The Alperin-Sheriff/Wikipedia Population dataset

This is a narrative description of the city populations dataset I’ve assembled for the Creating Data project. The headline here is: Wikipedia editors have created a much more comprehensive database of American city and town populations than historians have had to this point.

I’m writing it up separately and releasing it before any other components of the project for two reasons. First, the data is useful: there are a wide variety of fields where a more comprehensive, long-term database of city sizes is useful, and I’ve already spoken to a few people for whom it might be useful. (If you wish to download the data, you can do it from the github site for this dataset.)

Second, I wanted to use it to try a beta launch for some of the narrative display elements of this project. I’m trying something here that’s a central part of the full project: finding ways to explore through historical data that allow both narrative and exploratory data analysis.

Note that it’s still missing plenty of pieces, such as a loading bar! But if you want to see tens of thousands cities dancing around the screen, this is the right place…

The left side of this page is narration: the right lets you poke around this dataset as you please. If you want to free up the full page for the map, there’s a bar to contract at the right edge of the text block. Otherwise, scroll down for a walk through the data. If you don’t eventually see a map load, let me know.

{
   "year": 2010,
   "filters.Cities": "d => return d.properties.populations.cesta['2000'] > 50000",
   "drawing": ["StateLines", "ExternalLines", "Cities"],
   "duration": 3000,
   "changeOffset": 20,
   "scales.fill.Cities": "change",
   "getters.fill.Cities": "populationChange",
   "getters.size.Cities": "population",
   "scales.size.Cities": "size"
  }

The initial page of this page showed all cities United States with a population over 50,000; there are several hundred in the primary dataset I’m using for this, created by Stanford’s CESTA. But the CESTA dataset, which I thought for several years was the best in existence, contains about 7,500 cities and towns across the country with populations going back, in many cases, 1790. This data uses a broad definition of “place” before 1940, and something approximating the current census-designated-place afterwards.

{
  "year": 2010,
  "filters.Cities": "d => return d.properties.popSources.indexOf('cesta') > -1",
  "duration": 3000
  }

This dataset is both good and well-vetted; it comes from cooperation between Stanford and Census bureau itself, and can be used for a variety of purposes. Here, for example, you can the population change over the 1950s and 1960s as the major urban centers of the industrial midwest (Detroit, Cleveland, Chicago, St. Louis) lose population even as their suburbs grow dramatically.

{
   "year": 1970,
   "duration": 3000,
   "zoom.Cities":["Buffalo, New York", "St. Louis"],
   "scales.fill.Cities": "change",
   "getters.fill.Cities": "populationChange",
   "getters.size.Cities": "population",
   "scales.size.Cities": "size",
   "annotate.Cities":["Detroit", "Cleveland", "Chicago"]
  }

But it turns out to be far from the best possible. From reading nineteenth census reports, I knew that the published government figures included up to dozens of summary population statistics for each county. I noticed last year that these were starting to turn up on Wikipedia for towns much smaller than any of the published datasets contain. Here is the Wikipedia page for Montville, Maine, a small town with about 1,000 people. Wikipedia has long had current census information on places like this. But now it has the population from Montville’s year of peak population, 1840. This is information no one has been willing to pay to type up from the original census reports–not even Stanford and the US government.

Montville, Maine
Montville, Maine

Nationwide, CESTA only includes information for cities with a population over 2,500: Montville never reached this threshold, and so doesn’t appear. So how much of this stuff is in Wikipedia? And how does it compare to what academics have right now?

{
   "year": 2010,
   "duration": 3000,
   "zoom.States": ["AK","ME","FL","WA"],
   "getters.fill.Cities": "populationChange",
   "getters.size.Cities": "population",
   "labels.legendTitle": "undefined"
  }

There are a few states where the CESTA dataset has information for small towns. In Arkansas, Iowa, California, and Colorado it includes full runs of populations entered by state data centers interested in their own history. You can see how much better the resolution on those states is by comparing them (in green) to the rest of the country (in blue). (While before cities were sized by population, in this version of the map they’re all the same size.)

{
  "filters.Cities": "d => return d.properties.popSources.indexOf('cesta') > -1",
  "scales.fill.Cities": "<cat>",
  "getters.size.Cities": "d => return 1500",
  "getters.fill.Cities": "d => return d.properties.popSources.indexOf('cesta') == -1 ? 'Only Wikipedia' : ['IA','CO','AR','CA'].indexOf(d.properties.state) > -1 ? 'CESTA detailed states' : 'CESTA other states'",
  "labels.legendTitle": "Data Source"
  }

But Wikipedia has a comparable level of coverage for the entire country, which is what makes it such a remarkable source. Here, in orange, are all the cities that exist in Wikipedia and not in the existing sets. This is more than three times as many cities and towns. The midwest comes alive; suddenly, you can essentially see the rail lines running through Missouri and Kansas themselves, as well as thecities that the railroads built up on themselves.

{
      "filters.Cities": "undefined",
      "scales.fill.Cities": "<cat>",
      "getters.fill.Cities": "d => return d.properties.popSources.indexOf('cesta') == -1 ? 'Only Wikipedia' : ['IA','CO','AR','CA'].indexOf(d.properties.state) > -1 ? 'CESTA detailed states' : 'CESTA other states'",    
      "labels.legendTitle": "Data Source"
  }

This data entry entry is really incredible work. You might think of it as as testament to the power of crowd-sourcing. This isn’t crazy: but as with so much Wikipedia labor, this is almost entirely the work of a single person: Jacob Alperin-Sheriff, who undertook the work while a graduate student at Georgia Tech. (He is now works in cryptographer for the government). Posting the populations to Wikipedia under the username “DemocraticLuntz,” he entered approximate 25,000 cities and counties from the census, accounting for about 237,707 non-zero entries. That’s almost four times as many data points as the CESTA-Stanford set. It includes both towns not included in the CESTA set, and earlier years of growth from towns that spent several years below the census bureau’s cutoff of 2,500.

To build the union dataset you see here, I contacted Alperin-Sheriff, who sent me the CSV files he typed up before uploading them to Wikipedia. I also parsed every article in Wikipedia to find other articles that have a population history box. I then matched these to the CESTA data and each other using the population numbers as a key, and string similarity to break ties. In cases of disagreement (more about those in a bit), I’ve used whatever element seems to produce the smoothest overall growth.

{
      "filters.Cities": "undefined",
      "scales.fill.Cities": "customExplanation",
      "getters.fill.Cities": "d => return d.properties.popSources.indexOf('cesta') == -1 ? 'Only Wikipedia' : ['IA','CO','AR','CA'].indexOf(d.properties.state) > -1 ? 'CESTA detailed states' : 'CESTA other states'",    
      "labels.legendTitle": "Data Source"
  }

Things are expecially remarkable up in the the northeast of the country, where ‘towns’ and ‘townships’ provide a fairly consistent metric of population density at the sub-county level back (in some cases) to the eighteenth century. Outside of Indian reservations and a few unincorporated area in Maine, this gives the locations of pretty much every person in New England to within a few miles.

While Maine is represented by only a couple cities in the existing data, there are almost 500 towns in the Alperin-Sheriff/Wikipedia set.

{
      "filters.Cities": "undefined",
      "scales.fill.Cities": "customExplanation",
      "getters.fill.Cities": "d => return d.properties.popSources.indexOf('cesta') == -1 ? 'Only Wikipedia' : ['IA','CO','AR','CA'].indexOf(d.properties.state) > -1 ? 'CESTA detailed states' : 'CESTA other states'",    
      "zoom.States": ["ME", "NJ", "PA"]
  }

This data should be invaluable for a variety of projects that want a fine-grained view of the entire country; it makes it possible to see patterns at a resolution that isn’t possible using only large cities and counties to map.

Here, for instance, is a map of when a city gained most of its population. This is a pretty good way of putting a single year on any city, better than (say) year of maximum population; you can think of it as giving an estimate of–for instance–roughly how old the buildings or street names might be. Darker colors indicate older cities.


  {
      "filters.Cities": "d => return d.properties.popSources.indexOf('alperin') > -1",
      "scales.size.Cities": "sizeQuart",
      "getters.size.Cities": "maxPop", 
      "scales.fill.Cities": "averageYear",
      "getters.fill.Cities": "averageYear",
      "zoom.States": ["ME", "NJ", "PA"],
      "labels.legendTitle": "undefined"
  }

You can again see clear patterns in the midwest. Since the bulk of growth in major cities like Chicago happened between 1800 and 1950, the major cities tend to be old, along with market cities and towns for the farming hinterland evenly spaced through the country along the railroad lines.

{
      "filters.Cities": "d => return d.properties.averageYear <= 1945",
      "scales.size.Cities": "sizeQuart",
      "getters.size.Cities": "maxPop", 
      "scales.fill.Cities": "averageYear",
      "getters.fill.Cities": "averageYear",
      "zoom.Cities": ["Provo, Utah", "Detroit"]
  }

Growth since 1945, on the other hand, is concentrated in suburban splotches that take up much less of the map. The wide variety of suburban regions included here offers a useful way of exploring suburbanization in the country.

{
      "drawing": ["Cities", "StateLines", "ExternalLines"],
      "filters.Cities": "d => return d.properties.averageYear >= 1945",
      "scales.size.Cities": "sizeQuart",
      "getters.size.Cities": "maxPop", 
      "scales.fill.Cities": "averageYear",
      "getters.fill.Cities": "averageYear",
      "zoom.Cities": ["Provo, Utah", "Detroit"],
      "scales.x.Cities": "undefined",
      "scales.y.Cities": "undefined",
      "duration": 3000
  }

I find it informative to just look at the channels of spread of population along the railroads and rivers. The map current shows cities that experienced their average year of growth within twenty years of 1950. Drag the slider below to adjust the timespan plotted.

{
  "year": 1950,
  "scales.size.Cities": "sizeQuart",
  "getters.size.Cities": "maxPop",
  "scales.fill.Cities": "customExplanation",
  "getters.fill.Cities": "d => return d.properties.popSources.indexOf('cesta') == -1 ? 'Only Wikipedia' : ['IA','CO','AR','CA'].indexOf(d.properties.state) > -1 ? 'CESTA detailed states' : 'CESTA other states'",
  "filters.Cities": "d => return Math.abs(d.properties.averageYear - $year) < $changeOffset",
  "changeOffset": 20,
  "duration": 2500,
  "zoom.Cities": ["Miami", "Anchorage, Alaska", "Eastport, Maine", "Seattle, Washington"],
  "scales.x.Cities": "undefined",
  "scales.y.Cities": "undefined",
  "labels.legendTitle": "Data Source"
  }

The data is strong enough that you can see these in a variety of regions. Change the dropdown to zoom in on a particular city.

{
  "year": 2010,
  "scales.size.Cities": "sizeQuart",
  "getters.size.Cities": "maxPop",
  "filters.Cities": "undefined",
  "changeOffset": 20,
  "duration": 5000,
  "zoom.Cities": ["New York City"],
  "scales.x.Cities": "undefined",
  "scales.y.Cities": "undefined",
  "scales.fill.Cities": "averageYear",
  "getters.fill.Cities": "averageYear",
  "labels.legendTitle": "undefined"
  }

Zoom to:

Or you can just treat all of this as a data set to look at on its own.

There are lots of non-cartographic ways to look at this data, using information from Wikipedia or elsewhere.

{
  "drawing": ["Cities"],
  "scales.fill.Cities":"scheme2",
  "getters.fill.Cities": "region",
  "filters.Cities": "undefined",
  "getters.y.Cities":"undefined",
  "scales.y.Cities": "undefined",
  "getters.x.Cities":"undefined",
  "scales.x.Cities": "undefined",
  "duration": 4000,
  "zoom.Cities": ["Miami", "Anchorage, Alaska", "Eastport, Maine", "Seattle, Washington"],
  "labels.legendTitle": "region"
  }

Here, for example, is what it looks like when you feed in city populations to the U-MAP dimensionality reduction algorithm. This creates clusters that show cities that have similar patterns in their long-term growth.

{
      "drawing": ["Cities"],
      "filters.Cities": "undefined",
      "year": 2010,
      "getters.fill.Cities": "region",
      "scales.fill.Cities": "scheme2",
      "getters.y.Cities": "d => d.properties.umap_y",
      "scales.y.Cities": "<linear>",
      "getters.x.Cities": "d => d.properties.umap_x",
      "scales.x.Cities": "<linear>",
      "duration": 5000
  }

Individual portions of this graph show cities that are close not geographically but in the shape of their population curves. (You can see the full trend for each city as a sparkline next to its label.) Zooming in, for instance, on St. Louis, you can see other cities like Scranton or Bridgeport Ohio that have similarly-shaped declines lasting decades.

{
      "drawing": ["Cities"],
      "filters.Cities": "undefined",
      "year": 2010,
      "getters.fill.Cities": "region",
      "scales.fill.Cities": "scheme2",
      "getters.y.Cities": "d => d.properties.umap_y",
      "scales.y.Cities": "<linear>",
      "getters.x.Cities": "d => d.properties.umap_x",
      "scales.x.Cities": "<linear>",
      "zoom.ModernCounties": ["Mahoning, Ohio", "Columbiana, Ohio"],
      "annotate.Cities": ["St. Louis", "Scranton, Pennsylvania", "Bridgeport, Ohio"],
      "labels.legendTitle": "Census Region"
  }

Each of the well defined arms on this octopus is a different census; cities that peak in the same year generally show up together.

{
      "drawing": ["Cities"],
      "filters.Cities": "undefined",
      "year": 2010,
      "getters.fill.Cities": "region",
      "scales.fill.Cities": "scheme2",
      "getters.y.Cities": "d => d.properties.umap_y",
      "scales.y.Cities": "<linear>",
      "getters.x.Cities": "d => d.properties.umap_x",
      "scales.x.Cities": "<linear>",
      "zoom.States": ["AK","ME"]
  }

So: it should be useful for a variety of purposes. I may write up some of the interesting narratives later, including the highlights of what cities and regions expand or contract in the twentieth century.

But rather than end with something interesting, it’s important to end this little introduction with a number of caveats about this data. The first, and most important, is that there is little system to what gets included in Wikipedia–a single passionate editor can build up the collection in one place, but not another. You see this in a variety of places; Chicago neighborhoods, for example, are broken out as population centers of their own while New York neighborhoods (aside from the five boroughs) are not.

{
    "drawing": ["Cities", "ExternalLines", "StateLines"],
    "scales.size.Cities": "size",
    "getters.size.Cities": "population",
    "filters.Cities": "d => d.properties.averageYear >= 1945",
    "getters.fill.Cities": "averageYear",
    "scales.fill.Cities": "averageYear",
    "year": 2010,
    "getters.y.Cities": "undefined",
    "scales.y.Cities": "undefined",
    "getters.x.Cities": "undefined",
    "scales.x.Cities": "undefined",
    "zoom.States": ["AK","ME"]  
  }

Take, for example, the odd rectangular cluster at the bottom of Michigan around the city of Battle Creek when we limit to cities that grew after 1945. This is because–as far as I can tell–one editor took it upon themselves to enter historical population back to 1960 for the entirety of Calhoun County but nowhere else. You can see similar groups throughout the midwest–it’s a reminder of how much finer-grained the Census data could be. In most of the original census reports, population is reported back into the 19th century at this level of detail–Calhoun county, for example, has detailed township-level statistics at least back to the 1850s.

But until this data is digitized, the dribs and drabs that make it into Wikipedia aren’t very useful for large-scale thematic mapping.

{
      "drawing": ["Cities", "ExternalLines", "StateLines"],
      "scales.size.Cities": "size",
      "getters.size.Cities": "population",
      "filters.Cities": "d => d.properties.averageYear >= 1945",
      "getters.fill.Cities": "averageYear",
      "scales.fill.Cities": "averageYear",
      "year": 2010,
      "getters.y.Cities": "undefined",
      "scales.y.Cities": "undefined",
      "getters.x.Cities": "undefined",
      "scales.x.Cities": "undefined",
      "zoom.States": ["MN", "OH"],
      "annotate.Cities": ["Athens Township, Michigan"],
      "labels.legendTitle": "undefined"
  }

Fortunately, there is a heuristic that gives much more consistent data: filter to datasets that have a defined population in Jacob Alperin-Sheriff’s edits. Alperin-Sheriff used regular nationwide rules for what to include that make his data less of a hodgepodge than even the CESTA set, let alone Wikipedia.

The non-Alperin wikipedia entries (in red here) are almost never for clear-cut locations, but instead cover things like Chicago neighborhoods or those Michigan townships. Sometimes the data is quite good, but sometimes–as with townships in the state of Pennsylvania–it is only entered for a few years. (There is one case in the Alperin-Sheriff data of truncated entry; there seem to only be Massachusetts town populations after 1850.)

The two big exceptions are the state of Hawaii–which uses census-designated places, not municipalities, and so is not in the Alperin-Sheriff set at all–and no-longer extant municipalities like Brooklyn, New York or ghost towns like Swansea, Arizona. Alperin-Sheriff’s raw data does include New York municipalities, but for the dataset here I use his New York town data, which are more like Midwestern townships.

{
  "scales.fill.Cities": "highlightBad",
  "filters.Cities": "d => return d.properties.popSources.indexOf('alperin') > -1 || d.properties.popSources.indexOf('wiki') > -1",
  "getters.fill.Cities": "d => d.properties.popSources.indexOf('alperin') == -1 ? 'Non-Alperin-Sheriff' : 'Alperin-Sheriff'",
  "zoom.Cities": ["Los Angeles", "New York City"],
  "duration": 5000,
  "labels.legendTitle": "Data Source"
  }

It’s still worth, though, using all the data in order to get as close to the correct populations as possible.

There are conflicts between the various sources that are not trivial to resolve. Wikipedia’s numbers are identical to Alperin-Sheriff’s numbers are only about 99% of the time, and only about 96% of the time to the CESTA numbers.

All the cities in red here are ones for which there is some internal disagreement among the three datasets here (Wikipedia, Alperin source files, and CESTA data) in the 1890 census. I’ve checked the original census report from 1890 for each of these places.

These are just some examples.

  • Morris, Illinois is short by 600 people in the Wikipedia and Alperin sets, but is correct in CESTA;
  • Conneaut, Ohio has a repeated 1 in the Alperin-Sheriff set that makes the amount ten times greater; Wikipedia editors have corrected it to the same value as the CESTA number.
  • New Haven, Connecticut is 6,000 lower in the Alperin numbers because he uses the population for “New Haven City,” while Wikipedia and CESTA use a larger census population for “New Haven Town,” which includes the city and other municipalities.
  • Coal Hill, Arkansas is listed as 802 in the Wikipedia and the Alperin-Sheriff transcriptions, but only 202 in the CESTA set. The original census publications list the population as 802.
  • Winston-Salem, North Carolina is listed as 10729 in the Wikipedia sources and 10700 in CESTA, which sometimes rounds to the nearest hundred. The combined populations of the two cities of Winston and Salem was 10,729 in 1890.

In cases like this, the headline population in the merged dataset is created by selecting the number that produces the smoothest overall growth curve between existing values. The purpose of this is to ensure that egregiously wrong values won’t be included.

{"year":1890,
  "scales.fill.Cities": "highlightBad",
  "filters.Cities": "undefined",
  "getters.fill.Cities": "cityPopulationConflict",
  "zoom.Cities": ["Los Angeles", "New York City"],
  "duration": 5000,
  "getters.annotation.Cities": "d => return `CESTA: ${d.properties.populations.cesta[$year]}         Wiki: ${d.properties.populations.wiki[$year]}         Alperin-Sheriff: ${d.properties.populations.alperin[$year]}`",
  "annotate.Cities": ["Morris, Illinois", "Conneaut, Ohio", "New Haven, Connecticut", "Coal Hill, Arkansas", "Winston-Salem, North Carolina"]
  }

The last example is an especially interesting one that points in the direction of future needs to improve data like this. Winston and Salem were independent cities at the time of the 1890 census, and to list them as joint is–strictly speaking–incorrect. CESTA, unlike Wikipedia, does have statistics for the independent cities of Winston and Salem for part of this period; Alperin-Sheriff seems to have manually combined populations of merged cities for certain municipalities such as Allegheny and Pittsburgh, Pennsylvania. In other cases of municpal merger, such as Brooklyn and New York City, or Dover and Foxcroft, Maine, only the population of the larger municipality is included. While some extinct municipalities receive their own pages with population boxes (like Allegheny), others (including Winston, Salem, Dover, and Foxcroft) do not.

There is some room for using Wikipedia text, along with addition, subtraction, and county level populations, to create the full network of when cities merge with each other. I have not attempted to do this.

Ultimately, the best way to solve this would require returning to the original census reports, which have detailed footnotes about all mergers. It seems possible to me that the thousands of points collected here could be useful in training OCR and column-discrimination models across those reports, but that task is probably not going to be possible for several years.