A guided tour of the digital library

In the last 20 years, librarians and technology companies have scanned millions upon millions of books from research libraries. This is a significant portion of all the intellectual work published in the West before the rise of the Internet. What, actually, is in this vast new archive? How big is it? What kinds of books does it contain, and which ones are worth looking at again? Librarians and information scientists have been working for the past decade to make a substantial proportion of these books accessible through the Hathi Trust. But we lack ways–even bad ways–to see the entire digital library at once.

You’re probably used to browsing books by one of the various library classification systems in common use; the best known is the Dewey Decimal system. But especially at the research level, different libraries use different classification systems. Many research libraries organize using the Library of Congress Classification (LCC); but only about a third of the books in Hathi even have a LCC number associated with them.

The only thing we have for all of these books is their text. The visualization here provides a new way of exploring this vast digital library using a new method that makes a visual arrangement of books possible based on the vocabulary they use, using the method from my new paper on “Stable Random Projection.”1 I’ve taken inspiration in this from some other other maps of large collections of books and images, but this tries to take on a considerably more diverse and important set of cultural artifacts.2

You can visually browse through a library of 14 million volumes (about as many as there are books in the Library of Congress) and click on any one to see the original in the HathiTrust digital library.

Either navigate directly using a touch screen or mouse–the “interact” button above will hide the narration–or scroll down in this panel to get a guided tour. You can click on any title to view it in the Hathi Trust catalog, and read it if it’s in the public domain.

{
  "base_dir": "/data/scatter/hathi",
  "colors": {"language":"","Classification":"","Subclassification":"","date":"","Principal Author":""},
  "labels": ["Classification", "Subclassification","id","title","date","Principal Author", "language", "library"],
  "point_size": 2,
  "point_threshold": 10,
  "label_threshold": 0,
  "variable_point_size": false,
  "filters": null,
  "hide_uncolored": false,
  "zoom": [1, 0, 0, 1000],
  "colorize_by": "language",
  "label_field": "title",
  "scheme": "dark",
  "guides": ["legend", "color_legend", "label_legend", "filter_legend"],
  "keys": {"Subclassification": "LCC.txt", "Classification": "LCC.txt"}
  }

This visualization updates to include more books as we zoom in. Although there are about 14,000,000 books in the Hathi Trust, you’re only seeing about 17,000 out of about 12,000,000 right now.

As you’d expect, the greatest differences in words used are created by the language of a text.

Since a large proportion of the Hathi Trust is in English, the single biggest cluster is a central grouping of English-language books.

{
  "hide_uncolored": true,
  "zoom": [1.494462502704141,5.514914338661423,-3.7660311151101347]
  }

So; let’s look then only at English (about 17,000 books visible) to see the patterns of use there.

{
  "filters": {"English": "d.language == 'English'"},
  "hide_uncolored": true,
  "slowly": [
  {"field": "point_threshold", "value": 12},
  {"field": "point_size", "value": 2.0}
  ],
  "zoom": [1.994462502704141,5.514914338661423,-3.7660311151101347]
  }

The biggest organizing principal at this level is in the general subject matter of the book. I’ve colorized here by Library of Congress Classification; I’m using the top-level classes, which are things like “Q” for science or “L” for education. To see the full names of the classes, click the button below.

While many disparate books are positioned alongside each other, there is fairly strong separation in these various areas between different subject subject headings. When two radically different methods–librarians reading, and vectorized document representations–coincide even this much, you can safely conclude that both are tapping into something real.

But I should say for the machine-credulous: this doesn’t mean that this particular clustering is the be-all and end-all of machine reading. If I ran the algorithm again or with different parameters, the layout could be quite different. The point is simply; if a pattern emerges in this low-dimensional representation, it probably also exists in the space defined by a book’s vocabulary.

{
  "colorize_by": "Classification",
  "label_field": "Classification",
  "hide_uncolored": true,
  "label_threshold": 0.1,
  "slowly": [
  {
    "field": "point_threshold",
    "value": 20
  }
  ],
  "filters": {"English": "d.language=='English'"}
  }

Many sciences are located up towards the top of the map: the way the algorithm shakes out puts chemistry and physics are the far extremes of a peninsula including mathematics and a variety of publications dealing with technologies and materials.

{
  "zoom": [10.526836200232077,10.287032865090978,-21.887397213962267],
  "colorize_by": "Subclassification",
  "filters": {"English": "d.language=='English'",
  "Classification": "/[QRST]/"},
  "hide_uncolored": true,
  "label_field": "Subclassification"
  }

A second cluster lives towards the eastern side. Here, the far eastern promontory is formed by works about medicine (LC classification “R”); the biological sciences help form the bridge that links medicine to the rest of scholarship, with texts about agriculture mingling most easily with works in the social sciences.

{
  "zoom": [12.074376483081286,27.02329499000892,-2.941268304187396,8000],
  "colorize_by": "Subclassification",
  "filters": {"English": "d.language == 'English'",
  "Classification": "/[QRST]/"},
  "point_threshold": 12
  }

Just to the northwest are works about botany, plants, and parts of zoology; they have less to do with medicine in their language, but cluster back into with the bulk of the biological sciences. At the far promontory opposite medicine is the bulk of class “SB,” plant culture, which stands in the same relation to botany as medicine does to zoology.

The S classes–agriculture, general–are one of the major areas where the historical origins of the LC classification in the early 20th century are evident. Agriculture–a major area of study at US universities, and the employment area of a third of the population–seemed to logically command a top-level heading. But later bibliographers have shelved relatively few books in the ‘S’ classes.

{
  "label_threshold": 0.05,
  "zoom": [14.063431825509644,23.720800638482515,-7.285157063465448],
  "colorize_by": "Subclassification",
  "point_threshold": 13,
  "show_only_n_categories": 0
  }

But other works classed as “science” or “technology” are scattered throughout the map. For instance, way on the other side of literary history is a relatively self contained set of science clusters that consist mostly of textbooks for various science classes. Unlike the original works of science above, down here chemistry, math, and physics are quite distinct from each other.

{
  "label_threshold": 0.1,
  "label_field": "Subclassification",
  "show_only_n_categories": 5,
  "filters": {
      "English": "d.language=='English'",
      "Science": "d.Classification=='Q'"
      },
      "zoom": [11.918685221980581, 2.3099582169341026, 20.52232542157204]
  }
   

To see what other kinds of patterns exist here, look at a different section of the library: LC classification “P”, for literature.

{
    "zoom": [1.994462502704141,5.514914338661423,-3.7660311151101347],
    "point_threshold": 20,
    "label_field": "Subclassification",
    "filters": {
    "English": "d.language=='English'",
    "Fiction": "d.Classification=='P'"
    },"color_legend_toggle":"off","show_only_n_categories":0
    }
  

P is divided into two large macro chunks. One contains scholarly work about literature; much of this has lexical similarities to history and biography.

The inability to distinguish between these two chunks from library metadata is something that’s frequently frustrating to scholars of digital literature.

That they show up in clearly distinct chunks here is a good thing; it suggests that it will be relatively easily to use machine learning tools to tell them apart in cases where the metadata isn’t clear. (This conforms with what Ted Underwood’s white paper on the subject has found, but using a considerably smaller dataset.)3

{
      "point_threshold": 20,
      "zoom": [5.077083544079225,-2.4696966121767225,11.296810764084036]
  }

The other chunk is actual literature, which occupies one wing off the side of the full English-Language set. We are now past the point where metadata is especially good at describing the structuring that the library uses, but fortunately there’s an easy way to describe what’s going on here.

{
  "zoom": [9.037977716560997,23.85478326645137,12.026148406645955]
  }

The major chunks in this cluster are literary genre. Although most works don’t have metadata, we can use a heuristic to tell genre; I’m simply going to look at whether a work has the word “novel,” “play,” or “poetry/poem” in its title. Not all that many works are helpful enough to identity themselves. But even with most literature removed, enough remain to make it clear that the main areas here represent poetry, prose, and plays.

(You may see several big chunks of ‘plays’ hovering around the edges of poetry; many of these are 18th-century plays in verse. The diction in these in some ways resembles poetry more than it does, say, Tennessee Williams.)

{
  "colorize_by": "genre",
  "slowly": [{"field": "point_threshold", "value":35}],
  "label_field":"title",
  "point_threshold": 20,
  "label_threshold": 0.05,
  "filters": {"English": "d.language=='English'",
        "Literature": "d.Classification=='P'"
    },
  "zoom": [9.037977716560997,23.85478326645137,12.026148406645955]
  }

Zooming in further, we can look at yet a smalller subset of the library: poetry alone. What kind of library metadata explains the organization of poetry within itself?

{
  "zoom": [15.823534662094913,22.601605328038602,15.15921208910164],
  "colorize_by": "genre",
  "filters": {
      "English": "d.language=='English'",
      "Literature": "d.Classification=='P'" }
  }

The answer has something to do with style. Or, at least, date of composition. Here the chart is colored by date, and poems fade in by date of composition. Around 1800, most poems cluster in the bottom half of the cluster; as time goes on, new poetry is written closer to 2000.

{
  "show_only_n_categories": 0,
  "point_threshold": 24,
  "filters": {
      "English": "d.language=='English'",
      "Literature": "d.Classification=='P'",
      "year": "d.date <= 1800"
      },
      "slowly": [
      {"field": "filters", "value": {
          "English": "d.language=='English'",
              "year": "d.date <= 2020",
          "Literature": "d.Classification=='P'"
      }}],
      "colorize_by": "date",
      "duration": 10000,
      "label_field": "date",
      "label_threshold": 0.1,
  "zoom": [15.823534662094913,22.601605328038602,15.15921208910164]
  }

You can filter to a single decade of books at a time using the slider below.

Zooming in on early poetry gives you a bunch of books mostly written before 1850. You can click on any of these points to read them if you like. But even at this tight scale, there are forms of local textual organization.

{
   "zoom": [42.263183432150086,23.14051290748467,18.23463191774931],
   "label_threshold": 0.05,
   "filters": {
      "English": "d.language=='English'",
      "Literature": "d.Classification=='P'" },
   "slowly": [{"field": "point_size","value": 3},{"field":"label_threshold","value":0}]
  }

Within the early poetry, one of the forms of clustering takes place by authorship. Walter Scott, William Cowper, Shakespeare, and other poets each occupy a distinct area of the chart. Homer, author of the Odyssey appears classed as an English poet because so many different poets translated him to English in this period.

{
    "filters": {},
    "colorize_by": "first_author_name",
    "label_field": "first_author_name",
    "label_threshold": 0.05,
    "show_only_n_categories": 15,
    "color_legend_toggle": "off",
    "slowly": [{"field": "point_size","value": 3},{"field":"label_threshold","value":0.2},{"field":"point_threshold", "value":13}],
    "zoom": [52.263183432150086,23.14051290748467,18.23463191774931]
    }

At the full level of magnification, we’re just looking at a few hundred books by Walter Scott, salted with a few other books either about him or imitating him.

{
      "show_only_n_categories": 0, "hide_uncolored": false,
      "color_legend_toggle": "off",
      "label_field": "title",
      "label_threshold":0.08,
      "slowly": [{"field":"point_threshold", "value": 15}, {"field":"label_threshold", "value":0.3}],
      "zoom": [250, 24.23846452408224, 18.761868433830934]
  }

But while this resembles a library classification in certain ways, using full text also lets us play against the decisions about where a book belongs that constrain where it falls on the shelf. While almost all poetry is classed as literature, there are also rare works of history, biography, and economics in the libraries that are themselves in verse.

Looking at the poetry cluster but hiding literature surfaces these oddball works like A Metrical History of the Life and Times of Napoleon Bonaparte and historical poems about William Penn or the Spanish conquest of the new world.

Using these kind of filters produces, at times, some spectacularly convoluted verse, as when Thomas Dibdin tries to make the British national debt rhyme in his history of England: An image of a poem: text reading War with the Dutch adds to our troubles, 1781. And trade’s embarrassment redoubles. If I mistake, ^tis your’s to judge it, But only overhaul the Budget Which, for the service of the year, Will millions, twenty-three appear ; Thousands^ seven hundred fifty-six, And hundreds, (as accountants fix,) Some one or two ; a sum so great Had ne’er before disturb’d the state

{
      "show_only_n_categories": 0,
      "color_legend_toggle":"off",
      "label_field": "title",
      "slowly": [{"field":"point_threshold", "value": 20}],
      "label_threshold":0.7,
      "filters":{
        "Classification":"/[^P]/"
      },
      "zoom": [200, 24.14437272604416, 17.57975452085781]
  }

Within the whole of poetry, these works can fade away. But the ability to surface them is one of the ways even a mediocre machine-ordering of the digital library can be useful. As we zoom back out to the scale of poetry generally, you can see a large number of works that–although librarians correctly class them as something other than poetry–still share something in their language with the genre. This suggests one of the ways that this visualization can do is to look at places where the substance of texts does not match up with the ordering placed on them in a classification system.

Another thing it can help us with is understanding the scale of the library. Within the poetry cluster here are tens of thousands of books. Scholars in the humanities who are thought to work with “Big Data” operate, for the most part, at the upper limits of this scale. One of the foundational texts in Digital Humanities for thinking about how to deal with large libraries was Gregory Crane’s article “What do you do with a million books.”4

{
    "point_threshold": 12,
    "colorize_by": "Classification",
    "label_field": "Classification",
    "label_threshold": 0.05,
    "filters": {"English": "d.language=='English'"},
  "zoom": [6.849506276371657, 23.349639551810583, 13.840330233777106, 5000]
  }

But Google, Microsoft, the Internet Archive, and the rest have scanned not 100,000 novels, nor 1,000,000 books, but something approaching 10,000,000 volumes just in English. Even if you’ve had the privilege of browsing one of the great open-stack research libraries in the country that begin to approach this size, you probably don’t have a great sense of what the big subcollections of books in it are.

{
    "point_threshold": 12,
    "colorize_by": "Classification",
    "label_field": "Subclassification",
    "label_threshold": 0.05,
    "filters": {"English": "d.language=='English'"},
    "zoom": [1.5,5.514914338661423,-3.7660311151101347, 10000]
  }

Look, for example, at this part of fine arts (class N). Most of the books here are things like NC (drawing), ND (painting), and NK (decorative arts). Note all the little flecks of “technology” (starting with T) here. That’s predominantly because works about photography are classed as technology, not as art; when the Library of Congress created its classification system in the early 20th century, the idea of photography as a fine arts was far from their mind.

{
   "colorize_by": "Classification",
   "label": "Subclassification",
   "label_field": "Subclassification",
   "slowly": [{"field": "label_threshold","value": 0.3, "duration": 750}],
   "+filters": {"English": "d.language=='English'"},
   "zoom": [901.385229888668,1.1110842093088635,15.694344168624554]
  }

But just as with poetry, the language of art history diffuses gently through the entire library. There are bibliographical (Z) works about printers; recreational (G) works about the Disney company; and literary (P) works about puppetry.

{
   "colorize_by": "Classification",
   "label": "Subclassification",
  "hide_uncolored": true, 
   "label_field": "title",
   "label_threshold": 0,
   "slowly": [{"field": "label_threshold","value": 0.3}],
   "+filters": {"English": "d.language=='English'"},
   "zoom": [901.385229888668,1.1110842093088635,15.694344168624554, 8000]
  }

This doesn’t mean that the original classifications are wrong. If the library profession started over they would certainly not come up with the same ordering of knowledge as in the Library of Congress Classification; but the system has many principles (class by subject ahead of place; follow the intent of the author when choosing between plausible alternatives) that are sensible, useful, and unlikely to be easily reproduced algorithmically.

The point is, rather, that the ways computer classifies can sometimes reflect reality more sensitively than a rigid set of rules. Computers can be more flexible than bureaucracies.

Here, for instance, is a set of books that, for the most part, mix Spanish and English together in their contents. Some are classified in the metadata as Spanish, some as English, and some as multiple languages. But they share a linguistic commonality.

{
  "filters": {},
  "colorize_by":"language",
  "label_field": "title",
  "zoom": [49.72846520217916, 30.876259733930837, 3.5597142144718923, 9000]
  }
  

One section like this shows the importance of non-academic writing in academic libraries. Libraries are full of instructional manuals about how to do things. The section here is riddled with instructions on how to throw a football, how to golf, how to bowl, and any other sports skill you might wish to pick up.

In the library, sports is shelved in a distinct section of class G (recreation). But clustering purely on language, these works stand near not anthropology and geography, but instead in the broader neighborhood of a library of self-improvement for non-experts.

{
  "zoom": [   520.2827513827069, 1.933, -12.4567]
  ,
  "label_threshold": 0.3,"label_field":"title","filters":{}
  }
  

Just to the south, for example, is a similarly large section focusing on building skills for farming.

{
  "zoom": [516.7049822363766, 2.4654468758139316, -11.22443283821256],
    "label_threshold": 0.05,"label_field":"title","filters":{}
  
  }

While this might seem like an esoterically unimportant slice of the library, it provides context for other areas that we know to be important. To the west of farming, for example, just past the guides to improving photographic technique, is an important set of artifacts of important early digital culture; computer magazines from the 1970s through the 1990s. The most famous is probably Byte, one of the leading magazines through which code, tricks, and stories were shared before the Internet.

If you want to understand the early rise of the computer industry in the United States–and especially if you want to understand who it excluded, and how it differed from other countries–it might make sense to take a longer look at how it tapped into existing forms, rhetorics, and practices established by other American hobbyist movements.

{"zoom":[929.34,6.174,-12.197]
    ,"label_threshold":0.05
    ,"label_field":"title"
    ,"filters":{}}

Look around and you’ll find all sorts of other odd clusters of texts you might not realize have been stored in libraries. Here, for example, is a set of books as a statistics that are misread as Greek.

{"colorize_by": "Classification",
  "label_field": "Subclassification",
  "label_threshold":0.05,
  "show_only_n_categories": 0, "hide_uncolored": false, 
  "zoom": [115.20024701356954,5.54516604014923,-36.85146021576969]
  }

This serves as a useful reminder that the digital library scholars work with is itself the production of a strange set of machine hallucinations of text. Most of the library books in Hathi were converted to digital files by Google; libraries serve the Google scans. Google’s optical character recognition (OCR) seems to work not a word or even a paragraph at a time, but over spans of several pages; it uses best estimates to make an assumption about the general character set of a book, and then tries to read it consistently in that light.

As an example of what constitutes this cluster, look at what happens when Google’s OCR encounters a page like this one, in one of the volumes of the famous Framingham Heart Study, which is flipped onto its side.5

Image of statistics on a page scanned oriented at 90 degrees
Image of statistics on a page scanned oriented at 90 degrees

Google’s program seems to fail to recognize the correct orientation and, instead, encodes it as a series of nonsense numbers and Greek letters. (Even letters that appear to be Roman, here, like “Μ” are, for the most part, their Greek equivalents.)

11Ι1510Η0 3Π015Α5 Μ «3 . ΧΙΜίνΗ 30ΝνΐΗνΛΟ0-30ΝνΐΜνΛ 03"Ι00<1
  10816*16*1 εΖ189*εΐ Ζ855**6Ζ ΊΜ1510Η0 . εΖ189·εΐ 6111£*8β* *151Ζ*Ζ9
  011Ο15Α5 Ζ855**6Ζ *151Ζ*Ζ9 Ε185ΖΊ51 Μ»3 535Ϊ0-ΝΟΝ Μ03
  30Ν»I«νΛ00-30Ν*ΙΜΥΛ

When this is applied to title pages with the correct orientation, it produces ‘words’ that try to use greek characters to spell out English. So “Fertilizer use in the United States .. United States Department of Agriculture .. Bulletin No. 408” emerges looking like this:

ΓΕΚΤΙΙΙΖΕΚ υ$Ε ΙΝ ΤΗΕ υΝΙΤΕϋ 5ΤΑΤΕ5 … υΝΙΤΕϋ 5ΤΑΤΕ5 ϋΕΡΑΚΤΛΑΕΝΤ ΟΡ ΑΟΚΙΟυίΤΙΙΚΕ .. ΒυΙΙβΗη Νο. 408

This gets to the heart of what it means for a book to be findable in the digital library, and why we need new forms of exploration.

In the physical libraries of the past, a book is found or lost through human interpretations. If a book is misshelved or miscataloged, it can be functionally lost for weeks or years. To find a book lost in the stacks requires persistent manipulationns around the types of decisions human librarians might make.

In a digital library, the kiss of death sometimes relies on understanding not human mistakes but computer mistakes. (Of course all computer mistakes are human mistakes, originally.) While subject headings rely on a human reading of a text, OCR is, essentially, a computer reading of a text. And if the computer reads the text disastrously wrong (like “υΝΙΤΕϋ 5ΤΑΤΕ5” for “UNITED STATES”), you might think the book is, fundamentally lost. No keyword search will ever find the Framingham Heart Study.

But it’s too easy to leave things there; because all these volumes are clustered together based on the character strings that occur in them, which means that even if the relationships can’t be found by people using human terms, we can tease something out of the full-scale relations.

Spend some more time browsing around this map, and you’ll this same pattern again and again; here, for example, is a big set of musical scores.

{"zoom":[32.72,-14.648,8.411]
  ,"label_threshold":0.05
    ,"label_field":"first_author_name"
    ,"filters":{}
    ,"show_only_n_categories":8}
  

Crane, Gregory. “What Do You Do with a Million Books?” D-Lib Magazine 12, no. 3. Accessed January 23, 2012. https://doi.org/10.1045/march2006-crane.

Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. University of Illinois Press, 2013.

Richardson, Matthew, Judith Kamalski, Sarah Huggett, and Andrew Plume. “The Fundamental Interconnectedness of All Things. Places & Spaces: Mapping Science. Courtesy of Elsevier Ltd. In ‘8th Iteration (2012): Science Maps for Kids,’ Places & Spaces: Mapping Science, Edited by Katy Börner and Michael J. Stamper,” 2012. http://www.scimaps.org/detailMap/index/the_fundamental_inte_145.

Schmidt, Benjamin. “Stable Random Projection: Universal, Lighweight Dimensionality Reduction for Digital Libraries.” Journal of Cultural Analytics, October 2018.

Tang, Jian, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. “Visualizing Large-Scale and High-Dimensional Data.” arXiv:1602.00370 [Cs], 2016, 287–97. https://doi.org/10.1145/2872427.2883041.

Underwood, Ted. “Understanding Genre in a Collection of a Million Volumes, Interim Report.” Accessed December 29, 2014. http://figshare.com/articles/Understanding_Genre_in_a_Collection_of_a_Million_Volumes_Interim_Report/1281251.

Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics, 2018. https://doi.org/10.22148/16.019.


  1. The principles of that corpus are detailed in the paper. The basic idea is that books which are similar to each other in the words they contain should be close to each other. This form of so-called “bag-of-words” effective at describing what words are about; the point of that paper is to show that it can be effectively used on even massive, multilingual digital libraries with relatively small representations of individual books. The particular visualization of that high-dimensional space is done here using the LargeVis algorithm. Jian Tang et al., “Visualizing Large-Scale and High-Dimensional Data,” arXiv:1602.00370 [Cs], 2016, 287–97, https://doi.org/10.1145/2872427.2883041, Benjamin Schmidt, “Stable Random Projection: Universal, Lighweight Dimensionality Reduction for Digital Libraries.” Journal of Cultural Analytics, October 2018.

  2. For interface, I’m especially drawing from https://artsexperiments.withgoogle.com/tsnemap/. Similar maps exist of scientific research using network placement algorithms–e.g., Matthew Richardson et al., “The Fundamental Interconnectedness of All Things. Places & Spaces: Mapping Science. Courtesy of Elsevier Ltd. In ‘8th Iteration (2012): Science Maps for Kids,’ Places & Spaces: Mapping Science, Edited by Katy Börner and Michael J. Stamper,” 2012, http://www.scimaps.org/detailMap/index/the_fundamental_inte_145 and http://paperscape.org/–but they rely on citation metrics.

  3. Ted Underwood, “Understanding Genre in a Collection of a Million Volumes, Interim Report,” accessed December 29, 2014, http://figshare.com/articles/Understanding_Genre_in_a_Collection_of_a_Million_Volumes_Interim_Report/1281251.

  4. Matthew L. Jockers, Macroanalysis: Digital Methods and Literary History (University of Illinois Press, 2013); Gregory Crane, “What Do You Do with a Million Books?” D-Lib Magazine 12, no. 3, accessed January 23, 2012, https://doi.org/10.1045/march2006-crane; Ted Underwood, David Bamman, and Sabrina Lee, “The Transformation of Gender in English-Language Fiction,” Journal of Cultural Analytics, 2018, https://doi.org/10.22148/16.019.

  5. These transcriptions are accurate as of 2018-08-29. Google periodically updates its OCR and pushes the new versions to the libraries; things may have changed by the time you see it.