In the last 20 years, librarians and technology companies have scanned millions upon millions of books from research libraries. This is a significant portion of all the intellectual work published in the West before the rise of the Internet. What, actually, is in this vast new archive? How big is it? What kinds of books does it contain, and which ones are worth looking at again? Librarians and information scientists have been working for the past decade to make a substantial proportion of these books accessible through the Hathi Trust. But we lack waysâeven bad waysâto see the entire digital library at once.
Youâre probably used to browsing books by one of the various library classification systems in common use; the best known is the Dewey Decimal system. But especially at the research level, different libraries use different classification systems. Many research libraries organize using the Library of Congress Classification (LCC); but only about a third of the books in Hathi even have a LCC number associated with them.
The only thing we have for all of these books is their text. The visualization here provides a new way of exploring this vast digital library using a new method that makes a visual arrangement of books possible based on the vocabulary they use, using the method from my new paper on âStable Random Projection.â1 Iâve taken inspiration in this from some other other maps of large collections of books and images, but this tries to take on a considerably more diverse and important set of cultural artifacts.2
You can visually browse through a library of 14 million volumes (about as many as there are books in the Library of Congress) and click on any one to see the original in the HathiTrust digital library.
Either navigate directly using a touch screen or mouseâthe âinteractâ button above will hide the narrationâor scroll down in this panel to get a guided tour. You can click on any title to view it in the Hathi Trust catalog, and read it if itâs in the public domain.
{
"base_dir": "/data/scatter/hathi",
"colors": {"language":"","Classification":"","Subclassification":"","date":"","Principal Author":""},
"labels": ["Classification", "Subclassification","id","title","date","Principal Author", "language", "library"],
"point_size": 2,
"point_threshold": 10,
"label_threshold": 0,
"variable_point_size": false,
"filters": null,
"hide_uncolored": false,
"zoom": [1, 0, 0, 1000],
"colorize_by": "language",
"label_field": "title",
"scheme": "dark",
"guides": ["legend", "color_legend", "label_legend", "filter_legend"],
"keys": {"Subclassification": "LCC.txt", "Classification": "LCC.txt"}
}
This visualization updates to include more books as we zoom in. Although there are about 14,000,000 books in the Hathi Trust, youâre only seeing about 17,000 out of about 12,000,000 right now.
As youâd expect, the greatest differences in words used are created by the language of a text.
Since a large proportion of the Hathi Trust is in English, the single biggest cluster is a central grouping of English-language books.
{
"hide_uncolored": true,
"zoom": [1.494462502704141,5.514914338661423,-3.7660311151101347]
}
So; letâs look then only at English (about 17,000 books visible) to see the patterns of use there.
{
"filters": {"English": "d.language == 'English'"},
"hide_uncolored": true,
"slowly": [
{"field": "point_threshold", "value": 12},
{"field": "point_size", "value": 2.0}
],
"zoom": [1.994462502704141,5.514914338661423,-3.7660311151101347]
}
The biggest organizing principal at this level is in the general subject matter of the book. Iâve colorized here by Library of Congress Classification; Iâm using the top-level classes, which are things like âQâ for science or âLâ for education. To see the full names of the classes, click the button below.
While many disparate books are positioned alongside each other, there is fairly strong separation in these various areas between different subject subject headings. When two radically different methodsâlibrarians reading, and vectorized document representationsâcoincide even this much, you can safely conclude that both are tapping into something real.
But I should say for the machine-credulous: this doesnât mean that this particular clustering is the be-all and end-all of machine reading. If I ran the algorithm again or with different parameters, the layout could be quite different. The point is simply; if a pattern emerges in this low-dimensional representation, it probably also exists in the space defined by a bookâs vocabulary.
{
"colorize_by": "Classification",
"label_field": "Classification",
"hide_uncolored": true,
"label_threshold": 0.1,
"slowly": [
{
"field": "point_threshold",
"value": 20
}
],
"filters": {"English": "d.language=='English'"}
}
Many sciences are located up towards the top of the map: the way the algorithm shakes out puts chemistry and physics are the far extremes of a peninsula including mathematics and a variety of publications dealing with technologies and materials.
{
"zoom": [10.526836200232077,10.287032865090978,-21.887397213962267],
"colorize_by": "Subclassification",
"filters": {"English": "d.language=='English'",
"Classification": "/[QRST]/"},
"hide_uncolored": true,
"label_field": "Subclassification"
}
A second cluster lives towards the eastern side. Here, the far eastern promontory is formed by works about medicine (LC classification âRâ); the biological sciences help form the bridge that links medicine to the rest of scholarship, with texts about agriculture mingling most easily with works in the social sciences.
{
"zoom": [12.074376483081286,27.02329499000892,-2.941268304187396,8000],
"colorize_by": "Subclassification",
"filters": {"English": "d.language == 'English'",
"Classification": "/[QRST]/"},
"point_threshold": 12
}
Just to the northwest are works about botany, plants, and parts of zoology; they have less to do with medicine in their language, but cluster back into with the bulk of the biological sciences. At the far promontory opposite medicine is the bulk of class âSB,â plant culture, which stands in the same relation to botany as medicine does to zoology.
The S classesâagriculture, generalâare one of the major areas where the historical origins of the LC classification in the early 20th century are evident. Agricultureâa major area of study at US universities, and the employment area of a third of the populationâseemed to logically command a top-level heading. But later bibliographers have shelved relatively few books in the âSâ classes.
{
"label_threshold": 0.05,
"zoom": [14.063431825509644,23.720800638482515,-7.285157063465448],
"colorize_by": "Subclassification",
"point_threshold": 13,
"show_only_n_categories": 0
}
But other works classed as âscienceâ or âtechnologyâ are scattered throughout the map. For instance, way on the other side of literary history is a relatively self contained set of science clusters that consist mostly of textbooks for various science classes. Unlike the original works of science above, down here chemistry, math, and physics are quite distinct from each other.
{
"label_threshold": 0.1,
"label_field": "Subclassification",
"show_only_n_categories": 5,
"filters": {
"English": "d.language=='English'",
"Science": "d.Classification=='Q'"
},
"zoom": [11.918685221980581, 2.3099582169341026, 20.52232542157204]
}
To see what other kinds of patterns exist here, look at a different section of the library: LC classification âPâ, for literature.
{
"zoom": [1.994462502704141,5.514914338661423,-3.7660311151101347],
"point_threshold": 20,
"label_field": "Subclassification",
"filters": {
"English": "d.language=='English'",
"Fiction": "d.Classification=='P'"
},"color_legend_toggle":"off","show_only_n_categories":0
}
P is divided into two large macro chunks. One contains scholarly work about literature; much of this has lexical similarities to history and biography.
The inability to distinguish between these two chunks from library metadata is something thatâs frequently frustrating to scholars of digital literature.
That they show up in clearly distinct chunks here is a good thing; it suggests that it will be relatively easily to use machine learning tools to tell them apart in cases where the metadata isnât clear. (This conforms with what Ted Underwoodâs white paper on the subject has found, but using a considerably smaller dataset.)3
{
"point_threshold": 20,
"zoom": [5.077083544079225,-2.4696966121767225,11.296810764084036]
}
The other chunk is actual literature, which occupies one wing off the side of the full English-Language set. We are now past the point where metadata is especially good at describing the structuring that the library uses, but fortunately thereâs an easy way to describe whatâs going on here.
{
"zoom": [9.037977716560997,23.85478326645137,12.026148406645955]
}
The major chunks in this cluster are literary genre. Although most works donât have metadata, we can use a heuristic to tell genre; Iâm simply going to look at whether a work has the word ânovel,â âplay,â or âpoetry/poemâ in its title. Not all that many works are helpful enough to identity themselves. But even with most literature removed, enough remain to make it clear that the main areas here represent poetry, prose, and plays.
(You may see several big chunks of âplaysâ hovering around the edges of poetry; many of these are 18th-century plays in verse. The diction in these in some ways resembles poetry more than it does, say, Tennessee Williams.)
{
"colorize_by": "genre",
"slowly": [{"field": "point_threshold", "value":35}],
"label_field":"title",
"point_threshold": 20,
"label_threshold": 0.05,
"filters": {"English": "d.language=='English'",
"Literature": "d.Classification=='P'"
},
"zoom": [9.037977716560997,23.85478326645137,12.026148406645955]
}
Zooming in further, we can look at yet a smalller subset of the library: poetry alone. What kind of library metadata explains the organization of poetry within itself?
{
"zoom": [15.823534662094913,22.601605328038602,15.15921208910164],
"colorize_by": "genre",
"filters": {
"English": "d.language=='English'",
"Literature": "d.Classification=='P'" }
}
The answer has something to do with style. Or, at least, date of composition. Here the chart is colored by date, and poems fade in by date of composition. Around 1800, most poems cluster in the bottom half of the cluster; as time goes on, new poetry is written closer to 2000.
{
"show_only_n_categories": 0,
"point_threshold": 24,
"filters": {
"English": "d.language=='English'",
"Literature": "d.Classification=='P'",
"year": "d.date <= 1800"
},
"slowly": [
{"field": "filters", "value": {
"English": "d.language=='English'",
"year": "d.date <= 2020",
"Literature": "d.Classification=='P'"
}}],
"colorize_by": "date",
"duration": 10000,
"label_field": "date",
"label_threshold": 0.1,
"zoom": [15.823534662094913,22.601605328038602,15.15921208910164]
}
You can filter to a single decade of books at a time using the slider below.
Zooming in on early poetry gives you a bunch of books mostly written before 1850. You can click on any of these points to read them if you like. But even at this tight scale, there are forms of local textual organization.
{
"zoom": [42.263183432150086,23.14051290748467,18.23463191774931],
"label_threshold": 0.05,
"filters": {
"English": "d.language=='English'",
"Literature": "d.Classification=='P'" },
"slowly": [{"field": "point_size","value": 3},{"field":"label_threshold","value":0}]
}
Within the early poetry, one of the forms of clustering takes place by authorship. Walter Scott, William Cowper, Shakespeare, and other poets each occupy a distinct area of the chart. Homer, author of the Odyssey appears classed as an English poet because so many different poets translated him to English in this period.
{
"filters": {},
"colorize_by": "first_author_name",
"label_field": "first_author_name",
"label_threshold": 0.05,
"show_only_n_categories": 15,
"color_legend_toggle": "off",
"slowly": [{"field": "point_size","value": 3},{"field":"label_threshold","value":0.2},{"field":"point_threshold", "value":13}],
"zoom": [52.263183432150086,23.14051290748467,18.23463191774931]
}
At the full level of magnification, weâre just looking at a few hundred books by Walter Scott, salted with a few other books either about him or imitating him.
{
"show_only_n_categories": 0, "hide_uncolored": false,
"color_legend_toggle": "off",
"label_field": "title",
"label_threshold":0.08,
"slowly": [{"field":"point_threshold", "value": 15}, {"field":"label_threshold", "value":0.3}],
"zoom": [250, 24.23846452408224, 18.761868433830934]
}
But while this resembles a library classification in certain ways, using full text also lets us play against the decisions about where a book belongs that constrain where it falls on the shelf. While almost all poetry is classed as literature, there are also rare works of history, biography, and economics in the libraries that are themselves in verse.
Looking at the poetry cluster but hiding literature surfaces these oddball works like A Metrical History of the Life and Times of Napoleon Bonaparte and historical poems about William Penn or the Spanish conquest of the new world.
Using these kind of filters produces, at times, some spectacularly convoluted verse, as when Thomas Dibdin tries to make the British national debt rhyme in his history of England: 
{
"show_only_n_categories": 0,
"color_legend_toggle":"off",
"label_field": "title",
"slowly": [{"field":"point_threshold", "value": 20}],
"label_threshold":0.7,
"filters":{
"Classification":"/[^P]/"
},
"zoom": [200, 24.14437272604416, 17.57975452085781]
}
Within the whole of poetry, these works can fade away. But the ability to surface them is one of the ways even a mediocre machine-ordering of the digital library can be useful. As we zoom back out to the scale of poetry generally, you can see a large number of works thatâalthough librarians correctly class them as something other than poetryâstill share something in their language with the genre. This suggests one of the ways that this visualization can do is to look at places where the substance of texts does not match up with the ordering placed on them in a classification system.
Another thing it can help us with is understanding the scale of the library. Within the poetry cluster here are tens of thousands of books. Scholars in the humanities who are thought to work with âBig Dataâ operate, for the most part, at the upper limits of this scale. One of the foundational texts in Digital Humanities for thinking about how to deal with large libraries was Gregory Craneâs article âWhat do you do with a million books.â4
{
"point_threshold": 12,
"colorize_by": "Classification",
"label_field": "Classification",
"label_threshold": 0.05,
"filters": {"English": "d.language=='English'"},
"zoom": [6.849506276371657, 23.349639551810583, 13.840330233777106, 5000]
}
But Google, Microsoft, the Internet Archive, and the rest have scanned not 100,000 novels, nor 1,000,000 books, but something approaching 10,000,000 volumes just in English. Even if youâve had the privilege of browsing one of the great open-stack research libraries in the country that begin to approach this size, you probably donât have a great sense of what the big subcollections of books in it are.
{
"point_threshold": 12,
"colorize_by": "Classification",
"label_field": "Subclassification",
"label_threshold": 0.05,
"filters": {"English": "d.language=='English'"},
"zoom": [1.5,5.514914338661423,-3.7660311151101347, 10000]
}
Look, for example, at this part of fine arts (class N). Most of the books here are things like NC (drawing), ND (painting), and NK (decorative arts). Note all the little flecks of âtechnologyâ (starting with T) here. Thatâs predominantly because works about photography are classed as technology, not as art; when the Library of Congress created its classification system in the early 20th century, the idea of photography as a fine arts was far from their mind.
{
"colorize_by": "Classification",
"label": "Subclassification",
"label_field": "Subclassification",
"slowly": [{"field": "label_threshold","value": 0.3, "duration": 750}],
"+filters": {"English": "d.language=='English'"},
"zoom": [901.385229888668,1.1110842093088635,15.694344168624554]
}
But just as with poetry, the language of art history diffuses gently through the entire library. There are bibliographical (Z) works about printers; recreational (G) works about the Disney company; and literary (P) works about puppetry.
{
"colorize_by": "Classification",
"label": "Subclassification",
"hide_uncolored": true,
"label_field": "title",
"label_threshold": 0,
"slowly": [{"field": "label_threshold","value": 0.3}],
"+filters": {"English": "d.language=='English'"},
"zoom": [901.385229888668,1.1110842093088635,15.694344168624554, 8000]
}
This doesnât mean that the original classifications are wrong. If the library profession started over they would certainly not come up with the same ordering of knowledge as in the Library of Congress Classification; but the system has many principles (class by subject ahead of place; follow the intent of the author when choosing between plausible alternatives) that are sensible, useful, and unlikely to be easily reproduced algorithmically.
The point is, rather, that the ways computer classifies can sometimes reflect reality more sensitively than a rigid set of rules. Computers can be more flexible than bureaucracies.
Here, for instance, is a set of books that, for the most part, mix Spanish and English together in their contents. Some are classified in the metadata as Spanish, some as English, and some as multiple languages. But they share a linguistic commonality.
{
"filters": {},
"colorize_by":"language",
"label_field": "title",
"zoom": [49.72846520217916, 30.876259733930837, 3.5597142144718923, 9000]
}
One section like this shows the importance of non-academic writing in academic libraries. Libraries are full of instructional manuals about how to do things. The section here is riddled with instructions on how to throw a football, how to golf, how to bowl, and any other sports skill you might wish to pick up.
In the library, sports is shelved in a distinct section of class G (recreation). But clustering purely on language, these works stand near not anthropology and geography, but instead in the broader neighborhood of a library of self-improvement for non-experts.
{
"zoom": [ 520.2827513827069, 1.933, -12.4567]
,
"label_threshold": 0.3,"label_field":"title","filters":{}
}
Just to the south, for example, is a similarly large section focusing on building skills for farming.
{
"zoom": [516.7049822363766, 2.4654468758139316, -11.22443283821256],
"label_threshold": 0.05,"label_field":"title","filters":{}
}
While this might seem like an esoterically unimportant slice of the library, it provides context for other areas that we know to be important. To the west of farming, for example, just past the guides to improving photographic technique, is an important set of artifacts of important early digital culture; computer magazines from the 1970s through the 1990s. The most famous is probably Byte, one of the leading magazines through which code, tricks, and stories were shared before the Internet.
If you want to understand the early rise of the computer industry in the United Statesâand especially if you want to understand who it excluded, and how it differed from other countriesâit might make sense to take a longer look at how it tapped into existing forms, rhetorics, and practices established by other American hobbyist movements.
{"zoom":[929.34,6.174,-12.197]
,"label_threshold":0.05
,"label_field":"title"
,"filters":{}}
Look around and youâll find all sorts of other odd clusters of texts you might not realize have been stored in libraries. Here, for example, is a set of books as a statistics that are misread as Greek.
{"colorize_by": "Classification",
"label_field": "Subclassification",
"label_threshold":0.05,
"show_only_n_categories": 0, "hide_uncolored": false,
"zoom": [115.20024701356954,5.54516604014923,-36.85146021576969]
}
This serves as a useful reminder that the digital library scholars work with is itself the production of a strange set of machine hallucinations of text. Most of the library books in Hathi were converted to digital files by Google; libraries serve the Google scans. Googleâs optical character recognition (OCR) seems to work not a word or even a paragraph at a time, but over spans of several pages; it uses best estimates to make an assumption about the general character set of a book, and then tries to read it consistently in that light.
As an example of what constitutes this cluster, look at what happens when Googleâs OCR encounters a page like this one, in one of the volumes of the famous Framingham Heart Study, which is flipped onto its side.5

Googleâs program seems to fail to recognize the correct orientation and, instead, encodes it as a series of nonsense numbers and Greek letters. (Even letters that appear to be Roman, here, like âÎâ are, for the most part, their Greek equivalents.)
11Î1510Î0 3Î 015Î5 Π«3 . ΧÎÎÎŻÎœÎ 30ÎΜÎÎΜÎÎ0-30ÎΜÎÎΜΠ03"Î00<1
10816*16*1 ΔÎ189*ΔΠÎ855**6Î ÎÎ1510Î0 . ΔÎ189·ΔΠ6111ÂŁ*8ÎČ* *151Î*Î9
011Î15Î5 Î855**6Î *151Î*Î9 Î185ÎÎ51 λ3 535ÎȘ0-ÎÎÎ Î03
30λI«ΜÎ00-30Î*ÎÎ΄Î
When this is applied to title pages with the correct orientation, it produces âwordsâ that try to use greek characters to spell out English. So âFertilizer use in the United States .. United States Department of Agriculture .. Bulletin No. 408â emerges looking like this:
ÎÎÎ΀ÎÎÎÎÎÎ Ï $Î ÎΠ΀ÎÎ Ï ÎÎ΀ÎÏ 5΀Î΀Î5 âŠ Ï ÎÎ΀ÎÏ 5΀Î΀Î5 ÏÎÎĄÎÎ΀ÎÎÎÎ΀ ÎÎĄ ÎÎÎÎÎÏ ÎŻÎ€ÎÎÎÎ .. ÎÏ ÎÎÎČÎη ÎÎż. 408
This gets to the heart of what it means for a book to be findable in the digital library, and why we need new forms of exploration.
In the physical libraries of the past, a book is found or lost through human interpretations. If a book is misshelved or miscataloged, it can be functionally lost for weeks or years. To find a book lost in the stacks requires persistent manipulationns around the types of decisions human librarians might make.
In a digital library, the kiss of death sometimes relies on understanding not human mistakes but computer mistakes. (Of course all computer mistakes are human mistakes, originally.) While subject headings rely on a human reading of a text, OCR is, essentially, a computer reading of a text. And if the computer reads the text disastrously wrong (like âÏ ÎÎ΀ÎÏ 5΀Î΀Î5â for âUNITED STATESâ), you might think the book is, fundamentally lost. No keyword search will ever find the Framingham Heart Study.
But itâs too easy to leave things there; because all these volumes are clustered together based on the character strings that occur in them, which means that even if the relationships canât be found by people using human terms, we can tease something out of the full-scale relations.
Spend some more time browsing around this map, and youâll this same pattern again and again; here, for example, is a big set of musical scores.
{"zoom":[32.72,-14.648,8.411]
,"label_threshold":0.05
,"label_field":"first_author_name"
,"filters":{}
,"show_only_n_categories":8}
Crane, Gregory. âWhat Do You Do with a Million Books?â D-Lib Magazine 12, no. 3. Accessed January 23, 2012. https://doi.org/10.1045/march2006-crane.
Jockers, Matthew L. Macroanalysis: Digital Methods and Literary History. University of Illinois Press, 2013.
Richardson, Matthew, Judith Kamalski, Sarah Huggett, and Andrew Plume. âThe Fundamental Interconnectedness of All Things. Places & Spaces: Mapping Science. Courtesy of Elsevier Ltd. In â8th Iteration (2012): Science Maps for Kids,â Places & Spaces: Mapping Science, Edited by Katy Börner and Michael J. Stamper,â 2012. http://www.scimaps.org/detailMap/index/the_fundamental_inte_145.
Schmidt, Benjamin. âStable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries.â Journal of Cultural Analytics, 2018. https://doi.org/10.22148/16.025.
Tang, Jian, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. âVisualizing Large-Scale and High-Dimensional Data.â arXiv:1602.00370 [Cs], 2016, 287â97. https://doi.org/10.1145/2872427.2883041.
Underwood, Ted. âUnderstanding Genre in a Collection of a Million Volumes, Interim Report.â Accessed December 29, 2014. http://figshare.com/articles/Understanding_Genre_in_a_Collection_of_a_Million_Volumes_Interim_Report/1281251.
Underwood, Ted, David Bamman, and Sabrina Lee. âThe Transformation of Gender in English-Language Fiction.â Journal of Cultural Analytics, 2018. https://doi.org/10.22148/16.019.
The principles of that corpus are detailed in the paper. The basic idea is that books which are similar to each other in the words they contain should be close to each other. This form of so-called âbag-of-wordsâ effective at describing what words are about; the point of that paper is to show that it can be effectively used on even massive, multilingual digital libraries with relatively small representations of individual books. The particular visualization of that high-dimensional space is done here using the LargeVis algorithm. Jian Tang et al., âVisualizing Large-Scale and High-Dimensional Data,â arXiv:1602.00370 [Cs], 2016, 287â97, https://doi.org/10.1145/2872427.2883041, Benjamin Schmidt, âStable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries,â Journal of Cultural Analytics, 2018, https://doi.org/10.22148/16.025.â©
For interface, Iâm especially drawing from https://artsexperiments.withgoogle.com/tsnemap/. Similar maps exist of scientific research using network placement algorithmsâe.g., Matthew Richardson et al., âThe Fundamental Interconnectedness of All Things. Places & Spaces: Mapping Science. Courtesy of Elsevier Ltd. In â8th Iteration (2012): Science Maps for Kids,â Places & Spaces: Mapping Science, Edited by Katy Börner and Michael J. Stamper,â 2012, http://www.scimaps.org/detailMap/index/the_fundamental_inte_145 and http://paperscape.org/âbut they rely on citation metrics.â©
Ted Underwood, âUnderstanding Genre in a Collection of a Million Volumes, Interim Report,â accessed December 29, 2014, http://figshare.com/articles/Understanding_Genre_in_a_Collection_of_a_Million_Volumes_Interim_Report/1281251.â©
Matthew L. Jockers, Macroanalysis: Digital Methods and Literary History (University of Illinois Press, 2013); Gregory Crane, âWhat Do You Do with a Million Books?â D-Lib Magazine 12, no. 3, accessed January 23, 2012, https://doi.org/10.1045/march2006-crane; Ted Underwood, David Bamman, and Sabrina Lee, âThe Transformation of Gender in English-Language Fiction,â Journal of Cultural Analytics, 2018, https://doi.org/10.22148/16.019.â©
These transcriptions are accurate as of 2018-08-29. Google periodically updates its OCR and pushes the new versions to the libraries; things may have changed by the time you see it.â©