Interactive Visual Bibliography: Describing corpora

At any size above a few thousand books, it can be extremely hard to get a good sense of what’s actually in a digital library. Historians and literary historians have strong prior senses about what might be missing in any library–and we can often talk at length about the biases and omissions of an archive. Indeed, it’s become somewhat normal for humanists to use visualization as a way to elicit, in Lauren Klein’s phrase, “the image of absence.”1

But while arguments about absence can help us connect datasets to social forces we know exist, they don’t necessarily provide understanding of the wildly large, sometimes surprisingly inclusive, and overwhelmingly varied contents of digital archives. To use large collections of books, we need to know what’s in them– which means finding practices for descriptive analysis of libraries. In many fields–not just the humanities–solid description is both prior to argument and worthwhile without it.2

So how do we describe the library? We need many ways, but here I want to argue that one barely-used method–interactive visual bibliographies–may be essential. Another section of this site contains a visual bibliography full Hathi Trust collection.. The full digital library is so overwhelming in its scope that I was only able to describe a few highlights. Using the same technology stack, it’s possible to build collections that offer narrative guides but also provide free range for exploration.

Here I’ll look at a set that constitutes about 1% of the full Hathi collection, making it one of the largest collections of books under sustained examination by scholars. It is collection of about 137,000 works of fiction made by Ted Underwood for the NovelTM project. The source for this corpus and a description of its composition is on github: essentially it aims to build up a large, general-purpose model of fiction from the HathiTrust digital library. This large corpus is already beginning to be used for major research projects, like Underwood’s article on the transformation of gender in English-language fiction.3

The basic question here is: what do the contours of fiction as a whole look like when filtered against the digital library? We could slice and dice by library metadata to get a similar approximation; but we understand books through their individual profiles even more than through metadata. It’s best to be able to look at them directly.

Visual bibliography helps sketch out the contours of this collection in a way that makes it easier to interpret so-called ‘macroscopic’ readings of it. As I’ll show, it highlights several features in particular; including: the dominance of individual authors, especially before 1900; the presence of hundreds of works of non-fiction; the absence of certain genres of fiction outside of cultivated collection practices; and the complicated relationship of Anglo-American popular fiction, world literature, and folk tales as three dominant areas of library collection in the last 50 years.

We start with a frame that covers the entire corpus, with about 17,000 out of the 137,000 books visible and the rest hidden until we zoom in.

These works are arranged using the UMAP4 dimensionality reduction algorithm; much like T-SNE or LargeVis (which I used for the Hathi visualization), this tries to create local clusters of relatively similar documents without strictly enforcing large scale requirements.

A good first question to ask is: is there an organizing principal that seems to work for the full set?

{
    "base_dir": "/data/scatter/fiction",
    "colors": {"subject_forms":"", "subject_places":"",
      "subject_things": "", "subject_times": ""},
    "lab": ["short_title", "author", "htid"],
    "point_size": 0.5,
    "point_threshold": 8,
    "label_threshold": 0,
    "variable_point_size": false,
    "zoom": [2.189698497269877, 4.06330232428664, 0.050736521860849315],
    "colorize_by": "author",
    "label_field": "short_title",
    "scheme": "dark",
    "guides": ["legend", "color_legend", "label_legend", "filter_legend"],
  "filters": {
      "year": "Math.atan2(d.x - 5, d.y) <= -3.14"
     },
    "slowly": [
    {"field": "filters",
      "value": {    "year": "Math.atan2(d.x - 5, d.y) <= 3.24"
      },
      "duration": 5e03},
      {"field":"point_size",
      "value": 1.75}
    ]
    }

In this case, there clearly is: time. A line drawn from east to west through the middle of the central cluster will run, neatly, from the present to the past. This probably seems natural, but it is not the only macro organization principal you could imagine; in some alternate clustering, maybe there would be five big clusters shading into the middle for the detective novel, the romance novel, the adventure novel, and so forth. As we zoom in, you may see why there isn’t; because the strong genres are overwhelmed by a variety of genres that blur into a central core.

You can filter to a quarter century of books at a time using the slider below.

{
  "colorize_by": "inferreddate",
  "hide_uncolored": false,
  "label_field": "inferreddate",
  "slowly": [{"field": "label_threshold", "value": 0.1}],
  "zoom": [2.189698497269877, 4.06330232428664, 0.050736521860849315],
  "filters":{}
  }

While time structures the overall view, this is not simply a circle; there are many islands and peninsulas off the main continent. It’s in these that we’ll find the most interesting results; so let’s start to look in.

Remember, you can at any time press the “interact” button on the top of the screen to zoom and pan around the chart. Hovering or tapping on any book will bring up its metadata, and then clicking or tapping again will take you to read it at the HathiTrust’s site.

{
  "filters":{},
  "zoom": [7.817635048708641, 7.439276031274092, 4.39243214221656]
  }

The far south is a field of islands that tend to each be associated with a single author.

{
  "label_field": "author",
  "colorize_by": "author",
  "zoom": [6.455322986196844, 7.337459626949853, 6.149097913666003]
  }

These exist all over the peripheries of the chart. For some authors, like the Western novelist Zane Grey, the clustering is nearly perfect; almost every work he wrote is in this cluster, and no other authors intrude. For others, it is more complicated–often in interesting ways. Especially prolific authors tend to be clustered into a number of different works.

{
  "label_field": "author",
  "colorize_by": "author",
  "zoom": [111.65514627378174, 3.6982813350819956, 5.085562780642219]
  }

There is one island that appears not to be composed of a single author, located between Alphonse Daudet and Leo Tolstoy.

{
  "zoom": [24.017799667251126, 2.9915231492363556, 4.346447680396581]
  }

“zoom”: [31.70175242120543, 2.710890851300105, 4.469460673649076]

If we switch the labels to show the titles, it becomes what unifies these works; they are clearly not fiction at all.

{
  "label_threshold":0.1,
  "label_field":"short_title",
  "zoom": [24.017799667251126, 2.9915231492363556, 4.346447680396581]
  }

For some of these, it’s pretty clear what’s going on. Books about 19th century women clothing show up because the classification model Underwood uses has learned that minute descriptions of clothing are highly characteristics aspects of novels. Since one of the steps in the creation of this corpus used logistic regression to find unseen novels based on the words they use, books that are entirely about something that is itself highly characteristic of fiction can be mistakenly labeled as fiction themselves.5

{
  "zoom": [212.83836869512197, 2.949578748027431, 4.840450787191456]
  }

This is an interesting class of error to think about; it is completely alien to the type of mistake that a person would make, and yet highly characteristic of the most common methods in the digital humanities. Digital humanists like logistic regression because it produces interpretable results; you can see what words drive a classification, rather than relying on a “black box.”

But this comes at a cost. Clothing is not the only set of words that seem to trigger a problem; you also see extensive books about magic (which probably involves minute description of fabrics, body parts, and eyes–all language usually associated with fiction); and books about furniture and home decoration. Roughly, it seems that anything about the body and the domestic sphere stands some chance of being mistaken for fiction in this set. (Although I suspect there are many, many more books like these that are not classed as fiction; and some other body-heavy discourses, like medicine, have a counterbalancing vocabulary that probably prevents them from being misclassified.)

{
  "zoom": [339.7878583248624, 2.8905066811287856, 4.654037749694506]
  }

In other in this cluster, the reasons for misclassification are less clear. (What seems ‘fictional’ about The impact of controlled access highways on population growth in nonmetropolitan communities, 1940-1970?) Some people tend to view mistakes like this as inherently calling the whole process of automatic classification into question. But so many are all, neatly, lumped together here using nothing more than the bag-of-words counts that led to them being classed as fiction in the first place. That means a computer is easily capable of seeing that they have something distinctive about them; and could classify them out.

One technical solution I’d commend to the literary historians is to use, at least for corpus creation, more complicated models than simple logistic regression, which can be overwhelmed by a flood of individual words. I suspect even the equivalent of single-hidden-layer neural network models could learn to dampen out the contributions of words from a single set of vocabularies.

{
  "zoom": [248.51917694271768, 3.0519584944122347, 4.040298382623485]
  }

But while it can be useful to think about mistakes, the vectorization-visualization model also lets us see the contours of the general body of fiction.

If what we’ve seen to this point are islands off the continent, there are also peninsulas; outcroppings of related texts that self-segregate from the rest of the body of fiction. So now I’m going to take you around the circle clockwise, with a few stops at the most interesting points.

We start just to the northeast of the nonfiction, in the twentieth-century area of the text.

{
    "label_threshold":0.1,
    "hide_uncolored": false,
    "label_field": "short_title", "colorize_by": "inferreddate",
    "zoom": [9.750418535326174, -0.06654332504070837, 1.5156297085520407]
  }

In this section, our imagined map corresponds to the real map of the world; I’m showing here as labels the places associated with a book in the catalog record (for catalogers: MARC 650$z or 651$a).

The main body of ‘literature’, trained on mostly American libraries, lies in Europe and the United States.

The three main peninsulas sticking down here (from the west) are literatures from East Asia, South Asia, and Russia. African literature mediates between the Indian subcontinent and the European mainland; Middle-Eastern literature shows up at the base of the Soviet peninsula.

{
  "hide_uncolored": true,
  "label_field":"subject_places","colorize_by":"subject_places",
  "zoom": [9.750418535326174, -0.06654332504070837, 1.5156297085520407]
  }

To the northwest of this lies a large region of American literature. Subject headings for the United States are much more varied than for the rest of the world; so we get a lot of “Los Angeles” and “San Francisco” rather than just “India” or “Korea.”

{
  "zoom": [10.363869091094148, -1.6078820043003947, -2.745767683496278]
  }

Within modern American literature, genre distinctions are powerful even as author clusters remain. Danielle Steel occupies a cape of her own in popular literature.

{
    "label_field": "author",
    "colorize_by": "library",
    "zoom": [65.50299916407974, -1.188740767211339, -2.764657647543107]
    
    }
  

Farther out to sea from Cape Steel is a cluster of gay erotica. The titles of these books are a remarkable collection of double entendres. But it also raises an interesting point about what sorts of books make it into a fiction collection like this.

{
  "zoom": [338.14285493302265, -1.6562629709081964, -3.2441556341915643],
  "label_field": "short_title"
  }

Here the labels show years and the colors show the library that contributed the work into the Hathi Trust. Gay erotica is limited to just the 1970s, and all but one book in this cluster come from the University of Michigan system. I haven’t checked all of them, but most come, I believe, from the Labadie collection of Michigan’s libraries, which explicitly focuses on documents from marginalized political communities.

Gay erotica from the 1980s and 1990s, on the other hand, seems to be almost entirely missing. These are not novels that most university libraries collect for their literary excellence; it’s only because Michigan curators cared about the political community (and that Michigan then allowed Google to scan special collections items) that it made it into the literary collection. I haven’t seen many signs of straight smut from the same period in the library.

{
  "zoom": [64.00599832815948, -1.6932292663125015, -3.19054646413824],
  "label_field": "inferreddate"
  }
  

This means that some gaps in the chart may stand in for literal gaps in the literature.

For example, across the bay from gay erotica is the primary lesbian literature cluster in the set; but it dates almost entirely from the 1990s. If there were more lesbian literature from the 1970s, or gay writing from the 1980s-1990s, would there be stronger ties between them?

{
  "label_field":"short_title",
  "zoom": [106.80656584183127, -1.0294226594112512, -3.77856296230466]
  }

At this part of the chart, we’re in classic genre fiction. I’m switching now to label by the subject described in the MARC record (650$a), which often captures genre.

Continuing around the circle, we move past the modern American thriller.

{
  "label_field":"subject_things",
  "colorize_by":"library",
  "zoom": [39.995890843936074, 0.14815490820273247, -4.022121418131773]
  }

Science fiction occupies a healthy spot of its own: there are 1,900 visible points in the area onscreen right now.

{
  "zoom": [28.013760882045524, 2.2926195625267116, -4.117418104820118]
  }

Around here there’s another cluster of non-fiction. It’s most rules for games, manuals for yoga, and the like; I assume they confuses the classifier through their relatively informal language and–again–their minute description of the human body.

It shows up in this particular position on the chart, I suspect, because a cluster of baseball novels lies to southeast.

Are there more non-fiction groupings? I suspect so. I have noticed at least one section entirely in French; there are also a great many memoirs scattered among novels about the same period. If you browse in interactive mode using subject heading labels, you’ll see a lot of this in action.

{"label_field":"short_title",
  "zoom": [99.04179193259532, 3.245644733200832, -5.231265492735812]
  }

A vast area in the northeast of the map–8,000 books in the current region–acts as a counterpart to the ‘world literature’ section in the southwest. While the geography here resembles the world, it is taken up not by single-author ‘fiction’ but by tales, sagas, fables, and folklore.


  {
  "label_field":"subject_things",
  "colorize_by":"library",
  "zoom": [12.996014701553037, 7.4506586402533905, -4.376614983198532]
  }
  

Some individual stories repeat in translation after translation, such as the Arabian nights. There are 500 copies in just this area of the collection; if you navigate around more, you’ll find at least one other Arabian nights cluster.

{
    "label_field":"short_title","label_threshold":0.3,
    "point_size":1.25,
    "filters": {},
    "colorize_by":"library",
    "zoom": [23.047658646521956, 14.786223292202635, -10.706838781806638],
    "duration": 3000,
    "slowly": [{"field":"point_threshold","value": 7}]
  }
  

We’ve now made it most of the way around: what remains, though–primarily from before 1920–demonstrates a great asymmetry that arises of trying to build a fiction collection from library books.

Where the more modern parts of this corpus cluster around genre distinctions, the older parts are overwhelmed by the incredible number of texts from individual authors. The most frequent author in the corpus, Walter Scott, has about 784 volumes.

{"label_field":"author",
  "label_threshold":0.3,
  "point_size":1.25,
  "filters":{},
  "colorize_by":"author",
  "zoom":[40,11.447,-1.777],
  "duration":3000
  }

If we look by decade to see how many authors have over 50 books in the collection, it’s clear that there’s a massive distinction for authors whose earliest appearance in the corpus is in the nineteenth century as opposed to the twentieth.

The reasons for this are complicated, but certainly have a lot to do with the changing practices of libraries, digital aggregators, and printing presses, not just readers. Even were a press were to put out a glossy version of the “Complete Works of Nora Roberts” today, it’s unlikely it would end up on library shelves. It’s even more unlikely that Hathi would end up with multiple copies of it, because the digital library relies so heavily on the state universities in Michigan and California for recent books.

decade 50 book authors Top author books
1720 2 Defoe, Daniel 191
1730 2 Ainsworth, William Harrison 116
1750 2 Fielding, Henry 139
1770 2 Sterne, Laurence 76
1780 0 Smith, Charlotte Turner 49
1790 0 Goldsmith, Oliver 50
1800 3 Edgeworth, Maria 199
1810 2 Scott, Walter, Sir 781
1820 10 Cooper, James Fenimore 365
1830 7 Dickens, Charles 563
1840 13 Thackeray, William Makepeace 404
1850 15 Collins, Wilkie 161
1860 17 Eliot, George 188
1870 18 James, Henry 334
1880 18 Balzac, Honoré de 473
1890 19 Conrad, Joseph 180
1900 10 Wodehouse, P. G. 102
1910 4 Lawrence, D. H. 76
1920 4 Christie, Agatha 71
1930 2 Buck, Pearl S. 59
1940 1 Simenon, Georges 149
1950 1 Lessing, Doris May 51
1960 1 Oates, Joyce Carol 53
1970 0 Steel, Danielle 45
1980 0 Roberts, Nora 31
1990 0 McCall Smith, Alexander 22

As scholars continue to work with large digital libraries, we’ll have to continue to think more about how to balance our desire to create grand categories like ‘fiction’ against the great differences in collections. Probably we’ll have to adopt different strategies for different tasks; I’ve found it useful in classification, for example, to occasionally limit a corpus to only include one book per author, lest an especially prolific writer like Jane Austen or George Eliot come to define her entire generation or genre.

And to do this adequately, we’ll need more and different ways to describe what’s in those libraries. UMAP and overview bibliographies is one; but we also need Cartesian visualization of library contents, interactive searches, and maybe even richer ways for interactively toying with higher-dimensional representations.

That will be intimidating; but it should also be delightful, surprising, and even sublime. The quantity of effort that it took to write 100,000 works of fiction or 18,000,000 million books, and the richness of their contents, makes it worthwhile to find new ways of re-presenting the books themselves in contexts that we haven’t imagined yet.

McInnes, Leland, and John Healy. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv:1802.03426 [Cs, Stat], February 9, 2018. http://arxiv.org/abs/1802.03426.

Underwood, Ted. “Understanding Genre in a Collection of a Million Volumes, Interim Report.” Accessed December 29, 2014. http://figshare.com/articles/Understanding_Genre_in_a_Collection_of_a_Million_Volumes_Interim_Report/1281251.

Underwood, Ted, David Bamman, and Sabrina Lee. “The Transformation of Gender in English-Language Fiction.” Journal of Cultural Analytics, 2018. https://doi.org/10.22148/16.019.


  1. Lauren F. Klein, “The Image of Absence: Archival Silence, Data Visualization, and James Hemings,” American Literature 85, no. 4: 661–88, accessed January 14, 2015, https://doi.org/10.1215/00029831-2367310.

  2. Loeb, S., Dynarski, S., McFarland, D., Morris, P., Reardon, S., & Reber, S. 2017. “Descriptive Analysis in Education: A Guide for Researchers.” NCEE 2017–4023. Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance.

  3. Ted Underwood, David Bamman, and Sabrina Lee, “The Transformation of Gender in English-Language Fiction,” Journal of Cultural Analytics, 2018, https://doi.org/10.22148/16.019.

  4. Leland McInnes and John Healy, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” arXiv:1802.03426 [Cs, Stat], February 9, 2018, http://arxiv.org/abs/1802.03426.

  5. Ted Underwood, “Understanding Genre in a Collection of a Million Volumes, Interim Report,” accessed December 29, 2014, http://figshare.com/articles/Understanding_Genre_in_a_Collection_of_a_Million_Volumes_Interim_Report/1281251.