A recent paper of mine described a new method for turning the full digital library into a vectorized set of features –based on word counts–that ordinary computing hardware can handle.1
To get a sense of what sorts of textual properties these vectors give insight to, you can read (in addition to the second half of the paper above) my visual bibliography of 13 million Hathi books or my discussion of 130,000 works of fiction.
This page is a guide for anyone who actually wants to use them for exploration or research. The can be useful in a variety of cases beyond the ones I describe in the article.
If you want to try exploring these features, I’d recommend the following setup.
First, install the python package to work with SRP files.. This is pretty easy: you just type
pip install git+git://github.com/bmschmidt/pySRP.git into a terminal window, and then
import SRP next time you’re in python. The python module exposes a number of ways to work with the files, including a simple interface for iterating through them one row at a time that you can use to create any extracts you like. (The format is the same as used by Google’s word2vec files, so code for working with them will work as well.2)
Second, download a copy of the features from zenodo. This link takes you not to the full 1280 dimensional features I used in the paper but to a more compact version. That means you can download a pretty good representation of the entire Hathi trust–about 1 kilobyte of information per book–in 17GB of data. There are also segmentations by language and year if you just want to look at–say–French books. Because of the half-precision floats, this set can only be read with the python package above. The python package can read these binary files into a variety of more useful formats.
The big challenge here is that size matters, a lot. There are now 18 million books in the Hathi Trust, and to get a useful vector representation of any one of them you need at least a few hundred vectorized points–let’s say 640. If I tried to distribute these as numbers in a text file, each point takes ten characters, including spaces (e.g., -2.398139), it would take up. 640 * 10 * 15,000,000 = 96GB of space.
The binary format is only 640 dimensions (which is probably about 70% of the information for half the size); it also stores numbers as half-precision floats, which lets numbers be represented more compactly–if only to a few decimal points in just two characters.
I have included a number of examples in the docs for the python package of tasks you might want to perform, such as:
There are some other promising venues that I haven’t followed up on at length that I’m happy to talk to anyone about.
The full data are available, in pieces, from Northeastern’s digital repository This dataset is a little large for normal handling: I have worked with it as a single 64 gigabyte file, but to make downloading feasible I have chopped it into several different 2GB files by language and year. You can download the files that are useful. You can also contact me if you want the full set through some other medium.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781.
Schmidt, Benjamin. “Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries.” Journal of Cultural Analytics, 2018. https://doi.org/10.22148/16.025.
Benjamin Schmidt, “Stable Random Projection: Lightweight, General-Purpose Dimensionality Reduction for Digitized Libraries,” Journal of Cultural Analytics, 2018, https://doi.org/10.22148/16.025: link.↩