Deep scatterplots

I’ve been posting lately a number of zoomable interactive point clouds lately; the biggest is of 14 million books in the Hathi Trust. I find these really useful ways of visualizing cultural datasets where it’s necessary see both overall structure and individual items; I’ve written a bit more about this in a tour of 150,000 works of fiction.

So long as the number of items stays in the thousands or tens of thousands, it’s easy to visualize them in a web browser–as with the beautiful Google’s Arts Experiments browser of paintings, based on a T-SNE layout. But as the number of points grows, you start to hit a new set of challenges around navigability.

Recently, new alternatives to T-SNE have made it possible to create extremely large two-dimensional embeddings. Algorithms based on nearest-neighbor networks including UMAP and LargeVis able to interestingly arrange millions of items, from clothing to word embeddings prime numbers.1 And the widespread use of embedding strategies inspired by neural networks mean more data exists in geometric high-dimensional spaces that before.

So–how I’ve put the code for doing this on github. Here I’m explaining how it works and some of the design decisions behind it.

Example

Here’s a basic example of the chart: it shows 13.7 million books from the Hathi Trust.

Currently they are colorized by language; the features use an approximation of word counts using a method based on random projection I recently described. If you want to know more about this particular set, you should read the page that discusses the data.

{
  "base_dir": "/data/scatter/hathi",
  "colors": {"language":"","Classification":"","Subclassification":"","date":"","Principal Author":""},
  "lab": ["Classification", "Subclassification","id","title","date","Principal Author"],
  "point_size": 2,
  "point_threshold": 10,
  "label_threshold": 0,
  "variable_point_size": false,
  "filters": null,
  "zoom": [1, 0, 0, 1000],
  "colorize_by": "language",
  "label_field": "title",
  "scheme": "dark",
  "keys": {"Subclassification": "LCC.txt", "Classification": "LCC.txt"}
  }

Interactivity brings three fundamental things to a dataset like this.

The first is the ability to zoom in on segments of interest. Here, for instance, is the section of the chart where most works of literature are clustered.

As you zoom in, more points are loaded dynamically–more on that below.

{
  "zoom":[8.28,23.117,12.69]
  }

The second is the ability to get arbitrary information about the points. If you’re on a computer, you should be able to directly mouse over the left side of the screen to see information. If you’re on a tablet or phone, you may need to click the ‘interact’ button on the top bar, and then on any points once. That will bring up the full metadata profile for any point; you can then click on the point to see it in the Hathi trust.

The third is the ability to change the representation based on metadata. Here, I change the chart so that colors represent not language but year.

I’ve hard-coded in some values to represent the years I’m likely to plot; for categorical variables, like author, classification, and library,

Note the gear icon below, here; it opens up the underlying API to the chart. I have not gotten around to fully documenting that yet; but in many cases, you can probably tell what the values do. If you want to muck around, try changing ‘date’ to ‘library’ below.

{
  "zoom":[9,23.117,12.69],
  "colorize_by":"date",
  "debug": false
  }

Rendering

To draw, I use an HTML canvas element, which can comfortable support betwen 1,000 and 20,000 points. (“Comfortably” means, technically, that it can draw them within about 35 milliseconds, allowing basically seamless panning and zooming. I’ve erred too much, probably, in trying to include data at the expense of performance; in some places, especially on phones, you’ll see skips between frames.)

It would be possible to cram perhaps ten times as many points on the screen using WebGL to render instead. Canvas is somewhat easier to use (though not nearly as easy as svg, at least with D3.) Things would probably be prettier with more points. But since zooming allows ultimate access to any of the points, it’s not completely necessary. And even advanced technologies couldn’t display literally everything; past a million points or so, the key bottleneck become data, not rendering.

Tiling

So–how do you serve 14 million data points on demand?

The core principal here is tiling. Just as with online maps, I partition the space into equally sized quadrants and load data for each quadrant only when the window is zoomed far in on a point.

The idea for map tiling is to display points at any given resolution.

Here you can see what a basic tiling arrangement looks like. The principal I use is that each tile should have 1,000 points in it. So at the base zoom level, there are 1,000 points on screen; if you fully load a second set of tiles, there should be 4,000 points on screen; and so forth.

The code to build the tiles is part of the repo: scatterTiler.py. It only runs under python2 right now. To run it, you need an input file and and output directory to which tiles will be writter.

{
  "debug": "theoretical",
  "point_size": 0.8,
  "point_threshold": 3,
  "zoom": [0.8, 0, 0, 0]
  }

I actually implemented this naive version of tiling, and it works fairly well; but there’s a problem. If you want to zoom into the chart at any depth, you start needing to load an extraordinary number of files; and many of them are in areas where there are very few points. If we want to display 35,000 points on the screen, for example, I can set the zoom level to 35 (which is animated here). But that requires loading 341 tiles, many of which (like in the upper left) have basically no items in them. You can see them here, numbered by the conventional ‘x, y, z’ map tiling numbers.

(There’s another problem, too: it doesn’t correctly account for duplicates well. In ordinary map rendering, you don’t show high-resolution and low resolution tiles at the same time; but with points like this, we want to display the root information as well. I said that zoom level ‘2’ should have 4,000 points in it; but if you allow 1,000 points per tile, it will actually have 5,000; one for each of the four level-2 quadrants, and one for the root tile.)

{
  "debug": "theoretical",
  "point_size": 1,
  "point_threshold": 3,
  "zoom": [0.5, 0, 0, 0],
  "slowly":[{"field":"point_threshold", "value": 35,"duration": 7000}]
  }

The solution is: use the conventional tile structuring for storing on the network, but make sure that each tile has exactly 1,000 points in it. This gives us a set of manageable CSV files for serving over the web, which go much further than the z index we store them with.

Here I show the tiles that actually need to be loaded. You can see in the upper right that what used to be a z-index of 4 is now over 11,000; that means we’ll have to zoom in several orders of magnification before it will have to load a new tile.

Then, at the time of rendering, the program can simply traverse the quadtree from the top:2 as soon as it hits a point in a tile that’s lower than the current level of magnification, it knows to stop.

{ "debug": "actual", "point_size": 1,
  "point_threshold": 3,
  "slowly":[{"field":"point_threshold", "value": 35,"duration": 7000}] }

As you zoom in, more and more tiles will be loaded; the more points there are in a region, the greater density of tiles will be pulled from the server.

{
  "point_threshold": 22,
  "zoom": [3.3496259500337113, 20.275442124173697, 13.369927671071395],
  "debug": "actual",
  "label_threshold": 0,
  "filters": {}
  }

Mouse interactions

These tiles also serve as the basis of mouse interactions. Each tile keeps track of its own d3.quadtree object; when you hover over the canvas, the quadtrees for every visible tile are scanned to find the nearest point. Now that the quadtree numbers have been limited, this rarely requires searching more than five quadtrees, which is perfectly fast.

The other problem with using quadtree for an interactive graphic is that a single static quadtree instance can’t find missing items. I’ve made a patch to D3-quadtree that allows a filter on search.

Filters

To make exploring these easier, I made the JSON based API support filtering based on arbitrary functions.

Here’s an API call, for instance, that only shows works with single-word titles less than 8 letters. Regular expressions are useful enough that there’s a special syntax for them that can be entered by the user in ‘interactive mode’ above; more complicated queries have to be written by hand. This can be accessed through the filter panel.

{
  "filters": {"title": "d.title.search(/^[^ ]{0,8}$/) == 0"}
  }

  {
    "label_field":"title",
    "label_threshold": 1,"debug":"false",
    "filters": {"title": "d.title.search(/^[^ ]{0,8}$/) == 0"}
  }
  

Controlling the main plot through a narrative is done through the general purpose code I’ve written for this project. The nice thing about having a JSON API to control is that it’s easy to dispatch new events to update the state of the plot. This is how I’m writing all visualizations nowadays.

{
  "zoom":[272.1,22.051,13.427]
  }

Transitions

Some phenomena can be changed slowly.

One of the beauties of defining functions as text as then using D3’s interpolate function is that numeric values in functions can be automatically interpolated.

Here’s an example from the Hathi visualization, with the JSON schema code below. Initially, this says, start with only books published before 1800; and then, over the course of 10 seconds, change to a filter of only books published before 2020, while over 5 seconds making the points half as large.

D3’s interpolate function neatly redraws as quickly as possible with every intermediate possibility of years.

The next portion of the narrative will run the code.

{
  "debug": false,
  "point_threshold": 25,
  "label_threshold": 0,
  "filters": {},
  "zoom": [1, 0, 0],
  "colorize_by":"date"
  }

Here’s what that looks like. In the case of library books, it highlights something quite interesting. Different genres begin out around the margins of the plot; more recent ones fill in the middle, as if all genres are converging together. This replicates one of the interesting things I found in the paper, which is that it’s harder to classify library books today than in the past.

{
  "point_size": 5,
  "filters": {
    "year": "d.date <= 1800"
   },
  "slowly": [
   {"field": "point_size", "value":2.5, "duration": 5e03}, 
   {"field": "filters",
   "value": {"year": "d.date <= 2020"},
   "duration": 10e03}
  ]
  }

But you can do arbitrarily complex functions into the transition, which can yield visually cool results. I only really know enough trigonometry to make a spinning wheel: but if you’re better at it than I, you might able to create something cool by editing the api below to include a variety of spirals.

{
    "point_size": 5,
    "filters": {
      "year": "Math.atan2(d.x, d.y) <= -3.14"
     },
    "slowly": [
     {"field": "point_size", "value":1, "duration": 10e03}, 
     {"field": "filters",
     "value": {    "year": "Math.atan2(d.x, d.y) <= 3.24"
  },
     "duration": 10e03}
    ]
    }

Fast canvas tricks.

Initially, I wanted to use circles for points, because circles are prettier. But I believe that in csertain browser settings, circles can be a bit more time-consuming to render on a canvas element. So I use rectangles instead, shaped (approximately) like books.

This isn’t a general feature–in fact, we’re talking about putting up an abstracted version of this on one of the walls in Snell Library at Northeastern as part of our campus renovations. For that one, we’ll use SVG circles instead of canvas rectangles.

{
    "filters": {}
  }

Labelling algorithms

Let me switch to a different visualization to discuss two issues around labels.

This is a visualization of American street names, treated as text and clustered using UMAP, described further here). (I should also mention that the reason I use LargeVis for the big chart is simply that it’s hard to run on 13 million books, and I didn’t want to do it again after LargeVis was released. Some preliminary tests I did showed it had a nice structure–especially not separating Fraktur-German from regular German–on the Hathi data, but not so much better than LargeVis that I should clearly be using it instead.

I’m using it instead for two reasons:

  1. Because it introduces the concept of size in one of these plots. For Hathi, I shuffle the books in random order and make them all the same size. But with street names, there is a logical ordering; frequency. (This strategy will work on any word embedding–that’s a domain where this strategy is especially useful).

  2. Because for aesthetic reasons I made the bounding boxes visible on streets.

Avoiding bounding box collisions.

When you try to naively plot text, you end up with a mess of colliding bounding boxes. Here I’ve created a call to the API that plots (using the ‘label_threshold’ parameter) only the top 1% of street names. But most of these are near each other!

{
    "base_dir": "/data/scatter/streets",
    "colors": {"name":""},
    "lab": ["name"],
    "point_opacity": 1,
    "collisionDetection": false,
    "ffont": "SantaBarbaraStreetsMedium",
    "label_field":"name",
    "point_size": 1.2,
    "font": "Overpass",
    "point_threshold": 8,
    "guides": ["legend", "label_legend", "filter_legend"],  
    "label_threshold": 0.01,
    "variable_point_size": true,
    "variable_text_size": true,  
    "zoom": [1, 4.06330232428664, 0.050736521860849315],
    "label_field": "name",
    "scheme": "streets"
    }

Rather than do a full-on jittered layout using force direction (which might be hard to recalculate 30 times a second to maintain smooth animation), I just try and plot each word at its correct position, but then reserve the position in a quadtree. Then before inserting any new words, I check nearby points in the quadtree to see if they conflict. If they do, the word is plotted as a point, not given text; and I put semi-opaque backdrops behind the words so that you can still see some points behind.

Calculating a bounding box for every word itself can be expensive. But luckily we only have to do it once. For any word, the bounding box size is defined by the font size and the particular width of its letters. (E.g., ‘www’ takes up more horizontal width in most fonts than ‘iii’). We can cache that and use it before all future frame animations.

{
  "collisionDetection": true
  }

This means that we can plot far more words; here I slowly turn up the label_threshold to allow fully 10% of all words to be plotted if they don’t overlap.

{
  "slowly":[{"field":"label_threshold", "value": 0.10, "duration": 10e03}]
  }

Different fonts require recalculation, as well. I’m going to change to Courier–your browser will probably stutter as all the bounding boxes are recalculated.

{
  "font": "Courier"
  }

OK that’s all. If you want to try this on another set–or to do something new, like make it plot images instead of points and text boxes, which should not be too difficult–please let me know

McInnes, Leland, and John Healy. “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.” arXiv:1802.03426 [Cs, Stat], February 9, 2018. http://arxiv.org/abs/1802.03426.

Tang, Jian, Jingzhou Liu, Ming Zhang, and Qiaozhu Mei. “Visualizing Large-Scale and High-Dimensional Data.” arXiv:1602.00370 [Cs], 2016, 287–97. https://doi.org/10.1145/2872427.2883041.


  1. Leland McInnes and John Healy, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” arXiv:1802.03426 [Cs, Stat], February 9, 2018, http://arxiv.org/abs/1802.03426; Jian Tang et al., “Visualizing Large-Scale and High-Dimensional Data,” arXiv:1602.00370 [Cs], 2016, 287–97, https://doi.org/10.1145/2872427.2883041.

  2. Actually, I do a funny little shuffle rather than starting from the top; I start with all the nominally visible tiles, and use D3 promises to recursively check parent tiles.