In the Illustrated Theory of Numbers, the pictures serve different purposes. Some lend geometric insight to proofs. Others display logical flow. Others render an abstract concept. This post is about those which are data visualizations.
Using the term “data visualization” automatically increases web traffic, but I’m not just doing it for the hits. Instead, I think it’s time that the best data visualization practices are directed towards the most interesting data in number theory: the prime numbers being the prime example. While the data visualization community typically studies people and places and money and the natural world, the Illustrated Theory of Numbers gets its data from numbers themselves. It is one of the fascinating things about number theory that the data is entirely deterministic while at the same time obeying heuristics for random variables.
In this spirit, I’ve provided two drafts of a data visualization below, displaying the distribution of prime numbers up to 5 million. I’ll explain the editing process that led me from left to right.
My goal in this image is to provide the reader with a sense of the microscopic irregularity and the macroscopic regularity of the prime numbers. In the left column, the prime numbers are thick bars (10 points, I think). Each column displays a range of prime numbers: the first displays primes up to 50, the second the primes up to 500, etc.. The rightmost column displays the primes up to 5 million. In some ways, this is the simplest kind of data set — a one-dimensional distribution.
From the beginning, I decided on this basic layout of columns, so that by the rightmost column the image would appear smooth, and gradually getting lighter towards the top as the primes spread out. The numbers which represent primes on the far left are replaced by densities on the far right. A number near 5 million has about a 6.5% chance of being prime.
I made a lot of changes to this image, starting with the draft on the left (from a few years ago) and ending at the draft on the right (a few weeks ago). First, I pushed the prime number labels onto the bars. There might be some printing/clarity risks with white text on black bars, but it gets across the idea that the bars are the prime numbers and it reduces the chance of confusion that the same “ticks” apply to all columns.
In that spirit, I narrowed and separated the columns. This, I think, lightens the whole page, saves ink, and increases clarity. The red lines now indicate how each column is effectively contained in a tenth of the column to its right. I’ll admit there’s a bit of influence from the cover of Tufte’s Visual Display of Quantitative Information, though the subject matter is completely different. I hope the red lines also break the tendency to scan directly left-to-right, and indicate how data is squeezed into shorter intervals.
Also lightening the page, I changed the shading in columns 3-6. In the first two columns, solid black bars are used to represent prime numbers. But in columns 3-6, a shade of gray is used according to the density of primes in each bin. Among the numbers between 4000 and 4499, there are 60 prime numbers. Since 60/500 = 12%, I used a line segment at 12% black in the later draft. (With TikZ, this is accomplished by setting the color to black!12).
At first I was concerned that this would be too light, and I’ll see how it all looks when it’s printed professionally. But on the Ricoh printers here, the result looks good — even at 6.5% black (the density around 5 million), the gray is easily distinguishable from the white paper. And this fits with the principle of “smallest effective difference” described in Tufte’s Visual Explanations. It’s a bit hard (though not impossible) to see the primes spread out, as their density goes from 8.5% to 6.5% in the rightmost column. But that’s also part of the honest representation — it would be dishonest to the data to exaggerate the image to make the primes appear to spread out more quickly. The table of densities at the far right exhibits the gradual spreading unambiguously with numbers.
A note to the reader — the images tend to render with horizontal stripes on a computer monitor! Another reminder to print on a regular basis.
There’s probably a bit more tuning to do before publication. The primes deserve the effort.