Sep 23 2023
Data visualization is not just the art of presenting data to an audience. Upstream from this, you use visualizations in data cleaning to identify defective points, and in exploratory analysis, to identify patterns of interest. Then, you validate these patterns with a more formal analysis. Once confident that you have findings of value to communicate, you worry about making a compelling presentation.
Nick Desbarats and I had a long exchange on LinkedIn prompted by his article Connected Scatterplots Make Me Feel Dumb in Nightingale, the Data Visualization Society journal, on 8/29/2023. What he called Connected Scatterplot is what I call orbit charts, and I have found them helpful, particularly in analysis.
- Charts versus Data Science
- Why “Orbit Chart”?
Charts versus Data Science
Nick Desbarats has a website called Practical Reporting, which emphasizes charts and dashboards. It offers four days of training on these topics, compared to one day on graphs for analysis.
The communication of results is the last stage in a data science project, after data acquisition, cleaning, storage/retrieval, and analysis. Without the preceding stages, the best communication skills are useless because you have nothing to say. Once you have something to say, communication skills matter. Without them, your results are moot.
Charts and Data Products
Desbarats’s focus on charts is too narrow within communications, as they are only part of what you need. A finished data product can be a single chart, like Minard’s summary of Napoleon’s Russia campaign, but it can also be a 1,000-page book, like Thomas Piketty’s Capital in the 21st Century.
It can be a video like Hans Rosling’s introduction of the trendalyzer, an infographic like Alberto Cairo’s, a performance board used to support daily meetings on a factory floor, or a slide show to move an audience to action in a conference room. For individual charts within data products, you can take advice from Edward Tufte or Jean-Luc Doumont or, perhaps, once his book comes out, Nick Desbarats.
Exploratory Data Analysis
The phrase “exploratory data analysis,” usually evokes John Tukey’s eponymous book from 1977. Before it, no self-respecting statistician would dwell on this subject. Compared to the heights of mathematical statistics, they considered it trivial. It is no longer the case, and many of the data mining tools of today are, in fact, exploratory.
When exploring data, you don’t need to comply with anyone else’s standards of chart appearance. Your charts can be ugly and look unprofessional, but it doesn’t matter because it’s between you and the data. This changes once you have results to share.
The burden of communication with data products is not entirely on the analysts. It behooves professionals in the audience to be data literate enough to understand and act on solid work while detecting shoddy work.
Why “Orbit Chart”?
I found that Connected Scatterplot is a term some use for orbit charts. even though the former is ambiguous and the latter is both shorter and more descriptive.
Ambiguity in “Connected Scatterplot”
In the following examples, provided by data to viz calls both of these charts “Connected Scatterplots” (click to enlarge):
The first one shows bitcoin prices over time, and it is just a line plot with markers for the data points. The second one shows the evolution over 30 years of the popularity of Amanda and Ashley as first names. Amandas grew in numbers while Ashley’s stagnated in the 1970s, then Ashleys shot up in the early 1980s while Amandas plateaued, and both have been losing ground steadily since 1990.
These are different kinds of charts that should not go by the same name.
“Orbit Chart” is Short
It’s three syllables versus six, so “orbit chart” wins on name length. We could also call them “path charts,” or “path plots,” which is even shorter but “orbit chart” rolls off the tongue more easily.
“Orbit Chart” is Descriptive
In a scatterplot, you see how data pairs sampled from two random variables scatter over the plane. The points are not ordered: you can scramble their sequence and the scatterplot remains the same. You treat the data set as independent occurrences of the same pair of variables. You want to see whether the variables tend to be high or low together; one tends to be high when the other is low, or there is no such pattern.
Nick Desbarats “couldn’t find a single example of an insight or pattern that was clear in a connected scatterplot but that wasn’t also clear in a stacked or indexed line chart of the same data.” I have many, and included some original post on orbit charts from 2013:
- Minard’s Russia campaign chart.
- Unplanned versus planned downtime in nuclear power plants
- GINI index versus GDP in Brazil and the US, 1980-2011
- Recovery from crisis at Toyota versus GM
Here, I would like to elaborate on some of them and add new ones.
Literal Orbit Charts with Planets and Spacecraft
A literal orbit chart may show the path of spacecraft over time. It has labeled tick marks for time, as in this example from NASA:
The points along the orbit are anything but independent. They are on a path calculated through celestial mechanics and it would make no sense to scramble them. Neither would it make sense to plot separately the x and y positions of the spacecraft. In particular, seeing the crossings with planetary orbits would be more difficult. This would be unfortunate, as these were the purpose of these missions
You don’t have a sample of independent occurrences of two variables. Instead, you have a time series with each point differing from its predecessor by a small increment. When you plot the point positions over time, you have an orbit chart, with a meaning that is different from a scatterplot.
Phase Diagrams in Physics
While this is less intuitive and not obvious for the rest of us, physicists commonly show the orbits of dynamic systems in a phase space of two parameters that aren’t necessarily positions in space.
For example, most power plants in the world today use Rankine engines. They boil water with heat from nuclear fission or fossil fuels. The resulting steam moves a turbine that turns alternators. Then the steam condenses and a pump pushes the water back into the boiler. Heat comes in through the boiler and out through the condenser, and mechanical energy through the turbine:
So far, this is easy. In order to quantify the behavior of these engines, however, thermodynamicists describe the state of water cycling through this system in terms of temperature and entropy, as in the red orbit in the following diagram:
It’s a chart you learn to read if you study the subject, but not one you find in magazines for the general public. It’s useful to you if you design or sustain power plants; if you don’t, not understanding it is no reason to “feel dumb” or dismiss the chart.
Limit Cycles in Populations of Predators and Prey
The predator/prey relationship is similar to many you encounter in business, economics, and many sciences but is most easily explained by pikes eating carps in a pond. One might expect the populations to reach an equilibrium, with pikes eating just enough carps and carps breeding just fast enough to keep both populations constant.
A century ago, biologists observed that this was not happening. isInstead, both populations oscillated, As the prey population grows, so does the predator’s, until too many predators deplete the prey population. Then the predators face a shortage of prey, which depletes their population, resulting in the prey population recovering… The Lotka-Volterra equations, published in the 1920s, provided a model that explains why this dynamic does not lead to equilibrium.
Pikes and Carps
The salient characteristics of this model are best explained in an orbit chart. A point in the plane can represent the joint population of the two species, and the equations allow us to plot its orbit over time. In the following chart, the black dot represents the equilibrium. It’s unstable. If you are exactly at it, you stay there but if you move away from it a bit, the equations take you further and further away. It’s like the summit of a mountain: the path of least resistance is away from it in every direction.
The equations treat the populations like continuous variables, which, clearly, they are not. At every instant, they are subject to discrete changes through eggs hatching, individuals dying from causes unrelated to the other species, and encounters between one pike and one carp.
This means that, if you are at the equilibrium, you stray from it, and you tend not to come back. Instead, you move towards a limit cycle, shown here in red, and it is true whether your starting point is inside or outside the limit cycle. The limit cycle is like a valley surrounding the summit.
By contrast, you can plot the time series of both populations and it won’t make you see the limit cycle. The orbit chart allows you to visualize a pattern you would otherwise miss.
Some decades ago, shortly after learning this theory from Vladimir Arnold while interning at the Hahn-Meitner Institute in Berlin, I encountered a similar problem when researcher Wolf Montserrat wanted to solve a system of equations for the populations of vacancies and interstitials in a crystal of potassium chloride bombarded by neutrons. A few years later, in Silicon Valley, it was about market size for products with pent-up demand. Decades later, the basic SIR model of epidemiology, which is also similar, helped understand how the COVID-19 pandemic unfolded and the value of “flattening the curve.”
The spaghetti map is the most familiar form of orbit chart to manufacturing professionals. Usually, you show a spaghetti map of an existing process to highlight its inextricability. The map makes the point that the process needs disentangling. It doesn’t require mapping all the existing processes in a plant.
Always the Hurricanes Blowing
In 2022, I reviewed different ways to visualize data about Atlantic hurricanes, and many of them were literal orbit charts, showing paths on a map. The top-level summary showed hurricanes over 169 years originating off the coast of Senegal, ravaging the Caribbean, Central America, and the Eastern US, before dying in the North Atlantic:
The ROC curve
This curve, commonly used to evaluate methods to classify data among two categories based on a threshold is an orbit chart indexed by a threshold instead of time.
Origin of the ROC Curve
As a radar operator in the UK in 1940, you need to report the airplanes on your screen, but want to report only the German planes. You use an Identification Friend or Foe (IFF) system, in which each friendly airplane has a transponder that actively responds to the radar. Is the signal from a plane just a reflection from a foe, or does it come from a friend’s transponder?
It doesn’t work perfectly, and you must set a threshold in the strength of the response below which you decide the plane is a foe. This threshold determines both the probability of correctly identifying a foe and that of mistaking a friend for a foe. With the threshold too low, you never identify a plane as a foe. With a threshold too high, you identify all planes as foes. If the IFF is any good, somewhere in between, you have thresholds that give you a high probability of correctly identifying foe and a low one of mistaking a friend for a foe.
The most concise summary of this information is the orbit chart of these two probabilities when the threshold varies. Because of its roots in radars, it’s called the “Receiver Operating Characteristic,” or ROC Curve.
ROC Curves Versus the OC Curves of Sample Inspections
Quality professionals study the Operating Characteristics (OC) of sample inspections. For a given inspection plan, these curves show the probability of acceptance of a lot as a function of its proportion of defectives. This OC curve is best known through two of its points:
- The AQL is the proportion of defectives for which the acceptance probability of the lot is 95%.
- The LTPD is the proportion of defectives for which this probability is 10%.
The OC and ROC curves have similar-sounding names but, otherwise, nothing in common. One is a plain mapping; the other is an orbit chart.
Data visualizations are not just for communication with an audience. Whatever tools you use to clean and analyze data is between you and the data. The more tools you master, the more effective you are.
Charts are only worth posting to an audience if they convey new information. Then you need to worry about clarity and understandability for your specific audience. The same visualizations won’t work with production operators, the R&D staff, or the board of directors. The more tools you master, the better you can adjust your aim.