Where Have The Scatterplots Gone?

What passes for “business analytics” (BI), as advertised by software vendors, is limited to basic and poorly designed charts that fail to show interactions between variables, even though the use of scatterplots and elementary regression is taught to American middle schoolers and to shop floor operators participating in quality circles.

But the software suppliers seem to think that it is beyond the cognitive ability of executives. Technically, scatterplots are not difficult to generate, and there are even techniques to visualize more complex interactions than between pairs of variables, like trendalyzers or 3D scatterplots. And, of course, visualization is only the first step. You usually need other techniques to base any decision on data.

The current state of business analytics

According to INFORMS, analytics is “the scientific process of transforming data into insight for making better decisions.” There is, however, a gap between this definition and the visualizations offered by most software packages for analytics or business intelligence (BI). All you see is pie charts and stacked bars charts, with a few time series and heat maps. Other than being loaded with graphic junk, these charts have in common that each shows only one variable, as you can see in the following gallery (click to enlarge):

Obviously, you can’t do much analysis without considering interactions between variables. Interactions tell you which products consumers tend to buy together, whether lead times are related to quality, which combination of flow rates, temperatures and pressures gives you the most consistent output from a chemical reactor, whether first-pass yield is affected by the number of components in a product, or the effect of planned maintenance on equipment failures,… But the only BI package I could locate that bragged about finding “hidden relationships in your data” and showed at least one scatterplot is TIBCO’s Spotfire:

tibco_spotfire

 

This is not pushing the state of the art: the scatterplot is the most basic display of relationship between two variables. In the US, scatterplots are taught in Middle School, with 8th-graders learning to plot January temperatures in various world cities against latitude, deal with missing data, and take altitude into account.

In Japan and wherever QC circles exist, scatterplots are one of the “7 tools of QC” taught to each member and they are used, for example, to replace the observation of true product characteristics requiring destructive testing with easy-to-measure substitutes like length or weight. Here is a recent Youtube video by Lisa Bussom, introducing the concept at a basic level in this context:

If 8th-graders and production operators can do it, why do suppliers of analytics software assume that managers and executives can’t?

Edward Tufte thinks that a key indicator of graphical sophistication is the percentage of graphics based on more than one variable that are neither time series nor maps. Back in 1983, he published his research on the world press, in which he found three Japanese titles among the top five, rounded out by Germany’s Der Spiegel and the UK’s The Economist.

On LinkedIn Pulse, Eugene Ivanov cited an article from The Economist on 5/15/2015 about the relationship between religiosity, as measured by the share of the population of a country that self-identifies as “religious,” and innovation, as measured by patents per capita. While the choice of metrics is arguable, the key results are shown in the following scatterplot:

Innovation vs. Religiosity in The Economist 20150509_woc154_2

The chart is clear, except for the y-axis, which is logarithmic and should be labeled accordingly. Thus “-6” for Japan actually means 2.5 patents for every 1,000 citizens, while “-18” for Ethiopia means .015 patents per million citizens, or 1 patent for the entire country. The logarithmic axis has the advantage of showing a linear trend but the drawback of making the differences appear to be less stark than they are. The Economist has a “data team” that produces a whole section of the magazine entitled “Charts, maps and infographics.”

The top-ranked American publication, Business Week, was 7th. This was based on articles published between 1974 and 1980. I have unfortunately not seen more recent data. A look at the way Bloomberg/Business Week chose to present the ranking of US auto sales in 2014 does not make me optimistic. Click on this picture to open the original page:

Bloomberg US auto sales 2014
Bloomberg US auto sales 2014

On the Bloomberg site, you don’t even see the legend unless you click “Scroll,” and, to get any data about any car, you need to hover over the corresponding bubble. If there is a logic to the relative position of the bubbles, I missed it: the Jeep Cherokee is next the Mercedes Class S. This is the most technically complex and least useful presentation of rankings that I have ever seen. It actually does not show any sequence of ranks!

Generating Scatterplots

Technically, a scatterplot is easy to generate. It is one of the charting options in Excel. To post a scatterplot on a web page, you can use Google Charts. It’s fine if you are just interested in the interaction between two variables.

Old Faithful eruptionIn teaching, a commonly used example is the scatterplot of the durations of eruptions versus time between eruptions for Yellowstone’s Old Faithful geyser, for which a set of 272 observations from ranger logs is publicly available. It is provided with R as the faithful dataset. Possible equivalents in a manufacturing context include plotting time-to-repair versus time-between-failures on a machine that you run to failure, or order size versus time between orders from a given customer. In the first case, you might want to determine whether a periodic maintenance plan would increase availability; in the second, whether an erratic ordering pattern masks a smooth consumption by the customer, who may be better served on a replenishment basis.

With Old Faithful eruptions, the idea is to use the duration of eruptions to predict how long you will have to wait for the next one. In Excel, it takes a few seconds to generate a scatterplot, but you then have to play with the formatting options to enhance visibility and get the following result:

old_faithfule_scatterplot_excel

Looking at this chart, you can’t help but notice the following:

  1. On the average, the longer an eruption, the more you will have to wait for the next one.
  2. There are two distinct clusters of points, that you can separate horizontal at Eruption Time = 3 min, and vertically at Waiting Time = 70 min.
  3. The points appear gridded, due to the rounding of waiting times by rangers when manually logging.

Using R, Stephen Murdoch has enhanced this plot as follows, based on Edward Tufte’s ideas:

faithful-murdoch

The main enhancement is the addition of histograms of each variable along the axes. The red dots along each axis mark means,  and the author drew a boundary at duration = 180 sec = 3 min to separate the two clusters. A less obvious addition is the color-coding of the points, where red is used for points where the previous eruption was under 180 sec, and blue otherwise. Visually, the long-eruption cluster appears to have roughly equal numbers of red and blue dots, indicating that a long eruption is equally likely to be preceded by a short one or by another long one, while short eruptions tend to be preceded by other short ones.

In Excel, if your cloud of points appears concave or convex, you can play with the axes, making either or both logarithmic to straighten it. If you succeed, you have uncovered an exponential or a power law.This is an analysis that engineers used to do painstakingly with pencil and different kinds of graph paper.   This, of course, does not cover all the possible relationships, but these are common. Besides semi-log and bi-log paper, they also used other special forms like normal probability paper, which lives on as Quantile-Quantile (Q-Q) plots  in R and other statistical packages. If you want to do it in Excel, you have to work at it.

On the other hand, if you have 10 variables, and don’t know which pairs interact, you can generate the 45 scatterplots in Excel one by one, or you can use a variety of other tools to generate a matrix of all the possible scatterplots at once. You can do this with Minitab, with R’s ggplot2 package, and with many other statistical packages. In the following example, done with R’s ggplot2 package from biostatisticians Ben Bolker and Jonathan Dushoff,  the diagonal slots are used to show the distribution of each variable. The upper-right and lower-left parts above and below the diagonal are symmetrical — respectively showing scatterplots of Y versus X and X versus Y — and therefore redundant.

Example of a scatterplot matrix
Example of a scatterplot matrix

Visualizing more complex interactions

What if you have interactions that are not pairwise? For three variables, you can use a 3D scatterplot or Hans Rosling‘s Motion Chart, formerly known as the Trendalyzer. Here is a trendalyzer example from Hans Rosling showing life expectancy versus income in multiple countries over 200 years:

The Motion Chart is available as a  Google Chart. In the Motion Chart, one variable, meant to be time, plays a special role and the chart lets you see a continuum of 2D scatterplots of the other two variables over time. A 3D scatterplot, on the other hand just shows you a cloud of points, and none of the three variables plays a special role; For Excel, M. Kumarasamy provides a 3D scatterplot template you can download, with instructions, or you can use the Scatterplot3D package in R. The following example is from Advanced Software Engineering:

threedscatter

As you can see, a 3D scatterplot can show more complex interactions than a plain, 2D scatterplot. Graphically, however, individual points are more difficult to read. Conceptually, you could have 3D motion charts, which would be a continuum of 3D scatterplots over time. I just have not seen one.

You can add more dimensions of information through varying line thickness, whiskers, bubbles of varying sizes, and color codes. Done sparingly, it enhances the chart; done to excess,  it turns it into an unintelligible mess, with “too many notes.”

With higher dimensions, you can use dimensionality reduction techniques like Principal Component Analysis (PCA) to visualize a projection of your 50-dimension cloud of points onto a plane, and it may let you visually identify clusters or clumps of outliers, but any pairwise interaction you see in your cloud of points is between your principal components — linear combinations of your variables — rather than among your variables themselves, which makes them difficult to interpret.

Beyond visualization

Of course, visualizing interactions is only the first step. You still need to understand the nature of these interactions and, for example, the accuracy you can achieve when using some to predict the others. But this is another topic.