May 23 2018
If you google “data-science + manufacturing,” what comes back is recycled hype about the factory of the future. The same vision has been painted before and hasn’t come to pass, yet we are expected to believe that it will this time and that it will be a “4th industrial revolution.” Whether it’s true or not, this happy talk is no help in today’s factories. “Data science” covers real advances in the art of working with data, and the more relevant question is what it can do to improve existing operations.
This is not just about reaping tangible benefits today rather than hypothetical ones in the future but also about acquiring skills that will be needed to design new plants and production lines 5 years from now. These publications endow technology with a power to drive innovation that it doesn’t have. It is only a means for people to innovate. Their ability to do so hinges on their mastery of the technology, which is acquired by using it in continuous improvement.
Perhaps as a result of their focus, what these publications call “data science” in manufacturing is misleading because it has nothing to do with the meaning of the term everywhere else, in e-commerce, epidemiology, or economics. If you pick up any of the many books on data science, you find that they are about data wrangling, data analysis and the communication of results, not the internet of things (IoT), cyber-physical systems, or 3D printing, fascinating though these technologies may be.
Even though manufacturing is dependent on data both to control quantities and quality, this activity has never taken the lead in technology to use its data effectively in support of managerial or technical decisions. The challenge today remains to find simple answers to simple questions about what customers are buying, how it varies over time in aggregate volume and mix, how well products meet the needs they are designed to fill, where problems are and whether the solutions have worked…
Undoubtedly, direct observation of the shop floor and personal communications with stakeholders provide information that data cannot but the converse is true as well. What you see with your own eyes and what people tell you is not the complete story. It must be supplemented by data analysis but the currently prevalent practices don’t measure up. Pie charts, stacked bar charts, time series plots, safety crosses and other mainstays of performance boards are not sufficient.
From Statistics to Data Science
For decades, statisticians in manufacturing companies have tried to introduce more sophisticated methods without much success. In manufacturing, the title of “statistician” does not inspire confidence, to the point that Six Sigma called its practitioners “black belts” rather than “staff statisticians.” Today, statistics itself, as a stand-alone discipline, is an endangered species, subsumed under data science even in academia.
Yale University, which established its Department of Statistics in 1963, changed its name to Department of Statistics and Data Science in 2017. Columbia University created its Data Science Institute in 2012. In the US, 23 universities offer Masters in Data Science, including, besides Yale and Columbia, Harvard, Stanford, UC Berkeley, Cornell, Carnegie-Mellon, and Georgia Tech. Some of these programs are run by current or former statistics departments, as at Yale, Cornell, and Stanford; others, like Carnegie-Mellon’s, are in the school of computers science; others still, like Rutgers’s or Michigan State’s are in the business school. This tells us that (1) Data Science is taken seriously as a topic by the leaders of academia and (2) that there are divergent perspectives on what it is and where it belongs.
So what is the difference between statistics and data science, and why should you care? If you study statistics, you learn mathematical methods to draw conclusions from data and that’s it. Data science has a broader scope: it starts with data acquisition and ends with finished data products that can be reports, dashboards, or online visualization and query systems.
Once data is collected through IoT devices, cameras, transaction processing systems, or manually, it needs to be organized, cleaned, stored and retrieved using the technology of databases, data warehouses, data lakes, and other variants. This is taught in Computer Science, not Statistics.
Visualization at all stages of analysis is also central to data science but does not get as much respect in statistics as a branch of math. While renowned statisticians like John Tukey have worked on it, the art of producing charts and maps is not the core of the statistics curriculum and is dismissively called “descriptive statistics,” because it does not usually involve deep math. In data science, visualization is key to identifying patterns in data, validating models, and communicating results. It is an integral part of the process.
Among the many terms currently applied to the art of analyzing data, data science is most descriptive of the field as a whole. You also hear of data mining, machine learning, deep learning, or big data, all of which describe subsets or applications of data science but are often conflated. Strictly speaking:
- Data mining is the analysis of data that was collected for a different purpose, as opposed to Design of Experiments (DOE), where the data is collected specifically for the purpose of supporting or refuting a hypothesis.
- Machine learning is what is done by algorithms that become better at a task as new data accrues. For example, a neural network may be designed to recognize a handwritten “8” and to improve its performance with experience.
- Deep learning doesn’t mean what it says — acquiring deep knowledge about a topic. It designates multiple layers of neural networks where each layer uses the output of the layer below.
- Big Data refers to the multi-terabyte datasets generated daily in e-commerce, from clickthroughs to buying and selling transactions. Manufacturing datasets don’t qualify as Big Data. True Big Data is so large that it requires special tools, like Apache’s Hadoop and Google’s MapReduce, and I have never heard either mentioned in a manufacturing setting.
Data science is a broader umbrella term that is, if anything, too broad. Taken literally, it could encompass all of information technology. As used in most publications, it does not cover data acquisition technology but kicks in once it has collected data and it produces human-readable output to support decisions by humans. It does not include the use of data to control a physical process, as in 3D-printing, self-driving cars, or CNC.
The analytical tools used in data science receive the most media attention but are not where data scientists spend most of their time. Instead, while estimates vary and this is not accurately measured, the bulk of their efforts is spent preparing data. Ideally, this shouldn’t be happening; in reality, it does. The company’s systems should be able to produce tables with column headers like “Product ID,” “Serial Number,” “Completion Date,” “Color,” etc. followed by rows of values that an analyst can select, filter, join with other tables, summarize and transform to find answers.
The integrated system that would provide this is still in the future, and may stay there, for non-technical reasons. To date, it has not been possible for any software supplier to develop a single system with modules for all manufacturing activities from engineering change control to maintenance and quality that could outperform specialized systems for each function.
There is no technical obstacle but human dynamics of the software industry have kept such systems out of existence. The dominant providers of ERP products all started by being successful at one function — like multi-currency accounting or human resource management — and expanded from there into domains in which they neither had expertise nor the ability to recruit the best experts, and their specialized modules are generally not competitive with stand-alone systems developed by domain experts.
Short of having a single, all-in-one system, you might configure different systems to play together well. This would require them to have the same names for the same objects in all systems, consistent groupings and consistent relationships for products, processes, operations, equipment, critical dimensions, and people. The systems could then collaborate and feed usable extracts to analysts. The development of such a common information model, however, is not usually high of a manufacturing manager’s to-do list.
The prevailing reality is a multitude of legacy systems used by different departments, supplemented by individual spreadsheets. The same product goes by different names in Engineering, Production, Marketing, and Accounting, and the products are grouped by technical similarity in Engineering, volume class in Production, market segment in Marketing, and business unit in Accounting. Not only is the same object known by multiple names but supposedly unique names are used for several different objects.
The names are “smart” numbers, a legacy of the paper-and-pencil age, where, for example, you know that the product is blue because the 5th character of the name is “1,” and green would be “2.” In addition, the most valuable information, like a technician’s diagnosis of a machine failure, is often only available in free-text comments. And then there are missing values. In addition to the problems with the systems officially supported by the IT department, the individual spreadsheets contain tables with missing rows due to incomplete copy-and-paste operations, and errors in formulas.
The most common management response is to declare defeat. “It will be fixed when we implement a new system in two years,” they say. Or they give up on this plant and promise to do better in the next one to be built. Not only does giving up fail to provide answers to today’s questions but it also fails to prepare the organization to specify, acquire and implement new systems in the future.
Just as continuous improvement is necessary for existing production line layouts, workstation design, or logistics to learn how to design new ones, it is necessary with information systems, and this translates to an organized, sustained effort to make the existing systems useful in spite of all their flaws and low data quality.
The query tools of relational databases are the workhorses of data wrangling, but they are not sufficient, as data does not always come in tables but sometimes in lists of name-value pairs in a variety of formats like JSON or XML that must first be parsed and cross-tabulated. You also need more powerful tools to splits “smart” part numbers into their components, identify the meaning of each component, and translate values into plain English. And you need even more sophisticated text mining tools to convert free-text comments into formal descriptions of events by category and key parameters.
It doesn’t work perfectly. You may be able to recover only 90% or 95% of your data, but then you not only have a clean dataset but also a set of wrangling tools that can then be incrementally applied to new data and enrich this dataset, which begs the question of where to keep it. A common approach is to use a special kind of database called a data warehouse, into which you load daily extracts from all the legacy systems after they have been cleaned and properly formatted. They can then be conveniently retrieved for analysis.
The part of the data warehouse that is actually used for analysis may be a small fraction of its content, but you don’t know ahead of time which fraction. As a result, most of the data that is prepared and stored in the warehouse is never used. This has motivated companies with very large data sets, as in e-commerce, to come up with another approach called the data lake, into which you throw data objects from multiple systems in their original formats and prepare them for analysis if and when you have established that they are needed. Whether a data warehouse or data lake is preferable in an organization is a question of size. With small datasets, the penalty for preparing all data is small when weighed against the convenience of having it ready to use.
Analyzing the data
With clean data, you are, finally, at the statistician’s starting point. The first step is always to explore the data with simple summaries and plots of one variable or two at a time, and this is often sufficient to answer many questions. Being a good data scientist is about making the data talk, not about using a particular set of tools.
Data science training leaves you with a box full of tools that you don’t necessarily know what to do with, bearing names that are not self-explanatory like k-means clustering, bagging, the kernel trick, random forests, and many others. They were developed to solve problems but, to you, they are cures in search of a disease and answer questions you don’t have. The topical literature fails to answers the three questions John Seddon recommended asking about any tool:
- Who invented it?
- What problem was he or she trying to solve?
- Do I have this problem?
In data science, when a tool was invented is also essential because its use requires information technology. The tools of the 1920s rely on assumptions about probability distributions to simplify calculations; the ones from the 1990s and later require fewer assumptions and involve multiple simulations.
You find out, for example, that logistic regression has nothing to do with moving goods and was invented in 1958 by David Cox to predict a categorical outcome from a linear combination of predictors that can be numbers or categories. In manufacturing, it will tell you how relevant the variables and attributes you collect in-process are to a finished unit’s ability to pass Final Test. If they are not relevant, you may stop collecting them and can look for better ones; if they are relevant, you can modify the final test process to leverage the information these variables provide. Logistic regression can also be used to improve binning operations.
That it’s from 1958 tells you that using it on a dataset with 20,000 points and 15 predictors is unlikely to overtax a 2018 laptop or tablet. In this particular case, the name of David Cox does not add much information because he was a theoretician, as opposed to others, who worked on specific applications, like W. E. Deming in manufacturing quality, or Brad Efron in epidemiology. You may ask what your problems have in common with epidemiology.
Not only are you likely to find that you have no use for many of the tools in the published data science toolboxes but also that you have problems none of them addresses. Whether it is about demand, bookings and billings or about technical product characteristics, Manufacturing data comes in the form of time series. There a many tools for visualizing, analyzing, modeling and controlling time series but they are just off the data science lists.
Once you have established that a tool may be useful to you, you need to learn how to use it. You don’t need to plough through the underlying math anymore than a car driver needs to understand the theory of engines. It can remain a black box to you but you still need to know how to feed it data, what the various settings do, and how to interpret the output. By itself, this is not a trivial investment in time and effort and needs to be done selectively,
The presentation of results to stakeholders who are not data scientists is past the statistician’s end point. The results are moot unless they can be communicated to decision-makers in a clear and compelling fashion. The art of generating reports, slide sets, infographics and performance boards is not taught in statistics courses and not covered in statistics textbooks. It is often entrusted either to engineers who are poor communicators or to graphic artists who do not understand the technical content and produce charts that decorate rather than inform or persuade.
In business, the report, with a narrative in complete sentences and annotated charts, is a dying art, replaced by the slide set with bullet points that are not sentences and graphics that are limited to 3D pie charts and stacked-bar charts. When reports are produced, they are expected to fit on a single A3 or 11×17 page. Even the capstone project at the end of the Johns Hopkins University series of online courses on data science is documented by a slide set rather than a report.
This works for many activities but data science isn’t one of them. With slides and A3s alone, you can gloss over gaps in logic that would be exposed in report writing and prompt authors to fill them. Slides and A3s are useful, respectively as visual aids for oral presentations and as summaries but as a supplement to a fully baked, objective and rigorous statement of analysis and results, expressed in layman’s terms and with all appropriate nuances and caveats.
That executives are “too busy” to read reports is only true for reports that haven’t been designed to be read by busy executives. An executive always has the time to read a one-page summary — possibly an A3 — and spot-check the research behind the conclusions at three locations within the report. Reading it cover to cover is not usually necessary, particularly if the report has been designed with this use in mind.
The communication of data science is heavily graphic and, rather than limit themselves to a small set of standard charts that have been used in manufacturing for a century, engineers should expand their horizon, use more types of charts, embed them in infographics, and leverage the insights of a researcher like Edward Tufte. In addition, when a report is produced in electronic form, illustrations are not limited to still images. In Hans Rosling’s trendalyzer, for example, an animation shows a scatterplot changing over time. A histogram can also come with a slider bar to allow the reader to instantly see the effect of changing bin sizes.
The reports that are vanishing in business live on in academic papers, with abstracts in place of executive summaries. In many fields, these papers are, in fact, data science reports, and are not without challenges. First, academia’s review process does not always work. Growth in a Time of Debt, for example, an influential 2010 paper by Harvard economists was exposed in 2013 by students as containing calculation errors.
Second, when an academic paper is cited, the conclusions are often amplified beyond recognition. This is how a lighting study conducted on just 5 women assembling relays at Western Electric’s Hawthorne plant in the late 1920s spawned the belief in a Hawthorne effect that makes all the workers of the world more productive when management pays attention to them.
Data scientists cannot prevent journalists, politicians, or even work colleagues oversimplifying and distorting their work but it behooves them to speak up when it happens. They are responsible for the quality of the work, including not only sound analytics but effective communication as well.
The software toolkit of most engineers and managers in manufacturing is limited to Excel and PowerPoint, with the addition of Minitab for Six Sigma black belts. It doesn’t cut it for data science and there are plenty of options for all stages, from data wrangling to analysis and presentation. Some tools are free, powerful, and reliable, but require a high level of skills from users. Others are “for everyone” and available for fees. Regardless of the choice of tools, the main investment is in learning to apply them, just like production machinery.