May 23 2018
If you google “data-science + manufacturing,” what comes back is recycled hype about the factory of the future. The same vision has been painted before and hasn’t come to pass. Yet we are expected to believe that this time it will be a “4th industrial revolution.” Whether it’s true or not, this happy talk is no help in today’s factories. “Data science” covers real advances in the art of working with data, and the more relevant question is what it can do to improve existing operations.
This is not just about reaping tangible benefits today rather than hypothetical ones in the future but also about acquiring skills needed to design new plants and production lines 5 years from now. These publications endow technology with a power to drive innovation that it doesn’t have. It is only a means for people to innovate. Their ability to do so hinges on their mastery of the technology, which is acquired by using it in continuous improvement.
- Data Science in Manufacturing
- From Statistics to Data Science
- Data Wrangling/Munging
- Analyzing the data
- Presenting results
Data Science in Manufacturing
Perhaps as a result of their focus, what these publications call “data science” in manufacturing is misleading because it has nothing to do with the meaning of the term everywhere else, in e-commerce, epidemiology, or economics. If you pick up any of the many books on data science, you find that they are about data wrangling, data analysis and the communication of results, not the internet of things (IoT), cyber-physical systems, or 3D printing, fascinating though these technologies may be.
Simple Answers to Simple Questions
Even though manufacturing is dependent on data both to control quantities and quality, this activity has never taken the lead in technology to use its data effectively in support of managerial or technical decisions. The challenge today remains to find simple answers to simple questions about what customers are buying, how it varies over time in aggregate volume and mix, how well products meet needs, where problems are and whether the solutions have worked…
The Limits of Direct Observation
Undoubtedly, direct observation of the shop floor and personal communications with stakeholders provide information that data cannot but the converse is true as well. What you see with your own eyes and what people tell you is not the complete story. You must supplement it by data analysis, but the current practices don’t measure up. Pie charts, stacked bar charts, time series plots, safety crosses and other mainstays of performance boards are not sufficient.
From Statistics to Data Science
For decades, statisticians in manufacturing companies have tried to introduce more sophisticated methods without much success. In manufacturing, the title of “statistician” does not inspire confidence, to the point that Six Sigma called its practitioners “black belts” rather than “staff statisticians.”
What Schools Teach
Today, statistics itself, as a stand-alone discipline, is an endangered species, subsumed under data science even in academia.
Yale University, which established its Department of Statistics in 1963, changed its name to Department of Statistics and Data Science in 2017. Columbia University created its Data Science Institute in 2012. In the US, 23 universities offer Masters in Data Science, including, besides Yale and Columbia, Harvard, Stanford, UC Berkeley, Cornell, Carnegie-Mellon, and Georgia Tech. current or former statistics departments run some of these programs, as at Yale, Cornell, and Stanford; others, like Carnegie-Mellon’s, are in the school of computer science; others still, like Rutgers’s or Michigan State’s are in the business school. This tells us that (1) the leaders of academia take data science seriously as a topic, and (2) that there are divergent perspectives on what it is and where it belongs.
Statistics Within Data Science
So what is the difference between statistics and data science, and why should you care? If you study statistics, you learn mathematical methods to draw conclusions from data and that’s it. Data science has a broader scope: it starts with data acquisition and ends with finished data products that can be reports, dashboards, or online visualization and query systems.
Once data is collected through IoT devices, cameras, transaction processing systems, or manually, it needs to be organized, cleaned, stored and retrieved using the technology of databases, data warehouses, data lakes, and other variants. This is taught in Computer Science, not Statistics.
Visualization at all stages of analysis is also central to data science but does not get as much respect in statistics as a branch of math. While renowned statisticians like John Tukey have worked on it, the art of producing charts and maps is not the core of the statistics curriculum and many academics dismissively call it “descriptive statistics,” because it does not usually involve deep math. In data science, visualization is key to identifying patterns in data, validating models, and communicating results. It is an integral part of the process.
Data Science Versus Other Labels
Among the many terms currently applied to the art of analyzing data, data science is most descriptive of the field as a whole. You also hear of data mining, machine learning, deep learning, or big data, all of which describe subsets or applications of data science. Strictly speaking:
- In data mining, you analyse data collected for a different purpose, as opposed to Design of Experiments (DOE), where you collect data specifically for the purpose of supporting or refuting a hypothesis.
- Machine learning designates algorithms that become better at a task as new data accrues. For example, a neural network designed to recognize a handwritten “8” improves its performance with experience.
- Deep learning doesn’t mean what it says — acquiring deep knowledge about a topic. It designates multiple layers of neural networks where each layer uses the output of the layer below.
- Big Data refers to the multi-terabyte datasets generated daily in e-commerce, from clickthroughs to buying and selling transactions. Manufacturing datasets don’t qualify as Big Data. True Big Data is so large that it requires special tools, like Apache’s Hadoop and Google’s MapReduce, and I have never heard of either in a manufacturing setting.
Data science is a broader umbrella term that is, if anything, too broad. Taken literally, it could encompass all of information technology. As used in most publications, it does not cover data acquisition technology but kicks in once it has collected data and it produces human-readable output to support decisions by humans. It does not include the use of data to control a physical process, as in 3D-printing, self-driving cars, or CNC.
The analytical tools used in data science receive the most media attention but are not where data scientists spend most of their time. Instead, while estimates vary, they spend the bulk of their efforts preparing data. Ideally, this shouldn’t be happening; in reality, it does. The company’s systems should be able to produce tables with column headers like “Product ID,” “Serial Number,” “Completion Date,” “Color,” etc., followed by rows of values that an analyst can select, filter, join with other tables, summarize and transform to find answers.
Multiple Data Sources
The integrated system that would provide this is still in the future, and may stay there, for non-technical reasons. To date, it has not been possible for any software supplier to develop a single system with modules for all manufacturing activities from engineering change control to maintenance and quality that could outperform specialized systems for each function.
There is no technical obstacle, but the human dynamics of the software industry have kept such systems out of existence. The dominant providers of ERP products all started by being successful at one function — like multi-currency accounting or human resource management — and expanded from there into domains in which they neither had expertise nor the ability to recruit the best experts, and their specialized modules are generally not competitive with stand-alone systems developed by domain experts.
Making Multiple Systems Play Together
Short of having a single, all-in-one system, you might configure different systems to play together well. This would require them to have the same names for the same objects in all systems, consistent groupings and consistent relationships for products, processes, operations, equipment, critical dimensions, and people. The systems could then collaborate and feed usable extracts to analysts. The development of such a common information model, however, is not usually high of a manufacturing manager’s to-do list.
The prevailing reality is different departments using a multitude of legacy systems, and supplementing them with individual spreadsheets. The same product goes by different names in Engineering, Production, Marketing, and Accounting. Engineeering groups products by technical similarity; Production, by volume class: Marketing, by segment; Accountring, by business unit. Not only do they use multiple names for the same object known but they also use supposedly unique names for different objects.
The names are “smart” numbers, a legacy of the paper-and-pencil age, where, for example, you know that the product is blue because the 5th character of the name is “1,” and green would be “2.” In addition, the most valuable information, like a technician’s diagnosis of a machine failure, is often only available in free-text comments. And then there are missing values. In addition to the problems with the systems officially supported by the IT department, the individual spreadsheets contain tables with missing rows due to incomplete copy-and-paste operations, and errors in formulas.
The most common management response is to declare defeat. “We’ll fix this when we implement a new system in two years,” they say. Or they give up on this plant and promise to do better in the next one. Not only does giving up fail to provide answers to today’s questions but it also fails to prepare the organization to specify, acquire and implement new systems in the future.
Continuous Improvement of Information Systems
Just as continuous improvement is necessary for existing production line layouts, workstation design, or logistics to learn how to design new ones, it is necessary with information systems, and this translates to an organized, sustained effort to make the existing systems useful in spite of all their flaws and low data quality.
Data Sharing Formats
The query tools of relational databases are the workhorses of data wrangling, but they are not sufficient, as data does not always come in tables but sometimes in lists of name-value pairs in a variety of formats like JSON or XML that you must first parse and cross-tabulate. You also need more powerful tools to splits “smart” part numbers into their components, identify the meaning of each component, and translate values into plain English. And you need even more sophisticated text mining tools to convert free-text comments into formal descriptions of events by category and key parameters.
Data Warehouses and Data Lakes
It doesn’t work perfectly. You may be able to recover only 90% or 95% of your data, but then you not only have a clean dataset but also a set of wrangling tools that can then be incrementally applied to new data and enrich this dataset, which begs the question of where to keep it. A common approach is to use a special kind of database called a data warehouse, into which you load daily extracts from all the legacy systems after they have been cleaned and properly formatted. They can then be conveniently retrieved for analysis.
You use a small part of the data warehouse but you don’t know which one ahead of time. As a result, most of the data that is prepared and stored in the warehouse is never used. This has motivated companies with very large data sets to come up with the data lake. You throw into the lake data objects from multiple systems in their original formats. Then you prepare them for analysis if and when need them. Whether a data warehouse or data lake is preferable in an organization is a question of size. With small datasets, the penalty for preparing all data is small when weighed against the convenience of having it ready to use.
Analyzing the data
With clean data, you are, finally, at the statistician’s starting point. The first step is always to explore the data with simple summaries. These include plots of one variable or two at a time. This is often sufficient to answer many questions. Being a good data scientist is about making the data talk, not about using a particular set of tools.
Tools, and How to Assess Them
Data science training leaves you with a box full of tools. You don’t necessarily know what to do with them. They bear names that are not self-explanatory like k-means clustering, bagging, the kernel trick, random forests, and many others. They were developed to solve problems but, to you, they are cures in search of a disease and answer questions you don’t have. The topical literature fails to answers the three questions John Seddon recommended asking about any tool:
- Who invented it?
- What problem was he or she trying to solve?
- Do I have this problem?
In data science, the vintage of a tool matters because its use requires information technology. The tools of the 1920s rely on assumptions about probability distributions to simplify calculations. The ones from the 1990s and later require fewer assumptions and involve simulations.
Example: Logistic Regression
You find out, for example, that logistic regression has nothing to do with moving goods. David Cox invented it in 1958 to predict a categorical outcome from a linear combination of predictors. In manufacturing, it will tell you how relevant in-process variables and attributes are to a finished unit’s quality. If they are not relevant, you may stop collecting them. If they are relevant, you can modify the final test process to leverage the information they provide. Logistic regression can also be used to improve binning operations. It’s from 1958. Using it on a dataset with 20,000 points and 15 predictors is unlikely to overtax a 2018 laptop.
Most likely, you will have no use for many of the tools in the published data science toolboxes. You will also have problems none of them addresses. Whether it is about demand or technical product characteristics, Manufacturing data comes in the form of time series. There are many tools for time series that are off the data science lists.
Once you have established that a tool may be useful, you need to learn how to use it. You don’t need to plough through the underlying math. It can remain a black box to you. You still need to know how to feed it data, adjust settings, and interpret the output. By itself, this is not a trivial investment in time and effort, and you need to it selectively.
The presentation of results to stakeholders who are not data scientists is past the statistician’s endpoint. The results are moot unless communicated to decision-makers in a clear and compelling fashion. Statistics courses do not teach the art of generating reports, slide sets, infographics or performance boards. It is often entrusted either to engineers who are poor communicators or to graphic artists who do not understand the technical content.
The Dying Art of Business Reports
In business, the report, with a narrative in complete sentences and annotated charts, is a dying art. Slide sets with bullet points have replaced it. The bullet points are not sentences and the graphics 3D pie charts and stacked-bar charts. When professionals do produce reports, they fit them on a single A3 or 11×17 page. Even the capstone project at the end of the Johns Hopkins University series of online courses on data science is documented by a slide set rather than a report.
This works for many activities but data science is not one of them. With slides and A3s alone, you can gloss over gaps in logic that report writing would expose and prompt authors to fill them. Slides and A3s are useful as visual aids for oral presentations and as summaries. But they are a supplement to a statement of results in layman’s terms and with all appropriate nuances.
Executives are “too busy” to read a report only if the authors haven’t written it for busy readers. Executives can read a one-page summary and spot-check the research within the report. If it is well designed, the executives usually don’t need to read it cover to cover.
The communication of data science is heavily graphic. Rather than limit themselves to a small set of old, standard charts, engineers should expand their horizons. They should use more types of charts, embed them in infographics, and leverage the insights of Edward Tufte. In addition, when they produce a report in electronic form, they can use animated illustrations. Hans Rosling’s trendalyzer, for example, and shows a scatterplot changing over time. Even a histogram can come with a slider bar to see the effect of changing bin sizes.
Reports are vanishing in business but live on in academic papers, with abstracts in place of executive summaries. In many fields, these papers are, in fact, data science reports, and are not without challenges. First, academia’s review process does not always work. For example, in 2013, students exposed Growth in a Time of Debt, an influential 2010 paper by Harvard economists, as containing calculation errors.
Second, people who cite an academic paper often amplify its conclusions beyond recognition. This is how a lighting study conducted on just 5 women assembling relays at Western Electric’s Hawthorne plant in the late 1920s spawned the belief in a Hawthorne effect that makes all the workers of the world more productive when management pays attention to them.
Data scientists cannot prevent journalists, politicians, or even work colleagues oversimplifying and distorting their work. It behooves them to speak up when it happens. They are responsible for the quality of the work, including not only sound analytics but effective communication as well.
Most engineers and managers in manufacturing use Excel and PowerPoint. Six Sigma black belts add Minitab. It doesn’t cut it for data science. There are plenty of options for all stages, from data wrangling to analysis and presentation. Some tools are free, powerful, and reliable, but require a high level of skills. Others are “for everyone” and available for fees. Regardless of the choice of tools, the main investment is learning to apply them, just like production machinery.