Mingled last night with 209 internet operations “ninjas” at the meetup of the Large Scale Production Engineering (#lspe) group, on Actionable Metrics, hosted by Yahoo! in Sunnyvale, CA, and heard speakers from collectibles marketplace site etsy, web performance testing service SOASTA, and video streaming service Netflix describe how they used metrics in their activities.
Both in style and content, the presentations were radically different from what I have heard in manufacturing on that subject. The speakers went fast, as if they were in a great hurry to give us as much information as possible prior to returning to work. They were also extraordinarily open about the tools they used and how they used them.
The first feature of their work that struck me was the simplicity of their dashboards. Almost all the charts they monitor are simple line plots of time series, free of the jumble of 3D pie charts, stacked bar charts, and other complicated displays commonly found on manufacturing dashboards. The most elaborate display showed a time series of, for example, login response times, with a dynamically adjusted confidence band based on a statistical model to help operations engineers tell significant changes from fluctuations. Yet, even with these features, the meaning of the chart was obvious, even to an outsider to their business. It is not difficult, for example, to understand a plot of the evolution of login times as the number of users grows or additional security is added.
As engineers tweak and enhance the functions provided on their servers everyday, they monitor metrics to see if it does any good, and take immediate action if it doesn’t. The time within which they can and must react is measured in seconds, not hours or days. Their metrics fall into the following categories:
- Customer experience/business. This includes the user experience and its translation into business activity, like the number of page views and the conversion ratio of page views to orders. For a subscription-based service like Netflix, it might include the number of times subscribers visit the site without streaming anything, suggesting that they didn’t find what they were looking for.
- Infrastructure. This covers the behavior of the servers and the networks through which user inputs are passed to the applications and outputs returned. This has to do with processor and memory utilization, and with the availability of these resources in the face of very large and varying numbers of user interactions.
- Application. These metrics rate the ability of the application software to process the user data once it has them and until it sends a response. This includes the speed and quality of concurrent searches or commercial transactions, or the protection of user data.
Customer experience translates as is to a manufacturing context but Infrastructure, on the other hand, corresponds to Logistics, and Application to Production all with different time scales. While one of the speakers described response time as “one metric to rule them all,” none of the organizations presenting imposed any standard set of metrics on their engineers. Instead, they provide them with tools to capture the data, compute the metrics, display them on charts, and generate alerts, but it is their choice and their responsibility to define the metrics.
Response time is to their world what order fulfillment lead time is in manufacturing, but its importance is much greater. There are sectors in manufacturing where short order fulfillment lead times are a competitive advantage with customers, but also many, particularly with big ticket items, where other factors trump short lead times. If you are a farmer buying a tractor, for example, you will take one with the features you want over one you can have sooner that doesn’t have them. The pursuit of short production lead times has to do with other considerations, such as reducing inventory, detecting quality problems faster, or reducing the obsolescence cost of engineering changes. When providing services on the web, on the other hand, any slowdown in response causes customers to balk, and, internally, any increase in transaction processing time can cause servers to saturate and customer response times to explode.
For these engineers, capturing one metric is a matter of adding one line of code in their software, and they use open-source tools to generate plots and dashboards. It is not difficult to do. The hard part is identifying the right metrics. The Netflix speaker was quite aware of this. After someone had him “No matter what you measure, it’s useful,” he charted the evolution over one year of the proportion of user IP addresses ending with an even number.