Why ML data DevOps is needed
Blog: Think Data Analytics Blog
A fundamental obstacle in the way of most teams in building and deploying ML in production at the scale expected is that we still haven’t been able to bring DevOps practices to machine learning.
In this article, we discuss why the industry needs DevOps solutions for ML data and how the unique challenges of ML data hinder efforts to practice ML and deploy it to production.
Deploying machine learning (ML) to production is not an easy task, and in fact, an order of magnitude more difficult than deploying conventional software.
As a result, most ML projects will never see the light of day – and production – as most organizations give up and give up trying to use ML to promote their products and serve their customers.
As far as we can see, the fundamental obstacle in the way of most teams in building and deploying ML in production at the expected scale is that we still haven’t been able to bring in DevOps practices . into machine learning.
The MLOps solutions are part of the process of building and deploying ML models, but they lack support from one of the hardest parts of ML: the data side.
In this article, we discuss why the industry needs DevOps solutions for ML data, and how the unique challenges of ML data hinder efforts to put ML into practice and deploy it to production.
The article describes the vacuum in the current ML infrastructure ecosystem and proposes to fill it with Tecton, a centralized data platform for machine learning. Follow the link from my co-founder, Mike, for more details on launching Tecton.
Tecton was created by a group of engineers who built in-house ML platforms at companies such as Uber, Google, Facebook, Twitter, Airbnb, AdRoll, and Quora.
These companies’ significant investments in ML have allowed them to develop processes and tools for the broad application of ML to their organizations and products. The lessons presented in this article, as well as the Tecton platform itself, are largely based on our team’s experience in deploying ML to production over the past few years.
Remember the time when software release was long and painful?
The process of developing and deploying software twenty years ago and the process of developing ML applications today have a lot in common: the feedback systems were incredibly long, and by the time you got to the product release, your original requirements and design were already outdated. And then, in the late 2000s, a set of best practices for software development emerged in the form of DevOps , providing methods for managing the development lifecycle and enabling continuous, rapid improvement.
Visit here: Top Machine Learning Companies
The DevOps approach allows engineers to work in a well-defined common code base. Once the staged change is ready for deployment, the engineer checks it through the version control system. The continuous integration and delivery (CD / CI) process takes the most recent changes, unit tests, creates documentation, conducts integration testing, and ultimately releases the changes to production in a controlled manner or prepares the release for distribution.
Fig. 1: Typical DevOps Process
Key DevOps Conveniences:
- Programmers own their code from start to finish. They are empowered and fully responsible for every line of code in production. This sense of ownership generally improves the quality of the code, as well as the availability and reliability of the programs.
- Teams are able to repeat processes quickly and are not held back by the months-long cycle of the waterfall model. Instead, they are able to test new features with real users almost immediately.
- Performance and reliability issues are quickly identified and analyzed. If the performance metrics drop immediately after the last deployment, an automatic rollback is triggered, and the code edits that caused the deployment are very likely the cause of the drop in the metrics.
These days, many development teams take this integrated approach as a basis.
… In general, ML deployment is still long and painful
- Discovery and access to raw data. Data scientists of most companies spend up to 80% of their time searching for input data to model their problem. This often requires cross-functional coordination with data engineers and the regulatory team.
- Development of functions and training of models. Once data scientists have access to the raw data, they often cleanse it for weeks and convert it into functions and labels. They then train the models, evaluate the results, and repeat the entire process several times.
- Launch of data transfer and processing processes into production. Then, data scientists turn to programmers to simplify their data-manipulation processes. This usually means passing the functional transformation code to the next command for efficient production-ready reimplementation (more on this below).
- Deploying and integrating the model. This step usually involves integration with the service that the model uses to predict. For example, an online retailer’s mobile app that uses a recommendation model to predict product offerings.
- Monitoring setup. Once again, the help of programmers is required to make sure that the ML model and data processes are working correctly.
As a result, ML teams face the same problems that programmers faced twenty years ago:
- Data scientists do not have complete ownership of the life cycle of models and functionality. They have to rely on others to deploy their edits and maintain them in production.
- Data scientists cannot repeat processes quickly. Lack of lifecycle ownership makes this impossible. For specialists, the speed of repetition of processes is critical. However, the teams that data scientists rely on have enough tasks and priorities, which often leads to the aforementioned delays and uncertainties in work, which, in turn, accumulate in huge quantities and thereby nullify all productivity.
- Fig. 3: Both the speed and the repetition rate of the processes put significant pressure on the curve of the expected improvement of the product.
- Health and reliability issues are rarely identified. When programmers reimplement the work of data scientists, it becomes easy to overlook important details. At the same time, it is even easier to miss the moment when the model in production no longer creates correct predictions, since either the data flow processes have been disrupted, or the world has changed and the model needs to be retrained.
DevOps for ML is in full swing. But there is almost no DevOps for ML data
Such MLOps platforms like Sagemaker Kubeflow and move in the right direction on the way to help companies to simplify the ML production, so we can observe how MLOps introduces principles and DevOps tools in ML. To get started, they need a fairly decent upfront investment, but after correct integration, they are able to expand the capabilities of data scientists in the field of training, management and release of ML models.
Unfortunately, most MLOps tools tend to focus on the workflow around the model itself (training, implementation, management), which creates a number of difficulties for existing MLOs. ML applications are defined by code, models, and data… Their success depends on the ability to create high quality ML data and deliver it to production quickly and consistently … otherwise it’s just another garbage in, garbage out. The following diagram, specially selected and adapted from Google’s work on technical debt in ML, illustrates the “data-centric” and “model-centric” elements in ML systems. These days, MLOps platforms help with many “model-centric” elements, but with only a few “data-centric” elements or not at all:
Fig. 4: Model- and datcentric elements of ML systems. These days, model-centric elements are largely covered by MLOps systems.
The next section demonstrates some of the toughest challenges we faced while simplifying ML production. They are not meant to be comprehensive examples, but are intended to illustrate the challenges we face in ML data lifecycle management (functions and labels):
- Accessing the correct source data sources
- Creating functions and labels from source data
- Combining functions into training data
- Calculation and delivery of functions to production
- Tracking features in production
A quick reminder before we dive further: an ML function is data that serves as the input signal for a model to make a decision. For example, a food delivery service wants to show the expected delivery time in its application. To do this, you need to predict the duration of cooking a specific dish, in a specific restaurant, at a specific time. One of the convenient signals for creating such a forecast – a proxy for how busy the restaurant is – will be the “final bill” of incoming orders for the last 30 minutes. Function is calculated based on the flow of input data about orders:
Fig. 5: Source data is changed by function by transformation into function values
Date Challenge # 1: Gaining Access to Correct Raw Data
To create any function or model, a data scientist first needs to find the correct data source and get access to it. There are several obstacles along the way:
- Data Discovery: Professionals need to know where the raw data is. Data cataloging systems (such as Lyft’s Amundsen ) are an excellent solution , but they are not yet used as widely. Often the required data simply does not exist and therefore must first be created or cataloged.
- Access Approval: Often running between instances to obtain data access permissions that will solve problems is a must on the data science journey.
- Access to raw data: Professionals can extract raw data in a single data dump, which will become obsolete as soon as it appears on their laptop. Or they might wade through the challenges of networking and authentication, and then face the need to fetch raw data from sources with specific query languages.
Date Challenge # 2: Create Functions From Raw Data
The initial data can come from many sources, each with its own important properties that affect the types of functions extracted from them. These properties include the data source’s support for transformation types, data relevance, and the size of the available data archive:
Fig. 6: Different data sources approach different types of data transformation differently and provide access to different amounts of data depending on the relevance
It is important to consider these properties, since the types of data sources determine the types of functions that a data scientist can obtain from the source data:
- Data warehouses (such as Snowflake and Redshift) store a lot of information with low data relevance (hours and days). They can be a gold mine, but they are best suited for large-scale aggregations with low relevance requirements, such as “the number of all transactions per user.”
- Transactional data sources (such as MongoDB or MySQL) generally store smaller amounts of data with more relevance and are not designed to handle large analytic transformations. They are best suited for small-scale aggregations with a short time span, such as the number of orders created by a user in the last 24 hours.
- Data streams (such as Kafka) store high-speed events and provide them in near real time (in the region of milliseconds). Standard streams store one to seven days of historical data. They are well suited for aggregation over short periods of time and simple transformations with high relevance requirements, such as the calculation of the last 30 minute order receipt function above.
- Prediction query data is the raw event data that occurs in real time just before the ML prediction is made, such as a query just entered by the user in the search string. While such data are limited, they are often as “fresh” as possible and contain an easily predictable signal. Such data comes with a prediction query and can be used in real-time computations such as finding an estimate of the similarity between a user’s search query and the documents in a search array.
Looking ahead, note that combining data from different sources with complementary characteristics allows you to create really good functions. This approach requires the implementation and management of more advanced function transformations.
Date Challenge # 3: Combining Features into Training Data
Formation of training or test datasets requires combining the data of the respective functions. In this case, it is necessary to keep track of many details that can have a critical impact on the model. The two most insidious of these are:
- Data Leakage: Data scientists need to ensure that their model is trained on the correct information and to prevent unwanted information from leaking into the training data. This data can be: data from a test case, ground truth data, data from the future, or information that violates important preparatory processes (for example, anonymization).
- Time Travel: Data from the future is a particularly problematic type of data breach. Preventing this leak requires careful computation of the value of each function in the training data relative to a specific time in the past (that is, time travel to a specific point in time in the past). Conventional data systems are not designed to support time travel, which forces data scientists to take data leakage in their model for granted, or to pile up a whole jungle of crutches for the system to work correctly.
Date Challenge # 4: Evaluate and Deliver Functions to Production
Once a model is released in real time, it needs to continually deliver new feature data to generate accurate and up-to-date predictions — often at scale and with minimal latency.
How should we pass this data to the model? Directly from the source? Receiving and transferring data from storage can take minutes, days, hours, or even days, which is too long for real-time data output and therefore impossible in most cases.
In such cases, the evaluation of functions and the consumption of functions must be decoupled. For pre-computation (pre-computation) functions and offloading them to an output-optimized production data warehouse, you must use ETL processes. These processes create additional complexities and require new maintenance costs:
Finding the optimal compromise between relevance and cost-effectiveness: Decoupling computation and consumption of functions prioritizes relevance. Often, due to the increased cost, function processes can be run more frequently and, as a result, produce more relevant data. The right tradeoff varies depending on features and use cases. For example, the aggregation function of a thirty-minute window of the final invoice for delivery would make sense if it will be updated more often than a similar function with a two-week window of the final invoice.
Integration of function processes: Accelerating the production of functions requires obtaining data from several different sources, and as a result, solving the associated problems, more complex than the one with only one data source, which we discussed before. Coordinating such processes and integrating their results into a single vector of functions requires a serious approach from the data engineering side.
Training / serving-skew): Discrepancies between learning and work outcomes can lead to learning distortions. Learning biases are difficult to detect, and their presence can invalidate model predictions. The model can behave erratically when drawing conclusions based on data generated differently from those on which it was trained. The issue of distortions and working with them in itself deserves a separate series of articles. However, there are two typical risks worth highlighting:
- Logical discrepancies: With differences in the implementation of learning and work processes (as is the case in ordinary practice), it is easy to face differences in the logic of transformation, and even seemingly insignificant discrepancies can lead to huge negative consequences. Are Nulls handled differently? Do the numbers agree with the decimal point precision? The best practice to minimize the risk of any such distortion is to reuse as much transformation code as possible between training and workflows. Spending significant additional effort to minimize risk underscores the importance of these practices and saves countless hours of painful debugging in the future.
Fig. 7 In order to avoid distortions in training, a uniform method of implementing functions should be used for both training and work processes.
- Temporal distortion: Constant pre-computation of functions is not performed for a number of reasons (usually such as cost). For example, the recalculation in production of a two-hour final order invoice for each user and every second creates an additional signal that is too insignificant, and therefore not worth such a waste of resources. In practice, by the time the data is displayed, the values of functions may become outdated by several minutes, hours, or even days. The lack of display of such downtime in training data is one of the typical mistakes, the result of which is training a model on data that is more relevant than what it will receive in production.
Fig. 8: The graph shows the final account of orders: (1) shows the values of the function issued for the forecast and updated every 10 minutes; (2) depicts training data that incorrectly displays the true values much clearer compared to the functions issued to production
Date Challenge # 5: Tracking Features in Production
Something will break, despite all attempts to correctly bypass the above problems. When an ML system crashes, it almost always happens due to a “data integrity violation”. This term can indicate many different reasons, each of which requires tracking. Examples of data integrity violations:
- Broken data source channel: The original data source may experience interruptions and suddenly start transmitting incorrect data, transmit data with a delay, or stop transmitting data altogether. Data outages can have a wide range of indirect consequences, clogging up the functions and models of the download channel, and it is very easy to overlook the occurrence of such an outage. And even if the interruptions can be identified, the solution to the problem by refilling is too expensive and difficult to access.
- Drift of existing functions: Some functions of the model may begin to drift, that is, they cease to be relevant for current tasks. The reason for this can be both a bug and the most common normal behavior of the model, which is usually an indicator of changes in the world (for example, the behavior of your users can change dramatically after the next news release), which in turn requires retraining the model.
- Opaque interruptions in population functions: It is trivial to detect interruptions in the entire function. However, it is much more difficult to detect outages that affect only one or a few groups or subgroups of the population (for example, only users living in Germany).
- Outstanding responsibility for data quality: In the case where functions may receive raw data from several different distributing sources, who is ultimately responsible for the quality of the function? The data-scientist who created the function? The data-scientist who trained the model? The owner of the data feed channel? The programmer who integrated the model into production? In cases where responsibilities are unclear, problems remain unresolved for too long.
Challenges like these create an almost insurmountable obstacle course for even the most advanced data science and ML engineering teams. Solving them requires something better than the unchanging status quo of most companies, where bespoke solutions are the only answer to a subset of these problems.
Introducing Tecton: the data platform for machine learning
At Tecron, we are building a machine learning data platform to provide assistance with the most common and challenging data science challenges.
At a high level, the Tecron platform includes:
- Function processes for turning your raw data into functions and labels
- Function store for storing archived data of functions and labels
- Function server for issuing the latest function values to production
- SDK for getting training data and manipulating function processes
- Web UI for monitoring and tracking features, labels and datasets
- Monitoring engine to determine data quality or drift problems, and alerts
Fig. 9: As the central data platform for ML, Tecton brings features to development and production environments
The platform enables ML teams to bring DevOps practices to ML data:
- Scheduling: Tecron features are stored in a central repository. This allows data scientists to share, find, and use each other’s work.
- Code: Tecton allows users to set up simple and flexible function transformation processes.
- Assembly: Tecton compiles functions into powerful data processing tasks.
- Testing: Tecton supports functional and integration testing of functions.
- Release: Tecton integrates tightly with git. All feature descriptions are version controlled and easy to reproduce.
- Deployment: Tecton deploys and organizes data processing tasks on data processing engines (such as Spark). These data processes continually deliver function data to the Tecron function store.
- Operation: The Tecton function server provides sequential function values to data scientists for both training models and production models for predictions.
- Monitoring: Tecton monitors the inbound and outbound process flow of functions in case of drift and data quality problems.
Of course, ML data without ML models won’t give you a practical ML implementation. Therefore, Tecton provides flexible APIs and integrates with existing ML platforms. We started with Databricks, SageMaker and Kuberflow and continue to integrate with the complementary components of the ecosystem.