Blog Posts

Why ML data DevOps is needed

Blog: Think Data Analytics Blog

A fundamental obstacle in the way of most teams in building and deploying ML in production at the scale expected is that we still haven’t been able to bring DevOps practices to machine learning. 

In this article, we discuss why the industry needs DevOps solutions for ML data and how the unique challenges of ML data hinder efforts to practice ML and deploy it to production.

Deploying machine learning (ML) to production is not an easy task, and in fact, an order of magnitude more difficult than deploying conventional software. 

As a result, most ML projects will never see the light of day – and production – as most organizations  give up and give up trying to use ML  to promote their products and serve their customers.

As far as we can see, the fundamental obstacle in the way of most teams in building and deploying ML in production at the expected scale is that we still haven’t been able to bring in DevOps practices  . into machine learning. 

The MLOps solutions are part of the process of building and deploying ML models, but they lack support from one of the hardest parts of ML: the data side.

In this article, we discuss why the industry needs DevOps solutions for ML data, and how the unique challenges of ML data hinder efforts to put ML into practice and deploy it to production. 

The article describes the vacuum in the current ML infrastructure ecosystem and proposes to fill it with Tecton, a centralized data platform for machine learning. Follow the  link  from my co-founder, Mike, for more details on launching Tecton.

Tecton was created by a group of engineers who built in-house ML platforms at companies such as Uber, Google, Facebook, Twitter, Airbnb, AdRoll, and Quora. 

These companies’ significant investments in ML have allowed them to develop processes and tools for the broad application of ML to their organizations and products. The lessons presented in this article, as well as the Tecton platform itself, are largely based on our team’s experience in deploying ML to production over the past few years.

Remember the time when software release was long and painful?

The process of developing and deploying software twenty years ago and the process of developing ML applications today have a lot in common: the feedback systems were incredibly long, and by the time you got to the product release, your original requirements and design were already outdated. And then, in the late 2000s, a set of best practices for software development emerged in the form of  DevOps , providing methods for managing the development lifecycle and enabling continuous, rapid improvement.

Visit here: Top Machine Learning Companies

The DevOps approach allows engineers to work in a well-defined common code base. Once the staged change is ready for deployment, the engineer checks it through the version control system. The continuous integration and delivery (CD / CI) process takes the most recent changes, unit tests, creates documentation, conducts integration testing, and ultimately releases the changes to production in a controlled manner or prepares the release for distribution.

Fig. 1: Typical DevOps Process

Key DevOps Conveniences:

These days, many development teams take this integrated approach as a basis.

… In general, ML deployment is still long and painful

As a result, ML teams face the same problems that programmers faced twenty years ago:

DevOps for ML is in full swing. But there is almost no DevOps for ML data

Such  MLOps  platforms like Sagemaker Kubeflow and move in the right direction on the way to help companies to simplify the ML production, so we can observe how MLOps introduces principles and DevOps tools in ML. To get started, they need a fairly decent upfront investment, but after correct integration, they are able to expand the capabilities of data scientists in the field of training, management and release of ML models.

Unfortunately, most MLOps tools tend to focus on the workflow around the model itself (training, implementation, management), which creates a number of difficulties for existing MLOs. ML applications are defined by code, models, and  data… Their success depends on the ability to create high quality ML data and deliver it to production quickly and consistently … otherwise it’s just another garbage in, garbage out. The following diagram, specially selected and adapted from  Google’s work  on technical debt in ML, illustrates the “data-centric” and “model-centric” elements in ML systems. These days, MLOps platforms help with many “model-centric” elements, but with only a few “data-centric” elements or not at all:

Fig. 4: Model- and datcentric elements of ML systems. These days, model-centric elements are largely covered by MLOps systems.

The next section demonstrates some of the toughest challenges we faced while simplifying ML production. They are not meant to be comprehensive examples, but are intended to illustrate the challenges we face in ML data lifecycle management (functions and labels):

A quick reminder before we dive further: an ML function is data that serves as the input signal for a model to make a decision. For example, a food delivery service wants to show the expected delivery time in its application. To do this, you need to predict the duration of cooking a specific dish, in a specific restaurant, at a specific time. One of the convenient signals for creating such a forecast – a proxy for how busy the restaurant is – will be the “final bill” of incoming orders for the last 30 minutes. Function is calculated based on the flow of input data about orders:

Fig. 5: Source data is changed by function by transformation into function values

Date Challenge # 1: Gaining Access to Correct Raw Data

To create any function or model, a data scientist first needs to find the correct data source and get access to it. There are several obstacles along the way:

Date Challenge # 2: Create Functions From Raw Data

The initial data can come from many sources, each with its own important properties that affect the types of functions extracted from them. These properties include the data source’s support for transformation types, data relevance, and the size of the available data archive:

Fig. 6: Different data sources approach different types of data transformation differently and provide access to different amounts of data depending on the relevance

It is important to consider these properties, since the types of data sources determine the types of functions that a data scientist can obtain from the source data:

Looking ahead, note that combining data from different sources with complementary characteristics allows you to create really good functions. This approach requires the implementation and management of more advanced function transformations.

Date Challenge # 3: Combining Features into Training Data

Formation of training or test datasets requires combining the data of the respective functions. In this case, it is necessary to keep track of many details that can have a critical impact on the model. The two most insidious of these are:

Date Challenge # 4: Evaluate and Deliver Functions to Production

Once a model is released in real time, it needs to continually deliver new feature data to generate accurate and up-to-date predictions — often at scale and with minimal latency.

How should we pass this data to the model? Directly from the source? Receiving and transferring data from storage can take minutes, days, hours, or even days, which is too long for real-time data output and therefore impossible in most cases.

In such cases, the evaluation of functions and the consumption of functions must be decoupled. For  pre-computation (pre-computation) functions and offloading them to an output-optimized production data warehouse, you must use ETL processes. These processes create additional complexities and require new maintenance costs:

Finding the optimal compromise between relevance and cost-effectiveness: Decoupling computation and consumption of functions prioritizes relevance. Often, due to the increased cost, function processes can be run more frequently and, as a result, produce more relevant data. The right tradeoff varies depending on features and use cases. For example, the aggregation function of a thirty-minute window of the final invoice for delivery would make sense if it will be updated more often than a similar function with a two-week window of the final invoice.

Integration of function processes: Accelerating the production of functions requires obtaining data from several different sources, and as a result, solving the associated problems, more complex than the one with only one data source, which we discussed before. Coordinating such processes and integrating their results into a single vector of functions requires a serious approach from the data engineering side.

Training / serving-skew): Discrepancies between learning and work outcomes can lead to learning distortions. Learning biases are difficult to detect, and their presence can invalidate model predictions. The model can behave erratically when drawing conclusions based on data generated differently from those on which it was trained. The issue of distortions and working with them in itself deserves a separate series of articles. However, there are two typical risks worth highlighting:

Fig. 7 In order to avoid distortions in training, a uniform method of implementing functions should be used for both training and work processes.

Fig. 8: The graph shows the final account of orders: (1) shows the values ​​of the function issued for the forecast and updated every 10 minutes; (2) depicts training data that incorrectly displays the true values ​​much clearer compared to the functions issued to production

Date Challenge # 5: Tracking Features in Production

Something will break, despite all attempts to correctly bypass the above problems. When an ML system crashes, it almost always happens due to a “data integrity violation”. This term can indicate many different reasons, each of which requires tracking. Examples of data integrity violations:

Challenges like these create an almost insurmountable obstacle course for even the most advanced data science and ML engineering teams. Solving them requires something better than the unchanging status quo of most companies, where bespoke solutions are the only answer to a subset of these problems.

Introducing Tecton: the data platform for machine learning

At Tecron, we are building a machine learning data platform to provide assistance with the most common and challenging data science challenges.

At a high level, the Tecron platform includes:

  1. Function processes for turning your raw data into functions and labels
  2. Function store for storing archived data of functions and labels
  3. Function server for issuing the latest function values ​​to production
  4. SDK for getting training data and manipulating function processes
  5. Web UI for monitoring and tracking features, labels and datasets
  6. Monitoring engine to determine data quality or drift problems, and alerts

Fig. 9: As the central data platform for ML, Tecton brings features to development and production environments

The platform enables ML teams to bring DevOps practices to ML data:


Of course, ML data without ML models won’t give you a practical ML implementation. Therefore, Tecton provides flexible APIs and integrates with existing ML platforms. We started with Databricks, SageMaker and Kuberflow and continue to integrate with the complementary components of the ecosystem.

Original Source

The post Why ML data DevOps is needed appeared first on ThinkDataAnalytics.

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples