AI/ML workloads in containers: 6 things to know
Blog: The Enterprise Project - Enterprise Technology
AI/ML workloads in containers: 6 things to know
July 22, 2021 – 3:00am
Two of today’s big IT trends, AI/ML and containers, have become part of the same conversation at many organizations. They’re increasingly paired together, as teams look for better ways to manage their Artificial Intelligence and Machine Learning workloads – enabled by a growing menu of commercial and open source technologies for doing so.
“The best news for IT leaders is that tooling and processes for running machine learning at scale in containers has improved significantly over the past few years,” says Blair Hanley Frank, enterprise technology analyst at ISG. “There is no shortage of available open source tooling, commercial products, and tutorials to help data scientists and IT teams get these systems up and running.”
Running AI/ML workloads in containers: 6 key facts
Before IT leaders and their teams begin to dig into the nitty-gritty technical aspects of containerizing AI/ML workloads, some principles are worth thinking about up front. Here are six essentials to consider.
[ Check out our primer on 10 key artificial intelligence terms for IT and business leaders: Cheat sheet: AI glossary. ]
1. AI/ML workloads represent workflows
LIke many other workload types, AI/ML workloads can also be described as workflows, according to Red Hat technology evangelist Gordon Haff. Thinking in terms of workflow can help illuminate some basic concepts about running AI/ML in containers.
With AI/ML, the workflow starts with the gathering and preparation of data: Your models won’t get very far without it.
“Data gets gathered, cleaned, and processed,” Haff says. Then, the work continues: “Now it’s time to train a model, tuning parameters based on a set of training data. After model training, the next step of the workflow is [deploying to] production. Finally, data scientists need to monitor the performance of models in production, tracking prediction, and performance metrics.”
Haff describes this workflow in straightforward terms, but doesn’t discount the amount of effort that can be involved in terms of people, processes, and environments. Containerization can simplify that effort by bringing greater consistency and repeatability.
“Traditionally, this workflow might have involved two or three handoffs to different individuals using different environments,” Haff says. “However, a container platform-based workflow enables the sort of self-service that increasingly allows data scientists to take responsibility for both developing models and integrating into applications.”
[ Want best practices for AI workloads? Get the eBook: Top considerations for building a production-ready AI/ML environment. ]
2. The benefits are similar to other containerized workloads
Nauman Mustafa, head of AI & ML at Autify, sees three overarching benefits of containerization in the context of AI/ML workflows:
Modularity: It makes important components of the workflow – such as model training and deployment – more modular. This is similar to how containerization can enable more modular architectures, namely microservices, in the broader world of software development.
Speed: Containerization “accelerates the development/deployment and release cycle,” Mustafa says. (We’ll get back to speed in a moment.)
People management: Containerization also makes it “[easier] to manage teams by reducing cross-team dependencies,” Mustafa says. As in other IT arenas, containerization can help cut down on the “hand off and forget” mindset as work moves from one functional group to another.
While a machine learning model may have different technical requirements and considerations from another application or service, the potential benefits of containerization are quite similar.
Audrey Reznik, data scientist at Red Hat, points to increased portability and scalability of your AI/ML workloads or solutions – think hybrid cloud – as one example. Reznik lists less overhead as another.
“Containers use less system resources than bare metal or VM systems,” Reznik says.
This helps lead to faster deployments. “I like to use the phrase ‘how fast can you code,’ because as soon as you finish coding we can deploy your solution with a container,” Reznik says.
[ Public data sets can also help with speedy results. Read also: 6 misconceptions about AIOps, explained. ]
3. Teams need to be aligned
Just because you make the workflow more modular doesn’t mean everything – and everyone – no longer needs to work well together.
“Make sure everyone involved in building and operating machine learning workloads in a containerized environment is on the same page,” says Frank from ISG. “Operations engineers may be familiar with running Kubernetes, but may not understand the specific needs of data science workloads. At the same time, data scientists are familiar with the process of building and deploying machine learning models, but may require additional help when moving them to containers or operating them going forward.”
Containerization should improve alignment and collaboration (thanks to consistency, repeatability, and other characteristics), but don’t take this benefit as a given.
“In a world where repeatability of results is critical, organizations can use containers to democratize access to AI/ML technology and allow data scientists to share and replicate experiments with ease, all while being compliant with the latest IT and InfoSec standards,” says Sherard Griffin, director of global software engineering at Red Hat.
Let’s look at three additional principles to consider carefully:
4. The “pay attention” points don’t really change
Just as many of the benefits of containerization are roughly the same for AI/ML as with other workload types, so are the important areas of operational focus. Here are three examples of operational requirements that you’ll need to pay attention to, just like with other containerized applications:
- Resource allocation: Mustafa notes that proper resource allocation remains critical to optimizing cost and performance over time. Provision too much and you’re wasting resources (and money) over time; too little and you’re setting yourself up for performance problems.
- Observability: Just because you can’t see a problem does not render it out of existence. “Ensure that you have the necessary observability software in place to understand how your multi-container applications behave,” Frank says.
- Security: “From a security point of view, launching AI/ML solutions is no different from launching other solutions in containers,” Alexandra Murzina, ML engineer at Positive Technologies. That means tactics such as applying the principle of least privilege (both to people and the containers themselves), using only trusted, verified container images, runtime vulnerability scanning, and other security layers should remain top of mind.
5. Containers won’t fix all underlying issues
Just as automation won’t improve a flawed process (it just helps that flawed process run faster and more frequently), containerization is not going to address fundamental problems with your AI/ML workloads.
If you’re baking bias into your ML models, for example, running them in containers will do nothing to address that potentially serious issue.
Yes, there are significant advantages to containerization. Those advantages shouldn’t delude anyone into thinking containerization will solve all problems. It’s not just a matter of bad data or bias, either. Containers can speed up aspects of the workflow, but they don’t actually do all of the work.
“Containers are very beneficial for running AI/ML workloads,” says Raghu Kishore Vempati, director of technology at Capgemini Engineering. “[But] containerizing AI/ML workloads alone doesn’t make the model more efficient. It only provides a way to accelerate the productivity associated with training the models and inferring on them.”
6. Be smart about build vs. buy
As with most technical choices, there’s a “should we or shouldn’t we?” decision in terms of containerizing AI/ML workloads. Also like most important technical choices, nothing comes free.
“There is a cost associated with containerizing machine learning workflows, which may not be justified for tiny teams, but for large teams, benefits outweigh the cost,” Mustafa from Autifly says.
IT leaders and their teams should do it with clear goals or reasons in mind – “just because we can” probably should be the only reason on your list.
“Don’t overcomplicate an already complex situation,” Frank says. “Make sure that containerizing ML workloads will provide business value beyond the intellectual exercise.”
That value exists for a growing number of organizations and it seems likely to increase in concert with AI/ML adoption overall. If your “should we containerize?” answer is ultimately “yes,” then there’s also a build-versus-buy decision.
The good news is there are more platforms, tools, and services than ever that can help. There’s also a robust menu of open source projects focused on running AI/ML in containers. Kubeflow, for example, focuses on orchestrating ML workloads on Kubernetes.
[ Related read: Kubernetes: 6 open source tools to put your cluster to the test ]
A general rule of thumb: You probably don’t want to get stuck in the business of building and maintaining a platform for containerizing, deploying, and managing AI/ML workflows – unless that’s actually your business.
“As is often the case in the cloud-native space, projects can fail when they become too focused on assembling platforms and workflows rather than actually solving the problem at hand,” Haff says. “Maybe they’ve built a platform and now realize they need to use GPUs but they didn’t plan for that up front.”
That team then spends its time catching up and addressing missing needs instead of focusing on its developing, training, and inferring on models.
“One approach is to use a unified self-service platform like OpenShift Data Science which provides an integrated workflow while allowing users to add additional open source and proprietary tools based on their needs,” Haff says.
If you go the commercial and/or open source route, ensure you’ll have future flexibility, because the ecosystem around AI/ML is rapidly evolving – and your own strategy is likely to shift as well.
“Do not tie yourself to one vendor,” Reznik advises. “You want to be able to use a wide variety of open source solutions, not just what the vendor gives you. A wide variety of open source solutions will expand possibilities for your team to be more innovative.”