Going beyond No-Ops with Touchless-IT-Ops
Blog: Capgemini CTO Blog
No, really – what is “No-Ops”?
In 2011, Forrester released a report entitled Augment DevOps With No-Ops and noted that “DevOps is good, but cloud computing will usher in No-Ops.” But if we fast-forward to today, this definition is a bit too narrow. It relies on the cloud and does not address enough of the typical enterprise IT estate, which is more on-premise legacy than cloud based.
So, what does “touchless IT operations” entail – and why does it matter? I define “touchless IT operations” as the automation of infrastructure and application management. Not everything can be automated or resolved without human intervention – but a lot can. And the list of what can be automated gets longer all the time, while the list of required human interventions gets shorter, enabling IT staff to be deployed to more valuable activities that are directly related to the business. The potential benefits here include significantly faster speed-to-market and higher quality and cost savings.
No, really – that all sounds great – but how do we make this happen?
Biz/Dev/Sec/Ops – move to a product mindset and your operating model will follow
Firstly, move from a project mindset to a product one and shift to an Agile methodology. Next, rather than finishing an application development (AD) project and throwing it over the wall to applications management/support (AM), treat the application(s) as a product with a lifecycle plan and roadmap. This should include “marketing” and “adoption” strategies, as well as “end-of-life” considerations.
Ideally the product owner should come from the business and should drive decisions on the backlog in terms of features, non-functional requirements (NFRs), and incident/ticket resolution. By combining AD and AM into ADM, you create a new paradigm of “you build it, you run it.” In other words, the more resilient the application is from development, the less the team has to support L2 and L3 incidents. And that’s the kicker – the team supports L2 and L3 incidents – not a dedicated support organization. The ADM team is intrinsically motivated to create higher-quality applications, including better monitoring and alerting.
But what about testing and security? Both are included in the “team” and test automation and security “testing” are included in what the team does, along with automated coding standards and the prevention of technical debt.
Harness better intelligence and insight into IT operations and automate as much as practical
Artificial Intelligence applied to IT Operations or AI-Ops is a dual approach of using artificial intelligence to mine IT operational data for opportunities to get to the root-cause and remove the issue(s) causing incidents. And where this is not practical, the resolution should be automated.
Sounds easy right? No, not really. This requires several integral operations to be successful
Logging – monitors can only pick up what gets logged or passes through the network. When developers develop an application, they need to include “telemetry” or information about what’s going on with the application so it can be logged and evaluated. If developers don’t do this, then you may not have enough information to determine the root-cause should an issue or opportunity arise.
Monitoring – there are several good third-party tools on the market for this, but in general, you need something to monitor the logs, aggregate and contextualize the data, and provide alerts – along with additional data mining for insights and predictive metrics.
Actionable insights – from monitoring, you need actionable insights that identify root-causes or significantly lead to root cause discovery. The most beneficial incident resolution will ensure that it does not happen again. The second most beneficial resolution is one so fast that users don’t even notice it.
Automation – as previously mentioned, what cannot be prevented needs to be automated. And having a robust library of automation routines and software bots are great accelerators to automate as many resolutions as possible.
Site/Service Reliability Engineering (SRE)
Site/Service Reliability Engineering or SRE is exactly what it sounds like – engineering reliability. That being said, there is some unpacking to do with this statement. For the most part, SRE is about non-functional requirements (NFRs like security, reliability, scalability, etc.). SRE is a profoundly serious focus on the reliability of all the systems required to keep your application(s) up and running as expected.
Think about that – network, storage, logging, compute, scalability, security, etc. It’s a lot. SRE has a lot of similarities to AIOps. This includes leveraging analytics to better understand root causes and resolving issues before they become incidents, automating everything that cannot be prevented, and tearing into technical debt. Some companies consider this a “role” but at Capgemini, we consider this a capability that’s led by an SRE lead.
Tangible results with Touchless-IT-Operations – accelerated time-to-market, improved quality, higher cost savings
Implementing these changes to IT operating systems and utilizing capabilities like AIOps and SRE will yield an application development and management capability with “touchless IT Operations.” This means an organization without dedicated application management and accelerated time-to-market by working more closely with the business through a product-based Agile approach. Additionally, you can expect higher quality for the same reason (including SRE capability) and cost savings through a smaller and more Agile ADM capability.
“Touchless-IT-Operations” or TIO is much more than “No-Ops” or automated cloud management – it’s a profound move towards automated IT operations. To get a feel for Touchless-IT-Operations and the potential it has for your business, get in touch with me here.