Blog Posts Business Management

Site reliability engineering

Blog: Capgemini CTO Blog

Problem statement – Due to the current state of how we monitor, alert, and log our digital ecosystem, it takes more time to detect, diagnose, and fix production incidents. Communication and status updates on production incidents are not reliable or timely. Business continuity and disaster recovery capabilities are even more important today as we are betting more on digital experiences.

SRE is causing disruption in quality engineering. In the past, the focus of quality engineering was on shift-left testing, especially requirements review, functional, and non-functional testing. With SRE, the focus is shifting towards shift-right, or production testing.

DevOps vs SRE

One obvious question concerns the crossover between SRE and DevOps, and rightly so. There is a significant overlap between them as they both tend to address the silos between “dev” and “ops.” Also, in terms of practices followed, there are numerous parallels. However, the approach and objectives are quite different in both cases.

SRE pillars

SRE is focused on four delivery pillars:

SRE

Incident management: The main goal of this pillar is to reduce mean time to detect (MTTD) and mean time to resolve (MTTR) to desired numbers. Setting up a single pane of glass of telemetry and observability makes investigating and diagnosing problems easier. Defining the SLOs and SLIs at each service layer and customer experience web site level are the key milestones. Availability, latency, and system throughput are the KPI needs to be tracked.

Problem management: This pillar deals with root cause analysis and prevention and self-healing mechanisms in the digital ecosystem. SRE dashboards and data-driven insights provide information about the overall service health, which helps us to identify the service availability for given amount of time during production monitoring

Environment management: Business continuity and disaster recovery are the core areas of focus for this pillar. Security, compliance, data management, and DevOps governance will also be part of by this pillar. Chaos engineering is a disciplined approach to identify vulnerabilities in systems in the production environment. It is implemented to check the system’s reliability, stability, and ability to survive in unstable and unexpected conditions.

Outage communications:  Curated communication of production incidents, software releases, scheduled and unscheduled maintenance. The goal of this pillar is to use all channels to clearly communicate on system health and incident resolution

SRE stages

SRE can be implemented in a phased manner. The below diagram depicts the three stage approach:

SRE 2

 

SRE Jumpstart

SRE Jumpstart consists of aspects such as:

For more information on site reliability engineering, please contact Genesis Robinson @ genesis.robinson@capgemini.com

Leave a Comment

Get the BPI Web Feed

Using the HTML code below, you can display this Business Process Incubator page content with the current filter and sorting inside your web site for FREE.

Copy/Paste this code in your website html code:

<iframe src="https://www.businessprocessincubator.com/content/site-reliability-engineering/?feed=html" frameborder="0" scrolling="auto" width="100%" height="700">

Customizing your BPI Web Feed

You can click on the Get the BPI Web Feed link on any of our page to create the best possible feed for your site. Here are a few tips to customize your BPI Web Feed.

Customizing the Content Filter
On any page, you can add filter criteria using the MORE FILTERS interface:

Customizing the Content Filter

Customizing the Content Sorting
Clicking on the sorting options will also change the way your BPI Web Feed will be ordered on your site:

Get the BPI Web Feed

Some integration examples

BPMN.org

XPDL.org

×