Key Metrics for Cloud Platform Performance: From SLOs to Developer Experience

September 4, 2024

Markus Schweig

Read time: 14 mins

Share:

In today’s digital age, it is all about the customer experience. Satisfied customers naturally translate to revenue and growth. Quick responses and high availability have a widespread positive impact, especially when the customer base continues to increase. Think of Google handling 3.5 billion searches and YouTube users watching 1 billion hours of video every day, followed by many other top ranking companies. Then think about how companies that are not there yet can get to meaningful scale proportionate to their business over time.

I have seen engineering teams in a startup enjoying a quick and seamless allocation of infrastructure resources for their development, test, and production needs. Workflows were swift and the turn-around was known.

When the startup was acquired, resource creation shifted to the processes, tooling and hosting locations of the acquirer. During this transition, the former startup engineering teams kept projecting delivery dates based on the assumption of their previous agility. However, the timing for infrastructure allocation, resource access and general availability was now slower than before, from minutes to days, and with inconsistent turn around times for the same types of requests. Developer experience quickly degraded, which triggered adoption of former startup automation and processes. This greatly improved leadership’s understanding about ship dates. It helped put developers at ease being able to once again trust predictable turn around times for resource requests. 

Takeaways

The key takeaways of this article are

  • Develop, achieve and communicate the duration of infrastructure resource lifecycle management tasks. Allow application teams and developers to factor these dependencies into their own delivery schedules.
  • Consider building a configuration management platform with cloud agnostic purpose built APIs. Then use its consistent organization wide metrics for developing resource creation, mutation and deletion duration objectives. Offer platform performance guarantees for your mission critical resource lifecycle management services.
  • Upbound is built on Crossplane. Crossplane is a framework that helps platform creators to build their own cloud platforms for use by application teams and developers. Both Upbound and Crossplane come with out of the box metrics that can be seamlessly used as platform service level indicators. This simplifies the implementation to understand your platform performance across its use cases.

Your top level platform performance dashboard can conceptually look similar to the one below courtesy of Upbound’s Ezgi Demirel. It dynamically shows how long it takes for which resources to be ready, deleted, and drift detected. It lends itself to trigger notifications for the thresholds that you set.

Now let’s get started with how to get here from the current state of the world.

Closing The Gap

Great customer experiences are created by innovative application teams that require cloud and data center infrastructure on their continuous journey. The performance of cloud resource lifecycle management is paramount, because it is about enabling developers to have their infrastructure in place quickly, allowing them to get back to evolving their applications and thereby achieving higher velocity and speed of innovation. This in turn lets companies go to market faster, iterate on ideas for the next big thing, and outpace their competition. DORA, SPACE and DevEx frameworks help measure developer productivity. Based on the Atlassian state of developer experience report 2024, 69% of developers lose 8 hours or more of their working week to inefficiencies.

Let’s take a brief look at the developer productivity frameworks. Keep in mind that measuring a combination of the right metrics as set forth by these frameworks will lead to charting the path for a culture of successful behaviors. One such behavior is to drive chronic issues out of your systems, and not to mask them. I saw this approach working well when I was leading Microsoft's first Service Management team, when running Xbox’s first operations center, and when building Z2 / King / Activision mobile gaming infrastructure automation.

DevOps Research and Assessment (DORA): Provides a standard set of DevOps metrics used for evaluating process performance and maturity. These metrics provide information about how quickly DevOps can respond to changes, the average time to deploy code, the frequency of iterations, and insight into failures.

Satisfaction and well-being, Performance, Activity, Communication and collaboration, Efficiency and flow (SPACE): Captures the most important dimensions of developer productivity, illuminating that it is more diverse than just engineering system and developer tool performance, but recognizing their large impact and influence.

Developer Experience, DX (DevEx): Is what you get when developers can easily get in and maintain flow state at work. Flow state is the goal of great DevEx for increased engagement and quality of software delivery related to a number of factors including the developer user experience with their tools, and reduced cognitive load.

The four DORA key DevOps measurements are deployment frequency, lead time for changes, change failure rate, and mean time to recover (MTTR) plus a fifth DORA capability: reliability, as outlined in this article. The metrics tie back to and are influenced by the performance of the respective developer resource lifecycle management tasks.

Based on the Pluralsight State of Cloud 2023 report, only 27% of leaders say that their cloud strategies enable them to drive customer value, although 70% of organizations report more than half of their infrastructure is in the cloud. This is an indicator that companies are still early in their journey from traditional infrastructure management using Ansible, Puppet, Terraform, Pulumi and similar tooling without a central cloud developer platform, or at least not one that enables the measurement and reporting of its platform performance and how that relates to customer value, as is possible with Upbound.

The Emerging Era Of Internal Developer (Cloud) Platforms

There are a few companies that have already embraced building true custom cloud platforms, including those on the Crossplane adopters list, and Upbound’s customers. Those platforms are often fronted by internal developer portals like Backstage or Coretex. The Coretex 2024 State of Developer Productivity report says that teams without an internal developer portal report more frustration with data and context finding. The report further states that one of the top reasons that hampers developer productivity is inadequate tooling. This is where a well architected internal developer platform built on Upbound can help. Most other platform tool companies offer solutions that can be integrated with a developer platform, such as ArgoCD, Flux, Artifactory, and Grafana, but Upbound distinguishes itself through its unique cloud platform framework. It enables platform creators to build their own highly scalable developer platforms with a cloud agnostic purpose fit API for their internal application teams, and with lower risk and higher velocity than starting from scratch.

Developer platforms aim to reduce the cognitive load required from its users and increase their velocity in creating, updating and deleting cloud and other resources, essentially implementing higher level API abstractions and continuous resource lifecycle management. These platforms support development and software ship cycles throughout the various development, test, staging, and production environments.

Cloud platform creators can help accelerate the speed of their organization’s innovation, delivery and real world outcomes by uplifting the developer experience. Application teams are primary platform users. Developing and speaking to metrics that dovetail into measuring developer productivity is key, because this will enable the developers to better forecast end-to-end project delivery. Especially when they know within which projected timeframe they will receive usable cloud resources. Developers want to know how long it takes to create, upgrade and delete those resources. They want self service at a highly available custom cloud platform management API that reduces their cognitive load. This will in turn help application teams guide the business about their own performance when shipping products, services and features by incorporating cloud platform performance guarantees.

Configuration management platform teams must effectively measure, manage and communicate the performance of their platforms, which up to this point has not been trivial, because of bespoke tooling that creates and mutates infrastructure resources in imperative ways. Ansible, Puppet, Pulumi, and Terraform are often used by individual developers and SREs without their organizations having a good grasp on overarching performance. Thus there can be many factors influencing resource request delivery times. Measurements, data consolidation and developing performance objectives is more complex due to the lack of platform APIs. Note, that some platform teams create custom applications with a custom API. Measuring their end-to-end performance is often still an afterthought, especially when part of cascading software supply tool chains. By building an internal developer platform, especially when using Upbound, developing and offering platform performance guarantees will be more simple and achievable because of the built-in framework capabilities, such as service level indicator metric counters, and inferring compliance and corporate guidance.

Understanding SLAs, SLOs, and SLIs

Let’s look at how we can develop and offer platform performance guarantees. This is where Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) come into play.

Service Level Agreement (SLA): The agreement you make with your clients, users and application teams. This may be a contractual obligation between a service provider and a customer, defining the expected service quality and potential penalties when that quality is not met. Create a unique SLA for each platform service and cover expected and unexpected exceptions.

Service Level Objective (SLO): The objectives your team must hit to meet the agreement. This is a quantitative measure of a service’s performance, used to track progress towards meeting SLAs. Create SLOs that support the SLA.

Service Level Indicator (SLI): The real numbers on your performance. This is a measurable metric used to evaluate the SLO. Set realistic targets based on business goals and platform capabilities.

By defining clear SLAs, SLOs, and SLIs, teams can establish expectations, monitor performance, and identify areas for improvement. This SLO Development Life Cycle document explains how to generally develop your objectives.

Develop Achievable Platform SLOs

With cloud computing, businesses leverage dynamic scale and resource configurations. They take advantage of evolving features that are introduced through foundational changes while the application services are running on top of them. The application team benefits from knowing how long it will take to obtain new resources, and modify and delete existing resources, because they are striving to provide a highly available experience to their users. Here is how platform performance ties back to DORA metrics.

Commonly used deployment strategies include blue-green, canary, rolling, and feature toggles. I have seen teams combine them, where each new deployment required new infrastructure resources including load balancer and application server farms and at times additional database clusters. With multiple live production deployments in a day, resource availability highly influenced DORA deployment frequency.

Knowing about the duration for receiving infrastructure from a highly available self service configuration management platform API facilitated the understanding of DORA lead times for changes.

The newly allocated infrastructure provided the benefit of immediate rollback capabilities during deployment failures with minimum user impact influencing the DORA MTTR improvement metrics.

Allocating resilient deployment resources added resilience and increased DORA reliability. Coincidentally, when the platform is built on Upbound, an added reliability aspect is its frequent resource state reconciliation to ensure that the infrastructure remains indeed configured as declared.

Developer platforms wrap lower level infrastructure services and cloud platforms with asynchronous requests that can take tens of minutes to complete depending on the underlying cloud provider turnaround. Users of your platform want to know when resource requests are satisfied, and how available your API is.

Resource Readiness: Consider the following to develop your custom platform SLO.

  1. Measure the duration of resource requests at various times during the day over the course of several weeks. Consider inviting a focus test user group that puts your platform through its paces. Record the duration metrics.
  2. Keep recording and evaluating as you go to detect longer term patterns. These may include massive product launches. Think new Apple products that appear in the online store, or exclusive World of Warcraft video game content.
  3. Modify relevant user parameters when recording turnaround times, such as 
    1. cloud region, 
    2. instance types for Kubernetes worker nodes and virtual machines,
    3. and the number of requested resources.
  4. Identify patterns that may enable you to provide meaningful distinctions in your objectives and agreements based on criteria, such as
    1. time to allocate GPU versus non-GPU nodes, 
    2. time to receive resources in a particular region,
    3. time to create, update and delete resources in bulk, e.g. 1 versus 10, 100, 1,000.
  5. Develop independent objectives for different resource requests of interdependent resources. From a user’s perspective this would be the end-to-end duration for all resource creations to satisfy a user’s call to each particular configuration management platform API. For example, one such API call may be for an AWS EKS cluster with worker nodes, subnets, a VPC, IAM roles, and security groups.

Availability: Determine for which platform interfaces you want to develop availability objectives. Your user interface to your platform may be a CI/CD system. This system may interface with API gateways or an API server. The API server may reside in Kubernetes, and dispatch resource requests to a platform. The platform may use Kubernetes controllers to request the individual resources that are part of a group of interdependent resources from a cloud provider like AWS, Azure or GCP.

Per Amazon’s AWS Reliability Pillar, the company sees the following availability targets for various types of applications.

Be aware of your dependent cloud provider component availability. For instance based on this AWS EKS Service Level Agreement, the EKS control plane availability is 99.95%. If your platform endpoint is running on EKS, your own availability can by definition only be equal or less unless you implement a higher availability foundation. When your endpoints are cascaded serially through multiple clusters, for instance when your CI/CD system resides on a cluster talking to your custom platform on another, the end to end availability is the product of each cluster’s availability, e.g. 99.95% x 99.95% = 99.9%. Notice that we just shaved off 0.05%, which is a difference of 4+ hours of additional potential annual downtime.

Your custom platform processes may have their own availability guarantees that you want to factor in.

Let’s explore the meaning of the availability of your custom configuration management platform. Depending on how your platform is built, its unavailability may not impact running resources unless there are overarching cloud provider issues and your infrastructure is not straddled across regions. Unavailability will however impact the ability to create, update and delete resources. Imagine you spun up resources for $100,000 per hour with the intent to decommission them within the hour, and then a cloud provider issue causes 4 hours of EKS downtime that is well within the AWS EKS SLA, but impacting platform management APIs and resulting in a half a million dollar bill for your resources.

It is good practice to determine how much configuration management downtime your business can sustain without major impact, and its associated cost.

Determine the granularity of your available metrics. Upbound’s availability status page for instance distinguishes between its Upbound managed control planes services, its marketplace and its accounts availability.

An example SLO may look as follows. Write the SLIs that support your SLO to persistent storage outside of volatile systems, such as self hosted prometheus and grafana pods, so that there is evidence of your cloud platform performance. Upbound and Crossplane expose metrics counters for drift detection, time to first readiness and time to delete resources that can be directly used for SLO development. Kube state metrics offer counters that help with overall service availability and pod uptime.

Role of Crossplane in Cloud Platform Performance Management

Crossplane is a powerful cloud native framework for building custom cloud configuration platforms with resource provider agnostic APIs. It plays a crucial role in performance management by providing a unified control plane for managing infrastructure and application resources.  As a nice side benefit, Crossplane and its providers offer metrics endpoints that simplify the process of defining and monitoring SLIs.

Crossplane Provider Metrics: Many Crossplane providers expose metrics that can be directly used as SLIs. Especially those providers that were created using the Upjet code generator, that expose the following.

  • crossplane_managed_resource_deletion_seconds
    The time it took for a managed resource to be deleted.
  • crossplane_managed_resource_exists
    Counts the number of managed resources.
  • crossplane_managed_resource_drift_seconds_bucket
    Measures drift caused by external sources mutating resource state, suchas a person or third party system adding tags to cloud resources.
  • ​​crossplane_managed_resource_first_time_to_readiness_seconds
    The time it took for a managed resource to become ready for the first time after creation. 
  • crossplane_managed_resource_first_time_to_reconcile_seconds
    The time it took for a managed resource to be detected by the controller.
  • crossplane_managed_resource_ready
    Counts the number of managed resources that are in a ready state.
  • crossplane_managed_resource_synched
    Counts the number of managed resources that are synched.
  • upjet_resource_ext_api_duration
    Measures in seconds how long it takes a Cloud SDK call to complete.
  • upjet_resource_external_api_calls_total
    The number of external API calls.
  • upjet_resource_ttr
    Measures in seconds the time-to-readiness (TTR) for managed resources.
  • upjet_resource_reconcile_delay_seconds
    Measures in seconds how long the reconciles for a resource have been delayed from the configured poll periods.

Crossplane Pod Metrics: Include controller runtime, process and go metrics that can be used to raise awareness of degrading custom platform performance. This may not yet impact the developed service level objectives to deliver infrastructure within the promised duration, but if the platform is further stretched, it can. While  there is time for remediation, looking at the following information can help meet your SLO.

A noteworthy set of relevant Crossplane metrics that can help correct emerging platform performance issues before they impact a custom SLO are as follows:

  • controller_runtime_reconcile_errors_total
    Amount of resource state update errors. When the error counter keeps increasing, it will eventually jeopardize an SLO. Check if the kubernetes “provider” controllers have access and are authenticated to the provider API, or if this may be related to potential misconfigurations.
  • controller_runtime_webhook_requests_total
    Total number of admission requests by HTTP status code. Common webhooks include validation and API conversion. Rapidly increasing counts can be the cause for increased CPU utilization, which can cause side effects for other resources.
  • leader_election_master_status
    Indicating if the reporting system is master or backup. This should remain stable for long periods of time.
  • rest_client_requests_total
    Number of HTTP requests, partitioned by status code, method, and host.
  • workqueue_adds_total
    Total number of additions handled by workqueue.
  • workqueue_depth
    Current depth of workqueue. If this number keeps increasing, the Crossplane core pod may run into resource limits.
  • workqueue_retries_total
    Total number of retries handled by workqueue. A growing number is worth investigating.
  • workqueue_unfinished_work_seconds
    The number of seconds of work in progress but not observed by work_duration. Large values indicate stuck threads.

Custom Resource Metrics: Crossplane’s extension of the Kubernetes Resource Model allows for the creation of composite resource definitions (XRDs), enabling the use of custom status fields that are tailored to the specific platform requirements and its cloud agnostic API resources.

Summary

We learned that there is a clear benefit of using an enterprise offering such as Upbound, that is built on top of Crossplane, because it helps us to create mission critical custom cloud configuration platforms that readily expose consistent metrics that can be used as SLIs across a wide range of resource lifecycle management tasks straddling use cases of many application teams in an organization. SLIs allow us to develop SLOs and offer SLAs to those application teams. Everyone can then easily understand the platform performance and correlate it to the customer value. We can use dashboards to communicate that performance, and get notified when SLIs deviate from their baseline, so that we may still meet our SLOs by taking corrective actions.

Upbound observability can be set up following these steps. To see Upbound in action, sign up for a free trial.

Subscribe to the Upbound Newsletter