Top Ten Things Slowing Down Your Platform Team

March 17, 2023

Sumbry

Reading time: 10 min read

Platform Engineering is the “new hotness.” It’s what your engineers use to build, deploy, and operate their services in production. Companies everywhere realize the value of having Internal Platforms and restructuring Infrastructure Teams into Platform Teams with the mandate to standardize and consolidate their software development lifecycle around a consistent set of services, tooling, and processes.

While the value of an Internal Platform driven by Platform Teams is ever-so-clear, best practices around how to structure a team to execute against this vision are fuzzier. Most Platform Teams have a standard set of antipatterns that slow them down and prevent them from fully achieving the Platform Dream. So here is my Top Ten List of the things that are slowing down your Platform Team and how to fix them:

Not being staffed appropriately
Having to support the most critical parts of the company
Maintaining too high a percentage of toil
Doing too much context-switching
Needing to support Shadow Infrastructure
Underestimating the time to support new technology
Constantly having to support or defend migrations
Suffering from the effects of Conway’s Law
Constantly taking on cost-related work with no clear customer value
No consistency in how to build and operate a platform

If you want a high-performing Platform Team, you must be honest about the above pitfalls. They’re anti-patterns and will slow your Platform Team down. I’d argue that 50% of the work involved in building a platform is the same across most companies. So, focus your Platform Team on providing differentiated value for the business.

Now, how to avoid these traps?

Not being staffed appropriately

One of companies' biggest mistakes is not adequately staffing a platform team. Most platform teams are under-resourced and don’t have critical mass in team or organization size to meet the needs of a never-decreasing product backlog. Investing in your platform has a compounding return in terms of efficiency for your engineers. By providing consistency and giving engineers a golden path to build, deploy, and operate their software in production, you remove some of the massive amounts of context required to be productive. You allow them to iterate more quickly on features needed to operate and grow the business reliably.

A healthy ratio of engineers working on customer-facing features to engineers on technology consumed by other engineers should be 1:1. For every 50 folks working on product features, 50 folks should also work on teams that empower those engineers. This can be internal platform teams, infrastructure teams, reliability teams, quality teams, and more.

One thing that is also a challenge for platform teams is ensuring that you have the right skills. You must ensure the appropriate mix of software and {system, infrastructure, reliability, quality, security, networking, etc.} engineering skill sets across the team.

Having to support the most critical parts of the company

Every time I hear about a global outage at a company, I immediately think one of three things happened:

Their global service discovery system failed
Some type of internal or external network routing issue
Someone shipped a change to The Monolith and it died

We keep preaching the wonders of microservices and distributed systems. Still, these critical core components exist in most organizations and are always directly supported by the platform or infrastructure team. This work has no glory, but there is all hell to pay when an outage occurs. From a staffing perspective, there is also not much investment in these areas and generally not much time allocated to improving these systems. It'd be a no-brainer if you had more resources, but as the previous item suggests, you’re always resource constrained.

Maintaining too high a percentage of toil

The Google SRE Book gave us this fantastic description of toil: work that is tactical and doesn’t provide strategic value, work that is repeatable and done more than once, work that can be automated if taken the time, and work that is accomplished by hand and manual.

While Platform Teams can only partially eliminate this type of work, automating these processes is similar to spending time on paying down technical debt - it is a form of debt. You can manage it for a while until it eventually snowballs and consumes you in the form of …

Doing too much context-switching

This happens when your platform team is under-resourced, has to support too many critical components in the company, and thus doesn’t have time to automate processes and reduce their toil. Your platform engineers spend all their time jumping from one task to the next and doing 50 tasks in a day. These tasks are usually never related to each other, span many different contexts and domains, and incur a cost every time a human has to unload the context from the previous task and load the context of the new task.

Needing to support Shadow Infrastructure

Shadow Infrastructure happens when your Platform Team takes on too much work and is incapable of meeting the needs of their customers, aka the company’s engineers. Shadow Infrastructure includes work like deploying new infrastructure or technology in the organization in a way that multiple teams and other engineers can easily consume. Platform teams tend to have a global view of the world and like to provide global solutions, but feature teams, and rightly so, need to prioritize getting value into the business's customers.

They will solve this problem independently and implement new infrastructure or technology in a locally optimized way that works for the individual or team but may not be easily consumable by others. And because the priority is shipping features for the organization, a long-term feature development team has no interest in supporting and operating an underlying technology. Ultimately, ownership gets shifted over to the platform team.

Wonder why Platform teams are constantly context-switching and unable to reduce the amount of toil while supporting the most critical parts of the business? It is because the amount of things they support is constantly growing, and Shadow Infrastructure can be one of the significant, silent contributors.

Underestimating the time to support new technology

While the previous five items listed were a direct result of the interactions between a platform team and its customers, these next five are things that a platform team does directly to itself. It starts with underestimating how much time it will take to support new technology.

Engineers love to build things and learn new technologies. That isn’t always the best thing for the business and often downplays how much effort and time creating something new takes practice. Hint: It is measured in years, not months.

Engineers often need to pay more attention to how long it will take to build, operate and adopt a new service or technology in production. Ultimately this is the classic build versus buy argument. In either case, platform teams always need to pay more attention to the time it will take to support anything, whether it was built internally or bought and integrated.

Building is always more expensive than buying, so platform teams need to be honest about only building where necessary and where it provides differentiated value for their internal customers and the business. It may cost more, but it is easier to buy in most situations and offload some of that support directly onto a vendor. Engineers are not cheap from a cost and time/onboarding perspective, and you’re generally better off buying and offloading some of that support to a vendor; it is always almost cheaper in the long run.

Constantly having to support or defend migrations

Platforms naturally evolve, and as new technology and patterns emerge (like Kubernetes), you want your platform to take advantage of these improvements. What you don’t want to do is force your customers as a platform team to do a significant amount of new work to adopt this next iteration of your platform. This often shows up in the form of migrations:

Migrating from virtual machines to containers
Migrating from one cloud provider to another
Migrating from one backend service or API to another incompatible one

These are all variations of the same thing and force work on your engineers. This is another area where platform teams need to consider the support burden. You want to make it easier for your engineers to adopt new technology and do the heavy lifting for them, otherwise, they and you are running in place. Work that may provide a benefit in the future but doesn’t help solve problems today.

Platform teams should limit these forced migrations as much as possible and offload as much work as possible from their engineers. This means being honest upfront about the resources and time it will take to implement, and not having too many things on the truck simultaneously.

This is also a side-effect of platform teams not engaging enough with their customers, building without understanding their needs, and creating the right abstractions, APIs, and contracts up-front. A migration should be a seamless process.

Suffering from the effects of Conway’s Law

Platform teams or organizations are usually a combination of infrastructure (compute, storage, networking), managed services (database, streaming, object store), other platforms (machine learning platform, application service platform), and more. Conway’s law states that companies tend to organize their internal systems or products around internal teams or communication structures.

An anti-pattern exposes your platform in a way that mirrors your organizational structure. If you have a compute team, then you have a computing platform. If you have a network or database team, you have a networking or database platform. The problem is that your engineers don’t care how you’re organized. They will usually view your platform as a singular distinct entity. It is the single thing they use to build, deploy, and operate their services.

When designing your platform, you should always start with the end-user perspective and abstract the underlying complexities from them— it is optional for them to know and ultimately be leaky.

The above statement is not necessarily fair. Whenever a platform team takes on cost-optimization work, it is usually for the Finance Organization and usually in the form of calling out the cloud spend for being too high, heh! That said, platform teams generally take on the burden of optimizing infrastructure costs and spending, and there is no direct benefit to their engineers. An engineer does not necessarily care that the platform is cheap, only that it is reliable, fast, and does what it does.

This means that factoring in costs upfront when building out your platform, not just from a vendor perspective but also from a people perspective, is extremely important. You don’t want to think about this after the fact because it can be pretty expensive and time-consuming to make decisions around optimizing your platform after it already has significant usage. This also contributes to being unable to reduce toil, doing too much context-switching, and being under-resourced.

No consistency in how to build and operate a platform

This is where standardizing on an open-source technology can fit in. You wouldn’t try to create your own web server or database server today unless you had a highly specialized need, but everyone tries to build their own platform! It’s exceptionally wasteful when the foundational layers of a platform are essentially the same across companies.

Crossplane is one example of such a technology. It’s not just a control plane framework but provides the building blocks you need to provide the basic foundational layer of your platform. It includes a cloud abstraction layer via its provider ecosystem, allowing you to consistently talk to anything with a public API. It also gives you the ability of composability. It enables you to define complex relationships of underlying cloud providers, managed services, and resources, and provide your engineers with a simpler abstraction or API. Combine this with a configuration format that can quickly be packaged and shared with others, and you have the basic building blocks for any platform! Learn Crossplane once and use it anywhere.

Now that is progress.

You’ve learned about what’s slowing down your Platform Team—from team misalignments, to managing toil, to managing bespoke layers of your own platform. And now you have an idea of how to fix them–invest in your team and in your foundational tooling. Want to try out Upbound for yourself? Enable SRE teams with a more efficient workflow and the building blocks of platform engineering by setting up your first cloud platform powered by Upbound’s managed control planes. Sign up for your 30 day free trial here!

A little bit about me:

I am the Head of Engineering here at Upbound, the company behind the open-source project Crossplane. I have over two decades of experience working in the platform, infrastructure, and reliability engineering space at companies like AirBnB, Uber, and Twilio but have a non-traditional background. I started my career as a software engineer. I have taken on a variety of roles over the years from founder to network engineer, software engineer, platform engineer, system engineer, reliability engineer, engineering manager, engineering leader, and more. I started my early career in several startups as the Internet grew in popularity and quickly learned how to build and scale infrastructure, striving to strike the right balance between solving the right technical challenges while ensuring the business continued to grow. I eventually started several companies because I thought I had some excellent ideas but wanted to learn more about problem-solving from a product and customer perspective. Some great (and expensive) lessons learned are a big reason why I am heavily customer-focused today. I always have my customer hat on, whether one of business or engineering.

Subscribe to the Upbound Newsletter