Building Products at Scale - Platform and Reliability Team (Part III)

This is the third instalment on building products at scale. Part 1 covered the backlog & refinement sessions, and Part 2 covered the planning. Now, let's get back to engineering, and discuss how to support the process.

If you’re still following, what we have now is:

Regular Trains departing our stations (i.e. releases on a regular cadence),
Refinement sessions breaking down Themes → Epics → Features → User Stories (→ Tasks),
Prioritization happening at multiple levels (on a single backlog!),
Clearly visible partnerships across disciplines with stakeholders who share timelines & outcome responsibility (i.e., Engineering Lead + PM).

I talked a lot about how the process is in defining and breaking down the work items. But that doesn’t mean anything if we don’t have (1) the teams in place to build it, and (2) the signals in place to verify that we’re building the right thing.

While I have an opinion on point 1 - laid out in the diagram in Part 1, I am not precious about it. It matters a lot less, providing other aspects of this approach can be followed. I do however, have a strong opinion on how to deliver point 2.

The role of the platform and reliability team

The platform and reliability team plays a crucial role in ensuring the stability, performance and scalability. They collaborate closely with the development team(s) to help identify potential reliability issues as well as implement preventive measures up-front.

This team is responsible for implementing automated testing frameworks (i.e., the supporting infrastructure, but not the tests themselves) and monitoring systems for keeping an eye on the product. They can also run regular performance and reliability tests, to preventatively fix issues before we see them in production.

The most important one for me personally, however, is that the Platform & Reliability team also works on improving and streamlining the CI/CD pipelines. The immediate impact of this work is that teams can deploy code more frequently and efficiently, thereby enabling faster iterations.

You can also leverage the Platform & Reliability team in planning for business continuity and disaster response process and have them collaborate with security and ops teams.

By continuously monitoring the health and performance of the systems, and proactively fixing problems, the Platform & Reliability team enables the development teams to focus on delivering new features and functionality.

Best Practices for a Platform and Reliability Team (in an Agile Process)

The Platform & Reliability team should work closely with the rest of the engineering team to understand the architecture and design of the software systems, and provide input on how to improve reliability and performance.

However, their key stakeholder is also the Product team. The team should regularly communicate with the product team to understand their vision and goals, and work together to prioritize reliability-related features and requirements. The Platform & Reliability team should also engage with the design team to understand user needs and to ensure that the software is designed with reliability in mind. The design team might contribute suggestions on common flows, usability checks, etc., that could prove valuable in the development itself.

The Platform & Reliability team can encourage cross-functional collaboration and knowledge-sharing, so that all teams have a better understanding of the systems and how to make them more reliable. They can (and should) be a part of the regular Scrum teams, for example, contributing to each sprint. But in general, by using agile processes and techniques, the Platform & Reliability team can work effectively with other teams and quickly respond to changing requirements and priorities.

Regularly sharing metrics and insights from the monitoring and testing systems can help the product development teams to understand the health and performance of the software, and make data-driven decisions. It also drives accountability for the Platform & Reliability team. For example:

Availability and Uptime: The platform & reliability team can track the availability and uptime of the software systems, and share this information with other teams to provide insight into the overall health and reliability of the systems.
Response Time: Response time is a key metric for performance, and the platform & reliability team can measure the time it takes for the software to respond to requests from users or other systems.
Error Rates: Tracking error rates can help the platform & reliability team to identify areas of the software that need improvement, and to provide insight into the overall stability and reliability of the systems.
Performance Metrics: Performance metrics such as page load times and transaction processing times can help the platform & reliability team to measure and improve the performance of the software systems.
Security Metrics: Security metrics such as the number of security incidents and vulnerabilities can help the platform & reliability team to assess the overall security posture of the software systems, and identify areas for improvement.
User Satisfaction: User satisfaction metrics such as customer feedback and NPS (Net Promoter Score) can help the platform & reliability team to understand how well the software is meeting the needs of its users.
Capacity and Scalability: The team can track capacity and scalability metrics such as number of users, transactions, and data storage, to ensure that the software systems can handle increasing demand over time.

The last two are great examples of collaboration. NPS as a metric is highly dependent on the product and even design teams, but it is hugely beneficial for the entire engineering team to see and monitor. NPS was a key score we used to monitor our progress back in the Visual Studio team.

Finally, the team also plays a key part in promoting a reliability culture throughout the organization, and work with other teams to educate them on best practices for reliability and performance.

Improving the quality of engineering through platform & reliability

This reliability culture, mentioned in the paragraph above, has a key impact. It significantly contributes to the engineering process itself. As an example, providing dashboards and improving telemetry leads to better decision making. The product team can (and should...) start with hypothesis that are grounded in data.

For example, if the dashboards show that a particular feature is frequently causing errors or slowing down the system, the product team can use this data to hypothesize about potential usability issues or user flow bottlenecks. With this information, the design team can then work to improve the user experience by making design changes or adding additional functionality, leading to a more seamless and enjoyable user experience.

They also contribute to better testing (process) by providing valuable insights and tools. One way to do this is by establishing and maintaining a comprehensive testing infrastructure, including automated testing tools and continuous integration systems. In my current team, one of the first areas I've asked the team to focus on, is writing automated tests. To do this, they needed the underlying infrastructure - and they continue to need and evolve it. With infrastructure, I mean actual test execution, measurement of success/failure rates, linking to product backlog items, etc.

Additionally, the platform & reliability team can work with the testing teams themselves to create and maintain detailed test cases, and help them to identify and prioritize tests that will provide the most value in terms of improving the reliability and performance of the software. By collaborating in this way, the platform & reliability team can significantly enhance the quality and thoroughness of the testing process, leading to better software that is more reliable and performs better for end-users.

Finally, the platform & reliability team plays a key role in improving the release process. They can establish and maintain robust deployment pipelines, automate routine tasks, and provide real-time monitoring to ensure that releases are carried out smoothly and with minimal downtime (if applicable). The team can also collaborate with product teams and/or Release Manager to create a comprehensive release plan, including risk assessment, rollback strategies, and post-release validation processes. By providing a reliable and efficient release process, the platform & reliability team can help to reduce the risk of release failures and minimize their impact when they do occur.

Additionally, the team can use release metrics, such as time to deploy, release success rate, and post-release performance data, to continuously improve the release process and ensure that it remains effective and efficient over time.

Implementing flighting is another way that the platform & reliability team can improve the release process. Flighting is a technique for gradually releasing new features to a small subset of users, rather than releasing them to all users at once. This allows the team to monitor and validate the impact of the release on a smaller scale, identify and resolve issues more quickly, and make necessary changes before releasing to the full user base. The platform & reliability team can work with the engineering and product teams to design and implement a flighting strategy that is tailored to the specific needs and characteristics of the software. They can also provide the necessary infrastructure and tooling to manage and monitor the flighting process, and provide real-time data and telemetry to help the team make informed decisions about when and how to proceed with the full release. This helps minimize the risk of release failures and ensure that new features are released smoothly and with minimal impact on the user experience.

All of these play a critical role in ensuring that releases are carried out smoothly and with minimal impact on the user experience. That, of course, further enables one of the key tenets of a successful product development: frequent iterations.

Balancing Platform & Reliability efforts with product development

Unsurprisingly, dedicating a portion of the engineering team to this particular focus area means that you have to take some capacity away from feature development. Arguably, considering all points above, both are almost equal in priority. I'm frequently quoted in my team as saying "A software with multiple new features that doesn't work, is as worthless as a product that works exceptionally well, but has no useful features". Clearly, you need a balance.

One way of achieving the balance is keeping the team a part of the engineering effort (i.e., by having them be a part of the sprints in a scrum process). We already covered the fact that they have to collaborate with the product team to prioritize issues - in a good process, this results in a well organized backlog containing a mix of both priorities.

That said, this is a two way street. The platform and reliability teams works closely with the product development team to understand the roadmap and upcoming releases. They prioritize their efforts to ensure that the existing systems can accommodate the new features and products being developed. This may involve investing in infrastructure upgrades or improvements, performance optimizations, and testing to ensure that the systems can handle the increased load and complexity.

In turn, the product development team also works closely with the platform and reliability teams to understand the limitations and constraints of the existing systems. This can help them to design products and features that are optimized for the existing infrastructure and can be quickly and easily integrated into the existing systems.

Ultimately, the goal is to find a balance that allows the organization to continue delivering new and innovative products while maintaining the stability, scalability, and reliability of the existing systems. This requires a collaborative and proactive approach that involves all teams working together towards a common goal.

Team Composition

An interesting conversation is how to create this team. I'm sure there are as many approaches as there are teams themselves. What I've started with in my team is dedicating 30% of the capacity to this area. While they still contribute features, one of the key deliverables is most of the above. The team has a dedicated PM who focuses a lot on the internal improvements we can make to the platform, to the release strategy, process, pipelines, etc. The composition of the team itself is exactly the same as the "rest" of the engineering team: a PM, a designer, a tech lead and multiple engineers. In my case, the Platform & Reliability pod is composed of 2 additional engineers, but that will change when appropriate/needed. Keep in mind, however, as with all things agile, there's nothing wrong with experimenting & improving as you go along - with the important caveat that you don't whiplash people. That again, is one important argument for a good balance. That way, every individual contributor on the team always contributes to both, platform & features which means that you end up not needing a dedicated platform & reliability team...

Conclusion

It's clear, the Platform & Reliability team plays a crucial role in ensuring the stability, performance and scalability of a product, working closely with the development team and other stakeholders such as the product, design and security teams. By continuously monitoring the health and performance of the systems, and proactively fixing problems, the Platform & Reliability team enables the development teams to focus on delivering new features and functionality. Best practices for a Platform & Reliability team in an Agile process include close collaboration with other teams, regular communication with the product team to prioritize reliability-related features, and sharing metrics and insights from monitoring and testing systems to drive accountability and make data-driven decisions. The team also plays a key part in promoting a reliability culture throughout the organization and educating other teams on best practices for reliability and performance. The end goal then, is to work towards eliminating the need for the Platform & Reliability team itself.

Working as part of a larger process & team, this is one more wheel to help build products at scale.

Building Products at Scale - Platform and Reliability Team (Part III)

The role of the platform and reliability team

Best Practices for a Platform and Reliability Team (in an Agile Process)

Improving the quality of engineering through platform & reliability

Balancing Platform & Reliability efforts with product development

Team Composition

Conclusion

Read Next

Reflecting on the first ~20 years of my career

Interacting with Dataverse data from Azure & C#

Building Products at Scale - Release Trains (Part II)

Building products at scale - The Backlog (Part I)