SRE: An incomplete guide to cultural Narnia

SRE is a topic that over the last several years has become a popular discussion across many companies I have interacted with. Every organization is trying to build and implement this ideology while all asking the same questions. What is SRE? Who are SRE's? How do we get it? Where do we start? I, just like everyone else have opinions around this topic. However, there is a common ground between us all. SRE is not only about tooling and tech, it’s not only about the unicorn hires you can make, but it is primarily a CULTURAL shift within companies. It changes the way developers, product owners, admins, general engineer, etc. all interact with one another. You have to shift your operating model and how you approach building in your organization.

Now, just as a disclaimer these are just my opinions and experiences with building organizations like this and talking to other companies who have implemented or are currently implementing. There is no prescription for building SRE. Each and every company will implement at their own cadence and own internal philosophies. You will, over time, build it in a way that fits your organization and internal operating mode. Do not move too fast and attempt to force the model into your organization. Fail but fail gracefully and learn from those failure. You’ll get to your Narnia at some point.

Definitions:

Let’s start with some quick definitions of the acronyms we will be using. These will be short explanations but give you the general outline.

• SRE - Site Reliability Engineer, an individual who possesses an interest in infrastructure and operations, with a software engineering background.
• SWE - Software Engineer. Someone who is writes application/service software.
• Infrastructure SRE - A variation of SRE that works on infrastructure related projects (monitoring, provisioning, IaaS, etc.)
• Application/Service SRE - A variation of SRE that is dedicated to specific Application/Service group. These are the groups we are building the products that generate revenue for your company(hopefully).
What/Who is SRE?
The definition of SRE is quite scattered and I don't think anyone truly understands/knows what it means and no - Unicorn's will not show up below. This is what SRE means to me:
• SRE is a core group of individuals who have a wide array of skills. They are more of a technology generalist than specialist. These skill sets range from operations, infrastructure, networking, development, hardware, distributed systems, monitoring, stability, capacity planning, software engineering, etc.
• SRE's are responsible for architecture and implementation of technical infrastructure and supporting services. Focusing on stability, security, and scalability of those platforms.
• SRE’s should be building standard best practices, platforms, services, infra, etc. that reach the global organization.
• SRE is not just about the technology. SRE is a mindset, thought process, and cultural shift.
• SRE shouldn't be made mandatory for everyone in your company. Teams will have the choice to fund SRE support if they need/want it.

Areas of Responsibility

Technology has so many different aspects of it that it may seem hard to narrow down what SRE's should actually be spending their time and energy on. Since these individuals may not be directly building public facing products or features (depending on your company of course). What exactly will they be doing? This by no means is limited to this list but should be the general areas in which you start to focus your SRE organization. Without these underlying infrastructure plumbing being properly built, monitored, scaled, etc. it will be difficult to preach this new cultural mindset to the outside team members. When you look below these items may seem very infrastructure heavy. They are and aren't. Many of these can be split between Inf & App SRE. One is the primary producer while the other is the primary consumer.

• Monitoring
• Configuration Management & Automation
• Infrastructure Services / Networking / Platforms / Architecture (Including Hardware and Gen Compute)
• Infrastructure Tooling & Capacity Planning
• Big Data / Data Warehousing / Data Analytics
• Documentation and Runbooks
• Incident Response & Incident Management Process

**Is SRE single team or multiple teams? **

There is no prescription and results will vary depending on multiple factors. However, I recommend that this is split logically into multiple teams if you have the staff to accomdate. You can start small and have a generalist team that covers multiple areas and eventually break into smaller more manageable pieces. Below are those areas:

• Infrastructure (Hypervisor, Storage, Operating Systems, Containers, Automation, SDN)
• Observability (Monitoring, Telemetry, Event Correlation, Trend Analysis, IN&IM, etc)
• Tooling (Custom tools, Config Management, Developer Experience tools, etc)
• Services (Databases, Message Queues, Orchestration, Micro-services, etc)
• Apps (Mostly app side knowledge and ability to support / troubleshoot)

Even though they are split in some sense the reporting chain must remain centralized throughout the organization. Decentralizing in my opinion will break the effectiveness and communication of the team and result in an unsuccessful SRE implementation. SRE’s should report to like-minded individuals whose mission statement is to push SRE agenda and philosophies. Agenda is not the only reason, a central hierarchy allows for governance around tooling, best practices, and general technical/architectural guidance. From that you can allow those folks to cross pollinate in other groups/teams and spread those standards across the board. A breakdown in structure and communication will inevitably cause a deterioration in the mission.

Another reason for centralization is that it allows you to make sure your SRE hires do not become a dumping ground for operational work the app teams do not want to do or have pushed off. I have seen this happen in the past, it discourages the new hires and isolates them. They do not have a mission or a central team to work with and end up building things off on their own that only fit the needs of who they are supporting instead of seeing the bigger picture. The Application SWE’s should still be doing operational related work since they are the ones who created the need for it. The Application SRE can help offload some of that work and find ways through the central chain on what tools others are using to offload. Operational issues can quickly turn into error budgeting for the application teams.

Your SRE’s will likely be split into two disciplines. Application SRE & Infrastructure SRE. This allows you to cover both sides while pushing the same mission throughout all the teams. Infrastructure SRE’s are out there and there are plenty of them. I have seen it come as a natural evolution of DevOps, System Engineers, System Administrators, etc. You should focus on building your internal talent pool and then going outside for hires.
The Application SRE’s I like to see some from folks embedded in the application team. Having some of the SWE’s on the team volunteer and shift work over to SRE helps speed up the process for that specific application. You can leverage their knowledge of the code base and have them learn and understand the centralized mission and tooling to bring down to their stack. These individuals can also train the central team on the app stack.
Not every team in your company will need dedicated SRE support. This is where the central organization comes in handy. Dedicated resources are focused on building guidelines, standards, services, and platforms for teams to consume that can operate autonomously in your environment. Having a broadly communicated central team also allows for these other teams to know where to go when they need assistance or would like some recommendations on their technology stack.

Application/Service SRE vs Infrastructure SRE

Application SRE
• Embedded for Application/Service Teams
• Architecture guidance for new services and infrastructure
• Design and Implementation for modern technologies
• Support within the Application/Service Team
• Automation for apps and services for the team
• Automation for operational work
• Monitoring / Metrics for Application/Service Team
• Benchmarking and performance assistance for code
• Infrastructure deployment and architecture for applications and services
• Write documentation & runbooks for alerts issued by application/service stack
Infrastructure SRE
• Build & Management of “Plumbing” Technical Infrastructure (provisioning, OS, dns, dhcp, networking, central auth, etc.)
• Automation of infrastructure services (telemetry, monitoring, log aggregation, config management, anomaly detection, orchestration, etc.)
• Build and Management of consumable services and tools (message queues, databases, distributed compute farms, api services/integrations, containers, etc.)
• Infrastructure as Code
• Implement Global IR&IM process
• Support for Application SRE teams
• Architectural guidelines and best practice documentation

Organizational Layout

Org structures, whether we like to believe it or not, really help in the flow of communication throughout the team. It allows for the team leaders to form a single mission and have that mission carried out by their team members. I mentioned previously that I believe this organization must be centralized to be successful. Below is a high level overview that shows the separation between App and Inf SRE.
The leads can handle multiple teams and disciplines. This can shrink and scale as you see fit within your organization. App SRE is straight forward. The Inf SRE could be embedded across existing infrastructure teams or form a new team focused on a specific project.

Example SRE Interaction

Below is an example interaction between an application team (this one runs a ruby app) and an SRE hierarchy. You will see a team called “Frontline Support” as the initial interaction. This team is not required but is a nice to have. It helps in any environment big or small and can really offload a significant amount of the operational workload along with having a global view of issues coming in. The duties for Frontline support are to follow runbooks for alarms flowing in from the monitoring systems. It is similar to a traditional NOC but with more expertise across the stack they are watching. It also allows for trend analysis and bringing data to conversations around re-occurring issues or bugs in the system.

I do recommend using this as a starting point for junior level team members. It allows them to learn the system quickly and buddy up with more seasoned members that are either on this team or have the pager for the week. It is a great way to train and onboard your technical employees. Nothing works better than showing them what is broken.

Closing

To Sum up this blog post, everyone’s journey will be different. There are quite a few books and examples that exist out there but please only use them as ideas to form your own opinion on how these organizations should work. There is no prescription and nothing is ever perfect. I have left out quite a bit of information in this post but am always interested in having conversations about this topic and other alike. There is 1 consistent theme you’ll see across the board and that’s shaping the culture. Focus on your culture and hire great talent along with great leaders. Good luck and Happy Hacking.