I recently listened to one of my favorite podcasts, Datanauts (here), and the hosts were talking with Rob Hirschfeld (@zehicle). Rob was on to talk about what is an SRE. Rob talked about how IT operations is just as important as development. Google recognized this importance by creating a new title called Site Reliability Engineer. Google even created a new VP of Operations, Ben Traynor (@btreynor), to give the SRE’s more backing to be able to say “no” to the development team when the team tried to push through a change.
Other duties discussed during the podcast was a SRE should only work on production related tasks 50% of the their time. The remaining time should be devoted to writing code and project work. This helps keep the team innovative by writing code to automate more and and to help reduce technical debt.
Hearing this blew my mind! Work on only 50% of production! Work on tickets 50% of the time? Get outta here. My roles have primarily been in a systems administrator function and having the mindset about automation has been, “Fit it in when you can”. Which means, trying to juggle both the production issues and quickly script something to resolve the issue at the same time (Hear that? It’s technical debt calling). Like most people, it’s very difficult to do two things at the same time well. Changing that mindset to change your workload to allow more automation means faster recovery times, higher up-time for production, less technical debt, and using code to remove meaningless tasks for the business. (Doesn’t this sound like a “full stack” engineer?)
I was hooked!! I needed to find out more about this role and its function. I found a great YouTube video by Melissa Binde who’s the Director of Site Reliability Engineering at Google. She talks about hiring the right engineers who know architecture but also know code. Melissa goes on and confirms the 50% rule, but to keep engineers happy you give them project work and time to code. Melissa also introduced a new concept to me, Error Budget. There is no perfect system with an up time of 100%, not even at Google. Error Budget “provides a common incentive that allows both product development and SRE to focus on finding the right balance between innovation and reliability”. If a production service has an SLO (Service Level Objective) of 99.9% that service has an Error Budget of 0.1%. Product Managers and developers can keep pushing out features until production dips below that 0.1% budget. At that point, new features are stopped. This budget is monitored and reviewed quarterly.
Talk about another complete mind shift! Common ground for both Prod and Devs? WOW! You need to have good metrics and monitoring to get these concrete numbers. If not, I can see how this budget could get skewed.
What I learned is operations is hard no matter what organization you work for. Operatations want to automate away those meaningless tasks and keep a solid, working environment. Devs want to roll out new features into production. Development and operations need a common goal and it seems like Error Budget is a possibility. To achieve all this, a mind shift needs to happen. It sounds like Google has laid down some good pieces to a path we can all take away.
Melissa Binde’s discussion at GCP 20017:
Ben Traynor SRECon 2014