Operations
Site Reliability Engineering at Google – Christof Leng
The rule:
- If service is within SLA, launch away.
- If service is not within SLA, launch freeze.
Fixes:
- Common Staffing Pool: one more SRE = one less developer
- SRE hires only coders
- 50% cap on Ops work (toil)
- Keep DEV in the rotation
- Excess operations load (tickets, oncall, etc) always gets assigned to the dev team
- SRE Portability and the nuclear option