Rich Lafferty Rich Lafferty is a Staff Site Reliability Engineer at PagerDuty, where he builds platforms in the clouds to make PagerDuty reliable and his development teams happy. He calls Dartmouth, Nova Scotia home with his wife and two tortoiseshell cats. In his copious spare time, he can be found enjoying craft beer, staring at a wall, or playing bass in one of PagerDuty’s two office bands.
What is your specialty/area of expertise?
The short answer is probably “reliability engineering”, but it’s been a winding path to get there.
My career history is basically “production web operations”. I’ve been at this for a couple of decades, and back when companies still called us system administrators, I was a sysadmin. I tried out management for a while, leading the SRE teams at a SaaS company, but decided I was happier in an individual role.
These days my title is “Site Reliability Engineer”, which means a lot at different places but I think the common parts are twofold: first, that the role involves software and systems development so programming skills are necessary, and second that it’s about enabling development teams to operate their software in production, rather than doing it for them. (I still insist that “devops” is “developers and operations working together” and not a role, so I’ll go on the record that I am not a “devops engineer”!)
These days, as a staff engineer, I’ve been focusing a lot on how to build reliable systems -- and by systems, I mean combinations of technology and people. We’ve got some great SREs and SRE teams at PagerDuty to focus on the engineering work, and that’s given me the opportunity to step away from the hands-on coding work and think about it from a sociotechnical perspective -- how do you get a few hundred people with a variety of experience and conflicting priorities to successfully design and operate a complex software system?
What got you interested in reliability engineering?
It all kind of came down to what looks like a simple question: why do we keep having incidents? Why does software fail? Part of that was grounded in a basic “why do I keep getting paged?”, but bigger than that, why do all tech companies keep running into this?
Like a lot of people who came to the industry from the operations side, I don’t have a computer science background. I have a sociology degree, focused on industrial and organizational sociology. Mind you, that was a long time ago and was focused a lot on labour, but it planted a seed about the people side of software engineering that’s always pulled at me.
One of the fun things about working at PagerDuty is that our customers are like me, they’re software engineers trying to keep their systems running. And we’re pretty good at it, especially around responding to incidents, but that got me digging in further to figure out what everyone else knows about incident response, and that was a big rabbit hole! Through software-focused folks like John Allspaw and J. Paul Reed, I started learning about a lot of established “industrial” safety thinkers like Sidney Dekker, Eric Hollnagel, James Woods, and Richard Cook, and basically went through that familiar experience of discovering that everything I knew was, well, if not wrong, at least not on a very solid foundation. There is a lot of non-intuitive stuff in there, about the behavior of complex systems and people’s role in it. I love it. It’s a chance to work on big, complex problems at the intersection of people and technology, and without being in a traditional management role with direct reports and hiring and compensation and all the things I didn’t enjoy about being in management.
What is your team currently working on?
My main short-term focus right now is improving our incident review or “postmortem” process at PagerDuty. Some of that is just updating -- the process that worked when we had 50 engineers just doesn’t work when we have 300 and growing. But it’s also about including learnings from other companies and industries and pivoting from the perspective of prevention to that of adaptability. Prevention is rooted in the idea that it’s possible to not have failures, that you can engineer them out. Identify the root cause, eliminate it, and you’re good. But that doesn’t work. Failure in complex systems doesn’t have a causal chain. Instead, it’s about focusing on the ability of the system -- people and software -- to adapt to failure, with a focus on learning rather than fixing. It’s not intuitive and generates a lot of pushback, so we’re doing this one step at a time, but I’m excited about the opportunity to reboot this and the future possibilities.
Other than that, the main focus of “my team” right now -- by which I mean “me” -- is proposing and building a team to focus specifically on reliability, which is exciting because it’s really only something you can do at a particular scale.
What are PagerDuty’s Biggest challenges? The biggest challenges in your role?
As an established, reasonably large, and publicly-traded company, PagerDuty has plenty of challenges, but I’d say the biggest challenge I see in my role and area of influence here is twofold: first, how do you balance functional work -- new features, new products, and so on -- with the core reliability, availability, performance and quality work that our customers demand? The bottom line feature of PagerDuty is that it is up when customers’ systems are not, but that’s necessary but not sufficient. We also need to make sure that we focus on providing features that make it easier for our customers to keep their systems reliable. Figuring out how to balance that in a repeatable process is hard.
Second, how do you get an organization to design and build reliable systems, without every engineer, engineering manager, and product owner having to be a systems expert? We need to build reliability into PagerDuty’s organizational operating system.
What have you learned that surprised you the most while working at PagerDuty?
This is a tough one. I’ve learned a ton! One thing that surprised me a lot is that I thought I understood a lot more about how complex systems behave than I did. Definitely a “you don’t know what you don’t know” situation.
I think one thing that has surprised me the most is how hard it is to assess where you’re at in terms of operational maturity, how good you are at running things. We’ve got a lot of processes and tools that probably put us in the 95th percentile, like continuous deployment, containerization, and our on-call and incident processes. There’s always room for improvement, but especially for folks at the beginning of their careers, it can be easy to focus on the gaps in what we have without taking credit for how far along we are. Continuous improvement is critical, but so is knowing where you are in the maturity model is too.
What advice would you give to someone looking to get into reliability engineering?
For folks that have been at this for a while in an engineering role and want to learn more about reliability, the best advice I have is to get a comfortable reading chair. There’s so much published out there from other industries, and often the standard practices in software are years or decades out of date. For example, finding a root cause with the “Five Whys” method, that’s from the 1930s and was popularized by Toyota in the 1970s, but it feels like software only discovered it in the early 2010s. A lot has happened since 1970! The Learning from Incidents in Software community is a great starting point.
For people at the beginning of their career who want to get into SRE generally, though, I have a hard time advising. The path I took simply isn’t available anymore. It’s very hard to move from an IT-focused system administrator to a SaaS site reliability engineer now. Programming skills are way more important than they were when I started, and breadth of experience is critical too. I think the most important part is curiosity -- even if you’re on a team that’s shipping product features, look under the hood. How does this framework work? How does the database work? How do containers work? How did Amazon design AWS or Google design GCP? The best SRE is someone that understands what’s going on underneath the abstraction. And, of course, how the people are involved.
What technology and trends are you most excited about?
I find it hard to get excited about technology anymore. The state of software operations is much, much better than it was even ten years ago, with the abstractions you get from cloud, containers, and so on, and I expect that’s going to continue to improve, and there will still be a role for systems engineers to build those abstractions and provide platforms for developers. But it’s boring in the best possible way.
But in terms of trends, I think it’s the move from designing software systems to designing sociotechnical systems, that you can’t eliminate people and organizations from the system and so you have to include them in the design. We’re getting past the idea that people are the problem, that the best system is one where people don’t have to be involved. Instead, people are the thing that keeps the system running, and I think there’s a certain humanist element of that which I really appreciate. It turns out that computers still need us! How great is that?