A lot of examples of how agile approaches can improve processes focus on development and change, but the principles of focusing on value, collaboration, and continuous learning is just as important in operational work. Sometimes it can be harder to see how to apply those principles here because ops work is frequently siloed in teams, and calcified in “the way we’ve always done it”, but by taking a step back and considering the wider system this work is part of, you can actually make changes that have a positive impact across the whole organisation.

My last post considered how to better understand operational tasks and uncover options for improvement. Here, I’ll cover how I turned one onerous task, that took days of manual effort to complete every 6 months, frustrated everyone involved, and after all that effort still only produced something with a low degree of confidence, into one that had much higher engagement, supported the goals of multiple teams, and was automated to the point it took under an hour’s effort every quarter.

Call Tree Tests

This is actually one of the most important parts of resilience planning, there are few issues that can’t be overcome when we’re able to talk to each other, but because “it’s always been done this way”, the method most companies follow to do this is just copy and pasted from one organisation to the next, and has not kept up with changing needs and technology options.

Traditionally, this process is:

  • Incident management team contacts the executive team
  • They contact their direct reports…
  • … who contact their direct reports
  • and so on until everyone has received the message

Then there’s some vague way of checking that everyone was contacted, either asking for managers to attest that their team all got the message, or asking staff to speak up if they weren’t contacted. Neither are the most reliable of metrics.

There are plenty of issues with this approach that causes it to drag on and on:

  • Inconsistent approaches between teams of phone calls, texts, and the message changing slightly as it passes through each layer
  • People leaders are on holiday, and nobody is covering this part of their role
  • People are in meetings, or are busy, and prioritise that work over “just a test”
  • Chasing for responses or attestation leads to quite paternalistic behaviour: “you text this person now, then email me to confirm they received it”
  • There’s no data to back up the attestations, so you’re relying on folk being honest about completing a process nobody enjoys
  • Bringing responses together, waiting for attestations, and chasing folk for these can take days or weeks, usually with it all tracked in Excel…

Some places try to fix these symptoms by adding tighter rules, like all messages must have a response within an hour, or running tests in the evening to be “more realistic” (or more honestly, to make sure people can’t say they were in meetings). These added rules are all an attempt at extracting more value, but for what we’re trying to achieve with these tests, they’re a great example of trying to make the wrong thing right-er.

This is key to when it comes to improving operational processes: are you actually working to improve something that’s fundamentally flawed, or are you focusing on the real goal?

Looking at the system components

Our goal is to prove that we can contact everyone, using a method of communication that’s not tied to company systems, is widely accessible by everyone on personal devices, and is simple to use.

This rules out email, slack, whatsapp etc. and points us towards simpler tools. Everyone here has a mobile phone so basic phone calls and texts meet those criteria. It’s available on every phone (smart or basic), alerts are usually pushed without the need to go into an app, and as a carrier service it’s isolated from the companies systems.

Managers mostly had mobile numbers for their teams, but not everyone. And a lot of staff didn’t have their managers numbers. However, we realised that the People team did have contact details for everyone in the HR Information System. They owned the staff privacy policy, so we had to make sure that we had the rights to use this data for this purpose, and they helped with this. But they also had a problem, getting folk to periodically log in and check that the details held in their files were accurate, and that goal overlapped with ours, so they were happy to support work that was mutually beneficial.

Looking at the communication process itself, cascading calls was once the only way to effectively reach everyone, but advances in both the availability and possibilities of technology over the last 20 years has opened up a huge range of options. There’s a hint at this in how managers have moved to group text messages instead of calls in recent years (looking into ‘weak signals’ like this can often highlight emerging options for improvement). These days, there isn’t the cost or availability restrictions on these systems that were once was, and companies like Twilio and Nexmo have lowered the barrier of entry to just some basic coding skills and a credit card (less than £10 per quarter in our case, a drop in the ocean compared to what we’ve saved). These costs, coupled with the ease of using API’s, meant that this was a much cheaper option for us than buying an off-the-shelf, commercial tool.

Now we’ve considered the goal, identified others we can work with, and have seen how some limitations have changed to open up new options to consider. With this information, we can look at making the process work within these new limits, rather than trying to force others to fit within the old constraints. Those paternalistic instructions, chasing folk to send text messages and emails, were big, red flags that things were fundamentally flawed with this process, and the ‘weak signals’ of people leaders using new technologies and different approaches for their own teams helped show a way forward that would be embraced.

Fitting it all together

As a reformed developer, it was pretty straightforward to pull all the contact details we needed out from the HR Information System using an API and then send them to an SMS gateway using another API, which also sent responses back to our database. This meant we could reach everyone in one go, rather than cascading through layers, immediately removing a burden from people leaders.

So our process became:

  • Warn everyone in advance that a test is coming so they should check their contact detail in the HRIS
  • Click a button to import all the staff contact details, excluding everyone on holiday or absent
  • Type in a message, and click another button to deliver it to everyone
  • Refresh the response log until everyone has replied (typically within 4 hours)
  • Check in with the handful of individuals who haven’t responded (because there’s always someone who’s not updated their absent days or has forgotten their phone)
  • Save a copy of the response log as a pdf to be filed away as evidence for governance reporting

A single source of data, a single means of communication, a consistent message, and centralised response logs. From my point of view, the new process was a vast improvement in terms of the time I spent on it, and my confidence in the status reports I issued.

We reduced the burden on people leaders, and got people to log into the HR information system more frequently to check their data was accurate, so the improvements were felt across a large number teams.

And there was one surprising side effect: Some people actually started enjoying the quarterly communication tests! I gamified it a little, giving a shout out to the people who replied fastest, first team to complete, etc. and this really got some people engaged. Now when they find out a test is coming they keep an eye on me, waiting for the message, and then chase me down afterwards to find out if they were fastest. That’s a complete reversal from where this all started.

The busy-ness trap

It would have been easy to ignore this task, it was only ran every 6 months (we moved to quarterly after these improvements were made), it wasn’t directly improving anything on the product (although the messaging process and chasing did cause some disruption), and as far as the status reports showed, it worked (a simple ‘completed’ metric like a traffic light masks a lot).

That’s the tempting part of “busy” work, the routine is comfortable, and a slow rate of decline in performance is tolerable, right up until the point it’s not, and it becomes an emergency.

For me, the high effort in running it, the low confidence in the results, and the poor experience for the teams meant that it wasn’t going to be sustainable for much longer, even if this particular item wasn’t high up on any lists.

Taking a step back, and focusing on the value we wanted instead of the value we could extract from a flawed process, meant we found a way to reduce busy-ness not just in running the process, but across the whole company.

That is how agile principles can help you industrialise operational processes, as well as development.