Our support program has taken many different forms from its original inception two years ago. This blog seeks to highlight some key takeaways from overhauling our 24/7 support program, and how we went from noisy alert fatigue to fairly forgettable rotations.
First, let’s dig a little deeper into what it means to have a 24/7 support team, and why you may consider re-evaluating your current support program to get more out of your relationship with Culture Foundry. This article is how the support rotation has improved for our Culture Foundry Crew and clients.
At Culture Foundry we have two tiers of support. The first tier includes the majority of our team, from CEO to Project Managers to Developers. All our full time employees pull their weight at ensuring the success of this vital program. This means that one crew member will expect to be on call for one week, every eleven weeks.
Our Tier 2 rotation is the backstop for all issues that can’t be resolved by Tier 1. This includes the CTO, director of DevOps, and a back-end centered developer. These three rotate in every 3 weeks to stand guard as the strong arm of the support program. In other words, if they can’t fix it, it’s probably out of our hands, (think AWS downtime, and if that’s the case most of the internet is on fire).
Each week one member of the crew steps up to plate to be the on-call first responder for that week (12pm Wed – 12pm Wed). This means, phone ringers on *gasps in millennial*, within cell or wifi service, and prepped with the keys to the kingdom.
A perfect support program would mean this is pretty much just a precautionary process, but that wasn’t the case when we began refining our Support program last year. At that time we were plagued by alert fatigue from alerts firing when they shouldn’t be, alerting websites that we don’t hold support contracts for, and limitations of basic monitoring services that don’t provide robust intelligence on whether alerts are relevant or not. This is why we set about restructuring and revamping our support program.
Since overhauling this system, we’ve seen our support calls drop down into a mostly quiet duty shared by the whole team. All said and done, the following are five takeaways from setting up and iterating on our support program:
1. Eliminate Alert Fatigue
There is nothing like being woken up at 3am for no reason other than a monitor firing unnecessarily for a split second blip in uptime or due to outdated and no longer needed monitors.
When revamping our support program our first step was to intensely audit all our support systems. Were we still monitoring clients whose support service level had dropped? Do we have 24/7 monitors set up for ex-clients? Running an intensive audit that spanned service contracts, internal systems monitoring, invoicing and client support expectations was a valuable resource to our team. This alone has allowed us to eliminate many monitors that we no longer needed, and cutting down on the amount of time our support staff is abruptly awoken during the night unnecessarily. We also cut down on alert fatigue by implementing new monitoring systems with New Relic, which allowed for the ability to monitor statuses in multi-locations. By monitoring a website from an east coast checkpoint to a west coast checkpoint, we can manage to ensure that the site is down, down. As a bonus, this means that for our international clients, we also can monitor specific areas of the world.
As the saying goes, an ounce of prevention is worth a pound of cure. Another major move to cut down on possible downtime was to reinforce and rebuild our infrastructure, and consolidate our sites into more robust, secure, and manageable infrastructures has helped greatly to prevent instability that can occur with external hosting providers.
2. Set up all troubleshooting systems from the perspective of half-asleep support staff
Why is it always that websites seem to have blips in uptime in the middle of the night? The last thing you want to do is .. ummm, be awake, but also be struggling half-awake to troubleshoot a failed monitor.
To make this a smoother experience on our staff, we implemented individual runbooks for each monitor. It was imperative that we don’t put a burden on our staff to scroll through 3 pages of irrelevant documentation for appropriate troubleshooting steps for that specific monitor.
Getting runbooks to the right person was really made possible by utilizing New Relic’s field inputs and configuring custom alert formats in our OpsGenie integration, to send out those resource links in Slack for quick and easy reference. An alarm goes off, you ack’ it, open slack, and there is the link to that specific monitor’s runbook.
3. Set reasonable expectations with staff and clients
We are human, and that means that life is still happening while you’re on call. After some confusing moments, we realized we needed to reset expectations with our support team and our clients on how quickly fired alerts are addressed. The last thing we need is to have people checking phones while driving or jumping out of the shower to acknowledge an alert.
So, using the whole team’s input and getting proper paths of escalation set up in OpsGenie, team members could breathe easy knowing that there is a buffer in expected response time. Setting expectations with clients through clear documentation to point at for expectations of resolutions also helps guard against misunderstandings in resolution timing for both basic and professionally monitored clients.
Another way we sought to better prepare our team was to implement a new onboarding process for new staff set to joining the support team. This meant that we went from jumping in headfirst to having an initial shadow rotation to get the new team member’s feet wet. We also created a resource portal with better tutorials for our systems and outlined process and expectations.
4. Take Charge
Another important step to a successful support team is making sure there are never too many cooks in the kitchen. Setting clear expectations that the Tier 1 support person owns the alert from start to finish and through any needed Port-Mortem or Retrospective, means there is always one central person who is always up-to-speed on the incident. This means that escalations to Tier 2 in a stressful moment allows them the ability to get heads down and hands dirty without worrying about communicating with clients. It also sets expectations with our clients that all communications continue to come from one central contact person. We’ve found that this helps keep a nerve-wracking thing like downtime from bring made any more confusing to the client.
5. Communication is key
Having a designated Slack channels for support communication, professional alerts, external alerts, and basic alerts means that we keep channels of communication free from alert noise and irrelevant communications. This organized communication is imperative for quick and clear resolutions to a support incident.
When an incident is triggered, aside from utilizing the OpsGenie app, we have a slack channel for just the urgent alerts. These alerts contain the name of the client and what is being monitored, the runbook URL, the URL for verifying validity of alert, and a direct link to the New Relic interface for that monitor.
Once the Tier 1 support person has acknowledged that alert, they head over to the #alerts-chat slack channel to document the steps they take prior to escalation.
These comms usually are kept brief and often go something like;
“Can confirm, site is down”, “checking out Runbook”, “escalating to Tier 2”, “client has been notified”.
Again, in the world of support, precise and clear communication is important for not cluttering brains and channels with conversations about process, banter, or long-winded explanations that belong in a Post-Mortem. Although, it has become a custom to add a funny GIF when the changing of the guards occurs on Wednesdays.
In Conclusion
All in all, Support 2.0 has allowed us to wrangle in the lessons we’ve learned since implementing our support program, and build upon something that has become a beneficial offering of Culture Foundry services. If you want to learn more about how Culture Foundry’s support team can be there for you and your site in your time of need, contact us to discuss our levels of 24/7 monitoring!