DevOps for Pandemics

Sean D. Mack
10 min readOct 27, 2020

--

by Sean D. Mack and Sia Ahmadzadeh

COVID-19 has transformed the way we live and work. Today we sit amid a pandemic whose effects will extend well beyond today and into the future of work. This is truly an unprecedented event and it has forced us to adapt in ways that we could not have anticipated. DevOps has evolved over the past ten years as a set of principles to respond more quickly to the rapidly changing business and technology landscape. DevOps has transformed the way we work by bridging the gap between development and operations teams and improving time to market while improving service availability. Key DevOps principles and practices such as collaboration, transparency, and automation stand at the core of our ability to adapt. The same DevOps principles which have helped deliver better market outcomes have also helped us to adapt to rapidly changing conditions brought about by the COVID-19 pandemic.

The people, processes, and technologies which deliver on the principles of DevOps are all critical to success during this global crisis. While DevOps covers many different things, at its core it is about collaboration and collaboration has never been more critical to success than at times of crisis and rapid change. Other cultural aspects of DevOps such as a learning culture and individual empowerment also critical to success during a rapidly changing environment. The processes which surround DevOps practices are key to enabling rapid change. Key process focuses on automation and continuous learning ensure organizations can adapt to these changing times. And, of course, the technology of DevOps is key to enabling organizations to rapidly pivot. Whether it is CI/CD enabling rapid product changes, loosely coupled architecture allowing scalability, or the many collaborating tools we use to enable remote teamwork: the people, processes, and technologies of DevOps helps companies respond to major changes like the COVD-19 pandemic.

People are at the center of DevOps and never have people and collaboration been more important than now. In times of crisis and when tackling difficult tasks, good teamwork and tight collaboration are what set successful teams apart. The human/cultural aspects of DevOps which help us deal with the global pandemic go beyond collaboration. Trust and empowerment of the individual are core principles of DevOps. These same principles are also critical to success in a remote work environment. In a decentralized, work from home environment, it is not possible to maintain tight control in a bureaucratic or micro-managed environment. Teams with a trust-based culture where individuals are empowered to take action are ideally suited to thrive in a work from home environment.

As remote collaboration between people becomes ever more critical, so too do the tools that enable this collaboration. We must be able to socially distance our workforce to ensure that nothing we do requires that people go into the office. At Wiley, we saw a massive increase in usage of our collaboration tools. Chat tools such as Teams saw their usage increase by over 55%, while social tools such as Yammer saw increases of 30%. Globally this phenomenon is true, too, with video collaboration spiking during COVID with Zoom seeing usage increase by 30x growing from 10 million daily meeting participants in December to over 300 million in April[1].

Wiley Teams activity January 2020 — September 2020
Teams activity increased 55% after Wiley moved to remote working

It is important to note that the tools and culture we built to enable a socially distanced workforce will also help build a stronger global community. All too often workers located in areas outside of headquarters are unintentionally left out of conversations. It is easy to forget to include the one person on video conference when all other participants are in the room. However, when all participants are on the same chat platform, the same video conferencing platform, that global communication and collaboration becomes easier, not harder.

A learning culture is another element central to DevOps which becomes increasingly critical in the response to crisis. We must learn to work with new tools in new ways. At Wiley, we rolled out a massive training effort on our collaboration tools, running 12 training sessions in the two weeks that followed our decision to work from home where we trained more than 2,700 attendees.

We also rolled out new security training to respond to increasing cybersecurity threats related to the pandemic. The US saw a massive increase in phishing and spear-phishing attacks in the weeks following the COVID outbreak. While we have many security tools in place, one of the most important tools is training. In response to the increase in attacks, we launched a massive training campaign. We also initiated regular, company-wide communications through our social networks to ensure that all employees were part of the security solution. This ability of a company to learn and adapt are core tenants of DevOps, which are critical to being able to respond to COVID-19 or any crisis which disrupts the way we work.

The culture of DevOps helps us vastly improve our response to crises and the processes related to DevOps help reinforce this culture. I know that some may feel that ‘DevOps’ and ‘process’ are antithetical, but this is not the case at all, instead, DevOps revolves around processes that are lightweight and automated wherever possible. In many ways, this replaces manual process actions with activities that occur automatically. An excellent example of this would be auto-scaling groups in cloud deployment which allow for rapid expansion to meet changes in capacity. Rather than an extensive manual process of submitting requests for compute resources and waiting through multiple rounds of approvals systems can expand automatically to meet demand. This was critical to Wiley’s COVID-19 response where we saw significant spikes in demand for our online education products as millions of students moved to online learning.

DevOps technologies such as CI/CD and monitoring all help get to market quicker while increasing stability. These same technologies also help us to be significantly better at responding to crisis. Small batch sizes delivered through continuous integration and continuous deployment enable companies to rapidly pivot their delivery to market. In addition to CI/CD agile and adaptive infrastructure help us respond quickly to changes required to respond to crisis.

As the global impact of Covid began to escalate in March, Wiley encountered significant increases in platform activity and load levels, and response times started to be impacted. Activity on a single customer instance dramatically increased (due to aggressive promotion of work from home training to their population) and activated a known issue that started impacting response times for all instances in that hosting region. The performance degradation occurred due to problems with caching overhead in the application. In order to address the issue, configuration changes were made to limit impact to only the customer instance that was causing the issue. Once that was completed platform changes were developed and tested to optimize and reduce the caching overhead. The changes were pushed to production over the course of two days, resolving the problem and improving the overall performance of the platform.

Because of DevOps practices and technologies such as CI/CD, we were able to react quickly to a serious problem by reprioritizing current work and addressing the immediate need, while at the same time reducing our backlog and improving the baseline performance of the platform. We also saw a reflection of DevOps teams in responsive teams that were engaged and energized by owning and managing the successful response and decision-making process directly. We were able to rapidly deploy changes to meet the changing demand on our system because we had appropriately developed our CI/CD practices.

The following graph shows response time before and after the optimizations were made.

Application traffic from March 14–March 31, 2020
Application traffic spiked in march as companies rapidly transitioned to work from home training programs.
Response times from March 28 — April 6, 2020
Response times spiked due to additional traffic and known caching problem but drop dramatically after improvements are deployed.

Infrastructure as code (IaC) is another DevOps tool which helped us to adjust to rapidly changing conditions. IaC allows us to treat infrastructure in a programmatic way by describing it with code. By including IaC in our deployment pipeline, we can easily and quickly make incremental changes to adjust to changing conditions. In addition, automatic adjustments through elastic scaling can help us automatically scale up and down our capacity. When COVID-19 hit and usage spiked on our platforms, this ability to rapidly and elastically scale in response to change is critical to meeting the demand.

Socially Distance Your Applications

Social distancing is keeping people healthy during the pandemic, some appropriate distancing of our application components can help keep our services up and running. Loosely coupled systems is a good way to accomplish this sort of distancing at a systems level. Loosely coupled architectures also help provide more resilient systems which can help avoid disaster. By ensuring that independent components of our system can operate independently, we ensure that one portion of the system failure does not cause complete system failure. Loosely couple architectures become increasingly important as some of the components and services our applications depend on are SaaS services hosted and managed by others. With SaaS services we cannot control the availability so we must make sure overall services continue to operate even if SaaS components fail. In addition to “socially distancing” your applications it is important to test system is resilient to component failure. One great way to do this is through chaos testing at a service level. It is important to ensure that continuous testing looks not only at failure of individual infrastructure components, but also failure of features or functionality within your application.

Transparency and visibility are other key principles that underlie DevOps which are implemented through proper monitoring. Monitoring enables us to see problems before they occur and resolve issues quicker when they do. If monitoring is available, shared, and properly configured, it can provide visibility into rapid shifts, which may indicate some sort of crisis has changed. We must ensure that we are monitoring for significant shifts that are outside of normal cyclical patters. When problems do occur, proper monitoring enables us to quickly determine the source of the problem, allowing us to make rapid adjustments to provide continued service for our customers.

In order to have a culture of transparency there must be shared visibility into the monitoring data. In order to collaborate, all people within the organization must have visibility into the data that makes the organization run. At Wiley, we rapidly deployed a Business Continuity (BCP) Dashboard which had a broad range of data extending beyond system level metrics to key business metrics. These dashboards showed usage of VPN as well as information about number of new registrants on each of our e-learning platforms. The BCP dashboard showed data about how internal collaboration tools were being used so we could optimize our work from home workforce, as well as the number of research articles being submitted by authors around the globe researching this new disease. By sharing data with our technical teams and our business teams, we used the concept of transparency to truly allow our business to make better decisions in light of a rapidly changing environment during unprecedented times.

It is important to recognize that responding to COVID-19 is hard. There are very real impacts to people at a very personal level and to businesses around the globe. Adapting to a world during a pandemic is difficult, but there are silver linings despite these challenges. The need to ensure that engineers are available to rapidly respond has improved with more people spending less time on the road. In addition, COVID-19 has forced us to collaborate through technology in ways which DevOps practitioners have been advocating for years.

While we are physically further apart, we can connect with our co-workers in new ways. As much as social distancing makes connecting more difficult, it has also brought us into each other’s homes. We get to meet our colleague’s families via small cameos when the kid jumps up on their parent’s lap. Those who were previously reticent to participate in chat platforms and video conferences are jumping on these platforms in astounding rates. This new-found focus on video and chat collaboration is especially useful for geographically diverse teams where on-site meetings and conversations so often inadvertently exclude remote workers. By forcing us all to work remotely it has truly leveled the playing field and made us all better remote workers.

But the silver-linings are much deeper than reducing commute time and improved use of our collaboration tools. On an individual level, we have been able to spend more time with family. For me, personally, I can say that for the first time in my career, I have had the opportunity to have dinner with my family almost every night. Between the lack of travel and lack of late nights at the offices, I have been involved in my child’s life in a way I never imagined possible. On a much broader, global level, we have seen dramatic decreases in CO2 emissions. We have seen improvement in air quality, clean beaches, and environmental noise reduction[2]. And, while these decreases may be temporary, it makes it evident that we can make an impact. As much as we all hope this will end, as much as we all hope we will see an economic recovery, and a recovery for the people who are sick and suffering from this virus, there are many lessons and many benefits that I hope that we can take with us as we build a new future.

[1] https://www.theverge.com/2020/4/23/21232401/zoom-300-million-users-growth-coronavirus-pandemic-security-privacy-concerns-response

[2] https://www.sciencedirect.com/science/article/pii/S0048969720323305

--

--