Infrastructure, while critical to nearly every part of our lives, is invisible to many of us. The most basic needs of communication, food and water, as well as the production and delivery of other goods depend on systems we often take for granted. While we’re using it, infrastructure is constantly evolving–being built and rebuilt. Nowhere is this evolution happening faster than in software infrastructure.
Software powers every form of virtual communication (not just email and video conferencing, but our phones, too), and manages information about us (everything from financial to medical records) and our things (where they are coming from, how they are working). Software controls most manufacturing processes, and probably even powers your alarm clock.
COVID has accelerated digital transformations. Automation and online interaction are becoming more important as we continue to adapt to the restrictions of the pandemic, and the impact of software on our world will continue to grow.
As software becomes more important, the people who create, maintain and operate that software–software developers–play an increasingly important role in keeping our economy and our lives on track.
Like everyone else, software developers are feeling a lot of stress right now. But what’s especially stressful for developers is that at the same time we are expecting even more from them, the software infrastructure–the foundation on which those developers are building–is dangerously unstable. Worse, developers don’t understand how unstable it is and may even lack the tools to find out.
A New and Still-Evolving Discipline
Compared with other engineering disciplines, software development is still relatively immature. Its roots only go back to the middle of the 20th century, and it continues to develop rapidly: Fueled by Moore’s Law, cloud computing and other hardware advancements, software can now do things that were hard to imagine only 10 or 20 years ago.
The way that software is built also continues to change rapidly: More and more, software is built upon layers and layers of other software. Instead of starting from scratch, developers compose open source software libraries, provision virtual data centers, and even leverage managed services for machine learning. What took years or months to build in the past can now be built in days or even hours.
While this is a boon to the development of new applications, it also presents risks. Each new layer that is added to a software application–whether it’s a new open source library, a new framework, or a new service–multiplies the chances of failure. Each layer adds new sources of bugs, new opportunities for systems to become overloaded, and new vectors for attackers to exploit. And in many ways, we (as users) have become complacent, accepting and even expecting failure in software. We even have apps dedicated to tracking when other apps are not working. But it can be better, if we give software developers the means to understand, measure and improve reliability.
Rethink Established Tools
How can we create stability amid the whirlwind of change? Software architectures were already changing, but the pandemic has accelerated that change. To address it, we need to rethink how developers build software and especially the tools they use to do so.
First, we need more reliable software infrastructure. There’s no magic bullet for this: it requires engineering expertise and hard work. However, what we can do is make sure infrastructure providers have the incentives to commit to quality standards and improve reliability. This leads us to the second point: As we educate (and reeducate) software developers, we must train them to understand more about the infrastructure they are building on and to demand that the infrastructure reaches a higher bar in terms of quality and reliability.
Finally, we need the tools to measure the quality and reliability of every layer of software systems: from the apps on our phones to the underlying infrastructure that powers those apps. When software was simpler, it was sufficient to observe it from the outside. Now, every potential internal failure point needs to be accounted for, and failures need to be put into context so developers can prioritize and address them.
COVID has put increasing demands on many forms of infrastructure, not in the least on our software infrastructure. It has increased our dependence on new technologies and accelerated digital transformations. These demands require additional investment from the firms building software, but since software changes so quickly, it’s not merely investment in the infrastructure itself, but in their software developers and the tools they use. With these tools, software developers will not only be less stressed, but more productive and able to have a bigger impact on the world around them.
Daniel “Spoons” Spoonhower is a cofounder at Lightstep, provider of a full-context observability platform that makes complex microservice applications more transparent and reliable. An expert in distributed tracing, he is a core contributor to the OpenTracing project, a CNCF project. Previously, he spent almost six years as a staff software engineer on Google’s Infrastructure and Cloud Platform teams.