The Software Bathtub Curve – Understanding the Software Systems Lifecycle

Jul 23, 2014

Introduction

The Bathtub Curve is widely used in the context of reliability engineering to explain how and why the failure rate of a product or engineering system changes through its lifecycle. Initially the failure rate is high, but it decreases rapidly as defective products or components are identified and eliminated or replaced, and as assembly, integration and installation problems are resolved. This phase is sometimes referred to as the “burn-in” phase. Later in the life-cycle, random failures will continue to occur. It is during this phase that the failure rate is relatively constant and normally at its lowest. Once components begin to approach the end of their operational lifetime, they begin to fail more frequently and this leads to an increasing failure rate. This phase is often referred to as the “wear-out” phase.

This model might make sense in the context of physical systems that are subject to wear and tear and the degradation of physical materials – but how is it relevant to software systems? Software is “constructed” using abstract notions of logic. There is no physical material to degrade or wear-out – so once software functions according to specification it should remain that way indefinitely. So the conventional understanding of the Bathtub Curve cannot be applied to software systems.

What makes Software Systems different?

Careful analysis of the software engineering process and software systems lifecycle shows that the failure rate over time of software systems also follows a Bathtub Curve. The software reliability nomenclature and models developed decades ago by Musa, Iannino and Okumoto (Software Reliability, McGraw-Hill 1987) enable this phenomenon to be explained. Software systems are different because:

Given the theoretical impossibility of exhaustive testing, software engineers commission software systems, knowing that they have latent defects – or bugs – that were not identified or resolved during testing.
Defining what constitutes a “bug” depends on the system specifications. Behavior that constitutes a failure in terms of one specification may constitute desired behavior in the context of another specification.
Software failures are essentially non-random. They may appear to be random because the complexity of even small software systems masks their internal state and the external stimuli that are present at the time of failure. Were it possible to perfectly replicate these, then the same bug that caused the one apparently random failure would cause repeated failures during each execution. Operational profile is the term used to describe the set of possible run types and their associated probabilities of occurring. Failure rates are directly related to operational profile. Because a bug might be triggered in a run type that is infrequent, the failure rate of the system will appear to be low, but if the operational profile of the system changes so that this run type becomes more frequent, the observed failure rate will increase – even though these failures are all due to the same bug and the software system has not been modified.

Software Operational Lifecycle

Software systems exhibit a higher failure rate very early in the operational lifecycle. This is because software testing is essentially limited, and latent bugs often show up once the software system is live in the operational environment. Software vendors expect to release “patches” that eliminate these bugs soon after releasing the software. As these bugs are eliminated, the failure rate decreases. This phase is referred to as the “Initial Phase”.

Once the failure rate has dropped to the level that reflects the incidence of spurious bugs that typically arise throughout the operational lifecycle of a stable, mature system, the software system is said to have entered its “Operational Lifetime”.

With time, changes are inevitably made to the source code, normally in response to one or more factors, (some of which relate to the inherent properties of software systems highlighted above). These include:

Changing operational profile – the system is used in accordance with the original functional specifications, except usage patterns change so that functions that were previously invoked infrequently are now invoked more regularly. Failures that were experienced infrequently are now experienced more frequently, necessitating changes to the system.
Changing requirements – business and functional requirements typically change with time. This is normally the result of innovation, security threats and changes in the market, regulatory environment and competitive landscape.
Changing technology stack – system software platforms such as operating systems and database management systems evolve with time. In order to avoid obsolescence, software systems must be migrated from old technology platforms to newer ones. This is true even if the functional and business requirements have not changed. There are situations where there is a business or functional requirement that requires that changes be made to the technology stack. Providing support for mobile devices is one example of this.

Introducing changes to source code often results in the introduction of bugs. This is especially true when the changes that are being made introduce functionality that lies outside of the original design envelope of the software system. Strategic architectural approaches to software systems design are an attempt to provide a far-sighted, long-term foundation that will facilitate the implementation of requirements not yet envisaged at the time the software system is initially commissioned. Conversely, tactical “quick-win” architectures are often brittle and result in shorter stable Operational Lifetimes.

The Role of Architecture

Good architectural design (which is not the subject of our discussion) is the key to extending the Operational Lifetime – which is the bottom of the Bathtub. Ultimately however, the software system will require more and more modifications to address the factors listed above, and the bugs that get introduced will result in an increasing failure rate. Good architecture can significantly extend the Operational Lifetime – but it cannot prevent the ultimate demise of the software system which is described by the “Terminal Phase” and is characterized by an increasing and unacceptably high Failure Rate.

Once a software system enters the Terminal Phase, the system owners enter crisis mode – which makes it difficult to focus effort and energy on strategic replacement. Instead the focus turns to tactical fixes, which may postpone short-term catastrophic failure, but typically increase complexity, thereby contributing to a medium-term increase in failure rate.

Effective ownership and operational management of a software system entails anticipating the onset of the Terminal Phase and proactively entering a “Replacement Phase” before being overwhelmed by the operational problems that characterize the Terminal Phase.

This view of the operational lifecycle of a software system enables us to understand two common scenarios:

The Eternal Software System

Accountants seem to understand that a software system will not last forever. In electing to depreciate the software system asset over a given period of time, they implicitly acknowledge that the Operational Lifetime of the software system is limited. The fact that in most instances this acknowledgement is not accompanied by some obsolescence planning strikes a dissonant chord. It seems that some in the enterprise expect or hope that the software system will have an infinite Operational Lifetime – or at the very least, they hope that they will have moved on long before their successors need to deal with the engineering and commercial challenges of the Terminal Phase. Even the best software systems have a limited Operational Lifetime, and it is imperative that organizations plan for obsolescence and replacement.

The Infernal Software System

Unfortunately there are software systems that never enter the Operational Lifetime phase – they go directly from the Initial Phase to the Terminal Phase. For software engineers and system owners who are immersed in the development of the system, it can be difficult to distinguish between being in the Initial Phase and the Terminal Phase. Although this scenario may be attributable to poor architecture, it may also be the result of inadequate understanding of business objectives and requirements. Radical departures from previously defined goals and objectives reflect in the design and implementation of software systems. Failure to comprehend this scenario typically results in continued investment in a software system that is beyond redemption.

Conclusion

Although software systems are not subject to “wear-and-tear”, the Bathtub Curve provides us with insights into the operational lifecycle. The peculiarities of software systems ensure that they will need to be replaced with time. The Software Bathtub Curve enables software system owners to understand where a system is in its lifecycle. This makes it a useful strategic planning tool that can be used in the management of information technology and software systems to support critical decisions relating to budgeting, resourcing, on-going development and obsolescence and replacement planning.