Survivability Challenges for Systems of Systems



Robert J. Ellison

Carol Woody

This library item is related to the following area(s) of work:

Security and Survivability

This article was originally published in News at SEI on: June 1, 2007

The Changing Operational Environment

As systems become loosely coupled groups of software modules functioning independently and assembled dynamically to exchange information and perform shared services, the ability to establish control boundaries for system and component certification and accreditation grows increasingly difficult. The same set of interconnected software modules could be implemented within a single operating system or distributed internationally. Also, connectivity could be peer-to-peer, wireless, or any number of combinations. These varying implementation choices represent drastically different threat environments. Software, however, is not currently built to consider this range of variability.

Increasing system complexity, required system adaptability, and new operational mission needs are driving software and system developers to the expanded use of technologies such as Web services and design approaches such as service-oriented architectures. The new technologies and changing operational environment raise software risks that are not addressed by existing operational risk approaches. Newly developed components join an existing operational environment and must be connected with older technology for operational effectiveness.  Too often we see a patchwork of software components stitched together with home-grown solutions. It might work, but with very little or no assurance. It is likely inflexible and potentially unpredictable.

Operational support and sustainment efforts must consider that the operational environment will grow increasingly fragile as the need for increased flexibility and solutions to meet immediate needs drive development to field solutions based on newer, less tried tools and techniques that must integrate seamlessly with legacy systems. Changes in the operational environment are also occurring, independent of development activities. Continuous hardware and operating system upgrades take place as support for older versions expires. Vulnerability monitoring and incident mitigation introduce changes to infrastructure configurations and components such as firewalls and routers. Growing complexity and increasing interdependency of operational components can overwhelm problem analysis and response mechanisms. As new development joins the implemented environment, limitations on consideration of operational needs will accentuate these challenges.


  • The operational situation may change before problem analysis is completed.
  • Rapidly changing threats and attack patterns mean that quality estimates of the probabilities may not be available when the threat must be addressed.
  • Interdependencies and complexity can hide risks that may not be observed until deployment. It may be difficult to identify risks for new usage of an existing system or when systems with unanticipated configurations are linked.
  • With distributed processing, we have to protect data at rest as well as in transit, but actual transfers and resting points may not be known.


  • The effects of a component change or failure are difficult to contain, and external interdependencies may be unknown. An insignificant local change or error can be magnified by unexpected or poorly understood interdependencies.
  • Simplifying analysis by considering only critical components is less effective. It may be hard to find components that are not critical in some context.
  • The difficulty of predicting the behavior of a system of systems (SoS) is compounded by multiple control-points that do not share common risk mitigation strategies.

Mitigation Concerns

  • Operations are increasingly non-stop. A change in response to an operational state cannot require system restarts and may have to be done in minutes or hours rather than weeks. It is impractical to require manual changes at each application or to require redeployment for configuration and security changes.
  • The rapid evolution of hardware and technology implies that a homogeneous deployment eventually becomes heterogeneous. The heterogeneity creates the potential for multiple and potentially conflicting approaches to security and the risk of incomplete solutions.
  • Accountability and traceability could be difficult to incorporate in a service-based security infrastructure and will be especially difficult to provide in a distributed and heterogeneous environment.
  • The sheer number and diversity of software and hardware components limits the ability to synchronize upgrades and deployments. Such synchronization is even less likely across multiple organizations. Interoperability with legacy components or those with unapplied upgrades is a given.

Increased Importance of Operational Survivability

The survivability of an operational process (mission thread) depends on the availability of a specific set of functions. In the systems-of-systems environment, those functions can span multiple systems [Maier 96]. Those systems must support multiple mission threads and provide functionality for local processes (local threads), which the stand-alone system was originally designed to address.

Survivability focuses on what can go wrong (the risks), the consequences of those risks on the operational process (mission thread), and the possible mitigation options for those risks. Survivability emphasizes continued operations in the presence of error and recovery following a failure. A response to a failure is not necessarily a system response but may be a change in procedures or an acceptable modification of the mission.



Figure 1. Mission threads cross system boundaries

Survivability must account for the complexity of a networked operations environment, which can be distributed globally. Figure 1 presents a simple view of the global infrastructure environment. Figure 2 captures the variances among levels of connectivity, which can be described as strategic, operational, and tactical (see description in Table 1). Administrative controls are distributed across multiple systems, and there is limited visibility for system behavior beyond the boundaries of each local administration. Global guidance of systems has a role at the strategic level, but at the tactical level, systems must be able to act autonomously because communications and any associated controls are not as reliable.


Figure 2. Infrastructure capabilities vary across the organization

Survivability solutions cannot assume homogenous configurations. In reality, the operational capabilities (including bandwidth and latency) will vary across the strategic, operational, and tactical infrastructure levels of the organizational infrastructure. The risk-mitigation tactics must also allow for variances in software architecture induced by these differences. Existing systems and components are expected to evolve to greater interoperability and dependency on shared infrastructure capabilities, reducing the level of functional independence and increasing the potential for reliability and dependability concerns.

Table 1. Variations in application architectures


Network Dependability



Relatively dependable network

Web-centric an option: i.e., thin clients with applications executing on the server


Noticeable variations in bandwidth and reliability

Modes: May need to shift from thin to thick client if operating conditions deteriorate


May have to resort to ad-hoc and peer-to-peer configurations

Thick clients: Some applications must run locally with limited or no network access

Mission-thread requirements crossing system boundaries (joint among systems) will compete for resources with system-provided functions. SoS mission-thread execution will compete with local threads and other multi-system threads assigned to the same system for priority, bandwidth, and other resources (as shown in Figure 3).



Figure 3. Joint threads introduce new requirements and conflicts.

Software-intensive, complex, human-machine systems require distributed decision making across both physical and organizational boundaries. The combination of expanding interoperability, multilevel requirements, and multiple control points creates a highly complex operational environment in which system behavior can be difficult to predict. An SoS mission thread requires reliability and dependability for individual components and solid end-to-end engineering so that the integration of those components ensures the survivability of the thread.

Seamless integration among cyberspace participating systems will require operational support and sustainment efforts to be able to recognize unacceptable conditions, resource contention among participants, and changes in resources with options to respond to the potentially bottlenecking situations affecting mission success. Both monitoring and response needs must be considered as systems and software components are designed and built since the network communication structure's controlling mechanisms provide a limited range of integration control in an SoS environment.

Challenges for SoS Operational Survivability

There is ample evidence that we are already reaching the limits of our engineering and testing practices, even in the today's less dynamic environments [Howard 06]. Unfortunately, systems engineered to work together commonly fail to produce the desired joint outcome because all circumstances in the operational environment could not be anticipated. In fact, even system upgrades can break existing interoperability among partnering systems.

Operational survivability, both today and in the future, can also be affected in following ways:

  • The difficulty of predicting the behavior of an SoS is compounded by multiple managerial and technical control points that do not necessarily share common risk mitigation strategies.
  • The effects of a component change or failure are difficult to contain, and, given external interdependencies, they may be unknown. An insignificant local change or error can be magnified by unexpected or poorly understood interdependencies.
  • The interdependencies and complexity can hide risks that may not be observed until deployment. Risks may be difficult to identify for new uses of existing systems or when systems are linked in unanticipated configurations.
  • The requirements for building a system today are typically based on pre-defined organizational and user actions that are validated prior to implementation. The deployment requirements for an SoS participant or service are not as well defined. While the interfaces may be stable, the behaviors of systems of systems change as new services and participants are added and new usages of existing services are deployed.
  • As complexity increases in both the system capabilities and the interconnections in an SoS, the ability of humans to manually monitor and respond in a sufficiently timely manner becomes less feasible. In addition, it is impractical to require manual changes at each application or to require redeployment for configuration and security changes.
  • At any given time, a large system has some component involved with error recovery. For a complex operational environment, the effects of an error can be propagated to multiple systems. Error recovery is complicated by the decentralization of control and system evolution, and by inherently diverse, conflicting, and possibly unknowable requirements. That complexity can be offset by the capability of the infrastructure and its components to adapt to error states by using alternate pathways, execution sequences, and strategies.

Thus, the future systems-of-systems infrastructure will fundamentally alter the relationship between system components. Each component will know far less about the time, reason, and environmental conditions in which it is invoked. Components must assume that errors are occurring. To protect itself, and to continue to execute its missions, a component within the infrastructure must adopt a defensive posture against a wide range of potential complications (or stresses) that were most likely not predicted during its development. Survivability of the mission threads will depend on how components manage these stresses.

New Assurance Mechanisms

Survivability Analysis Framework

The Survivability Analysis Framework (SAF)1 was developed to help organizations analyze and understand threats and gaps to survivability for operational mission threads within an SoS. A report will be published in later this year describing the details of SAF along with examples. In the meantime, this article provides a summary to prepare the readers for the future document.

SAF is designed to address the following:

  • identify potential problems with existing or near-term interoperations among components within today's network environments
  • highlight the impact on survivability as constrained interoperation moves to more dynamic connectivity
  • suggest engineering guidelines to increase the chances for survivability of mission threads operating within the SoS infrastructure
  • increase our assurance that the mission threads can survive in the presence of failures

Mission Thread Steps and Step Interactions

Each critical step in a mission thread is tasked to fulfill some portion of mission-thread functionality. This tasking represents a contract of interaction between the mission-thread step and prior and subsequent steps. Pre-conditions establish the information provided to the step. Pre-conditions may trigger the execution of a step (e.g., data or a human command), or the process may be continually executed (e.g., a sensor). Each step will have outcomes (post-conditions) that may interact with subsequent steps. However, the contract with prior and subsequent steps is not necessarily static and may have to be negotiated at run time to reflect the current situation. Even the identity of prior and subsequent steps may vary across executions of the mission thread.2

Environmental, data, process, and interaction limitations can lead to potential breakdown of a step. Each limitation represents a source or type of stress on the step and, consequently, on the mission thread. However, such stress does not necessarily cause failure. Steps can be designed to manage a range of stresses and still respond appropriately or degrade gracefully. Additionally, the failure of any specific step may not necessarily doom a mission, because subsequent steps may continue to execute the thread.

Linkages among steps are driven by three primary components: people, resources (technology, systems, connectivity, etc.), and interactions (e.g. data exchange). The behavior of the linkages coupled with the activities to be addressed in each step can lead to stresses. Unmanaged stresses can potentially lead to complete failure. The mission also likely will fail if a step manages a stress in a manner incompatible with subsequent steps. For example, consider a step that receives some data as input. If the value received by the step is out of the expected range, then the step can respond in a variety of ways. For instance, it might substitute a default value in place of the out-of-range value. This substitution, however, may have dire consequences if the decision to manage the stress by substituting a default value is inconsistent with the subsequent step's expectation for a highly accurate value.

SAF captures for analysis the ways in which selected stresses are handled at critical mission thread steps. It also analyzes whether the stress-handling approaches adopted by a step are compatible with subsequent mission thread steps. SAF consists of two component groups: (1) three matrices that capture stresses on a step and potential mechanisms for managing these stresses; and (2) a process for applying the matrices to a joint mission thread.

Applying the Survivability Analysis Framework

To begin the process, select a business-process (mission-thread) scenario with sufficient complexity to cross a range of organizational and technical options useful for analysis. Establish appropriate completion criteria for the scenario to represent organizational success. Decompose the scenario into a specific sequence of end-to-end steps (unique activities) that must be performed to reach the success goals.

Across the range of mission steps, assemble a matrix for each of the following:

  • people (roles) required to participate in each step
  • resources required for the activities in each step (organizational policies and procedures, technology tools and capabilities, network connectivity, etc.)
  • pre-conditions, activities, and post-conditions expected based on successful step execution as pictured in Figure 4

Figure 4. Survivability Analysis Framework

Using the three matrices, apply failure analysis techniques to identify potential points of failure that would critically impact successful completion of the mission thread [Alberts 05, Stamatis 03, Woody 07].

To begin to realize its potential, a broader range of operational mission threads must be analyzed with SAF. From a larger body of mission thread and development assessment examples, patterns of effectiveness can be identified and the framework refined.


[Alberts 03]
Alberts, C. & Dorofee, A. Managing Information Security Risks The OCTAVE Approach. Boston, MA: Addison-Wesley, 2003.

[Alberts 05]
Alberts, C. & Dorofee, A. Mission Assurance Analysis Protocol (MAAP): Assessing Risk in Complex Environments (CMU/SEI-2005-TN-032) Pittsburgh, PA: Software Engineering Institute, Carnegie Mellon University, 2005.

[Howard 06]
Howard, M. & Lipner, S. The Security Development Lifecycle SDL: A Process for Developing Demonstrably More Secure Software. Microsoft, 2006.

[Leveson 04]
Leveson, N. “A Systems-Theoretic Approach to Safety in Software-Intensive Systems.” IEEE Transactions on Dependable and Secure Computing 1,1 (January-March 2004): 66-86.

[Maier 96]
Maier, M. “Architecting Principles for Systems of Systems” 567-574. Proceedings of the Sixth Annual International Symposium, International Council on Systems Engineering. Boston, MA, 1996.

[Stamatis 03]
Stamatis, D.H. Failure Mode and Effect Analysis: FMEA from Theory to Execution, 2nd ed. Milwaukee, WI: ASQ Quality Press, 2003.

[Woody 07]
Woody, C. & Alberts, C. “Considering Operational Security Risk during System Development.” IEEE Security & Privacy 5, 1 (January/February 2007): 30-35.

1 SAF was piloted for the U.S. Department of Defense, Joint Battle Mission Command and Control (JBMC2) in analysis of a time-sensitive-targeting mission thread for the Office of the Undersecretary of Defense Acquisition & Logistics (OUSD/AT&L). A second pilot analysis was completed for time-sensitive-targeting information assurance for the U.S. Department of Defense, Electronic Systems Center, Cryptologic Systems Group, and Network Systems Division (ESC/CPSG NIS).

2 Mission threads are expected to be dynamic in content because each specific mission is unique.

The views expressed in this article are the author's only and do not represent directly or imply any official position or view of the Software Engineering Institute or Carnegie Mellon University. This article is intended to stimulate further discussion about this topic.

Find Us Here

Find us on Youtube  Find us on LinkedIn  Find us on twitter  Find us on Facebook

Share This Page

Share on Facebook  Send to your Twitter page  Save to  Save to LinkedIn  Digg this  Stumble this page.  Add to Technorati favorites  Save this page on your Google Home Page 

For more information

Contact Us


Help us improve

Visitor feedback helps us continually improve our site.

Please tell us what you
think with this short
(< 5 minute) survey.