Software Engineering Institute | Carnegie Mellon University
Software Engineering Institute | Carnegie Mellon University

SoS Software Assurance

Justified confidence in system and system of systems (SoS) behavior requires software assurance theories and principles that don’t exist today. Using such theories and principles, organizations would have a better basis for confidence in deployed system behavior, and at the same time, these theories and principles could be used to make the assurance process more efficient and effective.

Our system-of-systems software assurance (SoSSA) research focuses on meeting the assurance needs of large-scale, multi-user adaptive information management and command-and-control systems of systems that will be operated in unanticipated ways. Systems of systems built using the service-oriented architecture (SOA) paradigm can fall into this category.

To make the assurance process more efficient (and effective), we need to answer foundational questions such as the following:

  • Which assurance activities provide the biggest increase in justified confidence that a system will behave acceptably when fielded?
  • Can some assurance activities be curtailed without reducing justified confidence in deployed system behavior? For example, when is it reasonable to stop testing a system, and why?
  • What insights do assurance activities yield into the residual risks that are present in a deployed system?
  • What evidence is most probative in deciding whether a system should be released?
  • What is a principled theoretical basis for asserting that sufficient confidence has been obtained in software-reliant behavior?
  • What types of justification are more or less acceptable?
  • Is a proposed confidence level well justified by sound principles and theories?

Adequate answers to these and similar questions do not exist today. And although the above questions apply to all kinds of software-reliant systems, they are becoming more important as an increasing number of these systems are systems of systems and ultra-large-scale systems.

We are pursuing two thrusts in this research, areas developed from information collected through interviews conducted with test and evaluation personnel and other inputs. These research thrusts cover a range of near-term to long-range technical and transition goals. (Our white paper provides an in-depth discussion of issues in system-of-systems software assurance.)

Thrust 1: Assurance Argumentation

Failure mode, effects, and criticality analysis (FMECA) and fault tree analysis (FTA) are standard techniques used to find design errors in hardware systems. The notion of doing FMECAs and FTAs for software systems has been proposed by others (Haapanen 2002) but given how software systems are architected and documented today, it was never quite clear how to trace out the effects. But a structured argument (demonstrating some property of a system) captures the reasons why the system is believed to work. One could use an FMECA/FTA approach on such an argument structure. For example, one could hypothesize that some claim is false or only partially true or only true some of the time (i.e., a “fault” in the argument structure). The next step would be to analyze the effect of this defect on higher level claims in the argument, deciding how the failure might be manifested in actual system operation, how critical the effect of this failure would be on system behavior, how often it would occur, etc. In short, develop an estimate of the impact of falsity of the claim. Effort devoted to demonstrating that the claim is true will “buy down” the impact of its being false. This then becomes a measure of value of each claim in the argument structure and a way to allocate assurance resources based on buy-down of risk.

The above approach identifies the impact of a claim’s being false. A complementary approach is to look at what argumentation deficiencies might make the claim false (i.e., what oversights or mistakes might exist in the supporting argument). For example, suppose a proof has been provided that some system property holds. What would make this proof irrelevant to the actual system? Any such defect would mean that the proof has no value in demonstrating that the hypothesized system property holds, and so the overall argument would be weakened. Given this impact, we would look at the various ways in which proofs can be irrelevant (e.g., by being based on an incorrect model of the system’s actual implementation) and see what assurance efforts have been performed to eliminate these possible sources of proof irrelevancy. Based on experience, we can estimate how often experienced engineers make these kinds of mistakes, and we then have an estimate of the likelihood that this kind of assurance deficiency is present. This leads to an estimate of the value of gathering assurance evidence showing that the deficiency is not present.

Both approaches provide a new way of evaluating the soundness of structured arguments. Both approaches are likely to be successful because they are based on the standard FMECA and FTA approaches to reasoning about design errors in hardware systems.

Thrust 2: SoS Failure Modes

If we are going to achieve increased confidence in the behavior of a system of systems under all circumstances, we need to understand the ways in which such systems fail, and in particular, the failure modes that are distinct from those of monolithic systems (whose evolution and content is completely under control of a central authority). For example, because SoS constituents evolve independently, it is possible for the collective set of evolutions to gradually degrade some desired overall SoS quality, e.g., end-to-end performance for certain threads. How can such degradation be detected and mitigated, or what constraints can be placed on constituent evolution that help to ensure maintenance of desired SoS properties under all (and changing) operational conditions? If SoS constituents fail to adapt to changing operational conditions because no individual change can satisfy the collective need, then failure to adapt because of failure to collaborate might be considered a particular type of SoS failure mode or failure mechanism.

Goal of this Work

Our goal is to make the assurance process more efficient and more effective. By establishing methods for evaluating the soundness of an assurance argument, we will be able to establish criteria for determining which elements of the assurance process contribute most to establishing justified confidence in system behavior. With this information, organizations can begin to intelligently make choices from among assurance techniques, concentrating on those that are most efficient. By addressing SoS failure modes, we can begin to identify patterns of failure that will be increasingly important in a world in which systems are increasingly interdependent. The identification of underlying patterns of SoS failure will lead to quicker identification and mitigation of such failure modes in SoS design, construction, and evolution.