Automating Mismatch Detection and Testing in ML Systems

As the DoD adopts machine learning (ML) to solve mission-critical problems, it is faced with an inability to detect and avoid inconsistences among assumptions and decisions made by data science/ML engineering, software engineering, and operations stakeholders. Examples include (1) poor model accuracy because model training data is different from production data, (2) system failure due to inadequate testing because developers were not able to produce appropriate test cases or lacked access to test data, and (3) monitoring tools not set up to detect diminishing model-related problems such as diminishing accuracy. We therefore define ML mismatch as a problem that occurs in the development, deployment, and operation of an ML-enabled system due to incorrect assumptions made about system elements by different stakeholders that results in a negative consequence. This ML mismatch can lead to delays, rework, and failure in the development, deployment, and evolution of ML-enabled systems.

According to NIST, it is 15 times more expensive to fix a bug detected during systems testing and 30 times more expensive if detected during production [NIST 2022].

National Institute of Standards and Technology

It is common knowledge in software engineering that the later a fault is detected, the more expensive it is to fix—this also applies to late discovery of ML mismatches. As an example, in ML-enabled systems, the differences between training data and production data are often not discovered until field testing or in production. According to NIST, it is 15 times more expensive to fix a bug detected during systems testing and 30 times more expensive if detected during production [NIST 2022]. Without automated and extensible tools for ML mismatch detection, incorrect assumptions made about system elements by different stakeholders are discovered too late in the development of ML-enabled systems.

2022_Automating Mismatch Detection and Testing in ML Systems

This project builds on a set of SEI-developed descriptors for elements of ML-enabled systems [Lewis 2021]. We are developing a suite of tools to (1) automate ML mismatch detection and (2) demonstrate how to extend descriptors to support testing of ML-enabled systems. The tools will also support descriptor validation on open source and DoD ML systems and components. For testing, we are explicitly focusing on production readiness of ML components that we define based on several attributes: ease of integration, testability, monitorability, maintainability, and quality, which we define as meeting both model requirements and system requirements.

This project’s end goal is for DoD organizations to adopt descriptors and tools for early mismatch detection and production-readiness test and evaluation as part of their ML-enabled systems development process. To this end, this project contributes to and advances the SEI’s modernizing software development and acquisition objective by improving formalization of detection of ML mismatch, improving testing practices for ML-enabled systems, and providing tool support that in the long run can be integrated into ML-enabled system-development toolchains.

In Context

This FY2021–23 project

builds on the results of our FY20 Characterizing and Detecting Mismatch in ML Systems project that empirically defined and validated the information that explicitly needs to be shared between stakeholders to avoid ML mismatch
aligns with the CMU SEI technical objective to reduce the cost of acquisition and operations, despite increased capability, and makes future costs more predictable by reducing delays, rework, and failure
aligns with the CMU SEI technical objective to be trustworthy in construction and implementation and resilient in the face of operational uncertainties
aligns with the DoD software strategy to accelerate the delivery and adoption of AI

Mentioned in this Article

[Lewis 2021]
Lewis, Grace A.; Bellomo, Stephany; & Ozkaya, Ipek. Characterizing and Detecting Mismatch in Machine-Learning-Enabled Systems. Pages 133–140. In 2021 IEEE/ACM 1st Workshop on AI Engineering— Software Engineering for AI (WAIN). 2021. https://doi.org/10.1109/WAIN52551.2021.00028

[NIST 2022]
NIST. The Economic Impacts of Inadequate Infrastructure for Software Testing. May 2002. https://www.nist.gov/system/files/documents/director/planning/report02-3.pdf

Research Review 2022