Characterizing and Detecting Mismatch in ML-Enabled Systems

Created July 2021

The development of machine learning-enabled systems typically involves three separate workflows with three different perspectives—data scientists, software engineers, and operations. The mismatches that arise can result in failed systems. We developed a set of machine-readable descriptors for elements of ML-enabled systems to make stakeholder assumptions explicit and prevent mismatch.

Developing ML Systems Involves Three Separate and Different Workflows

A machine learning (ML) model is trained for the purpose of recognizing people and vehicles in a desert setting. The model performs well based on its training and evaluation data sets and is deployed to the field. But there, the expectation is for the model to recognize people and vehicles in an urban setting. This simple example illustrates one way that a deployed ML system can fail.

The promise of ML to improve solutions to data-driven problems, as well as the growing availability of ML frameworks and tools, is driving a popularity explosion of using ML techniques in software systems. But the end-to-end development, deployment, and operation of ML-enabled systems remain a challenge.

The problems are many: Model training data is different from operational data, so model accuracy is poor. The trained model input/output is incompatible with operational data types, so integrators must develop large amounts of glue code. The computing resources required to execute the ML model are different from computing resources available in the production environment, so system performance is poor. Monitoring tools are not set up to detect diminishing model accuracy, so operators do not know when it is time to retrain the model.

The cause of many of these problems can be attributed to what we call ML mismatch, a problem that occurs in the development, deployment, and operation of an ML-enabled system due to incorrect assumptions made about system elements by different stakeholders. A challenge with deploying ML systems in production environments is that their development and operation involves three perspectives, with three different and often completely separate workflows and people: The data scientist builds the model. The software engineer integrates the model into a larger system. And operations staff deploy, operate, and monitor the systems. Because these stakeholders operate separately and often use different terminologies, opportunities for mismatch arise between the assumptions made from each perspective about the elements of the ML-enabled system and the actual guarantees provided by each element. We further posit that the cause for ML mismatch can be traced to information that if shared explicitly between stakeholders would have avoided the mismatch. Therefore, if stakeholders had a standardized and easy way to explicitly share this information, they could avoid ML mismatch.

Software Engineering for Machine Learning

If data scientists, software engineers, and operations staff made explicit the information required for properly defining, building, integrating, and operating the model, they could detect ML mismatch before a system experiences negative consequences. We developed descriptors for elements of ML-enabled systems by eliciting examples of mismatch from practitioners and formalized definitions of each mismatch in terms of data needed to support detection.

In Phase 1 of our study, we conducted interviews with practitioners in the roles of data scientist, software engineer, and operations for ML-enabled systems to identify mismatches and their consequences. We then validated the interview results with a practitioner survey. We grouped the examples of mismatch into seven categories, such as raw data, trained model, and production environment. Survey results indicated that the information contained in the descriptors is important or very important to share to avoid mismatch, and that what is most important varies between roles. The full details of the interview and survey results are available in a paper presented at the First Workshop on AI Engineering – Software Engineering for AI at ICSE 2021. [link to paper]

We also conducted a multi-vocal literature study to identify best practices for software engineering of ML-enabled systems that could address the mismatches identified in the interviews and survey. From these primary studies, we extracted or inferred attributes for documenting elements of ML-enabled systems. The result is a set of descriptors that codify attributes of system elements and therefore make all assumptions explicit.

The recommended descriptors provide stakeholders with examples of information to request from data scientists or ML engineers developing models, or business users who define requirements for models. System stakeholders can use them to consistently document and share system attributes, including information provided by third parties. The information in the descriptors can also serve as checklists as ML-enabled systems are developed.

Best practices and tools to support software engineering of ML-enabled systems—and then to deploy, operate, and sustain them—are still in their infancy. Our end goal is to develop empirically validated practices to improve engineering practices for the development, deployment, operation, and evolution of ML-enabled systems.

Looking Ahead

In Phase 2 of our study, we performed a mapping between the mismatches and system element attributes. For each mismatch, we identified the set of attributes that could be used to detect that mismatch. The descriptors are the result of a gap analysis of this mapping. We are working on the report of Phase 2 and will make the resulting descriptors publicly available.

Our next steps are tool development for automation of mismatch detection and extension of the descriptors to support testing of ML components before they are put into production. Our vision is to create a community around tool development and descriptor extensions.

Software Engineering Institute