search icon-carat-right cmu-wordmark
quotes
2024 Research Review

Vessel: Modelling Container Reproducibility Failures

This project focuses on creating a foundational model for all container reproducibility efforts (open-source, commercial, U.S. Department of Defense (DoD), etc.). The lack of build reproducibility is a fundamental problem that affects most software built today. While some tools exist to aid in promoting build reproducibility, the research and developer communities lack a holistic model of container build reproducibility and its failure scenarios. This project aims to develop this model as well as tools for verifying reproducibility in container builds.

This two-year project plans to improve container reproducibility by creating an empirical analysis of open-source container build files to identify common reproducibility issues.

Kevin Pitstick
Senior Software Engineer
Headshot of Kevin Pitstick.

A build is reproducible when, given the same build environment, any party can recreate bit-by-bit copies of all specified artifacts. While full build reproducibility may not be feasible in all cases, moving toward reproducibility is essential for the DoD to provide strong guarantees that its build capabilities have not been tampered with by an insider threat. A critical component of trusting the software supply chain is securing the build process from malicious tampering.

In practice, builds are often not reproducible due to elements of the build environment that rely on external nondeterministic factors (e.g., timestamps, filesystem file ordering, unique ID generation). While many of these factors, such as timestamps, are benign, developers lack sufficient tooling to understand why their container builds aren’t reproducible and whether build differences could be malicious. Build unreproducibility can result in lack of trust, changing build behavior, and broken builds, significantly hindering DoD efforts to build and deploy mission capabilities securely and reliably. Adding reproducibility to build source code is a manual and time-intensive process of identifying and fixing reproducibility gaps, and the state of the art lacks sufficient tooling to automate this process.

This two-year project plans to improve container reproducibility by creating an empirical analysis of open-source container build files to identify common reproducibility issues. The specific objective of this project is to create a model of container reproducibility failures (reproducibility issues linked to external factors); our goal is that our tools will correctly detect at least 95% of these failure cases. In addition to publishing the results from this project, we plan to release the reproducibility model, its associated dataset, and the code for detecting reproducibility failures.

Over this first year, we developed the Vessel Diff Tool (initial version is now released at https://github.com/cmu-sei/vessel), which allows post-build comparison of two container images to identify and categorize differences. We have tested the tool by running it on images built from approximately 110 GitHub repositories, and thus far it has categorized on average over 90% of differences in these images. Over the next year, we will be improving the diff tool’s rules, expanding our datasets to include DoD-relevant repositories, building out pre-build linting and repair tooling, and testing these tools with DoD transition partners.

Figure 1: Overview of Vessel Tooling Inserted in the Container Build Process

In Context: This FY2023-24 Project

  • leverages SEI expertise and experience in software engineering and working with the DoD
  • aligns with the CMU SEI technical objective to modernize software engineering and acquisition
  • aligns with the OUSD(R&E) critical technology priority of leveraging advanced computing and software and the DoD Software Modernization Strategy focus on automated build practices utilizing hardened software containers
Mentioned in this Article

“Reproducible Builds — a set of software development practices that create an independently-verifiable path from source to binary code.” https://reproducible-builds.org/ (accessed Jan. 24, 2023).