search icon-carat-right cmu-wordmark

Automating Container Minimization for the Edge

Created November 2024

Isolated user spaces known as containers have become a popular standard for packaging and distributing software applications at the tactical edge. However, devices in this environment have limited resources. The SEI has developed an algorithm that automates the process of removing unused files and combining duplicate files to reduce storage waste and update bandwidth, allowing Department of Defense (DoD) organizations to field more capability per size, weight, and power (SWaP) at faster deployment speeds. A positive side effect of removing unused files is also the reduction of software vulnerabilities that may be present in those files.

The Challenge of Deploying and Using Demanding Containers in a Limited Environment

The DoD wants to use containers to support its vision of a cloud-to-edge continuum in which capabilities packaged as containers are pushed from the cloud to edge devices to support localized data processing. The tactical edge environment presents many challenges, including

  • limits in storage space and computing power
  • denied, degraded, intermittent, and limited-bandwidth (DDIL) networks
  • high likelihood of bad actors trying to tamper with devices

The way that containers are currently built amplifies these challenges. Currently, container images are significantly larger than necessary, with much of their size wasted by unused or duplicated files. While container images can be built in a way that reduces waste, this is a more difficult process that requires pragmatic development and detailed planning of layers. Large containers require greater transfer bandwidth and take a greater toll on device storage and the edge network. When the containers demand more SWaP than the conditions of the edge can provide, new capabilities cannot be deployed. In addition, the larger the container is, the larger the number of vulnerabilities and consequently more surface area is available for adversaries to exploit.

To address these challenges, the SEI created an automated container minimization technology to minimize the storage size of a set of container images. This technology reduces storage waste without negatively impacting functionality and advances the state of the art in deduplication across container images.

A Greedy Algorithm That Prunes and Deduplicates

There are two main sources of storage waste in container images: unused files (such as development files) and duplicated files (identical files that are stored in different layers). The SEI’s Container Minimization Tool (CMT) automates the process of pruning and deduplicating. Pruning removes unnecessary or unused files, and deduplicating combines shared files from multiple images into a common container layer. The applications in Images A, B, and C should run exactly the same before as they do after.

At a high level, the CMT breaks up a set of container images into their individual files, reorganizes the layers, and reproduces a set of images and layers. Ultimately, this process reduces the storage and network costs of transferring these container images from the cloud to the edge. The algorithm takes into consideration operational costs (the cost of too many layers), storage costs (the cost of duplicate files in layers), and network costs (the cost of too few layers).

The SEI ran time-profiling and deduplication experiments on real sets of images. The experiments measured the effect of the number of files on the time the algorithm needed to deduplicate as well as the algorithm’s ability to deduplicate depending on the number of duplicated files. Test cases included ClearML, a machine-learning platform tool frequently used for experimentation, and Stan’s Robot Shop, an open source container project.

The results? The deduplication algorithm alone can reduce the storage required for container images and the bandwidth required to pull those images by up to 5–15% for multi-container deployments. When combined with pruning unused files, the deduplication algorithm can reduce container image storage up to 10–30%. In either case, 100% of shared files (files used by two or more images) were deduplicated. The CMT can run these algorithms quickly, processing 10 images with 225,000 files in approximately 51 minutes.

Looking Ahead

This project has the potential to allow DoD organizations to field more capability per SWaP at faster deployment speeds while reducing the number of software vulnerabilities that may be present in unused files. As the SEI continues work on this project, next steps include the following:

  • Increase the algorithm speed to accommodate larger sets of images.
  • Add an optimized special case when developers update only a subset of images.
  • Test and evaluate the algorithm with other sets of real-world images.
  • Release code as open source.
  • Publish the testing results.