search menu icon-carat-right cmu-wordmark
quotes
2022 Research Review / DAY 3

Advancing Algorithms for File Deduplication Across Containers

Container software virtually packages and isolates applications for deployment. It can operate over multiple network resources so applications can run in isolated user spaces (containers) in any cloud (or non-cloud) environment. The Department of Defense (DoD) wants to use containers to support its vision of a cloud-to-edge continuum in which capabilities packaged as containers are pushed from the cloud to edge devices to support localized data processing. However, devices deployed at the tactical edge are resource limited and commonly operate over disconnected, intermittently connected, low-bandwidth (DIL) networks or hostile environments in which there is a high likelihood of bad actors trying to tamper with them.

The Department of Defense wants to use containers to support its vision of a cloud-to-edge continuum in which capabilities packaged as containers are pushed from the cloud to edge devices to support localized data processing.

Kevin Pitstick
Senior Software Engineer
Photo of Kevin Pitstick

To address these limitations, we developed an automated container image minimization technology. This technology combined and improved on two minimization approaches: pruning (removing unnecessary files from single images) and deduplication (combining shared files across images into common layers). We focused on advancing the state-of-the-art in deduplication across container images.

Our solution for pushing containers to the tactical edge employs a “greedy” algorithm that creates a new set of layers for every image with minimal duplicates. Our solution for pushing containers to the tactical edge employs a “greedy” algorithm that creates a new set of layers for every image with minimal duplicates.

To create this new technology, we developed an algorithm for file deduplication across a collection of container images that can reduce container image storage usage and update bandwidth by up to 5–15% for multi-container deployments and by up to 10–30% for pruned container deployments. In our tests with real multi-container image systems, our algorithm deduplicates 100% of shared files and processes 10 images with 225,000 files in approximately 81 minutes.

This project focused on technology that supports the Open Container Initiative (OCI) standard because the DoD aims to avoid vendor lock-in and leverage OCI-compliant containers. Additionally, this project has the potential to accelerate the SEI’s impact by open sourcing minimization algorithms to gain wider interest and adoption from industry and the DoD community.

In Context

This FY2022 project

  • aligns with the SEI technical objective to be trustworthy in construction and implementation and resilient in the face of operational uncertainties, including known and yet unseen adversary capabilities
  • aligns with the SEI technical objective to be affordable such that the cost of acquisition and operations, despite increased capability, is reduced and predictable and provides a cost advantage over our adversaries
  • aligns with the DoD software objective to enhance resilience