search menu icon-carat-right cmu-wordmark
2021 Research Review / DAY 1

README: A Learned Approach to Augmenting Software Documentation

Modern software documentation processes in development, security, and operations (DevSecOps) software development lifecycles (SDLCs) are inadequate, time consuming, and difficult to quantify quantitatively.  Anecdotally, any software documentation process can be painful [Shorter 2020]. For any given continuous integration/continuous deployment (CI/CD) SDLC methodology, crafting and maintaining high-quality software documentation content can be a subjective, tedious, meticulous process requiring significant understanding and domain knowledge. Additionally, in modern Agile CI/CD or DevSecOps sprinting paradigms, human-in-the-loop (HITL) software documentation blockers detract from development success-gauging metrics. This situation inspires negative perceptions of current documentation processes and efforts to mitigate the blocker through substandard (or even non-existent) iterative documentation efforts.

The README research initiative is a strategic step forward towards a descriptive content generative process in modern DoD DevSecOps SDLCs.  The README proof of concept (POC) is not a templating engine. Rather, the primary differentiator between the README POC approach and emerging approaches in Development, Documentation, and Operations (DevDocOps) is that software documentation content is directly inferred from the underlying source code itself, backed by the SDLC cadence via DevSecOps policy. The README approach relies on leveraging a machine learning (ML) modular architecture for learning the nuanced associations between Python3.8 source code and corresponding software engineering (SWE) descriptive lexicon language, in an unsupervised manner, from thousands of open source publicly available repositories’ commit transaction histories. The README POC release establishes a viable cross-domain forward inference POC model learned from software repositories, and a minimum viable product (MVP) DevSecOps service prototype of the model as an exemplar.

The README ML cross-domain translation architecture is defined as a latent translation bridging model nested between two pretrained models over orthogonal data modalities [Tian 2019]. The README project refers to the nesting-based approach of pretrained models for cross-domain latent translation as the “Matryoshka Technique,” facilitating domain modularity with a deeper network-forward through pretrained nested model reuse. The Matryoshka Technique provides a modular experimental harness for training and validation (T&V) of multiple pretrained models, under varying pretrained configuration hyper-parameterizations, for learning a nested, shared, latent space modeling structure between them.

Python CFG GRU Encoder Python CFG GRU Encoder

For a software documentation content generative process, the cross-domain latent translation ML model, identified through this README research initiative, at reconstruction of each pretrained model’s intermediate latent encodings, is a conditional variational auto-encoder (CVAE) nested between a pretrained encoder from AST2VEC and a decoder from Seq2Seq_SO with StackOverflow SWE vocabularies and sword-similarity embeddings [Subramanian 2020; Paaßen 2021; Cho 2014; Efstathiou 2018]. 

Results of the README research initiative is a successful answer to the research question; the Matryoshka Technique for nesting pretrained models for a learned cross-domain latent translation between source code snippets and SWE subjective language is viable. This approach establishes efficacy in a general approach facilitating domain modularity with a deeper network-forward through pretrained nested model reuse.

README will produce the following outcomes and deliverables:

  • README DevSecOps SDLC MVP Prototype Service; Containerized Deployment Service Prototype
  • README: A Learned Approach to Augmenting Software Documentation technical report

In Context

This FY2019-21 project

  • contributes to the SEI’s strong portfolio of ongoing work in modernizing software development and acquisitions, AI, and autonomy
  • aligns with the CMU SEI technical objective to bring capabilities that make new missions possible or improve the likelihood of success of existing ones
  • aligns with the CMU SEI technical objective to be timely to enable the DoD to field new software-enabled systems and upgrades faster than our adversaries
  • aligns with the CMU SEI technical objective to be affordable such that the cost of acquisition and operations, despite increased capability, is reduced and predictable and provides a cost advantage over our adversaries
Mentioned in this Article

[Shorter 2020]

Shorter, Cameron. What is good documentation for software projects? April 6, 2020.

[Paaßen 2021]

Paaßen, B., McBroom, J., Jeffries, B., Koprinska, I., and Yacef, K. (2021). Mapping Python Programs to Vectors using Recursive Neural Encodings. Journal of Educational Datamining. [In press.]

[Cho 2014]

Cho, K., Merrienboer, B.V., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP. 2014.

[Efstathiou 2018]

Vasiliki Efstathiou, Christos Chatzilenas, and Diomidis Spinellis. Pages 38–41. Word embeddings for the software engineering domain. In Proceedings of the 15th International Conference on Mining Software Repositories (MSR '18). Association for Computing Machinery, New York, New York. 2018. DOI:

[Subramanian 2020]

Subramanian, A.. (2020). PyTorch-VAE.

[Tian 2019]

Tian, Y., & Engel, J. Latent Translation: Crossing Modalities by Bridging Generative Models. arXiv:1902.08261. 2019.