Continuous Integration and Data
As more applications move to a DevOps model with CI/CD pipelines, the testing required for this development model to work inevitably generates lots of data. This is also true for large open-source projects, that may see millions of tests executed on a daily basis.
[Related Article: The Best Machine Learning Research of Summer 2019]
The data produced by such CI systems contains information about several aspects of the continuous testing system; engineers with specific domain experience usually parse such data on a daily basis in an effort to maintain the system running smoothly.
After years of experience in the field, we wanted to investigate if machine learning could help us extract valuable insights from CI data with minimal human intervention.
Open Source and Open Data
The first requirement for any machine learning project is the data, and we have an open dataset available to use. We used data generated by the OpenStack CI, which runs on Zuul, a CI system. Both Zuul and OpenStack are open source projects.
The code being open source, however, is not a guarantee that the data produced by that platform is open too. Luckily the OpenStack community maintains the data in the open too!
Not all the data produced by Zuul may be suitable for our machine learning work. Zuul tests the code before it is merged into git and it does so following a two pipeline approach. The check pipeline tests changes to code that may be broken and that still requires human review. Once a change is approved by humans and it passes the check pipeline, it is queued into the gate pipeline, where tests are mostly expected to pass. We consider the check pipeline to be too noisy, while gate represents a clean source of data. Failures in gate may be related to temporary instability in the testing infrastructure, flakiness of tests, race conditions in the code or other changes to the code that were merged since the check pipeline was last ran.
The data produced by the gate pipeline is what we use to create our datasets.
Creating a stable dataset
The OpenSource community only stores the CI data for a limited amount of time, since new data is produced daily, and the available storage is limited. To have reproducible experiments we needed a stable dataset; we decided to pull and filter data on a daily basis.
We structured our work into several separate stages. The first one is storing the data for our experiments in the cloud. The second stage is data preparation and visualization, which is often an iterative process. The third stage is establishing our metrics, so we have a clear definition of what we aim to optimize. The final stage is running multiple experiments against the datasets we created, fine-tuning the model and analyzing the results.
[Related Article: Adapting Machine Learning Algorithms to Novel Use Cases]
We wrote tooling in python to help us keep track of datasets, experiments and results.
More about the authors:
Andrea Frittoli is an Open Source Developer Advocate at IBM and Machine Learning enthusiast. He’s a strong advocate for transparency in open source. He likes working on IaaS projects as well as machine learning, trying to combines the two worlds. Andrea has previously been a speaker at FOSSASIA, FOSS Backstage, OpenStack summits, Open Source Summits, and various meetups.
Kyra Wulffert is a solution architect and IT expert with over a decade of experience in the telecommunications industry working in international environments with local and remote teams. She’s a Machine Learning and Open Source enthusiast.