Editor’s note: Aditya is a speaker for ODSC APAC 2021. Check out his talk, “Machine Learning for Software Engineering,” there!
Today, society critically depends on software. The prevailing Covid-19 situation has accelerated the transition to a software-driven world even further. As the number of software applications grows, so does the demand for software engineers. But the size and complexity of source code that the engineers have to deal with is growing exponentially and making their job difficult. Does the key to improving the productivity of software engineers and quality of software lie in those large codebases themselves?
Machine learning methods have been incredibly successful in gleaning useful statistical information out of complex data. This has resulted in significant progress in the fields of natural languages, speech, and vision. Can we develop machine learning techniques that can make sense out of the large amount of source code that is available on GitHub and BitBucket? This prospect has attracted a lot of media attention in recent times thanks to new announcements such as GitHub Co-pilot that automatically can generate source code from method signatures and docstring, or TransCoder from Facebook for unsupervised machine translation between programming languages.
Unlike natural languages, programming languages have well-defined syntax and semantics, which can be modeled mathematically. Source code can therefore be subjected to formal analysis and many a tool exist that can perform deep semantic analysis of software. Why do we need machine learning if we can analyze programs algorithmically? The answer lies in the statistical properties of the source code. While it might be difficult to prove the correctness of an implementation of a cryptographic algorithm, it is perhaps easy enough to recognize a common coding pattern (say, a sorting algorithm) and detect bugs in it by looking for any deviations in the implementation from the many examples of similar implementations that exist. Designing a formal method of program analysis takes substantial manual efforts and research expertise. If we can train neural networks to understand and reason about code, we can devise neural methods of program analysis from source code examples.
Besides source code, software engineers often work with multiple modalities of data in the form of natural language documentation, code comments, execution logs, UI designs, and so on. Social forums like StackOverflow are a great source of knowledge. As such, these sources of data offer us insights into coding that may not be available in the source code itself. Models that can understand code and other modalities, e.g., natural language can be quite valuable. Unlike web search, keyword-based search for source code is not particularly effective due to the likely mismatch between the keywords used in search queries and the variable or method identifiers used in code. General-purpose contextual embedding models for source code, like the code understanding BERT (CuBERT), can be leveraged to obtain distributed representations of source code and natural language queries for better matching and retrieval.
We are seeing rapid progress in machine learning methods for code understanding and their applications to software engineering problems. Nevertheless, there remains a long way to go in terms of deeper understanding and the ability to analyze large coding contexts, as well as a unified treatment of source code with other sources of information. In my upcoming talk at ODSC APAC 2021, I hope to convey my enthusiasm about this emerging area, and its promising prospects.
About the author/ODSC APAC 2021 speaker on advances in machine learning for software engineering:
Aditya Kanade is an Associate Professor at the Department of Computer Science and Automation of the Indian Institute of Science. He completed his PhD at IIT Bombay and post-doc at the University of Pennsylvania. His research interests span machine learning, software engineering and automated reasoning. He is particularly excited about the prospect of developing machine learning techniques to automate software engineering, and designing trustworthy and deployable machine learning systems.