AI Code Generation and Cybersecurity
The endless stream of bad news related to cybersecurity leads one to question whether it is possible to secure software at the pace it is developed. Yet, progress in the fields of machine learning (ML) and artificial intelligence (AI) may eventually help us produce more secure code. Hardware and algorithm advancements have enabled ever larger models with billions of parameters unlocking novel applications in numerous fields, from increasing the efficacy of supply chain logistics to supporting autonomous weapons. These same advancements could be applied to cybersecurity, bringing new opportunities and risks that both practitioners and policy makers should consider.
One exciting area is AI code generation or machine assisted pair programming. These systems generally work by auto-completing portions of code written by a human, or take instructions that describe the code to be generated. The underlying models that power these systems are trained similarly to large language models, except they are trained on source code. While still in their infancy these systems are quickly advancing and becoming more capable with each new generation. They hold enormous potential for the tech economy and businesses that rely on the high-demand, short-supply engineering workforce.
More on:
Presently, the most advanced example of these systems is Copilot from GitHub. The ML model that powers Copilot is named Codex. It was developed by OpenAI and is based on GPT-3, a widely known large language model that is capable of producing human-like text. Copilot integrates with a developer’s workflow and is designed to finish small snippets of code or write functions that perform simple tasks based on a prompt described by the engineer. It excels at assisting human programmers by reducing the time it takes to develop routine functionality.
While we are seeing early success in AI-assisted pair programming, there is no widely available AI system that is capable of generating programs with complex business logic that rival the code repositories they were trained on. Producing larger programs that track multiple interdependent inputs and variables, make state transitions based on them, and produce complex output is a likely next step. Despite how intelligent these systems appear when they produce working code, they don’t actually understand the finer details of the compilers that will consume and process them, the CPU’s they will execute on, or the larger trust model they may be expected to uphold.
Still, AI-generated code could have some clear security benefits in the short term. First AI-assisted software development could ensure that code is more robust and better tested. Ideally, all code written by humans is well tested and audited by both automated tooling and manual human analysis. This ensures that a majority of defects are caught long before the code is in the hands of users. However, much of the code produced today does not undergo this type of automated testing because it takes time and expertise to apply it. These same AI systems can be utilized to produce these kinds of automated testing integrations for every function they generate at little to no effort. Human expertise may still be required to fix the defects it uncovers but this presents an opportunity to significantly scale up testing efforts earlier in the development cycle. This is an under explored area worth investing in today.
Second, in some cases these systems may be able to reduce the attack surface through code deduplication. By recognizing when a well tested library or framework can be substituted for human generated code, and produce the same effect, the system can reduce the overall complexity of the final product. Using stable, up-to-date, and widely adopted open source alternatives can lead to more secure code than proprietary untested equivalents. It also has the benefit of allowing engineers to focus on their core business logic. These AI systems will have to ensure they are always capable of recommending and generating code that uses the most up to date version of these libraries or they risk introducing known vulnerabilities. Even when newer components are available with security patches, engineers don’t always update them due to backward compatibility concerns. In these cases leaning on AI to rewrite those integrations can reduce the time to update and narrow the window of vulnerability.
While these systems may help the defender, they could also augment and enhance the capabilities of traditionally less-sophisticated malicious actors. Code, whether written by humans or machines, is inherently dual use. We must accept and prepare for a future where both defenders can use these capabilities to produce more secure code and malicious actors can reduce the time it takes to produce new attack tools.
More on:
Moreover, these systems will only produce outputs as secure as the code their models were trained on. AI-assisted software development can rapidly and inadvertently create new attack surfaces and vulnerabilities that outpace our ability to discover and harden them. Even with smart prefiltering of potentially malicious or low quality training data we don’t yet know what kinds of subtle supply chain attacks may be possible as attackers may place malicious training data at various points which will produce unknown outputs. This is particularly concerning given the lack of explainability that is common to large models produced by neural networks. This may result in difficult challenges to tackle related to Software Bill of Materials (SBOM) standards being set forth today. This issue occurs because by design the AI models obscure the provenance of the code it has generated, and code provenance is a key property of SBOM. The concept of a bugdoor, a purposefully placed subtle vulnerability that provides access and plausible deniability, takes on new meaning in the context of machine generated code for the same reason. Because open source code is necessary to build these AI models in the first place the security and quality of the open source software ecosystem plays a role in unlocking the potential of AI code generation. Investing in the security and trustworthiness of the open source ecosystem should remain a priority for both industry and government.
A new era of automated code generation is beginning to take shape. This shift will create new opportunities to develop more secure code by scaling the techniques we already know to be effective. However, a number of technical challenges remain. It is imperative that we prepare for the changes this era will bring.
Chris Rohlf is a software engineer, and security expert.