This research paper investigates using large language models (LLMs), like GPT-3.5, to assist penetration testers. The authors explore two applications: high-level penetration testing plan creation and low-level vulnerability exploitation within a virtual machine. A closed-loop system was developed to allow the LLM to suggest and execute commands, ultimately gaining root access. The study presents initial results, discusses areas for improvement, and addresses the ethical implications of using AI in penetration testing, including concerns about malicious use. Future research directions focus on integrating high and low-level functionalities, exploring alternative LLMs, and improving prompt engineering.

  • We explore two distinct use cases: high-level task planning for security testing assignments and low- level vulnerability hunting within a vulnerable virtual machine.For the latter, we implemented a closed-feedback loop between LLM-generated low-level actions with a vulnerable virtual machine (connected through SSH) and allowed the LLM to analyze the ma- chine state for vulnerabilities and suggest concrete attack vectors which were automatically executed within the virtual machine
  • The field of cybersecurity and software secu- rity testing, more specifically, penetration testing, suffers from a chronic lack of personnel [19], even worse, according to the ISC2 Cybersecurity Workforce Study 2022 [18], while global cybersecu- rity workforce was growing by 11.1% YoY, this growth was outpaced by the gap’s increase of 26.2% YoY. A recent interview study with penetration testers highlighted the need for human sparring part- ners [16], i.e., colleagues who offer alternative ideas or approaches whrn stuck.

LLMs show promising results in automating penetration testing tasks, particularly in low-level attack execution, but their effectiveness and ethical implications require further investigation.

  • Researchers explored the potential of LLMs, such as GPT3.5, to act as “AI sparring partners” for penetration testers.
  • In one experiment, researchers integrated GPT3.5 with a vulnerable virtual machine, enabling the LLM to analyze the machine, identify vulnerabilities, suggest attack vectors, and execute commands through SSH. This setup successfully gained root privileges on the virtual machine, demonstrating the potential for LLMs to automate privilege escalation attacks.
  • In another experiment, researchers tasked LLMs with designing penetration tests for both generic scenarios and a specific target organization. The LLMs generated realistic and feasible attack plans, including techniques such as password spraying, Kerberoasting, and exploiting Active Directory Certificate Services.
  • However, researchers noted that individual prototype runs were not always stable, with variations in the sequence and selection of commands and vulnerabilities identified. They observed that results tended to converge over longer runs or when aggregating multiple runs.
  • The ethical implications of using LLMs for penetration testing raise concerns, particularly regarding the potential for misuse. Researchers acknowledged that LLMs could be exploited for malicious purposes and stressed the need for defenders to be prepared for LLM-driven attacks. The ease of fine-tuning existing models for malicious activities, coupled with the low entry cost for experimentation, further amplifies the risk.
  • While LLMs can automate certain penetration testing tasks, the sources highlight the need for human oversight, particularly in prompt engineering, result verification, and ethical considerations.