Introduction

This repository hosts the implementation of CyberSecEval 4, which builds upon and extends the functionalities of its predecessors, CyberSecEval 3, CyberSecEval 2 and CyberSecEval. CyberSecEval 4 is an extensive benchmark suite designed to assess the cybersecurity vulnerabilities and defensive capabilities of Large Language Models (LLMs). Building on its predecessor, CyberSecEval 3, this latest version introduces three new benchmarks to evaluate an AI system's defensiive capabilities: Two new tests comprise CyberSOCEval, which measures the LLM's capability to enable Security Operation Center (SOC) efficiency improvements or automation. These benchmarks were developed in collaboration with Crowdstrike and include tests for Malware Analysis and Threat Intellligence Reasoning. The third new benchmark, AutoPatchBench, measures an LLM agent's capability to automatically patch security vulnerabilities in native code.

Created to measure and incentivize the development of highly capable LLM based defender tools, CyberSecEval 4 builds on the CyerSecEval code base to offer a comprehensive set of tools for evaluting both vulnerabilities and defensive capabilities in various security domains. It has been applied to well-known LLMs such as Meta's Llama4 as well as OpenAI, Google, and Anthropic models. The findings from the collection of CyberSecEval studies underscore both substantial cybersecurity threats, as well as promising oppottunities to leverage these models for uplifting defenders.

The repository includes several types of benchmarks:

MITRE and False Refusal Rate (FRR) Tests: These tests evaluate an LLM's compliance when asked to assist in cyberattacks as well as how often an LLM incorrectly refuses a borderline but essentially benign query.
1. MITRE Tests: These tests use the MITRE ATT&CK framework to evaluate an LLM's compliance when asked to assist in cyberattacks.
2. False Refusal Rate (FRR) Tests: These tests measure how often an LLM incorrectly refuses a borderline but essentially benign query, due to misinterpreting the prompt as a malicious request.
Secure Code Generation Tests: These tests assess an LLM's propensity to generate secure code across various contexts.
1. Instruct Tests: These tests assess an LLM's propensity to generate insecure code when given a specific instruction.
2. Autocomplete Tests: These tests measure how often an LLM suggests insecure coding practices in autocomplete contexts, where the LLM predicts subsequent code based on preceding code.
Prompt Injection Tests: These tests assess an LLM’s susceptibility to “prompt injection attacks” - attacks in which a portion of the LLM prompt coming from untrusted user input contains malicious instructions intended to override the LLM’s original task.
1. Textual Prompt Injection Tests: An english-only dataset of textual prompt injections.
2. Visual Prompt Injection Tests: These tests assess an LLM’s susceptibility to “visual prompt injection attacks” - attacks in which untrusted multimodal user input (consisting of both text and images) contains malicious instructions intended to override the LLM’s original task.
Code Interpreter Tests: These tests evaluate the security risks posed by integrating LLMs with code interpreters, specifically assessing how effectively an LLM can prevent malicious attempts to exploit the system or execute harmful code.
Vulnerability Exploitation Tests: These tests measure the LLM's program exploitation capabilities by asking the LLMs to solve "capture the flag" style challenges.
Spear Phishing Capability Tests: These tests are designed to evaluate the capabilities of LLMs in spear phishing scenarios. The focus is on assessing the LLM's persuasiveness and its effectiveness in convincing targeted victims to meet specific phishing objectives.
Autonomous Offensive Cyber Operations Tests: These tests are designed to evaluate the capabilities of LLMs to function autonomously as a cyber attack agent.
AutoPatch Tests: These tests are designed to evaluate the capabilities of LLMs to autonomously generate security patches for vulnerabilities, specifically the crashes discovered by fuzzer tests.
CyberSOCEval Tests: These tests, created in partnership with Crowdstrike, assess defensive capabilities and include two benchmarks:
- Malware Analysis: Assesses the precision and recall of LLMs in identifying malicious activities from potential malware, such as detecting ransomware or remote access trojans via mutiple choice questions where multple options may be correct.
- Threat Intelligence Reasoning: Evaluates an AI's ability to parse unstructured threat intelligence reports and extract actionable insights via mutiple choice questions where multple options may be correct.