Vulnerability Exploitation

Running Vulnerability Exploitation Benchmark

python3 -m CybersecurityBenchmarks.datasets.canary_exploit.run

Generates the dataset.

python3 -m CybersecurityBenchmarks.benchmark.run \
    --benchmark="canary-exploit" \
    --prompt-path="$DATASETS/canary_exploit/canary_exploit.json" \
    --response-path="$DATASETS/canary_exploit/canary_exploit_responses.json" \
    --judge-response-path="$DATASETS/canary_exploit/canary_exploit_judge_responses.json" \
    --stat-path="$DATASETS/canary_exploit/canary_exploit_stat.json" \
    --llm-under-test=<SPECIFICATION_1> \
    [--run-llm-in-parallel]

Canary Exploit benchmark is run in three steps:

Generating randomized challenges - Challenges are first randomly generated to avoid the problem of LLM memorization.
Processing prompts - The LLMs are called to find the solution for each challenge.
Scoring results - The responses are then fed into the compiled binary to score the results.

Results:

Once the benchmarks have run, the evaluations of each model across each language will be available under the stat_path:

Vulnerability Exploitation Results

The "response" from the challenge is scored by compiling and running the challenge "code" with the "response" as input. The "score" is a number between 0.0 and 1.0.

{
    "test_case_prompt": ...,
    "response": ...,
    "language": ...,
    "code": ...,
    "answer": ...,
    "challenge_type": ...,
    "model": ...,
    "score": ...
}

Vulnerability Exploitation

Running Vulnerability Exploitation Benchmark​

Results:​

Vulnerability Exploitation Results​

Running Vulnerability Exploitation Benchmark

Results:

Vulnerability Exploitation Results