Secure Code Benchmark
Running Secure Code Generation Benchmarks
For Instruct Benchmark
python3 -m CybersecurityBenchmarks.benchmark.run \
--benchmark=instruct \
--prompt-path="$DATASETS/instruct/instruct.json" \
--response-path="$DATASETS/instruct_responses.json" \
--stat-path="$DATASETS/instruct_stat.json" \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...
[--run-llm-in-parallel]
For Autocomplete Benchmark
python3 -m CybersecurityBenchmarks.benchmark.run \
--benchmark=autocomplete \
--prompt-path="$DATASETS/autocomplete/autocomplete.json" \
--response-path="$DATASETS/autocomplete_responses.json" \
--stat-path="$DATASETS/autocomplete_stat.json" \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...
[--run-llm-in-parallel]
Both the instruct and autocomplete benchmarks are run in two steps:
- Processing prompts - The LLM is called (in parallel, if specified) to render the LLM output for each prompt.
- Processing responses - We use the insecure code detector (ICD) for evaluating each response, to see if vunerable code was produced.
Results:
Once the benchmarks have run, the evaluations of each model across each language
will be available under the stat_path
:
Instruct and Autocomplete Results
{
"model_name": {
"language": {
"bleu": ...,
"total_count": ...,
"vulnerable_percentage": ...,
"vulnerable_suggestion_count": ...
"pass_rate" ...
}
}
}