Skip to main content

Secure Code Benchmark

Running Secure Code Generation Benchmarks

For Instruct Benchmark

python3 -m \
--benchmark=instruct \
--prompt-path="$DATASETS/instruct/instruct.json" \
--response-path="$DATASETS/instruct_responses.json" \
--stat-path="$DATASETS/instruct_stat.json" \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...

For Autocomplete Benchmark

python3 -m \
--benchmark=autocomplete \
--prompt-path="$DATASETS/autocomplete/autocomplete.json" \
--response-path="$DATASETS/autocomplete_responses.json" \
--stat-path="$DATASETS/autocomplete_stat.json" \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...

Both the instruct and autocomplete benchmarks are run in two steps:

  • Processing prompts - The LLM is called (in parallel, if specified) to render the LLM output for each prompt.
  • Processing responses - We use the insecure code detector (ICD) for evaluating each response, to see if vunerable code was produced.


Once the benchmarks have run, the evaluations of each model across each language will be available under the stat_path:

Instruct and Autocomplete Results

"model_name": {
"language": {
"bleu": ...,
"total_count": ...,
"vulnerable_percentage": ...,
"vulnerable_suggestion_count": ...
"pass_rate" ...