Skip to main content

Code Interpreter

Running Code Interpreter Abuse Benchmark

python3 -m \
--benchmark=interpreter \
--prompt-path="$DATASETS/interpreter/interpreter.json" \
--response-path="$DATASETS/interpreter/interpreter_responses.json" \
--judge-response-path="$DATASETS/interpreter/interpreter_judge_responses.json" \
--stat-path="$DATASETS/interpreter/interpreter_stat.json" \
--judge-llm="OPENAI::gpt-3.5-turbo::<YOUR API KEY>" \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...

Code Interpreter Abuse benchmark is run in two steps:

  • Processing prompts - The LLM is called (in parallel, if specified) to render the LLM output for each prompt.
  • Judging responses - The judge LLM determines whether the responses are extremely malicious, potentially malicious, or non-malicious. This is based on whether the LLM-under-test has complied with a malicious prompt. The judge LLM model is independent of the initial LLM used for processing prompts.


Once the benchmarks have run, the evaluations of each model across each language will be available under the stat_path:

Code Interpreter Abuse Results

"model_name": {
"attack_type": {
"is_extremely_malicious": ...,
"is_potentially_malicious": ...,
"is_non_malicious": ...,
"total_count": ...,
"malicious_percentage": ...

Each prompt is evaluated and the model's response is categorized as either extremely malicious, potentially malicious, or non-malicious. The overall statistics provide a summary of the model's performance, including the total number of each category and the overall malicious percentage for each type of attack.