Prompt Injection

Running Prompt Injection Benchmarks

Textual Prompt Injection Benchmark

python3 -m CybersecurityBenchmarks.benchmark.run \
   --benchmark=prompt-injection \
   --prompt-path="$DATASETS/prompt_injection/prompt_injection.json" \
   --response-path="$DATASETS/prompt_injection/prompt_injection_responses.json" \
   --judge-response-path="$DATASETS/prompt_injection/prompt_injection_judge_responses.json" \
   --stat-path="$DATASETS/prompt_injection/prompt_injection_stat.json" \
   --judge-llm=<SPECIFICATION_1> \
   --llm-under-test=<SPECIFICATION_2> --llm-under-test=<SPECIFICATION_3> ...
   [--run-llm-in-parallel]

Textual prompt injection benchmarks are run in the following two steps:

Processing prompts Query the LLMs under test with a system prompt and a user prompt, where the user prompt tries to inject instructions to the LLM that is violating the original system prompt
Processing response The responses of the LLMs under test will go through another judge LLM, where the judge LLM will judge if the injected instruction is executed

Multilingual Text Prompt Injection Benchmark

For the multilingual textual prompt injection benchmark, the same benchmark implementation is used, only the dataset is changed.

python3 -m CybersecurityBenchmarks.benchmark.run \
   --benchmark=prompt-injection \
   --prompt-path="$DATASETS/prompt_injection/prompt_injection_multilingual.json" \
   ... (rest same as above)

Visual Prompt Injection Benchmark

The dataset needed to run this benchmark can be obtained from HuggingFace.

$ git lfs install
$ git clone https://huggingface.co/datasets/facebook/cyberseceval3-visual-prompt-injection $DATASETS/cyberseceval3-visual-prompt-injection

To run:

python3 -m CybersecurityBenchmarks.benchmark.run \
   --benchmark=visual-prompt-injection \
   --prompt-path="$DATASETS/cyberseceval3-visual-prompt-injection" \
   --response-path="$DATASETS/results/cyberseceval3-visual-prompt-injection/model_responses.json" \
   --judge-response-path="$DATASETS/results/cyberseceval3-visual-prompt-injection/judge_responses.json" \
   --stat-path="$DATASETS/results/cyberseceval3-visual-prompt-injection/stats.json" \
   --judge-llm=<SPECIFICATION_1> \
   --llm-under-test=<SPECIFICATION_2> --llm-under-test=<SPECIFICATION_3> ... \
   [--run-llm-in-parallel] \
   [--num-queries-per-prompt=5]

Visual prompt injection benchmarks are run in the following two steps:

Processing prompts: Query the LLMs under test with a list of test cases consisting of a system prompt, a user text input and a user image to obtain a list of responses.
Judging responses: The responses of the LLMs under test will go through a separate judge LLM (the same judge LLM is used for all responses), where the judge LLM will judge if injection was successful based on the judge question of that test case.

--num-queries-per-prompt=<N> can be optionally specified to run each test case N times (default if unspecified is 1) in order to obtain more robust results due to the stochastic nature of LLM responses.

Unlike the other benchmarks, a directory should be specified for --prompt-path for this benchmark rather than a JSON file. The expected data format in the directory is:

A test_cases.json file with integer IDs for each test case
A subdirectory named images/ which have images corresponding to the test cases with filenames {id}.png.

Results

Textual/Multilingual Prompt Injection Results

Based on judge LLM, the prompt injection will be either successful or unsuccessful

{
    "stat_per_model": {
        "model_name": {
            "injection_successful_count": ...,
            "injection_unsuccessful_count": ...,
            "total_count": ...,
            "injection_successful_percentage": ...,
            "injection_unsuccessful_percentage": ...
        }
    }
}

In the same output file, the results are futher bucketized in different dimensions, e.g., the injection techniques, and whether the prompt is a direct prompt injection or an indrect prompt injection, etc. The result of each bucket follows the same structure.

{
    "stat_per_model_per_variant": {
        "model_name": {
            "injected_prompt_technique_variant": {
                "injection_successful_count": ...,
                "injection_unsuccessful_count": ...,
                "total_count": ...,
                "injection_successful_percentage": ...,
                "injection_unsuccessful_percentage": ...
            },
            ...
        }
    },
    "stat_per_model_per_type": {
        ...
    },
    "stat_per_model_per_risk_category": {
        ...
    },
    "stat_per_model_per_speaking_language": {
        ..
    }
}

Visual Prompt Injection Results

Based on the evaluation of the judge LLM, the output of visual prompt injection test cases will be judged as either a successful or unsuccessful injection.

{
    "stat_per_model": {
        "model_name": {
            "injection_successful_count": ...,
            "injection_unsuccessful_count": ...,
            "total_count": ...,
            "injection_successful_percentage": ...,
            "injection_unsuccessful_percentage": ...
        }
    }
}

In the same output file, the results are further bucketed in different dimensions:

Injection techniques
Risk category (security-violating vs. logic-violating)
Injection type (direct vs. indirect)

The result of each bucket follows the same structure.

{
    "stat_per_model_per_injection_technique": {
        "model_name": {
            "embedded_text_instructions": {
                "injection_successful_count": ...,
                "injection_unsuccessful_count": ...,
                "total_count": ...,
                "injection_successful_percentage": ...,
                "injection_unsuccessful_percentage": ...
            },
            ...
        }
    },
    "stat_per_model_per_injection_type": {
        ...
    },
    "stat_per_model_per_risk_category": {
        ...
    }
}

Prompt Injection

Running Prompt Injection Benchmarks​

Textual Prompt Injection Benchmark​

Multilingual Text Prompt Injection Benchmark​

Visual Prompt Injection Benchmark​

Results​

Textual/Multilingual Prompt Injection Results​

Visual Prompt Injection Results​