Skip to main content

Getting Started

Set up Virtual Environment

Set up your virtual environment and install dependencies:

python3 -m venv ~/.venvs/CybersecurityBenchmarks
source ~/.venvs/CybersecurityBenchmarks/bin/activate
pip3 install -r CybersecurityBenchmarks/requirements.txt

Running the Benchmarks

The main module for benchmark is CybersecurityBenchmarks.benchmark.run. Run

python3 -m CybersecurityBenchmarks.benchmark.run --help

to get a list of available options.

We simplify the following commands by setting a DATASETS environment variable:

export DATASETS=$PWD/CybersecurityBenchmarks/datasets

Each benchmark can run tests for multiple LLMs. Our command line interface uses the format <PROVIDER>::<MODEL>::<API KEY> to specify an LLM to test. We currently support APIs from OPENAI, ANYSCALE, and TOGETHER. For OpenAI compatible endpoints, you can also specify a custom base URL by using this format: <PROVIDER>::<MODEL>::<API KEY>::<BASE URL>. The followings are a few examples:

  • OPENAI::gpt-4::<API KEY>
  • OPENAI::gpt-3.5-turbo::<API KEY>
  • OPENAI::gpt-3.5-turbo::<API KEY>::https://api.openai.com/v1/
  • ANYSCALE::meta-llama/Llama-2-13b-chat-hf::<API KEY>
  • TOGETHER::togethercomputer/llama-2-7b::<API KEY>

How to use a different model supported by Anyscale or Together

  1. Lookup the models supported by Anyscale or Together
  2. Update valid_models method in llm.py with the model name that you choose to test with.

How to run benchmarks for self hosted models

  1. Extend llm.py.
  2. Implement your inferencing logic for at least the following required methods:
  • def query(self, prompt: str, guided_decode_json_schema: Optional[str] = None, temperature: float = DEFAULT_TEMPERATURE, top_p: float = DEFAULT_TOP_P) -> str:
  • def chat_with_system_prompt(self, system_prompt: str, prompt_with_history: List[str], guided_decode_json_schema: Optional[str] = None, temperature: float = DEFAULT_TEMPERATURE, top_p: float = DEFAULT_TOP_P) -> str:
  • def chat(self, prompt_with_history: List[str], guided_decode_json_schema: Optional[str] = None, temperature: float = DEFAULT_TEMPERATURE, top_p: float = DEFAULT_TOP_P) -> str:

Note: Additional methods may be needed for some benchmarks; you will be prompted to implement them if you attempt to run a benchmark requiring a method not supported by your custom provider.

  1. Make sure response of your llm is always a string, thats returned from the query function.
  2. Update supported providers in llm.create method with the name of LLM you created from step (1).
  3. Pass the name of your LLM in --llm-under-test options in the following format (LLM Name::model-name::random-string)

Other notes about running benchmarks

  • Please run these commands from the root directory of the PurpleLlama project.
  • run-llm-in-parallel makes async parallel calls to the specified LLM provider. This speeds up prompt evaluation.