Getting Started

Prerequisites

Please make sure you are running python3.9 or newer versions.

Also ensure you have weggli installed. Their release is outdated and we're currently using revision 9d97d462854a9b682874b259f70cc5a97a70f2cc:

cargo install weggli --rev=9d97d462854a9b682874b259f70cc5a97a70f2cc --git=https://github.com/weggli-rs/weggli

Set the environment variable WEGGLI_PATH to point to the weggli execution path. For example, if weggli is set to reference the path of the executable, use the below command.

export WEGGLI_PATH=weggli

Set up Virtual Environment

Set up your virtual environment and install dependencies:

python3 -m venv ~/.venvs/CybersecurityBenchmarks

source ~/.venvs/CybersecurityBenchmarks/bin/activate

pip3 install -r CybersecurityBenchmarks/requirements.txt

Running the Benchmarks

The main module for benchmark is CybersecurityBenchmarks.benchmark.run. Run

python3 -m CybersecurityBenchmarks.benchmark.run --help

to get a list of available options.

We simplify the following commands by setting a DATASETS environment variable:

export DATASETS=$PWD/CybersecurityBenchmarks/datasets

Each benchmark can run tests for multiple LLMs. Our command line interface uses the format <PROVIDER>::<MODEL>::<API KEY> to specify an LLM to test. We currently support APIs from OPENAI, ANYSCALE, and TOGETHER. The followings are a few examples:

OPENAI::gpt-4::<API KEY>
OPENAI::gpt-3.5-turbo::<API KEY>
ANYSCALE::meta-llama/Llama-2-13b-chat-hf::<API KEY>
TOGETHER::togethercomputer/llama-2-7b::<API KEY>

How to use a different model supported by Anyscale or Together

Lookup the models supported by Anyscale or Together
Update valid_models method in llm.py with the model name that you choose to test with.

How to run benchmarks for self hosted models

Extend llm.py.
def query(self, prompt: str) -> str: Implement your inferencing logic.
Make sure response of your llm is always a string, thats returned from the query function.
Update supported providers in llm.create method with the name of LLM you created from step (1).
Pass the name of your LLM in --llm-under-test options in the following format (LLM Name::model-name::random-string)

Other notes about running benchmarks

Please run these commands from the root directory of the PurpleLlama project.
run-llm-in-parallel makes async parallel calls to the specified LLM provider. This speeds up prompt evaluation.

Getting Started

Prerequisites​

Set up Virtual Environment​

Running the Benchmarks​

How to use a different model supported by Anyscale or Together​

How to run benchmarks for self hosted models​

Other notes about running benchmarks​