Getting Started
Set up Virtual Environment
Set up your virtual environment and install dependencies:
python3 -m venv ~/.venvs/CybersecurityBenchmarks
source ~/.venvs/CybersecurityBenchmarks/bin/activate
pip3 install -r CybersecurityBenchmarks/requirements.txt
Running the Benchmarks
The main module for benchmark is CybersecurityBenchmarks.benchmark.run. Run
python3 -m CybersecurityBenchmarks.benchmark.run --help
to get a list of available options.
We simplify the following commands by setting a DATASETS environment variable:
export DATASETS=$PWD/CybersecurityBenchmarks/datasets
Each benchmark can run tests for multiple LLMs. Our command line interface uses
the format <PROVIDER>::<MODEL>::<API KEY> to specify an LLM to test. We
currently support APIs from OPENAI, ANYSCALE, and TOGETHER. For OpenAI compatible endpoints,
you can also specify a custom base URL by using this format: <PROVIDER>::<MODEL>::<API KEY>::<BASE URL>.
The followings are a few examples:
OPENAI::gpt-4::<API KEY>OPENAI::gpt-3.5-turbo::<API KEY>OPENAI::gpt-3.5-turbo::<API KEY>::https://api.openai.com/v1/ANYSCALE::meta-llama/Llama-2-13b-chat-hf::<API KEY>TOGETHER::togethercomputer/llama-2-7b::<API KEY>
How to use a different model supported by Anyscale or Together
- Lookup the models supported by Anyscale or Together
- Update
valid_modelsmethod in llm.py with the model name that you choose to test with.
How to run benchmarks for self hosted models
- Extend
llm.py. - Implement your inferencing logic for at least the following required methods:
def query(self, prompt: str, guided_decode_json_schema: Optional[str] = None, temperature: float = DEFAULT_TEMPERATURE, top_p: float = DEFAULT_TOP_P) -> str:def chat_with_system_prompt(self, system_prompt: str, prompt_with_history: List[str], guided_decode_json_schema: Optional[str] = None, temperature: float = DEFAULT_TEMPERATURE, top_p: float = DEFAULT_TOP_P) -> str:def chat(self, prompt_with_history: List[str], guided_decode_json_schema: Optional[str] = None, temperature: float = DEFAULT_TEMPERATURE, top_p: float = DEFAULT_TOP_P) -> str:
Note: Additional methods may be needed for some benchmarks; you will be prompted to implement them if you attempt to run a benchmark requiring a method not supported by your custom provider.
- Make sure response of your llm is always a string, thats returned from the
queryfunction. - Update supported providers in
llm.createmethod with the name of LLM you created from step (1). - Pass the name of your LLM in
--llm-under-testoptions in the following format (LLM Name::model-name::random-string)
Other notes about running benchmarks
- Please run these commands from the root directory of the PurpleLlama project.
run-llm-in-parallelmakes async parallel calls to the specified LLM provider. This speeds up prompt evaluation.