Threat Intelligence Reasoning

Running Threat Intelligence Reasoning Benchmark

Download Threat Intelligence Report Data

This benchmark includes questions related to public threat intelligence reports. Depending on the source, this report data is hosted in pdf format at various agency organization websites and Crowdstrike's CyberSOCEval data repo and can all be extracted to the appropriate local directory by running the download_reports.py script.

Before running the download script:

To update the submodules containing the CrowdStrike reports and download the reports used in this benchmark, the first step is to run:

git submodule add https://github.com/CrowdStrike/CyberSOCEval_data CyberSOCEval_data

Note: You only need to run this command once. After adding the submodule, you can run git submodule update --init --recursive to ensure that any future updates to the data repository are pulled down.

You will also need to install some additional dependencies for the pdf2image package used for converting pdfs to images:

On Ubuntu/Debian: sudo apt-get install poppler-utils
On Arch Linux: sudo pacman -S poppler
On macOS: brew install poppler

Once these have been done, you can run:

python3 -m CybersecurityBenchmarks.datasets.crwd_meta.threat_intel_reasoning.download_reports

This script will download the reports from their various hosting sites, and will additionally extract the text from the pdf to CybersecurityBenchmarks/datasets/threat_intelligence_reasoning/<report_id>.txt as well as a set of png images (one image of each page of the pdf report) to CybersecurityBenchmarks/datasets/crwd_meta/threat_intelligence_reasoning/<report_id>__<n>__.png. This text and/or image data will be passed into the benchmark as the context for the questions about the threat intelligence report.

Run the benchmark

You may run this benchmark using either text, image, or both as the modality in which the threat intel report is passed to the model under test.

python3 -m CybersecurityBenchmarks.benchmark.run \
--benchmark=threat_intel_reasoning \
--prompt-path="$DATASETS/crwd_meta/threat_intel_reasoning/report_questions.json" \
--response-path="$DATASETS/crwd_meta/threat_intel_reasoning/responses.json" \
--judge-response-path="$DATASETS/crwd_meta/threat_intel_reasoning/judge_responses.json" \
--stat-path="$DATASETS/crwd_meta/threat_intel_reasoning/stats.json" \
--input-modality=<INPUT_MODALITY> \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...
[--run-llm-in-parallel]

where <INPUT_MODALITY> is one of "text", "image", or "text_and_image"

Results:

Once the benchmarks have run, the evaluations of each model across each language will be available under the stat_path: