Threat Intelligence Reasoning
Running Threat Intelligence Reasoning Benchmark
Download Threat Intelligence Report Data
This benchmark includes questions related to public threat intelligence reports. Depending on the source, this report data is hosted in pdf format at various agency organization websites and Crowdstrike's CyberSOCEval data repo and can all be extracted to the appropriate local directory by running the download_reports.py script.
Before running the download script:
- To update the submodules containing the CrowdStrike reports and download the reports used in this benchmark, the first step is to run:
git submodule add https://github.com/CrowdStrike/CyberSOCEval_data CyberSOCEval_data
Note: You only need to run this command once. After adding the submodule, you can run git submodule update --init --recursive
to ensure that any future updates to the data repository are pulled down.
- You will also need to install some additional dependencies for the
pdf2image
package used for converting pdfs to images:
- On Ubuntu/Debian:
sudo apt-get install poppler-utils
- On Arch Linux:
sudo pacman -S poppler
- On macOS:
brew install poppler
Once these have been done, you can run:
python3 -m CybersecurityBenchmarks.datasets.crwd_meta.threat_intel_reasoning.download_reports
This script will download the reports from their various hosting sites, and will additionally extract the text from the pdf to CybersecurityBenchmarks/datasets/threat_intelligence_reasoning/<report_id>.txt as well as a set of png images (one image of each page of the pdf report) to CybersecurityBenchmarks/datasets/crwd_meta/threat_intelligence_reasoning/<report_id>__<n>__.png. This text and/or image data will be passed into the benchmark as the context for the questions about the threat intelligence report.
Run the benchmark
You may run this benchmark using either text, image, or both as the modality in which the threat intel report is passed to the model under test.
python3 -m CybersecurityBenchmarks.benchmark.run \
--benchmark=threat_intel_reasoning \
--prompt-path="$DATASETS/crwd_meta/threat_intel_reasoning/report_questions.json" \
--response-path="$DATASETS/crwd_meta/threat_intel_reasoning/responses.json" \
--judge-response-path="$DATASETS/crwd_meta/threat_intel_reasoning/judge_responses.json" \
--stat-path="$DATASETS/crwd_meta/threat_intel_reasoning/stats.json" \
--input-modality=<INPUT_MODALITY> \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...
[--run-llm-in-parallel]
where <INPUT_MODALITY> is one of "text", "image", or "text_and_image"
Results:
Once the benchmarks have run, the evaluations of each model across each language
will be available under the stat_path
:
Threat Intelligence Reasoning Results
Output stats will appear in a json according to the following structure:
"model_name": {
"stat_per_model": {
"avg_score": ...,
"total_score": ...,
"correct_mc_count": ...,
"incorrect_mc_count": ...,
"response_parsing_error_count": ...,
"correct_mc_pct": ...
},
"stat_per_model_per_report": {
"dprk-espionage": {
"avg_score": ...,
...
},
"stat_per_model_per_source": {
"IC3": {
"avg_score": ...,
...
}
}
where:
- stat_per_model contains the aggregate metrics evaluated over all questions in the dataset
- stat_per_model_per_<CATEGORY> contains these same metrics aggregated only over the given <CATEGORY>, where these categories include individual report identifier and source (e.g. the source from which the report was obtained, one of IC3, CISA, NSA, or CrowdStrike) Metrics for each of these include:
- (in)correct_mc_count and (in)correct_mc_pct fields represent the number or percentage of questions to which the model correctly or incorrectly chose exaxtly the set of correct options
- response_parsing_error_count contains the number of questions for which we were unable to extract multiple choice selections
- avg_socre presents the average of the Jaccard similarity computed between the models chosen multiple choice options and the correct set of multiple choice options. This provides a 'less strict' metric of correctness that allows dor partial credit when the model gets some but not all, or more than the set of correct options.