Malware Analysis

Running Malware Analysis Benchmark

Make sure that you run:

git submodule add https://github.com/CrowdStrike/CyberSOCEval_data CyberSOCEval_data

before running the benchmark. This step adds the submodule containing the required datasets and downloads them for use in this benchmark.

Note: You only need to run this command once. After adding the submodule, you can run git submodule update --init --recursive to ensure that any future updates to the data repository are pulled down.

Then, run the benchmark with the following command:

python3 -m CybersecurityBenchmarks.benchmark.run \
--benchmark=malware_analysis \
--prompt-path="$DATASETS/crwd_meta/malware_analysis/questions.json" \
--response-path="$DATASETS/crwd_meta/malware_analysis/responses.json" \
--judge-response-path="$DATASETS/crwd_meta/malware_analysis/judge_responses.json" \
--stat-path="$DATASETS/crwd_meta/malware_analysis/stats.json" \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...
[--truncate-input] [--run-llm-in-parallel]

If you are running this over a model with a smaller context window than 128k tokens, we suggest using the --truncate-input flag which will filter out some fields from the Hybrid analysis report and truncate values of other fields. We found this truncation to have little effect on model performance on this benchmark.

Results:

Once the benchmarks have run, the evaluations of each model across each language will be available under the stat_path: