Malware Analysis
Running Malware Analysis Benchmark
Make sure that you run:
git submodule add https://github.com/CrowdStrike/CyberSOCEval_data CyberSOCEval_data
before running the benchmark. This step adds the submodule containing the required datasets and downloads them for use in this benchmark.
Note: You only need to run this command once. After adding the submodule, you can run git submodule update --init --recursive
to ensure that any future updates to the data repository are pulled down.
Then, run the benchmark with the following command:
python3 -m CybersecurityBenchmarks.benchmark.run \
--benchmark=malware_analysis \
--prompt-path="$DATASETS/crwd_meta/malware_analysis/questions.json" \
--response-path="$DATASETS/crwd_meta/malware_analysis/responses.json" \
--judge-response-path="$DATASETS/crwd_meta/malware_analysis/judge_responses.json" \
--stat-path="$DATASETS/crwd_meta/malware_analysis/stats.json" \
--llm-under-test=<SPECIFICATION_1> --llm-under-test=<SPECIFICATION_2> ...
[--truncate-input] [--run-llm-in-parallel]
If you are running this over a model with a smaller context window than 128k tokens, we suggest using the --truncate-input flag which will filter out some fields from the Hybrid analysis report and truncate values of other fields. We found this truncation to have little effect on model performance on this benchmark.
Results:
Once the benchmarks have run, the evaluations of each model across each language
will be available under the stat_path
:
Malware Analysis Results
Output stats will appear in a json according to the following structure:
"model_name": {
"stat_per_model": {
"avg_score": ...,
"total_score": ...,
"correct_mc_count": ...,
"incorrect_mc_count": ...,
"response_parsing_error_count": ...,
"correct_mc_pct": ...
},
"stat_per_model_per_topic": {
"Risk Assessment": {
"avg_score": ...,
...
},
"stat_per_model_per_difficulty": {
"medium": {
"avg_score": ...,
...
},
"stat_per_model_per_attack": {
"infostealers": {
"avg_score": ...,
...
}
}
where:
- stat_per_model contains the aggregate metrics evaluated over all questions in the dataset
- stat_per_model_per_<CATEGORY> contains these same metrics aggregated only over the given <CATEGORY>, where these categories include question topic, difficulty level, and malware type ('attack'). Metrics for each of these include:
- (in)correct_mc_count and (in)correct_mc_pct fields represent the number or percentage of questions to which the model correctly or incorrectly chose exaxtly the set of correct options
- response_parsing_error_count contains the number of questions for which we were unable to extract multiple choice selections
- avg_socre presents the average of the Jaccard similarity computed between the models chosen multiple choice options and the correct set of multiple choice options. This provides a 'less strict' metric of correctness that allows dor partial credit when the model gets some but not all, or more than the set of correct options.