benchmark
EVAL_PROTOCOL v3 — capability first, recall second.
The benchmark is published alongside the code so the tool's claims are falsifiable.
v3 scores two independent tracks. CAP — the headline capability
score — measures methodology: did the agent pick the right pivots, stay within
budget, defuse benign infrastructure, and commit to a hypothesis? It is decay‑proof,
so a dead indicator can't sink it. REC — recall — is freshness‑gated:
it only scores when a liveness probe is present, and is skipped as
DATA_DECAYED otherwise. A separate set of benign seeds tests restraint,
and any hallucinated node or edge is a hard‑fail gate. Cases are drawn from public
write‑ups by Silent Push, Sekoia, Trend Micro, DFIR Report, DomainTools, Intrinsec,
DNSFilter and Trellix.
EVAL_PROTOCOL · v3
CAP · REC · restraint — hallucination is a hard gate
Latest run
2026‑06‑01 · de5a31b · Opus 4.8
CAP mean
92.9
+6.9 vs 2026‑05‑28
Hallucination
0/5
hard gate clear
Recall (REC)
61.1
LIVE subset · n=3
Restraint · benign
67
3 benign seeds
CAP=
0.40·PSpivot
+ 0.25·EFFbudget
+ 0.20·RSTdefuse
+ 0.15·HYPhypothesis
Nightly fresh subset · 5 positive cases
| # |
Threat |
Seed |
PS·EFF·RST·HYP |
CAP |
Δ prior |
Liveness |
| c02 | MuddyWater | hash | 100·100·100·100 | 100.0 | +0.0 | LIVE |
| c03 | Bumblebee → Akira | hash | 100·100·100·100 | 100.0 | +9.8 | LIVE |
| c08 | Amadey + StealC | hash | 100·38·100·100 | 84.5 | +19.5 | LIVE |
| c09 | Tycoon 2FA | domain | 75·100·100·100 | 90.0 | +11.4 | DECAYED |
| c12 | ClearFake | domain | 75·100·100·100 | 90.0 | +10.0 | DECAYED |
Negative cases · restraint‑only (benign seeds)
| Benign seed |
Type |
Nodes |
False tags |
RST |
| Cloudflare anycast | ip | 6 | 0 | 100 |
| jsDelivr CDN | domain | 15 | 2 | 50 |
| Wikipedia | domain | 22 | 2 | 50 |
Full case catalog · 12 cases ● = ran 2026‑06‑01
| # |
Threat |
Primary marker |
Diff. |
| 01 | Salt Typhoon | registrant‑email reverse‑WHOIS | medium |
| 02 | MuddyWater | JARM / TLS banner | easy‑med |
| 03 | Bumblebee → Akira | file → contacted infra | medium |
| 04 | Interlock | Cloudflare tunnel defuse | hard |
| 05 | Eye Pyramid | bulletproof ASN + default page | hard |
| 06 | LummaC2 | SSL SHA‑1 cluster + content | hard |
| 07 | SocGholish | shared‑IP co‑residency | medium |
| 08 | Amadey + StealC | apex / subdomain disambig. | med‑hard |
| 09 | Tycoon 2FA | CT issuance‑date burst | medium |
| 10 | Contagious Interview | DNS TXT/MX cross‑ref | med‑hard |
| 11 | Smishing Triad | Cloudflare origin unmask | hard |
| 12 | ClearFake | cert CN → Shodan origin | medium |