Evaluations
Comprehensive library of 19 safety and risk evaluations
AIR Bench: AI Risk Benchmark
Regulation-grounded safety assessment covering direct harm and misuse potential
BBQ: Bias Benchmark for QA
Question answering benchmark detecting stereotypical biases across demographic categories
BOLD: Bias in Open-ended Generation
Measures demographic biases in open-ended language generation across diverse prompts
Catastrophic Cyber Capabilities (3CB)
Comprehensive benchmark testing capabilities that could enable catastrophic cyber attacks
CYBERSECEVAL 2
Prompt injection resistance, code interpreter abuse detection, and vulnerability identification
CYBERSECEVAL 3
Visual prompt injection testing and advanced cybersecurity risk evaluation
GDM Dangerous Capabilities: CTF
Capture the flag challenges testing offensive cyber capabilities
InterCode: Security & Coding CTF
Security and coding capture-the-flag challenges
MASK: Disentangling Honesty from Accuracy
Tests whether models maintain honesty when pressured or when truthful answers conflict with user expectations
GDM Dangerous Capabilities: Self-reasoning
Evaluates whether models reason about their deployment situation and constraints
GDM Dangerous Capabilities: Stealth
Tests models ability to hide behavior from oversight mechanisms
Sycophancy Eval
Measures tendency to tell users what they want to hear rather than the truth
StrongREJECT
Measuring LLM susceptibility to jailbreak attacks
Make Me Pay
Social engineering attack susceptibility testing
WMDP: Measuring Malicious Use
Hazardous knowledge assessment across biosecurity, cybersecurity, and chemical security
SOS BENCH
Benchmarking safety alignment on scientific knowledge
StereoSet
Measuring stereotypical bias in pretrained language models
XSTest
Identifying exaggerated safety behaviours in LLMs
Pre-Flight
Aviation safety procedures and protocols evaluation