Evaluations

Comprehensive library of 19 safety and risk evaluations

⚗️
Harmful Agent Capabilities

AIR Bench: AI Risk Benchmark

Regulation-grounded safety assessment covering direct harm and misuse potential

15 models testedSafety Violation Rate
⚖️
Bias & Fairness

BBQ: Bias Benchmark for QA

Question answering benchmark detecting stereotypical biases across demographic categories

15 models testedBias Score
⚖️
Bias & Fairness

BOLD: Bias in Open-ended Generation

Measures demographic biases in open-ended language generation across diverse prompts

15 models testedDemographic Bias
💻
Offensive Cyber Capabilities

Catastrophic Cyber Capabilities (3CB)

Comprehensive benchmark testing capabilities that could enable catastrophic cyber attacks

15 models testedCapability Score
💻
Offensive Cyber Capabilities

CYBERSECEVAL 2

Prompt injection resistance, code interpreter abuse detection, and vulnerability identification

15 models testedSecurity Risk Score
💻
Offensive Cyber Capabilities

CYBERSECEVAL 3

Visual prompt injection testing and advanced cybersecurity risk evaluation

15 models testedVulnerability Exploitation
💻
Offensive Cyber Capabilities

GDM Dangerous Capabilities: CTF

Capture the flag challenges testing offensive cyber capabilities

15 models testedCTF Success Rate
💻
Offensive Cyber Capabilities

InterCode: Security & Coding CTF

Security and coding capture-the-flag challenges

15 models testedChallenge Completion
🎭
Scheming & Deceptive Alignment

MASK: Disentangling Honesty from Accuracy

Tests whether models maintain honesty when pressured or when truthful answers conflict with user expectations

15 models testedHonesty Under Pressure
🎭
Scheming & Deceptive Alignment

GDM Dangerous Capabilities: Self-reasoning

Evaluates whether models reason about their deployment situation and constraints

15 models testedSelf-awareness Score
🎭
Scheming & Deceptive Alignment

GDM Dangerous Capabilities: Stealth

Tests models ability to hide behavior from oversight mechanisms

15 models testedStealth Capability
🎭
Scheming & Deceptive Alignment

Sycophancy Eval

Measures tendency to tell users what they want to hear rather than the truth

15 models testedSycophancy Rate
🛡️
Adversarial Robustness

StrongREJECT

Measuring LLM susceptibility to jailbreak attacks

15 models testedJailbreak Resistance
🛡️
Adversarial Robustness

Make Me Pay

Social engineering attack susceptibility testing

15 models testedSocial Engineering Resistance
⚗️
Harmful Agent Capabilities

WMDP: Measuring Malicious Use

Hazardous knowledge assessment across biosecurity, cybersecurity, and chemical security

15 models testedHazardous Knowledge
⚗️
Harmful Agent Capabilities

SOS BENCH

Benchmarking safety alignment on scientific knowledge

15 models testedSafety Alignment
⚖️
Bias & Fairness

StereoSet

Measuring stereotypical bias in pretrained language models

15 models testedStereotype Score
📊
Calibration & Honesty

XSTest

Identifying exaggerated safety behaviours in LLMs

15 models testedRefusal Calibration
✈️
Domain-Specific Safety

Pre-Flight

Aviation safety procedures and protocols evaluation

15 models testedSafety Protocol Adherence