Evaluations

Comprehensive library of 19 safety and risk evaluations

Harmful Agent Capabilities

AIR Bench: AI Risk Benchmark

Regulation-grounded safety assessment covering direct harm and misuse potential

15 models testedSafety Violation Rate

Bias & Fairness

BBQ: Bias Benchmark for QA

Question answering benchmark detecting stereotypical biases across demographic categories

15 models testedBias Score

Bias & Fairness

BOLD: Bias in Open-ended Generation

Measures demographic biases in open-ended language generation across diverse prompts

15 models testedDemographic Bias

Offensive Cyber Capabilities

Catastrophic Cyber Capabilities (3CB)

Comprehensive benchmark testing capabilities that could enable catastrophic cyber attacks

15 models testedCapability Score

Offensive Cyber Capabilities

CYBERSECEVAL 2

Prompt injection resistance, code interpreter abuse detection, and vulnerability identification

15 models testedSecurity Risk Score

Offensive Cyber Capabilities

CYBERSECEVAL 3

Visual prompt injection testing and advanced cybersecurity risk evaluation

15 models testedVulnerability Exploitation

Offensive Cyber Capabilities

GDM Dangerous Capabilities: CTF

Capture the flag challenges testing offensive cyber capabilities

15 models testedCTF Success Rate

Offensive Cyber Capabilities

InterCode: Security & Coding CTF

Security and coding capture-the-flag challenges

15 models testedChallenge Completion

Scheming & Deceptive Alignment

MASK: Disentangling Honesty from Accuracy

Tests whether models maintain honesty when pressured or when truthful answers conflict with user expectations

15 models testedHonesty Under Pressure

Scheming & Deceptive Alignment

GDM Dangerous Capabilities: Self-reasoning

Evaluates whether models reason about their deployment situation and constraints

15 models testedSelf-awareness Score

Scheming & Deceptive Alignment

GDM Dangerous Capabilities: Stealth

Tests models ability to hide behavior from oversight mechanisms

15 models testedStealth Capability

Scheming & Deceptive Alignment

Sycophancy Eval

Measures tendency to tell users what they want to hear rather than the truth

15 models testedSycophancy Rate

Adversarial Robustness

StrongREJECT

Measuring LLM susceptibility to jailbreak attacks

15 models testedJailbreak Resistance

Adversarial Robustness

Make Me Pay

Social engineering attack susceptibility testing

15 models testedSocial Engineering Resistance

Harmful Agent Capabilities

WMDP: Measuring Malicious Use

Hazardous knowledge assessment across biosecurity, cybersecurity, and chemical security

15 models testedHazardous Knowledge

Harmful Agent Capabilities

SOS BENCH

Benchmarking safety alignment on scientific knowledge

15 models testedSafety Alignment

Bias & Fairness

StereoSet

Measuring stereotypical bias in pretrained language models

15 models testedStereotype Score

Calibration & Honesty

XSTest

Identifying exaggerated safety behaviours in LLMs

15 models testedRefusal Calibration

Domain-Specific Safety

Pre-Flight

Aviation safety procedures and protocols evaluation

15 models testedSafety Protocol Adherence