Advancing the Fundamental Science of Trustworthy AI
The science of trustworthy AI is nascent. Today, many demonstrations of unsafe behavior are isolated, and mitigations do not generalize reliably across model families, deployment contexts, or capability regimes. Therefore, we aim to support basic technical research that improves our ability to understand, predict, and control risks from frontier AI systems while simultaneously enabling their trustworthy deployment.
Opportunities for Funding
-
Apply
2026 Science of Trustworthy AI RFP
Science of Trustworthy AI
The Challenge
Every day, AI technology is becoming more consequential. As a result, the impact of safety failures are potentially incredibly harmful.
-
We need mature decision-relevant evaluations.
Today’s evaluations often fail exactly where we need them most: under distribution shift, long-horizon interaction, tool use, and optimization pressure. Many tests are brittle, highly correlated, or easy to “train to”, and stylized scenarios can be misleading if they do not reflect deployment-like contexts. We need a rigorous science of evaluation with construct validity, predictive validity, and clear evidence standards for when results justify real-world decisions.
-
The research capacity needed for trustworthy AI lags the pace of deployment.
Frontier systems are being deployed rapidly, but the infrastructure for trustworthy assessment and oversight is not keeping pace—especially for large, ambitious projects that require substantial compute, multidisciplinary expertise, and time. Accelerating progress will require sustained support for high-ambition, field-shaping research rather than incremental work.
-
Academics are underleveraged in trustworthy AI research.
Currently, safety research for the largest AI models is primarily conducted by leading AI labs. Still, despite vast private capital flowing into AI development, commercial incentives for foundational, pre-product safety science are often weaker than incentives for capability and product improvements, especially when benefits are diffuse or long-horizon. Foundational advances in the science of trustworthy AI therefore function as a global public good, motivating targeted philanthropic support for academic and nonprofit researchers.
Program Goals
-
Deepen our understanding of safety properties of AI systems
-
Build a rigorous science of evaluation with construct and predictive validity
-
Advance trustworthy AI approaches resistant to obsolescence from fast-evolving technology
-
Support a global community of researchers advancing the science of trustworthy AI
Research Agenda
Our research agenda organizes priorities around three connected aims:
-
Characterize and forecast misalignment in frontier AI systems
Why frontier AI training-and-deployment safety stacks still result in models learning effective goals that fail under distribution shift, pressure, or extended interaction.
-
Develop generalizable measurement and intervention
Advance the science of evaluations with decision-relevant construct and predictive validity, and develop interventions that control what AI systems learn (not just what they say).
-
Oversee AI systems with superhuman capabilities and address multi-agent risks
Develop oversight and control methods for settings where direct human evaluation of correctness or safety isn’t feasible, and address risks that emerge from interacting AI systems.
Featured Projects: Understanding Safety in AI Systems
-
Tests of Compositional Generalization as an “Upper Bound” on AI Safety Risks
-
Unlocking Multi-Agent System Safety with Dynamic Islands of Trust
-
LLM-Derived Guardrail for Frontier Models as a Step Towards a Cautious Scientist AI
-
Robust LLM-based Scoring of Agent Alignment
-
Deception & Misinformation: Elicitation and Testing
-
A Meta-analysis Approach to Understanding LM Capabilities
-
Long-Term Safety Behavior of Agents
-
Mechanistic Interpretability to Detect Test Set Contamination
-
Benchmarks for AI Agents and Cybersecurity
-
Adaptive Stress Testing for Automated Unsupervised Large Language Model Testing and Failure Classification
-
Understanding the Mechanism of Adversarial Transfer Across AI Models
-
Beyond Simple Scaling: A Multi-Dimensional Family of Scaling Laws for Interventions in Pre-Training and Model Safety Optimization
-
What Counts as Contamination? How Generalization Could Confound Evaluation
-
Mechanistic Cognitive Alignment for Morally Guided Decision-Making
-
Multiagent-Based T&E Environment Construction
-
A Conformal Safety Assurance Framework for Large Language Models
-
Formal Verification of Software
-
Robustness and Controllability of Language Model-Based Agents Built for Automating Tasks in Software Engineering
-
Five Nines of Reliability for AI Agents
-
Multi-Agent AI Safety via Dynamic Games
-
OpenAgentSafety: Measuring and Mitigating Safety Harms of LLM-based AI Agent Interactions
-
Exploring AI Safety in Free-Form, Evolutionary Multi-Agent Systems
-
Evaluating and Defending Multimodal Computer-Use Agents under Adversarial Attacks with a Realistic Online Environment
-
Formalizing Prompt Injection Defenses
-
BenchCraft: An AI-powered Toolkit to Create Valid and Efficient Benchmarks with Measurement Theory and Human-AI Collaboration
-
Quantifying and Mitigating Privacy Risks in Multi-Agent Systems
-
Tracing and Eliminating Harmful Capabilities Across Model Generations
Feature Projects: Inference-Time Compute
-
A Dual-Framework for Grounding AI Reasoning: Measuring Faithfulness via Code Execution and Debugging
-
Causally Grounded Inference-Time Intervention for Robust Model Alignment
-
Emerging Risks from Inference-Time Compute
-
Evaluating Preparedness via Model Organism Spectrum
-
Faithfulness and Safety of Latent Chain-of-Thought Reasoning
-
Fundamental Limitations of the Test Time Compute Paradigm
-
Improving the Legibility of Inference-Time Reasoning for Scalable Oversight
-
INSPECT: Inference-time Safety for Physically Embodied Chain-of-Thought in Robotics
-
Investigating Hawthorne Effects in Large Reasoning Models
-
Large Language Model Safety in Inference-Time Motivated Reasoning
-
Preventing Reward Hacking at Inference-Time with Reinforcement Learning
-
Quantifying, Attributing, and Improving Faithfulness in Chains of Thought
-
Reasoning with Foresight: Safe Inference via Q-Value Guided Decoding
-
Safely Confident Reasoning
-
Safety in RL-Enabled Goal-Directed Agents
-
Toward Safe Reasoning: Test-Time Unlearning of Sensitive Knowledge Beyond Final Answers
-
Utility Engineering and Moral Scaffolding for Safer Reasoning Models
-
Will Chain-of-Thought Monitoring Significantly Improve Safety?
-
Reasoning Up the Instruction Ladder: Inference-Time Deliberation for Safer AI