AI Safety Science

The AI Safety Science program aims to deepen our understanding of the safety properties of systems built with large language models (LLMs) and to develop well-founded, concrete, implementable technical methods for testing and evaluating LLMs.

Scroll

Overview

Artificial intelligence (AI) systems, predominantly based on large language models (LLMs), are finding an enormous number of applications around the world at an astonishing pace. Every week, AI technology is becoming more consequential in everyday lives, government and military operations, and across almost all industries. As a result, the impacts of safety failures of the technology are potentially widespread and of great significance. While powerful, LLMs do not always provide reliable or predictable results and their vulnerability to abuse is significant. 

 

The science of AI safety is nascent. We have many ways to demonstrate unsafe model behaviors, but these tend to be anecdotal and disconnected. Currently, there are no measures of LLM model safety that are robust, verifiable, and scalable. We have no consistent and broadly applicable testing regimes. Safety training in a model can often be reversed post facto, or be gotten around with clever techniques and motivated adversaries. And we don’t have the ability to demonstrate safety as an affirmative property of a machine-learned model. 

 

In short, we view AI safety as an embryonic field of huge import, full of research opportunities and open questions. According to one estimate, only 1-3% of AI publications are on safety (Toner et al., 2022; Emerging Tech Observatory, 2024).

 

Based on our research, we have unearthed some key issues within this space:

 

  • 1. We do not have a robust ecosystem of safety benchmarks and evaluations. We believe there are too few well-founded, quality benchmarks; current benchmarks are highly correlated (suggesting they are not assessing independent capabilities of a model); and evaluations for many agentic and multi-modal capabilities do not yet exist. At the same time, experts widely acknowledge that, while current models are unlikely to have major safety concerns, more safety efforts are needed now to deal with the more sophisticated systems that will inevitably be developed and deployed—and the greater safety risks they will present.

 

  • 2. Current philanthropic and government funding of AI safety research is insufficient. This view has been shared by the academic community, peer funders, government agencies, and frontier research labs. One estimate puts total funding for AI safety research at only $80-130 million per year over the 2021-2024 period (LessWrong, 2024). One of the largest government research programs dedicated to AI safety, Safe Learning-Enabled Systems, plans to award only $10 million per year in 2023 and 2024 (NSF, 2023). This level of funding prohibits the type of larger and longer-term research efforts that would require more talent, compute, and time.

 

  • 3. Academics are underleveraged in AI safety. Machine learning as a field, led by the academic community, has grown into a powerful and worldwide discipline; contributing to key basic algorithms that undergird frontier language models and has resulted in many impressive applications. However, work on the safety of the largest models has been conducted mainly by frontier AI labs. Based on the scale and success of ML as a discipline, we believe that faculty and students can, with the right resources, make progress on deeper understanding of LLMs and principled approaches to evaluation that are generally beyond the faster research timelines for industry players under competitive pressure. Furthermore, the established practice of peer review, alongside code publication that enables reproducibility, can bring a more principled perspective to AI safety research. This community lacks access to frontier models and testing environments, adequate compute resources, and dedicated software engineers to allow them to conduct high quality research in this vital area.

Program Goals

The program aims to advance the science of AI safety by:

 

  • Deepening our understanding of the safety properties of systems built with large language models (LLMs). 
  • Developing well-founded, concrete implementable technical methods for testing and evaluating LLMs.
  • Understanding the relationships between and implications of various testing methodologies.
  • Growing the technical AI safety academic community, especially globally.
  • Ensuring that safety theory is informed by practice.

 

Program Approach

This program will pursue the following main activities:

 

Funding for discrete research projects globally aligned with a targeted research agenda, both directly and through competitive open calls, with a focus on under-resourced early- to mid-career faculty.

 

Compute capacity for academic labs.

 

Convenings to build the global AI safety research community, showcase the research of our grantees, and support frontier companies in sharing safety problems they are encountering.

Program Research Agenda

Our research will focus on state-of-the-art LLMs, and on agent and multi-agent AI (i.e., AI that can act in the real world) and multi-modal settings (including speech, code, video, still images, and drawings).

 

We are interested in supporting scientific advances that can be broadly applied to safety criteria and testing methodologies for large classes of models and which would remain applicable even in the face of rapid technological advancement:

 

A. Understanding how to assure and measure generative AI systems
1. Formal assurance
2. Characteristics of high-quality benchmarks
3. Contamination

 

B. Generalizability of test results
4. Transferability
5. Multi-agent systems
6. Relative safety

 

C. Testing and evaluation frameworks for repeatable and scalable testing
7. Agentic testing environments
8. LLM-based scoring
9. Automated evaluations

 

Read our full research agenda via the link below.

 

Contact Information Form

If you would like to be notified about upcoming funding opportunities and keep up to date with programmatic developments, please complete the form linked below.

Featured Projects

Dr. Sanjeev Arora, Princeton University

Dr. Arora will develop a methodology to quantify the upper bound of risk of an AI model creating unsafe behaviors, in contrast to many current evaluations that implicitly work with a lower bound

Dr. Yoshua Bengio, Mila - Quebec Artificial Intelligence Institute

Dr. Bengio will develop a risk mitigation technology for AI systems by building quantitative safety guardrails for frontier AI models.

Dr. Adam Gleave & Kellin Pelrine, FAR.AI, and Dr. Thomas Costello, American University and MIT

Drs. Gleave, Pelrine and Costello will assess the risks of AI-driven deception and disinformation, develop mitigations, and provide a platform for future action towards solving them.

Dr. Tatsu Hashimoto, Stanford University

Dr. Hashimoto will create a testing environment that provides statistical guarantees for the failure rates of AI agents, as well as build new safety metrics that aim to identify emergent, unsafe capabilities of AI systems before they appear.

Dr. Matthias Hein, University of Tübingen & Dr. Jonas Geiping, ELLIS Institute Tübingen

Dr. Hein and Dr. Geiping will investigate how and why AI systems become less safe when they operate for extended periods of time.

Dr. Daniel Kang, University of Illinois, Urbana-Champaign

Dr. Kang will assess the ability of AI agents to perform complex cybersecurity attacks. 

Dr. Mykel Kochenderfer, Stanford University

Dr. Kochenderfer will apply the technique of adaptive stress testing to develop a software toolbox that helps automate LLM testing and evaluation.

Dr. Zico Kolter, Carnegie Mellon University

Dr. Kolter will explore the causes of adversarial transfer, a phenomenon where attacks developed for one AI model are also effective when applied to other models.  

Dr. Sanmi Koyejo, Stanford University

Dr. Koyejo will explore how new pretraining interventions affect LLM capabilities and safety during scaling.

Dr. Anna Leshinskaya, University of California-Irvine

Dr. Leshinskaya will use cognitive theory to advance our understanding of how LLMs solve complex tasks by breaking those tasks into their more fundamental componen

Dr. Bo Li, University of Illinois Urbana-Champaign

Dr. Li will design and develop a virtual environment with advanced red teaming algorithms for automatically evaluating AI systems and AI agents, with a focus on exploring what level of access to AI models is needed for different levels of evaluation.

Dr. Evan Miyazono & Daniel Windham, Atlas Computing

Dr. Miyazono and Windham will design and develop an AI tool for generating, reviewing, and validating formal specifications for computer code, which will help test the accuracy of AI systems that write that code.

Dr. Karthik Narasimhan, Princeton University

Dr. Narasimhan will develop benchmarks to evaluate the reliability and consistency of AI systems that write computer code.

Dr. Arvind Narayanan, Princeton University

Dr. Narayanan will design and use verifiers that check the accuracy of AI systems to improve the reliability of AI agents.

Dr. Aditi Raghunathan, Dr. Aviral Kumar & Dr. Andrea Bajcsy , Carnegie Mellon University

Dr. Raghunathan, Kumar & Bajcsy will develop a novel way to evaluate the safety of AI systems using adversarial multi-player games. Dr. Raghunathan is also an Early Career Fellow of the AI2050 program at Schmidt Sciences.

Dr.. Maarten Sap & Dr. Graham Neubig, Carnegie Mellon University

Drs. Sap & Neubig will develop a simulation environment for assessing the safety risks stemming from interactions between AI users, AI agents, and tools.

Dr. Dawn Song, University of California-Berkeley

Dr. Song will investigate safety issues in multi-agent LLM systems.

Dr. Florian Tramèr, ETH Zurich

Dr. Tramèr will design guardrails against prompt injection attacks, a phenomenon where an AI system encounters text that instructs it to act in unintended and malicious ways, and develop a formal programming framework to standardize these guardrails.

Dr. Ziang Xiao, Johns Hopkins University and Dr. Susu Zhang, University of Illinois Urbana-Champaign

Dr. Xiao and Zhang will build measurement theory grounded framework and tools  to identify gaps in current AI benchmark research and scaffold novel AI benchmark creation.

Advisory Board

Percy Liang

Percy is an Associate Professor of Computer Science at Stanford University and the director of the Center for Research on Foundation Models. He is currently focused on making foundation models (in particular, language models) more accessible through open-source and understandable through rigorous benchmarking. In the past, he has worked on many topics centered on machine learning and natural language processing, including robustness, interpretability, human interaction, learning theory, grounding, semantics, and reasoning. His awards include the Presidential Early Career Award for Scientists and Engineers, IJCAI Computers and Thought Award, an NSF CAREER Award, a Sloan Research Fellowship, a Microsoft Research Faculty Fellowship, and paper awards at ACL, EMNLP, ICML, COLT, ISMIR, CHI, UIST, and RSS. Percy holds a B.S. from MIT and a PhD from UC Berkeley.

Yonadav Shavit

Yonadav is a member of the policy staff at OpenAI, focused on frontier AI risks. He previously received his PhD in Computer Science from Harvard University, where he researched technical assurance mechanisms for accountable frontier AI development and deployment. Prior to that, Yonadav was an associate on the Plaintext Group at Schmidt Futures, where he worked on AI policy and grantmaking. Yonadav also holds a BS and MEng in electrical engineering and computer science from MIT.

Ajeya Cotra

Ajeya is a Senior Program Officer at Open Philanthropy focused on risks from advanced AI. She analyzes when extremely powerful AI might be developed and what risks that might pose, and runs targeted funding programs to support empirical ML research that could shed light on these questions (recently wrapping up a $25M funding program on benchmarks to measure AI agents’ capabilities). She also speaks about AI futurism and AI risk for a popular audience. She’s been quoted in NYT and Vox, appeared on podcasts including Hard Fork and Freakonomics, was a speaker at CODE 2023, and runs the publication Planned Obsolescence.

JueYan Zhang

JueYan is a philanthropic advisor and a grantmaker. He runs the AI Safety Tactical Opportunities Fund (AISTOF), a pooled multi-donor charitable fund which seeks to reduce the probability and severity of catastrophic risks from advanced AI. Since inception in September 2023, the fund has raised more than $10mm and disbursed more than $7mm. JueYan serves on the boards of Family Empowerment Media and Suvita. He also advises Ambitious Impact and is a speculation grantor for the Survival and Flourishing Fund. Previously, JueYan spent a decade earning-to-give as a hedge fund manager. JueYan holds a BS in business administration and a BA in statistics from UC Berkeley.

Mark Greaves

Mark is the Executive Director of the AI Institute at Schmidt Sciences. Prior to Schmidt Sciences, Mark was a senior leader in AI and data analytics within the National Security Directorate at Pacific Northwest National Laboratory. Prior to this, Mark was Director of Knowledge Systems at Vulcan Inc., the private asset management company for Paul Allen, where he led global research teams in question-answering textbooks, large knowledge graphs, semantic web, and crowdsourcing. Mark was Director of DARPA’s Joint Logistics Technology Office and Program Manager in DARPA’s Information Exploitation Office. Mark was awarded the Office of the Secretary of Defense Medal for Exceptional Public Service for his contributions to US national security. He holds a BA in cognitive science from Amherst College, an MS in computer science from UCLA, and a PhD in philosophy from Stanford University.

Ron Brachman

We also thank Ron Brachman for his generous support in designing this program.

Questions and Comments

We are interested in receiving feedback on our program, input on our research agenda, and ideas for research projects that would advance this idea. Please email us at aisafety@schmidtsciences.org

Work we’re doing for science

We build networks of brilliant researchers at different career stages. We lead Virtual Institutes of Science to solve hard problems across locations and fields using modern tools.