Safety Science
The Safety Science program aims to deepen our understanding of the safety properties of systems built with large language models (LLMs) and to develop well-founded, concrete, implementable technical methods for testing and evaluating LLMs.
Overview
Artificial intelligence (AI) systems, predominantly based on large language models (LLMs), are finding an enormous number of applications around the world at an astonishing pace. Every week, AI technology is becoming more consequential in everyday lives, government and military operations, and across almost all industries. As a result, the impacts of safety failures of the technology are potentially widespread and of great significance. While powerful, LLMs do not always provide reliable or predictable results and their vulnerability to abuse is significant.
The science of AI safety is nascent. We have many ways to demonstrate unsafe model behaviors, but these tend to be anecdotal and disconnected. Currently, there are no measures of LLM model safety that are robust, verifiable, and scalable. We have no consistent and broadly applicable testing regimes. Safety training in a model can often be reversed post facto, or be gotten around with clever techniques and motivated adversaries. And we don’t have the ability to demonstrate safety as an affirmative property of a machine-learned model.
In short, we view AI safety as an embryonic field of huge import, full of research opportunities and open questions. According to one estimate, only 1-3% of AI publications are on safety (Toner et al., 2022; Emerging Tech Observatory, 2024).
Based on our research, we have unearthed some key issues within this space:
- 1. We do not have a robust ecosystem of safety benchmarks and evaluations. We believe there are too few well-founded, quality benchmarks; current benchmarks are highly correlated (suggesting they are not assessing independent capabilities of a model); and evaluations for many agentic and multi-modal capabilities do not yet exist. At the same time, experts widely acknowledge that, while current models are unlikely to have major safety concerns, more safety efforts are needed now to deal with the more sophisticated systems that will inevitably be developed and deployed—and the greater safety risks they will present.
- 2. Current philanthropic and government funding of AI safety research is insufficient. This view has been shared by the academic community, peer funders, government agencies, and frontier research labs. One estimate puts total funding for AI safety research at only $80-130 million per year over the 2021-2024 period (LessWrong, 2024). One of the largest government research programs dedicated to AI safety, Safe Learning-Enabled Systems, plans to award only $10 million per year in 2023 and 2024 (NSF, 2023). This level of funding prohibits the type of larger and longer-term research efforts that would require more talent, compute, and time.
- 3. Academics are underleveraged in AI safety. Machine learning as a field, led by the academic community, has grown into a powerful and worldwide discipline; contributing to key basic algorithms that undergird frontier language models and has resulted in many impressive applications. However, work on the safety of the largest models has been conducted mainly by frontier AI labs. Based on the scale and success of ML as a discipline, we believe that faculty and students can, with the right resources, make progress on deeper understanding of LLMs and principled approaches to evaluation that are generally beyond the faster research timelines for industry players under competitive pressure. Furthermore, the established practice of peer review, alongside code publication that enables reproducibility, can bring a more principled perspective to AI safety research. This community lacks access to frontier models and testing environments, adequate compute resources, and dedicated software engineers to allow them to conduct high quality research in this vital area.
Program Goals
The program aims to advance the science of AI safety by:
- Deepening our understanding of the safety properties of systems built with large language models (LLMs).
- Developing well-founded, concrete implementable technical methods for testing and evaluating LLMs.
- Understanding the relationships between and implications of various testing methodologies.
- Growing the technical AI safety academic community, especially globally.
- Ensuring that safety theory is informed by practice.
Program Approach
This program will pursue the following main activities:
Funding for discrete research projects globally aligned with a targeted research agenda, both directly and through competitive open calls, with a focus on under-resourced early- to mid-career faculty.
Compute capacity for academic labs.
Convenings to build the global AI safety research community, showcase the research of our grantees, and support frontier companies in sharing safety problems they are encountering.
Program Research Agenda
Our research will focus on state-of-the-art LLMs, and on agent and multi-agent AI (i.e., AI that can act in the real world) and multi-modal settings (including speech, code, video, still images, and drawings).
We are interested in supporting scientific advances that can be broadly applied to safety criteria and testing methodologies for large classes of models and which would remain applicable even in the face of rapid technological advancement.
Read our full research agenda via the link below
Advisory Board
Percy Liang
Percy is an Associate Professor of Computer Science at Stanford University and the director of the Center for Research on Foundation Models. He is currently focused on making foundation models (in particular, language models) more accessible through open-source and understandable through rigorous benchmarking. In the past, he has worked on many topics centered on machine learning and natural language processing, including robustness, interpretability, human interaction, learning theory, grounding, semantics, and reasoning. His awards include the Presidential Early Career Award for Scientists and Engineers, IJCAI Computers and Thought Award, an NSF CAREER Award, a Sloan Research Fellowship, a Microsoft Research Faculty Fellowship, and paper awards at ACL, EMNLP, ICML, COLT, ISMIR, CHI, UIST, and RSS. Percy holds a B.S. from MIT and a PhD from UC Berkeley.
Yonadav Shavit
Yonadav is a member of the policy staff at OpenAI, focused on frontier AI risks. He previously received his PhD in Computer Science from Harvard University, where he researched technical assurance mechanisms for accountable frontier AI development and deployment. Prior to that, Yonadav was an associate on the Plaintext Group at Schmidt Futures, where he worked on AI policy and grantmaking. Yonadav also holds a BS and MEng in electrical engineering and computer science from MIT.
Ajeya Cotra
Ajeya is a Senior Program Officer at Open Philanthropy focused on risks from advanced AI. She analyzes when extremely powerful AI might be developed and what risks that might pose, and runs targeted funding programs to support empirical ML research that could shed light on these questions (recently wrapping up a $25M funding program on benchmarks to measure AI agents’ capabilities). She also speaks about AI futurism and AI risk for a popular audience. She’s been quoted in NYT and Vox, appeared on podcasts including Hard Fork and Freakonomics, was a speaker at CODE 2023, and runs the publication Planned Obsolescence.
JueYan Zhang
JueYan is a philanthropic advisor and a grantmaker. He runs the AI Safety Tactical Opportunities Fund (AISTOF), a pooled multi-donor charitable fund which seeks to reduce the probability and severity of catastrophic risks from advanced AI. Since inception in September 2023, the fund has raised more than $10mm and disbursed more than $7mm. JueYan serves on the boards of Family Empowerment Media and Suvita. He also advises Ambitious Impact and is a speculation grantor for the Survival and Flourishing Fund. Previously, JueYan spent a decade earning-to-give as a hedge fund manager. JueYan holds a BS in business administration and a BA in statistics from UC Berkeley.
Mark Greaves
Mark is the Executive Director of the AI Institute at Schmidt Sciences. Prior to Schmidt Sciences, Mark was a senior leader in AI and data analytics within the National Security Directorate at Pacific Northwest National Laboratory. Prior to this, Mark was Director of Knowledge Systems at Vulcan Inc., the private asset management company for Paul Allen, where he led global research teams in question-answering textbooks, large knowledge graphs, semantic web, and crowdsourcing. Mark was Director of DARPA’s Joint Logistics Technology Office and Program Manager in DARPA’s Information Exploitation Office. Mark was awarded the Office of the Secretary of Defense Medal for Exceptional Public Service for his contributions to US national security. He holds a BA in cognitive science from Amherst College, an MS in computer science from UCLA, and a PhD in philosophy from Stanford University.
Questions and Comments
We are interested in receiving feedback on our program, input on our research agenda, and ideas for research projects that would advance this idea. Please email us at aisafety@schmidtsciences.org.
Work we’re doing for science
We build networks of brilliant researchers at different career stages. We lead Virtual Institutes of Science to solve hard problems across locations and fields using modern tools.