Sign up for our newsletter
Back To Top
Back to AI & Advanced Computing

AI Interpretability

Building the tools to better understand large language models.

Cutting-edge AI models often function as black boxes, solving complicated problems through a reasoning process that is hidden within billions of artificial neurons. This lack of interpretability fundamentally limits the trustworthiness of AI technologies, as it is difficult to validate the reasoning that produced any given AI output. 

This program aims to address this problem by supporting basic and applied research in AI interpretability. This program focuses especially on understanding deceptive behaviors from AI models, including behaviors like excessive deference to users (sycophancy) and knowingly giving harmful advice to users. These problems are appearing more frequently in frontier AI systems, which are trained with noisy human feedback, and they are an ideal target for interpretability tools. By inspecting what models are thinking, and not merely what they are saying, we will be able to better detect when humans are getting untrustworthy information from AI.

This program will support researchers in academia and nonprofits working on interpretability. This competition will pit red teams against blue teams to answer the question: how can we detect when LLMs produce deceptive reasoning or communication? The blinded design of the competition will force blue teams to show that interpretability tools work in practice, revealing deceptive behaviors that are not known about in advance. Outputs of the competition will include research papers.  

Check back at a later date for additional funding opportunities in 2026. Inquiries should be directed to [email protected].

More AI & Advanced Computing programs and initiatives

Back To Top