Samir Bhatt
| Program | Schmidt Science Polymaths |
| School | University of Copenhagen |
| Field of Study | Machine Learning and Public Health |
Samir Bhatt developed an algorithm to dramatically simplify the mathematically complex challenge of charting the evolution of diseases and languages “for the joy of doing it.”
How many ways can you draw a family tree connecting 30 people? Far from a few lines on paper, the possibilities number more than you can count.
“The number of ways to connect just 30 people is larger than the number of grains of sand on Earth,” says Dr. Samir Bhatt, a Schmidt Science Polymath and professor at the University of Copenhagen.
This is the mathematical reality facing scientists who build phylogenetic trees—diagrams that map the evolutionary history of species, populations, or strains of microbes. These trees make it possible to, for example, determine when mammals split from their ancestors or track the emergence of viral variants during the COVID-19 pandemic. But finding the most accurate tree can mean searching through an unfathomably large number of options.
Bhatt and his team have developed an algorithm called Phylo2Vec to help researchers navigate this vast space. Rather than representing a tree as a string of nested parentheses—the conventional format—Phylo2Vec translates it into a compact string of integers. These sequences take six to eight times less storage space, and they simplify comparisons since a one-number difference indicates distinct, but neighboring trees.
Bhatt developed the mathematics with his group, then partnered with engineers at the University of Washington’s Scientific Software Engineering Center, part of Schmidt Sciences’ Virtual Institute for Scientific Software. “They sped up our code and did many clever tricks under the hood,” he says. “When we got the software out, we felt reassured it was done properly.”
Mathematics moves slowly, but the method is already gaining traction. Since the paper’s publication last summer, early signs of momentum have emerged as other researchers have proposed variations on the integer system and a framework unifying them.
Bhatt’s interest in phylogenetics grew out of his work tracking infectious diseases, particularly COVID-19. During an outbreak, trees help scientists assess how much a pathogen has mutated, how fast it’s spreading, and where it likely originated—factors that shape the public health response. “How do you know that an mpox strain found a month ago is new or not?” Bhatt says. “You’re going to have to build a tree.”
In Bhatt’s view, Phylo2Vec is a first step. His team is now building mathematical models for how trees develop over time, with the ultimate goal of reconstructing evolutionary history more accurately.
The Polymaths program has given him room to follow his curiosity. “It was the first bit of funding that really allowed me to be me,” Bhatt says. “We created Phylo2Vec for the joy of doing it, and for the promise it could have in this field.”