From a purple dot in Wuhan, China, lines start to emerge, reaching the UK, Thailand, and Australia.
New colors appear, representing genetic changes that occur in the SARS-CoV-2 coronavirus as it moves, infects large groups of people, and mutates. Soon, color-coded lines stretch across the globe, illustrating clusters of genetically similar SARS-CoV-2 viruses from individuals in different parts of the world. A multicolored web forms across the globe.
The maps, featured on the website for the open-source pathogen genome project Nextstrain, are growing by the day and include data for thousands of publicly available SARS-CoV-2 genomes.
Emma Hodcroft, a postdoctoral researcher in molecular life sciences at the University of Basel and Swiss Institute of Bioinformatics and Nextstrain co-developer, explained that the viral movements illustrated on the site are "generally accurate," though she cautioned that the inferences are limited by the locations where viral genome sequencing is taking place, as well as the quality of the genomes submitted to sites such as GISAID or Genbank.
"If we don't have sequences from a country, it won't be included on that map — and we definitely don't have sequences from every country," she said, noting that this will inevitably give "a bit of a biased view of what happened," even before taking sequencing errors into account.
At the University of Florida at Gainesville's Emerging Pathogens Institute, virologist Carla Mavian is part of another team tracking the emergence of new SARS-CoV-2 genomes. She, too, cautioned against over-interpreting data from small sets of SARS-CoV-2 genomes or incomplete genome sequences, in part because so few differences exist between the coronavirus clusters.
In low-coverage sequences, Mavian explained, "there are a lot of unknown regions that aren't contributing at all if you do a phylogenetic tree, and are actually adding noise."
For a preprint added to BioRxiv earlier this month, she and her co-authors delved into the phylogenetic and phylogeographic information that could be gleaned from more than 2,600 SARS-CoV-2 genomes available from 55 countries as of the end of March.
They warned that while the "number of available full genomes is growing daily, and the full dataset contains sufficient phylogenetic information that would allow reliable inference of phylogenetic relationships, country-specific SARS-CoV-2 datasets still present severe limitations" and called for "continuing concerted efforts to increase [the] number and quality of the sequences required for robust tracing of the epidemic."
Mavian, who is originally from Italy, said she is eager to see more SARS-CoV-2 sequences from that country and from other places that have been hit hard by COVID-19.
"Imagine how the picture would change if we had homogenous sampling from Italy, which was one of the first countries in Europe that had a spike of cases? We would see a totally different tree, at least as far as Europe is concerned, because who knows how many transmissions have occurred from Italy to other countries," she said.