Several Pittsburgh-based genetic research organizations have released to the public an open-source software tool that aims to make researchers' work easier while handing massive amounts of genomic data across disparate data sources.
The three participants in the software development project, called TCGA Expedition, are the University of Pittsburgh, the UPMC health system and the Pittsburgh Supercomputing Center. The 30-year-old center is a collaboration between Carnegie Mellon University and the University of Pittsburgh.
The genetic database known as the Cancer Genome Atlas is, for now, the focus of the Pittsburgh developers' attention. It is a joint project of the National Cancer Institute and the National Human Genome Research Institute at the National Institutes of Health.
So far, TCGA has amassed 2.5 petabytes of data describing both normal and cancerous tissues from 11,000 patients. The data has been used by independent researchers in more than a thousand cancer studies. Their work has resulted in mapping of genomic changes in 33 types of cancer, according to the NIH.
The Pittsburgh group's software is available for downloading free of charge at GitHub, a hosting site for open-source software development projects.
Medical records used in genomics research contain a lot of text-based information, which is more difficult for a computer to analyze than discrete data elements, project leader Dr. Rebecca Jacobson said. Jacobson is a professor of biomedical informatics at the University of Pittsburgh and chief information officer of the Institute for Precision Medicine at the University of Pittsburgh School of Medicine.
Thus, a part of the TCGA Expedition tool set is software for natural language processing of text. It was developed under an earlier project called the Text Information Extraction System with funding from the now-defunct CABIG cancer research project, Jacobson said.
There are many different databases for genomic research in the U.S. and elsewhere, but “the lack of a common data model, lack of interoperability between portals (to access those databases) and lack of programmatic access to the millions of data files produces significant limitations,” according to a write-up of the TCGA Expedition project by Jacobson and her colleagues published in PLOS.
For example, it isn't uncommon for the same query submitted through different portals to produce different results. In addition, “the typical investigator will not have the computing infrastructure required to download, store, manage and analyze” these enormous data files, they said.
“Everyone is running into the same problem, all over the world,” Jacobson said. The Pittsburgh project may try later to adapt its software to work with many of these different databases, but initially, its focus is on the TCGA data trove.
“That was what our researchers really wanted to use, but didn't have the data infrastructure,” Johnson said. “The really important thing was being able to deal with, acquire, clean, harmonize and manage this huge data set.”
The software project is using the creative commons licensing scheme and doesn't require other developers who use and add to the code to contribute back those additions free of copyright restrictions. But, Johnson said, if other developers want to improve the software code and donate it to the project, “absolutely we'll be happy to include anything that we think will be an advantage to the community.”