The Forensic Microbiome Database (FMD) is a human microbiome analysis resource that correlates publicly available 16s rRNA sequence data, irrespective of sequencing platform or variable region sequenced, obtained from multiple body sites to metadata as it relates to forensics. Using the relative abundances of bacterial taxa within a sample, machine learning is used to predict the geographic location from which that sample came from (https://www.nature.com/articles/s41598-018-27100-1). The more comprehensive and diverse the data within the FMD is, the more accurate the predictions will become. These predictions could potentially be used to determine the geographical origin of a human trafficking victim or as evidence at a crime scene to narrow down a search radius for a person of interest.
This study involves the collection and processing of human microbiome samples from five regions around the world—Chile, Barbados, Hong Kong, and two sites in South Africa. Paired oral and stool samples will be collected from healthy females between the ages of 18 and 26. Samples will be immediately frozen and sent to the J. Craig Venter Institute’s (JCVI) lab for sequencing and subsequent analysis.
The National Center for Biotechnology Information (NCBI) houses a Short Read Archive (SRA) which stores raw sequencing reads for thousands of scientific studies (Figure 1). Scientists deposit their sequencing reads to the SRA allowing for reproducibility of their studies and for use by other studies to discover new insights. The FMD searches the SRA frequently to discover recently added studies to add to the database. Only a specific subset of the samples stored in the SRA are of interest. These are human associated microbiome studies—sequencing reads of bacterial 16s rRNA from a human body site with sufficient metadata associated with the study. Using 16s rRNA samples allow the determination of the relative abundances of the microbial community of that body site. SRA samples contain information about individual samples—mainly how they were prepared and sequenced.
Within NCBI, multiple samples from the same study are incorporated into an umbrella category termed the BioProject (Figure 2). This BioProject describes the study and allows users to view all SRA samples contained within.
Search terms like “human microbiome 16S” are used to query the databases to find these types of studies. When a sample that meets the requirements listed above is identified, there are certain criteria that must be met to merit further analysis due to the intended function of the FMD. At a minimum there must be metadata associated with a study to include the geographic location (at city level) of where a sample was collected and a published manuscript describing the study. Other information, like age and gender of the human host, are beneficial and stored in the database, but are not required for a sample’s inclusion. The metadata is mostly collected from the website where the study was downloaded, though it can also be obtained from alternative locations, such as the published manuscripts. Currently only healthy human samples are included in the public database and its subsequent analysis tools, but unhealthy samples are also being collected for future inclusion.
In addition to NCBI, samples are collected from other sequence databases, such as EBI (https://academic.oup.com/nar/article/45/W1/W545/3787865) and MG-RAST (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-386), which have a similar organizational structure.
When a study is identified, the associated raw sequence reads are downloaded for all samples intended to be analyzed. For example, the SRA Toolkit (https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft) allows for easy fetching of SRA files from NCBI, which are then converted into FASTQ format. The FASTQ format contains both the nucleotide sequence of a read generated from a sequencing instrument as well as the associated quality score for each nucleotide and are preferred, but occasionally only FASTA files are available and are thus used. These files are used for subsequent analysis.
There are several tools to use when analyzing 16s rRNA data. The FMD pipeline uses an internal pipeline written at JCVI that is based upon UPARSE (for reasons listed in the Frequently Asked Questions) to cluster reads to Operational Taxonomic Units (OTU’s). Within an OTU, sequences are 97% identical or greater which allows for slight variation in percent identity due to several potential sources of error. This pipeline outputs an OTU table, which is a matrix which gives the number of reads per sample per OTU. Mothur is then used to map OTU’s to Taxonomy using a reference database—FMD uses the SILVA 16s rRNA database for this purpose.
A series of R scripts, utilizing the PhyloSeq library (http://dx.plos.org/10.1371/journal.pone.0061217) among others, generate the final files uploaded into the database. These R scripts filter out samples that have less than 2,000 reads or has more than 85% of the reads in unclassified genera, and removes any OTU’s from the OTU table if they are present in less than 10 samples in the run. OTU counts tables for every taxonomic level (Kingdom, Genus, etc.) are generated by collapsing lower classification levels by the higher classification level (Species with the same Genus will be summed to create the Genus counts table, etc.). These, along with files that contain metadata information (age, geographic location, body site sampled, etc.) about each sample are generated and ultimately entered into the database for querying and use in the tools contained within the FMD website. A list of these tools and how to use them can be found in the User Manual.