Event Details

Biostatistics Seminar Series | "Bayesian nonparametric prediction of the taxonomic affiliation of DNA sequences"

Predicting the taxonomic affiliation of DNA sequences collected from biological samples is a fundamental step in biodiversity assessment. This task is performed by leveraging existing databases containing reference DNA sequences endowed with a taxonomic identification. However, environmental sequences can be from organisms that are either unknown to science or for which there are no reference sequences available. Thus, the taxonomic novelty of a sequence needs to be accounted for when doing classification. Professor Tommaso Rigon proposes Bayesian nonparametric taxonomic classifiers, BayesANT, which use species sampling model priors to allow unobserved taxa to be discovered at each taxonomic rank. Using a simple product multinomial likelihood with conjugate Dirichlet priors at the lowest rank, a highly flexible supervised algorithm is developed to provide a probabilistic prediction of the taxa placement of each sequence at each rank. They run their algorithm on a carefully annotated library of Finnish arthropods (FinBOL). To assess the ability of BayesANT to recognize novelty and to predict known taxonomic affiliations correctly, Professor Rigon tests it on two training-test splitting scenarios, each with a different proportion of taxa unobserved in training. Their algorithm attains accurate predictions and reliably quantifies classification uncertainty, especially when many sequences in the test set are affiliated with taxa unknown in training.