Yijun Sun, Yunpeng Cai, Li Liu, Fahong Yu, Michael L. Farrell, William McKendree , ESPRIT: estimating species richness using large
Nucleic Acids Research, 2009, Vol. 37, No. 10 e76, 2009-04-15
Abstract : Recent metagenomics studies of environmental samples suggested that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity. Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rap- idly and inexpensively. However, computational methods for analyzing large collections of 16S ribosomal sequences are limited. We proposed a new algorithm, referred to as ESPRIT, which addresses several computational issues with prior methods. We developed two versions of ESPRIT, one for personal computers (PCs) and one for computer clusters (CCs). The PC version is used for small and medium scale data sets and can process several tens of thousands of sequences within a few minutes, while the CC version is for large-scale problems and is able to analyze several hundreds of thousands of reads within one day. Large-scale experiments are presented that clearly demon- strate the effectiveness of the newly proposed algorithm.). It allows researchers to study genetic materials recovered directly from environmental samples, bypassing the needs for isolation and lab cultivation of individual species, and thus opens a new window to probe the hidden world of microbial communities. This technique has been successfully used in several 16S rRNA-based metagenomics analyses of various environments. For example, Sogin et al. (4) provided one of the first global in-depth descriptions of microbial diversities and their relative abundance in the ocean, and Keijser et al. (5) were among the first to study oral microbial populations. It has been shown that the microbial diversities are at least one order of magnitude larger than previously reported. These estimation results, however, were computed through extrapolation. In order to obtain more accurate estimates, surveys that are several orders of magnitude larger than those reported in the literature may be required to uncover sequences from minor components (4,5). However, analyzing large collections of 16S ribosomal sequences poses a serious computational challenge for existing algorithms. In this article, we focus on taxonomy independent analysis where sequences are classified into operational taxonomic units (OTUs) of specified sequence variations, based on which various ecological metrics are estimated. Typically, sequences with