International Science Index


Towards End-To-End Disease Prediction from Raw Metagenomic Data

Abstract:Analysis of the human microbiome using metagenomic sequencing data has demonstrated high ability in discriminating various human diseases. Raw metagenomic sequencing data require multiple complex and computationally heavy bioinformatics steps prior to data analysis. Such data contain millions of short sequences read from the fragmented DNA sequences and stored as fastq files. Conventional processing pipelines consist in multiple steps including quality control, filtering, alignment of sequences against genomic catalogs (genes, species, taxonomic levels, functional pathways, etc.). These pipelines are complex to use, time consuming and rely on a large number of parameters that often provide variability and impact the estimation of the microbiome elements. Training Deep Neural Networks directly from raw sequencing data is a promising approach to bypass some of the challenges associated with mainstream bioinformatics pipelines. Most of these methods use the concept of word and sentence embeddings that create a meaningful and numerical representation of DNA sequences, while extracting features and reducing the dimensionality of the data. In this paper we present an end-to-end approach that classifies patients into disease groups directly from raw metagenomic reads: metagenome2vec. This approach is composed of four steps (i) generating a vocabulary of k-mers and learning their numerical embeddings; (ii) learning DNA sequence (read) embeddings; (iii) identifying the genome from which the sequence is most likely to come and (iv) training a multiple instance learning classifier which predicts the phenotype based on the vector representation of the raw data. An attention mechanism is applied in the network so that the model can be interpreted, assigning a weight to the influence of the prediction for each genome. Using two public real-life data-sets as well a simulated one, we demonstrated that this original approach reaches high performance, comparable with the state-of-the-art methods applied directly on processed data though mainstream bioinformatics workflows. These results are encouraging for this proof of concept work. We believe that with further dedication, the DNN models have the potential to surpass mainstream bioinformatics workflows in disease classification tasks.
[1] A. Oulas and Christina Pavloudi, “Metagenomics: Tools and insights for analyzing next-generation sequencing data derived from biodiversity studies,” Libertas Academica, 2015.
[2] E. R. Mardis, “DNA sequencing technologies: 2006–2016,” Nature Protocols, vol. 12, no. 2, pp. 213–218, 2017.
[3] D. Liang, R. K.-K. Leung et al., “Involvement of gut microbiome in human health and disease: brief overview, knowledge gaps and research opportunities,” Gut Pathogens, vol. 10, no. 1, p. 3, 2018.
[4] S. Nayfach and K. S. Pollard, “Toward accurate and quantitative comparative metagenomics,” Cell, vol. 166, no. 5, pp. 1103–1116, 2016.
[5] C. Wen, Z. Zheng et al., “Quantitative metagenomics reveals unique gut microbiome biomarkers in ankylosing spondylitis,” Genome Biology, vol. 18, no. 1, p. 142, 2017.
[6] MetaHIT Consortium, H. B. Nielsen et al., “Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes,” Nature Biotechnology, vol. 32, no. 8, pp. 822–828, 2014.
[7] J. Qian and M. Comin, “MetaCon: unsupervised clustering of metagenomic contigs with probabilistic k-mers statistics and coverage,” BMC Bioinformatics, vol. 20, p. 367, 2019.
[8] T. Mikolov, I. Sutskever et al., “Distributed representations of words and phrases and their compositionality,” nips, p. 9, 2013.
[9] A. L. Maas, R. E. Daly et al., “Learning word vectors for sentiment analysis,” acl, p. 9, 2011.
[10] Y. Qi, D. S. Sachan et al., “When and why are pre-trained word embeddings useful for neural machine translation?” arXiv:1804.06323
[cs], 2018.
[11] S. Bengio and G. Heigold, “Word embeddings for speech recognition,” p. 5, 2014.
[12] P. Ng, “dna2vec: Consistent vector representations of variable-length k-mers,” arXiv:1701.06279
[cs, q-bio, stat], 2017.
[13] R. Menegaux and J.-P. Vert, “Continuous embeddings of DNA sequencing reads and application to metagenomics,” Journal of Computational Biology, vol. 26, no. 6, pp. 509–518, 2019.
[14] A. Joulin, E. Grave et al., “Bag of tricks for efficient text classification,” arXiv:1607.01759
[cs], 2016.
[15] A. Ugarte, M. Quang-Dao et al., “QMSpy: An integrated modular and scalable platform for quantitative metagenomics in pyspark,” in 2019 IEEE-RIVF International Conference on Computing and Communication Technologies (RIVF). IEEE, 2019, pp. 1–6.
[16] C. Jobin, “Precision medicine using microbiota,” Science, vol. 359, no. 6371, pp. 32–34, 2018.
[17] S. Woloszynek, Z. Zhao et al., “16s rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses,” bioRxiv, 2018.
[18] X. Min, W. Zeng et al., “Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding,” Bioinformatics, vol. 33, no. 14, pp. i92–i101, 2017.
[19] Q. Liang, P. W. Bible et al., “DeepMicrobes: taxonomic classification for metagenomics with deep learning,” NAR Genomics and Bioinformatics, vol. 2, no. 1, p. lqaa009, 2020.
[20] Z. Cui, W. Chen, and Y. Chen, “Multi-scale convolutional neural networks for time series classification,” arXiv:1603.06995
[cs], 2016.
[21] T. Young, D. Hazarika et al., “Recent trends in deep learning based natural language processing,” arXiv:1708.02709
[cs], 2018.
[22] D. Quang and X. Xie, “DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences,” Nucleic Acids Research, vol. 44, no. 11, pp. e107–e107, 2016.
[23] J. Zhou and O. G. Troyanskaya, “Predicting effects of noncoding variants with deep learning–based sequence model,” Nature Methods, vol. 12, no. 10, pp. 931–934, 2015.
[24] D. R. Kelley, J. Snoek, and J. L. Rinn, “Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks,” Cold Spring Harbor Laboratory Press, p. 35, 2016.
[25] S. Hildick-Smith and I. Bahtchevanov, “Deep learning for natural language sequence labelling applied to epigenomics,” p. 8, 2016.
[26] M. Rojas-Carulla, I. Tolstikhin et al., “GeNet: Deep representations for metagenomics,” arXiv:1901.11015
[cs, q-bio, stat], 2019.
[27] K. He, X. Zhang et al., “Deep residual learning for image recognition,” arXiv:1512.03385
[cs], 2015.
[28] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2014, pp. 1532–1543.
[29] S. Arora, Y. Liang, and T. Ma, “A simple but tough-to-beat baseline for sentence embeddings,” p. 16, 2017.
[30] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” arXiv:1409.3215
[cs], 2014.
[31] R. Kiros, Y. Zhu et al., “Skip-thought vectors,” arXiv:1506.06726
[cs], 2015.
[32] F. Hill, K. Cho, and A. Korhonen, “Learning distributed representations of sentences from unlabelled data,” arXiv:1602.03483
[cs], 2016.
[33] J. Devlin, M.-W. Chang et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805
[cs], 2018.
[34] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2015, pp. 1412–1421.
[35] A. Conneau, D. Kiela et al., “Supervised learning of universal sentence representations from natural language inference data,” arXiv:1705.02364
[cs], 2017.
[36] M. Zaheer, S. Kottur et al., “Deep sets,” arXiv:1703.06114
[cs, stat], 2017.
[37] M. Ilse, J. M. Tomczak, and M. Welling, “Attention-based deep multiple instance learning,” arXiv:1802.04712
[cs, stat], 2018.
[38] E. Prifti, Y. Chevaleyre et al., “Interpretable and accurate prediction models for metagenomics data,” GigaScience, vol. 9, no. 3, p. giaa010, 2020.
[39] J. Wang and J.-D. Zucker, “Solving multiple-instance problem: a lazy learning approach,” 2000.
[40] Edoardo Pasolli, Duy Tin Truong, and Faizan Malik, “Machine learning meta-analysis of large metagenomic datasets: Tools and biological insights,” Plos, 2015.
[41] MetaHIT Consortium, J. Li et al., “An integrated catalog of reference genes in the human gut microbiome,” Nature Biotechnology, vol. 32, no. 8, pp. 834–841, 2014.
[42] A. Fritz, P. Hofmann et al., “CAMISIM: simulating metagenomes and microbial communities,” Microbiome, vol. 7, no. 1, p. 17, 2019.
[43] E. Pasolli, D. T. Truong et al., “Machine learning meta-analysis of large metagenomic datasets: Tools and biological insights,” PLOS Computational Biology, vol. 12, no. 7, p. e1004977, 2016.
[44] G. Zeller, J. Tap et al., “Potential of fecal microbiota for early-stage detection of colorectal cancer,” Molecular Systems Biology, vol. 10, no. 11, p. 766, 2014.
[45] N. Qin, F. Yang et al., “Alterations of the human gut microbiome in liver cirrhosis,” Nature, vol. 513, no. 7516, pp. 59–64, 2014.
[46] M. Oh and L. Zhang, “DeepMicro: deep representation learning for disease prediction based on microbiome data,” Scientific Reports, vol. 10, no. 1, p. 6026, 2020.
[47] E. Grave, A. Joulin et al., “Efficient softmax approximation for GPUs,” arXiv:1609.04309
[cs], 2017.
[48] P. Bojanowski, E. Grave et al., “Enriching word vectors with subword information,” arXiv:1607.04606
[cs], 2016.
[49] M. Zaharia, M. Chowdhury et al., “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” p. 14, 2012.
[50] A. Bakarov, “A survey of word embeddings evaluation methods,” arXiv:1801.09536
[cs], 2018.
[51] A. Criscuolo, “A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies,” Research Ideas and Outcomes, vol. 5, p. e36178, 2019.