Towards End-To-End Disease Prediction from Raw Metagenomic Data
Abstract:Analysis of the human microbiome using metagenomic
sequencing data has demonstrated high ability in discriminating
various human diseases. Raw metagenomic sequencing data require
multiple complex and computationally heavy bioinformatics steps
prior to data analysis. Such data contain millions of short sequences
read from the fragmented DNA sequences and stored as fastq files.
Conventional processing pipelines consist in multiple steps including
quality control, filtering, alignment of sequences against genomic
catalogs (genes, species, taxonomic levels, functional pathways,
etc.). These pipelines are complex to use, time consuming and
rely on a large number of parameters that often provide variability
and impact the estimation of the microbiome elements. Training
Deep Neural Networks directly from raw sequencing data is a
promising approach to bypass some of the challenges associated with
mainstream bioinformatics pipelines. Most of these methods use the
concept of word and sentence embeddings that create a meaningful
and numerical representation of DNA sequences, while extracting
features and reducing the dimensionality of the data. In this paper
we present an end-to-end approach that classifies patients into disease
groups directly from raw metagenomic reads: metagenome2vec. This
approach is composed of four steps (i) generating a vocabulary of
k-mers and learning their numerical embeddings; (ii) learning DNA
sequence (read) embeddings; (iii) identifying the genome from which
the sequence is most likely to come and (iv) training a multiple
instance learning classifier which predicts the phenotype based on
the vector representation of the raw data. An attention mechanism
is applied in the network so that the model can be interpreted,
assigning a weight to the influence of the prediction for each genome.
Using two public real-life data-sets as well a simulated one, we
demonstrated that this original approach reaches high performance,
comparable with the state-of-the-art methods applied directly on
processed data though mainstream bioinformatics workflows. These
results are encouraging for this proof of concept work. We believe
that with further dedication, the DNN models have the potential to
surpass mainstream bioinformatics workflows in disease classification
 A. Oulas and Christina Pavloudi, “Metagenomics: Tools and insights
for analyzing next-generation sequencing data derived from biodiversity
studies,” Libertas Academica, 2015.
 E. R. Mardis, “DNA sequencing technologies: 2006–2016,” Nature
Protocols, vol. 12, no. 2, pp. 213–218, 2017.
 D. Liang, R. K.-K. Leung et al., “Involvement of gut microbiome in
human health and disease: brief overview, knowledge gaps and research
opportunities,” Gut Pathogens, vol. 10, no. 1, p. 3, 2018.
 S. Nayfach and K. S. Pollard, “Toward accurate and quantitative
comparative metagenomics,” Cell, vol. 166, no. 5, pp. 1103–1116, 2016.
 C. Wen, Z. Zheng et al., “Quantitative metagenomics reveals unique
gut microbiome biomarkers in ankylosing spondylitis,” Genome Biology,
vol. 18, no. 1, p. 142, 2017.
 MetaHIT Consortium, H. B. Nielsen et al., “Identification and assembly
of genomes and genetic elements in complex metagenomic samples
without using reference genomes,” Nature Biotechnology, vol. 32, no. 8,
pp. 822–828, 2014.
 J. Qian and M. Comin, “MetaCon: unsupervised clustering of
metagenomic contigs with probabilistic k-mers statistics and coverage,”
BMC Bioinformatics, vol. 20, p. 367, 2019.
 T. Mikolov, I. Sutskever et al., “Distributed representations of words
and phrases and their compositionality,” nips, p. 9, 2013.
 A. L. Maas, R. E. Daly et al., “Learning word vectors for sentiment
analysis,” acl, p. 9, 2011.
 Y. Qi, D. S. Sachan et al., “When and why are pre-trained word
embeddings useful for neural machine translation?” arXiv:1804.06323
 S. Bengio and G. Heigold, “Word embeddings for speech recognition,”
p. 5, 2014.
 P. Ng, “dna2vec: Consistent vector representations of variable-length
[cs, q-bio, stat], 2017.
 R. Menegaux and J.-P. Vert, “Continuous embeddings of DNA
sequencing reads and application to metagenomics,” Journal of
Computational Biology, vol. 26, no. 6, pp. 509–518, 2019.
 A. Joulin, E. Grave et al., “Bag of tricks for efficient text classification,”
 A. Ugarte, M. Quang-Dao et al., “QMSpy: An integrated modular
and scalable platform for quantitative metagenomics in pyspark,”
in 2019 IEEE-RIVF International Conference on Computing and
Communication Technologies (RIVF). IEEE, 2019, pp. 1–6.
 C. Jobin, “Precision medicine using microbiota,” Science, vol. 359, no.
6371, pp. 32–34, 2018.
 S. Woloszynek, Z. Zhao et al., “16s rRNA sequence embeddings:
Meaningful numeric feature representations of nucleotide sequences that
are convenient for downstream analyses,” bioRxiv, 2018.
 X. Min, W. Zeng et al., “Chromatin accessibility prediction
via convolutional long short-term memory networks with k-mer
embedding,” Bioinformatics, vol. 33, no. 14, pp. i92–i101, 2017.
 Q. Liang, P. W. Bible et al., “DeepMicrobes: taxonomic classification for
metagenomics with deep learning,” NAR Genomics and Bioinformatics,
vol. 2, no. 1, p. lqaa009, 2020.
 Z. Cui, W. Chen, and Y. Chen, “Multi-scale convolutional neural
networks for time series classification,” arXiv:1603.06995
 T. Young, D. Hazarika et al., “Recent trends in deep learning based
natural language processing,” arXiv:1708.02709
 D. Quang and X. Xie, “DanQ: a hybrid convolutional and recurrent deep
neural network for quantifying the function of DNA sequences,” Nucleic
Acids Research, vol. 44, no. 11, pp. e107–e107, 2016.
 J. Zhou and O. G. Troyanskaya, “Predicting effects of noncoding variants
with deep learning–based sequence model,” Nature Methods, vol. 12,
no. 10, pp. 931–934, 2015.
 D. R. Kelley, J. Snoek, and J. L. Rinn, “Basset: Learning the regulatory
code of the accessible genome with deep convolutional neural networks,”
Cold Spring Harbor Laboratory Press, p. 35, 2016.
 S. Hildick-Smith and I. Bahtchevanov, “Deep learning for natural
language sequence labelling applied to epigenomics,” p. 8, 2016.
 M. Rojas-Carulla, I. Tolstikhin et al., “GeNet: Deep representations for
[cs, q-bio, stat], 2019.
 K. He, X. Zhang et al., “Deep residual learning for image recognition,”
 J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors
for word representation,” in Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP).
Association for Computational Linguistics, 2014, pp. 1532–1543.
 S. Arora, Y. Liang, and T. Ma, “A simple but tough-to-beat baseline for
sentence embeddings,” p. 16, 2017.
 I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” arXiv:1409.3215
 R. Kiros, Y. Zhu et al., “Skip-thought vectors,” arXiv:1506.06726
 F. Hill, K. Cho, and A. Korhonen, “Learning distributed representations
of sentences from unlabelled data,” arXiv:1602.03483
 J. Devlin, M.-W. Chang et al., “BERT: Pre-training of deep bidirectional
transformers for language understanding,” arXiv:1810.04805
 T. Luong, H. Pham, and C. D. Manning, “Effective approaches to
attention-based neural machine translation,” in Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing.
Association for Computational Linguistics, 2015, pp. 1412–1421.
 A. Conneau, D. Kiela et al., “Supervised learning of universal sentence
representations from natural language inference data,” arXiv:1705.02364
 M. Zaheer, S. Kottur et al., “Deep sets,” arXiv:1703.06114
 M. Ilse, J. M. Tomczak, and M. Welling, “Attention-based deep multiple
instance learning,” arXiv:1802.04712
[cs, stat], 2018.
 E. Prifti, Y. Chevaleyre et al., “Interpretable and accurate prediction
models for metagenomics data,” GigaScience, vol. 9, no. 3, p. giaa010,
 J. Wang and J.-D. Zucker, “Solving multiple-instance problem: a lazy
learning approach,” 2000.
 Edoardo Pasolli, Duy Tin Truong, and Faizan Malik, “Machine learning
meta-analysis of large metagenomic datasets: Tools and biological
insights,” Plos, 2015.
 MetaHIT Consortium, J. Li et al., “An integrated catalog of reference
genes in the human gut microbiome,” Nature Biotechnology, vol. 32,
no. 8, pp. 834–841, 2014.
 A. Fritz, P. Hofmann et al., “CAMISIM: simulating metagenomes and
microbial communities,” Microbiome, vol. 7, no. 1, p. 17, 2019.
 E. Pasolli, D. T. Truong et al., “Machine learning meta-analysis of
large metagenomic datasets: Tools and biological insights,” PLOS
Computational Biology, vol. 12, no. 7, p. e1004977, 2016.
 G. Zeller, J. Tap et al., “Potential of fecal microbiota for early-stage
detection of colorectal cancer,” Molecular Systems Biology, vol. 10,
no. 11, p. 766, 2014.
 N. Qin, F. Yang et al., “Alterations of the human gut microbiome in
liver cirrhosis,” Nature, vol. 513, no. 7516, pp. 59–64, 2014.
 M. Oh and L. Zhang, “DeepMicro: deep representation learning for
disease prediction based on microbiome data,” Scientific Reports, vol. 10,
no. 1, p. 6026, 2020.
 E. Grave, A. Joulin et al., “Efficient softmax approximation for GPUs,”
 P. Bojanowski, E. Grave et al., “Enriching word vectors with subword
 M. Zaharia, M. Chowdhury et al., “Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster computing,” p. 14, 2012.
 A. Bakarov, “A survey of word embeddings evaluation methods,”
 A. Criscuolo, “A fast alignment-free bioinformatics procedure to infer
accurate distance-based phylogenetic trees from genome assemblies,”
Research Ideas and Outcomes, vol. 5, p. e36178, 2019.