Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 73,710 bioRxiv papers from 320,843 authors.

An Efficient, Scalable and Exact Representation of High-Dimensional Color Information Enabled via de Bruijn Graph Search

By Fatemeh Almodaresi, Prashant Pandey, Michael Ferdman, Rob Johnson, Rob Patro

Posted 07 Nov 2018
bioRxiv DOI: 10.1101/464222

The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this paper, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes — patterns of color occurrence — present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e. samples or references) grows to thousands of experiments. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved more than 11x better compression compared to RRR. An implementation of the new MST-based color class encoding is written in C++17 and is available at https://github.com/splatlab/mantis.

Download data

  • Downloaded 733 times
  • Download rankings, all-time:
    • Site-wide: 14,163 out of 73,741
    • In bioinformatics: 2,259 out of 7,172
  • Year to date:
    • Site-wide: 26,399 out of 73,741
  • Since beginning of last month:
    • Site-wide: 26,399 out of 73,741

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)