Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 60,222 bioRxiv papers from 267,720 authors.

Integrating long-range connectivity information into de Bruijn graphs

By Isaac Turner, Kiran V. Garimella, Zamin Iqbal, Gil McVean

Posted 08 Jun 2017
bioRxiv DOI: 10.1101/147777 (published DOI: 10.1093/bioinformatics/bty157)

Motivation: The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input. Results: We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both the de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterise the genomic context of drug-resistance genes. Availability: Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, available under the MIT license at https://github.com/mcveanlab/mccortex.

Download data

  • Downloaded 1,138 times
  • Download rankings, all-time:
    • Site-wide: 5,512 out of 60,222
    • In bioinformatics: 1,056 out of 6,078
  • Year to date:
    • Site-wide: 23,716 out of 60,222
  • Since beginning of last month:
    • Site-wide: 18,837 out of 60,222

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide

Sign up for the Rxivist weekly newsletter! (Click here for more details.)