Rxivist logo

Compression of short-read sequences using path encoding

By Carl Kingsford, Rob Patro

Posted 24 Jun 2014
bioRxiv DOI: 10.1101/006551 (published DOI: 10.1093/bioinformatics/btv071)

Storing, transmitting, and archiving the amount of data produced by next generation sequencing is becoming a significant computational burden. For example, large-scale RNA-seq meta-analyses may now routinely process tens of terabytes of sequence. We present here an approach to biological sequence compression that reduces the difficulty associated with managing the data produced by large-scale transcriptome sequencing. Our approach offers a new direction by sitting between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs --- a common task in genome assembly --- and context-dependent arithmetic coding. Supporting this method is a system, called a bit tree, to compactly store sets of kmers that is of independent interest. Using these techniques, we are able to encode RNA-seq reads using 3% -- 11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than recent competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

Download data

  • Downloaded 679 times
  • Download rankings, all-time:
    • Site-wide: 20,782 out of 89,266
    • In bioinformatics: 3,011 out of 8,426
  • Year to date:
    • Site-wide: 85,271 out of 89,266
  • Since beginning of last month:
    • Site-wide: 83,673 out of 89,266

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)