Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 73,470 bioRxiv papers from 319,827 authors.

Squeakr: An Exact and Approximate k-mer Counting System

By Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro

Posted 29 Mar 2017
bioRxiv DOI: 10.1101/122077 (published DOI: 10.1093/bioinformatics/btx636)

Motivation: k-mer-based algorithms have become increasingly popular in the processing of high-throughput sequencing (HTS) data. These algorithms span the gamut of the analysis pipeline from k-mer counting (e.g., for estimating assembly parameters), to error correction, genome and transcriptome assembly, and even transcript quantification. Yet, these tasks often use very different k-mer representations and data structures. In this paper, we set forth the fundamental operations for maintaining multisets of k-mers and classify existing systems from a data-structural perspective. We then show how to build a k-mer-counting and multiset-representation system using the counting quotient filter (CQF), a feature-rich approximate membership query (AMQ) data structure. We introduce the k-mer-counting/querying system Squeakr (Simple Quotient filter-based Exact and Approximate Kmer Representation), which is based on the CQF. This off-the shelf data structure turns out to be an efficient (approximate or exact) representation for sets or multisets of k-mers. Results: Squeakr takes 2X-4.3X less time than the state-of-the-art to count and perform a random-point-query workload. Squeakr is memory-efficient, consuming 1.5X-4.3X less memory than the state-of-the-art. It offers competitive counting performance, and answers point queries (i.e. queries for the abundance of a particular k-mer) over an order-of-magnitude faster than other systems. The Squeakr representation of the k-mer multiset turns out to be immediately useful for downstream processing (e.g., de Bruijn graph traversal) because it supports fast queries and dynamic k-mer insertion, deletion, and modification. Availability: https://github.com/splatlab/squeakr

Download data

  • Downloaded 1,536 times
  • Download rankings, all-time:
    • Site-wide: 4,280 out of 73,470
    • In bioinformatics: 833 out of 7,157
  • Year to date:
    • Site-wide: 50,677 out of 73,470
  • Since beginning of last month:
    • Site-wide: 50,677 out of 73,470

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)