Rxivist logo

Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 73,475 bioRxiv papers from 319,779 authors.

Gkmexplain: Fast and Accurate Interpretation of Nonlinear Gapped k-mer Support Vector Machines Using Integrated Gradients

By Avanti Shrikumar, Eva Prakash, Anshul Kundaje

Posted 06 Nov 2018
bioRxiv DOI: 10.1101/457606 (published DOI: 10.1093/bioinformatics/btz322)

Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Note: Avanti Shrikumar and Eva Prakash are co-first authors. Avanti Shrikumar and Anshul Kundaje are co-corresponding authors.

Download data

  • Downloaded 700 times
  • Download rankings, all-time:
    • Site-wide: 15,068 out of 73,481
    • In bioinformatics: 2,373 out of 7,157
  • Year to date:
    • Site-wide: 32,234 out of 73,481
  • Since beginning of last month:
    • Site-wide: 32,234 out of 73,481

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide


Sign up for the Rxivist weekly newsletter! (Click here for more details.)