Rxivist logo

mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species

By Sergio Arredondo-Alonso, Malbert R. C. Rogers, Johanna C. Braat, T. D. Verschuuren, Janetta Top, Jukka Corander, Rob J.L. Willems, Anita C. Schurch

Posted 23 May 2018
bioRxiv DOI: 10.1101/329045 (published DOI: 10.1099/mgen.0.000224)

Assembly of bacterial short-read whole genome sequencing (WGS) data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Long read sequencing has emerged as a solution to resolve plasmid structures and to obtain complete genomes for most bacterial species. This information can be used to generate and label datasets from short-read based contigs as plasmid- or chromosome-derived. We investigated the use of several popular machine learning methods to classify short-read contigs with known plasmid- or chromosome origin from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. Based on resulting F1-scores we selected support-vector machine (SVM) models as best classifier for all three bacterial species (F1-score E. faecium = 0.94, F1-score K. pneumoniae = 0.90, F1-score E. coli = 0.76), which outperformed other existing plasmid tools using an independent set of isolates (precision E. faecium = 0.92, precision K. pneumoniae = 0.86, precision E. coli = 0.82). We demonstrated the scalability of our model by accurately predicting the plasmidome of a large collection of 1,644 E. faecium isolates with only short-read WGS available using a standard laptop with a single core. A low number of false positive predicted sequences suggests that the assignment of a particular gene of interest as plasmid- or chromosome-encoded by the models is plausible. The SVM classifiers are publicly available as a new R package called mlplasmids at https://gitlab.com/sirarredondo/mlplasmids under the GNU General Public License v3.0. We additionally developed a graphical-user interface using the Shiny package which can be accessed at https://sarredondo.shinyapps.io/mlplasmids. Single genomes can easily be predicted by uploading genome assemblies. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.

Download data

  • Downloaded 1,823 times
  • Download rankings, all-time:
    • Site-wide: 11,990
    • In microbiology: 699
  • Year to date:
    • Site-wide: 144,301
  • Since beginning of last month:
    • Site-wide: 70,932

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide