mlplasmids: a user-friendly tool to predict plasmid- and chromosome-derived sequences for single species
Malbert R. C. Rogers,
Johanna C. Braat,
T. D. Verschuuren,
Rob J.L. Willems,
Anita C. Schurch
Posted 23 May 2018
bioRxiv DOI: 10.1101/329045 (published DOI: 10.1099/mgen.0.000224)
Posted 23 May 2018
Assembly of bacterial short-read whole genome sequencing (WGS) data frequently results in hundreds of contigs for which the origin, plasmid or chromosome, is unclear. Long read sequencing has emerged as a solution to resolve plasmid structures and to obtain complete genomes for most bacterial species. This information can be used to generate and label datasets from short-read based contigs as plasmid- or chromosome-derived. We investigated the use of several popular machine learning methods to classify short-read contigs with known plasmid- or chromosome origin from Enterococcus faecium, Klebsiella pneumoniae and Escherichia coli using pentamer frequencies. Based on resulting F1-scores we selected support-vector machine (SVM) models as best classifier for all three bacterial species (F1-score E. faecium = 0.94, F1-score K. pneumoniae = 0.90, F1-score E. coli = 0.76), which outperformed other existing plasmid tools using an independent set of isolates (precision E. faecium = 0.92, precision K. pneumoniae = 0.86, precision E. coli = 0.82). We demonstrated the scalability of our model by accurately predicting the plasmidome of a large collection of 1,644 E. faecium isolates with only short-read WGS available using a standard laptop with a single core. A low number of false positive predicted sequences suggests that the assignment of a particular gene of interest as plasmid- or chromosome-encoded by the models is plausible. The SVM classifiers are publicly available as a new R package called mlplasmids at https://gitlab.com/sirarredondo/mlplasmids under the GNU General Public License v3.0. We additionally developed a graphical-user interface using the Shiny package which can be accessed at https://sarredondo.shinyapps.io/mlplasmids. Single genomes can easily be predicted by uploading genome assemblies. We anticipate that this tool may significantly facilitate research on the dissemination of plasmids encoding antibiotic resistance and/or contributing to host adaptation.
- Downloaded 1,823 times
- Download rankings, all-time:
- Site-wide: 11,990
- In microbiology: 699
- Year to date:
- Site-wide: 144,301
- Since beginning of last month:
- Site-wide: 70,932
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!