alignparse: A Python package for parsing complex features from high-throughput long-read sequencing
By
Katharine H.D. Crawford,
Jesse Bloom
Posted 21 Nov 2019
bioRxiv DOI: 10.1101/850404
(published DOI: 10.21105/joss.01915)
Advances in sequencing technology have made it possible to generate large numbers of long, high-accuracy sequencing reads. For instance, the new PacBio Sequel platform can generate hundreds of thousands of high-quality circular consensus sequences in a single run. Good programs exist for aligning these reads for genome assembly. However, these long reads can also be used for other purposes, such as sequencing PCR amplicons that contain various features of interest. For instance, PacBio circular consensus sequences have been used to identify the mutations in influenza viruses in single cells, or to link barcodes to gene mutants in deep mutational scanning. For such applications, the alignment of the sequences to the targets may be fairly trivial, but it is not trivial to then parse specific features of interest (such as mutations, unique molecular identifiers, cell barcodes, and flanking sequences) from these alignments. Here we describe alignparse, a Python package for parsing complex sets of features from long sequences that map to known targets. Specifically, it allows the user to provide complex target sequences in Genbank format that contain an arbitrary number of user-defined sub-sequence features. It then aligns the sequencing reads to these targets and filters alignments based on whether the user-specified features are present with the desired identities (which can be set to different thresholds for different features). Finally, it parses out the sequences, mutations, and/or accuracy of these features as specified by the user. The flexibility of this package therefore fulfills the need for a tool to extract and analyze complex sets of features in large numbers of long sequencing reads.
Download data
- Downloaded 369 times
- Download rankings, all-time:
- Site-wide: 68,179
- In bioinformatics: 6,558
- Year to date:
- Site-wide: 95,849
- Since beginning of last month:
- Site-wide: 95,511
Altmetric data
Downloads over time
Distribution of downloads per paper, site-wide
PanLingua
News
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!