Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 83,609 bioRxiv papers from 360,279 authors.
Most downloaded bioRxiv papers, all time
in category bioinformatics
7,875 results found. For more information, click each entry to expand.
6,606 downloads bioinformatics
Chen-Shan Chin, Paul Peluso, Fritz J. Sedlazeck, Maria Nattestad, Gregory T. Concepcion, Alicia Clum, Christopher Dunn, Ronan O'Malley, Rosa Figueroa-Balderas, Abraham Morales-Cruz, Grant R. Cramer, Massimo Delledonne, Chongyuan Luo, Joseph R. Ecker, Dario Cantu, David R. Rank, Michael C. Schatz
While genome assembly projects have been successful in a number of haploid or inbred species, one of the current main challenges is assembling non-inbred or rearranged heterozygous genomes. To address this critical need, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble Single Molecule Real-Time (SMRT(R)) Sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We demonstrate the quality of this approach by assembling new reference sequences for three heterozygous samples, including an F1 hybrid of the model species Arabidopsis thaliana, the widely cultivated V. vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata that have challenged short-read assembly approaches. The FALCON-based assemblies were substantially more contiguous and complete than alternate short or long-read approaches. The phased diploid assembly enabled the study of haplotype structures and heterozygosities between the homologous chromosomes, including identifying widespread heterozygous structural variations within the coding sequences.
6,588 downloads bioinformatics
To evaluate and compare the performance of variant calling methods and their confidence scores, comparisons between a test call set and a ?gold standard? need to be carried out. Unfortunately, these comparisons are not straightforward with the current Variant Call Files (VCF), which are the standard output of most variant calling algorithms for high-throughput sequencing data. Comparisons of VCFs are often confounded by the different representations of indels, MNPs, and combinations thereof with SNVs in complex regions of the genome, resulting in misleading results. A variant caller is inherently a classification method designed to score putative variants with confidence scores that could permit controlling the rate of false positives (FP) or false negatives (FN) for a given application. Receiver operator curves (ROC) and the area under the ROC (AUC) are efficient metrics to evaluate a test call set versus a gold standard. However, in the case of VCF data this also requires a special accounting to deal with discrepant representations. We developed a novel algorithm for comparing variant call sets that deals with complex call representation discrepancies and through a dynamic programing method that minimizes false positives and negatives globally across the entire call sets for accurate performance evaluation of VCFs.
6,566 downloads bioinformatics
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
6,527 downloads bioinformatics
Breast cancer remains the most common type of cancer and the leading cause of cancer-induced mortality among women with 2.4 million new cases diagnosed and 523,000 deaths per year. Historically, a diagnosis has been initially performed using clinical screening followed by histopathological analysis. Automated classification of cancers using histopathological images is a chciteallenging task of accurate detection of tumor sub-types. This process could be facilitated by machine learning approaches, which may be more reliable and economical compared to conventional methods. To prove this principle, we applied fine-tuned pre-trained deep neural networks. To test the approach we first classify different cancer types using 6,402 tissue microarrays (TMAs) training samples. Our framework accurately detected on average 99.8% of the four cancer types including breast, bladder, lung and lymphoma using the ResNet V1 50 pre-trained model. Then, for classification of breast cancer sub-types, this approach was applied to 7,909 images from the BreakHis database. In the next step, ResNet V1 152 classified benign and malignant breast cancers with an accuracy of 98.7%. In addition, ResNet V1 50 and ResNet V1 152 categorized either benign- (adenosis, fibroadenoma, phyllodes tumor, and tubular adenoma) or malignant- (ductal carcinoma, lobular carcinoma, mucinous carcinoma, and papillary carcinoma) sub-types with 94.8% and 96.4% accuracy, respectively. The confusion matrices revealed high sensitivity values of 1, 0.995 and 0.993 for cancer types, as well as malignant- and benign sub-types respectively. The areas under the curve (AUC) scores were 0.996,0.973 and 0.996 for cancer types, malignant and benign sub-types, respectively. Overall, our results show negligible false negative (on average 3.7 samples) and false positive (on average 2 samples) results among different models. Availability: Source codes, guidelines, and data sets are temporarily available on google drive upon request before moving to a permanent GitHub repository.
6,505 downloads bioinformatics
Matthew Amodio, David van Dijk, Krishnan Srinivasan, William S. Chen, Hussein Mohsen, Kevin R Moon, Allison Campbell, Yujiao Zhao, Xiaomei Wang, Manjunatha Venkataswamy, Anita Desai, V. Ravi, Priti Kumar, Ruth Montgomery, Guy Wolf, Smita Krishnaswamy
Biomedical researchers are generating high-throughput, high-dimensional single-cell data at a staggering rate. As costs of data generation decrease, experimental design is moving towards measurement of many different single-cell samples in the same dataset. These samples can correspond to different patients, conditions, or treatments. While scalability of methods to datasets of these sizes is a challenge on its own, dealing with large-scale experimental design presents a whole new set of problems, including batch effects and sample comparison issues. Currently, there are no computational tools that can both handle large amounts of data in a scalable manner (many cells) and at the same time deal with many samples (many patients or conditions). Moreover, data analysis currently involves the use of different tools that each operate on their own data representation, not guaranteeing a synchronized analysis pipeline. For instance, data visualization methods can be disjoint and mismatched with the clustering method. For this purpose, we present SAUCIE, a deep neural network that leverages the high degree of parallelization and scalability offered by neural networks, as well as the deep representation of data that can be learned by them to perform many single-cell data analysis tasks, all on a unified representation. A well-known limitation of neural networks is their interpretability. Our key contribution here are newly formulated regularizations (penalties) that render features learned in hidden layers of the neural network interpretable. When large multi-patient datasets are fed into SAUCIE, the various hidden layers contain denoised and batch-corrected data, a low dimensional visualization, unsupervised clustering, as well as other information that can be used to explore the data. We show this capability by analyzing a newly generated 180-sample dataset consisting of T cells from dengue patients in India, measured with mass cytometry. We show that SAUCIE, for the first time, can batch correct and process this 11-million cell data to identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue on the basis of single-cell measurements.
6,488 downloads bioinformatics
Mash extends the MinHash dimensionality-reduction technique to include a pairwise mutation distance and P-value significance test, enabling the efficient clustering and search of massive sequence collections. Mash reduces large sequences and sequence sets to small, representative sketches, from which global mutation distances can be rapidly estimated. We demonstrate several use cases, including the clustering of all 54,118 NCBI RefSeq genomes in 33 CPU hours; real-time database search using assembled or unassembled Illumina, Pacific Biosciences, and Oxford Nanopore data; and the scalable clustering of hundreds of metagenomic samples by composition. Mash is freely released under a BSD license (https://github.com/marbl/mash).
6,409 downloads bioinformatics
Long sequencing reads generated by single-molecule sequencing technology offer the possibility of dramatically improving the contiguity of genome assemblies. The biggest challenge today is that long reads have relatively high error rates, currently around 15%. The high error rates make it difficult to use this data alone, particularly with highly repetitive plant genomes. Errors in the raw data can lead to insertion or deletion errors (indels) in the consensus genome sequence, which in turn create significant problems for downstream analysis; for example, a single indel may shift the reading frame and incorrectly truncate a protein sequence. Here we describe an algorithm that solves the high error rate problem by combining long, high-error reads with shorter but much more accurate Illumina sequencing reads, whose error rates average <1%. Our hybrid assembly algorithm combines these two types of reads to construct mega-reads, which are both long and accurate, and then assembles the mega-reads using the CABOG assembler, which was designed for long reads. We apply this technique to a large data set of Illumina and PacBio sequences from the species Aegilops tauschii, a large and highly repetitive plant genome that has resisted previous attempts at assembly. We show that the resulting assembled contigs are far larger than in any previous assembly, with an N50 contig size of 486,807. We compare the contigs to independently produced optical maps to evaluate their large-scale accuracy, and to a set of high-quality bacterial artificial chromosome (BAC)-based assemblies to evaluate base-level accuracy.
6,402 downloads bioinformatics
Alex Bishara, Eli L Moss, Ekaterina Tkachenko, Joyce B Kang, Soumaya Zlitni, Rebecca N Culver, Tessa M. Andermann, Ziming Weng, Christina Wood, Christine Handy, Hanlee Ji, Serafim Batzoglou, Ami S. Bhatt
Although shotgun short-read sequencing has facilitated the study of strain-level architecture within complex microbial communities, existing metagenomic approaches often cannot capture structural differences between closely related co-occurring strains. Recent methods, which employ read cloud sequencing and specialized assembly techniques, provide significantly improved genome drafts and show potential to capture these strain-level differences. Here, we apply this read cloud metagenomic approach to longitudinal stool samples from a patient undergoing hematopoietic cell transplantation. The patient's microbiome is profoundly disrupted and is eventually dominated by Bacteroides caccae. Comparative analysis of B. caccae genomes obtained using read cloud sequencing together with metagenomic RNA sequencing allows us to predict that particular mobile element integrations result in increased antibiotic resistance, which we further support using in vitro antibiotic susceptibility testing. Thus, we find read cloud sequencing to be useful in identifying strain-level differences that underlie differential fitness.
6,367 downloads bioinformatics
Motivation: Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. Results: We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. Availability and Implementation: The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.
6,328 downloads bioinformatics
Identifying robust survival subgroups of hepatocellular carcinoma (HCC) will significantly improve patient care. Currently, endeavor of integrating multi-omics data to explicitly predict HCC survival from multiple patient cohorts is lacking. To fill in this gap, we present a deep learning (DL) based model on HCC that robustly differentiates survival subpopulations of patients in six cohorts. We build the DL based, survival-sensitive model on 360 HCC patients' data using RNA-seq, miRNA-seq and methylation data from TCGA, which predicts prognosis as good as an alternative model where genomics and clinical data are both considered. This DL based model provides two optimal subgroups of patients with significant survival differences (P=7.13e-6) and good model fitness (C-index=0.68). More aggressive subtype is associated with frequent TP53 inactivation mutations, higher expression of stemness markers (KRT19, EPCAM) and tumor marker BIRC5, and activated Wnt and Akt signaling pathways. We validated this multi-omics model on five external datasets of various omics types: LIRI-JP cohort (n=230, C-index=0.75), NCI cohort (n=221, C-index=0.67), Chinese cohort (n=166, C-index=0.69), E-TABM-36 cohort (n=40, C-index=0.77), and Hawaiian cohort (n=27, C-index=0.82). This is the first study to employ deep learning to identify multi-omics features linked to the differential survival of HCC patients. Given its robustness over multiple cohorts, we expect this workflow to be useful at predicting HCC prognosis prediction.
6,293 downloads bioinformatics
Preranked gene set enrichment analysis (GSEA) is a widely used method for interpretation of gene expression data in terms of biological processes. Here we present FGSEA method that is able to estimate arbitrarily low GSEA P-values with a higher accuracy and much faster compared to other implementations. We also present a polynomial algorithm to calculate GSEA P-values exactly, which we use to practically confirm the accuracy of the method.
6,185 downloads bioinformatics
The recently introduced Oxford Nanopore MinION platform generates DNA sequence data in real-time. This opens immense potential to shorten the sample-to-results time and is likely to lead to enormous benefits in rapid diagnosis of bacterial infection and identification of drug resistance. However, there are very few tools available for streaming analysis of real-time sequencing data. Here, we present a framework for streaming analysis of MinION real-time sequence data, together with probabilistic streaming algorithms for species typing, multi-locus strain typing, gene presence strain-typing and antibiotic resistance profile identification. Using three culture isolate samples as well as a mixed-species sample, we demonstrate that bacterial species and strain information can be obtained within 30 minutes of sequencing and using about 500 reads, initial drug-resistance profiles within two hours, and complete resistance profiles within 10 hours. Multi-locus strain typing required more than 15x coverage to generate confident assignments, whereas gene-presence typing could detect the presence of a known strain with 0.5x coverage. We also show that our pipeline can process over 100 times more data than the current throughput of the MinION on a desktop computer.
6,124 downloads bioinformatics
Analysis of single-cell RNA-seq data begins with pre-processing of sequencing reads to generate count matrices. We investigate algorithm choices for the challenges of pre-processing, and describe a workflow that balances efficiency and accuracy. Our workflow is based on the kallisto (<https://pachterlab.github.io/kallisto/>) and bustools (<https://bustools.github.io/>) programs, and is near-optimal in speed and memory. The workflow is modular, and we demonstrate its flexibility by showing how it can be used for RNA velocity analyses. Documentation and tutorials for using the kallisto | bus workflow are available at <https://www.kallistobus.tools/>.
6,110 downloads bioinformatics
John Vivian, Arjun Rao, Frank Austin Nothaft, Christopher Ketchum, Joel Armstrong, Adam Novak, Jacob Pfeil, Jake Narkizian, Alden D. Deran, Audrey Musselman-Brown, Hannes Schmidt, Peter Amstutz, Brian Craft, Mary Goldman, Kate Rosenbloom, Melissa Cline, Brian O’Connor, Megan Hanna, Chet Birger, W. James Kent, David A. Patterson, Anthony D. Joseph, Jingchun Zhu, Sasha Zaranek, Gad Getz, David Haussler, Benedict Paten
Toil is portable, open-source workflow software that supports contemporary workflow definition languages and can be used to securely and reproducibly run scientific workflows efficiently at large-scale. To demonstrate Toil, we processed over 20,000 RNA-seq samples to create a consistent meta-analysis of five datasets free of computational batch effects that we make freely available. Nearly all the samples were analysed in under four days using a commercial cloud cluster of 32,000 preemptable cores.
6,100 downloads bioinformatics
Recent technological advances have enabled assaying DNA methylation at single-cell resolution. Current protocols are limited by incomplete CpG coverage and hence methods to predict missing methylation states are critical to enable genome-wide analyses. Here, we report DeepCpG, a computational approach based on deep neural networks to predict DNA methylation states from DNA sequence and incomplete methylation profiles in single cells. We evaluated DeepCpG on single-cell methylation data from five cell types generated using alternative sequencing protocols, finding that DeepCpG yields substantially more accurate predictions than previous methods. Additionally, we show that the parameters of our model can be interpreted, thereby providing insights into the effect of sequence composition on methylation variability.
6,086 downloads bioinformatics
Read-based phasing allows to reconstruct the haplotype structure of a sample purely from sequencing reads. While phasing is a required step for answering questions about population genetics, compound heterozygosity, and to aid in clinical decision making, there has been a lack of an accurate, usable and standards-based software. WhatsHap is a production-ready tool for highly accurate read-based phasing. It was designed from the beginning to leverage third-generation sequencing technologies, whose long reads can span many variants and are therefore ideal for phasing. WhatsHap works also well with second-generation data, is easy to use and will phase not only SNVs, but also indels and other variants. It is unique in its ability to combine read-based with genetic phasing, allowing to further improve accuracy if multiple related samples are provided.
6,072 downloads bioinformatics
Pseudomonas aeruginosa is one of the most dangerous superbugs in the list of bacteria for which new antibiotics are urgently needed, which was published by World Health Organization. P. aeruginosa is an antibiotic-resistant opportunistic human pathogen. It affects patients with AIDS, cystic fibrosis, cancer, burn victims and people with prosthetics and implants. P. aeruginosa also forms biofilms. Biofilms increase resistance to antibiotics and host immune responses. Because of biofilms, current therapies are not effective. It is important to find new antibacterial treatment strategies against P. aeruginosa. Biofilm formation is regulated through a system called quorum sensing. Thus disrupting this system is considered a promising strategy to combat bacterial pathogenicity. It is known that quercetin inhibits Pseudomonas aeruginosa biofilm formation, but the mechanism of action is unknown. In the present study, we tried to analyse the mode of interactions of LasR with quercetin. We used a combination of molecular docking, molecular dynamics (MD) simulations and machine learning techniques for the study of the interaction of the LasR protein of P. aeruginosa with quercetin. We assessed the conformational changes of the interaction and analysed the molecular details of the binding of quercetin with LasR. We show that quercetin has two binding modes. One binding mode is the interaction with ligand binding domain, this interaction is not competitive and it has also been shown experimentally. The second binding mode is the interaction with the bridge, it involves conservative amino acid interactions from LBD, SLR, and DBD and it is also not competitive. Experimental studies show hydroxyl group of ring A is necessary for inhibitory activity, in our model the hydroxyl group interacts with Leu177 during the second binding mode. This could explain the molecular mechanism of how quercetin inhibits LasR protein. This study may offer insights on how quercetin inhibits quorum sensing circuitry by interacting with transcriptional regulator LasR. The capability of having two binding modes may explain why quercetin is effective at inhibiting biofilm formation and virulence gene expression.
6,048 downloads bioinformatics
The introduction of RNA velocity in single cells has opened up new ways of studying cellular differentiation. The originally proposed framework obtains velocities as the deviation of the observed ratio of spliced and unspliced mRNA from an inferred steady state. Errors in velocity estimates arise if the central assumptions of a common splicing rate and the observation of the full splicing dynamics with steady-state mRNA levels are violated. With scVelo (https://scvelo.org), we address these restrictions by solving the full transcriptional dynamics of splicing kinetics using a likelihood-based dynamical model. This generalizes RNA velocity to a wide variety of systems comprising transient cell states, which are common in development and in response to perturbations. We infer gene-specific rates of transcription, splicing and degradation, and recover the latent time of the underlying cellular processes. This latent time represents the cell's internal clock and is based only on its transcriptional dynamics. Moreover, scVelo allows us to identify regimes of regulatory changes such as stages of cell fate commitment and, therein, systematically detects putative driver genes. We demonstrate that scVelo enables disentangling heterogeneous subpopulation kinetics with unprecedented resolution in hippocampal dentate gyrus neurogenesis and pancreatic endocrinogenesis. We anticipate that scVelo will greatly facilitate the study of lineage decisions, gene regulation, and pathway activity identification.
6,014 downloads bioinformatics
We describe Strelka2 (https://github.com/Illumina/strelka), an open-source small variant calling method for clinical germline and somatic sequencing applications. Strelka2 introduces a novel mixture-model based estimation of indel error parameters from each sample, an efficient tiered haplotype modeling strategy and a normal sample contamination model to improve liquid tumor analysis. For both germline and somatic calling, Strelka2 substantially outperforms current leading tools on both variant calling accuracy and compute cost.
5,972 downloads bioinformatics
The prediction of inter-residue contacts and distances from co-evolutionary data using deep learning has considerably advanced protein structure prediction. Here we build on these advances by developing a deep residual network for predicting inter-residue orientations in addition to distances, and a Rosetta constrained energy minimization protocol for rapidly and accurately generating structure models guided by these restraints. In benchmark tests on CASP13 and CAMEO derived sets, the method outperforms all previously described structure prediction methods. Although trained entirely on native proteins, the network consistently assigns higher probability to de novo designed proteins, identifying the key fold determining residues and providing an independent quantitative measure of the "ideality" of a protein structure. The method promises to be useful for a broad range of protein structure prediction and design problems.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!