Rxivist combines preprints from bioRxiv with data from Twitter to help you find the papers being discussed in your field. Currently indexing 54,968 bioRxiv papers from 253,677 authors.
Most downloaded bioRxiv papers, since beginning of last month
51,404 results found. For more information, click each entry to expand.
379 downloads systems biology
Despite over a billion years of evolutionary divergence, several thousand human genes possess clearly identifiable orthologs in yeast, and many have undergone lineage-specific duplications in one or both lineages. The ortholog conjecture postulates that orthologous genes between species retain ancestral functions despite divergence over vast timescales, but duplicated genes will be free to diverge in function. However, the retention of ancestral functions among co-orthologs between species and within gene families has been difficult to test experimentally at scale. In order to investigate how ancestral functions are retained or lost post-duplication, we systematically replaced hundreds of essential yeast genes with their human orthologs from gene families that have undergone lineage-specific duplications, including those with single duplications (one yeast gene to two human genes, 1:2) or higher-order expansions (1:>2) in the human lineage. We observe a variable pattern of replaceability across different ortholog classes, with an obvious trend towards differential replaceability inside gene families, rarely observing replaceability by all members of a family. We quantify the ability of various properties of the orthologs to predict replaceability, showing that in the case of 1:2 orthologs, replaceability is predicted largely by the divergence and tissue-specific expression of the human co-orthologs, i.e. the human proteins that are less diverged from their yeast counterpart and more ubiquitously expressed across human tissues more often replace their single yeast ortholog. These trends were consistent with in silico simulations demonstrating that when only one ortholog is replaceable, it tends to be the least diverged of the pair. Replaceability of yeast genes having more than two human co-orthologs was marked by retention of orthologous interactions in functional or protein networks as well as by more ancestral subcellular localization. Overall, we performed >400 human gene replaceability assays revealing 56 new human-yeast complementation pairs, thus opening up avenues to further functionally characterize these human genes in a simplified organismal context.
375 downloads genetics
Charles P Fulco, Joseph Nasser, Thouis R Jones, Glen Munson, Drew T Bergman, Vidya Subramanian, Sharon R Grossman, Rockwell Anyoha, Tejal A Patwardhan, Tung H Nguyen, Michael Kane, Benjamin Doughty, Elizabeth M. Perez, Neva C. Durand, Elena K Stamenova, Erez Lieberman Aiden, Eric S Lander, Jesse M Engreitz
Mammalian genomes harbor millions of noncoding elements called enhancers that quantitatively regulate gene expression, but it remains unclear which enhancers regulate which genes. Here we describe an experimental approach, based on CRISPR interference, RNA FISH, and flow cytometry (CRISPRi-FlowFISH), to perturb enhancers in the genome, and apply it to test >3,000 potential regulatory enhancer-gene connections across multiple genomic loci. A simple equation based on a mechanistic model for enhancer function performed remarkably well at predicting the complex patterns of regulatory connections we observe in our CRISPR dataset. This Activity-by-Contact (ABC) model involves multiplying measures of enhancer activity and enhancer-promoter 3D contacts, and can predict enhancer-gene connections in a given cell type based on chromatin state maps. Together, CRISPRi-FlowFISH and the ABC model provide a systematic approach to map and predict which enhancers regulate which genes, and will help to interpret the functions of the thousands of disease risk variants in the noncoding genome.
373 downloads genomics
Ryan Poplin, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas Colthurst, Alexander Ku, Dan Newburger, Jojo Dijamco, Nam Nguyen, Pegah T. Afshar, Sam S. Gross, Lizzie Dorfman, Cory Y. McLean, Mark A. DePristo
Next-generation sequencing (NGS) is a rapidly evolving set of technologies that can be used to determine the sequence of an individual's genome by calling genetic variants present in an individual using billions of short, errorful sequence reads. Despite more than a decade of effort and thousands of dedicated researchers, the hand-crafted and parameterized statistical models used for variant calling still produce thousands of errors and missed variants in each genome. Here we show that a deep convolutional neural network can call genetic variation in aligned next-generation sequencing read data by learning statistical relationships (likelihoods) between images of read pileups around putative variant sites and ground-truth genotype calls. This approach, called DeepVariant, outperforms existing tools, even winning the "highest performance" award for SNPs in a FDA-administered variant calling challenge. The learned model generalizes across genome builds and even to other species, allowing non-human sequencing projects to benefit from the wealth of human ground truth data. We further show that, unlike existing tools which perform well on only a specific technology, DeepVariant can learn to call variants in a variety of sequencing technologies and experimental designs, from deep whole genomes from 10X Genomics to Ion Ampliseq exomes. DeepVariant represents a significant step from expert-driven statistical modeling towards more automatic deep learning approaches for developing software to interpret biological instrumentation data.
373 downloads genetics
Iosif Lazaridis, Anna Belfer-Cohen, Swapan Mallick, Nick Patterson, Olivia Cheronet, Nadin Rohland, Guy Bar-Oz, Ofer Bar-Yosef, Nino Jakeli, Eliso Kvavadze, David Lordkipanidze, Zinovi Matzkevich, Tengiz Meshveliani, Brendan J Culleton, Douglas J. Kennett, Ron Pinhasi, David Reich
The earliest ancient DNA data of modern humans from Europe dates to ~40 thousand years ago, but that from the Caucasus and the Near East to only ~14 thousand years ago, from populations who lived long after the Last Glacial Maximum (LGM) ~26.5-19 thousand years ago. To address this imbalance and to better understand the relationship of Europeans and Near Easterners, we report genome-wide data from two ~26 thousand year old individuals from Dzudzuana Cave in Georgia in the Caucasus from around the beginning of the LGM. Surprisingly, the Dzudzuana population was more closely related to early agriculturalists from western Anatolia ~8 thousand years ago than to the hunter-gatherers of the Caucasus from the same region of western Georgia of ~13-10 thousand years ago. Most of the Dzudzuana population's ancestry was deeply related to the post-glacial western European hunter-gatherers of the 'Villabruna cluster', but it also had ancestry from a lineage that had separated from the great majority of non-African populations before they separated from each other, proving that such 'Basal Eurasians' were present in West Eurasia twice as early as previously recorded. We document major population turnover in the Near East after the time of Dzudzuana, showing that the highly differentiated Holocene populations of the region were formed by 'Ancient North Eurasian' admixture into the Caucasus and Iran and North African admixture into the Natufians of the Levant. We finally show that the Dzudzuana population contributed the majority of the ancestry of post-Ice Age people in the Near East, North Africa, and even parts of Europe, thereby becoming the largest single contributor of ancestry of all present-day West Eurasians.
372 downloads developmental biology
For more than a century, researchers have been trying to understand the relationship between embryogenesis and regeneration (Morgan 1901). A long-standing hypothesis is that biological processes originally used during embryogenesis are re-deployed during regeneration. In the past decade, we have begun to understand the relationships of genes and their organization into regulatory networks responsible for driving embryogenesis (Davidson et al. 2002; Röttinger et al. 2012) and regeneration (Srivastava et al. 2014; Lobo and Levin 2015; Rodius et al. 2016) in diverse taxa. Here, we compare these networks in the same species to investigate how regeneration re-uses genetic interactions originally set aside for embryonic development. Using a uniquely suited embryonic development and whole-body regeneration model, the sea anemone Nematostella vectensis, we show that at the transcriptomic level the regenerative program partially re-uses elements of the embryonic gene network in addition to a small cohort of genes that are only activated during regeneration. We further identified co-expression modules that are either i) highly conserved between these two developmental trajectories and involved in core biological processes or ii) regeneration specific modules that drive cellular events unique to regeneration. Finally, our functional validation reveals that apoptosis is a regeneration-specific process in Nematostella and is required for the initiation of the regeneration program. These results indicate that regeneration reactivates embryonic gene modules to accomplish basic cellular functions but deploys a novel gene network logic to activate the regenerative process.
372 downloads bioinformatics
Hybrid genome assembly has emerged as an important technique in bacterial genomics, but cost and labor requirements limit large-scale application. We present Ultraplexing, a method to improve per-sample sequencing cost and hands-on-time of Nanopore sequencing for hybrid assembly by at least 50%, compared to molecular barcoding while maintaining high assembly quality (Quality Value; QV ≥ 42). Ultraplexing requires the availability of Illumina data and uses inter-sample genetic variability to assign reads to isolates, which obviates the need for molecular barcoding. Thus, Ultraplexing can enable significant sequencing and labor cost reductions in large-scale bacterial genome projects. * SNP : Single Nucleotide Polymorphism QV : Quality Value
372 downloads synthetic biology
Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach reaches near state-of-the-art or superior performance predicting stability of natural and de novo designed proteins as well as quantitative function of molecularly diverse mutants. UniRep further enables two orders of magnitude cost savings in a protein engineering task. We conclude UniRep is a versatile protein summary that can be applied across protein engineering informatics.
371 downloads genomics
Large-scale genetic screens play a key role in the systematic discovery of genes underlying cellular phenotypes. Pooling of genetic perturbations greatly increases screening throughput, but has so far been limited to screens of enrichments defined by cell fitness and flow cytometry, or to comparatively low-throughput single cell gene expression profiles. Although microscopy is a rich source of spatial and temporal information about mammalian cells, high-content imaging screens have been restricted to much less efficient arrayed formats. Here, we introduce an optical method to link perturbations and their phenotypic outcomes at the single-cell level in a pooled setting. Barcoded perturbations are read out by targeted in situ sequencing following image-based phenotyping. We apply this technology to screen a focused set of 952 genes across >3 million cells for involvement in NF-κB activation by imaging the translocation of RelA (p65) to the nucleus, recovering 20 known pathway components and 3 novel candidate positive regulators of IL-1β and TNFα-stimulated immune responses.
371 downloads bioinformatics
Machine learning algorithms trained to predict the regulatory activity of nucleic acid sequences have revealed principles of gene regulation and guided genetic variation analysis. While the human genome has been extensively annotated and studied, model organisms have been less explored. Model organism genomes offer both additional training sequences and unique annotations describing tissue and cell states unavailable in humans. Here, we develop a strategy to train deep convolutional neural networks simultaneously on multiple genomes and apply it to learn sequence predictors for large compendia of human and mouse data. Training on both genomes improves gene expression prediction accuracy on held out sequences. We further demonstrate a novel and powerful transfer learning approach to use mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease. Together these techniques unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.
368 downloads immunology
Felipe A Vieira Braga, Gozde Kar, Marijn Berg, Orestes A Carpaij, Krzysztof Polanski, Lukas M Simon, Sharon Brouwer, Tomas Gomes, Laura Hesse, Jian Jiang, Eirini S Fasouli, Mirjana Efremova, Roser Vento-Tormo, Karen Affleck, Subarna Palit, Paulina Strzelecka, Helen V Firth, Krishnaa TA Mahbubani, Ana Cvejic, Kerstin B Meyer, Kourosh Saeb-Parsy, Marjan Luinge, Corry-Anke Brandsma, Wim Timens, Ilias Angelidis, Maximilian Strunz, Gerard H Koppelman, Antoon J van Oosterhout, Herbert B Schiller, Fabian J. Theis, Maarten van den Berge, Martijn C Nawijn, Sarah A Teichmann
Human lungs enable efficient gas exchange, and form an interface with the environment which depends on mucosal immunity for protection against infectious agents. Tightly controlled interactions between structural and immune cells are required to maintain lung homeostasis. Here, we use single cell transcriptomics to chart the cellular landscape of upper and lower airways and lung parenchyma in health. We report location-dependent airway epithelial cell states, and a novel subset of tissue-resident memory T cells. In lower airways of asthma patients, mucous cell hyperplasia is shown to stem from a novel mucous ciliated cell state, as well as goblet cell hyperplasia. We report presence of pathogenic effector Th2 cells in asthma, and find evidence for type-2 cytokines in maintaining the altered epithelial cell states. Unbiased analysis of cell-cell interactions identify a shift from airway structural cell communication in health to a Th2-dominated interactome in asthma.
366 downloads genomics
Ansuman T. Satpathy, Jeffrey M. Granja, Kathryn E Yost, Yanyan Qi, Francesca Meschi, Geoffrey P McDermott, Brett N Olsen, Maxwell R. Mumbach, Sarah E Pierce, M. Ryan Corces, Preyas Shah, Jason C. Bell, Darisha Jhutty, Corey M Nemec, Jean Wang, Li Wang, Yifeng Yin, Paul G Giresi, Anne Lynn S. Chang, Grace X Y Zheng, William J. Greenleaf, Howard Y. Chang
Understanding complex tissues requires single-cell deconstruction of gene regulation with precision and scale. Here we present a massively parallel droplet-based platform for mapping transposase-accessible chromatin in tens of thousands of single cells per sample (scATAC-seq). We obtain and analyze chromatin profiles of over 200,000 single cells in two primary human systems. In blood, scATAC-seq allows marker-free identification of cell type-specific cis- and trans-regulatory elements, mapping of disease-associated enhancer activity, and reconstruction of trajectories of differentiation from progenitors to diverse and rare immune cell types. In basal cell carcinoma, scATAC-seq reveals regulatory landscapes of malignant, stromal, and immune cell types in the tumor microenvironment. Moreover, scATAC-seq of serial tumor biopsies before and after PD-1 blockade allows identification of chromatin regulators and differentiation trajectories of therapy-responsive intratumoral T cell subsets, revealing a shared regulatory program driving CD8+ T cell exhaustion and CD4+ T follicular helper cell development. We anticipate that droplet-based single-cell chromatin accessibility will provide a broadly applicable means of identifying regulatory factors and elements that underlie cell type and function.
364 downloads genomics
Since its debut in 2009, single-cell RNA-seq has been a major propeller behind biomedical research progress. Developmental biology and stem cell studies especially benefit from the ability to profile single cells. While most studies still focus on individual tissues or organs, recent development of ultra-high-throughput single-cell RNA-seq has demonstrated potential power to depict more complexed system or even the entire body. Though multiple ultra-high-throughput single-cell RNA-seq systems have acquired attention, systematic comparison of these systems is yet available. Here we focus on three prevalent droplet-based ultra-high-throughput single-cell RNA-seq systems, inDrop, Drop-seq, and 10X Genomics Chromium. While each system is capable of profiling single-cell transcriptome, detailed comparison revealed distinguishing features and suitable application scenario for each system.
362 downloads synthetic biology
Benjamin Schumann, Stacy A. Malaker, Simon P. Wisnovsky, Marjoke F. Debets, Anthony J. Agbay, Daniel Fernandez, Lauren J. S. Wagner, Liang Lin, Junwon Choi, Douglas M. Fox, Jessie Peh, Melissa A. Gray, Kayvon Pedram, Jennifer J. Kohler, Milan Mrksich, Carolyn R. Bertozzi
Studying posttranslational modifications classically relies on experimental strategies that oversimplify the complex biosynthetic machineries of living cells. Protein glycosylation contributes to essential biological processes, but correlating glycan structure, underlying protein and disease-relevant biosynthetic regulation is currently elusive. Here, we engineer living cells to tag glycans with editable chemical functionalities while providing information on biosynthesis, physiological context and glycan fine structure. We introduce a non-natural substrate biosynthetic pathway and use engineered glycosyltransferases to incorporate chemically tagged sugars into the cell surface glycome of the living cell. We apply the strategy to a particularly redundant yet disease-relevant human glycosyltransferase family, the polypeptide N-acetylgalactosaminyl transferases. This approach bestows a gain-of-function modification on cells where the products of individual glycosyltransferases can be selectively characterized or manipulated at will.
362 downloads bioinformatics
Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples .
362 downloads genomics
The genomes of ancient humans, Neandertals, and Denisovans contain many alleles that influence disease risks. Using genotypes at 3180 disease-associated loci, we estimated the disease burden of 147 ancient genomes. After correcting for missing data, genetic risk scores were generated for nine disease categories and the set of all combined diseases. These genetic risk scores were used to examine the effects of different types of subsistence, geography, and sample age on the number of risk alleles in each ancient genome. On a broad scale, hereditary disease risks are similar for ancient hominins and modern-day humans, and the GRS percentiles of ancient individuals span the full range of what is observed in present day individuals. In addition, there is evidence that ancient pastoralists may have had healthier genomes than hunter-gatherers and agriculturalists. We also observed a temporal trend whereby genomes from the recent past are more likely to be healthier than genomes from the deep past. This calls into question the idea that modern lifestyles have caused genetic load to increase over time. Focusing on individual genomes, we find that the overall genomic health of the Altai Neandertal is worse than 97% of present day humans and that Otzi the Tyrolean Iceman had a genetic predisposition to gastrointestinal and cardiovascular diseases. As demonstrated by this work, ancient genomes afford us new opportunities to diagnose past human health, which has previously been limited by the quality and completeness of remains.
362 downloads developmental biology
We present CLADES (Cell Lineage Access Driven by an Edition Sequence), a technology for cell lineage studies based on CRISPR/Cas9. CLADES relies on a system of genetic switches to activate and inactivate reporter genes in a pre-determined order. Targeting CLADES to progenitor cells allows the progeny to inherit a sequential cascade of reporters, coupling birth order with reporter expression. This gives us temporal resolution of lineage development that can be used to deconstruct an extended cell lineage by tracking the reporters expressed in the progeny. When targeted to the germ line, the same cascade progresses across animal generations, marking each generation with the corresponding combination of reporters. CLADES thus offers an innovative strategy for making programmable cascades of genes that can be used for genetic manipulation or to record serial biological events.
362 downloads cell biology
Filamentous fungi are ubiquitous in nature and serve as important biological models in various scientific fields including genetics, cell biology, ecology, evolution, and chemistry. A significant obstacle in studying filamentous fungi is the lack of tools for characterizing their growth and morphology in an efficient and quantitative manner. Consequently, assessments of the growth of filamentous fungi are often subjective and imprecise. In order to remedy this problem, we developed Fungal Feature Tracker (FFT), a user-friendly software comprised of different image analysis tools to automatically quantify different fungal characteristics, such as spore number, spore morphology, and measurements of total length, number of hyphal tips and the area covered by the mycelium. In addition, FFT can recognize and quantify specialized structures such as the traps generated by nematode-trapping fungi, which could be tuned to quantify other distinctive fungal structures in different fungi. We present a detailed characterization and comparison of a few fungal species as a case study to demonstrate the capabilities and potential of our software. Using FFT, we were able to quantify various features at strain and species level, such as mycelial growth over time and the length and width of spores, which would be difficult to track using classical approaches. In summary, FFT is a powerful tool that enables quantitative measurements of fungal features and growth, allowing objective and precise characterization of fungal phenotypes.
360 downloads bioinformatics
Massively parallel phenotyping assays have provided unprecedented insight into how multiple mutations combine to determine biological function. While these assays can measure phenotypes for thousands to millions of genotypes in a single experiment, in practice these measurements are not exhaustive, so that there is a need for techniques to impute values for genotypes whose phenotypes are not directly assayed. Here we present a method based on the idea of inferring the least epistatic possible sequence-function relationship compatible with the data. In particular, we infer the reconstruction in which mutational effects change as little as possible across adjacent genetic backgrounds. Although this method is highly conservative and has no tunable parameters, it also makes no assumptions about the form that genetic interactions take, resulting in predictions that can behave in a very complicated manner where the data require it but which are nearly additive where data is sparse or absent. We apply this method to analyze a fitness landscape for protein G, showing that our technique can provide a substantially less epistatic fit to the landscape than standard methods with little loss in predictive power. Moreover, our analysis reveals that the complex structure of epistasis observed in this dataset can be well-understood in terms of a simple qualitative model consisting of three fitness peaks where the landscape is locally additive in the vicinity of each peak.
359 downloads biochemistry
Metabolomics has started to embrace computational approaches for chemical interpretation of large data sets. Yet, metabolite annotation remains a key challenge. Recently, molecular networking and MS2LDA emerged as molecular mining tools that find molecular families and substructures in mass spectrometry fragmentation data. Moreover, in silico annotation tools obtain and rank candidate molecules for fragmentation spectra. Ideally, all structural information obtained and inferred from these computational tools could be combined to increase the resulting chemical insight one can obtain from a data set. However, integration is currently hampered as each tool has its own output format and efficient matching of data across these tools is lacking. Here, we introduce MolNetEnhancer, a workflow that combines the outputs from molecular networking, MS2LDA, in silico annotation tools (such as Network Annotation Propagation or DEREPLICATOR) and the automated chemical classification through ClassyFire to provide a more comprehensive chemical overview of metabolomics data whilst at the same time illuminating structural details for each fragmentation spectrum. We present examples from four plant and bacterial case studies and show how MolNetEnhancer enables the chemical annotation, visualization, and discovery of the subtle substructural diversity within molecular families. We conclude that MolNetEnhancer is a useful tool that greatly assists the metabolomics researcher in deciphering the metabolome through combination of multiple independent in silico pipelines.
358 downloads genomics
Sanja Vickovic, Goekcen Eraslan, Fredrik Salmen, Johanna Klughammer, Linnea Stenbeck, Tarmo Aijo, Richard Bonneau, Jose Fernandez Navarro, Ludvig Bergenstraahle, Joshua Gould, Mostafa Ronaghi, Jonas Frisen, Joakim Lundeberg, Aviv Regev, Patrik L Staahl
Tissue function relies on the precise spatial organization of cells characterized by distinct molecular profiles. Single-cell RNA-Seq captures molecular profiles but not spatial organization. Conversely, spatial profiling assays to date have lacked global transcriptome information, throughput or single-cell resolution. Here, we develop High-Density Spatial Transcriptomics (HDST), a method for RNA-Seq at high spatial resolution. Spatially barcoded reverse transcription oligonucleotides are coupled to beads that are randomly deposited into tightly packed individual microsized wells on a slide. The position of each bead is decoded with sequential hybridization using complementary oligonucleotides providing a unique bead-specific spatial address. We then capture, and spatially in situ barcode, RNA from the histological tissue sections placed on the HDST array. HDST recovers hundreds of thousands of transcript-coupled spatial barcodes per experiment at 2 μm resolution. We demonstrate HDST in the mouse brain, use it to resolve spatial expression patterns and cell types, and show how to combine it with histological stains to relate expression patterns to tissue architecture and anatomy. HDST opens the way to spatial analysis of tissues at high resolution.
- Top preprints of 2018
- Paper search
- Author leaderboards
- Overall metrics
- The API
- Email newsletter
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!