Most downloaded biology preprints, all time
in category synthetic biology
1,591 results found. For more information, click each entry to expand.
56,892 downloads bioRxiv synthetic biology
Integrating neurons into digital systems to leverage their innate intelligence may enable performance infeasible with silicon alone, along with providing insight into the cellular origin of intelligence. We developed DishBrain, a system which exhibits natural intelligence by harnessing the inherent adaptive computation of neurons in a structured environment. In vitro neural networks from human or rodent origins, are integrated with in silico computing via high-density multielectrode array. Through electrophysiological stimulation and recording, cultures were embedded in a simulated game-world, mimicking the arcade game 'Pong'. Applying a previously untestable theory of active inference via the Free Energy Principle, we found that learning was apparent within five minutes of real-time gameplay, not observed in control conditions. Further experiments demonstrate the importance of closed-loop structured feedback in eliciting learning over time. Cultures display the ability to self-organise in a goal-directed manner in response to sparse sensory information about the consequences of their actions.
37,602 downloads bioRxiv synthetic biology
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state-of-the-art features for long-range contact prediction.
21,235 downloads bioRxiv synthetic biology
Artificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution. It has only recently been possible to extend protein structure prediction to two hundred million cataloged proteins. Characterizing the structures of the exponentially growing billions of protein sequences revealed by large scale gene sequencing experiments would necessitate a breakthrough in the speed of folding. Here we show that direct inference of structure from primary sequence using a large language model enables an order of magnitude speed-up in high resolution structure prediction. Leveraging the insight that language models learn evolutionary patterns across millions of sequences, we train models up to 15B parameters, the largest language model of proteins to date. As the language models are scaled they learn information that enables prediction of the three-dimensional structure of a protein at the resolution of individual atoms. This results in prediction that is up to 60x faster than state-of-the-art while maintaining resolution and accuracy. Building on this, we present the ESM Metagenomic Atlas. This is the first large-scale structural characterization of metagenomic proteins, with more than 617 million structures. The atlas reveals more than 225 million high confidence predictions, including millions whose structures are novel in comparison with experimentally determined structures, giving an unprecedented view into the vast breadth and diversity of the structures of some of the least understood proteins on earth.
18,802 downloads bioRxiv synthetic biology
Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.
17,789 downloads bioRxiv synthetic biology
Methods of altering wild populations are most useful when inherently limited to local geographic areas. Here we describe a novel form of gene drive based on the introduction of multiple copies of an engineered 'daisy' sequence into repeated elements of the genome. Each introduced copy encodes guide RNAs that target one or more engineered loci carrying the CRISPR nuclease gene and the desired traits. When organisms encoding a drive system are released into the environment, each generation of mating with wild-type organisms will reduce the average number of the guide RNA elements per 'daisyfield' organism by half, serving as a generational clock. The loci encoding the nuclease and payload will exhibit drive only as long as a single copy remains, placing an inherent limit on the extent of spread.
17,423 downloads bioRxiv synthetic biology
DNA is an attractive medium to store digital information. Here, we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14x10^6 bytes in DNA oligos and perfectly retrieved the information from a sequencing coverage equivalent of a single tile of Illumina sequencing. We also tested a process that can allow 2.18x10^15 retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecules and obtained a perfect retrieval from a density of 215Petabyte/gram of DNA, orders of magnitudes higher than previous techniques.
13,132 downloads bioRxiv synthetic biology
Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach reaches near state-of-the-art or superior performance predicting stability of natural and de novo designed proteins as well as quantitative function of molecularly diverse mutants. UniRep further enables two orders of magnitude cost savings in a protein engineering task. We conclude UniRep is a versatile protein summary that can be applied across protein engineering informatics.
12,187 downloads bioRxiv synthetic biology
Protein engineering has enormous academic and industrial potential. However, it is limited by the lack of experimental assays that are consistent with the design goal and sufficiently high-throughput to find rare, enhanced variants. Here we introduce a machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution. As demonstrated in two highly dissimilar proteins, avGFP and TEM-1 β-lactamase, top candidates from a single round are diverse and as active as engineered mutants obtained from previous multi-year, high-throughput efforts. Because it distills information from both global and local sequence landscapes, our model approximates protein function even before receiving experimental data, and generalizes from only single mutations to propose high-functioning epistatically non-trivial designs. With reproducible >500% improvements in activity from a single assay in a 96-well plate, we demonstrate the strongest generalization observed in machine-learning guided protein function optimization to date. Taken together, our approach enables efficient use of resource intensive high-fidelity assays without sacrificing throughput, and helps to accelerate engineered proteins into the fermenter, field, and clinic. ### Competing Interest Statement A full list of G.M.C.'s tech transfer, advisory roles, and funding sources can be found on the lab's website: http://arep.med.harvard.edu/gmc/tech.html. S.B. is employed by and holds equity in Nabla Bio, Inc. G.K. is employed by and holds equity in Telis Bioscience Inc.
11,659 downloads bioRxiv synthetic biology
Modeling the effect of sequence variation on function is a fundamental problem for understanding and designing proteins. Since evolution encodes information about function into patterns in protein sequences, unsupervised models of variant effects can be learned from sequence data. The approach to date has been to fit a model to a family of related sequences. The conventional setting is limited, since a new model must be trained for each prediction task. We show that using only zero-shot inference, without any supervision from experimental data or additional training, protein language models capture the functional effects of sequence variation, performing at state-of-the-art.
10,835 downloads bioRxiv synthetic biology
Therapeutic antibody optimization is time and resource intensive, largely because it requires low-throughput screening (103 variants) of full-length IgG in mammalian cells, typically resulting in only a few optimized leads. Here, we use deep learning to interrogate and predict antigen-specificity from a massively diverse sequence space to identify globally optimized antibody variants. Using a mammalian display platform and the therapeutic antibody trastuzumab, rationally designed site-directed mutagenesis libraries are introduced by CRISPR/Cas9-mediated homology-directed repair (HDR). Screening and deep sequencing of relatively small libraries (104) produced high quality data capable of training deep neural networks that accurately predict antigen-binding based on antibody sequence. Deep learning is then used to predict millions of antigen binders from an in silico library of ~108 variants, where experimental testing of 30 randomly selected variants showed all 30 retained antigen specificity. The full set of in silico predicted binders is then subjected to multiple developability filters, resulting in thousands of highly-optimized lead candidates. With its scalability and capacity to interrogate high-dimensional protein sequence space, deep learning offers great potential for antibody engineering and optimization.
10,705 downloads bioRxiv synthetic biology
Modern synthetic biology depends on the manufacture of large DNA constructs from libraries of genes, regulatory elements or other genetic parts. Type IIS-restriction enzyme-dependent DNA assembly methods (e.g., Golden Gate) enable rapid one-pot, ordered, multi-fragment DNA assembly, facilitating the generation of high-complexity constructs. The order of assembly of genetic parts is determined by the ligation of flanking Watson-Crick base-paired overhangs. The ligation of mismatched overhangs leads to erroneous assembly, and the need to avoid such pairings has typically been accomplished by using small sets of empirically vetted junction pairs, limiting the number of parts that can be joined in a single reaction. Here, we report the use of a comprehensive method for profiling end-joining ligation fidelity and bias to predict highly accurate sets of connections for ligation-based DNA assembly methods. This data set allows quantification of sequence-dependent ligation efficiency and identification of mismatch-prone pairings. The ligation profile accurately predicted junction fidelity in ten-fragment Golden Gate assembly reactions, and enabled efficient assembly of a lac cassette from up to 24-fragments in a single reaction. Application of the ligation fidelity profile to inform choice of junctions thus enables highly flexible assembly design, with >20 fragments in a single reaction.
8,862 downloads bioRxiv synthetic biology
Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags such as molecular function and cellular component. This provides ProGen with an unprecedented range of evolutionary sequence diversity and allows it to generate with fine-grained control as demonstrated by metrics based on primary sequence similarity, secondary structure accuracy, and conformational energy.
8,675 downloads bioRxiv synthetic biology
Proteins---molecular machines that underpin all biological life---are of significant therapeutic and industrial value. Directed evolution is a high-throughput experimental approach for improving protein function, but has difficulty escaping local maxima in the fitness landscape. Here, we investigate how supervised learning in a closed loop with DNA synthesis and high-throughput screening can be used to improve protein design. Using the green fluorescent protein (GFP) as an illustrative example, we demonstrate the opportunities and challenges of generating training datasets conducive to selecting strongly generalizing models. With prospectively designed wet lab experiments, we then validate that these models can generalize to unseen regions of the fitness landscape, even when constrained to explore combinations of non-trivial mutations. Taken together, this suggests a hybrid optimization strategy for protein design in which a predictive model is used to explore difficult-to-access but promising regions of the fitness landscape that directed evolution can then exploit at scale.
8,073 downloads bioRxiv synthetic biology
Donatas Repecka, Vykintas Jauniskis, Laurynas Karpus, Elzbieta Rembeza, Jan Zrimec, Simona Poviloniene, Irmantas Rokaitis, Audrius Laurynenas, Wissam Abuajwa, Otto Savolainen, Rolandas Meskys, Martin KM Engqvist, Aleksej Zelezniak
De novo protein design for catalysis of any desired chemical reaction is a long standing goal in protein engineering, due to the broad spectrum of technological, scientific and medical applications. Currently, mapping protein sequence to protein function is, however, neither computationionally nor experimentally tangible ,. Here we developed ProteinGAN, a specialised variant of the generative adversarial network  that is able to ‘learn’ natural protein sequence diversity and enables the generation of functional protein sequences. ProteinGAN learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino acid sequence space and creates new, highly diverse sequence variants with natural-like physical properties. Using malate dehydrogenase as a template enzyme, we show that 24% of the ProteinGAN-generated and experimentally tested sequences are soluble and display wild-type level catalytic activity in the tested conditions in vitro , even in highly mutated (>100 mutations) sequences. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse novel functional proteins within the allowed biological constraints of the sequence space. : #ref-1 : #ref-2 : #ref-3
8,009 downloads bioRxiv synthetic biology
Biology offers compelling proof that macroscopic "living materials" can emerge from reactions between diffusing biomolecules. Here, we show that molecular self-organization could be a similarly powerful approach for engineering functional synthetic materials. We introduce a programmable DNA-hydrogel that produces tunable patterns at the centimeter length scale. We generate these patterns by implementing chemical reaction networks through synthetic DNA complexes, embedding the complexes in hydrogel, and triggering with locally applied input DNA strands. We first demonstrate ring pattern formation around a circular input cavity and show that the ring width and intensity can be predictably tuned. Then, we create patterns of increasing complexity, including concentric rings and non-isotropic patterns. Finally, we show "destructive" and "constructive" interference patterns, by combining several ring-forming modules in the gel and triggering them from multiple sources. We further show that computer simulations based on the reaction-diffusion model can predict and inform the programming of target patterns.
7,833 downloads bioRxiv synthetic biology
Alejandro Chavez, Jonathan Scheiman, Suhani Vora, Benjamin W Pruitt, Marcelle Tuttle, Eswar Iyer, Samira Kiani, Christopher D Guzman, Daniel J. Wiegand, Dimtry Ter-Ovanesyan, Jonathan L Braff, Noah Davidsohn, Ron Weiss, John Aach, James J. Collins, George M Church
The RNA-guided bacterial nuclease Cas9 can be reengineered as a programmable transcription factor by a series of changes to the Cas9 protein in addition to the fusion of a transcriptional activation domain (AD). However, the modest levels of gene activation achieved by current Cas9 activators have limited their potential applications. Here we describe the development of an improved transcriptional regulator through the rational design of a tripartite activator, VP64-p65-Rta (VPR), fused to Cas9. We demonstrate its utility in activating expression of endogenous coding and non-coding genes, targeting several genes simultaneously and stimulating neuronal differentiation of induced pluripotent stem cells (iPSCs).
7,828 downloads bioRxiv synthetic biology
Longxing Cao, Brian Coventry, Inna Goreshnik, Buwei Huang, Joon Sung Park, Kevin M Jude, Iva Markovic, Rameshwar U. Kadam, Koen H.G. Verschueren, Kenneth Verstraete, Scott Thomas Russell Walsh, Nathaniel Bennett, Ashish Phal, Aerin Yang, Lisa Kozodoy, Michelle DeWitt, Lora Picton, Lauren Miller, Eva-Maria Strauch, Samer Halabiya, Bradley Hammerson, Wei Yang, Steffen Benard, Lance Stewart, Ian A. Wilson, Hannele Ruohola-Baker, Joseph Schlessinger, Sangwon Lee, Savvas Savvides, K. Christopher N. Garcia, David Baker
The design of proteins that bind to a specific site on the surface of a target protein using no information other than the three-dimensional structure of the target remains an outstanding challenge. We describe a general solution to this problem which starts with a broad exploration of the very large space of possible binding modes and interactions, and then intensifies the search in the most promising regions. We demonstrate its very broad applicability by de novo design of binding proteins to 12 diverse protein targets with very different shapes and surface properties. Biophysical characterization shows that the binders, which are all smaller than 65 amino acids, are hyperstable and bind their targets with nanomolar to picomolar affinities. We succeeded in solving crystal structures of four of the binder-target complexes, and all four are very close to the corresponding computational design models. Experimental data on nearly half a million computational designs and hundreds of thousands of point mutants provide detailed feedback on the strengths and limitations of the method and of our current understanding of protein-protein interactions, and should guide improvement of both. Our approach now enables targeted design of binders to sites of interest on a wide variety of proteins for therapeutic and diagnostic applications.
7,760 downloads bioRxiv synthetic biology
Inheritance-biasing “gene drives” may be capable of spreading genomic alterations made in laboratory organisms through wild populations. We previously considered the potential for RNA-guided gene drives based on the versatile CRISPR/Cas9 genome editing system to serve as a general method of altering populations. Here we report molecularly contained gene drive constructs in the yeast Saccharomyces cerevisiae that are typically copied at rates above 99% when mated to wild yeast. We successfully targeted both non-essential and essential genes, showed that the inheritance of an unrelated “cargo” gene could be biased by an adjacent drive, and constructed a drive capable of overwriting and reversing changes made by a previous drive. Our results demonstrate that RNA-guided gene drives are capable of efficiently biasing inheritance when mated to wild-type organisms over successive generations.
6,968 downloads bioRxiv synthetic biology
Unsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a potential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state-of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.
6,362 downloads bioRxiv synthetic biology
We present here an approach for engineering evolving DNA barcodes in living cells. The methodology entails using a homing guide RNA (hgRNA) scaffold that directs the Cas9-hgRNA complex to target the DNA locus of the hgRNA itself. We show that this homing CRISPR-Cas9 system acts as an expressed genetic barcode that diversifies its sequence and that the rate of diversification can be controlled in cultured cells. We further evaluate these barcodes in cultured cell populations and show that they can record lineage history and and that their RNA can be assayed as single molecules in situ. This integrated approach will have wide ranging applications, such as in deep lineage tracing, cellular barcoding, molecular recording, dissecting cancer biology, and connectome mapping.
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!