Most downloaded biology preprints, all time
in category synthetic biology
1,135 results found. For more information, click each entry to expand.
27,231 downloads bioRxiv synthetic biology
In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure, and improving state-of-the-art features for long-range contact prediction.
17,128 downloads bioRxiv synthetic biology
Methods of altering wild populations are most useful when inherently limited to local geographic areas. Here we describe a novel form of gene drive based on the introduction of multiple copies of an engineered 'daisy' sequence into repeated elements of the genome. Each introduced copy encodes guide RNAs that target one or more engineered loci carrying the CRISPR nuclease gene and the desired traits. When organisms encoding a drive system are released into the environment, each generation of mating with wild-type organisms will reduce the average number of the guide RNA elements per 'daisyfield' organism by half, serving as a generational clock. The loci encoding the nuclease and payload will exhibit drive only as long as a single copy remains, placing an inherent limit on the extent of spread.
15,194 downloads bioRxiv synthetic biology
DNA is an attractive medium to store digital information. Here, we report a storage strategy, called DNA Fountain, that is highly robust and approaches the information capacity per nucleotide. Using our approach, we stored a full computer operating system, movie, and other files with a total of 2.14x10^6 bytes in DNA oligos and perfectly retrieved the information from a sequencing coverage equivalent of a single tile of Illumina sequencing. We also tested a process that can allow 2.18x10^15 retrievals using the original DNA sample and were able to perfectly decode the data. Finally, we explored the limit of our architecture in terms of bytes per molecules and obtained a perfect retrieval from a density of 215Petabyte/gram of DNA, orders of magnitudes higher than previous techniques.
11,741 downloads bioRxiv synthetic biology
Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach reaches near state-of-the-art or superior performance predicting stability of natural and de novo designed proteins as well as quantitative function of molecularly diverse mutants. UniRep further enables two orders of magnitude cost savings in a protein engineering task. We conclude UniRep is a versatile protein summary that can be applied across protein engineering informatics.
9,652 downloads bioRxiv synthetic biology
Modern synthetic biology depends on the manufacture of large DNA constructs from libraries of genes, regulatory elements or other genetic parts. Type IIS-restriction enzyme-dependent DNA assembly methods (e.g., Golden Gate) enable rapid one-pot, ordered, multi-fragment DNA assembly, facilitating the generation of high-complexity constructs. The order of assembly of genetic parts is determined by the ligation of flanking Watson-Crick base-paired overhangs. The ligation of mismatched overhangs leads to erroneous assembly, and the need to avoid such pairings has typically been accomplished by using small sets of empirically vetted junction pairs, limiting the number of parts that can be joined in a single reaction. Here, we report the use of a comprehensive method for profiling end-joining ligation fidelity and bias to predict highly accurate sets of connections for ligation-based DNA assembly methods. This data set allows quantification of sequence-dependent ligation efficiency and identification of mismatch-prone pairings. The ligation profile accurately predicted junction fidelity in ten-fragment Golden Gate assembly reactions, and enabled efficient assembly of a lac cassette from up to 24-fragments in a single reaction. Application of the ligation fidelity profile to inform choice of junctions thus enables highly flexible assembly design, with >20 fragments in a single reaction.
9,563 downloads bioRxiv synthetic biology
Protein engineering has enormous academic and industrial potential. However, it is limited by the lack of experimental assays that are consistent with the design goal and sufficiently high-throughput to find rare, enhanced variants. Here we introduce a machine learning-guided paradigm that can use as few as 24 functionally assayed mutant sequences to build an accurate virtual fitness landscape and screen ten million sequences via in silico directed evolution. As demonstrated in two highly dissimilar proteins, avGFP and TEM-1 β-lactamase, top candidates from a single round are diverse and as active as engineered mutants obtained from previous multi-year, high-throughput efforts. Because it distills information from both global and local sequence landscapes, our model approximates protein function even before receiving experimental data, and generalizes from only single mutations to propose high-functioning epistatically non-trivial designs. With reproducible >500% improvements in activity from a single assay in a 96-well plate, we demonstrate the strongest generalization observed in machine-learning guided protein function optimization to date. Taken together, our approach enables efficient use of resource intensive high-fidelity assays without sacrificing throughput, and helps to accelerate engineered proteins into the fermenter, field, and clinic. ### Competing Interest Statement A full list of G.M.C.'s tech transfer, advisory roles, and funding sources can be found on the lab's website: http://arep.med.harvard.edu/gmc/tech.html. S.B. is employed by and holds equity in Nabla Bio, Inc. G.K. is employed by and holds equity in Telis Bioscience Inc.
8,570 downloads bioRxiv synthetic biology
Therapeutic antibody optimization is time and resource intensive, largely because it requires low-throughput screening (103 variants) of full-length IgG in mammalian cells, typically resulting in only a few optimized leads. Here, we use deep learning to interrogate and predict antigen-specificity from a massively diverse sequence space to identify globally optimized antibody variants. Using a mammalian display platform and the therapeutic antibody trastuzumab, rationally designed site-directed mutagenesis libraries are introduced by CRISPR/Cas9-mediated homology-directed repair (HDR). Screening and deep sequencing of relatively small libraries (104) produced high quality data capable of training deep neural networks that accurately predict antigen-binding based on antibody sequence. Deep learning is then used to predict millions of antigen binders from an in silico library of ~108 variants, where experimental testing of 30 randomly selected variants showed all 30 retained antigen specificity. The full set of in silico predicted binders is then subjected to multiple developability filters, resulting in thousands of highly-optimized lead candidates. With its scalability and capacity to interrogate high-dimensional protein sequence space, deep learning offers great potential for antibody engineering and optimization.
7,905 downloads bioRxiv synthetic biology
Biology offers compelling proof that macroscopic "living materials" can emerge from reactions between diffusing biomolecules. Here, we show that molecular self-organization could be a similarly powerful approach for engineering functional synthetic materials. We introduce a programmable DNA-hydrogel that produces tunable patterns at the centimeter length scale. We generate these patterns by implementing chemical reaction networks through synthetic DNA complexes, embedding the complexes in hydrogel, and triggering with locally applied input DNA strands. We first demonstrate ring pattern formation around a circular input cavity and show that the ring width and intensity can be predictably tuned. Then, we create patterns of increasing complexity, including concentric rings and non-isotropic patterns. Finally, we show "destructive" and "constructive" interference patterns, by combining several ring-forming modules in the gel and triggering them from multiple sources. We further show that computer simulations based on the reaction-diffusion model can predict and inform the programming of target patterns.
7,507 downloads bioRxiv synthetic biology
Proteins---molecular machines that underpin all biological life---are of significant therapeutic and industrial value. Directed evolution is a high-throughput experimental approach for improving protein function, but has difficulty escaping local maxima in the fitness landscape. Here, we investigate how supervised learning in a closed loop with DNA synthesis and high-throughput screening can be used to improve protein design. Using the green fluorescent protein (GFP) as an illustrative example, we demonstrate the opportunities and challenges of generating training datasets conducive to selecting strongly generalizing models. With prospectively designed wet lab experiments, we then validate that these models can generalize to unseen regions of the fitness landscape, even when constrained to explore combinations of non-trivial mutations. Taken together, this suggests a hybrid optimization strategy for protein design in which a predictive model is used to explore difficult-to-access but promising regions of the fitness landscape that directed evolution can then exploit at scale.
7,449 downloads bioRxiv synthetic biology
Inheritance-biasing “gene drives” may be capable of spreading genomic alterations made in laboratory organisms through wild populations. We previously considered the potential for RNA-guided gene drives based on the versatile CRISPR/Cas9 genome editing system to serve as a general method of altering populations. Here we report molecularly contained gene drive constructs in the yeast Saccharomyces cerevisiae that are typically copied at rates above 99% when mated to wild yeast. We successfully targeted both non-essential and essential genes, showed that the inheritance of an unrelated “cargo” gene could be biased by an adjacent drive, and constructed a drive capable of overwriting and reversing changes made by a previous drive. Our results demonstrate that RNA-guided gene drives are capable of efficiently biasing inheritance when mated to wild-type organisms over successive generations.
6,884 downloads bioRxiv synthetic biology
Unsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.
6,560 downloads bioRxiv synthetic biology
Alejandro Chavez, Jonathan Scheiman, Suhani Vora, Benjamin W Pruitt, Marcelle Tuttle, Eswar Iyer, Samira Kiani, Christopher D Guzman, Daniel J. Wiegand, Dimtry Ter-Ovanesyan, Jonathan L Braff, Noah Davidsohn, Ron Weiss, John Aach, James J. Collins, George Church
The RNA-guided bacterial nuclease Cas9 can be reengineered as a programmable transcription factor by a series of changes to the Cas9 protein in addition to the fusion of a transcriptional activation domain (AD). However, the modest levels of gene activation achieved by current Cas9 activators have limited their potential applications. Here we describe the development of an improved transcriptional regulator through the rational design of a tripartite activator, VP64-p65-Rta (VPR), fused to Cas9. We demonstrate its utility in activating expression of endogenous coding and non-coding genes, targeting several genes simultaneously and stimulating neuronal differentiation of induced pluripotent stem cells (iPSCs).
6,405 downloads bioRxiv synthetic biology
Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter language model, ProGen, on ∼280M protein sequences conditioned on taxonomic and keyword tags such as molecular function and cellular component. This provides ProGen with an unprecedented range of evolutionary sequence diversity and allows it to generate with fine-grained control as demonstrated by metrics based on primary sequence similarity, secondary structure accuracy, and conformational energy.
6,183 downloads bioRxiv synthetic biology
We present here an approach for engineering evolving DNA barcodes in living cells. The methodology entails using a homing guide RNA (hgRNA) scaffold that directs the Cas9-hgRNA complex to target the DNA locus of the hgRNA itself. We show that this homing CRISPR-Cas9 system acts as an expressed genetic barcode that diversifies its sequence and that the rate of diversification can be controlled in cultured cells. We further evaluate these barcodes in cultured cell populations and show that they can record lineage history and and that their RNA can be assayed as single molecules in situ. This integrated approach will have wide ranging applications, such as in deep lineage tracing, cellular barcoding, molecular recording, dissecting cancer biology, and connectome mapping.
6,119 downloads bioRxiv synthetic biology
To extend the frontier of genome editing and enable the radical redesign of mammalian genomes, we developed a set of dead-Cas9 base editor (dBE) variants that allow editing at tens of thousands of loci per cell by overcoming the cell death associated with DNA double-strand breaks (DSBs) and single-strand breaks (SSBs). We used a set of gRNAs targeting repetitive elements – ranging in target copy number from about 31 to 124,000 per cell. dBEs enabled survival after large-scale base editing, allowing targeted mutations at up to ~13,200 and ~2610 loci in 293T and human induced pluripotent stem cells (hiPSCs), respectively, three orders of magnitude greater than previously recorded. These dBEs can overcome current on-target mutation and toxicity barriers that prevent cell survival after large-scale genome engineering. One Sentence Summary Base editing with reduced DNA nicking allows for the simultaneous editing of >10,000 loci in human cells.
5,792 downloads bioRxiv synthetic biology
Donatas Repecka, Vykintas Jauniskis, Laurynas Karpus, Elzbieta Rembeza, Jan Zrimec, Simona Poviloniene, Irmantas Rokaitis, Audrius Laurynenas, Wissam Abuajwa, Otto Savolainen, Rolandas Meskys, Martin KM Engqvist, Aleksej Zelezniak
De novo protein design for catalysis of any desired chemical reaction is a long standing goal in protein engineering, due to the broad spectrum of technological, scientific and medical applications. Currently, mapping protein sequence to protein function is, however, neither computationionally nor experimentally tangible ,. Here we developed ProteinGAN, a specialised variant of the generative adversarial network  that is able to ‘learn’ natural protein sequence diversity and enables the generation of functional protein sequences. ProteinGAN learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino acid sequence space and creates new, highly diverse sequence variants with natural-like physical properties. Using malate dehydrogenase as a template enzyme, we show that 24% of the ProteinGAN-generated and experimentally tested sequences are soluble and display wild-type level catalytic activity in the tested conditions in vitro , even in highly mutated (>100 mutations) sequences. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse novel functional proteins within the allowed biological constraints of the sequence space. : #ref-1 : #ref-2 : #ref-3
5,405 downloads bioRxiv synthetic biology
RNA-guided gene drive elements could address many ecological problems by altering the traits of wild organisms, but the likelihood of global spread tremendously complicates ethical development and use. Here we detail a localized form of CRISPR-based gene drive composed of genetic elements arranged in a daisy-chain such that each element drives the next. "Daisy drive" systems can duplicate any effect achievable using an equivalent global drive system, but their capacity to spread is limited by the successive loss of non-driving elements from the base of the chain. Releasing daisy drive organisms constituting a small fraction of the local wild population can drive a useful genetic element to local fixation for a wide range of fitness parameters without resulting in global spread. We additionally report numerous highly active guide RNA sequences sharing minimal homology that may enable evolutionary stable daisy drive as well as global CRISPR-based gene drive. Daisy drives could simplify decision-making and promote ethical use by enabling local communities to decide whether, when, and how to alter local ecosystems.
5,345 downloads bioRxiv synthetic biology
Multicellular development depends on the differentiation of cells into specific fates with precise spatial organization. Lineage history plays a pivotal role in cell fate decisions, but is inaccessible in most contexts. Engineering cells to actively record lineage information in a format readable in situ would provide a spatially resolved view of lineage in diverse developmental processes. Here, we introduce a serine integrase-based recording system that allows in situ readout, and demonstrate its ability to reconstruct lineage relationships in cultured stem cells and flies. The system, termed intMEMOIR, employs an array of independent three-state genetic memory elements that can recombine stochastically and irreversibly, allowing up to 59,049 distinct digital states. intMEMOIR accurately reconstructed lineage trees in stem cells and enabled simultaneous analysis of single cell clonal history, spatial position, and gene expression in Drosophila brain sections. These results establish a foundation for microscopy-readable clonal analysis and recording in diverse systems. ### Competing Interest Statement K.F., K.K.C., L.C., and M.B.E. are inventors on a patent application for recording technologies.
5,208 downloads bioRxiv synthetic biology
DNA is an emerging storage medium for digital data but its adoption is hampered by limitations of phosphoramidite chemistry, which was developed for single-base accuracy required for biological functionality. Here, we establish a de novo enzymatic DNA synthesis strategy designed from the bottom-up for information storage. We harness a template-independent DNA polymerase for controlled synthesis of sequences with user-defined information content. We demonstrate retrieval of 144-bits, including addressing, from perfectly synthesized DNA strands using batch-processed Illumina and real-time Oxford Nanopore sequencing. We then develop a codec for data retrieval from populations of diverse but imperfectly synthesized DNA strands, each with a ~30% error tolerance. With this codec, we experimentally validate a kilobyte-scale design which stores 1 bit per nucleotide. Simulations of the codec support reliable and robust storage of information for large-scale systems. This work paves the way for alternative synthesis and sequencing strategies to advance information storage in DNA.
4,949 downloads bioRxiv synthetic biology
Akin to Zinc Finger and Transcription Activator Like Effector based transcriptional modulators, nuclease-null CRISPR-Cas9 provides a groundbreaking programmable DNA binding platform, begetting an arsenal of targetable regulators for transcriptional and epigenetic perturbation, by either directly tethering, or recruiting, transcription enhancing effectors to either component of the Cas9/guide RNA complex. Application of these programmable regulators is now gaining traction for the modulation of disease-causing genes or activation of therapeutic genes, in vivo. Adeno-Associated Virus (AAV) is an optimal delivery vehicle for in vivo delivery of such regulators to adult somatic tissue, due to the efficacy of viral delivery with minimal concerns about immunogenicity or integration. However, present Cas9 activator systems are notably beyond the packaging capacity of a single AAV delivery vector capsid. Here, we engineer a compact CRISPR-Cas9 activator for convenient AAV-mediated delivery. We validate efficacy of the CRISPR-Cas9 transcriptional activation using AAV delivery in several cell lines.
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!