The structural organization of the genome plays an important role in multiple aspects of genome function. Understanding how genomic sequence influences 3D organization can help elucidate their roles in various processes in healthy and disease states. However, the sequence determinants of genome structure across multiple spatial scales are still not well understood. To learn the complex sequence dependencies of multiscale genome architecture, here we developed a sequence-based deep learning approach, Orca, that predicts genome 3D architecture from kilobase to whole-chromosome scale, covering structures including chromatin compartments and topologically associating domains. Orca also makes both intrachromosomal and interchromosomal predictions and captures the sequence dependencies of diverse types of interactions, from CTCF-mediated to enhancer-promoter interactions and Polycomb-mediated interactions. Orca enables the interpretation of the effects of any structural variant at any size on multiscale genome organization and provides an in silico model to help study the sequence-dependent mechanistic basis of genome architecture. We show that the models accurately recapitulate effects of experimentally studied structural variants at varying sizes (300bp-80Mb) using only sequence. Furthermore, these sequence models enable in silico virtual screen assays to probe the sequence-basis of genome 3D organization at different scales. At the submegabase scale, the models predicted specific transcription factor motifs underlying cell-type-specific genome interactions. At the compartment scale, based on virtual screens of sequence activities, we propose a new model for the sequence basis of chromatin compartments: sequences at active transcription start sites are primarily responsible for establishing the expression-active compartment A, while the inactive compartment B typically requires extended stretches of AT-rich sequences (at least 6-12kb) and can form 'passively' without depending on any particular sequence pattern. Orca thus effectively provides an 'in silico genome observatory' to predict variant effects on genome structure and probe the sequence-based mechanisms of genome organization.
- Downloaded 1,454 times
- Download rankings, all-time:
- Site-wide: 16,820
- In bioinformatics: 1,870
- Year to date:
- Site-wide: 4,151
- Since beginning of last month:
- Site-wide: 5,190
Downloads over time
Distribution of downloads per paper, site-wide
- 27 Nov 2020: The website and API now include results pulled from medRxiv as well as bioRxiv.
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!