Rxivist logo

COPILOT: a Containerised wOrkflow for Processing ILlumina genOtyping daTa

By Hamel Patel, Sang-Hyuck Lee, Gerome Breen, Stephen Menzel, Oyesola Ojewunmi, Richard Dobson

Posted 26 Jul 2021
bioRxiv DOI: 10.1101/2021.07.26.453753

Background: The Illumina genotyping microarrays generate data in image format, which is processed by the platform-specific software GenomeStudio, followed by an array of complex bioinformatics analyses. This process can be time-consuming, lead to reproducibility errors, and be a daunting task for novice bioinformaticians. Results: Here we introduce the COPILOT (Containerised wOrkflow for Processing ILlumina genOtyping daTa) protocol, which provides an in-depth and clear guide to process raw Illumina genotype data in GenomeStudio, followed by a containerised workflow to automate an array of complex bioinformatics analyses involved in a GWAS quality control (QC). The COPILOT protocol was applied to two independent cohorts consisting of 2791 and 479 samples genotyped on the Infinium Global Screening (GSA) array with Multi-disease (MD) drop-in (~750,000 markers) and the Infinium H3Africa consortium array (~2,200,000 markers) respectively. Following the COPILOT protocol, an average sample quality improvement of 1.24% was observed across sample call rates, with notable improvement for low-quality samples. For example, from the 3270 samples processed, 141 samples had an initial sample call rate below 98%, averaging 96.6% (95% CI 95.6-97.7%), which is considered below the acceptable sample call rate threshold for a typical GWAS analysis. However, following the COPILOT protocol, all 141 samples had a call rate above 98% after QC and averaged 99.6% (95% CI 99.5-99.7%). In addition, the COPILOT pipeline automatically identified potential data issues, including gender discrepancies, heterozygosity outliers, related individuals, and population outliers through ancestry estimation. Conclusions: The COPILOT protocol makes processing Illumina genotyping data transparent, effortless and reproducible. The container is deployable on multiple platforms, improves data quality, and the end product is analysis-ready PLINK formatted data, with a comprehensive and interactive summary report to guide the user for further data analyses.

Download data

  • Downloaded 157 times
  • Download rankings, all-time:
    • Site-wide: 157,621
    • In genomics: 7,864
  • Year to date:
    • Site-wide: 41,201
  • Since beginning of last month:
    • Site-wide: 58,796

Altmetric data

Downloads over time

Distribution of downloads per paper, site-wide