Genotyped SNPs in UK Biobank failing Hardy-Weinberg equilibrium test
Testing SNP deviation from Hardy-Weinberg equilibrium (HWE) is a standard method used to detect potential genotyping errors. In the rapid-GWAS marker QC, we observed 44,184 genotyped autosomal variants with HWE p-value < 10e–12 when computed using genotypes of 361,194 white British individuals. 15,069 of these genotyped variants are retained in the imputed bgen files with INFO = 1 and HWE p-value < 10e–12. Of particular concern, 3,987 SNPs have no homozygous alternative genotypes despite having a MAF > 1%.
One hypothesis for why these SNPs are not filtered might stem from the QC criteria relying on a per-batch HWE test. According to the Supplementary Note S2.3 of Bycroft, C., et al. (Nature, 2018), UK Biobank applied marker-based QC to the raw genotypes using a per-batch approach, where each batch contains up to ~4,000 samples (after restricting to 463,844 ancestrally homogeneous individuals). Within each batch, SNPs with HWE p-value < 10e–12 were set to missing for samples in that batch. However, if a subsequent HWE test across the full sample wasn’t performed after the per-batch QC, it could explain the high number of HWE-failing SNPs seen in the genotype data. For example, rs2237897 (MAF = 4.2% in gnomAD non-Finnish European) showed HWE p-value = 8.1 × 10e–233 (counts of homozygous alternative / heterozygous / homozygous reference: 0 / 26777 / 321577; missing: 12840). However, only a single batch out of 106 batches failed the per-batch HWE test (p < 10e–12). As a result, rs2237897 in the imputed bgen shows INFO = 1 and HWE p-value = 6.3 × 10e–134. (We note that, since missing genotypes were imputed for genotyped variants, HWE p-value could be different from those calculated only using non-missing genotypes, and INFO score could be less than 1). Finally, the SNP intensity plot for rs2237897 in UK Biobank below confirms that deviation from HWE is indeed a result of poor genotyping quality.
Not only could retaining these SNPs negatively affect imputation quality of the surrounding loci, this might be of substantial concern for interpreting associations around these HWE-failed variants, particularly for downstream analysis such as fine-mapping. While we excluded all markers with HWE p-value ≤ 10e–10 from the rapid-GWAS results (except for those annotated as protein-coding by VEP), we recommend users re-evaluate HWE when they interpret GWAS signals from individual loci.
Authored by Masahiro Kanai, with input from Daniel Howrigan, Mark Daly, and Hilary Finucane