1.14 Billion Rows of Psychiatric Genetics Data, Now One Line of Python

By Prahlad Menon 3 min read

The old workflow was miserable. You’d navigate the PGC download portal, track down the right study, fire off a wget, wait on a slow academic server, gunzip the result, then discover the column separator was a space in some files and a tab in others — and that was just dataset one. Multiply by 52 publications across 12 disorder groups and you had a weekend-long data wrangling session before you’d even looked at a p-value.

That friction is gone. OpenMed has published the full PGC Psychiatric GWAS Summary Statistics collection on HuggingFace as clean, standardized Parquet. All 1.14 billion rows. One call to load_dataset().

What’s in the collection

The Psychiatric Genomics Consortium (PGC) runs the world’s largest collaborative psychiatric genetics studies. Their GWAS summary statistics — the distilled output of those studies — encode associations between millions of genetic variants (SNPs) and disorder risk. Each row carries the SNP identifier, chromosome, position, effect and reference alleles, effect size (beta or odds ratio), standard error, p-value, and sample size.

The OpenMed collection spans 12 disorder groups across 52 publications:

DatasetRowsConfigs
pgc-substance-use214M21
pgc-mdd179M38
pgc-ptsd128M16
pgc-schizophrenia91.4M26
pgc-bipolar74.4M21
pgc-cross-disorder63.3M19
pgc-adhd31.2M20
pgc-anxiety27.5M17
pgc-ocd-tourette36.5M17
pgc-autism18.6M14
pgc-eating-disorders10.6M15
pgc-other40.9M28

The Parquet format matters beyond just convenience. Columnar storage means you can filter to a specific chromosome or SNP range without reading the whole dataset into memory — critical when you’re working with 179 million MDD rows.

The cross-disorder dataset is the real prize

For AI/ML researchers, pgc-cross-disorder is where the most interesting work lives. It captures pleiotropy — genetic variants that influence multiple psychiatric conditions simultaneously.

The shared genetic architecture between ADHD, depression, schizophrenia, and bipolar disorder isn’t just a curiosity. It’s where novel signals hide. Polygenic risk scores built across disorders can flag individuals at risk for comorbid presentations that single-disorder models miss entirely. Cross-disorder data is also the foundation for biological foundation models that need to generalize across psychiatric phenotypes rather than overfit to a single condition.

Getting started

# pip install datasets pandas

from datasets import load_dataset

# Load schizophrenia GWAS
ds = load_dataset("OpenMed/pgc-schizophrenia")
df = ds["train"].to_pandas()
print(df.head())
print(df.columns.tolist())
print(f"Total variants: {len(df):,}")

# Load cross-disorder data (most useful for AI research)
ds_cross = load_dataset("OpenMed/pgc-cross-disorder")
df_cross = ds_cross["train"].to_pandas()

# Filter to genome-wide significant hits (p < 5e-8)
significant = df_cross[df_cross["p"] < 5e-8]
print(f"Genome-wide significant variants: {len(significant):,}")

For multi-disorder analysis, merge on SNP ID to surface shared signals:

import pandas as pd

disorders = ["pgc-schizophrenia", "pgc-bipolar", "pgc-mdd"]
dfs = {}
for disorder in disorders:
    ds = load_dataset(f"OpenMed/{disorder}", split="train")
    dfs[disorder] = ds.to_pandas()[["snp", "beta", "p"]].rename(
        columns={"beta": f"beta_{disorder}", "p": f"p_{disorder}"}
    )

# Merge on SNP to find shared signals
merged = dfs[disorders[0]]
for d in disorders[1:]:
    merged = merged.merge(dfs[d], on="snp", how="inner")

print(f"SNPs with data across all three disorders: {len(merged):,}")

A word on data use

This collection makes access dramatically easier, but “easy to load” isn’t the same as “unrestricted.” The PGC operates under data access agreements that vary by study. Not all datasets are cleared for commercial use. Before building anything production-facing, review the original PGC consent frameworks and check the individual dataset cards on HuggingFace for specific restrictions. The academic research use case is well-covered; commercial applications need more diligence.

Who should care

Beyond computational psychiatrists and psychiatric geneticists (the obvious audience), this collection matters for: AI/ML researchers building biological foundation models, polygenic risk score developers, drug discovery teams mining for genetic targets, and anyone doing Mendelian randomization studies on psychiatric outcomes.

The data was always there. Now it’s just actually usable.


FAQ

What is the PGC GWAS dataset? 1.14 billion rows of genetic association data from 52 psychiatric genetics publications across 12 disorder groups, now available on HuggingFace as standardized Parquet files.

How do I access this data? pip install datasets, then load_dataset("OpenMed/pgc-schizophrenia"). See the code examples above for loading, filtering, and merging across disorders.

What is GWAS summary statistics data? The output of genome-wide association studies: for millions of genetic variants (SNPs), you get the association strength with a disease phenotype — effect size, standard error, p-value, and sample size. It’s the core input for polygenic risk scoring, genetic correlation analysis, and Mendelian randomization.

Why is the cross-disorder dataset the most useful? It captures pleiotropy — SNPs influencing multiple psychiatric conditions simultaneously. This is where cross-disorder AI models and polygenic risk scores find their most novel signal, and where understanding the shared genetic architecture of conditions like ADHD, depression, and schizophrenia becomes tractable.

Can I use this data commercially? PGC data comes with access agreements that vary by study. Review the individual dataset cards and original PGC consent frameworks before any commercial application. Academic research use is generally well-covered; commercial use requires explicit clearance.

What was the friction before this release? Navigating the PGC download portal, running gunzip on each file, debugging inconsistent column separators across 52 publications, and manually normalizing formats — easily hours of work per dataset. Now it’s one Python command.