MosaicMRI: The Largest Open-Source Musculoskeletal MRI Dataset Just Dropped

By Prahlad Menon 4 min read

MRI AI research has a dirty secret: most models are trained and benchmarked on knee and brain data from a handful of scanner configurations. When those models encounter a spine, an ankle, or a different coil setup, they fail. Not gracefully β€” dramatically.

MosaicMRI is the dataset built to fix that.

Released last week by researchers at USC, it’s the largest open-source raw musculoskeletal MRI dataset ever published: 2,671 volumes, 80,156 slices, 454 patients, 10 anatomies. Raw k-space data β€” the full acquisition, not just magnitude images β€” with realistic clinical variability across contrast, orientation, and coil configuration.

Why This Matters

The standard MRI ML datasets β€” fastMRI, SKM-TEA, CMRxRecon β€” were instrumental in demonstrating that learned reconstruction could work. But they mostly showed it works in narrow, controlled settings. Single anatomy. Single scanner. Limited contrast variation. Training on knee data and deploying on spine data is a known failure mode, but benchmarks haven’t pushed researchers to solve it.

MosaicMRI is explicitly designed to make generalization the central challenge:

  • 10 anatomies (vs. 1-2 in most existing datasets)
  • Multi-contrast: PD, T1, T2, STIR, and clinical variants including DIXON, DESS, TIRM
  • Multi-orientation: axial, sagittal, coronal across all anatomies
  • Multi-coil: 4–46 channels, with 16-channel most common
  • Spine included β€” almost entirely absent from public raw MRI datasets

The benchmark tracks operationalize this directly. The anatomy generalization challenge withholds ankle data from training. The contrast generalization challenge withholds T1 fat-suppressed. You can’t memorize your way to good scores.

The Data

Collected on a 1.5T Siemens Magnetom Avantofit scanner between July–September 2025. Every scan was visually quality-checked. Stored as HDF5 with ISMRMRD-compatible headers and fastMRI-style layout β€” drop-in compatible with existing reconstruction pipelines.

MosaicMRI/
β”œβ”€β”€ multicoil_train/    1,873 scans | 303 patients | 56,235 slices | 2,382 GiB
β”œβ”€β”€ multicoil_val/        398 scans |  68 patients | 12,027 slices |   580 GiB
β”œβ”€β”€ multicoil_test/       400 scans |  79 patients | 11,894 slices |    72 GiB
β”œβ”€β”€ anatomy_transfer_challenge/
β”‚   └── ankle/           (held out from training β€” 20 files, 49 GiB)
└── contrast_generalization_challenge/
    └── T1_FS/           (held out from training β€” 17 files, 21 GiB)

Splits are patient-disjoint to prevent leakage, balanced by slice count with per-anatomy coverage preserved across train/val/test.

Each H5 file contains:

  • k-space β€” raw multi-coil acquisition data
  • reconstruction_rss β€” root-sum-of-squares reference reconstruction
  • ISMRMRD header β€” full acquisition metadata (TR, TE, TI, FOV, matrix, coil count, acceleration factor, trajectory)

The Research Directions It Opens

Accelerated reconstruction across anatomy. Today’s state of the art in learned MRI reconstruction trains a separate model per anatomy/contrast. MosaicMRI enables β€” and the benchmark requires β€” single models that generalize across the full MSK spectrum.

Foundation models for MRI. The paper frames MosaicMRI as a testbed for foundation model questions: scaling laws (does more diverse data keep improving reconstruction?), data mixtures (which anatomy combinations transfer?), continual learning (can a model learn new anatomies without forgetting old ones?), and OOD generalization.

Low-field reconstruction. The 1.5T data combined with realistic coil variability creates a proxy for the noise and artifact distributions seen in low-field scanners. Training reconstruction models on 1.5T with coil diversity improves performance on 0.55T and 1.0T systems β€” a clinically important direction as low-field MRI expands globally.

Motion compensation. Real clinical scans have motion artifacts. MosaicMRI includes scans with realistic patient motion, enabling development of motion-robust reconstruction methods that aren’t trained on pristine phantom data.

The Benchmark

Three initial tracks:

TrackChallengeWithheld Data
Mixed-anatomy reconstruction4x and 8x acceleration, all anatomiesβ€”
Anatomy generalizationReconstruct ankle with no ankle training dataAnkle (20 volumes)
Contrast generalizationReconstruct T1-FS with no T1-FS training dataT1 fat-suppressed (17 volumes)

Submit results at mosaicmri.ai/benchmark. More tracks announced over time.

Access and Citation

Data access requires a request at mosaicmri.ai. Code is on GitHub. Paper is arXiv:2604.11762.

@misc{arguello2026mosaicmri,
  title  = {MosaicMRI: A Diverse Dataset and Benchmark for Raw Musculoskeletal MRI},
  author = {Paula Arguello and Berk Tinaz and Mohammad Shahab Sepehri and Zalan Fabian and Maryam Soltanolkotabi},
  year   = {2026},
  eprint = {2604.11762},
  archivePrefix = {arXiv},
  primaryClass  = {eess.IV},
  url    = {https://arxiv.org/abs/2604.11762}
}

This is the kind of dataset release that changes what’s possible in a field. If you work on MRI reconstruction, foundation models for medical imaging, or clinical AI generalization β€” this is worth your attention.