How do I install Average-Calibration-Losses?

Clone the repo with git clone https://github.com/cai4cai/Average-Calibration-Losses, then use Docker or the provided devcontainer for a reproducible environment with MONAI.

What datasets does mL1-ACE work on?

The paper validates mL1-ACE on four public datasets: ACDC (cardiac MRI), AMOS (abdominal CT), KiTS (kidney tumor CT), and BraTS (brain tumor MRI).

Does mL1-ACE hurt segmentation accuracy?

Hard-binned mL1-ACE maintains Dice scores comparable to baseline while improving calibration. Soft-binned achieves stronger calibration improvements with minimal Dice impact (typically under 1%).

What license is Average-Calibration-Losses released under?

The code is open source and available on GitHub at github.com/cai4cai/Average-Calibration-Losses.

How much does mL1-ACE reduce calibration error?

Soft mL1-ACE reduces macro-averaged ACE by 19-46% across datasets compared to standard Dice+CE baseline, with micro-averaged improvements up to 70%.

mL1-ACE: Fix Overconfident Medical Image Segmentation with Calibration Losses

Q: What is mL1-ACE loss?

mL1-ACE (marginal L1 Average Calibration Error) is a differentiable auxiliary loss function that improves pixel-wise calibration in medical image segmentation by aligning predicted confidence with actual accuracy.

Q: Why are medical segmentation models overconfident?

Deep neural networks trained with standard losses like Dice and cross-entropy tend to produce overconfident predictions, especially on limited medical imaging data, where predicted probabilities don't reflect true accuracy.

Q: What is the difference between hard and soft binning in mL1-ACE?

Hard binning uses discrete bin assignments (square kernel), while soft binning uses triangular kernels for smoother gradients. Soft binning achieves better calibration but may slightly reduce Dice scores.

By Prahlad Menon Published 2026-03-18 4 min read

TL;DR: Deep learning segmentation models are overconfident — they say 95% when they’re really 70% accurate. mL1-ACE is a differentiable calibration loss you add during training to fix this. Results: 16-46% reduction in calibration error while maintaining Dice scores. Open-source code available using MONAI.

Your medical image segmentation model just predicted a tumor boundary with 98% confidence. But it’s wrong. This overconfidence problem isn’t a bug — it’s the default behavior of neural networks trained on limited medical data.

A new paper from King’s College London and Imperial offers a practical fix: train-time calibration losses that make your model honest about its uncertainty.

What Problem Does mL1-ACE Solve?

mL1-ACE solves the calibration gap in medical image segmentation. When a model predicts 90% probability for a voxel being tumor, it should be correct ~90% of the time. In reality, standard models trained with Dice + cross-entropy are systematically overconfident.

This matters clinically:

Radiotherapy planning — uncertain regions need human review
Surgical guidance — false confidence at tissue boundaries is dangerous
AI-assisted diagnosis — calibrated probabilities enable proper decision thresholds

Post-hoc fixes like temperature scaling help, but they’re band-aids. They can’t provide instance-specific or region-specific calibration. mL1-ACE bakes calibration directly into training.

How Does mL1-ACE Work Under the Hood?

The core idea is elegant: use Average Calibration Error as a differentiable loss term.

For each image, predictions are binned by confidence level. Within each bin, the method compares:

Expected probability (average predicted confidence)
Observed frequency (actual accuracy in that bin)

The difference between these is the calibration error. Sum it across bins, average across classes — that’s mL1-ACE.

The key insight: segmentation produces dense voxel-level predictions, giving you thousands of data points per image. This makes per-image calibration estimation stable, unlike classification where you need large batches.

Two binning strategies:

Approach	Kernel	Calibration Improvement	Dice Impact
Hard-binned (hL1-ACE)	Square	Moderate (16-22%)	Minimal
Soft-binned (sL1-ACE)	Triangular	Strong (19-46%)	Small (~1%)

Soft binning uses triangular kernels for smoother gradients during backpropagation, achieving better calibration at the cost of slightly softer probability maps.

What Results Did the Paper Achieve?

The authors tested on four standard medical imaging benchmarks:

Dataset	Modality	Task	Baseline Dice	ACE Reduction (soft)
ACDC	Cardiac MRI	LV/RV/Myocardium	0.871	46%
AMOS	Abdominal CT	15 organs	0.883	19%
BraTS	Brain MRI	Tumor regions	0.905	30%
KiTS	Kidney CT	Kidney/tumor/cyst	0.859	36%

Key findings:

Soft mL1-ACE delivers the strongest calibration improvements
Hard mL1-ACE maintains segmentation performance with moderate calibration gains
Both outperform post-hoc temperature scaling
Improvements are statistically significant (p < 0.01) across datasets

The paper also introduces dataset reliability histograms — a visualization showing calibration variability across an entire dataset, not just aggregate metrics.

How Do I Install and Use Average-Calibration-Losses?

The code uses MONAI bundles for reproducible experiments:

# Clone the repository
git clone https://github.com/cai4cai/Average-Calibration-Losses
cd Average-Calibration-Losses

# Generate a bundle for your dataset/loss combination
python generate_bundle.py --data brats21 --loss softl1ace_dice_ce

# Train with Docker (recommended)
./scripts/docker_build.sh
./scripts/docker_run.sh --mode train --bundle brats21_softl1ace_dice_ce_<hash> --gpu 0

Available loss configurations:

baseline_dice_ce — Standard Dice + cross-entropy
hardl1ace_dice_ce — Add hard-binned calibration loss
softl1ace_dice_ce — Add soft-binned calibration loss

Requirements: 24GB+ VRAM GPU (RTX 4090), 96GB RAM recommended.

How Do I Add mL1-ACE to My Own Training Pipeline?

The loss functions are modular. Here’s the core implementation pattern:

from src.losses.softl1ace import SoftL1ACEandDiceCELoss

# Initialize combined loss
loss_fn = SoftL1ACEandDiceCELoss(
    include_background=False,
    softmax=True,
    num_bins=20,
    # Weights for Dice, CE, and ACE components
    lambda_dice=1.0,
    lambda_ce=1.0,
    lambda_ace=1.0
)

# Use in training loop
loss = loss_fn(predictions, ground_truth)
loss.backward()

The implementation handles:

Background exclusion
One-hot encoding
Missing class handling (important for datasets like KiTS where 56% of cases lack cyst labels)

How Does mL1-ACE Compare to Other Calibration Methods?

Method	Type	Calibration	Accuracy	Complexity
Temperature Scaling	Post-hoc	Moderate	Unchanged	Low
Label Smoothing	Train-time	Weak	May decrease	Low
Mixup	Train-time	Moderate	May decrease	Medium
NACL	Train-time	Moderate	Maintained	Medium
mL1-ACE (hard)	Train-time	Good	Maintained	Medium
mL1-ACE (soft)	Train-time	Best	Slight decrease	Medium

The advantage of mL1-ACE: it directly optimizes the calibration metric you care about, rather than using proxies or heuristics.

When Should I Use mL1-ACE?

Use mL1-ACE when:

Your segmentation model will inform clinical decisions
You need reliable uncertainty estimates at tissue boundaries
Post-hoc calibration isn’t sufficient for your application
You want instance-specific calibration, not just global adjustment

Consider alternatives when:

You need maximum Dice performance regardless of calibration
Your hardware can’t support the additional computation
You’re working with extremely small datasets where per-image calibration is unstable

Frequently Asked Questions

What is mL1-ACE loss?

mL1-ACE (marginal L1 Average Calibration Error) is a differentiable auxiliary loss that improves voxel-wise probability calibration in segmentation models by directly minimizing the gap between predicted confidence and observed accuracy during training.

Why are medical segmentation models overconfident?

Neural networks trained with standard losses (Dice, cross-entropy) on limited medical imaging data systematically overpredict confidence. The softmax function tends to push probabilities toward extremes, and limited training data doesn’t provide enough signal to regularize this.

What’s the difference between ECE and ACE?

ECE (Expected Calibration Error) weights bins by sample count, so it’s dominated by high-confidence predictions. ACE (Average Calibration Error) treats all confidence levels equally, which matters for medical imaging where boundary uncertainty is clinically important.

Does mL1-ACE work with any architecture?

Yes. The paper uses SegResNet, but the loss function is architecture-agnostic. It operates on the softmax output, so it works with U-Net, nnU-Net, Swin UNETR, or any segmentation architecture.

How many bins should I use?

The paper uses 20 bins as default. Sensitivity studies show the method is robust across 5-100 bins, with diminishing returns above 20.

Can I combine mL1-ACE with other techniques?

Yes. mL1-ACE is complementary to data augmentation, deep supervision, and ensemble methods. It specifically targets the calibration objective that other techniques address indirectly.

Is the code production-ready?

The code is research-grade with good documentation, Docker support, and MONAI integration. It’s suitable for research and prototyping. For clinical deployment, you’d want additional validation on your specific dataset.

What license is the code under?

The repository is open source on GitHub. Check the LICENSE file for specific terms.

Links: