mL1-ACE: Fix Overconfident Medical Image Segmentation with Calibration Losses
TL;DR: Deep learning segmentation models are overconfident — they say 95% when they’re really 70% accurate. mL1-ACE is a differentiable calibration loss you add during training to fix this. Results: 16-46% reduction in calibration error while maintaining Dice scores. Open-source code available using MONAI.
Your medical image segmentation model just predicted a tumor boundary with 98% confidence. But it’s wrong. This overconfidence problem isn’t a bug — it’s the default behavior of neural networks trained on limited medical data.
A new paper from King’s College London and Imperial offers a practical fix: train-time calibration losses that make your model honest about its uncertainty.
What Problem Does mL1-ACE Solve?
mL1-ACE solves the calibration gap in medical image segmentation. When a model predicts 90% probability for a voxel being tumor, it should be correct ~90% of the time. In reality, standard models trained with Dice + cross-entropy are systematically overconfident.
This matters clinically:
- Radiotherapy planning — uncertain regions need human review
- Surgical guidance — false confidence at tissue boundaries is dangerous
- AI-assisted diagnosis — calibrated probabilities enable proper decision thresholds
Post-hoc fixes like temperature scaling help, but they’re band-aids. They can’t provide instance-specific or region-specific calibration. mL1-ACE bakes calibration directly into training.
How Does mL1-ACE Work Under the Hood?
The core idea is elegant: use Average Calibration Error as a differentiable loss term.
For each image, predictions are binned by confidence level. Within each bin, the method compares:
- Expected probability (average predicted confidence)
- Observed frequency (actual accuracy in that bin)
The difference between these is the calibration error. Sum it across bins, average across classes — that’s mL1-ACE.
The key insight: segmentation produces dense voxel-level predictions, giving you thousands of data points per image. This makes per-image calibration estimation stable, unlike classification where you need large batches.
Two binning strategies:
| Approach | Kernel | Calibration Improvement | Dice Impact |
|---|---|---|---|
| Hard-binned (hL1-ACE) | Square | Moderate (16-22%) | Minimal |
| Soft-binned (sL1-ACE) | Triangular | Strong (19-46%) | Small (~1%) |
Soft binning uses triangular kernels for smoother gradients during backpropagation, achieving better calibration at the cost of slightly softer probability maps.
What Results Did the Paper Achieve?
The authors tested on four standard medical imaging benchmarks:
| Dataset | Modality | Task | Baseline Dice | ACE Reduction (soft) |
|---|---|---|---|---|
| ACDC | Cardiac MRI | LV/RV/Myocardium | 0.871 | 46% |
| AMOS | Abdominal CT | 15 organs | 0.883 | 19% |
| BraTS | Brain MRI | Tumor regions | 0.905 | 30% |
| KiTS | Kidney CT | Kidney/tumor/cyst | 0.859 | 36% |
Key findings:
- Soft mL1-ACE delivers the strongest calibration improvements
- Hard mL1-ACE maintains segmentation performance with moderate calibration gains
- Both outperform post-hoc temperature scaling
- Improvements are statistically significant (p < 0.01) across datasets
The paper also introduces dataset reliability histograms — a visualization showing calibration variability across an entire dataset, not just aggregate metrics.
How Do I Install and Use Average-Calibration-Losses?
The code uses MONAI bundles for reproducible experiments:
# Clone the repository
git clone https://github.com/cai4cai/Average-Calibration-Losses
cd Average-Calibration-Losses
# Generate a bundle for your dataset/loss combination
python generate_bundle.py --data brats21 --loss softl1ace_dice_ce
# Train with Docker (recommended)
./scripts/docker_build.sh
./scripts/docker_run.sh --mode train --bundle brats21_softl1ace_dice_ce_<hash> --gpu 0
Available loss configurations:
baseline_dice_ce— Standard Dice + cross-entropyhardl1ace_dice_ce— Add hard-binned calibration losssoftl1ace_dice_ce— Add soft-binned calibration loss
Requirements: 24GB+ VRAM GPU (RTX 4090), 96GB RAM recommended.
How Do I Add mL1-ACE to My Own Training Pipeline?
The loss functions are modular. Here’s the core implementation pattern:
from src.losses.softl1ace import SoftL1ACEandDiceCELoss
# Initialize combined loss
loss_fn = SoftL1ACEandDiceCELoss(
include_background=False,
softmax=True,
num_bins=20,
# Weights for Dice, CE, and ACE components
lambda_dice=1.0,
lambda_ce=1.0,
lambda_ace=1.0
)
# Use in training loop
loss = loss_fn(predictions, ground_truth)
loss.backward()
The implementation handles:
- Background exclusion
- One-hot encoding
- Missing class handling (important for datasets like KiTS where 56% of cases lack cyst labels)
How Does mL1-ACE Compare to Other Calibration Methods?
| Method | Type | Calibration | Accuracy | Complexity |
|---|---|---|---|---|
| Temperature Scaling | Post-hoc | Moderate | Unchanged | Low |
| Label Smoothing | Train-time | Weak | May decrease | Low |
| Mixup | Train-time | Moderate | May decrease | Medium |
| NACL | Train-time | Moderate | Maintained | Medium |
| mL1-ACE (hard) | Train-time | Good | Maintained | Medium |
| mL1-ACE (soft) | Train-time | Best | Slight decrease | Medium |
The advantage of mL1-ACE: it directly optimizes the calibration metric you care about, rather than using proxies or heuristics.
When Should I Use mL1-ACE?
Use mL1-ACE when:
- Your segmentation model will inform clinical decisions
- You need reliable uncertainty estimates at tissue boundaries
- Post-hoc calibration isn’t sufficient for your application
- You want instance-specific calibration, not just global adjustment
Consider alternatives when:
- You need maximum Dice performance regardless of calibration
- Your hardware can’t support the additional computation
- You’re working with extremely small datasets where per-image calibration is unstable
Frequently Asked Questions
What is mL1-ACE loss?
mL1-ACE (marginal L1 Average Calibration Error) is a differentiable auxiliary loss that improves voxel-wise probability calibration in segmentation models by directly minimizing the gap between predicted confidence and observed accuracy during training.
Why are medical segmentation models overconfident?
Neural networks trained with standard losses (Dice, cross-entropy) on limited medical imaging data systematically overpredict confidence. The softmax function tends to push probabilities toward extremes, and limited training data doesn’t provide enough signal to regularize this.
What’s the difference between ECE and ACE?
ECE (Expected Calibration Error) weights bins by sample count, so it’s dominated by high-confidence predictions. ACE (Average Calibration Error) treats all confidence levels equally, which matters for medical imaging where boundary uncertainty is clinically important.
Does mL1-ACE work with any architecture?
Yes. The paper uses SegResNet, but the loss function is architecture-agnostic. It operates on the softmax output, so it works with U-Net, nnU-Net, Swin UNETR, or any segmentation architecture.
How many bins should I use?
The paper uses 20 bins as default. Sensitivity studies show the method is robust across 5-100 bins, with diminishing returns above 20.
Can I combine mL1-ACE with other techniques?
Yes. mL1-ACE is complementary to data augmentation, deep supervision, and ensemble methods. It specifically targets the calibration objective that other techniques address indirectly.
Is the code production-ready?
The code is research-grade with good documentation, Docker support, and MONAI integration. It’s suitable for research and prototyping. For clinical deployment, you’d want additional validation on your specific dataset.
What license is the code under?
The repository is open source on GitHub. Check the LICENSE file for specific terms.
Links: