Can Apple's Neural Engine do training, not just inference?

Yes. A developer reverse-engineered Apple's private ANE APIs and successfully trained a transformer (forward + backward pass) directly on the Neural Engine. The barrier was always software, not hardware — Apple just hasn't exposed training capability publicly.

Is the ANE really 80x more efficient than an A100?

Yes, in energy efficiency (TFLOPS per watt). The M4 ANE delivers ~6.6 TFLOPS/W vs A100's ~0.08 TFLOPS/W. However, the A100 has 16x more raw throughput (312 vs 19 TFLOPS). For battery-powered edge inference, ANE wins. For training large models, you still want datacenter GPUs.

How does Apple's Neural Engine actually work?

The ANE is a graph execution engine, not a general-purpose accelerator. You submit a compiled computation graph and it executes atomically. It's fundamentally a convolution engine (matmul as 1×1 conv gives 3x throughput), has ~32MB on-chip SRAM, and uses hard power gating for zero idle power.

What did the reverse engineering discover about ANE?

Private Objective-C APIs (_ANEClient, _ANECompiler, _ANEInMemoryModelDescriptor) allow direct compilation and execution on ANE without CoreML. Apple's '38 TOPS' marketing is misleading — INT8 dequantizes to FP16 before compute, so true peak is 19 TFLOPS FP16.

What are the limitations of ANE training?

Current utilization is only ~11% of peak. Many operations fall back to CPU (RMSNorm backward, residual connections, loss, Adam). There's a ~119 compile limit due to resource leaks. These are private APIs that could break with any macOS update.

How can I try ANE training myself?

Clone github.com/maderix/ANE, build with 'xcrun clang' using the provided flags, and run './train_large'. Requires macOS 15+ on Apple Silicon. No external dependencies — just system frameworks and private APIs resolved at runtime.

Why does this matter for edge AI?

Every MacBook, iMac, and Mac Mini has a 19 TFLOPS accelerator locked to inference-only use. This project proves backpropagation on ANE is feasible, opening possibilities for on-device fine-tuning and potentially nudging Apple toward first-party training support.

80× More Efficient Than A100? Someone Reverse-Engineered Apple's Neural Engine for Training

By Prahlad Menon Published 2026-03-04 2 min read

Apple’s Neural Engine (ANE) has always been a black box. It ships in every M-series Mac, delivers impressive inference performance, and is completely locked down for training. CoreML only supports inference. MLX ignores it entirely. Apple provides no public API for direct access.

Then maderix reverse-engineered the entire software stack and trained a transformer on it anyway.

What Actually Happened

Over a weekend, maderix (working collaboratively with Claude) mapped the ANE’s software architecture from CoreML down to the IOKit kernel driver. They discovered private Objective-C APIs (_ANEClient, _ANECompiler, _ANEInMemoryModelDescriptor) that allow direct compilation and execution of compute graphs on the ANE—without touching CoreML.

The result: a working proof-of-concept that runs transformer training (forward + backward pass) directly on Apple’s Neural Engine. No GPU. No Metal. Pure ANE compute.

Current benchmarks (M4, single transformer layer, dim=768, seq=512):

9.3 ms/step
11.2% ANE utilization (1.78 TFLOPS sustained)
6 ANE kernel dispatches per training step
Forward and backward passes on ANE; weight gradients on CPU

The “80× More Efficient” Claim

Headlines have trumpeted “80× more efficient than A100.” This is technically accurate but needs context—it’s about energy efficiency, not raw throughput.

Accelerator	Peak Performance	Efficiency (TFLOPS/W)
M4 ANE	19 TFLOPS FP16	6.6
M4 GPU	~3.5 TFLOPS	~1.0
A100	312 TFLOPS FP16	~0.08
H100	990 TFLOPS FP16	~0.13

The ANE delivers roughly 80× more compute per watt than an A100. But the A100 has 16× more raw throughput. For battery-powered on-device inference, the ANE is extraordinary. For training large models, you still want datacenter GPUs.

Another key finding: Apple’s “38 TOPS” marketing is misleading. The ANE dequantizes INT8 weights to FP16 before compute—there’s no actual 2× INT8 speedup. The true peak is 19 TFLOPS FP16, regardless of quantization.

How the ANE Actually Works

The ANE isn’t a general-purpose accelerator—it’s a graph execution engine. You don’t issue individual matrix operations. You submit a compiled program describing an entire computation graph, and the hardware executes it atomically.

Key discoveries from the reverse engineering:

It’s fundamentally a convolution engine. Expressing matmul as 1×1 convolution gives 3× higher throughput.
~32MB on-chip SRAM. Stay under this to avoid spilling to DRAM, which kills performance.
Deep graphs win. Single operations use only ~30% of capacity. Chain 16-64 ops to approach the theoretical peak.
Zero idle power. Hard power gating means 0 milliwatts when not in use—no leakage.

How to Try It Yourself

The code is MIT licensed and available at github.com/maderix/ANE.

Requirements:

macOS 15+ on Apple Silicon (tested on M4)
Xcode command line tools

Build and run:

# Clone the repo
git clone https://github.com/maderix/ANE.git
cd ANE

# Build the training program
xcrun clang -O2 -framework Foundation -framework IOSurface \
    -framework CoreML -framework Accelerate -ldl -lobjc \
    -o train_large training/train_large.m

# Run
./train_large

No external dependencies—just system frameworks and private ANE APIs resolved at runtime.

Important Caveats

The author is refreshingly honest about limitations:

This is research code, not a production framework. Don’t expect it to replace PyTorch.
Utilization is low. Even after optimization, only ~11% of peak ANE throughput is achieved.
Many operations fall back to CPU. RMSNorm backward, residual connections, loss computation, and Adam updates all run on CPU.
~119 compile limit. The ANE compiler leaks resources, requiring process restarts via exec().
Private APIs. These are undocumented and could break with any macOS update.

Why This Matters

The technical achievement here isn’t training a large model efficiently—it’s proving that the barrier to NPU training has always been software, not hardware.

Apple’s ANE is a 19 TFLOPS accelerator sitting in every MacBook, iMac, and Mac Mini. It’s locked to inference-only use not because it can’t do training, but because Apple hasn’t chosen to expose the capability. This project demonstrates that backpropagation on ANE is entirely feasible with the right software layer.

For edge AI researchers, this opens interesting possibilities. For Apple, it might be a nudge toward first-party training support. For the rest of us, it’s a fascinating look inside silicon that Apple would prefer to keep mysterious.

References

GitHub Repository: maderix/ANE
Part 1: Reverse Engineering: maderix.substack.com
Part 2: ANE Benchmarks: maderix.substack.com
Prior Work - hollance/neural-engine: GitHub (community ANE documentation)
Prior Work - mdaiter/ane: GitHub (early ANE reverse engineering)
Apple’s ANE Transformers Reference: apple/ml-ane-transformers

This project uses Apple’s private, undocumented APIs and is not affiliated with or endorsed by Apple Inc. The legal basis for this research falls under fair use and interoperability provisions (Sega v. Accolade, 1992; DMCA §1201(f)).