LiTo: Apple's Surface Light Field Tokenizer for Image-to-3D Generation
Apple researchers have introduced LiTo — a 3D generative model that does something most image-to-3D systems quietly skip: it actually captures how objects look from different angles, not just their shape.
Most 3D generation models give you geometry. LiTo gives you geometry plus the full surface light field — the way specular highlights shift, the way Fresnel reflections change as you rotate the object, the subtle way materials respond to complex lighting. The result is 3D objects that don’t just match the input image’s shape, but its lighting and material character too.
The Problem LiTo Solves
Take any state-of-the-art image-to-3D model — Zero123, SyncDreamer, One-2-3-45. Feed it a photo of a shiny ceramic mug under warm lighting. The output will probably capture the mug’s shape reasonably well. But rotate the 3D model and those highlights stay put, like they’re painted on. That’s because these models predict diffuse appearance — view-independent color — not view-dependent light interaction.
This is the core gap LiTo addresses.
How LiTo Works
Surface Light Fields as Training Signal
A surface light field is essentially the full description of how light exits a surface at every point, in every direction. It captures specular highlights, Fresnel reflections, subsurface scattering — all the effects that make real objects look real.
LiTo’s insight: RGB-depth image pairs already contain samples of the surface light field. Each image is a view from a specific camera position. Depth gives you the surface geometry. Together, they give you a sparse sample of what the light field looks like from that viewpoint.
By encoding random subsamples of these RGB-depth pairs, LiTo builds a compact latent representation that jointly encodes both geometry and view-dependent appearance.
Latent Flow Matching
Rather than using diffusion (slow, many denoising steps), LiTo trains a latent flow matching model on the surface light field representation. Flow matching learns a continuous transformation from a simple distribution to the complex distribution of real 3D object latents — conditioned on a single input image.
The result: given one photo, the model generates a 3D latent that captures the object’s shape and how it would look from any viewpoint under the original lighting conditions.
Key Architecture Points
- Latent vectors encode both geometry and appearance in a unified 3D latent space
- Flow matching model conditioned on single input image
- Training uses random subsamples of RGB-depth pairs per object
- Reproduces specular highlights and Fresnel reflections at inference
Companion Work: Apple Shape Tokens
LiTo arrives alongside Apple’s Shape Tokens paper from the ML Research team (Rick Chang, Yuyang Wang, Jiatao Gu, Josh Susskind, Oncel Tuzel, et al.).
Where LiTo focuses on appearance + geometry, Shape Tokens focuses on geometry as a first-class ML primitive:
| LiTo | Shape Tokens | |
|---|---|---|
| Focus | Geometry + view-dependent appearance | Geometry |
| Representation | Surface light field latents | Flow-matching PDFs on 3D surfaces |
| Conditioning | Single RGB image | Image, text, or 3D |
| Output | 3D with realistic materials | 3D at variable resolution |
| Code available | ❌ Not yet | ✅ Promised |
Shape Tokens model 3D surfaces as probability density functions p(x,y,z) learned via flow matching — a continuous, differentiable representation that plugs cleanly into downstream ML models. The team has promised a public release of pretrained tokenizers, image-conditioned flow models, training code, and their full data rendering pipeline.
Results
LiTo achieves higher visual quality and better input fidelity than existing image-to-3D methods on the paper’s benchmarks. The key wins are in view-dependent effects — outputs correctly show highlight shifts and reflection changes as viewing angle changes, which prior methods simply cannot reproduce.
No code means we can’t verify these claims independently yet. But the approach is technically sound: surface light fields are the right representation for the problem, and flow matching is the right generative backbone for this kind of structured continuous data.
Why This Matters
The inability to capture view-dependent appearance is one of the most visible failure modes of current 3D generation. Every synthetic product render, every AR object overlay, every game asset generation pipeline hits this wall — objects that look plasticky and flat because the lighting is baked-in rather than physically plausible.
LiTo points toward a path where image-to-3D models produce assets that behave like real materials — not just shapes with textures painted on.
With Apple’s resources and the quality of the Shape Tokens companion work (which promises public code), these ideas are likely to propagate into the broader ecosystem quickly — once the code drops.
Links:
- Paper (ICLR 2026 OpenReview): openreview.net/forum?id=TVP0p4f2Su
- Apple Shape Tokens: machinelearning.apple.com/research/3d-shape-tokenization
- Shape Tokens arXiv: arxiv.org/abs/2412.15618