LiTo (Surface Light Field Tokenization) is a 3D generative model from Apple that jointly encodes object geometry and view-dependent appearance into a compact set of latent vectors. It uses a latent flow matching model conditioned on a single input image to generate 3D objects that preserve specular highlights and Fresnel reflections from the original image's lighting.

What is surface light field tokenization?

Surface light field tokenization encodes not just the shape of a 3D object, but also how its surface reflects light from different viewing angles. LiTo samples random subsets of RGB-depth image pairs (which provide surface light field data) and encodes them into a compact latent space that captures both geometry and appearance jointly.

How does LiTo differ from other image-to-3D models?

Most image-to-3D models reconstruct geometry only, or predict diffuse (view-independent) appearance. LiTo explicitly models view-dependent effects like specular highlights and Fresnel reflections — meaning the generated 3D object looks different depending on your viewing angle, just like real objects do under complex lighting.

What is latent flow matching in 3D generation?

Latent flow matching is a generative modeling technique that learns to transform a simple distribution (e.g. Gaussian noise) into complex data distributions via learned continuous flows — more stable and efficient than diffusion. In LiTo, a flow matching model is trained on the surface light field latent space to generate 3D objects conditioned on a single input image.

Is LiTo code available?

No. As of March 2026, LiTo has no public code release. The paper was submitted to ICLR 2026 (OpenReview ID: TVP0p4f2Su). Apple's companion Shape Tokens paper (3D Shape Tokenization) has promised a public code release including pretrained models and training code.

Who are the authors of LiTo?

LiTo was submitted as an anonymous paper to ICLR 2026. It is associated with Apple researchers based on its relationship to Apple's broader 3D representation research program, including the Shape Tokens paper (authors: Rick Chang, Yuyang Wang, Miguel Angel Bautista Martin, Jiatao Gu, Xiaoming Zhao, Josh Susskind, Oncel Tuzel).

What is Apple's Shape Tokens paper?

Shape Tokens is a companion Apple ML Research paper that introduces a 3D representation encoding shape as conditioning vectors within a 3D flow-matching model. Unlike LiTo which focuses on appearance, Shape Tokens focuses on geometry — enabling image-to-3D, text-to-3D alignment, and variable-resolution rendering. Code and pretrained models are promised for public release.

LiTo: Apple's Surface Light Field Tokenizer for Image-to-3D Generation

By Prahlad Menon Published 2026-03-12 3 min read

Apple researchers have introduced LiTo — a 3D generative model that does something most image-to-3D systems quietly skip: it actually captures how objects look from different angles, not just their shape.

Most 3D generation models give you geometry. LiTo gives you geometry plus the full surface light field — the way specular highlights shift, the way Fresnel reflections change as you rotate the object, the subtle way materials respond to complex lighting. The result is 3D objects that don’t just match the input image’s shape, but its lighting and material character too.

The Problem LiTo Solves

Take any state-of-the-art image-to-3D model — Zero123, SyncDreamer, One-2-3-45. Feed it a photo of a shiny ceramic mug under warm lighting. The output will probably capture the mug’s shape reasonably well. But rotate the 3D model and those highlights stay put, like they’re painted on. That’s because these models predict diffuse appearance — view-independent color — not view-dependent light interaction.

This is the core gap LiTo addresses.

How LiTo Works

Surface Light Fields as Training Signal

A surface light field is essentially the full description of how light exits a surface at every point, in every direction. It captures specular highlights, Fresnel reflections, subsurface scattering — all the effects that make real objects look real.

LiTo’s insight: RGB-depth image pairs already contain samples of the surface light field. Each image is a view from a specific camera position. Depth gives you the surface geometry. Together, they give you a sparse sample of what the light field looks like from that viewpoint.

By encoding random subsamples of these RGB-depth pairs, LiTo builds a compact latent representation that jointly encodes both geometry and view-dependent appearance.

Latent Flow Matching

Rather than using diffusion (slow, many denoising steps), LiTo trains a latent flow matching model on the surface light field representation. Flow matching learns a continuous transformation from a simple distribution to the complex distribution of real 3D object latents — conditioned on a single input image.

The result: given one photo, the model generates a 3D latent that captures the object’s shape and how it would look from any viewpoint under the original lighting conditions.

Key Architecture Points

Latent vectors encode both geometry and appearance in a unified 3D latent space
Flow matching model conditioned on single input image
Training uses random subsamples of RGB-depth pairs per object
Reproduces specular highlights and Fresnel reflections at inference

Companion Work: Apple Shape Tokens

LiTo arrives alongside Apple’s Shape Tokens paper from the ML Research team (Rick Chang, Yuyang Wang, Jiatao Gu, Josh Susskind, Oncel Tuzel, et al.).

Where LiTo focuses on appearance + geometry, Shape Tokens focuses on geometry as a first-class ML primitive:

	LiTo	Shape Tokens
Focus	Geometry + view-dependent appearance	Geometry
Representation	Surface light field latents	Flow-matching PDFs on 3D surfaces
Conditioning	Single RGB image	Image, text, or 3D
Output	3D with realistic materials	3D at variable resolution
Code available	❌ Not yet	✅ Promised

Shape Tokens model 3D surfaces as probability density functions p(x,y,z) learned via flow matching — a continuous, differentiable representation that plugs cleanly into downstream ML models. The team has promised a public release of pretrained tokenizers, image-conditioned flow models, training code, and their full data rendering pipeline.

Results

LiTo achieves higher visual quality and better input fidelity than existing image-to-3D methods on the paper’s benchmarks. The key wins are in view-dependent effects — outputs correctly show highlight shifts and reflection changes as viewing angle changes, which prior methods simply cannot reproduce.

No code means we can’t verify these claims independently yet. But the approach is technically sound: surface light fields are the right representation for the problem, and flow matching is the right generative backbone for this kind of structured continuous data.

Why This Matters

The inability to capture view-dependent appearance is one of the most visible failure modes of current 3D generation. Every synthetic product render, every AR object overlay, every game asset generation pipeline hits this wall — objects that look plasticky and flat because the lighting is baked-in rather than physically plausible.

LiTo points toward a path where image-to-3D models produce assets that behave like real materials — not just shapes with textures painted on.

With Apple’s resources and the quality of the Shape Tokens companion work (which promises public code), these ideas are likely to propagate into the broader ecosystem quickly — once the code drops.

Links:

Paper (ICLR 2026 OpenReview): openreview.net/forum?id=TVP0p4f2Su
Apple Shape Tokens: machinelearning.apple.com/research/3d-shape-tokenization
Shape Tokens arXiv: arxiv.org/abs/2412.15618