From a single RGB image—no depth, no multi-view.
A frozen image encoder extracts visual features; these condition the flow model via cross-attention.
Body & object start as independent noisy codes at a random timestep $t \in [0,1]$.
LEXIS-Flow denoises both streams jointly—dual-stream cross-attention couples body and object throughout denoising.
Output: clean body code $\hat{\mathcal{B}}$ and object code $\hat{\mathcal{O}}$—coherently estimated from a single forward pass.
A frozen LEXIS Decoder $D_\psi$ lifts codes into 3D meshes and dense InterFields.
→ Jointly estimated 3D body, object meshes, and dense InterField proximity in one shot.
Guidance signals — gradients of $\mathcal{L}_{\mathrm{mask}} + \mathcal{L}_{\mathrm{IF}}$ nudge intermediate states during sampling.