Input

❄

Image Encoder

visual features

Body $\mathcal{B}_{t_1}$

$\mathcal{Z}$tokens $R$rot $\mathbf{t}$trans $\boldsymbol{\beta}$shape

Object $\mathcal{O}_{t_2}$

$R$rot $\mathbf{t}$trans

$t_1, t_2 \in [0,1]$

LEXIS

Flow

dual-stream

Flow Matching

$\hat{\mathcal{B}}$

$\hat{\mathcal{O}}$

clean states

❄

LEXIS Decoder

codebook

3D HOI + InterFields jointly estimated

$\mathcal{L}_{\mathrm{mask}}$ Render-and-compare

$\mathcal{L}_{\mathrm{IF}}$ Interaction proximity

Dual-stream Flow Matching — decoupled timesteps

$$\bigl[v_\theta^{\mathcal{B}},\; v_\theta^{\mathcal{O}}\bigr] = \mathrm{\text{LEXIS-Flow}}\bigl(\mathcal{B}_{t_1},\; \mathcal{O}_{t_2},\; t_1,\; t_2,\; \mathcal{I}\bigr)$$

$t_1$ = body timestep · $t_2$ = object timestep · cross-attention couples the streams

From a single RGB image—no depth, no multi-view.

A frozen image encoder extracts visual features; these condition the flow model via cross-attention.

Body & object start as independent noisy codes at a random timestep $t \in [0,1]$.

LEXIS-Flow denoises both streams jointly—dual-stream cross-attention couples body and object throughout denoising.

Output: clean body code $\hat{\mathcal{B}}$ and object code $\hat{\mathcal{O}}$—coherently estimated from a single forward pass.

A frozen LEXIS Decoder $D_\psi$ lifts codes into 3D meshes and dense InterFields.

→ Jointly estimated 3D body, object meshes, and dense InterField proximity in one shot.

Guidance signals — gradients of $\mathcal{L}_{\mathrm{mask}} + \mathcal{L}_{\mathrm{IF}}$ nudge intermediate states during sampling.

LEXIS-FlowRGB & noise to HOI

LEXIS-Flow
RGB & noise to HOI