LEXIS-Flow
RGB & noise to HOI

input RGB
Input
Image Encoder
visual features
Body   $\mathcal{B}_{t_1}$
$\mathcal{Z}$tokens $R$rot $\mathbf{t}$trans $\boldsymbol{\beta}$shape
Object   $\mathcal{O}_{t_2}$
$R$rot $\mathbf{t}$trans
$t_1, t_2 \in [0,1]$
t1 t2
LEXIS
Flow
dual-stream
Flow Matching
$\hat{\mathcal{B}}$
$\hat{\mathcal{O}}$
clean states
VQ codebook
LEXIS Decoder
codebook
output InterField
3D HOI + InterFields jointly estimated
cross-attention
guidance gradients   $\nabla\bigl(\mathcal{L}_{\mathrm{mask}} + \mathcal{L}_{\mathrm{IF}}\bigr)$
$\mathcal{L}_{\mathrm{mask}}$ Render-and-compare
$\mathcal{L}_{\mathrm{IF}}$ Interaction proximity
Dual-stream Flow Matching — decoupled timesteps
$$\bigl[v_\theta^{\mathcal{B}},\; v_\theta^{\mathcal{O}}\bigr] = \mathrm{\text{LEXIS-Flow}}\bigl(\mathcal{B}_{t_1},\; \mathcal{O}_{t_2},\; t_1,\; t_2,\; \mathcal{I}\bigr)$$
$t_1$ = body timestep  ·  $t_2$ = object timestep  ·  cross-attention couples the streams

From a single RGB image—no depth, no multi-view.

A frozen image encoder extracts visual features; these condition the flow model via cross-attention.

Body & object start as independent noisy codes at a random timestep $t \in [0,1]$.

LEXIS-Flow denoises both streams jointly—dual-stream cross-attention couples body and object throughout denoising.

Output: clean body code $\hat{\mathcal{B}}$ and object code $\hat{\mathcal{O}}$—coherently estimated from a single forward pass.

A frozen LEXIS Decoder $D_\psi$ lifts codes into 3D meshes and dense InterFields.

Jointly estimated 3D body, object meshes, and dense InterField proximity in one shot.

Guidance signals — gradients of $\mathcal{L}_{\mathrm{mask}} + \mathcal{L}_{\mathrm{IF}}$ nudge intermediate states during sampling.