Input

❄

Image Encoder

visual features

Body $\mathcal{B}_{\mathrm{start}}$ from CameraHMR^[1]

$\mathcal{Z}$tokens $R$rot $\mathbf{t}$trans $\boldsymbol{\beta}$shape

Object $\mathcal{O}_{\mathrm{start}}$ from SAM3D^[2]

$R$rot $\mathbf{t}$trans

SDEdit $t_{\mathrm{start}} = 15/25$

LEXIS

Flow

guided sampling

ODE integration

$\hat{\mathcal{B}}$

$\hat{\mathcal{O}}$

refined states

❄

LEXIS Decoder

codebook

LEXIS-Flow* refined 3D HOI

$\mathcal{L}_{\mathrm{mask}}$ Render-and-compare

$\mathcal{L}_{\mathrm{IF}}$ Interaction proximity

Dual-stream Flow Matching — decoupled timesteps

$$\bigl[v_\theta^{\mathcal{B}},\; v_\theta^{\mathcal{O}}\bigr] = \mathrm{\text{LEXIS-Flow}}\bigl(\mathcal{B}_{t_1},\; \mathcal{O}_{t_2},\; t_1,\; t_2,\; \mathcal{I}\bigr)$$

$t_1$ = body timestep · $t_2$ = object timestep · cross-attention couples the streams

SDEdit^[3] start transport — LEXIS-Flow*

$$\mathcal{X}_{t_{\mathrm{start}}} = (1 - t_{\mathrm{start}})\,\epsilon \;+\; t_{\mathrm{start}}\,\mathcal{X}_{\mathrm{start}}$$

$\epsilon$ = noise · $\mathcal{X}_{\mathrm{start}} = \{\mathcal{B}_{\mathrm{start}}, \mathcal{O}_{\mathrm{start}}\}$ from CameraHMR + SAM3D · $t_{\mathrm{start}} = 15/25$

Same architecture, same guidance — only the initialization changes.

Instead of pure noise, start from SoTA expert predictions: CameraHMR body + SAM3D object.

Encode the init into LEXIS latent state, then SDEdit-jump to intermediate timestep $t = 15 / 25$.

Run the same guided sampling — cross-attention to the image, $\mathcal{L}_{\mathrm{mask}} + \mathcal{L}_{\mathrm{IF}}$ guidance gradients, frozen LEXIS Decoder.

Refined output: 3D HOI corrected by 2D mask and 3D InterField constraints — this is LEXIS-Flow*.

LEXIS-Flow*Guided Refinement

LEXIS-Flow*
Guided Refinement