LEXIS-Flow*
Guided Refinement

input RGB
Input
Image Encoder
visual features
Body   $\mathcal{B}_{\mathrm{start}}$ from CameraHMR[1]
$\mathcal{Z}$tokens $R$rot $\mathbf{t}$trans $\boldsymbol{\beta}$shape
Object   $\mathcal{O}_{\mathrm{start}}$ from SAM3D[2]
$R$rot $\mathbf{t}$trans
SDEdit $t_{\mathrm{start}} = 15/25$
tstart tstart
LEXIS
Flow
guided sampling
ODE integration
$\hat{\mathcal{B}}$
$\hat{\mathcal{O}}$
refined states
VQ codebook
LEXIS Decoder
codebook
output InterField
LEXIS-Flow* refined 3D HOI
cross-attention
guidance gradients   $\nabla\bigl(\mathcal{L}_{\mathrm{mask}} + \mathcal{L}_{\mathrm{IF}}\bigr)$
$\mathcal{L}_{\mathrm{mask}}$ Render-and-compare
$\mathcal{L}_{\mathrm{IF}}$ Interaction proximity
Dual-stream Flow Matching — decoupled timesteps
$$\bigl[v_\theta^{\mathcal{B}},\; v_\theta^{\mathcal{O}}\bigr] = \mathrm{\text{LEXIS-Flow}}\bigl(\mathcal{B}_{t_1},\; \mathcal{O}_{t_2},\; t_1,\; t_2,\; \mathcal{I}\bigr)$$
$t_1$ = body timestep  ·  $t_2$ = object timestep  ·  cross-attention couples the streams
SDEdit[3] start transport — LEXIS-Flow*
$$\mathcal{X}_{t_{\mathrm{start}}} = (1 - t_{\mathrm{start}})\,\epsilon \;+\; t_{\mathrm{start}}\,\mathcal{X}_{\mathrm{start}}$$
$\epsilon$ = noise  ·  $\mathcal{X}_{\mathrm{start}} = \{\mathcal{B}_{\mathrm{start}}, \mathcal{O}_{\mathrm{start}}\}$ from CameraHMR + SAM3D  ·  $t_{\mathrm{start}} = 15/25$

Same architecture, same guidance — only the initialization changes.

Instead of pure noise, start from SoTA expert predictions: CameraHMR body + SAM3D object.

Encode the init into LEXIS latent state, then SDEdit-jump to intermediate timestep $t = 15 / 25$.

Run the same guided sampling — cross-attention to the image, $\mathcal{L}_{\mathrm{mask}} + \mathcal{L}_{\mathrm{IF}}$ guidance gradients, frozen LEXIS Decoder.

Refined output: 3D HOI corrected by 2D mask and 3D InterField constraints — this is LEXIS-Flow*.

[1] CameraHMR, Patel et al 3DV'25
[2] SAM3D, SAM3D team arXiv'25
[3] SDEdit, Meng et al ICLR'22