Same architecture, same guidance — only the initialization changes.
Instead of pure noise, start from SoTA expert predictions: CameraHMR body + SAM3D object.
Encode the init into LEXIS latent state, then SDEdit-jump to intermediate timestep $t = 15 / 25$.
Run the same guided sampling — cross-attention to the image, $\mathcal{L}_{\mathrm{mask}} + \mathcal{L}_{\mathrm{IF}}$ guidance gradients, frozen LEXIS Decoder.
Refined output: 3D HOI corrected by 2D mask and 3D InterField constraints — this is LEXIS-Flow*.