LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

* Equal contribution
1University of Amsterdam, The Netherlands 2Max Planck Institute for Intelligent Systems, Tübingen, Germany 3Aristotle University of Thessaloniki, Greece
LEXIS teaser

LEXIS reconstructs 3D Human–Object Interaction from a single RGB image. We infer dense, continuous Interaction Fields (InterFields) across human and object surfaces, and use them to guide physically-plausible 3D reconstruction — without post-hoc optimization.

TL;DR 3D Human–Object Interaction reconstruction from a single image — going beyond sparse, binary contact by modeling dense, continuous proximity (InterFields). A learned manifold of interaction signatures (LEXIS) guides the model's refinement during generation, yielding physically-plausible reconstructions in one forward pass.

Abstract

Reconstructing 3D Human–Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse, binary contact cues, these fail to model the continuous proximity and dense spatial relationships that characterize natural interactions. We address this limitation via InterFields, a representation that encodes dense, continuous proximity across the entire body and object surfaces. However, inferring these fields from single images is inherently ill-posed. To tackle this, our intuition is that interaction patterns are characteristically structured by the action and object geometry. We capture this structure in LEXIS, a novel discrete manifold of interaction signatures learned via a VQ-VAE. We then develop LEXIS-Flow, a diffusion framework that leverages LEXIS signatures to estimate human and object meshes alongside their InterFields. Notably, these InterFields help in a guided refinement that ensures physically-plausible, proximity-aware reconstructions without requiring post-hoc optimization. Evaluation on Open3DHOI and BEHAVE shows that LEXIS-Flow significantly outperforms existing SotA baselines in reconstruction, contact, and proximity quality. Our approach not only improves generalization but also yields reconstructions perceived as more realistic, moving us closer to holistic 3D scene understanding.

Method

InterFields: We go beyond binary contact by representing proximity as dense, continuous fields over entire body and object surfaces. Each point is mapped to its distance to the nearest counterpart surface, encoding not just where contact occurs but the full spatial proximity — a far richer signal for guiding 3D reconstruction.

InterFields concept

InterFields encode the full proximity landscape between body and object surfaces, going beyond sparse binary contact labels.

LEXIS: Inferring InterFields from a single image is ill-posed. We tame this by learning a compact prior LEXIS, as a VQ-VAE dictionary of interaction signatures. Its discrete latent codes form a structured manifold of distance-aware interaction patterns, conditioned on action type and object geometry.

LEXIS-Net architecture

LEXIS-Net learns a discrete manifold of interaction signatures via VQ-VAE, encoding human-object interaction proximity information represented via InterFields.

LEXIS-Flow / LEXIS-Flow*: A dual-stream Flow-Matching transformer that reconstructs spatial configurations of 3D human and object conditioned on an RGB image and guided by InterFields (LEXIS signatures). LEXIS-Flow* initialises from off-the-shelf predictions, and refines them. This combines existing methods' estimates with LEXIS interaction prior.

LEXIS-Flow architecture

LEXIS-Flow jointly estimates meshes and InterFields in a single forward pass, guided by LEXIS signatures for proximity-aware, one-stage reconstruction.

Results

Input LEXIS-Flow* (Ours) HOI-Gaussian InteractVLM HDM
Qualitative comparison against state-of-the-art methods

Existing methods estimates floating objects (HOI-Gaussian, InteractVLM) or penetrations (HDM). LEXIS-Flow* produces tighter physical coupling via dense InterFields.

Comparisons

360° orbit renders. Navigate between examples with the arrows below.

Input LEXIS-Flow* (Ours) HOI-Gaussian InteractVLM HDM
Input
1 / 4

Additional Results

Additional in-the-wild results

LEXIS-Flow* recovers physically-plausible interactions from diverse in-the-wild images, producing accurate spatial configurations.

BibTeX

@article{antic2026lexis,
  title   = {{LEXIS}: {LatEnt} {ProXimal} Interaction Signatures for {3D} {HOI} from an Image},
  author  = {Anti\'{c}, Dimitrije and Budria, Alvaro and Paschalidis, Georgios and Dwivedi, Sai Kumar and Tzionas, Dimitrios},
  journal = {arXiv preprint},
  year    = {2026},
}

Acknowledgments

We thank Božidar Antić and Ilya Petrov for valuable insights and discussions. SKD is supported by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). We acknowledge HPC support by the EuroHPC Joint Undertaking that awarded access to the EuroHPC supercomputers LEONARDO (project ID EHPC-AI-2024A06-077), hosted by CINECA in Italy, and JUPITER (project ID e-reg-2025r02-393), hosted by JSC in Germany, and by the Dutch national e-infrastructure through the SURF Cooperative grant no. EINF-12852. We also acknowledge support through a research gift from Google, and the NVIDIA Academic Grant Program. This work is supported by the European Research Council (ERC) through the Starting Grant (project STRIPES, Grant agreement ID: 101165317, DOI: 10.3030/101165317, PI: D. Tzionas).