LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image

Dimitrije Antić¹ · Alvaro Budria^*1 · Georgios Paschalidis^*1
Sai Kumar Dwivedi² · Dimitrios Tzionas^1,3

* Equal contribution

¹University of Amsterdam, The Netherlands ²Max Planck Institute for Intelligent Systems, Tübingen, Germany ³Aristotle University of Thessaloniki, Greece

arXiv Paper Code Slides Video

TL;DR 3D Human–Object Interaction reconstruction from a single image — going beyond sparse, binary contact by modeling dense, continuous proximity (InterFields). A learned manifold of interaction signatures (LEXIS) guides the model's refinement during generation, yielding physically-plausible reconstructions in one forward pass.

Abstract

Reconstructing 3D Human–Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse, binary contact cues, these fail to model the continuous proximity and dense spatial relationships that characterize natural interactions. We address this limitation via InterFields, a representation that encodes dense, continuous proximity across the entire body and object surfaces. However, inferring these fields from single images is inherently ill-posed. To tackle this, our intuition is that interaction patterns are characteristically structured by the action and object geometry. We capture this structure in LEXIS, a novel discrete manifold of interaction signatures learned via a VQ-VAE. We then develop LEXIS-Flow, a diffusion framework that leverages LEXIS signatures to estimate human and object meshes alongside their InterFields. Notably, these InterFields help in a guided refinement that ensures physically-plausible, proximity-aware reconstructions without requiring post-hoc optimization. Evaluation on Open3DHOI and BEHAVE shows that LEXIS-Flow significantly outperforms existing SotA baselines in reconstruction, contact, and proximity quality. Our approach not only improves generalization but also yields reconstructions perceived as more realistic, moving us closer to holistic 3D scene understanding.

Method

InterFields: We go beyond binary contact by representing proximity as dense, continuous fields over entire body and object surfaces. Each point is mapped to its distance to the nearest counterpart surface, encoding not just where contact occurs but the full spatial proximity — a far richer signal for guiding 3D reconstruction.

InterFields encode the full proximity landscape between body and object surfaces, going beyond sparse binary contact labels.

LEXIS: Inferring InterFields from a single image is ill-posed. We tame this by learning a compact prior LEXIS, as a VQ-VAE dictionary of interaction signatures. Its discrete latent codes form a structured manifold of distance-aware interaction patterns, conditioned on action type and object geometry.

LEXIS-Net learns a discrete manifold of interaction signatures via VQ-VAE, encoding human-object interaction proximity information represented via InterFields.

LEXIS-Flow / LEXIS-Flow*: A dual-stream Flow-Matching transformer that reconstructs spatial configurations of 3D human and object conditioned on an RGB image and guided by InterFields (LEXIS signatures). LEXIS-Flow* initialises from off-the-shelf predictions, and refines them. This combines existing methods' estimates with LEXIS interaction prior.

LEXIS-Flow jointly estimates meshes and InterFields in a single forward pass, guided by LEXIS signatures for proximity-aware, one-stage reconstruction.

Results

Input LEXIS-Flow* (Ours) HOI-Gaussian InteractVLM HDM

Qualitative comparison against state-of-the-art methods

Existing methods estimates floating objects (HOI-Gaussian, InteractVLM) or penetrations (HDM). LEXIS-Flow* produces tighter physical coupling via dense InterFields.

Comparisons

360° orbit renders. Navigate between examples with the arrows below.

Input LEXIS-Flow* (Ours) HOI-Gaussian InteractVLM HDM

1 / 4

Additional Results

LEXIS-Flow* recovers physically-plausible interactions from diverse in-the-wild images, producing accurate spatial configurations.

BibTeX

@article{antic2026lexis,
  title   = {{LEXIS}: {LatEnt} {ProXimal} Interaction Signatures for {3D} {HOI} from an Image},
  author  = {Anti\'{c}, Dimitrije and Budria, Alvaro and Paschalidis, Georgios and Dwivedi, Sai Kumar and Tzionas, Dimitrios},
  journal = {arXiv preprint arXiv:2604.20800},
  year    = {2026},
}

Acknowledgments

We thank Božidar Antić and Ilya Petrov for valuable insights and discussions. SKD is supported by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). We acknowledge HPC support by the EuroHPC Joint Undertaking that awarded access to the EuroHPC supercomputers LEONARDO (project ID EHPC-AI-2024A06-077), hosted by CINECA in Italy, and JUPITER (project ID e-reg-2025r02-393), hosted by JSC in Germany, and by the Dutch national e-infrastructure through the SURF Cooperative grant no. EINF-12852. We also acknowledge support through a research gift from Google, and the NVIDIA Academic Grant Program. This work is supported by the European Research Council (ERC) through the Starting Grant (project STRIPES, Grant agreement ID: 101165317, DOI: 10.3030/101165317, PI: D. Tzionas).