Recovering 3D object pose and shape from a single image is a challenging and ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intra- and inter-class shape variance, and the lack of 3D ground truth for natural images. Existing deep-network methods are trained on synthetic datasets to predict 3D shapes, so they often struggle generalizing to real-world images. Moreover, they lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without directly considering pixel alignment. To tackle these limitations, we develop a novel render-and-compare optimization framework, called SDFit. This has three key innovations: First, it uses a learned category-specific and morphable signed-distance function (mSDF) model, and fits this to an image by iteratively refining both 3D pose and shape. The mSDF robustifies inference by constraining the search on the manifold of valid shapes, while allowing for arbitrary shape topologies. Second, SDFit retrieves an initial 3D shape that likely matches the image, by exploiting foundational models for efficient look-up into 3D shape databases. Third, SDFit initializes pose by establishing rich 2D–3D correspondences between the image and the mSDF through foundational features. We evaluate SDFit on three image datasets, i.e., Pix3D, Pascal3D+, and COMIC. SDFit performs on par with SotA feed-forward networks for unoccluded images and common poses, but is uniquely robust to occlusions and uncommon poses. Moreover, it requires no retraining for unseen images. Thus, SDFit contributes new insights for generalizing in the wild.
SDFit represents object shape via a morphable Signed Distance Function (mSDF) and jointly estimates 3D pose and shape by minimizing a render-and-compare objective over rendered mask, normal and depth maps. The pipeline has three modular stages — strong shape prior, retrieval-based shape initialization, and foundational-feature-based pose initialization — followed by joint iterative refinement.
High-level overview of SDFit: shape initialization via retrieval (left), pose initialization via 2D–3D correspondences from foundational features (middle), and joint pose+shape refinement via render-and-compare (right).
① mSDF — morphable shape prior. We represent object shape with a learned, category-level morphable SDF (DIT). A compact latent code $z \in \mathbb{R}^{d}$ parameterizes a continuous SDF whose 0-level set defines the surface. This encodes the manifold of valid shapes, supports arbitrary topology, and yields dense correspondences across all morphed shapes — the same role SMPL plays for bodies.
② Shape initialization via retrieval. We embed both the input image and every training shape into OpenShape's joint 2D–3D latent space, and retrieve the mSDF shape whose embedding is closest to the image. This is fast, scales to large databases, and gives a strong starting hypothesis $z_{\text{init}}$.
Shape initialization: retrieval in a joint 2D–3D foundational latent space.
③ Pose initialization via foundational correspondences. We decorate the initial mSDF shape with per-vertex ControlNet+DINO features, and match these to per-pixel features extracted from the image. This yields dense 2D–3D correspondences which, combined with intrinsics from PerspectiveFields, drive a RANSAC+PnP pose estimate — purely zero-shot.
Pose initialization: dense 2D–3D correspondences from foundational features.
④ Render-and-compare refinement. Starting from $(z_{\text{init}}, R_{\text{init}}, t_{\text{init}})$, we differentiably extract a mesh (FlexiCubes), pose it, render (Nvdiffrast) and compare against image-predicted mask, normal and depth maps. Gradients flow back through the renderer into $(z, R, t)$, iteratively refining both pose and shape until convergence — with a canonical-space regularizer that allows topology to evolve (e.g. growing missing armrests) without falling into local minima.
@inproceedings{antic2025sdfit,
title = {{SDFit}: {3D} Object Pose and Shape by Fitting a Morphable {SDF} to a Single Image},
author = {Anti\'{c}, Dimitrije and Paschalidis, Georgios and Tripathi, Shashank and Gevers, Theo and Dwivedi, Sai Kumar and Tzionas, Dimitrios},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
} Our method builds upon prior open-source efforts. We thank the authors for releasing their code and models: Deep Implicit Templates, nvdiffrast, FlexiCubes, OpenShape, and Diff3F.
We thank Božidar Antić, Yuliang Xiu and Muhammed Kocabas for useful insights. We acknowledge EuroHPC JU for awarding the project ID EHPC-AI-2024A06-077 access to Leonardo BOOSTER. This work also used the Dutch national e-infrastructure with the support of the SURF Cooperative using grant no. EINF-7589. This work is partly supported by the ERC Starting Grant (project STRIPES, 101165317, PI: D. Tzionas). D. Tzionas has received a research gift from Google, and from the NVIDIA Academic Grant Program.