From Seeing to Understanding: FunREC Reconstructs Functional 3D Scenes from a Single Interaction Video
One of the central themes of AI Symposium 2026 is industry, vision, and robotics, and FunREC provides a compelling example of where the field is heading. By reconstructing functional 3D digital twins of indoor environments from a single egocentric RGB-D interaction video, the new method shows how computer vision is moving beyond static scene capture toward a deeper understanding of how real-world environments can be used and interacted with. The research was led by Alexandros Delitzas and colleagues from ETH Zurich, Stanford, and MPI for Informatics; two of our featured speakers, Marc Pollefeys and Dániel Baráth, are among the co-authors.
A central challenge in 3D scene understanding is not only to reconstruct what an environment looks like, but also to capture how it works. FunREC addresses this challenge by recovering functional indoor scenes from egocentric interaction videos recorded in everyday settings. Unlike previous articulated reconstruction methods that depend on controlled capture setups, multiple object states, or CAD priors, FunREC operates directly on in-the-wild human interaction sequences.
The method automatically detects articulated parts in a scene, estimates their kinematic parameters, tracks their motion over time, and reconstructs both static geometry and movable components in canonical 3D space. The result is a simulation-compatible digital twin in which articulated elements remain interactable rather than being reduced to a static snapshot of the environment.
FunREC outperforms prior work by a large margin on the two newly introduced benchmarks described below (RealFun4D and OmniFun4D), achieving gains of up to +50 mIoU in part segmentation, 5–10× lower articulation and pose errors, and significantly higher reconstruction accuracy. These results suggest that the approach not only generates visually plausible 3D models, but also captures functional structure in a way that is useful for downstream tasks.
The project further introduces two new datasets, RealFun4D and OmniFun4D, designed to support research on realistic functional scene understanding. RealFun4D contains 351 in-the-wild human–scene interactions recorded across 60 apartments in four countries using a head-mounted Azure Kinect DK depth camera, providing RGB video, depth, camera poses, hand and part masks, articulation parameters, and full 3D reconstructions. OmniFun4D includes 127 photorealistic simulated interactions across 12 OmniGibson scenes, rendered with NVIDIA RTX Path Tracing, and provides the same annotation types for simulated environments. Together, these datasets provide a foundation for benchmarking methods that aim to move from static 3D reconstruction toward functional digital twins.
The practical relevance of the work is especially clear in robotics and simulation. FunREC supports URDF/USD export, enabling reconstructed scenes to be loaded directly into physics engines such as NVIDIA Isaac Sim. As a demonstration of one potential application, the authors show how the inferred scene model can be transferred to a mobile manipulator. Using contact points, articulation parameters, and interaction trajectories extracted from human demonstrations, a Boston Dynamics Spot with Arm can reproduce the same interactions with objects such as cabinets or drawers.
FunREC points toward a broader shift in AI and computer vision: from recognizing objects and reconstructing geometry to understanding the interactive and functional structure of real-world environments. This capability could play an important role in future robotics, simulation, and embodied AI systems that must operate reliably in human spaces.
The full paper is available here: https://functionalscenes.github.io/



