"Machines Are Learning to Understand Space" — Federico Tombari at the AI Symposium on 3D AI, Explainability, and Immersive Worlds
At the AI Symposium 2026, we spoke with Federico Tombari, Director of Research at Google Zurich and one of the leading experts in 3D computer vision, multimodal models, and autonomous perception. In our conversation, he reflected on how end-to-end AI systems are reshaping machine perception, why explainability is becoming more important as models grow more powerful, and what it will take for spatial AI and immersive 3D scene generation to move from impressive demos to real-world impact.
- Your work has long focused on helping machines move from recognising objects to understanding three-dimensional scenes. Looking back over the past decade, what would you say has been the single most significant shift in machine perception?
- There was clearly a major turning point around 2021, when we saw new breakthroughs in AI — particularly in machine learning. The emergence of large language models signalled a broader shift that made its presence felt in 3D computer vision and perception too.
One of the biggest changes has been the move towards end-to-end generalist models. In the past, many tasks were handled through pipelines made up of multiple algorithms engineered to work together. Today, we increasingly see a tendency to replace those pipelines with a single model capable of solving the same task end to end.
Ideally, that same model can then be reused across many different tasks. This brings both advantages and disadvantages. On the positive side, it allows us to concentrate knowledge about multiple tasks and challenges within one model. If we continue to provide it with sufficient data, the model can keep learning and improving across different domains.
The downside is that we are moving more and more towards giant black boxes. We may have a single end-to-end model producing the answers, but we often lack the ability to inspect it clearly and understand what is happening inside.
That becomes especially important in many real-world applications — industrial processes, manufacturing, autonomous systems, robotics, autonomous driving. When something goes wrong, we need to know what happened and how to improve the specific component that failed. For that reason, one area of AI that has existed for some time but is now becoming particularly important is explainability: the effort to open up the black box and better understand how different parts of a model work, and where exactly they fail.
- Is explainability a particularly pressing concern in visual AI — in image or video generation, for instance? With text, inaccuracies are not always immediately obvious. But with images or video, viewers tend to spot errors immediately — a hand with six fingers, say.
- Yes, although that points to a somewhat different set of challenges — ones that are more closely related to the certifiability and safety aspects of AI, even if they overlap with explainability.
Explainability becomes crucial when a machine learning model sits directly in the critical path of a real-time application. If you are driving a car, for instance, and a machine learning model is making important decisions about how that car should behave, then if something goes wrong and a bad decision is made, that can create a dangerous situation. In such cases, you need to understand precisely what happened.
What you are referring to, by contrast, is more closely related to generative AI in digital content creation. There, one of the central challenges is determining whether a given piece of content was generated by AI or not. That is clearly something that should be addressed not only technically, through better methods for identifying and classifying AI-generated content, but also through governance.
We need policies that are shared as broadly as possible across industry and government, establishing clear guidelines for this kind of data. This also connects to the importance of traceability: being able to determine how and when a specific image or video was generated. Watermarking is one example. A number of methods are being explored in which generative models for images and video effectively mark the content they create, making it clear that it was AI-generated. That could help limit some of the risks associated with deepfakes and related challenges.

- Are we reaching a point where the average viewer can no longer reliably tell the difference between a fully AI-generated video and genuine footage?
- Absolutely — we are getting very, very close to that point. That is exactly why these kinds of safeguards need to be put in place.
There is also another issue: in some cases, generated content may be very closely correlated with material the model has already seen. That is why it is important for models used commercially to be able to identify copyrighted material and to operate under clear policies in that respect. This is very much top of mind for companies such as Google that are working in this area. Clearer policies have already been adopted for handling training data, although of course more can still be done.
- In your presentation, you showed fascinating examples of how videos and photographs can be turned into immersive 3D environments. Which use cases do you see becoming genuinely transformative first — navigation, remote collaboration, retail, design, or something else entirely?
- There are in fact many applications that could be unlocked by this kind of technology. The goal is to push digital content creation further with generative AI — not merely producing something visually convincing, but something that is geometrically faithful, something that genuinely captures the third dimension.
That matters for a wide variety of use cases. As I mentioned in my talk, one important area is the creation of immersive environments that people can actually navigate — whether these environments are reconstructions of real places or entirely generated. This is relevant not only for gaming and mixed reality, but also for autonomous systems.
We are now seeing the emergence of so-called world models: systems that create interactive digital representations of 3D environments. Sometimes the applications are entertainment-oriented, but in other cases these models can generate highly useful, customised data for training robotic systems, robotic arms, autonomous agents, or autonomous driving models so that they perform more safely and more effectively.
For all of these applications, the third dimension is not just a visual enhancement. It has to be geometrically faithful. The underlying 3D structure must be preserved. Otherwise, you risk feeding noisy or misleading data to a robot learning to navigate an environment, or to an autonomous vehicle learning how to drive safely on real roads. If the geometry does not match the real world closely enough, the system may fail when deployed.
- You have worked in both academia and industry. How has working across both worlds shaped your thinking about what makes AI research genuinely valuable?
- Over the last few years, there has been a real rebalancing in the relationship between academia and industry, especially in AI.
Traditionally, academia was the main driver of innovation. That was where many of the disruptive ideas emerged, while industry focused more on technology transfer — taking the most promising ideas and turning them into products or applications.
In recent years, that balance has shifted. One reason is the very trend I mentioned earlier: we are increasingly training large, unified end-to-end models across many tasks, and doing so requires enormous amounts of data and compute. Access to both has become critical.
A great deal of innovation is now happening in environments where that scale of data and computation is available, and in many cases those environments are industrial rather than academic. That has changed the balance of research innovation to some extent.
But I call it a rebalancing rather than a replacement, because I do not think the importance of either side has diminished. Both continue to play a fundamental role in advancing the field. Academia still has a crucial role in pursuing disruptive, high-risk ideas, and that remains essential. Industry, meanwhile, is often better positioned to develop and scale the state of the art in large models.
What is particularly interesting now is the rise of consortia and partnerships that bring academia and industry together. These collaborations are increasingly important because they allow different institutions to pool resources and reach the critical mass needed in terms of data and compute. So one of the consequences of recent AI developments is that collaboration itself has become more important.
- It was striking to see how many different platforms can now interact with the 3D world — from phones to smart glasses to headsets. How does the experience of 3D content change when you go from a flat screen to a headset or smart glasses — and what does that transition make possible that a laptop screen simply cannot?
- The real value of immersive applications is that they enable a fundamentally different kind of user experience.
These use cases are driven by the desire for a stronger sense of presence — especially in relation to a physical world that is far away from us. If we want to connect with people who are not physically near us, immersive experiences can clearly add value. The same is true when we want to explore or learn about the world. Many concepts become easier to grasp, more intuitive, and more compelling when they are presented through immersive interfaces rather than on flat screens.
I am particularly interested in the ways AI can power rich and meaningful experiences in fields such as tutoring and education. AI can help create tools that make complex ideas easier to understand, and it can potentially make those tools much more widely available — including to people who otherwise would not have access to them. I believe that could become one of the real breakthroughs and one of the most positive impacts of AI.

- What will matter most in driving broader adoption of XR and spatial computing — better algorithms, better hardware, or a richer developer ecosystem?
- In reality, it is all of those things together. They are tightly interconnected.
On the hardware side, we are talking not only about the mechanical components and sensors, but also displays. Displays are critical for the immersive experiences we have been discussing, but they also need to be lightweight, high-definition, energy-efficient, and practical to wear. That is a very demanding engineering problem.
Another extremely important component is chipsets — mobile or embedded computing units capable of running these models and algorithms directly on device, under tight constraints related to battery life, latency, and accuracy.
Then, of course, there is the need for state-of-the-art machine learning models and algorithms. And finally, there is the developer ecosystem. That is also essential. We have already seen from the history of successful smartphone platforms how important an active and enthusiastic developer community can be. To build that, you need to provide the right tools, make the platform attractive, and give developers the freedom to be creative and turn their ideas into reality. That is another key challenge in this space.
- Looking five years ahead, what would it take to convince you that spatial AI and immersive 3D scene generation have moved decisively beyond impressive demos and into genuine mainstream use?
It is difficult to predict the future, even just five years ahead, given how quickly things are moving.
One area I would point to is autonomous agents. At the moment, one of their major limitations is still the way they interact with the physical world. We are reaching a point where such agents can understand the world reasonably well, and they can move through it with increasing precision — in terms of obstacle avoidance, dynamic control, and overall navigation. But their ability to manipulate objects and interact meaningfully with the world remains limited.
I think the next big step will be showing that spatial intelligence can power a new generation of autonomous agents that can truly act in and manipulate objects in the physical world. That would open up many new applications and markets.
Of course, that would not be only a matter of AI — or at least not only of spatial intelligence. It would also require better robotic tools for handling and interacting with the environment. So what we are really looking for is progress on both fronts at the same time.

