Inspired by AllTracker, this project aims to extend MVTracker to dense 3D tracking at pixel-level granularity across multiple views. Efficiency and scalability will be emphasized.
Supervisor: frano.rajic@inf.ethz.ch
MVTracker is recently introduced as the first method for the task of multi-view 3D point tracking. Its way of representing the tracked points with a single features vector might limit the context and is susceptible to noise in the backprojected depth maps. In this project, the student will look for better ways to use the features.
Supervisor: frano.rajic@inf.ethz.ch
This project investigates combining 3D point trajectories from MVTracker with dynamic 3D Gaussian Splatting pipelines to improve optimization-based reconstruction of moving scenes.
Supervisor: frano.rajic@inf.ethz.ch, sergey.prokudin@inf.ethz.ch
The project explores adding memory and re-identification mechanisms to 2D point tracking models, enabling re-tracking of occluded or temporarily invisible points.
Supervisor: frano.rajic@inf.ethz.ch
Point tracking is the task of tracking arbitrary points on surfaces of objects throughout the video and is of interest for applications such as robotics. MVTracker introduces the first feedforward multi-view 3d point tracker with state-of-the-art performance across multi-view datasets. However, its performance is not robust to arbitrary scene normalizations since it learns a tracking prior at the scene scale present in the training data. In addition, it underperforms in monocular, unbounded scenes such as on videos from the TAPVid-2D benchmark. In this project, the student will explore the failure cases and limitations of MVTracker and work toward improving the performance in those cases.
Supervisor: frano.rajic@inf.ethz.ch
Point tracking is a task that has been revived in 2022, but the number and diversity of training datasets remains limited. The goal of this project is to look into creating new or repurposing existing dataset generation pipelines (e.g., from video games) to generate training data for multi-view depth and 3D point tracking, and training existing trackers on the new data.
Supervisor: frano.rajic@inf.ethz.ch, sergey.prokudin@inf.ethz.ch
TAPIP3D and MVTracker score well in monocular and multi-view 3D point tracking, but their performance on standard 2D point tracking datasets is hindered by the quality of the video depth received from off-the-shelf depth estimators. This projects aim to make a selected 3D point tracker work well in both settings and be able to perform 2D point tracking even when the depth is noisy or missing.
Supervisor: frano.rajic@inf.ethz.ch
Having 2–4 cameras instead of one can enable more precise and robust visual perception for robotics. The student will explore monocular and multi-view video depth estimators and work toward implementing a better method.
Supervisor: frano.rajic@inf.ethz.ch
Video depth estimation for sparse multi-camera input lacks precision and robustness, with frequent failures and considerable geometry misalignment. Such failures directly impact the precision of downstream applications such as multi-view 3D point tracking. The student will work toward improving the robustness of state-of-the-art multi-view 3D point trackers to imperfect depth.
Supervisor: frano.rajic@inf.ethz.ch
3D point tracking methods treat the problem of depth estimation independently and run point tracking on top of the depth estimates. The quality of the estimated depth can limit the tracking accuracy, especially in cases where depth estimation fails and predicts completely unreasonable geometry. However, the two problems are inherently related and it might be possible that learning to solve the problem jointly is beneficial for both tasks. In this project, you will work towards developing a data-driven, feedforward method that will implicitly or explicitly learn to solve the two tasks in the same neural network.
Supervisor: frano.rajic@inf.ethz.ch
When you delete an object from a 3D scene, is it really gone? Pixels may disappear, but the scene graph, the hidden map of objects and their relationships, often keeps whispering clues: 'a chair once stood next to this table.' These symbolic residues pose privacy risks and break the illusion of a clean edit. This project takes on the challenge of erasing objects twice: once in the geometry, and once in the graph. You will design a closed-loop system that scrubs scene graphs of all traces of a removed object and then propagates the changes back into the 3D world, ensuring perfect consistency between symbolic reasoning and physical reality. The outcome not only advances privacy-preserving 3D editing, but also unlocks new possibilities for robotics, AR/VR, and spatial AI.
Supervisors: skocour@ethz.ch
Can a 'deleted' object come back to life? Even the best 3D scene editing methods often leave subtle traces that give away what was removed. This project turns those leftovers into clues: using our Remove360 dataset, you will develop a model that reconstructs the missing object—its shape, texture, and placement—so realistically that it looks like it was never gone. The work not only exposes hidden weaknesses in current removal techniques, but also pushes forward new methods for realistic scene completion with applications in AR/VR, virtual staging, and digital content creation.
Supervisors: skocour@ethz.ch
Novel view synthesis (NVS), a fundamental problem in computer vision, seeks to generate renderings from novel target viewpoints given a set of input viewpoints. Achieving this requires addressing several complex challenges: (1) inferring the geometric structure of a scene from 2D observations, (2) rendering the inferred 3D reconstruction from new viewpoints in a physically plausible manner, and (3) inpainting or extrapolating missing regions that are not observed in the input viewpoints. To tackle these challenges, diverse 3D representations, along with classical geometric constraints, advanced optimization techniques, and deep stereo priors, have been extensively studied. In recent years, diffusion generative models for 2D images and videos have demonstrated remarkable capabilities in generating photorealistic images. These advancements have opened new avenues for enhancing NVS by leveraging the priors encoded in these models. This project aims to investigate the types of prior knowledge encoded within 2D generative models that can most effectively benefit NVS. Unlike many contemporary approaches that fine-tune pretrained generative models for specific NVS tasks, this research adopts a zero-shot framework.
Supervisors: yutong.chen@inf.ethz.ch
In the era of autonomy, the creation of a 3D digital world that faithfully replicates our physical reality becomes increasingly critical. Central to this endeavor is the incorporation of realistic human behaviors. Moreover, human behaviors are intricately rooted in environments - our movements are influenced by our interactions with various objects and the spatial arrangement of our surroundings. Therefore, it is essential not only to model human motion itself but also to model how humans interact with the surrounding environment. Creating human motions within diverse environments has significant applications across numerous fields, including augmented reality (AR), virtual reality (VR), assistive robotics, biomechanics, filmmaking, and the gaming industry. However, capturing human motions in environments require expensive devices, complicated hardware setup and significant manual efforts, thus not scalable to create large-scale human-scene interaction datasets. In this project, we explore how to leverage 2D foundation models to synthesize 3D human motions in various environments in an efficient and scalable way. The project starts from December 2024 or January 2025.
Supervisor: siwei.zhang@inf.ethz.ch
Pre-trained large language models (LLMs) and vision-language models (VLMs) have demonstrated the ability to understand and autoregressively complete complex token sequences, enabling them to capture both the physical and semantic properties of a scene. By leveraging in-context learning, these models can function as general sequence modelers without requiring additional training. This project aims to explore how these zero-shot capabilities can be applied to human motion analysis tasks, such as motion prediction, generation, and denoising. By converting human motion data into token sequences, the project will assess the effectiveness of pre-trained foundation models in digital human modeling. Students will conduct a literature review, design experimental pipelines, and run tests to evaluate the feasibility of using LLMs and VLMs for motion analysis, while exploring optimal tokenization schemes and input modalities.
Supervisors: sergey.prokudin@inf.ethz.ch
Recent advancements in 3D reconstruction, such as neural radiance fields (NeRF) and 3D Gaussian Splatting have led to impressive results in high-quality novel view synthesis. However, these techniques still face challenges when it comes to extracting accurate geometry, particularly in scenes with reflective or transparent surfaces. At the same time, monocular depth estimation using data-driven or diffusion-based models has shown great promise in inferring depth from a single image and in certain controlled scenarios, access to ground-truth depth information further enables a more precise understanding of scene geometry. This project aims to investigate how depth or normal cues can be integrated into 3D reconstruction pipelines to improve geometric accuracy. The student will explore various methods for incorporating monocular geometric cues, either through direct supervision or indirectly by leveraging depth-aware features, and evaluate the effectiveness of these approaches in challenging scenarios.
Supervisors: johannes.weidenfeller@ai.ethz.ch, lilian.calvet@balgrist.ch
This project aims to evaluate the point tracking performance of state-of-the-art dynamic 3D reconstruction methods on multi-view videos from the TAPVid-3D benchmark. In addition to performance evaluation, failure cases will be analyzed, and improvements will be explored based on the time available during the project.
Supervisor: frano.rajic@inf.ethz.ch
Sign language is a visual means of communication that uses hand shapes, facial expressions, body movements, and gestures to convey meaning. It serves as the primary language for the deaf and hard-of-hearing communities. Technologies that capture and generate sign language can bridge communication gaps by enabling real-time translation to text or speech, providing educational tools for non-signers, and improving accessibility in public services like healthcare. This project aims to develop a generative model that can convert spoken language to 3D sign language performance by a human avatar.
Supervisors: kaifeng.zhao@inf.ethz.ch
The goal of this project is to investigate methods to learn human-scene interaction skills from 2D observations.
Supervisors: kaifeng.zhao@inf.ethz.ch, siwei.zhang@inf.ethz.ch
The goal of this project is to investigate methods to generate 3D facial animations leveraging diffusion models. Diffusion models have shown compelling results in human motion generation. Recent work leverages these models to synthesize full-body motions from sparse input (e.g. head-hand tracking signal). This project will explore extensions of this method to facial animation -- e.g., synthesizing face motion from sparse 2D/3D keypoints.
Supervisors: qianli.ma@inf.ethz.ch, fbogo@meta.com
This project aims to leverage a recent 3D human motion dataset CIRCLE to develop a generative human motion model to synthesize highly complex human scene interactions.
Supervisors: gen.li@inf.ethz.ch, yan.zhang@inf.ethz.ch
This project aims to build a system to capture interactions between people and the environment.
Supervisors: yan.zhang@inf.ethz.ch, kraus@ibk.baug.ethz.ch
This project attempts to learn object geometry and appearance from a set of 2D images and allows for scale specific controlling. We have also witnessed many great processes in realistic controllable 2D image synthesis and pleasant 3D image results by tacking leverage the recent advance in volume rendering. The core idea of this project is to extend the recent 3D generator that enables a level of control on both appearance and geometry.
Supervisors: anpei.chen@inf.ethz.ch
Supervisors: kkarunrat@inf.ethz.ch
This project attempts to reconstruct the geometric and appearance of 4D scenes (static scene + moving objects). We will start with decomposable radiance field reconstruction with a specific setting: a middle scale static environment (room or outdoor street) and one class of objects (human or car).
Supervisors: anpei.chen@inf.ethz.ch
Supervisors: Francis Engelmann (mailto:francisengelmann@ai.ethz.ch)
Supervisors: Shengyu Huang (shengyu.huang@geod.baug.ethz.ch), Xuyang Bai (xbaiad@connect.ust.hk), Dr. Theodora Kontogianni (theodora.kontogianni@inf.ethz.ch), Prof. Dr. Konrad Schindler (konrad.schindler@geod.baug.ethz.ch)