Website borrowed from NeRFies under a Creative Commons Attribution-ShareAlike 4.0 International
Voyager consists of two key components:
(1) World-Consistent Video Diffusion: A unified architecture that jointly generates aligned RGB and depth video sequences, conditioned on existing world observation to ensure global coherence.
(2) Long-Range World Exploration: An efficient world cache with point culling and an auto-regressive inference with smooth video sampling for iterative scene extension with context-aware consistency.
To train Voyager, we propose a scalable data engine, i.e., a video reconstruction pipeline that automates camera pose estimation and metric depth prediction for arbitrary videos, enabling large-scale, diverse training data curation without manual 3D annotations. Using this pipeline, we compile a dataset of over 100,000 video clips, combining real-world captures and synthetic Unreal Engine renders.
We evaluate the video generation quality of Voyager by comparing four open-source camera-controllable video generation methods on image-to-video generation. We randomly select 150 video clips from the test set of RealEstate as our test dataset and adopt PSNR, SSIM, and LPIPS to measure the similarity between the generated frames and the ground truth.
We report the quantitative results on the right table. Our method outperforms all the baselines, demonstrating the high generation quality of our video model. The qualitative comparison in the above figure also showcases our capability of generating photorealistic videos. Especially in the last case, only our method can preserve the details of products in the input image. However, other methods are prone to generating artifacts, e.g., in the first example, these methods fail to provide reasonable predictions when the camera movement is too large. |
|
To evaluate the quality of scene generation, we further compare the quality of scene reconstruction with generated videos. Since the compared baselines only produce RGB frames, we first exploit VGGT to estimate camera parameters and initialize the point clouds for the generated videos of these methods. Thanks to the capability of generating RGB-D content, our results can be directly used in 3DGS reconstruction.
In the right table, our reconstruction results with VGGT post-hoc outperform the compared baselines, indicating that our generated videos are more consistent in aspect of geometry. The results are even better when initializing point clouds with our own depth output, which demonstrates the effectiveness of our depth generation for scene reconstruction. The qualitative results in the above figure illustrate the same conclusion. Particularly in the last case, our method retains most details of the chandelier, while baseline methods even fail to reconstruct a basic shape. |
|
Besides the in-domain comparison on RealEstate, we test Voyager on WorldScore static benchmark on world generation. Voyager achieves the highest score on this benchmark. The score shows that our method has a competitive performance on camera control and spatial consistency, compared with 3D-based methods. Our subjective quality is the highest among all methods, further demonstrating the visual quality of our generated videos. Notably, since our video condition is constructed with metric depth, the camera movement in our results are larger than other methods, which is much harder to generate.
Method | WorldScore Average | Camera Control | Object Control | Content Alignment | 3D Consistency | Photometric Consistency | Style Consistency | Subjective Quality |
---|---|---|---|---|---|---|---|---|
WonderJourney | 63.75 | 84.6 | 37.1 | 35.54 | 80.6 | 79.03 | 62.82 | 66.56 |
WonderWorld | 72.69 | 92.98 | 51.76 | 71.25 | 86.87 | 85.56 | 70.57 | 49.81 |
EasyAnimate | 52.85 | 26.72 | 54.5 | 50.76 | 67.29 | 47.35 | 73.05 | 50.31 |
Allegro | 55.31 | 24.84 | 57.47 | 51.48 | 70.5 | 69.89 | 65.6 | 47.41 |
Gen-3 | 60.71 | 29.47 | 62.92 | 50.49 | 68.31 | 87.09 | 62.82 | 63.85 |
CogVideoX-I2V | 62.15 | 38.27 | 40.07 | 36.73 | 86.21 | 88.12 | 83.22 | 62.44 |
Voyager | 77.62 | 85.95 | 66.92 | 68.92 | 81.56 | 85.99 | 84.89 | 71.09 |
Website borrowed from NeRFies under a Creative Commons Attribution-ShareAlike 4.0 International