Voyager

We evaluate the video generation quality of Voyager by comparing four open-source camera-controllable video generation methods on image-to-video generation. We randomly select 150 video clips from the test set of RealEstate as our test dataset and adopt PSNR, SSIM, and LPIPS to measure the similarity between the generated frames and the ground truth.

We report the quantitative results on the right table. Our method outperforms all the baselines, demonstrating the high generation quality of our video model. The qualitative comparison in the above figure also showcases our capability of generating photorealistic videos. Especially in the last case, only our method can preserve the details of products in the input image. However, other methods are prone to generating artifacts, e.g., in the first example, these methods fail to provide reasonable predictions when the camera movement is too large.

Quantitative comparison of novel view synthesis on *RealEstate10K*.
Method	PSNR ↑	SSIM ↑	LPIPS ↓
SEVA	16.648	0.613	0.349
ViewCrafter	16.512	0.636	0.332
See3D	18.189	0.694	0.290
FlexWorld	18.278	0.693	0.281
Voyager	18.751	0.715	0.277

Scene Generation

To evaluate the quality of scene generation, we further compare the quality of scene reconstruction with generated videos. Since the compared baselines only produce RGB frames, we first exploit VGGT to estimate camera parameters and initialize the point clouds for the generated videos of these methods. Thanks to the capability of generating RGB-D content, our results can be directly used in 3DGS reconstruction.

In the right table, our reconstruction results with VGGT post-hoc outperform the compared baselines, indicating that our generated videos are more consistent in aspect of geometry. The results are even better when initializing point clouds with our own depth output, which demonstrates the effectiveness of our depth generation for scene reconstruction. The qualitative results in the above figure illustrate the same conclusion. Particularly in the last case, our method retains most details of the chandelier, while baseline methods even fail to reconstruct a basic shape.

Quantitative comparison of Gaussian Splattig reconstruction on *RealEstate10K*. Baselines require additional reconstruction step, while Voyager performs better with our generated depth.
Method	Post Rec.	PSNR ↑	SSIM ↑	LPIPS ↓
SEVA	VGGT	15.581	0.602	0.452
ViewCrafter	VGGT	16.161	0.628	0.440
See3D	VGGT	16.764	0.633	0.440
FlexWorld	VGGT	17.623	0.659	0.425
Voyager	VGGT	17.742	0.712	0.404
Voyager	-	18.035	0.714	0.381

World Generation

Besides the in-domain comparison on RealEstate, we test Voyager on WorldScore static benchmark on world generation. Voyager achieves the highest score on this benchmark. The score shows that our method has a competitive performance on camera control and spatial consistency, compared with 3D-based methods. Our subjective quality is the highest among all methods, further demonstrating the visual quality of our generated videos. Notably, since our video condition is constructed with metric depth, the camera movement in our results are larger than other methods, which is much harder to generate.

Quantitative comparison on *WorldScore Benchmark*. **Bold and underline** indicates the 1st, **Bold** indicates the 2nd, underline indicates the 3rd.
Method	WorldScore Average	Camera Control	Object Control	Content Alignment	3D Consistency	Photometric Consistency	Style Consistency	Subjective Quality
WonderJourney	63.75	84.6	37.1	35.54	80.6	79.03	62.82	66.56
WonderWorld	72.69	92.98	51.76	71.25	86.87	85.56	70.57	49.81
EasyAnimate	52.85	26.72	54.5	50.76	67.29	47.35	73.05	50.31
Allegro	55.31	24.84	57.47	51.48	70.5	69.89	65.6	47.41
Gen-3	60.71	29.47	62.92	50.49	68.31	87.09	62.82	63.85
CogVideoX-I2V	62.15	38.27	40.07	36.73	86.21	88.12	83.22	62.44
Voyager	77.62	85.95	66.92	68.92	81.56	85.99	84.89	71.09

Voyager Long-Range and World-Consistent Video Diffusion for Explorable 3D Scene Generation

The Voyager Model

Experiments

Video Generation

Scene Generation

World Generation