vggt-omega

Introduction: [CVPR 2026 Oral] VGGT Omega
More: Author   ReportBugs   
Tags:

VGGT-Ω

Project Page arXiv

Jianyuan Wang1,2 Minghao Chen1 Shangzhan Zhang1 Nikita Karaev1
Johannes Schönberger2 Patrick Labatut2 Piotr Bojanowski2 David Novotny
Andrea Vedaldi1,2 Christian Rupprecht1

1Visual Geometry Group, University of Oxford; 2Meta AI

Before using the models, please request access to the checkpoints here. Once your request is approved, you can download the checkpoints. Please note that access requests are reviewed by an automated process based on the information provided in the request.

Model Resolution Text alignment Download
VGGT-Omega-1B-512 512 No Link
VGGT-Omega-1B-256-Text-Alignment 256 Yes Link

The authors are not involved in the review process and cannot approve or reject individual applications. However, the 🤗 Hugging Face demo is available to everyone.

Quick Start

First, clone this repository and install the dependencies:

git clone git@github.com:facebookresearch/vggt-omega.git
cd vggt-omega
pip install -r requirements.txt
pip install -e .

Now, try the model with a few lines of code:

import torch

from vggt_omega.models import VGGTOmega
from vggt_omega.utils.load_fn import load_and_preprocess_images
from vggt_omega.utils.pose_enc import encoding_to_camera

checkpoint_path = "path/to/vggt_omega_1b_512.pt"
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]

model = VGGTOmega().to("cuda").eval()
model.load_state_dict(torch.load(checkpoint_path, map_location="cpu"))

images = load_and_preprocess_images(image_names, image_resolution=512).to("cuda")

with torch.inference_mode():
    predictions = model(images)

extrinsics, intrinsics = encoding_to_camera(
    predictions["pose_enc"],
    predictions["images"].shape[-2:],
)

depth = predictions["depth"]
depth_conf = predictions["depth_conf"]
camera_and_register_tokens = predictions["camera_and_register_tokens"]
camera_tokens = camera_and_register_tokens[:, :, :1]
registers = camera_and_register_tokens[:, :, 1:]

For the text-aligned checkpoint, use VGGTOmega(enable_alignment=True) with image_resolution=256 and read predictions["text_alignment_embedding"].

Interactive Demo

Install the demo dependencies:

pip install -r requirements_demo.txt

Launch the Gradio demo with a local checkpoint path:

python demo_gradio.py \
  --checkpoint checkpoints/VGGT-Omega-1B-512/model.pt \
  --image-resolution 512

The demo accepts uploaded images or a video, runs camera and depth inference, and visualizes the depth-unprojected point cloud and predicted cameras as a GLB scene.

Runtime and GPU Memory

We benchmark the end-to-end peak GPU memory usage of VGGT-Omega-1B-512 on a single NVIDIA A100 GPU with 624x416 input images. The measurement covers the full inference program, from loading the model weights onto the GPU through the forward pass, so it includes both the memory needed to store the model itself and the memory used by inference activations and buffers. In other words, a GPU with at least the listed available memory is able to run the corresponding number of input frames under this setup.

Input Frames 1 10 25 50 100 200 300 400 500
Peak Memory (GB) 6.02 6.67 7.80 9.66 13.37 20.82 28.26 35.71 43.15

The benchmark uses load_and_preprocess_images with the default mode="balanced" and image_resolution=512. For these roughly 3:2 landscape images, this produces 624x416 inputs. You can set mode="max_size" to resize the longest side to 512 instead; for the same aspect ratio, this gives about 512x336 inputs and uses less GPU memory.

License

See the LICENSE file for details about the license under which this code is made available.

[^release]: This Release is intended to support the open source research community.

@misc{wang2026vggtomega,
      title={VGGT-$\Omega$}, 
      author={Jianyuan Wang and Minghao Chen and Shangzhan Zhang and Nikita Karaev and Johannes Schönberger and Patrick Labatut and Piotr Bojanowski and David Novotny and Andrea Vedaldi and Christian Rupprecht},
      year={2026},
      eprint={2605.15195},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.15195}, 
}
Apps
About Me
GitHub: Trinea
Facebook: Dev Tools
AI Daily Digest