vggt-omega
Before using the models, please request access to the checkpoints here. Once your request is approved, you can download the checkpoints. Please note that access requests are reviewed by an automated process based on the information provided in the request.
| Model | Resolution | Text alignment | Download |
|---|---|---|---|
VGGT-Omega-1B-512 |
512 | No | Link |
VGGT-Omega-1B-256-Text-Alignment |
256 | Yes | Link |
The authors are not involved in the review process and cannot approve or reject individual applications. However, the 🤗 Hugging Face demo is available to everyone.
Quick Start
First, clone this repository and install the dependencies:
git clone git@github.com:facebookresearch/vggt-omega.git
cd vggt-omega
pip install -r requirements.txt
pip install -e .
Now, try the model with a few lines of code:
import torch
from vggt_omega.models import VGGTOmega
from vggt_omega.utils.load_fn import load_and_preprocess_images
from vggt_omega.utils.pose_enc import encoding_to_camera
checkpoint_path = "path/to/vggt_omega_1b_512.pt"
image_names = ["path/to/imageA.png", "path/to/imageB.png", "path/to/imageC.png"]
model = VGGTOmega().to("cuda").eval()
model.load_state_dict(torch.load(checkpoint_path, map_location="cpu"))
images = load_and_preprocess_images(image_names, image_resolution=512).to("cuda")
with torch.inference_mode():
predictions = model(images)
extrinsics, intrinsics = encoding_to_camera(
predictions["pose_enc"],
predictions["images"].shape[-2:],
)
depth = predictions["depth"]
depth_conf = predictions["depth_conf"]
camera_and_register_tokens = predictions["camera_and_register_tokens"]
camera_tokens = camera_and_register_tokens[:, :, :1]
registers = camera_and_register_tokens[:, :, 1:]
For the text-aligned checkpoint, use VGGTOmega(enable_alignment=True) with image_resolution=256 and read predictions["text_alignment_embedding"].
Interactive Demo
Install the demo dependencies:
pip install -r requirements_demo.txt
Launch the Gradio demo with a local checkpoint path:
python demo_gradio.py \
--checkpoint checkpoints/VGGT-Omega-1B-512/model.pt \
--image-resolution 512
The demo accepts uploaded images or a video, runs camera and depth inference, and visualizes the depth-unprojected point cloud and predicted cameras as a GLB scene.
Runtime and GPU Memory
We benchmark the end-to-end peak GPU memory usage of VGGT-Omega-1B-512 on a
single NVIDIA A100 GPU with 624x416 input images. The measurement covers the full
inference program, from loading the model weights onto the GPU through the
forward pass, so it includes both the memory needed to store the model itself
and the memory used by inference activations and buffers. In other words, a GPU
with at least the listed available memory is able to run the corresponding
number of input frames under this setup.
| Input Frames | 1 | 10 | 25 | 50 | 100 | 200 | 300 | 400 | 500 |
|---|---|---|---|---|---|---|---|---|---|
| Peak Memory (GB) | 6.02 | 6.67 | 7.80 | 9.66 | 13.37 | 20.82 | 28.26 | 35.71 | 43.15 |
The benchmark uses load_and_preprocess_images
with the default mode="balanced" and image_resolution=512. For these roughly
3:2 landscape images, this produces 624x416 inputs. You can set
mode="max_size" to resize the longest side to 512 instead; for the same aspect
ratio, this gives about 512x336 inputs and uses less GPU memory.
License
See the LICENSE file for details about the license under which this code is made available.
[^release]: This Release is intended to support the open source research community.
@misc{wang2026vggtomega,
title={VGGT-$\Omega$},
author={Jianyuan Wang and Minghao Chen and Shangzhan Zhang and Nikita Karaev and Johannes Schönberger and Patrick Labatut and Piotr Bojanowski and David Novotny and Andrea Vedaldi and Christian Rupprecht},
year={2026},
eprint={2605.15195},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.15195},
}
