vggt_mapper

Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks.

  • The operator processes a video and extracts frames based on the specified frame number and duration.

  • It uses the VGGT model to analyze the extracted frames and generate various outputs such as camera parameters, depth maps, point maps, and 3D point tracks.

  • If 3D point tracks are required, the user must provide query points in the format [x, y], relative to the top-left corner.

  • The results are stored in the sample’s metadata under the specified tag field name, which defaults to ‘vggt_tags’.

  • The operator can output camera parameters, depth maps, point maps from projection, point maps from unprojection, and 3D point tracks, depending on the configuration.

  • The VGGT model is loaded from the provided path, and the operator runs in CUDA mode if available.

输入单个场景的视频,并使用VGGT提取信息,包括相机姿态、深度图、点图和3D点轨迹。

  • 该算子处理视频并根据指定的帧数和持续时间提取帧。

  • 它使用VGGT模型分析提取的帧并生成各种输出,如相机参数、深度图、点图和3D点轨迹。

  • 如果需要3D点轨迹,用户必须提供查询点,格式为[x, y],相对于左上角。

  • 结果存储在样本元数据中指定的标签字段名下,默认为’vggt_tags’。

  • 根据配置,算子可以输出相机参数、深度图、投影点图、反投影点图和3D点轨迹。

  • VGGT模型从提供的路径加载,如果可用,算子将在CUDA模式下运行。

Type 算子类型: mapper

Tags 标签: gpu, video

🔧 Parameter Configuration 参数配置

name 参数名

type 类型

default 默认值

desc 说明

vggt_model_path

<class ‘str’>

'facebook/VGGT-1B'

The path to the VGGT model.

frame_num

typing.Annotated[int, Gt(gt=0)]

3

The number of frames to be extracted uniformly from the video. If it’s 1, only the middle frame will be extracted. If it’s 2, only the first and the last frames will be extracted. If it’s larger than 2, in addition to the first and the last frames, other frames will be extracted uniformly within the video duration. If “duration” > 0, frame_num is the number of frames per segment.

duration

<class ‘float’>

0

The duration of each segment in seconds. If 0, frames are extracted from the entire video. If duration > 0, the video is segmented into multiple segments based on duration, and frames are extracted from each segment.

tag_field_name

<class ‘str’>

'vggt_tags'

The field name to store the tags. It’s “vggt_tags” in default.

frame_dir

<class ‘str’>

DATA_JUICER_ASSETS_CACHE

Output directory to save extracted frames.

if_output_camera_parameters

<class ‘bool’>

True

Determines whether to output camera parameters.

if_output_depth_maps

<class ‘bool’>

True

Determines whether to output depth maps.

if_output_point_maps_from_projection

<class ‘bool’>

True

Determines whether to output point maps directly inferred by VGGT.

if_output_point_maps_from_unprojection

<class ‘bool’>

True

Determines whether to output point maps constructed from depth maps and camera parameters.

if_output_point_tracks

<class ‘bool’>

True

Determines whether to output point tracks. If point tracks are required, the user should provide a list where each element consists of 2D point coordinates (list shape: (N, 2)). The point coordinates should be specified in the format [x, y], relative to the top-left corner, where x/y values are non-normalized.

args

''

extra args

kwargs

''

extra args

📊 Effect demonstration 效果演示

test

VggtMapper(vggt_model_path='facebook/VGGT-1B', frame_num=2, duration=2, frame_dir=DATA_JUICER_ASSETS_CACHE, if_output_camera_parameters=True, if_output_depth_maps=True, if_output_point_maps_from_projection=True, if_output_point_maps_from_unprojection=True, if_output_point_tracks=True)

📥 input data 输入数据

Sample 1: 1 video
video11.mp4:
query_points
[[320.0, 200.0], [500.72, 100.94]]
Sample 2: 1 video
video10.mp4:
query_points
[[50.72, 100.94]]

📤 output data 输出数据

Sample 1: empty
camera_parameters_extrinsic
[1, 10, 3, 4]
camera_parameters_intrinsic
[1, 10, 3, 3]
depth_maps_depth_maps
[1, 10, 294, 518, 1]
depth_maps_depth_conf
[1, 10, 294, 518]
point_maps_from_projection_point_map
[1, 10, 294, 518, 3]
point_maps_from_projection_point_conf
[1, 10, 294, 518]
point_maps_from_unprojection_point_maps_from_unprojection
[10, 294, 518, 3]
point_tracks_track_list
[1, 10, 2, 2]
point_tracks_vis_score
[1, 10, 2]
point_tracks_conf_score
[1, 10, 2]
Sample 2: empty
camera_parameters_extrinsic
[1, 18, 3, 4]
camera_parameters_intrinsic
[1, 18, 3, 3]
depth_maps_depth_maps
[1, 18, 392, 518, 1]
depth_maps_depth_conf
[1, 18, 392, 518]
point_maps_from_projection_point_map
[1, 18, 392, 518, 3]
point_maps_from_projection_point_conf
[1, 18, 392, 518]
point_maps_from_unprojection_point_maps_from_unprojection
[18, 392, 518, 3]
point_tracks_track_list
[1, 18, 1, 2]
point_tracks_vis_score
[1, 18, 1]
point_tracks_conf_score
[1, 18, 1]

✨ explanation 解释

The VggtMapper operator processes video data to extract various 3D information such as camera parameters, depth maps, and point tracks. The input consists of video files and query points. The output includes detailed 3D information for each video, such as the shape of the camera parameters, depth maps, and point tracks. This example demonstrates the typical use case where the operator processes a single video with multiple query points and outputs the corresponding 3D information. VggtMapper算子处理视频数据以提取各种3D信息,如相机参数、深度图和点轨迹。输入包括视频文件和查询点。输出包含每个视频的详细3D信息,例如相机参数、深度图和点轨迹的形状。此示例展示了典型的使用场景,其中算子处理带有多个查询点的单个视频,并输出相应的3D信息。