Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

Anonymous Author1^1,2,3, Anonymous Author2^1,2, Anonymous Author3^1,4, Anonymous Author3^1,2,5, Anonymous Author3^1,2, Anonymous Author3^1,6, Anonymous Author3², Anonymous Author3¹, Anonymous Author3¹, Anonymous Author3³, Anonymous Author3^1,2, Anonymous Author3^1,2

¹Anonymous Institution, ²Anonymous Institution, ³Anonymous Institution, ⁴Anonymous Institution, ⁵Anonymous Institution, ⁶Anonymous Institution

arXiv GitHub

Abstract

Motion control and compositing are two essential capabilities for controllable video generation. Existing approaches address these tasks separately—focusing either on trajectory-conditioned image-to-video generation, which typically restricts content insertion to the first frame, or reference-to-video generation, which lacks precise spatial and temporal control over how reference content is composed into the generated frames.

In this work, we unify both capabilities within a single model, Go-with-the-Track, by introducing joint conditioning on multiple reference images and reference-anchored point-tracks. While conventional point-tracks are defined as the 2D flow of a point strictly within the generated video sequence, we extend this definition by anchoring the tracks to the reference images, explicitly establishing point correspondences between the generated video frames and the reference content. Treating these correspondences as integral to point-track conditioning, Go-with-the-Track enables fine-grained compositing and motion control throughout the video.

Applications

Multi-Reference-driven Restylization

We extract point-tracks from a source video and use them to transfer complex motion patterns to new subjects defined by one or more reference images. By decoupling motion (point-tracks) from appearance (reference images), our model faithfully preserves the source dynamics while propagating the visual identity and style of the references.

1 / 10

Mesh-driven Compositing

Keyframe-driven Mesh Stylization

We render mesh vertices to extract point-tracks and use stylized keyframes as visual references. Guided by these vertex-driven point-tracks, our model stylizes 3D animations in a customized artistic style while faithfully preserving the original motion dynamics.

1 / 10

Multi-Reference-driven Mesh Compositing

By rendering and stylizing each mesh object independently (e.g., each person and the background), we generate a coherent composited video.

1 / 10

Keypoint-driven Compositing

By treating body and facial keypoints extracted from off-the-shelf detectors (SAM 3D Body and Pixel3DMM) as point-tracks, Go-with-the-Track transfers the appearance of reference images onto a source video even when the pose varies significantly.

1 / 8

Static Scene Camera Control

Multi-Reference-driven Camera Control in Static Scene

Go-with-the-Track enables precise camera control and novel view synthesis in static scenes by reprojecting 3D point clouds (reconstructed by Pi3 from reference images) along user-specified camera trajectories. It handles complex camera paths, including spirals and smooth interpolations, while flexibly leveraging reference images that may not be identically present in the final video.

1 / 15

360° Camera Orbit in a Static Scene

Using sparse point-tracks as motion anchors, our model generates a smooth 360° camera orbit around a static scene while maintaining spatial consistency and visual coherence.

1 / 3

Camera Retargeting in Dynamic Scene

Beyond static scene camera control, our framework supports camera retargeting in dynamic scenes with independently moving objects. We achieve this by reprojecting estimated dynamic point-tracks (using DELTA and Pi3) along a new camera path, using four uniformly sampled frames from the original video as references.

1 / 52

Temporal Stabilization

We mitigate temporal flickering in inverse rendering tasks by interpolating albedo and shading estimates derived from keyframes.

Albedo

1 / 6

Shading Estimates

1 / 2

Baseline Comparison

First-Frame-driven Reconstruction

Given the first frame and point-tracks extracted from the source video, each method attempts to reconstruct the source video. We find that the outputs from Go-with-the-Track align much more closely with the original video, better preserving the spatial structure and element identity.

1 / 8

First-Frame-driven Restylization

Given a restylized first frame and source point-tracks, each method generates a motion-preserved restylized video. Our model more faithfully preserves the source motion while maintaining the subject identity defined by the restylized first frame.

1 / 15

Ablation Study

Model Design

Ablation Study on Model Design

We analyze the effectiveness of our key design components: the spatially-aware point-track embedder, the point-track adapter, and the relative position injection strategy.

w/o Spatially-aware ID: Replacing our spatially-aware embeddings with random embeddings significantly degrades reference insertion capabilities, as the model struggles to place reference content at the correct locations in the generated frames.
w/o Track Adapter: Removing the point-track adapter harms motion controllability, since naive subsampling of tracks leads to the loss of detailed motion conditioning.
w/o Relative Position: Injecting relative positional information before max pooling is also essential for improving motion controllability, as intra-block spatial cues would otherwise be lost during pooling.

1 / 2

PCA Analysis on Point-track Embeddings

We visualize both random embeddings and our spatially-aware point-track embeddings using PCA and project the principal components back into pixel space. Our embeddings exhibit clear spatial correlations, whereas random embeddings show no meaningful spatial structure. This suggests that spatially correlated embeddings provide a useful inductive bias that the model can effectively leverage.

1 / 1

Dataset

Dataset Visualization

We visualize training samples after applying the data augmentation strategies described in our paper. These examples illustrate the diversity of motion patterns, reference image variations, and point-track densities, enhancing the model’s robustness to diverse conditioning inputs.

1 / 8

Ablation Study on Dataset

To evaluate the effectiveness of our hybrid training strategy, we train our 1.3B model under two settings: (1) using only real videos, where point-tracks are estimated by an off-the-shelf tracker and therefore contain noise, and (2) using the full dataset, which additionally includes synthetic and static-scene videos with precise ground-truth point-track annotations. As shown, training on the full dataset significantly improves motion controllability. This result demonstrates the benefit of incorporating accurate ground-truth point-track supervision.

1 / 3

Reference Frames

Ablation Study on Keyframes

Unlike conventional point-track-conditioned image-to-video models that rely on the first frame, our model can be conditioned on arbitrary frames. As shown, it supports conditioning on the first, middle, or last frame of a video. Moreover, we achieve the best reconstruction performance by conditioning on four uniformly sampled reference frames.

1 / 7

Ablation Study on Non-keyframes

Beyond keyframe conditioning, our model can be guided by reference images that do not exactly correspond to any generated frame. As shown, the model effectively retrieves relevant visual information from reference images to produce coherent videos.

1 / 7

Iterative Point-track Resampling

We provide a visual comparison of detected point-tracks obtained using our iterative resampling strategy (Appendix Algorithm 1) and uniform random sampling of point queries over the video frames. Our iterative resampling produces denser and more uniformly distributed point-tracks, achieving better spatial coverage with reduced sparsity.

1 / 1