CoWTracker: Tracking by Warping instead of Correlation

Dense point tracking and optical flow with a simple warping-based architecture that avoids cost volumes, yet achieves state-of-the-art results on TAP-Vid and RoboTAP and strong zero-shot optical flow.

Zihang Lai^1,2 Eldar Insafutdinov¹ Edgar Sucar¹ Andrea Vedaldi^1,2

¹Visual Geometry Group, University of Oxford ²Meta AI

📄 Paper (PDF) 📚 arXiv 💻 Code & Models 🤗 Hugging Face Demo 🎬 Video

Motivation

Dense Tracking without Cost Volumes

We study dense point tracking: following any pixel in a video over time. Classical approaches rely on cost volumes that are expensive, hard to scale, and often specialized to optical flow. CoWTracker asks: can we track densely by warping instead?

Problem: Dense Point Tracking

Given a video and a set of query points in a reference frame, the goal is to recover their trajectories across the entire sequence, including handling occlusions and disocclusions.

Tracks for any pixel, not just sparse keypoints.
Long-range motion and large viewpoint change.
Visibility and confidence for each point over time.

Limitations of Cost Volumes

Quadratic complexity in spatial resolution.
Large memory footprint and compute cost.
Hard to scale to high-resolution, long sequences.

Tracking by Warping

CoWTracker replaces expensive cost volumes with iterative warps of high-resolution features, followed by a spatio-temporal transformer that refines tracks jointly across space and time.

No explicit feature correlation / cost volume.
Efficient high-resolution features (stride 2).
Single model for tracking and optical flow.

CoWTracker replaces expensive cost volumes with iterative warps of high-resolution features.

Key Idea

Iterative Warping + Spatio-Temporal Reasoning

Instead of searching over many candidate matches, CoWTracker maintains a current estimate of each point's position and repeatedly warps target features back to the query frame. A transformer then refines the displacement and predicts visibility and confidence.

Tracking Loop

Start with zero displacement (stationary initialization).
Warp target-frame features back to the query frame.
Concatenate warped features, query features, displacement, and hidden state.
Apply a spatio-temporal transformer to update the state.
Predict a displacement increment, visibility, and confidence.

Spatio-Temporal Transformer

Spatial attention across all points in a frame.
Temporal attention along each point track over time.
Alternating spatial and temporal blocks.
Joint reasoning about motion, occlusion, and consistency.

CoWTracker architecture: The backbone extracts video features from the input video, and a lightweight update operator iteratively warps and refines tracks to yield dense trajectories, visibility, and confidence.

Qualitative Visuals

Dense Tracks and Flow in Action

Qualitative videos showing CoWTracker on challenging scenes for dense point tracking and zero-shot optical flow.

Dense Point Tracking Videos

Clips visualizing dense point tracks over time. Query points in the reference frame are colored and their trajectories are rendered across the sequence. CoWTracker maintains coherent tracks even under large viewpoint changes and occlusions.

Thin structures (e.g., bicycle spokes, railings, ropes).
Fast motion and strong camera shake.
Self-occlusions and re-appearances.

Dense tracking on a DAVIS-style scene with thin structures and camera motion.

Robust tracks through strong occlusions (e.g., BMX / crowd scenes).

Long-horizon tracking on a RoboTAP / robotic manipulation sequence.

Ego-motion tracking in a dynamic scene with moving objects.

Driving scene with cars and pedestrians.

Optical Flow Videos

Using the same model in a two-frame setting, we visualize color-coded optical flow fields on Sintel, KITTI-2015, and Spring. The videos show the input frames alongside CoWTracker's predicted flow.

Sharp motion boundaries around objects and limbs.
Accurate flow in regions with large displacement.
Competitive zero-shot quality compared to specialized flow models.

Zero-shot optical flow on Sintel (input frame + CoWTracker flow) Left to right: input frame, CoWTracker flow, ground truth flow.

Driving sequence from KITTI-2015 with sharp motion boundaries and small objects.

Results

State-of-the-Art Dense Tracking and Strong Optical Flow

CoWTracker achieves state-of-the-art results on TAP-Vid and RoboTAP benchmarks, while also transferring effectively to optical flow datasets such as Sintel, KITTI-2015, and Spring.

5.1 Dense Point Tracking (TAP-Vid, RoboTAP)

We evaluate on the TAP-Vid suite (DAVIS, RGB-Stacking, Kinetics) and RoboTAP using AJ, δ_avg, and OA metrics. CoWTracker significantly improves over AllTracker and other dense trackers.

Highest mean AJ and δ_avg across TAP-Vid and RoboTAP.
Strong occlusion handling and long-range consistency in DAVIS and RGB-Stacking.
Trained on Kubric-only data, yet competitive with methods using larger training mixtures.

Mean AJ

71.3

Overall accuracy

Mean δ_avg

81.8

Position accuracy

Mean OA

93.3

Occlusion accuracy

Tracking Benchmark Snapshot

Method	Train	Mean AJ	Mean δ_avg	Mean OA
CoTracker 3	Kub+Mix	62.0	74.4	89.6
AllTracker	Kub+Mix	68.9	80.5	91.5
CoWTracker	Kub	71.3	81.8	93.3

Average results across the TAP-Vid and RoboTAP benchmarks. Higher is better.

5.2 Zero-Shot Optical Flow (Sintel, KITTI, Spring)

Without training on optical flow datasets, CoWTracker provides competitive or superior performance compared to specialized flow networks such as RAFT, SEA-RAFT, and WAFT.

Lower EPE on Sintel and KITTI-2015 than many flow-specific models.
Accurate motion boundaries and robustness to large displacement.
Competitive results on the challenging Spring dataset.

Flow Benchmark Snapshot

Method	Sintel (EPE)	KITTI (EPE)	Spring (EPE)
RAFT	1.15	1.53	0.22
SEA-RAFT	0.97	1.60	0.21
WAFT	1.01	1.35	0.13
CoWTracker (ours)	0.78	1.04	0.17

Zero-shot optical flow results on three benchmarks. Our predictions are produced by the same model used for point track- ing, not trained on any optical-flow datasets.

Resources

Paper, Code, and Data

Everything you need to reproduce CoWTracker results and build upon the method.

Paper & Code

📄 Paper: PDF / arXiv
💻 Code & Models: GitHub repository
🤗 Demo: Hugging Face Space
🎬 Video: YouTube

Citation

Please cite CoWTracker if you use it in your research:

@article{lai2026a,
  title   = {CoWTracker: Tracking by Warping instead of Correlation},
  author  = {Lai, Zihang and Insafutdinov, Eldar and Sucar, Edgar and Vedaldi, Andrea},
  journal = {arXiv preprint arXiv:2602.04877},
  year    = {2026},
}