CoWTracker: Tracking by Warping instead of Correlation
Dense point tracking and optical flow with a simple warping-based architecture that avoids cost volumes, yet achieves state-of-the-art results on TAP-Vid and RoboTAP and strong zero-shot optical flow.
Dense Tracking without Cost Volumes
We study dense point tracking: following any pixel in a video over time. Classical approaches rely on cost volumes that are expensive, hard to scale, and often specialized to optical flow. CoWTracker asks: can we track densely by warping instead?
Problem: Dense Point Tracking
Given a video and a set of query points in a reference frame, the goal is to recover their trajectories across the entire sequence, including handling occlusions and disocclusions.
- Tracks for any pixel, not just sparse keypoints.
- Long-range motion and large viewpoint change.
- Visibility and confidence for each point over time.
Limitations of Cost Volumes
- Quadratic complexity in spatial resolution.
- Large memory footprint and compute cost.
- Hard to scale to high-resolution, long sequences.
Tracking by Warping
CoWTracker replaces expensive cost volumes with iterative warps of high-resolution features, followed by a spatio-temporal transformer that refines tracks jointly across space and time.
- No explicit feature correlation / cost volume.
- Efficient high-resolution features (stride 2).
- Single model for tracking and optical flow.
CoWTracker replaces expensive cost volumes with iterative warps of high-resolution features.
Iterative Warping + Spatio-Temporal Reasoning
Instead of searching over many candidate matches, CoWTracker maintains a current estimate of each point's position and repeatedly warps target features back to the query frame. A transformer then refines the displacement and predicts visibility and confidence.
Tracking Loop
- Start with zero displacement (stationary initialization).
- Warp target-frame features back to the query frame.
- Concatenate warped features, query features, displacement, and hidden state.
- Apply a spatio-temporal transformer to update the state.
- Predict a displacement increment, visibility, and confidence.
Spatio-Temporal Transformer
- Spatial attention across all points in a frame.
- Temporal attention along each point track over time.
- Alternating spatial and temporal blocks.
- Joint reasoning about motion, occlusion, and consistency.
CoWTracker architecture: The backbone extracts video features from the input video, and a lightweight update operator iteratively warps and refines tracks to yield dense trajectories, visibility, and confidence.
Dense Tracks and Flow in Action
Qualitative videos showing CoWTracker on challenging scenes for dense point tracking and zero-shot optical flow.
Dense Point Tracking Videos
Clips visualizing dense point tracks over time. Query points in the reference frame are colored and their trajectories are rendered across the sequence. CoWTracker maintains coherent tracks even under large viewpoint changes and occlusions.
- Thin structures (e.g., bicycle spokes, railings, ropes).
- Fast motion and strong camera shake.
- Self-occlusions and re-appearances.
Dense tracking on a DAVIS-style scene with thin structures and camera motion.
Robust tracks through strong occlusions (e.g., BMX / crowd scenes).
Long-horizon tracking on a RoboTAP / robotic manipulation sequence.
Ego-motion tracking in a dynamic scene with moving objects.
Driving scene with cars and pedestrians.
Optical Flow Videos
Using the same model in a two-frame setting, we visualize color-coded optical flow fields on Sintel, KITTI-2015, and Spring. The videos show the input frames alongside CoWTracker's predicted flow.
- Sharp motion boundaries around objects and limbs.
- Accurate flow in regions with large displacement.
- Competitive zero-shot quality compared to specialized flow models.
Zero-shot optical flow on Sintel (input frame + CoWTracker flow) Left to right: input frame, CoWTracker flow, ground truth flow.
Driving sequence from KITTI-2015 with sharp motion boundaries and small objects.
State-of-the-Art Dense Tracking and Strong Optical Flow
CoWTracker achieves state-of-the-art results on TAP-Vid and RoboTAP benchmarks, while also transferring effectively to optical flow datasets such as Sintel, KITTI-2015, and Spring.
5.1 Dense Point Tracking (TAP-Vid, RoboTAP)
We evaluate on the TAP-Vid suite (DAVIS, RGB-Stacking, Kinetics) and RoboTAP using AJ, δavg, and OA metrics. CoWTracker significantly improves over AllTracker and other dense trackers.
- Highest mean AJ and δavg across TAP-Vid and RoboTAP.
- Strong occlusion handling and long-range consistency in DAVIS and RGB-Stacking.
- Trained on Kubric-only data, yet competitive with methods using larger training mixtures.
Tracking Benchmark Snapshot
| Method | Train | Mean AJ | Mean δavg | Mean OA |
|---|---|---|---|---|
| CoTracker 3 | Kub+Mix | 62.0 | 74.4 | 89.6 |
| AllTracker | Kub+Mix | 68.9 | 80.5 | 91.5 |
| CoWTracker | Kub | 71.3 | 81.8 | 93.3 |
Average results across the TAP-Vid and RoboTAP benchmarks. Higher is better.
5.2 Zero-Shot Optical Flow (Sintel, KITTI, Spring)
Without training on optical flow datasets, CoWTracker provides competitive or superior performance compared to specialized flow networks such as RAFT, SEA-RAFT, and WAFT.
- Lower EPE on Sintel and KITTI-2015 than many flow-specific models.
- Accurate motion boundaries and robustness to large displacement.
- Competitive results on the challenging Spring dataset.
Flow Benchmark Snapshot
| Method | Sintel (EPE) | KITTI (EPE) | Spring (EPE) |
|---|---|---|---|
| RAFT | 1.15 | 1.53 | 0.22 |
| SEA-RAFT | 0.97 | 1.60 | 0.21 |
| WAFT | 1.01 | 1.35 | 0.13 |
| CoWTracker (ours) | 0.78 | 1.04 | 0.17 |
Zero-shot optical flow results on three benchmarks. Our predictions are produced by the same model used for point track- ing, not trained on any optical-flow datasets.
Paper, Code, and Data
Everything you need to reproduce CoWTracker results and build upon the method.
Paper & Code
- 📄 Paper: PDF / arXiv
- 💻 Code & Models: GitHub repository
- 🤗 Demo: Hugging Face Space
- 🎬 Video: YouTube
Citation
Please cite CoWTracker if you use it in your research:
@article{lai2026a,
title = {CoWTracker: Tracking by Warping instead of Correlation},
author = {Lai, Zihang and Insafutdinov, Eldar and Sucar, Edgar and Vedaldi, Andrea},
journal = {arXiv preprint arXiv:2602.04877},
year = {2026},
}