AnyGroundBench

A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Rintaro Otsubo*1,2 Ryo Fujii*1,2 Reina Ishikawa1,2 Taiki Kanaya1,2 Kanta Sawafuji1,2 Hiroki Kajita1,3 Shigeki Sakai1,3 Hideo Saito1,2 Ryo Hachiuma4
1Keio University 2Keio AI Research Center 3Keio University School of Medicine 4NVIDIA

* Equal contribution

AnyGroundBench teaser
AnyGroundBench evaluates spatio-temporal video grounding across specialized domains including animal, industry, sports, surgery, and public security.

Abstract

Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on general, daily-life benchmarks. This creates a critical disconnect from real-world applications in specialized fields, where models inevitably encounter rare visual concepts and complex spatio-temporal dynamics. Since exhaustive pre-training across infinite data distributions is infeasible, the ability to adapt to novel domains is essential.

To bridge this gap, we introduce AnyGroundBench, a domain-adaptation benchmark designed to shift the STVG evaluation paradigm from static zero-shot testing to rigorous domain adaptation. Targeting five specialized domains (animal, industry, sports, surgery, and public security), AnyGroundBench pairs newly captured videos such as expert-annotated mouse behaviors with established datasets, unifying them through dense, high-fidelity spatio-temporal annotations. Crucially, the benchmark provides dedicated training subsets to systematically measure domain adaptability.

We extensively evaluate 15 state-of-the-art VLMs, assessing their zero-shot generalization and In-Context Learning (ICL) capabilities under practical computational constraints. Ultimately, our findings reveal that current models fail in both zero-shot and ICL-based adaptation when confronted with specialized domains, exposing critical flaws in spatio-temporal reasoning that future research must address.

Main Results

Each cell reports STVG / TVG / SVG. STVG uses vIoU@0.3, TVG uses tIoU@0.3, and SVG uses sIoU@0.3. Shaded +ICL rows show 2-shot In-Context Learning performance.

Models Animal Industry Sports Surgery Public Security
Proprietary VLMs
GPT-4o0.00 / 16.5 / 7.642.95 / 8.87 / 14.70.61 / 17.1 / 0.000.00 / 13.4 / 0.000.80 / 56.0 / 4.40
+ICL0.00 / 17.3 / 12.71.25 / 10.4 / 4.140.61 / 22.0 / 0.610.00 / 26.1 / 0.006.35 / 66.9 / 6.40
GPT-5.13.18 / 19.7 / 55.45.32 / 9.46 / 28.40.00 / 25.7 / 1.841.38 / 22.6 / 0.454.80 / 61.2 / 35.6
+ICL5.59 / 23.0 / 59.22.74 / 11.3 / 34.91.22 / 23.9 / 7.362.64 / 41.7 / 0.449.63 / 66.4 / 39.2
Gemini-2.5-Flash2.54 / 26.1 / 15.20.59 / 21.3 / 4.731.84 / 31.9 / 3.680.00 / 22.6 / 0.970.40 / 51.2 / 2.00
+ICL1.91 / 19.1 / 12.11.18 / 23.0 / 8.870.00 / 41.7 / 0.000.46 / 18.9 / 3.592.40 / 61.6 / 3.60
Gemini-2.5-Pro8.28 / 36.9 / 20.31.18 / 39.6 / 17.10.61 / 37.4 / 7.971.38 / 31.4 / 2.774.00 / 65.8 / 22.8
+ICL8.28 / 31.2 / 45.28.28 / 37.2 / 26.03.68 / 43.5 / 5.526.01 / 41.2 / 24.88.80 / 70.4 / 20.0
Gemini-3-Flash14.0 / 36.3 / 51.55.91 / 30.7 / 20.12.45 / 29.4 / 2.450.92 / 37.9 / 11.110.7 / 66.8 / 45.5
+ICL13.3 / 33.1 / 45.86.50 / 33.1 / 24.21.22 / 44.1 / 3.686.48 / 42.5 / 27.030.8 / 78.4 / 25.1
Gemini-3.1-Pro16.5 / 37.5 / 70.77.69 / 21.8 / 41.41.22 / 26.3 / 16.54.16 / 32.8 / 26.122.8 / 69.4 / 52.0
+ICL12.7 / 33.1 / 60.511.8 / 27.8 / 39.61.22 / 39.8 / 7.369.72 / 42.1 / 23.422.4 / 77.2 / 41.6
Open-source Specialized VLMs
LLaVA-ST12.1 / 19.7 / 53.50.00 / 9.46 / 12.40.79 / 8.58 / 3.680.00 / 12.0 / 0.450.80 / 35.6 / 13.2
Open-source General-Purpose VLMs
Qwen3-VL-4B5.73 / 25.4 / 19.70.00 / 4.14 / 5.320.00 / 8.58 / 0.000.00 / 13.4 / 0.000.40 / 39.2 / 0.00
+ICL0.63 / 28.6 / 0.631.18 / 15.9 / 0.000.00 / 17.7 / 0.000.00 / 34.2 / 0.000.00 / 59.6 / 0.00
Qwen3-VL-8B3.82 / 19.7 / 0.000.00 / 10.6 / 0.000.00 / 11.6 / 0.000.00 / 15.7 / 0.000.80 / 46.0 / 0.00
+ICL0.00 / 28.0 / 4.450.59 / 14.7 / 0.000.00 / 23.3 / 0.000.00 / 34.2 / 0.000.40 / 65.6 / 0.00
Qwen3.5-4B2.54 / 30.5 / 13.30.00 / 14.7 / 7.690.00 / 12.8 / 0.000.00 / 28.2 / 1.490.40 / 49.2 / 2.00
+ICL3.18 / 29.2 / 17.12.95 / 23.6 / 5.910.00 / 20.8 / 0.000.46 / 34.7 / 6.702.00 / 65.2 / 2.40
Qwen3.5-9B4.45 / 35.0 / 20.30.59 / 17.1 / 12.40.61 / 14.7 / 0.000.00 / 28.2 / 1.350.40 / 50.4 / 4.80
+ICL2.54 / 35.0 / 10.11.77 / 19.5 / 14.20.00 / 25.7 / 0.610.00 / 25.4 / 7.152.40 / 62.8 / 2.80
Eagle2.5-8B0.00 / 24.8 / 1.270.00 / 7.69 / 2.360.00 / 7.97 / 0.000.00 / 15.2 / 0.000.00 / 47.6 / 0.40
+ICL0.00 / 25.4 / 0.000.00 / 12.4 / 0.000.00 / 20.8 / 0.000.00 / 28.2 / 0.440.00 / 61.6 / 0.00
InternVL3-8B0.63 / 15.2 / 3.820.00 / 5.91 / 1.180.00 / 6.79 / 0.000.00 / 10.6 / 0.000.00 / 4.40 / 0.80
+ICL0.00 / 15.9 / 0.000.00 / 7.10 / 0.000.00 / 7.36 / 0.000.00 / 25.4 / 0.000.00 / 7.60 / 0.00
InternVL3-14B0.63 / 17.8 / 7.640.00 / 5.32 / 1.180.00 / 7.36 / 0.000.00 / 10.6 / 0.000.00 / 13.2 / 0.00
+ICL0.00 / 15.2 / 0.000.00 / 8.59 / 0.000.00 / 8.64 / 0.000.00 / 14.5 / 0.490.00 / 19.0 / 0.00
InternVL3.5-8B0.00 / 11.4 / 0.000.00 / 2.95 / 1.180.00 / 6.13 / 0.000.00 / 7.40 / 0.000.00 / 3.60 / 0.00
+ICL0.00 / 5.09 / 1.270.00 / 7.81 / 2.360.00 / 3.68 / 0.000.46 / 18.0 / 2.690.40 / 4.00 / 0.00

Qualitative Examples

Qualitative examples
Qualitative visualization of specialized-domain spatio-temporal grounding examples.