AnyGroundBench

A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Rintaro Otsubo^*1,2 Ryo Fujii^*1,2 Reina Ishikawa^1,2 Taiki Kanaya^1,2 Kanta Sawafuji^1,2 Hiroki Kajita^1,3 Shigeki Sakai^1,3 Hideo Saito^1,2 Ryo Hachiuma⁴

¹Keio University ²Keio AI Research Center ³Keio University School of Medicine ⁴NVIDIA

* Equal contribution

arXiv Code Dataset

AnyGroundBench teaser — AnyGroundBench evaluates spatio-temporal video grounding across specialized domains including animal, industry, sports, surgery, and public security.

Abstract

Vision-Language Models (VLMs) have demonstrated immense promise in Spatio-Temporal Video Grounding (STVG). However, current evaluation protocols are largely confined to zero-shot assessments on general, daily-life benchmarks. This creates a critical disconnect from real-world applications in specialized fields, where models inevitably encounter rare visual concepts and complex spatio-temporal dynamics. Since exhaustive pre-training across infinite data distributions is infeasible, the ability to adapt to novel domains is essential.

To bridge this gap, we introduce AnyGroundBench, a domain-adaptation benchmark designed to shift the STVG evaluation paradigm from static zero-shot testing to rigorous domain adaptation. Targeting five specialized domains (animal, industry, sports, surgery, and public security), AnyGroundBench pairs newly captured videos such as expert-annotated mouse behaviors with established datasets, unifying them through dense, high-fidelity spatio-temporal annotations. Crucially, the benchmark provides dedicated training subsets to systematically measure domain adaptability.

We extensively evaluate 15 state-of-the-art VLMs, assessing their zero-shot generalization and In-Context Learning (ICL) capabilities under practical computational constraints. Ultimately, our findings reveal that current models fail in both zero-shot and ICL-based adaptation when confronted with specialized domains, exposing critical flaws in spatio-temporal reasoning that future research must address.

Main Results

Each cell reports STVG / TVG / SVG. STVG uses vIoU@0.3, TVG uses tIoU@0.3, and SVG uses sIoU@0.3. Shaded +ICL rows show 2-shot In-Context Learning performance.

Models	Animal	Industry	Sports	Surgery	Public Security
Proprietary VLMs
GPT-4o	0.00 / 16.5 / 7.64	2.95 / 8.87 / 14.7	0.61 / 17.1 / 0.00	0.00 / 13.4 / 0.00	0.80 / 56.0 / 4.40
+ICL	0.00 / 17.3 / 12.7	1.25 / 10.4 / 4.14	0.61 / 22.0 / 0.61	0.00 / 26.1 / 0.00	6.35 / 66.9 / 6.40
GPT-5.1	3.18 / 19.7 / 55.4	5.32 / 9.46 / 28.4	0.00 / 25.7 / 1.84	1.38 / 22.6 / 0.45	4.80 / 61.2 / 35.6
+ICL	5.59 / 23.0 / 59.2	2.74 / 11.3 / 34.9	1.22 / 23.9 / 7.36	2.64 / 41.7 / 0.44	9.63 / 66.4 / 39.2
Gemini-2.5-Flash	2.54 / 26.1 / 15.2	0.59 / 21.3 / 4.73	1.84 / 31.9 / 3.68	0.00 / 22.6 / 0.97	0.40 / 51.2 / 2.00
+ICL	1.91 / 19.1 / 12.1	1.18 / 23.0 / 8.87	0.00 / 41.7 / 0.00	0.46 / 18.9 / 3.59	2.40 / 61.6 / 3.60
Gemini-2.5-Pro	8.28 / 36.9 / 20.3	1.18 / 39.6 / 17.1	0.61 / 37.4 / 7.97	1.38 / 31.4 / 2.77	4.00 / 65.8 / 22.8
+ICL	8.28 / 31.2 / 45.2	8.28 / 37.2 / 26.0	3.68 / 43.5 / 5.52	6.01 / 41.2 / 24.8	8.80 / 70.4 / 20.0
Gemini-3-Flash	14.0 / 36.3 / 51.5	5.91 / 30.7 / 20.1	2.45 / 29.4 / 2.45	0.92 / 37.9 / 11.1	10.7 / 66.8 / 45.5
+ICL	13.3 / 33.1 / 45.8	6.50 / 33.1 / 24.2	1.22 / 44.1 / 3.68	6.48 / 42.5 / 27.0	30.8 / 78.4 / 25.1
Gemini-3.1-Pro	16.5 / 37.5 / 70.7	7.69 / 21.8 / 41.4	1.22 / 26.3 / 16.5	4.16 / 32.8 / 26.1	22.8 / 69.4 / 52.0
+ICL	12.7 / 33.1 / 60.5	11.8 / 27.8 / 39.6	1.22 / 39.8 / 7.36	9.72 / 42.1 / 23.4	22.4 / 77.2 / 41.6
Open-source Specialized VLMs
LLaVA-ST	12.1 / 19.7 / 53.5	0.00 / 9.46 / 12.4	0.79 / 8.58 / 3.68	0.00 / 12.0 / 0.45	0.80 / 35.6 / 13.2
Open-source General-Purpose VLMs
Qwen3-VL-4B	5.73 / 25.4 / 19.7	0.00 / 4.14 / 5.32	0.00 / 8.58 / 0.00	0.00 / 13.4 / 0.00	0.40 / 39.2 / 0.00
+ICL	0.63 / 28.6 / 0.63	1.18 / 15.9 / 0.00	0.00 / 17.7 / 0.00	0.00 / 34.2 / 0.00	0.00 / 59.6 / 0.00
Qwen3-VL-8B	3.82 / 19.7 / 0.00	0.00 / 10.6 / 0.00	0.00 / 11.6 / 0.00	0.00 / 15.7 / 0.00	0.80 / 46.0 / 0.00
+ICL	0.00 / 28.0 / 4.45	0.59 / 14.7 / 0.00	0.00 / 23.3 / 0.00	0.00 / 34.2 / 0.00	0.40 / 65.6 / 0.00
Qwen3.5-4B	2.54 / 30.5 / 13.3	0.00 / 14.7 / 7.69	0.00 / 12.8 / 0.00	0.00 / 28.2 / 1.49	0.40 / 49.2 / 2.00
+ICL	3.18 / 29.2 / 17.1	2.95 / 23.6 / 5.91	0.00 / 20.8 / 0.00	0.46 / 34.7 / 6.70	2.00 / 65.2 / 2.40
Qwen3.5-9B	4.45 / 35.0 / 20.3	0.59 / 17.1 / 12.4	0.61 / 14.7 / 0.00	0.00 / 28.2 / 1.35	0.40 / 50.4 / 4.80
+ICL	2.54 / 35.0 / 10.1	1.77 / 19.5 / 14.2	0.00 / 25.7 / 0.61	0.00 / 25.4 / 7.15	2.40 / 62.8 / 2.80
Eagle2.5-8B	0.00 / 24.8 / 1.27	0.00 / 7.69 / 2.36	0.00 / 7.97 / 0.00	0.00 / 15.2 / 0.00	0.00 / 47.6 / 0.40
+ICL	0.00 / 25.4 / 0.00	0.00 / 12.4 / 0.00	0.00 / 20.8 / 0.00	0.00 / 28.2 / 0.44	0.00 / 61.6 / 0.00
InternVL3-8B	0.63 / 15.2 / 3.82	0.00 / 5.91 / 1.18	0.00 / 6.79 / 0.00	0.00 / 10.6 / 0.00	0.00 / 4.40 / 0.80
+ICL	0.00 / 15.9 / 0.00	0.00 / 7.10 / 0.00	0.00 / 7.36 / 0.00	0.00 / 25.4 / 0.00	0.00 / 7.60 / 0.00
InternVL3-14B	0.63 / 17.8 / 7.64	0.00 / 5.32 / 1.18	0.00 / 7.36 / 0.00	0.00 / 10.6 / 0.00	0.00 / 13.2 / 0.00
+ICL	0.00 / 15.2 / 0.00	0.00 / 8.59 / 0.00	0.00 / 8.64 / 0.00	0.00 / 14.5 / 0.49	0.00 / 19.0 / 0.00
InternVL3.5-8B	0.00 / 11.4 / 0.00	0.00 / 2.95 / 1.18	0.00 / 6.13 / 0.00	0.00 / 7.40 / 0.00	0.00 / 3.60 / 0.00
+ICL	0.00 / 5.09 / 1.27	0.00 / 7.81 / 2.36	0.00 / 3.68 / 0.00	0.46 / 18.0 / 2.69	0.40 / 4.00 / 0.00

AnyGroundBench

Abstract

Main Results

Qualitative Examples