Why Far Looks Up: Probing Spatial
Representation in Vision-Language Models

Cheolhong Min¹, Jaeyun Jung¹, Daeun Lee¹, Hyeonseong Jeon^1,2,
Yu Su³, Jonathan Tremblay⁴, Chan Hee Song^4,3,†,‡, Jaesik Park^1,†

¹Seoul National University, ²Republic of Korea Air Force, ³The Ohio State University, ⁴NVIDIA

            cheolhong.min@snu.ac.kr  · 
            jaesik.park@snu.ac.kr  · 
            lusong@nvidia.com
          

^† Co-corresponding author ^‡ Currently at NVIDIA

arXiv Code SpatialTunnel

VLMs often answer spatial understanding questions by using vertical image position as a proxy — not genuine 3D reasoning. We diagnose this bias at the representation level and show that internal structure, not benchmark accuracy, predicts robustness.

Abstract

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit markedly different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images.

§ 01 — The Puzzle

Benchmark scores alone don't tell the full story

Some models are inconsistent across benchmarks — strong on one, weak on another. Others are uniformly robust. Why?

Model	EmbSpatial Overall	CV-2D Relation	CV-3D Depth	CV-3D Distance	BLINK Depth	BLINK Spat. Rel.
Molmo-7B	60.7	76.3	84.5	68.5	78.2	70.6
NVILA-Lite-2B	54.0	58.6	69.2	52.3	64.5	67.1
RoboRefer-2B	92.0	96.5	95.7	90.5	84.7	79.7
Qwen2.5-VL-3B	62.3	67.4	70.3	60.2	68.6	83.9
Qwen3-VL-235B	82.0	96.5	93.3	91.0	84.7	90.2
More baseline models will be added.

Some models (Molmo, NVILA, Qwen2.5) show inconsistent cross-benchmark patterns. In contrast, RoboRefer and Qwen3-VL-235B are consistently strong across all benchmarks. Benchmark accuracy alone cannot explain this gap.

§ 02 — The Shortcut

A perspective bias hidden inside benchmarks

VLMs exploit a statistical regularity in natural images — objects appearing higher tend to be farther away — using vertical position as a proxy for depth.

In everyday photographs, perspective projection creates a reliable correlation: farther objects appear higher in the image. VLMs trained on such data may internalize this as a shortcut, conflating 2D vertical position with 3D depth rather than reasoning about 3D structure directly.

80.9%

of EmbSpatial-Bench questions are consistent with the perspective heuristic

10.7%

are counter-heuristic — cases where the ground truth contradicts the perspective heuristic

−36.9 pp

worst accuracy drop (consistent → counter) across all models

100

Gap

Molmo-7B

34.9 63.5

−28.6

NVILA-Lite-2B

27.1 49.0

−21.9

RoboRefer-2B

59.7 87.0

−27.3

Qwen2.5-VL-3B

32.6 54.7

−22.1

Qwen3-VL-235B

41.7 73.3

−31.6

Molmo-7B

75.4 93.1

−17.7

NVILA-Lite-2B

40.0 74.4

−34.4

RoboRefer-2B

95.4 98.9

−3.5

Qwen2.5-VL-3B

55.4 75.5

−20.1

Qwen3-VL-235B

90.8 98.1

−7.3

Accuracy on counter samples

Accuracy on consistent samples

Below consistent accuracy (= 100 − consistent)

Does data scaling help?

One natural question: can we simply train away this bias with more spatial data? We fine-tuned each model on a mixture of five spatial reasoning datasets — SAT, RoboSpatial, SPAR-7M, RefSpatial, and PRISM — at four scales (80k, 400k, 800k, and 2M total samples). For full details on the data mixture, see Appendix B.3.
Click below to compare.

SpatialTunnel: diagnosing the bias with precision

Role 01

Eliminate background confounds

Real images conflate depth with background context, apparent size, and lighting. SpatialTunnel's Blender-rendered tunnel corridor leaves object position as the only variable — no background shortcuts possible.

Role 02

Fine-grained measurement via 16×16 grid

Objects are parameterized by depth z and angular position θ. The 16×16 configuration grid maps exactly which placements trigger failures — a heatmap of entanglement across the full position space.

Fig. 3: SpatialTunnel holds the two objects at fixed depths while sweeping their angular positions around the tunnel cross-section, so that 2D image-plane layout varies independently of depth ordering.

Fig. 4: Mean accuracy heatmaps on SpatialTunnel for Molmo-7B. Accuracy on consistent cells improves steadily. In contrast, counter cells remain substantially harder, with the largest drop at 400k and a partial recovery at 2M.

Scaling steadily improves accuracy on consistent cells, but counter cells remain substantially harder throughout. The bias persists even at 2M training samples — confirming it is model-intrinsic, not a dataset artifact.

§ 03 — Framework

Looking inside: contrastive probing

If the bias is model-intrinsic, we need to examine the representations directly.

Coh_D

Axis Coherence (Distance)

"Does depth have a stable, consistent direction in embedding space?"

Mean pairwise cosine similarity of sign-corrected delta vectors for the distance axis. High coherence = the model encodes the axis as a stable, consistent direction in representation space.

Coh = (2/N(N−1)) · Σ_i<j cos(δ̃ᵢ, δ̃ⱼ)

↑ Higher is better

VD-EI

VD-Entanglement Index

"Is 'far' confused with 'above' internally?"

Directional coupling between vertical and distance representations. Highly positive = vertical & distance are coupled (perspective bias)

VD-EI = ¼[cos(μ_above,μ_far) + cos(μ_below,μ_close)
−cos(μ_above,μ_close) − cos(μ_below,μ_far)]

↓ Lower is better

Distance coherence is the weakest axis — across all models and scales

Horizontal

0.649

Vertical

0.830

Distance

0.182

RoboRefer-2B values shown (best model). CohD remains lowest across all models even after 2M-sample fine-tuning.

§ 04 — Findings

Representation structure predicts robustness

PCA of delta vectors reveals the structural difference — answering the puzzle from §01.

Molmo / NVILA / Qwen (2M) — Partial separation

Horizontal and vertical axes form separable clusters, but distance delta vectors collapse near the origin or overlap with the vertical cluster. Far and above occupy the same embedding region — the shortcut is encoded.

RoboRefer-2B — Three clean clusters

Each axis aligned with a distinct PCA component.
CohD 0.182 · VD-EI 0.362

Qwen3-VL-235B — Well-separated axes

Scale achieves similar disentanglement.
v = 0.908 · Δ = +0.068

As CohD grows through fine-tuning, counter accuracy rises. When CohD stagnates, accuracy stagnates too.

RoboRefer occupies a unique region of high coherence and low entanglement.

High Coh_D + Low VD-EI = Robust Spatial Reasoning

Representation structure — not benchmark accuracy — reliably indicates genuine 3D spatial understanding.

🔵 Auxiliary depth supervision

Large-scale RGB-D training directly exposes 3D geometry, building coherent and disentangled spatial axes.

RoboRefer-2B CohD 0.182 VD-EI 0.362

⚡ Very large-scale pretraining

At sufficient scale, structured representations emerge even without targeted spatial supervision.

Qwen3-VL-235B v = 0.908 Δ = +0.068

BibTeX

@article{min2026whyfarlooksup,
  title   = {Why Far Looks Up: Probing Spatial Representation in Vision-Language Models},
  author  = {Min, Cheolhong and Jung, Jaeyun and Lee, Daeun and Jeon, Hyeonseong and
             Su, Yu and Tremblay, Jonathan and Song, Chan Hee and Park, Jaesik},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026},
}

Why Far Looks Up: Probing SpatialRepresentation in Vision-Language Models

VLMs often answer spatial understanding questions by using vertical image position as a proxy — not genuine 3D reasoning. We diagnose this bias at the representation level and show that internal structure, not benchmark accuracy, predicts robustness.

Abstract

Benchmark scores alone don't tell the full story

A perspective bias hidden inside benchmarks

Does data scaling help?

SpatialTunnel: diagnosing the bias with precision

Looking inside: contrastive probing

Representation structure predicts robustness

BibTeX

Why Far Looks Up: Probing Spatial
Representation in Vision-Language Models