Why Far Looks Up: Probing Spatial
Representation in Vision-Language Models

1Seoul National University, 2Republic of Korea Air Force, 3The Ohio State University, 4NVIDIA
Co-corresponding author    Currently at NVIDIA
Figure 1

VLMs often answer spatial understanding questions by using vertical image position as a proxy — not genuine 3D reasoning. We diagnose this bias at the representation level and show that internal structure, not benchmark accuracy, predicts robustness.


Abstract

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit markedly different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images.


Benchmark scores alone don't tell the full story

Some models are inconsistent across benchmarks — strong on one, weak on another. Others are uniformly robust. Why?

Model EmbSpatial
Overall
CV-2D
Relation
CV-3D
Depth
CV-3D
Distance
BLINK
Depth
BLINK
Spat. Rel.
Molmo-7B 60.776.384.5 68.578.270.6
NVILA-Lite-2B 54.058.669.2 52.364.567.1
RoboRefer-2B 92.096.595.790.584.779.7
Qwen2.5-VL-3B 62.367.470.3 60.268.683.9
Qwen3-VL-235B 82.096.593.391.084.790.2
More baseline models will be added.
Some models (Molmo, NVILA, Qwen2.5) show inconsistent cross-benchmark patterns. In contrast, RoboRefer and Qwen3-VL-235B are consistently strong across all benchmarks. Benchmark accuracy alone cannot explain this gap.

A perspective bias hidden inside benchmarks

VLMs exploit a statistical regularity in natural images — objects appearing higher tend to be farther away — using vertical position as a proxy for depth.

Consistent vs Counter examples

In everyday photographs, perspective projection creates a reliable correlation: farther objects appear higher in the image. VLMs trained on such data may internalize this as a shortcut, conflating 2D vertical position with 3D depth rather than reasoning about 3D structure directly.

80.9%
of EmbSpatial-Bench questions are consistent with the perspective heuristic
10.7%
are counter-heuristic — cases where the ground truth contradicts the perspective heuristic
−36.9 pp
worst accuracy drop (consistent → counter) across all models
100
Gap
Molmo-7B
34.9 63.5
−28.6
NVILA-Lite-2B
27.1 49.0
−21.9
RoboRefer-2B
59.7 87.0
−27.3
Qwen2.5-VL-3B
32.6 54.7
−22.1
Qwen3-VL-235B
41.7 73.3
−31.6
Molmo-7B
75.4 93.1
−17.7
NVILA-Lite-2B
40.0 74.4
−34.4
RoboRefer-2B
95.4 98.9
−3.5
Qwen2.5-VL-3B
55.4 75.5
−20.1
Qwen3-VL-235B
90.8 98.1
−7.3
Accuracy on counter samples
Accuracy on consistent samples
Below consistent accuracy (= 100 − consistent)

Does data scaling help?

One natural question: can we simply train away this bias with more spatial data? We fine-tuned each model on a mixture of five spatial reasoning datasets — SAT, RoboSpatial, SPAR-7M, RefSpatial, and PRISM — at four scales (80k, 400k, 800k, and 2M total samples). For full details on the data mixture, see Appendix B.3.
Click below to compare.

SpatialTunnel: diagnosing the bias with precision

Role 01
Eliminate background confounds
Real images conflate depth with background context, apparent size, and lighting. SpatialTunnel's Blender-rendered tunnel corridor leaves object position as the only variable — no background shortcuts possible.
Role 02
Fine-grained measurement via 16×16 grid
Objects are parameterized by depth z and angular position θ. The 16×16 configuration grid maps exactly which placements trigger failures — a heatmap of entanglement across the full position space.
Figure 3

Fig. 3: SpatialTunnel holds the two objects at fixed depths while sweeping their angular positions around the tunnel cross-section, so that 2D image-plane layout varies independently of depth ordering.

Figure 4-a Figure 4-b

Fig. 4: Mean accuracy heatmaps on SpatialTunnel for Molmo-7B. Accuracy on consistent cells improves steadily. In contrast, counter cells remain substantially harder, with the largest drop at 400k and a partial recovery at 2M.

Scaling steadily improves accuracy on consistent cells, but counter cells remain substantially harder throughout. The bias persists even at 2M training samples — confirming it is model-intrinsic, not a dataset artifact.

Looking inside: contrastive probing

If the bias is model-intrinsic, we need to examine the representations directly.

Figure 5
Coh_D
Axis Coherence (Distance)
"Does depth have a stable, consistent direction in embedding space?"
Mean pairwise cosine similarity of sign-corrected delta vectors for the distance axis. High coherence = the model encodes the axis as a stable, consistent direction in representation space.
Coh = (2/N(N−1)) · Σi<j cos(δ̃ᵢ, δ̃ⱼ)
↑ Higher is better
VD-EI
VD-Entanglement Index
"Is 'far' confused with 'above' internally?"
Directional coupling between vertical and distance representations. Highly positive = vertical & distance are coupled (perspective bias)
VD-EI = ¼[cos(μabovefar) + cos(μbelowclose)
        −cos(μaboveclose) − cos(μbelowfar)]
↓ Lower is better
Distance coherence is the weakest axis — across all models and scales
Horizontal
0.649
Vertical
0.830
Distance
0.182

RoboRefer-2B values shown (best model). CohD remains lowest across all models even after 2M-sample fine-tuning.


Representation structure predicts robustness

PCA of delta vectors reveals the structural difference — answering the puzzle from §01.

Figure 7
Molmo / NVILA / Qwen (2M) — Partial separation
Horizontal and vertical axes form separable clusters, but distance delta vectors collapse near the origin or overlap with the vertical cluster. Far and above occupy the same embedding region — the shortcut is encoded.
RoboRefer-2B — Three clean clusters
Each axis aligned with a distinct PCA component.
CohD 0.182 · VD-EI 0.362
Qwen3-VL-235B — Well-separated axes
Scale achieves similar disentanglement.
v = 0.908 · Δ = +0.068
Figure 6a

As CohD grows through fine-tuning, counter accuracy rises. When CohD stagnates, accuracy stagnates too.

Figure 6b

RoboRefer occupies a unique region of high coherence and low entanglement.

High Coh_D  +  Low VD-EI  =  Robust Spatial Reasoning

Representation structure — not benchmark accuracy — reliably indicates genuine 3D spatial understanding.

🔵 Auxiliary depth supervision
Large-scale RGB-D training directly exposes 3D geometry, building coherent and disentangled spatial axes.
RoboRefer-2B CohD 0.182 VD-EI 0.362
⚡ Very large-scale pretraining
At sufficient scale, structured representations emerge even without targeted spatial supervision.
Qwen3-VL-235B v = 0.908 Δ = +0.068

BibTeX

@article{min2026whyfarlooksup,
  title   = {Why Far Looks Up: Probing Spatial Representation in Vision-Language Models},
  author  = {Min, Cheolhong and Jung, Jaeyun and Lee, Daeun and Jeon, Hyeonseong and
             Su, Yu and Tremblay, Jonathan and Song, Chan Hee and Park, Jaesik},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026},
}