STI-Bench : Are MLLMs Ready for
Precise Spatial-Temporal World Understanding?

Yun Li1,2, Yiming Zhang1,3, Tao Lin1, XiangRui Liu1,4, Wenxiao Cai5, Zheng Liu4, Bo Zhao1
1School of AI, Shanghai Jiao Tong University; 2China University of Geosciences; 3Nanyang Technological University; 4BAAI; 5Stanford University;

Cover

Abstract

TThe use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To evaluate models' Spatial-Temporal Intelligence, we introduce STI-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.

Demo Videos

Tasks Definition

We have divided the questions into Static Understanding and Dynamic Understanding, and further give definitions to 8 different tasks.

Dimensional Measurement. Concerns estimates of an object's geometric size, such as length, width, and height, as well as the distance between objects or between the camera and an object.
Example: “What is the height of this box?” or “How close is the camera to the table?”

Spatial Relation. Focuses on identifying spatial relationships among objects or between the camera and an object, including front and back, left and right, up and below.
Example: “Is the chair on the left or right side of the table?” or “What is the position of the red bag relative to the fur sofa?”

3D Video Grounding. Given a semantic description such as “the red backpack on the brown sofa,” the goal is to retrieve the object's 3D bounding box in the camera coordinate system at a specific point in the video.
Example: “Locate the 3D bounding box of the red suitcase near the bed.”

Comparison

Comparison

Comparison of STI-Bench with existing benchmarks. Data represents the source of our QA data, where V stands for Video and I stands for Image. Env. indicates the environment in which the data is generated, where S represents Simulation and R represents Real. The two columns under View indicate whether the dataset includes Ego-centric and Allocentric perspectives. The two columns under Evaluation specify whether the ground truth is presented in numerical or textual form. The four columns under Spatial-Temporal indicate whether the benchmark evaluates spatial distance, direction (with angular precision), velocity, or a precise and comprehensive trajectory description.

Pipeline

Pipeline

I. Data Collection Collected data from Desktop, Indoor, and Outdoor scenarios using Omni6DPose for 6D object pose estimation, ScanNet for indoor 3D scene reconstruction, and Waymo for autonomous driving. These datasets provide frame-by-frame camera parameters and point clouds.

II. Automatic QA Pair Generation Generated QA pairs with MLLMs using detailed object descriptions and computed ground-truth information for each task. This process produced a diverse set of questions and challenging answer options.

III. Human Quality Control Conducted human quality control to filter and refine QA pairs, addressing issues like inaccurate descriptions and insufficient video information. This ensured high-quality questions and shuffled answer options for robust evaluation.

IV. Fine-Grained Adjustment Adjusted QA pairs with scaling factors to match the precision needs of different scenarios, from millimeters for desktop settings to meters for outdoor environments. This fine-grained adjustment helps train and evaluate MLLMs effectively.

Results

Results

Main Results. The experimental results in indoor, outdoor, and desktop environments consistently demonstrate that MLLMs exhibit significant limitations in Dimensional Measurement and displacement & estimation tasks. Across all evaluated scenarios, these models achieve substantially lower performance compared to other spatial reasoning tasks, such as pose estimation, directional reasoning, and 3D video grounding. In particular, performance is uniformly low among all models tested, indicating a generalized deficiency in accurately perceiving and estimating distances and displacements, rather than shortcomings of specific model architectures.

Radar

Analysis

By leveraging the model’s reasoning process and uniformly sampling approximately 200 error records across each task type and scenario, we categorize its errors into three representative patterns, which reflects three core limitations of MLLMs.

Error Analysis Pie Chart

1. Inaccurate Spatial Quantification

Models struggle with accurately estimating spatial properties from visual inputs, including object dimensions, distances, and 3D positions. These issues stem from a lack of clear visual references and the inherent challenge of inferring 3D information from 2D images, affecting all tasks requiring precise spatial measurements.

2. Flawed Temporal Dynamics Understanding

Models perform poorly in understanding information that changes over time, struggling to accurately calculate and describe motion characteristics like displacement, speed, and trajectories. They particularly struggle with distinguishing object motion from camera motion, issues stemming from challenges in cross-frame integration and lack of physical models.

3. Weak Cross-Modal Integration

Models fail to effectively connect textual instructions with visual content or integrate non-visual data with visual information. This leads to misinterpretation of temporal constraints, improper use of initial conditions, and incorrect associations between structured data and visual elements, affecting all tasks relying on multimodal information.

QUIZ

What is the velocity of the object?

More Examples

Example 1
Example 2
Example 3

Conclusion

We introduced STI-Bench, a comprehensive benchmark to assess MLLMs’ spatial-temporal understanding through over 300 real-world videos and 2,000 QA pairs of robot desktop, indoor, and outdoor scenarios, which reveals significant limitations in current MLLMs' spatial-temporal understanding capabilities, with even the top-performing models achieving no more than 50% accuracy. Models particularly struggle with precise quantitative tasks like dimensional measurement. Our analysis identifies three key weaknesses: inaccurate spatial quantification, flawed temporal dynamics understanding, and weak cross-modal integration. These findings emphasize the substantial gap between current capabilities and the reliability needed for embodied AI and autonomous driving applications. STI-Bench provides a valuable framework for evaluating and improving MLLMs' ability to understand the physical world—essential for developing the next generation of embodied intelligent systems.