smolrgpt_logo
SmolRGPT: Efficient Spatial Reasoning for
Warehouse Environments with
600M Parameters
ICCVW 2025
Warehouse Spatial Intelligence Top🥉
Considering the transporters Region 0 Region 1 Region 2 and the pallets Region 3 Region 4 Region 5 Region 6 Region 7 Region 8 Region 9 Region 10 Region 11 Region 12, which pallet is the optimal choice for an empty transporter to pick up?
user image
The transporter Region 0 is not carrying any boxes. The pallet Region 12 is the closest to transporter Region 0, so it is the most suitable choice for automated picking.
What is SmolRGPT?
SmolRGPT is a multimodal AI system designed to answer spatial questions in environments like warehouses. It takes RGB and depth images as inputs, using a frozen visual feature extractor (SigLip2) to generate high-level representations. Trainable connector and refiner modules process these features separately for RGB and depth. Region masks select relevant objects, and mask pooling aggregates their features. These pooled features, along with the question, are fed into a vision-language model (SmoILM2) that integrates visual and textual information to produce a natural language answer, such as the distance between two pallets. The architecture uses both frozen and trainable components for flexible, grounded reasoning.
smol_architecture
AIC 2025 (ICCVW Workshop): Warehouse Spatial Intelligence
SmolRGPT achieved 3rd place in the AI City Challenge 2025 Track 3 with a final S1 score of 90.68, demonstrating that a 600M parameter model can compete with much larger architectures. It excelled in spatial relationship understanding, reaching 99.8% accuracy for left-right directional tasks, and achieved strong results in counting (92.76% accuracy, RMSE 0.0750) thanks to effective depth integration. For multiple-choice spatial reasoning, smolRGPT scored 88.02% accuracy, showing robust scene comprehension. The most challenging aspect was distance estimation (82.13% accuracy, RMSE 0.4740), but this still surpassed expectations for a model of this size, highlighting the impact of using both RGB and depth refiners.
Rank Team Name Score
1 UWIPL_ETRI 96.0789
2 HCMUT.VNU 91.9735
3 Embia (smolRGPT) 90.6772
4 MIZSU 73.0606
5 HCMUS_HTH 66.8861
6 MealsRetrieval 56.6352
7 BKU22 50.3662
8 Smart Lab 31.9245
9 AICV 28.2993
SpatialRGPT-Bench
To evaluate general spatial reasoning, we further trained SmolRGPT on a mix of OSD and Warehouse datasets and tested on SpatialRGPT-Bench. Despite having only 600M parameters, SmolRGPT achieved 65.6% average accuracy on qualitative spatial tasks, outperforming much larger models like GPT-4, GPT-4V, and LLaVA-v1.6-34B. It was especially strong in depth and height relations. While SpatialRGPT and SpatialRGPT-Depth achieved higher accuracy, they use much larger architectures. On quantitative tasks, SmolRGPT matched or outperformed larger models for distance and direction, though width and height estimation remain areas to improve. Overall, SmolRGPT’s results show it offers competitive spatial reasoning performance with exceptional parameter efficiency, making it ideal for resource-constrained devices.
Qualitative Spatial Reasoning
Method Parameters Below/ Above Left/ Right Big/ Small Tall/ Short Wide/ Thin Behind/ Front Avg.
GPT-4 1.76T 64.1 42.8 42.8 61.6 61.6 49.0 57.8
GPT-4V 1.76T 63.3 46.6 64.1 60.7 68.2 45.4 58.1
LLaVA-v1.6-34B 34B 44.1 45.7 36.7 53.5 37.5 45.4 43.9
GPT-4V+SoM 1.76T 75.0 55.2 42.4 54.4 49.0 47.2 54.3
LLaVA-v1.6-34B+SoM 34B 44.1 40.0 33.9 47.3 41.3 46.3 42.3
Kosmos-2 1.3B 28.3 15.2 4.71 26.7 12.5 12.7 17.0
RegionVILA 7B* 30.8 47.6 35.8 44.6 35.5 49.0 40.4
SmolRGPT 600M 71.6 49.5 67.9 74.1 51.9 79.0 65.6
SpatialRGPT 7B* 99.1 99.0 79.2 89.2 83.6 87.2 89.8
SpatialRGPT-Depth 7B* 99.1 99.0 80.1 91.9 87.5 91.8 91.7
Quantitative Spatial Reasoning
Method Parameters Direct Distance Horizontal Dist. Vertical Dist. Width Height Direction
GPT-4 1.76T 21.6 11.5 33.0 52.3 48.1 34.6
GPT-4V 1.76T 29.7 25.4 33.0 51.1 68.4 43.9
LLaVA-v1.6-34B 34B 24.3 24.5 30.4 30.8 42.8 33.6
GPT-4V+SoM 1.76T 25.7 22.1 33.9 45.8 62.4 54.2
LLaVA-v1.6-34B+SoM 34B 12.8 20.4 11.3 9.02 7.52 11.3
Kosmos-2 1.3B 4.05 4.91 18.9 3.01 3.10 3.82
RegionVILA 7B* 22.3 24.6 17.9 36.8 49.6 35.5
SmolRGPT 600M 35.8 18.3 33.9 18.05 20.3 35.5
SpatialRGPT 7B* 35.1 59.0 53.8 51.9 54.9 95.3
SpatialRGPT-Depth 7B* 41.2 65.6 51.9 49.6 57.9 95.3
Examples
The following examples demonstrate SmolRGPT’s capabilities in spatial reasoning and visual comprehension within complex, real-world environments. Presented below are illustrative scenarios highlighting tasks such as object counting, determining relative positions, measuring spatial distances, and making decisions based on scene context. For each example, you can examine the question asked, SmolRGPT’s prediction, and the ground truth answer, all paired with the relevant visual input. These cases showcase the accuracy and versatility of SmolRGPT in interpreting visual scenes for practical applications.
Counting
example_1
Question: Given the buffer masks Region 0 Region 1 Region 2 and pallet masks Region 3 Region 4 Region 5 Region 6 Region 7 Region 8, what is the count of pallets in the leftmost buffer area?
SmolRGPT Prediction: The buffer region Region 1 is the leftmost buffer region from this viewpoint. The buffer region Region 1 contains pallets Region 3 Region 4. Therefore, the buffer area Region 1 has two pallets.
Ground Thruth: Observing from this perspective, the buffer region Region 1 is the leftmost buffer region. The buffer region Region 1 has pallets Region 3 Region 4. Hence, in buffer area Region 1, there are exactly two pallets.
Distance
example_1
Question: Could you measure the distance between the pallet Region 0 and the palletRegion 1?
SmolRGPT Prediction: The pallet Region 0 and the pallet Region 1 are 8.23 meters apart.
Ground Thruth: The pallet Region 0 is 8.15 meters away from the pallet Region 1.
Left Right
example_2
Question: Is the pallet Region 0 positioned on the left or right side of the pallet Region 1 relative to the observer?
SmolRGPT Prediction: The pallet Region 0 is to the left of the pallet Region 1.
Ground Thruth: The pallet Region 0 is situated on the left of the pallet Region 1.
MCG (Multiple Choice Grounding)
example_1
Question: Given the transporters Region 0, Region 2 and the pallets Region 3, Region 4, Region 5, Region 6, Region 8, Region 9, which pallet is best suited for automated picking by an empty transporter?
SmolRGPT Prediction: The transporter Region 2 is not holding any boxes. The pallet Region 8 is the closest to transporter Region 2, so it is the most suitable choice for automated picking.
Ground Thruth: At the moment, the transporter Region 2 is not loaded with any boxes. The pallet Region 8 is nearest to the transporter Region 2, making it the best choice to pick up first.
Citation
@article{traore2025smolrgptefficientspatialreasoning, title={SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters}, author={Abdarahmane Traore and Éric Hervet and Andy Couturier}, year={2025}, eprint={2509.15490}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2509.15490} }