SmolRGPT

SmolRGPT: Efficient Spatial Reasoning for
Warehouse Environments with
600M Parameters

ICCVW 2025

Warehouse Spatial Intelligence Top🥉

Abdarahmane Traore, Éric Hervet, Andy Couturier

Paper

Code

Considering the transporters Region 0 Region 1 Region 2 and the pallets Region 3 Region 4 Region 5 Region 6 Region 7 Region 8 Region 9 Region 10 Region 11 Region 12, which pallet is the optimal choice for an empty transporter to pick up?

The transporter Region 0 is not carrying any boxes. The pallet Region 12 is the closest to transporter Region 0, so it is the most suitable choice for automated picking.

What is SmolRGPT?

SmolRGPT is a multimodal AI system designed to answer spatial questions in environments like warehouses. It takes RGB and depth images as inputs, using a frozen visual feature extractor (SigLip2) to generate high-level representations. Trainable connector and refiner modules process these features separately for RGB and depth. Region masks select relevant objects, and mask pooling aggregates their features. These pooled features, along with the question, are fed into a vision-language model (SmoILM2) that integrates visual and textual information to produce a natural language answer, such as the distance between two pallets. The architecture uses both frozen and trainable components for flexible, grounded reasoning.

AIC 2025 (ICCVW Workshop): Warehouse Spatial Intelligence

SmolRGPT achieved 3rd place in the AI City Challenge 2025 Track 3 with a final S1 score of 90.68, demonstrating that a 600M parameter model can compete with much larger architectures. It excelled in spatial relationship understanding, reaching 99.8% accuracy for left-right directional tasks, and achieved strong results in counting (92.76% accuracy, RMSE 0.0750) thanks to effective depth integration. For multiple-choice spatial reasoning, smolRGPT scored 88.02% accuracy, showing robust scene comprehension. The most challenging aspect was distance estimation (82.13% accuracy, RMSE 0.4740), but this still surpassed expectations for a model of this size, highlighting the impact of using both RGB and depth refiners.

Rank	Team Name	Score
1	UWIPL_ETRI	96.0789
2	HCMUT.VNU	91.9735
3	Embia (smolRGPT)	90.6772
4	MIZSU	73.0606
5	HCMUS_HTH	66.8861
6	MealsRetrieval	56.6352
7	BKU22	50.3662
8	Smart Lab	31.9245
9	AICV	28.2993

SpatialRGPT-Bench

To evaluate general spatial reasoning, we further trained SmolRGPT on a mix of OSD and Warehouse datasets and tested on SpatialRGPT-Bench. Despite having only 600M parameters, SmolRGPT achieved 65.6% average accuracy on qualitative spatial tasks, outperforming much larger models like GPT-4, GPT-4V, and LLaVA-v1.6-34B. It was especially strong in depth and height relations. While SpatialRGPT and SpatialRGPT-Depth achieved higher accuracy, they use much larger architectures. On quantitative tasks, SmolRGPT matched or outperformed larger models for distance and direction, though width and height estimation remain areas to improve. Overall, SmolRGPT’s results show it offers competitive spatial reasoning performance with exceptional parameter efficiency, making it ideal for resource-constrained devices.

Qualitative Spatial Reasoning

Method	Parameters	Below/ Above	Left/ Right	Big/ Small	Tall/ Short	Wide/ Thin	Behind/ Front	Avg.
GPT-4	1.76T	64.1	42.8	42.8	61.6	61.6	49.0	57.8
GPT-4V	1.76T	63.3	46.6	64.1	60.7	68.2	45.4	58.1
LLaVA-v1.6-34B	34B	44.1	45.7	36.7	53.5	37.5	45.4	43.9
GPT-4V+SoM	1.76T	75.0	55.2	42.4	54.4	49.0	47.2	54.3
LLaVA-v1.6-34B+SoM	34B	44.1	40.0	33.9	47.3	41.3	46.3	42.3
Kosmos-2	1.3B	28.3	15.2	4.71	26.7	12.5	12.7	17.0
RegionVILA	7B*	30.8	47.6	35.8	44.6	35.5	49.0	40.4
SmolRGPT	600M	71.6	49.5	67.9	74.1	51.9	79.0	65.6
SpatialRGPT	7B*	99.1	99.0	79.2	89.2	83.6	87.2	89.8
SpatialRGPT-Depth	7B*	99.1	99.0	80.1	91.9	87.5	91.8	91.7

Quantitative Spatial Reasoning

Method	Parameters	Direct Distance	Horizontal Dist.	Vertical Dist.	Width	Height	Direction
GPT-4	1.76T	21.6	11.5	33.0	52.3	48.1	34.6
GPT-4V	1.76T	29.7	25.4	33.0	51.1	68.4	43.9
LLaVA-v1.6-34B	34B	24.3	24.5	30.4	30.8	42.8	33.6
GPT-4V+SoM	1.76T	25.7	22.1	33.9	45.8	62.4	54.2
LLaVA-v1.6-34B+SoM	34B	12.8	20.4	11.3	9.02	7.52	11.3
Kosmos-2	1.3B	4.05	4.91	18.9	3.01	3.10	3.82
RegionVILA	7B*	22.3	24.6	17.9	36.8	49.6	35.5
SmolRGPT	600M	35.8	18.3	33.9	18.05	20.3	35.5
SpatialRGPT	7B*	35.1	59.0	53.8	51.9	54.9	95.3
SpatialRGPT-Depth	7B*	41.2	65.6	51.9	49.6	57.9	95.3

Examples

The following examples demonstrate SmolRGPT’s capabilities in spatial reasoning and visual comprehension within complex, real-world environments. Presented below are illustrative scenarios highlighting tasks such as object counting, determining relative positions, measuring spatial distances, and making decisions based on scene context. For each example, you can examine the question asked, SmolRGPT’s prediction, and the ground truth answer, all paired with the relevant visual input. These cases showcase the accuracy and versatility of SmolRGPT in interpreting visual scenes for practical applications.

Counting

Question: Given the buffer masks Region 0 Region 1 Region 2 and pallet masks Region 3 Region 4 Region 5 Region 6 Region 7 Region 8, what is the count of pallets in the leftmost buffer area?

SmolRGPT Prediction: The buffer region Region 1 is the leftmost buffer region from this viewpoint. The buffer region Region 1 contains pallets Region 3 Region 4. Therefore, the buffer area Region 1 has two pallets.

Ground Thruth: Observing from this perspective, the buffer region Region 1 is the leftmost buffer region. The buffer region Region 1 has pallets Region 3 Region 4. Hence, in buffer area Region 1, there are exactly two pallets.

Distance

Question: Could you measure the distance between the pallet Region 0 and the palletRegion 1?

SmolRGPT Prediction: The pallet Region 0 and the pallet Region 1 are 8.23 meters apart.

Ground Thruth: The pallet Region 0 is 8.15 meters away from the pallet Region 1.

Left Right

Question: Is the pallet Region 0 positioned on the left or right side of the pallet Region 1 relative to the observer?

SmolRGPT Prediction: The pallet Region 0 is to the left of the pallet Region 1.

Ground Thruth: The pallet Region 0 is situated on the left of the pallet Region 1.

MCG (Multiple Choice Grounding)

Question: Given the transporters Region 0, Region 2 and the pallets Region 3, Region 4, Region 5, Region 6, Region 8, Region 9, which pallet is best suited for automated picking by an empty transporter?

SmolRGPT Prediction: The transporter Region 2 is not holding any boxes. The pallet Region 8 is the closest to transporter Region 2, so it is the most suitable choice for automated picking.

Ground Thruth: At the moment, the transporter Region 2 is not loaded with any boxes. The pallet Region 8 is nearest to the transporter Region 2, making it the best choice to pick up first.

Citation

 @article{traore2025smolrgptefficientspatialreasoning,
                            title={SmolRGPT: Efficient Spatial Reasoning for Warehouse Environments with 600M Parameters},
                            author={Abdarahmane Traore and Éric Hervet and Andy Couturier},
                            year={2025},
                            eprint={2509.15490},
                            archivePrefix={arXiv},
                            primaryClass={cs.CV},
                            url={https://arxiv.org/abs/2509.15490}
                            }