AgroMind

Abstract

Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating nine public datasets and one private global parcel dataset, containing 28,482 QA pairs and 20,850 images. The pipeline begins with multi-source data pre-processing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 20 open-source LMMs and 4 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work.

Dataset Statistics

Dataset statistics from different aspects

Geographical coverage map of datasets

The AgroMind dataset integrates 10 datasets with comprehensive coverage:

Sensor Types: UAV (7,000 QA pairs), Satellite (12,000 QA pairs), Camera (9,000 QA pairs)
Agricultural Scenes: Anomaly Detection, Crop Monitoring, Pest Identification, Parcel Delineation, Tree Analysis
Geographic Coverage: 106 regions globally, including diverse climate zones and agricultural systems
Temporal Coverage: Multi-seasonal imagery capturing crop phenodynamics

Task Dimensions

Statistical information of AgroMind

Hierarchical task system

AgroMind comprehensively evaluates LMMs through 4 dimensions and 13 task types:

Spatial Perception

Spatial Localization (SL): Identifying distribution patterns
Spatial Relationship (SR): Determining relative positions
Boundary Detection (BD): Predicting coordinates of cultivated areas

Object Understanding

Object Classification (OC): Identifying agricultural entities
Pest/Disease Diagnostics (PDD): Recognizing pest species
Growth Status Recognition (GSR): Assessing plant health

Scene Understanding

Scene Comparison (SC): Identifying images with specific features
Counting (CO): Estimating object quantities
Area Statistics (AS): Calculating coverage rates

Scene Reasoning

Visual Prompt Reasoning (VPR): Inferring measurements
Anomaly Reasoning (AR): Identifying anomalous regions
Climate Type Reasoning (CTR): Determining climate zones
Planning (PL): Predicting outcomes like yield reduction

Benchmark Pipeline

The benchmark curation pipeline

The AgroMind benchmark covers four key stages:

Data Pre-processing

Customized processing protocols for heterogeneous data sources, including format conversion, annotation refinement, and multi-level standardization.

Question Generation

Two generation approaches: rule-based questions for normativity and logic, and human-based questions for flexibility and diversity.

LMMs Inference

Models process preprocessed images and generated questions to produce answers and analysis for agricultural RS tasks.

Quality Control

Systematic comparison of model outputs with expert-annotated standards, identifying incorrect, illogical, or incomplete responses.

Citation

@misc{li2025largemultimodalmodelsunderstand,
      title={Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind}, 
      author={Qingmei Li and Yang Zhang and Zurong Mai and Yuhang Chen and Shuohong Lou and Henglian Huang and Jiarui Zhang and Zhiwei Zhang and Yibin Wen and Weijia Li and Haohuan Fu and Jianxi Huang and Juepeng Zheng},
      year={2025},
      eprint={2505.12207},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.12207}, 
}
}