AgroMind

Benchmarking Large Multimodal Models in Agricultural Remote Sensing

AgroMind Framework Overview

AgroMind Framework Overview

Abstract

Large Multimodal Models (LMMs) has demonstrated capabilities across various domains, but comprehensive benchmarks for agricultural remote sensing (RS) remain scarce. Existing benchmarks designed for agricultural RS scenarios exhibit notable limitations, primarily in terms of insufficient scene diversity in the dataset and oversimplified task design. To bridge this gap, we introduce AgroMind, a comprehensive agricultural remote sensing benchmark covering four task dimensions: spatial perception, object understanding, scene understanding, and scene reasoning, with a total of 13 task types, ranging from crop identification and health monitoring to environmental analysis. We curate a high-quality evaluation set by integrating nine public datasets and one private global parcel dataset, containing 28,482 QA pairs and 20,850 images. The pipeline begins with multi-source data pre-processing, including collection, format standardization, and annotation refinement. We then generate a diverse set of agriculturally relevant questions through the systematic definition of tasks. Finally, we employ LMMs for inference, generating responses, and performing detailed examinations. We evaluated 20 open-source LMMs and 4 closed-source models on AgroMind. Experiments reveal significant performance gaps, particularly in spatial reasoning and fine-grained recognition, it is notable that human performance lags behind several leading LMMs. By establishing a standardized evaluation framework for agricultural RS, AgroMind reveals the limitations of LMMs in domain knowledge and highlights critical challenges for future work.

Dataset Statistics

Dataset Statistics

Dataset statistics from different aspects

Geographical Coverage

Geographical coverage map of datasets

The AgroMind dataset integrates 10 datasets with comprehensive coverage:

  • Sensor Types: UAV (7,000 QA pairs), Satellite (12,000 QA pairs), Camera (9,000 QA pairs)
  • Agricultural Scenes: Anomaly Detection, Crop Monitoring, Pest Identification, Parcel Delineation, Tree Analysis
  • Geographic Coverage: 106 regions globally, including diverse climate zones and agricultural systems
  • Temporal Coverage: Multi-seasonal imagery capturing crop phenodynamics

Task Dimensions

Task Type Distribution

Statistical information of AgroMind

Task System

Hierarchical task system

AgroMind comprehensively evaluates LMMs through 4 dimensions and 13 task types:

Spatial Perception

  • Spatial Localization (SL): Identifying distribution patterns
  • Spatial Relationship (SR): Determining relative positions
  • Boundary Detection (BD): Predicting coordinates of cultivated areas

Object Understanding

  • Object Classification (OC): Identifying agricultural entities
  • Pest/Disease Diagnostics (PDD): Recognizing pest species
  • Growth Status Recognition (GSR): Assessing plant health

Scene Understanding

  • Scene Comparison (SC): Identifying images with specific features
  • Counting (CO): Estimating object quantities
  • Area Statistics (AS): Calculating coverage rates

Scene Reasoning

  • Visual Prompt Reasoning (VPR): Inferring measurements
  • Anomaly Reasoning (AR): Identifying anomalous regions
  • Climate Type Reasoning (CTR): Determining climate zones
  • Planning (PL): Predicting outcomes like yield reduction

Benchmark Pipeline

Benchmark Pipeline

The benchmark curation pipeline

The AgroMind benchmark covers four key stages:

Data Pre-processing

Customized processing protocols for heterogeneous data sources, including format conversion, annotation refinement, and multi-level standardization.

Question Generation

Two generation approaches: rule-based questions for normativity and logic, and human-based questions for flexibility and diversity.

LMMs Inference

Models process preprocessed images and generated questions to produce answers and analysis for agricultural RS tasks.

Quality Control

Systematic comparison of model outputs with expert-annotated standards, identifying incorrect, illogical, or incomplete responses.

Access the Dataset

The complete AgroMind dataset is available on HuggingFace:

  • 20,850 high-quality agricultural images
  • 28,482 diverse QA pairs
  • Multi-sensor data (UAV, Satellite, Camera)
  • Comprehensive task annotations

Download on HuggingFace

Citation

@misc{li2025largemultimodalmodelsunderstand,
      title={Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind}, 
      author={Qingmei Li and Yang Zhang and Zurong Mai and Yuhang Chen and Shuohong Lou and Henglian Huang and Jiarui Zhang and Zhiwei Zhang and Yibin Wen and Weijia Li and Haohuan Fu and Jianxi Huang and Juepeng Zheng},
      year={2025},
      eprint={2505.12207},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.12207}, 
}
}