AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

Yibin Wen1,*Qingmei Li2,*Zi Ye1,*Jiarui Zhang1,*Jing Wu1
Zurong Mai1Shuohong Lou1Yuhang Chen1Henglian Huang1Xiaoya Fan3
Yang Zhang1Lingyuan Zhao4Haohuan Fu2,7Huang Jianxi5,6Juepeng Zheng1,7,†
1 Sun Yat-sen University   2 Tsinghua University   3 Southwest University
4 HuanTian Wisdom Technology Co., Ltd.   5 China Agricultural University
6 Southwest Jiaotong University   7 National Supercomputing Center in Shenzhen

* Equal contribution   Corresponding author
Paper (arXiv) Hugging Face Dataset

Abstract

Recent advancements in Vision-Language Models (VLMs) have significantly transformed various industries. In agriculture, these dual-modal capabilities offer promising applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. While several Visual Question Answering (VQA) datasets and benchmarks have been developed to evaluate VLM performance, they often fail to adequately assess the critical reasoning and problem-solving skills required in complex agricultural contexts. To address this gap, we introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities for VLMs, particularly in zero-shot scenarios, by focusing on their capacity to engage in logical reasoning and effective problem-solving. Our evaluations, conducted with 26 representative VLMs, including both proprietary and open-source models, reveal that while some proprietary models excel at answering questions, there is a notable and significant gap in their reasoning capabilities. This underscores the importance of incorporating CoT for more precise and effective assessments.

Case Samples

Key Features

table1
innovation

Compared with previous agricultural multimodal benchmarks, AgriCoT introduces four key advantages: Multi-step Reasoning, Multimodal Alignment, Long-form Reasoning and Reasoning Evaluation.

Hierarchical Task System

overview
overview-2
lidar_fig

AgriCoT constructs five evaluation dimensions (such as object detection, quantitative analysis, disease monitoring, spatial understanding and environmental management), covering 15 different and diverse task types.

Data Construction

curation

The construction of AgriCoT benchmark primarily comprises four steps: collecting samples from data sources, ensuring the quality of the samples, generating a CoT for each QA pair, and conducting a comprehensive evaluation of the representative VLMs.

Statistics

statistics

Statistics of AgriCoT, from three perspectives: the distribution of question types across steps and dimensions, the number of steps and length distribution of CoTs, and word cloud analysis of both questions and CoTs.

Experiment Results

Further Analysis

experiment

Performance and analysis of various VLMs across different perspectives, including model size, CoT length and CoT step conut.

BibTeX

          
          @misc{wen2025agricotchainofthoughtbenchmarkevaluating,
          title={AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture}, 
          author={Yibin Wen and Qingmei Li and Zi Ye and Jiarui Zhang and Jing Wu and Zurong Mai and Shuohong Lou and Yuhang Chen and Henglian Huang and Xiaoya Fan and Yang Zhang and Lingyuan Zhao and Haohuan Fu and Huang Jianxi and Juepeng Zheng},
          year={2025},
          eprint={2511.23253},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2511.23253}}