AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture
Paper (arXiv)Abstract
Recent advancements in Vision-Language Models (VLMs) have significantly transformed various industries. In agriculture, these dual-modal capabilities offer promising applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. While several Visual Question Answering (VQA) datasets and benchmarks have been developed to evaluate VLM performance, they often fail to adequately assess the critical reasoning and problem-solving skills required in complex agricultural contexts. To address this gap, we introduce AgriCoT, a VQA dataset that incorporates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,535 carefully curated samples, AgriCoT offers a comprehensive and robust evaluation of reasoning abilities for VLMs, particularly in zero-shot scenarios, by focusing on their capacity to engage in logical reasoning and effective problem-solving. Our evaluations, conducted with 26 representative VLMs, including both proprietary and open-source models, reveal that while some proprietary models excel at answering questions, there is a notable and significant gap in their reasoning capabilities. This underscores the importance of incorporating CoT for more precise and effective assessments.
Case Samples
Key Features
Compared with previous agricultural multimodal benchmarks, AgriCoT introduces four key advantages: Multi-step Reasoning, Multimodal Alignment, Long-form Reasoning and Reasoning Evaluation.
Hierarchical Task System
AgriCoT constructs five evaluation dimensions (such as object detection, quantitative analysis, disease monitoring, spatial understanding and environmental management), covering 15 different and diverse task types.
Data Construction
The construction of AgriCoT benchmark primarily comprises four steps: collecting samples from data sources, ensuring the quality of the samples, generating a CoT for each QA pair, and conducting a comprehensive evaluation of the representative VLMs.
Statistics
Statistics of AgriCoT, from three perspectives: the distribution of question types across steps and dimensions, the number of steps and length distribution of CoTs, and word cloud analysis of both questions and CoTs.
Experiment Results
Further Analysis
Performance and analysis of various VLMs across different perspectives, including model size, CoT length and CoT step conut.
BibTeX
@misc{wen2025agricotchainofthoughtbenchmarkevaluating,
title={AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture},
author={Yibin Wen and Qingmei Li and Zi Ye and Jiarui Zhang and Jing Wu and Zurong Mai and Shuohong Lou and Yuhang Chen and Henglian Huang and Xiaoya Fan and Yang Zhang and Lingyuan Zhao and Haohuan Fu and Huang Jianxi and Juepeng Zheng},
year={2025},
eprint={2511.23253},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2511.23253}}