TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs

composition

Samples of TCC-Bench. TCC-Bench has eight domains: Astronomy, Music, Custom, Architecture, Transportation, Diet, Clothing and Artifact.

Introduction

We present the Traditional Chinese Culture understanding Benchmark, TCC-Bench, a bilingual (i.e., Chinese and English) Visual Question Answering (VQA) benchmark specifically designed to evaluate the capabilities of MLLMs in understanding traditional Chinese culture. We customize eight knowledge domains that encompass key aspects of traditional Chinese culture. Moreover, the images within TCC-Bench are curated from museum artifacts, depictions of everyday life, comics, and other culturally significant materials, ensuring both visual diversity and cultural authenticity. Moreover, we introduce a semi-automated question generation method that reduces manual effort while ensuring the acquisition of high-quality data.

Overview

We construct a multiple-choice VQA benchmark to evaluate the ability of current MLLMs to understand traditional Chinese culture. We divide the questions into eight domains: Astronomy, Music, Custom, Architecture, Transportation, Diet, Clothing and Artifact. To ensure the high quality of the benchmark, six adults are recruited for image collection and question generation. They all have bachelor's degrees or above and have extensive experience living in a Chinese cultural context. We design a semi-automated question generation pipeline to reduce manual labor costs. To facilitate the analysis of the model’s understanding of traditional Chinese culture across different languages, our dataset provides bilingual questions, options, and explanations.
composition

Statistics

Our TCC-Bench comprises 675 images and 860 high-quality questions. Each question is accompanied by four carefully designed options, with only one correct answer, resulting in a total of 3,440 options. On average, the option length is 3.2 Chinese characters or 2.3 English words. The accompanying explanations average 20.9 Chinese characters or 15.9 English words, offering comprehensive insights into the traditional Chinese cultural knowledge underpinning each question. Furthermore, we analyze the distribution of questions across eight domains in our dataset. It can be observed that the question distribution in TCC-Bench is relatively balanced.
composition

Main Result

To comprehensively evaluate MLLMs’ understanding of traditional Chinese culture, we select various MLLMs, including both open-source and closed-source models. For open-source MLLMs, we choose models with different parameter sizes. Specifically, the selected open-source models are: LLaVA-v1.6-7B, DeepSeek-VL-7B, Qwen2-VL-7B/72B, CogVLM2-19B, GLM-4V-9B, InternVL2.5-8B/78B. We use A800 GPUs to deploy and evaluate them. For closed-source MLLMs, we choose GPT-4o, Claude-3.7 Sonnet and Gemini-2.0 Flash. We evaluate them by calling the corresponding official APIs. The prompts are formatted as a multiple-choice setup. We use accuracy as the evaluation metric and prepare a rule-based answer extraction method.
composition

Error Analysis

To analyze the underlying causes of the performance gap observed in MLLMs on TCC-Bench and to provide insights for their improvement, we extract the erroneous responses of GPT-4o under the CoT setting for manual annotations. Based on human annotations, we categorize the errors into four types: Visual Perceptual Error, Lack of Cultural Knowledge, Reasoning Error, and Reject to Answer.
composition

BibTeX

@misc{xu2025tccbenchbenchmarkingtraditionalchinese,
        title={TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs}, 
        author={Pengju Xu and Yan Wang and Shuyuan Zhang and Xuan Zhou and Xin Li and Yue Yuan and Fengzhao Li and Shunyuan Zhou and Xingyu Wang and Yi Zhang and Haiying Zhao},
        year={2025},
        eprint={2505.11275},
        archivePrefix={arXiv},
        primaryClass={cs.MM},
        url={https://arxiv.org/abs/2505.11275}, 
  }