GaslightingBench

Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models

1 Singapore Management University 2 Fudan University

^*Equal contribution

Abstract

Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities, excelling in complex understanding and generation tasks. Despite their success, MLLMs remain vulnerable to conversational adversarial inputs. In this paper, we systematically study gaslighting negation attacks—a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs, often fabricating justifications. We conduct extensive evaluations of state-of-the-art MLLMs across diverse benchmarks and observe substantial performance drops when negation is introduced. Notably, we introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments. GaslightingBench consists of multiple-choice questions curated from existing datasets, along with generated negation prompts across 20 diverse categories. Throughout extensive evaluation, we find that proprietary models such as Gemini-1.5-flash and GPT-4o demonstrate better resilience compared to open-source counterparts like Qwen2-VL and LLaVA, though even advanced reasoning-oriented models like Gemini-2.5-Pro remain susceptible. Our category-level analysis further shows that subjective or socially nuanced domains (e.g., Social Relation, Image Emotion) are especially fragile, while more objective domains (e.g., Geography) exhibit relatively smaller but still notable drops. Overall, all evaluated MLLMs struggle to maintain logical consistency under gaslighting negation attack. These findings highlight a fundamental robustness gap and provide insights for developing more reliable and trustworthy multimodal AI systems.

Comparison of MLLMs's performance before (i.e., initial answers) and after negation, reported as average accuracy across eight benchmarks-MME, MMMU, MMMUPro, MMBench, PoPE, ChartQA, AI2Diagram and MathVista. The results highlight the substantial accuracy drop across all models when negation is introduced.

Evaluation Pipeline

we introduce a structured evaluation pipeline to systematically assess MLLMs' vulnerability to negation by measuring their performance before and after exposure to negation arguments. The process consists of three key steps:
(1) Inputs and Initial Answers: MLLMs receive a variety of question formats as input, including Yes/No, Multiple-Choice, and Free-Form, and their initial answers are recorded.

(2) Negation Generation: if the model's initial response is correct, a negation argument is introduced to challenge its answer. Different negation strategies are applied based on the question type.

(3) Post-Negation Evaluation: the model's response after negation is analyzed to determine if it maintains consistency or is misled into revising its answer. Post-processing is applied to normalize responses for accurate comparison.

Gaslighting Benchmark Construction

To comprehensively evaluate the impact of negation on MLLMs, we introduce the first gaslighting benchmark, GaslightingBench, designed to ensure broad coverage, category balance, and the ability to expose MLLMs' vulnerabilities to negation arguments. This figure shows category distribution of GaslightingBench with 20 categories and 1,287 samples. Each category is carefully curated from existing datasets to ensure balanced representation and broad coverage, providing a comprehensive evaluation dataset for assessing MLLMs' vulnerabilities to negation arguments.

This figure shows some examples from different categories in the GaslightingBench. The green-highlighted option is correct, while a randomly chosen incorrect option is used to generate the negation argument.

Result Analysis

Performance comparison of Multimodal Large Language Models (MLLMs) across various benchmarks before (i.e., initial answers) and after the introduction of negation arguments. The performance drop is highlighted in red.

Results of MLLMs in our GaslightingBench, comparing each model's performance before (i.e., initial answers) and after negation. The performance drop is highlighted in red.

Qualitative examples

Here are some quantative examples showcasing how MLLMs respond to negation arguments in different datasets. In each example, the models initially provide correct responses. However, when users introduce negation arguments, many models revise their answers incorrectly.

Interesting Discussion

Effect of negation types. We incorporated two additional variants into GaslightingBench: (i) anger-style negation, where the user conveys emotionally charged disbelief (e.g., “I can’t believe you made such a basic mistake!”), and (ii) authority-style negation, where the user appeals to an external authority (e.g., “The professor said your answer is incorrect.”). As shown in following table, we observed a larger performance decline compared to neutral negation. This indicates that emotionally or authoritatively framed challenges can further erode model reliability, likely by amplifying the model's deference to perceived user authority.

Model confidence under gaslighting negation attacks. To better understand the internal behavior of MLLMs under gaslighting negation attacks, as shown in following table, we conduct a confidence-based analysis using model-reported probability scores for Gemini-1.5-flash and Qwen-2-VL-7B on both GaslightingBench and MMMU. We group predictions by their correctness and whether they oc- curred before or after negation, then compute the average confidence scores. On the one hand, for both models, confidence scores are generally higher for correct predictions than for incorrect ones, especially after negation, indicating some degree of internal calibration. On the other hand, we ob- serve confidence drop in incorrect answers after negation, particularly in Qwen-2-VL-7B (from 90.6 to 74.4 on GaslightingBench), suggesting the model becomes less confident when misled. However, confidence in incorrect responses remains relatively high, especially in Gemini-1.5-flash, which maintains 91.9 average confidence even when wrong after negation, indicating a risk of confident hallucination. These findings indicate that while models may exhibit partial uncertainty when ma- nipulated, they can still produce incorrect yet high-confidence outputs. This suggests the need for calibrated uncertainty modeling in MLLMs under adversarial dialogue settings.