DeepfakeJudge: Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection

📄 Abstract

Deepfake detection models increasingly generate natural language explanations to justify their predictions. Yet, while classification accuracy has improved, the reasoning behind these predictions is often ungrounded, hallucinated, or loosely connected to the actual visual evidence. Existing evaluation protocols focus primarily on detection accuracy and largely overlook reasoning fidelity, visual grounding, and interpretability.

We introduce DeepfakeJudge, a unified framework for scalable reasoning supervision and evaluation in deepfake detection. The framework integrates an out-of-distribution detection benchmark, a densely human-annotated reasoning dataset, and a bootstrapped generator-evaluator training pipeline to build a multimodal reasoning judge. The resulting models evaluate explanation quality directly from images and support both pointwise and pairwise assessment aligned with human judgment.

DeepfakeJudge establishes reasoning fidelity as a measurable and scalable dimension of trustworthy deepfake detection, showing that effective reasoning evaluators can be trained without requiring explicit ground-truth rationales for every instance.

🔍 Why DeepfakeJudge?

Current deepfake detectors can classify images, but their reasoning is often unreliable: hallucinated artifacts, ungrounded claims, and generic explanations remain widespread. Standard metrics such as BLEU and BERTScore do not capture these deficiencies. DeepfakeJudge addresses this gap by training compact vision-language models that assess reasoning quality directly from images.

Comparison of deepfake reasoning across different VLMs

DeepfakeJudge scores vs traditional metrics

Figure 1: (Left) VLMs produce reasoning of widely varying quality for the same deepfake: some correctly identify artifacts, while others hallucinate non-existent manipulations. (Right) Traditional metrics (BLEU, BERTScore) fail to capture reasoning fidelity, whereas DeepfakeJudge scores correlate well with human assessments.

💡 Key Contributions

🎯

Out-of-Distribution Deepfake Benchmark

A challenging benchmark combining real images, text-to-image generations, and editing-based forgeries from modern pipelines such as Gemini, SeedDream, Flux-Kontext-Max, and Qwen-Edit, designed to evaluate both detection accuracy and reasoning generalization.

✍️

Human-Annotated Visual Reasoning Dataset

Densely annotated dataset linking textual explanations to localized visual evidence, covering artifact categories, bounding boxes, referring expressions, and structured gold reasoning rationales.

🔄

Bootstrapped Generator-Evaluator Pipeline

A scalable supervision framework that produces graded reasoning traces across five quality levels, iteratively refines misaligned samples using evaluator feedback, and paraphrases accepted outputs to introduce stylistic diversity.

⚖️

MLLM-as-a-Judge

Compact 3B and 7B vision-language models trained as reasoning evaluators, supporting both pointwise scoring (1 to 5) and pairwise comparison. The 7B model achieves 96.2% pairwise accuracy and 0.95 Pearson correlation, surpassing models over 30x larger.

📊 Dataset Construction

DeepfakeJudge introduces a multi-level dataset ecosystem: DFJ-Detect (2,000 images for OOD detection), DFJ-Reason (924 images with human-annotated reasoning), DFJ-Meta (41K+ bootstrapped training samples for judge training), and DFJ-Meta-Human (155 human-validated evaluation samples).

Figure 3: Overview of the dataset construction pipeline. Real and synthetic images are curated, annotated by human experts, and expanded through bootstrapped reasoning generation to create graded supervision at scale.

🎯 DFJ-Detect

1,000 real images (OpenImages-V7)
500 T2I fakes (Gemini, SeedDream)
500 edited fakes (Gemini, Flux-Kontext-Max, Qwen-Edit)

🧠 DFJ-Reason

924 images with dense annotations
Artifact categories + bounding boxes
Referring expressions + gold rationales
Cohen's κ = 0.71 inter-annotator agreement

📷 Example Images from the Dataset

REAL OpenImages-V7

FAKE Gemini T2I

EDITED Flux-Kontext

EDITED Qwen-Edit

🔄 Bootstrapped Reasoning Supervision

To scale training beyond the human-annotated subset, we introduce a generator-evaluator loop. A generator produces reasoning traces at five intended quality levels (1 to 5). An evaluator checks alignment between predicted and intended ratings. Misaligned samples are refined using evaluator feedback until consistency is achieved. Accepted samples are then paraphrased to prevent stylistic memorization.

Figure 4: The bootstrapped generator-evaluator supervision framework. Reasoning traces are generated at controlled quality levels, evaluated for alignment, iteratively refined, and diversified through paraphrasing.

📝 Example: Graded Reasoning from Bootstrapping

(Image label: Fake)

Rating	Reasoning
5 (Gold)	The train in the mid-right appears unnaturally bright and overly smooth, lacking realistic surface textures that would normally show wear and detail. The entire upper-right train section is distorted, with shapes fused together and unclear details, making it hard to distinguish individual parts. The merging of elements creates an unnatural and confusing structure that does not align with typical train forms. Edges between the train and the background are uneven and lack depth, further weakening the image's realism. These texture and structural anomalies clearly indicate that the scene is artificially generated.
4	The train located on the right side is strikingly bright and shows a lack of detailed texture, giving it an overly smooth and synthetic appearance. In the upper-right section, the shapes seem to merge, complicating the identification of distinct train parts. The boundaries between the train and its background are somewhat uneven, further diminishing its natural look. These texture irregularities and the blending of forms strongly suggest that the image is AI-generated.
3	The train exhibits an unusually bright and smooth texture, with certain sections blending into one another in the upper right area. The distinct features of the train's structure are somewhat obscured, and the edges around it lack sharpness, indicating possible alterations. Overall, the image raises suspicion and does not appear entirely authentic, despite some visible details.
2	The train looks a little off because some parts seem merged and unclear, especially on the upper right. The colors also look a bit unnatural. However, the rest of the scene looks okay. It might be edited or just a low-quality photo. There are no strong clear signs, so I'm not completely sure.
1	The train looks normal and the tracks appear fine, with no visible issues. The colors and textures seem consistent with a real photo. The background and surrounding objects also look natural. Nothing stands out as fake or edited here, so this image is definitely real.

Table: Example of a fake image and corresponding degraded ratings produced by the bootstrapping process. Reasoning quality is systematically controlled from level 5 (fully accurate and grounded) to level 1 (hallucinated and misleading).

🔬 Methodology

Dataset Construction

Real and synthetic images are curated for an OOD detection benchmark. Fake images span both T2I generation and image-editing pipelines. A subset is densely annotated for reasoning supervision, linking textual explanations to spatial visual evidence.

Bootstrapped Reasoning Supervision

A generator model produces reasoning samples across five quality levels. An evaluator model assigns ratings and provides feedback. Misaligned samples are iteratively refined. Accepted samples are paraphrased to introduce stylistic diversity while preserving semantic structure.

DeepfakeJudge Training

Compact VLMs (3B and 7B) are fine-tuned using LoRA with a negative log-likelihood objective. In the pointwise setting, the model predicts a reasoning quality score (1 to 5) with a brief justification. In the pairwise setting, it selects the better-grounded reasoning between two candidates.

📈 Benchmark Results

🎯 Deepfake Detection (OOD)

Evaluation on DeepfakeJudge-Detect (2,000 images):

Model	Real F1	Fake F1	Overall Acc.
Gemini-2.5-Flash	73.7	50.0	65.5
GPT-4o-mini	70.2	35.8	59.3
Qwen-3-VL-235B	78.6	68.4	74.5
Qwen-3-VL-235B-Thinking	76.6	79.8	63.7
SIDA-13B	57.0	34.5	48.1

🧠 Reasoning Evaluation

Evaluation on DeepfakeJudge-Reason:

Model	BLEU-3	BERTScore	DFJ-3B Score
Gemini-2.5-Flash	0.02	0.60	3.17
GPT-4o-mini	0.01	0.35	2.83
Qwen-3-VL-30B	0.03	0.62	3.31
Qwen-3-VL-235B	0.01	0.60	3.59
SIDA	0.01	0.58	2.32

📌 Pointwise Evaluation

DFJ-Meta

Model	RMSE ↓	Pearson ↑
Gemini-2.5	1.09	0.83
GPT-4o-mini	0.78	0.87
Qwen-3-VL-235B	1.10	0.82
DeepfakeJudge-3B	0.69	0.92
DeepfakeJudge-7B	0.61	0.93

DFJ-Meta-Human

Model	RMSE ↓	Pearson ↑
GPT-4o-mini	0.81	0.86
Qwen-235B-Thinking	0.95	0.86
DeepfakeJudge-7B	0.50	0.95

⚖️ Pairwise Evaluation

Pairwise accuracy (% agreement with human preferences):

Model	DFJ-Meta	DFJ-Meta-Human
Gemini-2.5	91.7	94.2
GPT-4o-mini	90.3	89.8
Qwen-235B	93.2	99.4
DeepfakeJudge-3B	94.4	96.6
DeepfakeJudge-7B	96.2	98.9

Model	Type	Base	Link
DeepfakeJudge-3B-Pointwise	Pointwise	Qwen2.5-VL-3B	🤗 Download
DeepfakeJudge-3B-Pairwise	Pairwise	Qwen2.5-VL-3B	🤗 Download
DeepfakeJudge-7B-Pointwise	Pointwise	Qwen2.5-VL-7B	🤗 Download
DeepfakeJudge-7B-Pairwise	Pairwise	Qwen2.5-VL-7B	🤗 Download

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision