[CVPR 2026]
Deepfake detection models increasingly generate natural language explanations to justify their predictions. Yet, while classification accuracy has improved, the reasoning behind these predictions is often ungrounded, hallucinated, or loosely connected to the actual visual evidence. Existing evaluation protocols focus primarily on detection accuracy and largely overlook reasoning fidelity, visual grounding, and interpretability.
We introduce DeepfakeJudge, a unified framework for scalable reasoning supervision and evaluation in deepfake detection. The framework integrates an out-of-distribution detection benchmark, a densely human-annotated reasoning dataset, and a bootstrapped generator-evaluator training pipeline to build a multimodal reasoning judge. The resulting models evaluate explanation quality directly from images and support both pointwise and pairwise assessment aligned with human judgment.
DeepfakeJudge establishes reasoning fidelity as a measurable and scalable dimension of trustworthy deepfake detection, showing that effective reasoning evaluators can be trained without requiring explicit ground-truth rationales for every instance.
Current deepfake detectors can classify images, but their reasoning is often unreliable: hallucinated artifacts, ungrounded claims, and generic explanations remain widespread. Standard metrics such as BLEU and BERTScore do not capture these deficiencies. DeepfakeJudge addresses this gap by training compact vision-language models that assess reasoning quality directly from images.
A challenging benchmark combining real images, text-to-image generations, and editing-based forgeries from modern pipelines such as Gemini, SeedDream, Flux-Kontext-Max, and Qwen-Edit, designed to evaluate both detection accuracy and reasoning generalization.
Densely annotated dataset linking textual explanations to localized visual evidence, covering artifact categories, bounding boxes, referring expressions, and structured gold reasoning rationales.
A scalable supervision framework that produces graded reasoning traces across five quality levels, iteratively refines misaligned samples using evaluator feedback, and paraphrases accepted outputs to introduce stylistic diversity.
Compact 3B and 7B vision-language models trained as reasoning evaluators, supporting both pointwise scoring (1 to 5) and pairwise comparison. The 7B model achieves 96.2% pairwise accuracy and 0.95 Pearson correlation, surpassing models over 30x larger.
DeepfakeJudge introduces a multi-level dataset ecosystem: DFJ-Detect (2,000 images for OOD detection), DFJ-Reason (924 images with human-annotated reasoning), DFJ-Meta (41K+ bootstrapped training samples for judge training), and DFJ-Meta-Human (155 human-validated evaluation samples).
To scale training beyond the human-annotated subset, we introduce a generator-evaluator loop. A generator produces reasoning traces at five intended quality levels (1 to 5). An evaluator checks alignment between predicted and intended ratings. Misaligned samples are refined using evaluator feedback until consistency is achieved. Accepted samples are then paraphrased to prevent stylistic memorization.
| Rating | Reasoning |
|---|---|
| 5 (Gold) | The train in the mid-right appears unnaturally bright and overly smooth, lacking realistic surface textures that would normally show wear and detail. The entire upper-right train section is distorted, with shapes fused together and unclear details, making it hard to distinguish individual parts. The merging of elements creates an unnatural and confusing structure that does not align with typical train forms. Edges between the train and the background are uneven and lack depth, further weakening the image's realism. These texture and structural anomalies clearly indicate that the scene is artificially generated. |
| 4 | The train located on the right side is strikingly bright and shows a lack of detailed texture, giving it an overly smooth and synthetic appearance. In the upper-right section, the shapes seem to merge, complicating the identification of distinct train parts. The boundaries between the train and its background are somewhat uneven, further diminishing its natural look. These texture irregularities and the blending of forms strongly suggest that the image is AI-generated. |
| 3 | The train exhibits an unusually bright and smooth texture, with certain sections blending into one another in the upper right area. The distinct features of the train's structure are somewhat obscured, and the edges around it lack sharpness, indicating possible alterations. Overall, the image raises suspicion and does not appear entirely authentic, despite some visible details. |
| 2 | The train looks a little off because some parts seem merged and unclear, especially on the upper right. The colors also look a bit unnatural. However, the rest of the scene looks okay. It might be edited or just a low-quality photo. There are no strong clear signs, so I'm not completely sure. |
| 1 | The train looks normal and the tracks appear fine, with no visible issues. The colors and textures seem consistent with a real photo. The background and surrounding objects also look natural. Nothing stands out as fake or edited here, so this image is definitely real. |
Real and synthetic images are curated for an OOD detection benchmark. Fake images span both T2I generation and image-editing pipelines. A subset is densely annotated for reasoning supervision, linking textual explanations to spatial visual evidence.
A generator model produces reasoning samples across five quality levels. An evaluator model assigns ratings and provides feedback. Misaligned samples are iteratively refined. Accepted samples are paraphrased to introduce stylistic diversity while preserving semantic structure.
Compact VLMs (3B and 7B) are fine-tuned using LoRA with a negative log-likelihood objective. In the pointwise setting, the model predicts a reasoning quality score (1 to 5) with a brief justification. In the pairwise setting, it selects the better-grounded reasoning between two candidates.
Evaluation on DeepfakeJudge-Detect (2,000 images):
| Model | Real F1 | Fake F1 | Overall Acc. |
|---|---|---|---|
| Gemini-2.5-Flash | 73.7 | 50.0 | 65.5 |
| GPT-4o-mini | 70.2 | 35.8 | 59.3 |
| Qwen-3-VL-235B | 78.6 | 68.4 | 74.5 |
| Qwen-3-VL-235B-Thinking | 76.6 | 79.8 | 63.7 |
| SIDA-13B | 57.0 | 34.5 | 48.1 |
Evaluation on DeepfakeJudge-Reason:
| Model | BLEU-3 | BERTScore | DFJ-3B Score |
|---|---|---|---|
| Gemini-2.5-Flash | 0.02 | 0.60 | 3.17 |
| GPT-4o-mini | 0.01 | 0.35 | 2.83 |
| Qwen-3-VL-30B | 0.03 | 0.62 | 3.31 |
| Qwen-3-VL-235B | 0.01 | 0.60 | 3.59 |
| SIDA | 0.01 | 0.58 | 2.32 |
DFJ-Meta
| Model | RMSE ↓ | Pearson ↑ |
|---|---|---|
| Gemini-2.5 | 1.09 | 0.83 |
| GPT-4o-mini | 0.78 | 0.87 |
| Qwen-3-VL-235B | 1.10 | 0.82 |
| DeepfakeJudge-3B | 0.69 | 0.92 |
| DeepfakeJudge-7B | 0.61 | 0.93 |
DFJ-Meta-Human
| Model | RMSE ↓ | Pearson ↑ |
|---|---|---|
| GPT-4o-mini | 0.81 | 0.86 |
| Qwen-235B-Thinking | 0.95 | 0.86 |
| DeepfakeJudge-7B | 0.50 | 0.95 |
Pairwise accuracy (% agreement with human preferences):
| Model | DFJ-Meta | DFJ-Meta-Human |
|---|---|---|
| Gemini-2.5 | 91.7 | 94.2 |
| GPT-4o-mini | 90.3 | 89.8 |
| Qwen-235B | 93.2 | 99.4 |
| DeepfakeJudge-3B | 94.4 | 96.6 |
| DeepfakeJudge-7B | 96.2 | 98.9 |
All models are fine-tuned from Qwen2.5-VL-Instruct using LoRA and hosted on Hugging Face:
| Model | Type | Base | Link |
|---|---|---|---|
| DeepfakeJudge-3B-Pointwise | Pointwise | Qwen2.5-VL-3B | 🤗 Download |
| DeepfakeJudge-3B-Pairwise | Pairwise | Qwen2.5-VL-3B | 🤗 Download |
| DeepfakeJudge-7B-Pointwise | Pointwise | Qwen2.5-VL-7B | 🤗 Download |
| DeepfakeJudge-7B-Pairwise | Pairwise | Qwen2.5-VL-7B | 🤗 Download |
If you find DeepfakeJudge useful in your research, please consider citing:
📢 arXiv paper coming soon. Citation will be updated upon release.