Modern Vision-Language Models (VLMs) pose significant individual-level privacy risks by linking fragmented multimodal data to identifiable individuals through hierarchical reasoning. Existing privacy benchmarks mainly evaluate privacy perception, such as detecting phone numbers, names, faces, or other isolated attributes, but they do not fully capture the more critical risk of privacy reasoning: a model's ability to infer and link distributed information into individual profiles.
To address this gap, we introduce MultiPriv, a benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. MultiPriv introduces the Privacy Perception and Reasoning (PPR) framework and constructs a bilingual multimodal dataset with synthetic individual profiles, where direct identifiers such as faces and names are linked to sensitive attributes such as health status, home address, trajectory information, and financial records.
MultiPriv contains 36 privacy attributes across 9 tasks, covering attribute detection, privacy information extraction, privacy region localization, cross-image re-identification, chained reasoning, and cross-modal association. We evaluate over 50 open-source and commercial VLMs and show that reasoning, rather than perception alone, is a key driver of individual-level privacy risk.
Faces, fingerprints, names, IDs, and other direct identifiers.
Medical reports, health status, bank cards, and financial account information.
Addresses, tickets, travel routes, and movement or activity traces.
Vehicle ownership, personal relations, and social identity attributes.
Figure 3. MultiPriv decomposes privacy risk into nine subtasks spanning privacy perception and individual-level privacy reasoning.
MultiPriv evaluates privacy risks along two complementary dimensions: attribute-level privacy perception and individual-level privacy reasoning.
Detects explicit identifiers such as faces, names, fingerprints, and ID numbers.
Identifies attributes that can implicitly reveal identity, such as address, health, or trajectory.
Extracts sensitive textual or visual content from privacy-related images and documents.
Localizes private regions such as faces, license plates, or bank card numbers with bounding boxes.
Determines whether different images belong to the same individual using shared or correlated cues.
Infers one private attribute of an individual directly from another linked attribute.
Performs multi-step linkage across privacy cues, such as face to location to identity.
Associates known private attributes with the correct individual across multiple samples.
Links textual and visual privacy cues to infer sensitive attributes about a specific individual.
Extracting discrete sensitive attributes from unstructured multimodal inputs.
Linking fragmented attributes to reconstruct an identifiable individual profile.
The Central Risk of Modern VLMs
VLMs do not only detect private attributes. Strong reasoning allows them to link fragmented multimodal cues and reconstruct identifiable profiles.
Many open-source and commercial models show substantial privacy reasoning ability. In the controlled benchmark, widely used VLMs can link sensitive attributes across languages and modalities.
Reasoning-focused models achieve high scores on chained reasoning, re-identification, and cross-modal association, showing that identity linkage is a central source of privacy risk.
Privacy behavior varies across English and Chinese. A model may appear less risky in one language while exposing stronger identity-linkage behavior in another.
Smaller reasoning-capable models can match or exceed larger models in privacy reasoning, indicating that architecture, training data, and alignment matter more than parameter count alone.
Explicit step-by-step or structured reasoning prompts can increase privacy leakage by guiding models to connect fragmented multimodal evidence into identity profiles.
Strongly aligned models may refuse many risky queries, reducing realized exposure. Answered-only accuracy reveals that such models can still possess strong privacy reasoning capability when they do answer.
Higher scores indicate greater potential privacy risk under misuse.
Perception Risk measures attribute-level exposure. Reasoning Risk measures individual-level linkage and reconstruction.
Table 3. VLM-Induced Privacy Perception Risk| Model | Overall Risk ↑ |
English ↑ | Chinese ↑ | Bounding IoU ↑ |
||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Direct | Indirect | IEA | Mean | Direct | Indirect | IEA | Mean | |||
| Open-Source VLMs | ||||||||||
| QvQ-72B-Preview | 0.454 | 0.645 | 0.499 | 0.561 | 0.568 | 0.599 | 0.560 | 0.561 | 0.573 | 0.22 |
| Llama-4-Maverick | 0.447 | 0.493 | 0.470 | 0.484 | 0.482 | 0.378 | 0.458 | 0.485 | 0.440 | 0.42 |
| Qwen3-VL-8B | 0.381 | 0.340 | 0.341 | 0.457 | 0.379 | 0.435 | 0.372 | 0.498 | 0.435 | 0.33 |
| Qwen3-VL-32B | 0.380 | 0.377 | 0.383 | 0.532 | 0.431 | 0.409 | 0.497 | 0.530 | 0.479 | 0.23 |
| Llama-4-Scout | 0.377 | 0.442 | 0.509 | 0.480 | 0.477 | 0.374 | 0.453 | 0.507 | 0.445 | 0.21 |
| InternVL3.5-38B | 0.372 | 0.413 | 0.312 | 0.454 | 0.393 | 0.413 | 0.395 | 0.519 | 0.442 | 0.28 |
| InternVL3.5-8B | 0.358 | 0.390 | 0.373 | 0.447 | 0.403 | 0.401 | 0.360 | 0.501 | 0.421 | 0.25 |
| InternVL3.5-14B | 0.356 | 0.394 | 0.340 | 0.442 | 0.392 | 0.411 | 0.367 | 0.497 | 0.425 | 0.25 |
| Qwen3-VL-30B-A3B | 0.353 | 0.333 | 0.300 | 0.339 | 0.324 | 0.355 | 0.393 | 0.407 | 0.385 | 0.35 |
| Qwen3-VL-4B | 0.345 | 0.453 | 0.297 | 0.473 | 0.408 | 0.503 | 0.356 | 0.486 | 0.448 | 0.18 |
| Phi-4-multimodal | 0.332 | 0.404 | 0.369 | 0.384 | 0.386 | 0.413 | 0.380 | 0.433 | 0.409 | 0.20 |
| MiniCPM-V-4.5 | 0.291 | 0.410 | 0.368 | 0.474 | 0.417 | 0.428 | 0.409 | 0.531 | 0.456 | 0.00 |
| Llama-3.2-11B-Vision | 0.274 | 0.414 | 0.292 | 0.313 | 0.340 | 0.349 | 0.319 | 0.298 | 0.322 | 0.16 |
| Llava-v1.6-Vicuna-13B | 0.265 | 0.413 | 0.335 | 0.241 | 0.330 | 0.416 | 0.336 | 0.223 | 0.325 | 0.14 |
| GLM-4.1V-9B | 0.252 | 0.342 | 0.120 | 0.009 | 0.157 | 0.294 | 0.329 | 0.008 | 0.210 | 0.39 |
| Phi-3.5-vision | 0.238 | 0.359 | 0.329 | 0.297 | 0.328 | 0.497 | 0.188 | 0.145 | 0.277 | 0.11 |
| deepseek-vl2-small | 0.222 | 0.365 | 0.205 | 0.350 | 0.307 | 0.311 | 0.305 | 0.457 | 0.358 | 0.00 |
| deepseek-vl2-tiny | 0.222 | 0.365 | 0.205 | 0.350 | 0.307 | 0.311 | 0.305 | 0.457 | 0.358 | 0.00 |
| instructblip-flan-t5-xl | 0.216 | 0.795 | 0.327 | 0.052 | 0.391 | 0.534 | 0.208 | 0.032 | 0.258 | 0.00 |
| Llava-v1.6-Vicuna-7B | 0.195 | 0.362 | 0.307 | 0.260 | 0.310 | 0.419 | 0.267 | 0.136 | 0.274 | 0.00 |
| Llava-v1.6-Mistral-7B | 0.194 | 0.405 | 0.247 | 0.259 | 0.304 | 0.378 | 0.265 | 0.195 | 0.279 | 0.00 |
| instructblip-flan-t5-xxl | 0.192 | 0.833 | 0.283 | 0.013 | 0.376 | 0.418 | 0.151 | 0.027 | 0.199 | 0.00 |
| instructblip-vicuna-7b | 0.139 | 0.519 | 0.137 | 0.010 | 0.222 | 0.330 | 0.244 | 0.012 | 0.195 | 0.00 |
| instructblip-vicuna-13b | 0.113 | 0.161 | 0.146 | 0.020 | 0.109 | 0.539 | 0.124 | 0.026 | 0.230 | 0.00 |
| Phi-3-vision-128k | 0.073 | 0.049 | 0.256 | 0.002 | 0.102 | 0.000 | 0.260 | 0.068 | 0.109 | 0.07 |
| Commercial VLMs | ||||||||||
| Gemini-2.5-Pro | 0.579 | 0.775 | 0.614 | 0.431 | 0.607 | 0.788 | 0.602 | 0.443 | 0.611 | 0.52 |
| GPT-5 | 0.505 | 0.428 | 0.607 | 0.484 | 0.506 | 0.606 | 0.650 | 0.542 | 0.599 | 0.41 |
| Gemini-2.5-Flash | 0.466 | 0.467 | 0.492 | 0.454 | 0.471 | 0.473 | 0.465 | 0.527 | 0.488 | 0.44 |
| Claude-Sonnet-4 | 0.401 | 0.425 | 0.555 | 0.478 | 0.486 | 0.419 | 0.607 | 0.465 | 0.497 | 0.22 |
| GPT-4o | 0.342 | 0.432 | 0.418 | 0.420 | 0.423 | 0.399 | 0.370 | 0.440 | 0.403 | 0.20 |
| Model | Overall Risk ↑ |
English ↑ | Chinese ↑ | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Single | Re-ID | Chained | Cross | Mean | Single | Re-ID | Chained | Cross | Mean | ||
| Open-Source VLMs | |||||||||||
| Qwen3-VL-32B-Thinking | 0.874 | 0.841 | 0.889 | 0.878 | 0.963 | 0.893 | 0.885 | 0.859 | 0.731 | 0.944 | 0.855 |
| Qwen3-VL-4B-Thinking | 0.871 | 0.872 | 0.843 | 0.858 | 0.963 | 0.884 | 0.859 | 0.859 | 0.771 | 0.944 | 0.858 |
| Qwen3-VL-8B-Thinking | 0.868 | 0.874 | 0.900 | 0.728 | 0.963 | 0.866 | 0.871 | 0.867 | 0.798 | 0.944 | 0.870 |
| InternVL3.5-8B | 0.858 | 0.845 | 0.878 | 0.856 | 0.926 | 0.876 | 0.783 | 0.817 | 0.787 | 0.972 | 0.840 |
| InternVL3.5-38B | 0.857 | 0.818 | 0.889 | 0.858 | 0.963 | 0.882 | 0.738 | 0.850 | 0.765 | 0.972 | 0.831 |
| Llama-4-Maverick | 0.842 | 0.724 | 0.878 | 0.867 | 0.926 | 0.849 | 0.766 | 0.859 | 0.769 | 0.944 | 0.835 |
| InternVL3.5-14B | 0.834 | 0.809 | 0.866 | 0.803 | 0.963 | 0.860 | 0.768 | 0.850 | 0.698 | 0.917 | 0.808 |
| Llama-4-Scout | 0.813 | 0.734 | 0.889 | 0.756 | 0.963 | 0.836 | 0.762 | 0.859 | 0.590 | 0.944 | 0.789 |
| MiniCPM-v-4.5 | 0.807 | 0.748 | 0.866 | 0.808 | 0.815 | 0.809 | 0.822 | 0.834 | 0.646 | 0.917 | 0.805 |
| GLM-4.1v-9B-Thinking | 0.795 | 0.802 | 0.866 | 0.622 | 0.926 | 0.804 | 0.800 | 0.867 | 0.558 | 0.917 | 0.786 |
| Phi-4-multimodal | 0.794 | 0.790 | 0.832 | 0.808 | 0.963 | 0.848 | 0.728 | 0.759 | 0.635 | 0.833 | 0.739 |
| QvQ-72B-Preview | 0.632 | 0.667 | 0.787 | 0.642 | 0.852 | 0.737 | 0.473 | 0.583 | 0.185 | 0.861 | 0.526 |
| deepseek-vl2 | 0.388 | 0.447 | 0.662 | 0.289 | 0.556 | 0.489 | 0.240 | 0.650 | 0.119 | 0.139 | 0.287 |
| Commercial VLMs | |||||||||||
| Gemini-2.5-Flash | 0.825 | 0.773 | 0.889 | 0.783 | 0.963 | 0.852 | 0.785 | 0.842 | 0.673 | 0.889 | 0.797 |
| Claude-Sonnet 4 | 0.807 | 0.586 | 0.866 | 0.819 | 0.963 | 0.809 | 0.768 | 0.825 | 0.683 | 0.944 | 0.805 |
| GPT-4o | 0.555 | 0.443 | 0.695 | 0.403 | 0.741 | 0.571 | 0.445 | 0.600 | 0.331 | 0.778 | 0.539 |
| GPT-5 | 0.459 | 0.293 | 0.559 | 0.328 | 0.889 | 0.517 | 0.348 | 0.467 | 0.175 | 0.611 | 0.400 |
| Gemini-2.5-Pro | 0.456 | 0.454 | 0.821 | 0.317 | 0.593 | 0.546 | 0.356 | 0.667 | 0.108 | 0.333 | 0.366 |
Full refusal rates, additional task-level statistics, and model-specific examples are reported in the paper.