MultiPriv

40

Synthetic
Profiles

1,119

Multimodal
Images

36

Privacy
Attributes

9

PPR
Tasks

2

Languages
English Chinese

50+

Evaluated
VLMs

Privacy categories: Biometric Identity Document Medical Health Financial Account Location Trajectory Property Identity Social Attributes

Abstract

Modern Vision-Language Models (VLMs) pose significant individual-level privacy risks by linking fragmented multimodal data to identifiable individuals through hierarchical reasoning. Existing privacy benchmarks mainly evaluate privacy perception, such as detecting phone numbers, names, faces, or other isolated attributes, but they do not fully capture the more critical risk of privacy reasoning: a model's ability to infer and link distributed information into individual profiles.

To address this gap, we introduce MultiPriv, a benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. MultiPriv introduces the Privacy Perception and Reasoning (PPR) framework and constructs a bilingual multimodal dataset with synthetic individual profiles, where direct identifiers such as faces and names are linked to sensitive attributes such as health status, home address, trajectory information, and financial records.

MultiPriv contains 36 privacy attributes across 9 tasks, covering attribute detection, privacy information extraction, privacy region localization, cross-image re-identification, chained reasoning, and cross-modal association. We evaluate over 50 open-source and commercial VLMs and show that reasoning, rather than perception alone, is a key driver of individual-level privacy risk.

⚠️ Note: MultiPriv is intended solely for academic research on VLM privacy, safety alignment, and privacy-preserving model evaluation. The benchmark uses synthetic individual profiles to study identity-linking risks without exposing real personal identifiers.

Privacy Perception and Reasoning

Figure 1. MultiPriv studies how VLMs move from privacy perception, which extracts discrete sensitive attributes from unstructured data, to privacy reasoning, which links fragmented cues into identifiable individual profiles.

MultiPriv Privacy Taxonomy and Definition

Figure 2. Privacy Taxonomy and Definition.

Benchmark Overview

📊 Dataset Composition

Component	Count	Description
Synthetic Profiles	40	Fictional individuals with linked multimodal privacy attributes Designed for identity-level privacy reasoning
Images	1,119	Multimodal privacy samples from public sources and synthesized instances Covers perception and reasoning tasks
VQA Pairs	7,414	Manually designed bilingual VQA queries and answers English and Chinese evaluation
Task Coverage	9	4 perception subtasks · 5 reasoning subtasks · 36 privacy attributes

🌐 Privacy Taxonomy

Biometric and Identity Document

Faces, fingerprints, names, IDs, and other direct identifiers.

Medical and Financial Privacy

Medical reports, health status, bank cards, and financial account information.

Location and Trajectory

Addresses, tickets, travel routes, and movement or activity traces.

Property and Social Attributes

Vehicle ownership, personal relations, and social identity attributes.

Figure 3. MultiPriv decomposes privacy risk into nine subtasks spanning privacy perception and individual-level privacy reasoning.

Task Framework

MultiPriv evaluates privacy risks along two complementary dimensions: attribute-level privacy perception and individual-level privacy reasoning.

1

Direct Identifier Recognition

Detects explicit identifiers such as faces, names, fingerprints, and ID numbers.

2

Indirect Identifier Recognition

Identifies attributes that can implicitly reveal identity, such as address, health, or trajectory.

3

Privacy Information Extraction

Extracts sensitive textual or visual content from privacy-related images and documents.

4

Privacy Region Localization

Localizes private regions such as faces, license plates, or bank card numbers with bounding boxes.

5

Single-Step Cross-Validation

Determines whether different images belong to the same individual using shared or correlated cues.

6

Single-Step Reasoning

Infers one private attribute of an individual directly from another linked attribute.

7

Chained Reasoning

Performs multi-step linkage across privacy cues, such as face to location to identity.

8

Re-Identification and Linkability

Associates known private attributes with the correct individual across multiple samples.

9

Cross-Modal Association

Links textual and visual privacy cues to infer sensitive attributes about a specific individual.

⚖️ From Attribute Exposure to Individual-Level Privacy Risk

Privacy Perception

Extracting discrete sensitive attributes from unstructured multimodal inputs.

→

Privacy Reasoning

Linking fragmented attributes to reconstruct an identifiable individual profile.

Key Findings

The Central Risk of Modern VLMs

Attribute-Level Perception → Individual-Level Reasoning

VLMs do not only detect private attributes. Strong reasoning allows them to link fragmented multimodal cues and reconstruct identifiable profiles.

Finding 1

Current VLMs Enable Individual-Level Privacy Risks

Many open-source and commercial models show substantial privacy reasoning ability. In the controlled benchmark, widely used VLMs can link sensitive attributes across languages and modalities.

Finding 2

Reasoning Drives Privacy Leakage

Reasoning-focused models achieve high scores on chained reasoning, re-identification, and cross-modal association, showing that identity linkage is a central source of privacy risk.

Finding 3

Single-Language Evaluation Underestimates Risk

Privacy behavior varies across English and Chinese. A model may appear less risky in one language while exposing stronger identity-linkage behavior in another.

Finding 4

Scale Alone Does Not Determine Privacy Risk

Smaller reasoning-capable models can match or exceed larger models in privacy reasoning, indicating that architecture, training data, and alignment matter more than parameter count alone.

Finding 5

Reasoning Prompts Amplify Leakage

Explicit step-by-step or structured reasoning prompts can increase privacy leakage by guiding models to connect fragmented multimodal evidence into identity profiles.

Finding 6

Refusal Can Hide Underlying Capability

Strongly aligned models may refuse many risky queries, reducing realized exposure. Answered-only accuracy reveals that such models can still possess strong privacy reasoning capability when they do answer.

Experimental Results

Higher scores indicate greater potential privacy risk under misuse.

Perception Risk measures attribute-level exposure. Reasoning Risk measures individual-level linkage and reconstruction.

Table 3. VLM-Induced Privacy Perception Risk

Model	Overall Risk ↑	English ↑				Chinese ↑				Bounding IoU ↑
Model	Overall Risk ↑	Direct	Indirect	IEA	Mean	Direct	Indirect	IEA	Mean	Bounding IoU ↑
Open-Source VLMs
QvQ-72B-Preview	0.454	0.645	0.499	0.561	0.568	0.599	0.560	0.561	0.573	0.22
Llama-4-Maverick	0.447	0.493	0.470	0.484	0.482	0.378	0.458	0.485	0.440	0.42
Qwen3-VL-8B	0.381	0.340	0.341	0.457	0.379	0.435	0.372	0.498	0.435	0.33
Qwen3-VL-32B	0.380	0.377	0.383	0.532	0.431	0.409	0.497	0.530	0.479	0.23
Llama-4-Scout	0.377	0.442	0.509	0.480	0.477	0.374	0.453	0.507	0.445	0.21
InternVL3.5-38B	0.372	0.413	0.312	0.454	0.393	0.413	0.395	0.519	0.442	0.28
InternVL3.5-8B	0.358	0.390	0.373	0.447	0.403	0.401	0.360	0.501	0.421	0.25
InternVL3.5-14B	0.356	0.394	0.340	0.442	0.392	0.411	0.367	0.497	0.425	0.25
Qwen3-VL-30B-A3B	0.353	0.333	0.300	0.339	0.324	0.355	0.393	0.407	0.385	0.35
Qwen3-VL-4B	0.345	0.453	0.297	0.473	0.408	0.503	0.356	0.486	0.448	0.18
Phi-4-multimodal	0.332	0.404	0.369	0.384	0.386	0.413	0.380	0.433	0.409	0.20
MiniCPM-V-4.5	0.291	0.410	0.368	0.474	0.417	0.428	0.409	0.531	0.456	0.00
Llama-3.2-11B-Vision	0.274	0.414	0.292	0.313	0.340	0.349	0.319	0.298	0.322	0.16
Llava-v1.6-Vicuna-13B	0.265	0.413	0.335	0.241	0.330	0.416	0.336	0.223	0.325	0.14
GLM-4.1V-9B	0.252	0.342	0.120	0.009	0.157	0.294	0.329	0.008	0.210	0.39
Phi-3.5-vision	0.238	0.359	0.329	0.297	0.328	0.497	0.188	0.145	0.277	0.11
deepseek-vl2-small	0.222	0.365	0.205	0.350	0.307	0.311	0.305	0.457	0.358	0.00
deepseek-vl2-tiny	0.222	0.365	0.205	0.350	0.307	0.311	0.305	0.457	0.358	0.00
instructblip-flan-t5-xl	0.216	0.795	0.327	0.052	0.391	0.534	0.208	0.032	0.258	0.00
Llava-v1.6-Vicuna-7B	0.195	0.362	0.307	0.260	0.310	0.419	0.267	0.136	0.274	0.00
Llava-v1.6-Mistral-7B	0.194	0.405	0.247	0.259	0.304	0.378	0.265	0.195	0.279	0.00
instructblip-flan-t5-xxl	0.192	0.833	0.283	0.013	0.376	0.418	0.151	0.027	0.199	0.00
instructblip-vicuna-7b	0.139	0.519	0.137	0.010	0.222	0.330	0.244	0.012	0.195	0.00
instructblip-vicuna-13b	0.113	0.161	0.146	0.020	0.109	0.539	0.124	0.026	0.230	0.00
Phi-3-vision-128k	0.073	0.049	0.256	0.002	0.102	0.000	0.260	0.068	0.109	0.07
Commercial VLMs
Gemini-2.5-Pro	0.579	0.775	0.614	0.431	0.607	0.788	0.602	0.443	0.611	0.52
GPT-5	0.505	0.428	0.607	0.484	0.506	0.606	0.650	0.542	0.599	0.41
Gemini-2.5-Flash	0.466	0.467	0.492	0.454	0.471	0.473	0.465	0.527	0.488	0.44
Claude-Sonnet-4	0.401	0.425	0.555	0.478	0.486	0.419	0.607	0.465	0.497	0.22
GPT-4o	0.342	0.432	0.418	0.420	0.423	0.399	0.370	0.440	0.403	0.20

Table 4. VLM-Induced Privacy Reasoning Risk

Model	Overall Risk ↑	English ↑					Chinese ↑
Model	Overall Risk ↑	Single	Re-ID	Chained	Cross	Mean	Single	Re-ID	Chained	Cross	Mean
Open-Source VLMs
Qwen3-VL-32B-Thinking	0.874	0.841	0.889	0.878	0.963	0.893	0.885	0.859	0.731	0.944	0.855
Qwen3-VL-4B-Thinking	0.871	0.872	0.843	0.858	0.963	0.884	0.859	0.859	0.771	0.944	0.858
Qwen3-VL-8B-Thinking	0.868	0.874	0.900	0.728	0.963	0.866	0.871	0.867	0.798	0.944	0.870
InternVL3.5-8B	0.858	0.845	0.878	0.856	0.926	0.876	0.783	0.817	0.787	0.972	0.840
InternVL3.5-38B	0.857	0.818	0.889	0.858	0.963	0.882	0.738	0.850	0.765	0.972	0.831
Llama-4-Maverick	0.842	0.724	0.878	0.867	0.926	0.849	0.766	0.859	0.769	0.944	0.835
InternVL3.5-14B	0.834	0.809	0.866	0.803	0.963	0.860	0.768	0.850	0.698	0.917	0.808
Llama-4-Scout	0.813	0.734	0.889	0.756	0.963	0.836	0.762	0.859	0.590	0.944	0.789
MiniCPM-v-4.5	0.807	0.748	0.866	0.808	0.815	0.809	0.822	0.834	0.646	0.917	0.805
GLM-4.1v-9B-Thinking	0.795	0.802	0.866	0.622	0.926	0.804	0.800	0.867	0.558	0.917	0.786
Phi-4-multimodal	0.794	0.790	0.832	0.808	0.963	0.848	0.728	0.759	0.635	0.833	0.739
QvQ-72B-Preview	0.632	0.667	0.787	0.642	0.852	0.737	0.473	0.583	0.185	0.861	0.526
deepseek-vl2	0.388	0.447	0.662	0.289	0.556	0.489	0.240	0.650	0.119	0.139	0.287
Commercial VLMs
Gemini-2.5-Flash	0.825	0.773	0.889	0.783	0.963	0.852	0.785	0.842	0.673	0.889	0.797
Claude-Sonnet 4	0.807	0.586	0.866	0.819	0.963	0.809	0.768	0.825	0.683	0.944	0.805
GPT-4o	0.555	0.443	0.695	0.403	0.741	0.571	0.445	0.600	0.331	0.778	0.539
GPT-5	0.459	0.293	0.559	0.328	0.889	0.517	0.348	0.467	0.175	0.611	0.400
Gemini-2.5-Pro	0.456	0.454	0.821	0.317	0.593	0.546	0.356	0.667	0.108	0.333	0.366

Full refusal rates, additional task-level statistics, and model-specific examples are reported in the paper.

BibTeX

@inproceedings{ anonymous2026multipriv, title={MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models}, author={Anonymous}, booktitle={Forty-third International Conference on Machine Learning}, year={2026}, url={https://openreview.net/forum?id=E4CNyyUDSD} }