Speaker
Description
The growing movement toward open data, open science, and open government has increased the demand for sharing detailed microdata while protecting individual privacy. Data anonymization methods, including classical statistical disclosure control techniques and synthetic data generation, enable data sharing, but evaluating the resulting privacy risks and analytical usefulness remains challenging.
We present riskutility, an R package that provides a unified framework for assessing both disclosure risk and data utility in anonymized datasets.
The package implements a broad collection of evaluation metrics, including attribution-based disclosure risk measures (e.g., CAP, TCAP, DCAP), distance-based memorization checks such as Distance to Closest Record (DCR) and Nearest Neighbor Distance Ratio (NNDR), information-theoretic metrics, and model-based and distribution-based utility measures. In addition to these core methods, the package includes many further diagnostics for comparing distributions, multivariate structure, predictive performance, and other aspects of analytical validity. Together, these tools allow analysts to systematically evaluate privacy risks alongside analytical usefulness within a single workflow.
A central methodological contribution implemented in the package is RAPID (Risk of Attribute Prediction–Induced Disclosure), a novel inferential disclosure risk measure. RAPID models a realistic attacker who trains predictive models on released data and uses quasi-identifiers to infer sensitive attributes of real individuals. The method quantifies per-record vulnerability for both continuous and categorical sensitive variables.
Through live coding examples with a real-world dataset, this talk will demonstrate practical workflows for disclosure risk analysis, including identifying high-risk records, selecting attacker models via cross-validation, analysing which quasi-identifier combinations drive vulnerability, and exploring privacy–utility trade-offs using PCA-based visualisations.
The RAPID methodology is described in our recent work:
https://arxiv.org/abs/2602.09235
Additional Material or Paper
We previously contributed a workshop on data anonymisation and disclosure risk at useR! 2024 in Salzburg:
https://userconf2024.sched.com/event/1c8zq/tutorial-data-anonymisation-for-open-science-jiri-novak-oscar-thees-uzh-fhnw-marko-miletic-bern-university-of-applied-sciences-alzbeta-beranova-czech-statistical-office
The RAPID preprint is available at https://arxiv.org/pdf/2602.09235.
If you used AI tools or services to support the preparation of this submission, please state the name and reason for using each of them.
Deepl Write was used for minor language editing and stylistic improvements of the abstract. All scientific content and ideas are the authors’ own.
| Keywords: Please list up to 5 keywords to help us find the right session for your contribution. | synthetic data, statistical disclosure control, disclosure risk, privacy–utility trade-offs, R package |
|---|---|
| Virtual Option | This submission is for onsite presentation only |
| Video Recording | Video sharing is fine |
| The author(s) agree(s) to take responsibility and be accountable for the contents of the submission and is/are authorized to present it. | Confirm |
| Interested in serving as reviewer? | oscar.thees@fhnw.ch |