6–9 Jul 2026
Europe/Warsaw timezone

Synthetic by Design: Two R Packages for Privacy-Safe Public Health Data Generation

8 Jul 2026, 13:40
20m
Lightning Talk (5 minutes) Virtual Presentation Room

Speakers

Abigail Stamm (Minnesota Department of Health) Eric Kvale (Minnesota Department of Health)

Description

Access to realistic public health data for training, pipeline validation, and methods development is constrained by privacy regulations that restrict use of real patient records. We present two complementary open-source R packages that address this problem at different points along the synthetic data design spectrum.

toysurveydata (Stamm, MDH) generates simple, customizable fake survey datasets from a priori response proportions, without modeling inter-variable relationships. Designed for demonstrating data cleaning and validation workflows, it offers a lightweight, settings-table–driven approach that prioritizes usability over epidemiological fidelity.

The NSSP Synthetic Data Toolkit (Kvale, MDH) occupies the other end of the spectrum: a modular Shiny dashboard generating HIPAA-safe emergency department visit data modeled after the CDC National Syndromic Surveillance Program BioSense Platform. Its architecture includes a data definitions module implementing the full NSSP Data Dictionary field structure, a baseline generator with configurable demographic distributions and temporal visit patterns, a ten-scenario disruption engine, and export utilities producing CSV, JSON, and HL7-like outputs across all five BioSense tables.

A recently added scenario simulates the downstream healthcare utilization impacts of federal immigration enforcement operations, modeling suppressed care-seeking behavior, ED volume shifts, syndromic surveillance signal changes, and behavioral health surge patterns — a novel application of synthetic data generation to an emerging and politically sensitive public health research problem.

Together these packages illustrate how synthetic data design choices — complexity, fidelity, and domain specificity — should be driven by the intended training or research use case. Both are freely available on GitHub.

  • toysurveydata: https://github.com/ajstamm/toysurveydata
  • NSSP Toolkit: https://github.com/ekvale/nssp-synthetic-data

If you used AI tools or services to support the preparation of this submission, please state the name and reason for using each of them.

Claud AI programming assistance

Keywords: Please list up to 5 keywords to help us find the right session for your contribution. synthetic data, syndromic surveillance, Shiny, public health informatics, R packages
Virtual Option This submission is for pre-recorded virtual presentation only
Material License MIT license
Video Recording Video sharing is fine
The author(s) agree(s) to take responsibility and be accountable for the contents of the submission and is/are authorized to present it. Confirm

Authors

Abigail Stamm (Minnesota Department of Health) Eric Kvale (Minnesota Department of Health)

Presentation materials

There are no materials yet.