Speaker
Description
UnitMix is an R package designed to detect and correct unit of measurement errors using Gaussian mixture model-based clustering, supporting both methodological research and production workflows at National Statistical Institutes (NSIs).
The core function, assign.cluster, implements a multivariate Gaussian mixture model on log-transformed variables, allowing clusters to be defined simultaneously over more than two variables through user-defined error patterns, and providing uncertainty aware assignments via probability and entropy thresholds. This talk presents the evolution of UnitMix, emphasizing the new refine.cluster functionality and its role in robust post processing of multivariate cluster assignments.
The refine.cluster function evaluates the compatibility of each record with its assigned cluster via Mahalanobis distance, using a chi-square cutoff to identify outliers and unstable groups based on within cluster compatibility and minimum size constraints. Records that fail these checks, or that belong to clusters with low proportions of compatible units, are automatically moved to an unassigned cluster, improving the reliability of edited data. We illustrate how this refinement step has been integrated into Istat’s operational pipelines for the energy consumption surveys of households and enterprises, where systematic scale errors and structural outliers are frequent. In addition, we show its use in the acquisitions section of the Small and Medium Enterprises (SME) survey.
The contribution will focus to the advantages related to moving from ad-hoc scripts to a reusable package that supports transparent, reproducible editing workflows in official statistics.
If you used AI tools or services to support the preparation of this submission, please state the name and reason for using each of them.
We used ChatGPT to check the English and ensure compliance with CRAN standards.
| Keywords: Please list up to 5 keywords to help us find the right session for your contribution. | UnitMix, Measurement errors, editing and imputation, Gaussian mixture model |
|---|---|
| Virtual Option | This submission is for onsite presentation primarily, but I would also like it to be considered for pre-recorded virtual presentation if I don't get an onsite slot |
| Video Recording | Video sharing is fine |
| The author(s) agree(s) to take responsibility and be accountable for the contents of the submission and is/are authorized to present it. | Confirm |
| Interested in serving as reviewer? | cristina.faricelli@istat.it |