Evaluating Imputation Strategies for Complex Longitudinal Cohort Studies

Evidence from TILDA

Sinéad Moylett, George Smith-Kolff, Blair Robertson

University of Limerick / University of Canterbury

Tuesday, 7 Jul, 2026

TILDA: The Irish Longitudinal Study on Ageing

TILDA logo

Study: Nationally representative, adults aged 50+, Ireland

Waves: 5 waves, 2009 – 2018, approx. every 2 years

Wave Sample Size Retention (%)
1 8504 100.0
2 7207 84.7
3 6400 75.3
4 5715 67.2
5 4980 58.6

Data: 265 health covariates spanning physical, mental, behavioural, and cognitive domains, with a wide range of data types (e.g., numeric, ordinal, binary, etc.)

Focus: 7 social participation items (SCQSocAct), ordinal Likert scale 1-8

MAR confirmed: naniar::mcar_test() rejected (\(p < 0.001\)) all 5 waves.

Non-monotonic missingness (79.9%) → mice and missForest required

Imputation Strategies Tested

1. Mode Imputation (Baseline)

  • Replaces missing values with the most frequent category
  • Deterministic - no uncertainty
  • Chosen over mean: variables are ordinal (Likert 1–8)


2. MICE (Multiple Imputation by Chained Equations)

  • Van Buuren & Groothuis-Oudshoorn (2011)
  • Creates \(M = 5\) plausible complete datasets
  • Proportional Odds Logistic Regression (polr) for ordinal variables
  • 75 iterations (convergence verified via trace plots)
  • Combines estimates via Rubin’s rules

3. Random Forest (3 configurations)

  • missForrest package
  • Naive: \(m_{\text{try}} = \lfloor\sqrt{p}\rfloor = 2\)
  • Bagged: \(m_{\text{try}} = p = 7\)
  • Optimised: \(m_{\text{try}}\) selected by OOB error


Evaluation metrics:

  • Bias: directional error; should be near zero
  • RMSE: magnitude of error; lower is better
  • KL-Divergence: distributional similarity; zero = perfect match

The Surprise: Mode Beats Machine Learning


Method RMSE (SD)
Mode (Naive baseline) 0.994 (0.360)
Bagged RF 1.009 (0.364)
Optimised RF 1.106 (0.418)
MICE (Statistical) 1.121 (0.301)
Naive RF 1.213 (0.418)

The naive baseline outperforms every machine learning method on RMSE.

Why?

RMSE is the Wrong Metric for Inference

Scatter plot of mean bias versus mean RMSE for five imputation methods. MICE sits near zero bias at RMSE 1.121. Mode and all three Random Forest variants show positive bias between 0.08 and 0.12, with lower RMSE values.

Method RMSE (SD) Bias (SD) KL-Div (SD)
Mode (Naive baseline) 0.994 (0.360) 0.079 (0.255) 0.0293 (0.0487)
MICE (Statistical) 1.121 (0.301) -0.036 (0.042) 0.0009 (0.0006)
Naive RF 1.213 (0.418) 0.116 (0.336) 0.0844 (0.1379)
Bagged RF 1.009 (0.364) 0.080 (0.213) 0.0430 (0.0884)
Optimised RF 1.106 (0.418) 0.104 (0.277) 0.0581 (0.1097)

Mode wins RMSE by always predicting the dominant category - exploiting unimodal data, not doing better statistics.

MICE is the only near-zero bias method. KL-Divergence \(\approx 30\times\) better than Mode.

For inference, the distributional shape is what matters.

The OOB Failure: What Ground Truth Reveals

Line plot showing ground truth RMSE across five TILDA waves for three Random Forest configurations. In Wave 3, the OOB-optimised and naive configurations spike to RMSE 1.392 while the bagged configuration remains at 0.894.




Wave 3: a 55.7% RMSE gap

Config OOB GT RMSE
Naive (\(m=2\)) 0.7221 1.392
Bagged (\(m=7\)) 0.7224 0.894

OOB correctly identified the best configuration in 4 out of 5 waves. But when it failed, it failed catastrophically.

The optimised strategy inherited the wave 3 failure: a 49.6% RMSE fluctuation across waves.

Matching Methods to Research Goals

The right imputation method depends on what you are trying to do.

Goal Recommended Method Reason
Inferential (hypothesis tests, effect estimation) mice with polr Unbiased parameters, valid SEs, preserved distribution
Predictive (individual outcome forecasts) Bagged RF (\(m_{\text{try}}=p\)) Lower RMSE, flexible non-linear relationships. Do not use OOB-optimised.
Exploratory (trends, variable importance) RF to explore, then mice to validate Never use RF-imputed data to both explore and confirm a finding.

What does this mean for R users?

For inference (regression, hypothesis tests, effect estimation)

Use mice::mice(method = "polr") for ordinal variables.

Near-zero bias, preserved distribution, valid SEs via Rubin’s rules.

For prediction (individual outcome forecasts)

Use missForest with mtry = p (Bagged), not OOB-optimised.

In Wave 3, OOB recommended mtry = 2 — ground truth RMSE was 55.7% worse than mtry = 7. No warning was given.

Acknowledgements

George Smith-Kolff

University of Canterbury, Christchurch, New Zealand

Photo of George Smith-Kolff

Sinéad Moylett and Blair Robertson

University Limerick, Ireland

University of Canterbury, Christchurch, New Zealand

Photo of Sinéad Moylett

Photo of Blair Robertson

Data: The Irish Longitudinal Study on Ageing (TILDA), Trinity College Dublin

University of Limerick logo University of Canterbury logo TILDA logo