Evaluating Imputation Strategies for Complex Longitudinal Cohort Studies

TILDA: The Irish Longitudinal Study on Ageing

TILDA logo

Study: Nationally representative, adults aged 50+, Ireland

Waves: 5 waves, 2009 – 2018, approx. every 2 years

Wave	Sample Size	Retention (%)
1	8504	100.0
2	7207	84.7
3	6400	75.3
4	5715	67.2
5	4980	58.6

Data: 265 health covariates spanning physical, mental, behavioural, and cognitive domains, with a wide range of data types (e.g., numeric, ordinal, binary, etc.)

Focus: 7 social participation items (SCQSocAct), ordinal Likert scale 1-8

MAR confirmed: naniar::mcar_test() rejected (\(p < 0.001\)) all 5 waves.

Non-monotonic missingness (79.9%) → mice and missForest required

Imputation Strategies Tested

1. Mode Imputation (Baseline)

Replaces missing values with the most frequent category
Deterministic - no uncertainty
Chosen over mean: variables are ordinal (Likert 1–8)

2. MICE (Multiple Imputation by Chained Equations)

Van Buuren & Groothuis-Oudshoorn (2011)
Creates \(M = 5\) plausible complete datasets
Proportional Odds Logistic Regression (polr) for ordinal variables
75 iterations (convergence verified via trace plots)
Combines estimates via Rubin’s rules

3. Random Forest (3 configurations)

missForrest package
Naive: \(m_{\text{try}} = \lfloor\sqrt{p}\rfloor = 2\)
Bagged: \(m_{\text{try}} = p = 7\)
Optimised: \(m_{\text{try}}\) selected by OOB error

Evaluation metrics:

Bias: directional error; should be near zero
RMSE: magnitude of error; lower is better
KL-Divergence: distributional similarity; zero = perfect match

The Surprise: Mode Beats Machine Learning

Method	RMSE (SD)
Mode (Naive baseline)	0.994 (0.360)
Bagged RF	1.009 (0.364)
Optimised RF	1.106 (0.418)
MICE (Statistical)	1.121 (0.301)
Naive RF	1.213 (0.418)

The naive baseline outperforms every machine learning method on RMSE.

Why?

RMSE is the Wrong Metric for Inference

Scatter plot of mean bias versus mean RMSE for five imputation methods. MICE sits near zero bias at RMSE 1.121. Mode and all three Random Forest variants show positive bias between 0.08 and 0.12, with lower RMSE values.

Method	RMSE (SD)	Bias (SD)	KL-Div (SD)
Mode (Naive baseline)	0.994 (0.360)	0.079 (0.255)	0.0293 (0.0487)
MICE (Statistical)	1.121 (0.301)	-0.036 (0.042)	0.0009 (0.0006)
Naive RF	1.213 (0.418)	0.116 (0.336)	0.0844 (0.1379)
Bagged RF	1.009 (0.364)	0.080 (0.213)	0.0430 (0.0884)
Optimised RF	1.106 (0.418)	0.104 (0.277)	0.0581 (0.1097)

Mode wins RMSE by always predicting the dominant category - exploiting unimodal data, not doing better statistics.

MICE is the only near-zero bias method. KL-Divergence \(\approx 30\times\) better than Mode.

For inference, the distributional shape is what matters.

This scatter plot maps every method on two dimensions simultaneously. X-axis is mean RMSE — further left is better at point prediction. Y-axis is mean bias — the dashed line at zero is perfect unbiasedness.

Mode and RF are left but above the line: positive bias, systematically pushing imputations toward dominant categories.

MICE sits to the right — higher RMSE — but it is the only method near the zero line. Mean bias of minus 0.036.

Why does Mode win RMSE? The SCQSocAct variables are highly unimodal — most responses are Never or Rarely. Mode always imputes the modal category. It is correct most of the time and never makes a catastrophic error. MICE honestly samples from the full distribution — sometimes imputing a 5 when the truth is 1 — and RMSE penalises those large errors heavily.

But look at KL-Divergence — distributional fidelity. Mode is approximately 30 times worse than MICE. It creates an artificial spike at the dominant category, making the population appear far more homogeneous than it is. For regression, hypothesis tests, effect estimation — the joint distribution of variables is what determines validity. MICE preserves it.

The OOB Failure: What Ground Truth Reveals

Line plot showing ground truth RMSE across five TILDA waves for three Random Forest configurations. In Wave 3, the OOB-optimised and naive configurations spike to RMSE 1.392 while the bagged configuration remains at 0.894.

Wave 3: a 55.7% RMSE gap

Config	OOB	GT RMSE
Naive (\(m=2\))	0.7221	1.392
Bagged (\(m=7\))	0.7224	0.894

OOB correctly identified the best configuration in 4 out of 5 waves. But when it failed, it failed catastrophically.

The optimised strategy inherited the wave 3 failure: a 49.6% RMSE fluctuation across waves.

Matching Methods to Research Goals

The right imputation method depends on what you are trying to do.

Goal	Recommended Method	Reason
Inferential (hypothesis tests, effect estimation)	`mice` with `polr`	Unbiased parameters, valid SEs, preserved distribution
Predictive (individual outcome forecasts)	Bagged RF (\(m_{\text{try}}=p\))	Lower RMSE, flexible non-linear relationships. Do not use OOB-optimised.
Exploratory (trends, variable importance)	RF to explore, then `mice` to validate	Never use RF-imputed data to both explore and confirm a finding.

What does this mean for R users?

For inference (regression, hypothesis tests, effect estimation)

Use mice::mice(method = "polr") for ordinal variables.

Near-zero bias, preserved distribution, valid SEs via Rubin’s rules.

For prediction (individual outcome forecasts)

Use missForest with mtry = p (Bagged), not OOB-optimised.

In Wave 3, OOB recommended mtry = 2 — ground truth RMSE was 55.7% worse than mtry = 7. No warning was given.

The most practically useful thing I can leave you with is this decision framework.

For inferential research — hypothesis testing, effect estimation, regression — use mice with polr for ordinal variables. MICE is the only method here that delivers unbiased parameters, preserved distributional structure, and valid uncertainty propagation through Rubin’s rules. The higher RMSE is not a weakness; it reflects honest sampling from uncertainty.

For predictive tasks, Bagged RF’s lower RMSE is relevant. But always use mtry equals p — never OOB-optimised. In Wave 3 of our simulation, OOB recommended the naive configuration by a margin of 0.0003. Ground truth RMSE was 55.7% worse. There was no warning.

For exploratory work, RF is excellent for scoping patterns — but always reimpute with mice before drawing inferential conclusions. Never use the same RF-imputed data to both explore and confirm a finding.

Two concrete recommendations to take away.

First: for longitudinal cohort studies where your goal is inference — which is most of them — use mice with polr for ordinal variables. MICE delivers near-zero bias, distributional fidelity, and valid standard errors through Rubin’s rules. The higher RMSE is not a weakness; it reflects honest sampling from uncertainty rather than collapsing it away.

Second: if you use missForest, always use mtry equals p — the Bagged configuration. Do not let OOB guide your hyperparameter selection. The Wave 3 failure is not a theoretical risk, it is documented empirical evidence. No warning was given. Bagged RF is the most stable configuration across all five waves.

Acknowledgements

George Smith-Kolff

University of Canterbury, Christchurch, New Zealand

Photo of George Smith-Kolff

Sinéad Moylett and Blair Robertson

University Limerick, Ireland

University of Canterbury, Christchurch, New Zealand

Photo of Sinéad Moylett

Photo of Blair Robertson

Data: The Irish Longitudinal Study on Ageing (TILDA), Trinity College Dublin

University of Limerick logo University of Canterbury logo TILDA logo