TILDA: The Irish Longitudinal Study on Ageing
Study: Nationally representative, adults aged 50+, Ireland
Waves: 5 waves, 2009 – 2018, approx. every 2 years
1
8504
100.0
2
7207
84.7
3
6400
75.3
4
5715
67.2
5
4980
58.6
Data: 265 health covariates spanning physical, mental, behavioural, and cognitive domains, with a wide range of data types (e.g., numeric, ordinal, binary, etc.)
Focus: 7 social participation items (SCQSocAct), ordinal Likert scale 1-8
MAR confirmed: naniar::mcar_test() rejected (\(p < 0.001\) ) all 5 waves.
Non-monotonic missingness (79.9%) → mice and missForest required
TILDA — the Irish Longitudinal Study on Ageing — follows adults aged 50 and over across Ireland through five waves from 2009 to 2018. We start with 8,504 participants in wave 1. By wave 5, only 4,980 remain — a 41% drop. Non-random attrition at this scale is characteristic of longitudinal studies of older adults.
We confirmed MAR using Little’s MCAR test via naniar — rejected at p less than 0.001 in all five waves. And 79.9% of respondents showed non-monotonic missingness, which rules out simpler sequential methods and means we need mice or missForest.
Our focus is on seven social participation items — how often respondents attend concerts, visit the pub, go to classes — measured on a Likert scale from 1 to 8.
Imputation Strategies Tested
1. Mode Imputation (Baseline)
Replaces missing values with the most frequent category
Deterministic - no uncertainty
Chosen over mean: variables are ordinal (Likert 1–8)
2. MICE (Multiple Imputation by Chained Equations)
Van Buuren & Groothuis-Oudshoorn (2011)
Creates \(M = 5\) plausible complete datasets
Proportional Odds Logistic Regression (polr) for ordinal variables
75 iterations (convergence verified via trace plots)
Combines estimates via Rubin’s rules
3. Random Forest (3 configurations)
missForrest package
Naive: \(m_{\text{try}} = \lfloor\sqrt{p}\rfloor = 2\)
Bagged: \(m_{\text{try}} = p = 7\)
Optimised: \(m_{\text{try}}\) selected by OOB error
Evaluation metrics:
Bias: directional error; should be near zero
RMSE: magnitude of error; lower is better
KL-Divergence: distributional similarity; zero = perfect match
Three strategies, five configurations in total.
Mode is our naive baseline. MICE is the statistical standard for MAR data — we use proportional odds logistic regression, the appropriate model for ordered categories, across M equals 5 imputed datasets combined via Rubin’s rules.
Random Forest via missForest in three configurations: naive with the default square root of p predictors, bagged using all p predictors, and optimised using OOB error to select mtry wave by wave.
Performance is assessed via a ground truth simulation: we extract 1,000 complete cases per wave, introduce artificial missingness mirroring real TILDA patterns, impute, and compare directly against the known true values.
The Surprise: Mode Beats Machine Learning
Mode (Naive baseline)
0.994 (0.360)
Bagged RF
1.009 (0.364)
Optimised RF
1.106 (0.418)
MICE (Statistical)
1.121 (0.301)
Naive RF
1.213 (0.418)
The naive baseline outperforms every machine learning method on RMSE.
Why?
This is the RMSE column from the global performance table — mean across all five simulation waves and all seven variables.
Mode imputation, the simplest possible method, achieves the lowest RMSE. Bagged RF follows closely. MICE comes in fourth.
The naive baseline is beating every sophisticated method. Before explaining why, let me show you what the other metrics reveal.
RMSE is the Wrong Metric for Inference
Mode (Naive baseline)
0.994 (0.360)
0.079 (0.255)
0.0293 (0.0487)
MICE (Statistical)
1.121 (0.301)
-0.036 (0.042)
0.0009 (0.0006)
Naive RF
1.213 (0.418)
0.116 (0.336)
0.0844 (0.1379)
Bagged RF
1.009 (0.364)
0.080 (0.213)
0.0430 (0.0884)
Optimised RF
1.106 (0.418)
0.104 (0.277)
0.0581 (0.1097)
Mode wins RMSE by always predicting the dominant category - exploiting unimodal data, not doing better statistics.
MICE is the only near-zero bias method. KL-Divergence \(\approx 30\times\) better than Mode.
For inference, the distributional shape is what matters.
This scatter plot maps every method on two dimensions simultaneously. X-axis is mean RMSE — further left is better at point prediction. Y-axis is mean bias — the dashed line at zero is perfect unbiasedness.
Mode and RF are left but above the line: positive bias, systematically pushing imputations toward dominant categories.
MICE sits to the right — higher RMSE — but it is the only method near the zero line. Mean bias of minus 0.036.
Why does Mode win RMSE? The SCQSocAct variables are highly unimodal — most responses are Never or Rarely. Mode always imputes the modal category. It is correct most of the time and never makes a catastrophic error. MICE honestly samples from the full distribution — sometimes imputing a 5 when the truth is 1 — and RMSE penalises those large errors heavily.
But look at KL-Divergence — distributional fidelity. Mode is approximately 30 times worse than MICE. It creates an artificial spike at the dominant category, making the population appear far more homogeneous than it is. For regression, hypothesis tests, effect estimation — the joint distribution of variables is what determines validity. MICE preserves it.
The OOB Failure: What Ground Truth Reveals
Wave 3: a 55.7% RMSE gap
Naive (\(m=2\) )
0.7221
1.392
Bagged (\(m=7\) )
0.7224
0.894
OOB correctly identified the best configuration in 4 out of 5 waves. But when it failed, it failed catastrophically.
The optimised strategy inherited the wave 3 failure: a 49.6% RMSE fluctuation across waves.
Before we move to the recommendations, one result R users should know about specifically: the Out-of-Bag error used internally by missForest to select mtry.
This figure shows ground truth RMSE across all five waves for the three RF configurations. Focus on Wave 3. OOB said naive — mtry equals 2 — was optimal, by a margin of just 0.0003. Ground truth RMSE was 1.392. The bagged configuration — which OOB said was marginally worse — produced 0.894. A 55.7% gap. No warning was given.
OOB worked well in 4 of 5 waves. But the failure was invisible and catastrophic. In a real imputation scenario, you have no ground truth to catch it. Default to Bagged RF — mtry equals p — which was the most stable configuration across all waves.
Matching Methods to Research Goals
The right imputation method depends on what you are trying to do .
Inferential (hypothesis tests, effect estimation)
mice with polr
Unbiased parameters, valid SEs, preserved distribution
Predictive (individual outcome forecasts)
Bagged RF (\(m_{\text{try}}=p\) )
Lower RMSE, flexible non-linear relationships. Do not use OOB-optimised.
Exploratory (trends, variable importance)
RF to explore, then mice to validate
Never use RF-imputed data to both explore and confirm a finding.
What does this mean for R users?
For inference (regression, hypothesis tests, effect estimation)
Use mice::mice(method = "polr") for ordinal variables.
Near-zero bias, preserved distribution, valid SEs via Rubin’s rules.
For prediction (individual outcome forecasts)
Use missForest with mtry = p (Bagged ), not OOB-optimised.
In Wave 3, OOB recommended mtry = 2 — ground truth RMSE was 55.7% worse than mtry = 7. No warning was given.
The most practically useful thing I can leave you with is this decision framework.
For inferential research — hypothesis testing, effect estimation, regression — use mice with polr for ordinal variables. MICE is the only method here that delivers unbiased parameters, preserved distributional structure, and valid uncertainty propagation through Rubin’s rules. The higher RMSE is not a weakness; it reflects honest sampling from uncertainty.
For predictive tasks, Bagged RF’s lower RMSE is relevant. But always use mtry equals p — never OOB-optimised. In Wave 3 of our simulation, OOB recommended the naive configuration by a margin of 0.0003. Ground truth RMSE was 55.7% worse. There was no warning.
For exploratory work, RF is excellent for scoping patterns — but always reimpute with mice before drawing inferential conclusions. Never use the same RF-imputed data to both explore and confirm a finding.
Two concrete recommendations to take away.
First: for longitudinal cohort studies where your goal is inference — which is most of them — use mice with polr for ordinal variables. MICE delivers near-zero bias, distributional fidelity, and valid standard errors through Rubin’s rules. The higher RMSE is not a weakness; it reflects honest sampling from uncertainty rather than collapsing it away.
Second: if you use missForest, always use mtry equals p — the Bagged configuration. Do not let OOB guide your hyperparameter selection. The Wave 3 failure is not a theoretical risk, it is documented empirical evidence. No warning was given. Bagged RF is the most stable configuration across all five waves.