Model selection after multiple imputation: a deviance correction for AIC, BIC, and likelihood-ratio tests
Abstract
Model selection on multiply imputed data is biased toward the candidates with more missing information, which in the nested families studied are the more complex models, because their larger relative increase in variance makes the fit look better than it is. We trace this to a deviance bias in the averaged log-likelihood across imputations. Under congenial proper multiple imputation with the complete-data maximum likelihood estimate as target, the averaged log-likelihood overstates its complete-data counterpart by one half the trace of the relative-increase-in-variance matrix, plus a design-imbalance term that vanishes when data are missing completely at random. The bias is specific to each candidate model, which is why uncorrected information criteria favor the models with more missing information. Adding one trace term per candidate removes the deviance bias in expectation and substantially restores complete-data model selection. The same analysis gives the bias of a likelihood-ratio comparison at the null, the missing information in the tested directions alone, and identifies where a calibrated test already supplies the correction so that it must not be applied twice. The derivation is human-prompted AI work under a stated, auditable verification protocol.
Full Text
MODEL SELECTION AFTER MULTIPLE IMPUTATION 1
Model selection after multiple imputation: a deviance correction for AIC, BIC, and
likelihood-ratio tests
AI Authors
Claude Opus 4.7–4.8 GPT-5.5 Pro Gemini 3.1 Pro
Prompters
Marcus Waldman
Center for Innovative Design and Analysis, Department of Biostatistics and Informatics, Colorado
School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO
MODEL SELECTION AFTER MULTIPLE IMPUTATION 2
Author Note
Marcus Waldman https://orcid.org/0000-0002-3288-4803
The journal recognizes two classes of author. The AI authors are the model lineages that
produced the derivations, drafts, and computations and carried out the cross-model adversarial
and Delphi review: Claude Opus 4.7–4.8 (Anthropic), GPT-5.5 Pro (OpenAI), and Gemini 3.1
Pro (Google). The prompter, Marcus Waldman, conceived and directed the work and is the
ORCID-verified human prompter of record. Contributor roles (CRediT): Marcus Waldman –
conceptualization, supervision, validation, project administration; Claude Opus 4.7–4.8 –
methodology, software, formal analysis, writing of the original draft; GPT-5.5 Pro and Gemini 3.1
Pro – validation through cross-model review and Delphi consensus. The companion sourced
derivation, the verification directory with its pre-registered studies, the cross-model grading
records, and the full session transcripts are part of the public record, collected at the project page,
https://marcus-waldman.github.io/mi-spectral/, and citation discipline is enforced mechanically.
Correspondence concerning this article should be addressed to Marcus Waldman, Email:
marcus.waldman@cuanschutz.edu
MODEL SELECTION AFTER MULTIPLE IMPUTATION 3
Model selection on multiply imputed data is biased toward the candidates with more missing
information, which in the nested families studied are the more complex models, because their
larger relative increase in variance makes the fit look better than it is. We trace this to a deviance
bias in the averaged log-likelihood across imputations. Under congenial proper multiple
imputation with the complete-data maximum likelihood estimate as target, the averaged
log-likelihood overstates its complete-data counterpart by one half the trace of the
relative-increase-in-variance matrix, plus a design-imbalance term that vanishes when data are
missing completely at random. The bias is specific to each candidate model, which is why
uncorrected information criteria favor the models with more missing information. Adding one
trace term per candidate removes the deviance bias in expectation and substantially restores
complete-data model selection. The same analysis gives the bias of a likelihood-ratio comparison
at the null, the missing information in the tested directions alone, and identifies where a calibrated
test already supplies the correction so that it must not be applied twice. The derivation is
human-prompted AI work under a stated, auditable verification protocol.
Keywords: multiple imputation, model selection, likelihood-ratio test, information
criterion, missing data
MODEL SELECTION AFTER MULTIPLE IMPUTATION 4
Model selection after multiple imputation: a deviance correction for AIC, BIC, and
likelihood-ratio tests
1 Introduction
Multiple imputation treats missing data through a division of labor, and that division has
served applied research for nearly four decades. An imputer fills in the missing values several
times from a model for the complete data. An analyst then fits the model of substantive interest to
each completed data set, and simple rules combine the results (Rubin, 1987). The field’s own
accounts of its state of the art describe a mature methodology (Enders, 2025; Schafer & Graham,
2002). In those accounts the role of the missing-data mechanism is understood, point estimates
recover their complete-data targets under stated conditions, and Wald-type tests for single and
multiple parameters are well calibrated. On that testimony, the major inferential questions read as
settled.
The exception is inference carried by the likelihood itself. For likelihood-ratio tests, a
combining rule has existed since Meng and Rubin (1992). Its modern repairs give tests with
accurate size (Chan, 2022; Chan & Meng, 2022). For likelihood-based model selection, the
picture is different. Here the literature’s assessment of itself is that the question is open. Only one
information criterion has been proposed specifically for averaging over multiply imputed data. Its
authors call for further theoretical and practical study of the method and place that work beyond
their own scope (Consentino & Claeskens, 2010). The first comprehensive study of model
selection after multiple imputation describes the available literature as unexpectedly thin
(Schomaker & Heumann, 2014). That study also cautions that selecting models by averaged
criteria has no support in the multiple-imputation literature. A dedicated study of variable
selection with multiply imputed data reports that no guidelines yet exist (Wood et al., 2008). The
applied guides are silent in the same direction. The standard book-length treatment does not treat
information criteria for multiply imputed data at all (van Buuren, 2018). Neither does the current
state-of-the-art review (Enders, 2025).
What, then, should “settled” mean? We propose a standard and organize this paper around
MODEL SELECTION AFTER MULTIPLE IMPUTATION 5
it. We call it the complete-data replication principle. A procedure for multiply imputed data
replicates complete-data inference under one condition. On average over repeated samples, it
must reach the same conclusion that would have been reached had no data been missing. The
principle can be demanded at three levels. At the first level, estimates recover their complete-data
targets. This is the classical level, and Rubin’s rules settle it. At the second level, the decision
criterion itself recovers its complete-data counterpart in expectation. This criterion is a deviance
or an information criterion. At the third level, decision rates match. The same model is selected,
and the same hypotheses are rejected, as often as they would have been with complete data. Stated
this way, the settled results are settled because they pass at the first level. Posing the second and
third levels reframes these open questions, which had been posed only at the first level.
Benchmarking selection methods against the full data is not new in itself. Simulation comparisons
of that kind appear in Wood et al. (2008) and Consentino and Claeskens (2010). The principle as
an explicit yardstick is, to our knowledge, stated here for the first time. We add a characterization
of when it can and cannot be met.
The obstacle is a bias in the averaged log-likelihood. This paper’s central result describes
that bias exactly. Suppose a likelihood model with an estimated variance or covariance is fit to
congenially imputed data. We show that the averaged log-likelihood then overstates its
complete-data counterpart by
1 2 tr(RIV) + (𝐴)+(𝐶).
The first term is one half the trace of the relative-increase-in-variance matrix. This matrix is a
standard object in the MI literature. Its trace adds up the missing information about the model’s
parameters. The second term is smaller. It reflects imbalance between the observed and the
missing units on the conditioning variables, and vanishes when the data are missing completely at
random. The practical consequence is that, for such a model, every deviance and information
criterion built on the averaged log-likelihood across imputations is, in expectation, too optimistic.
Worse, the optimism is not uniform across models. Each candidate’s criterion is inflated in
proportion to that candidate’s own missing information. A model-comparison table built on
MODEL SELECTION AFTER MULTIPLE IMPUTATION 6
imputed data therefore tends, all else equal, to favor the candidates with the most missing
information. In our pre-registered simulations, 100% of uncorrected MI-AIC’s misclassifications
fell on the candidate with the largest RIV. The fix is direct. Add each candidate’s own trace back
to its criterion. We show that the corrected criterion recovers its complete-data expectation at any
signal strength. This is the principle’s second level, met in full.
The principle’s third level asks for more. It asks for the same decisions at the same rates as
the complete data. We show that the answer splits into two cases. In the first case, the competing
models fit equally well, in the sense that the smaller model is true. Correction then restores
complete-data behavior in this case. Calibrated tests reject at the complete-data rate (Chan, 2022).
Selection that matches the criterion’s full null distribution chooses models indistinguishably from
complete-data AIC in our pre-registered design. In the second case, one model genuinely fits
better. The missing data have then destroyed part of the evidence in its favor, and the corrected
criteria and calibrated tests studied here cannot recover it. The relevant statistics shrink by factors
we predict, and the remaining shortfall is information loss rather than a fixable calibration error.
The practical reading is this. Corrected criteria are honest, not clairvoyant. Less information
means less power, and the third level is met exactly to the extent the data permit.
These results stand upstream of the test-calibration literature and beside the
model-selection one. The calibration line runs from Meng and Rubin (1992) through Chan and
Meng (2022) to Chan (2022), and it calibrates the reference distribution of an MI test statistic
while taking the statistic’s numerator as given. We derive the bias of that numerator, so the two
are complementary. Calibration makes the reference distribution right, while the present
correction makes the statistic referred to it centered. As a penalty, the correction is precedented,
because it reproduces AIC𝑥;𝑦of Shimodaira and Maeda (2018). That criterion halved the
missing-data surcharge of the earlier complete-data-discrepancy criteria (Cavanaugh & Shumway,
1998). All of those criteria were derived for deterministic EM estimation under a fixed
missingness pattern. Two other routes stand nearby. The missing-covariate criteria of Claeskens
and Consentino (2008) target a different, Takeuchi-type discrepancy. The reweighting route of
MODEL SELECTION AFTER MULTIPLE IMPUTATION 7
Hens et al. (2006) reaches the complete-data target through inverse-probability weights, but it
explicitly leaves an imputation-based criterion open. Five things here are new. First, the
decomposition of the bias into an imputation-bias part and an estimation-mismatch part, with its
estimated-scale scope condition. Second, the design-imbalance term (𝐴) + (𝐶) under MAR,
which lies beyond Shimodaira’s fixed-pattern setting. His concluding section names the
combination of missingness with other sampling mechanisms, such as covariate shift (Shimodaira,
2000), as future work. Third, the extension from the deterministic EM Q-function to proper
multiple imputation. Fourth, the exact differential bias of the likelihood-ratio numerator at the null
in the complete-data metric. Fifth, the replication principle itself, with the null/noncentral
characterization of its third level. Congeniality is assumed throughout (Meng, 1994). The bias
derived here is what remains after the imputer and the analyst agree.
The contributions follow, stated for use rather than for novelty and ordered as an applied
reader is likely to need them.
1. A correction for model selection after imputation. Choosing among models by AIC or
BIC on multiply imputed data is biased, and the bias is specific to each candidate and grows
with that candidate’s missing information. The uncorrected criteria therefore favor the
candidates with the most missing information. Adding one term per candidate, the trace of
its relative-increase-in-variance matrix, removes the trace component of the bias and
substantially restores the ranking that complete data would have given.
2. The deviance bias behind the correction. For a model that estimates a scale or covariance,
we show that the averaged log-likelihood across imputations overstates its complete-data
counterpart by half the trace of the relative-increase-in-variance matrix, plus a
design-imbalance term that appears only under data missing at random and disappears
when data are missing completely at random. The proof for proper imputation and the
design-imbalance term are new, while the trace itself matches a penalty already known from
a related prediction problem.
3. The bias of a likelihood-ratio comparison. For two nested models compared at the null,
MODEL SELECTION AFTER MULTIPLE IMPUTATION 8
the relevant bias is the missing information carried by the tested directions alone, measured
in the complete-data metric. The obvious alternative, the difference of the two models’
separate corrections, always overstates it.
4. A sharper way to run that comparison. Fitting the competing models to the same
imputed data sets rather than to separate ones cancels most of the shared noise and tightens
the comparison.
5. An auditable AI-human workflow. We treat the way the derivations were produced as a
contribution in its own right, with transparent provenance and checks the reader can run.
These are citations checked against their sources, independent symbolic verification,
preregistered simulations whose failures are reported, adversarial re-derivation that caught
a sign error in this very work, and full reproducibility. Its checkable records are verified
mechanically, and its one descriptive part, a coding of the project’s own session record, was
produced by the same kind of system it describes and is reported as such.
One feature of this paper bears on how it should be read, and the title page declares it. The
derivations are human-prompted AI derivations, with an ORCID-verified prompter of record.
Section 2 describes the collaboration that produced them and shows that it was productive.
Section 4 states the verification protocol under which every claim was produced and checked, and
what each safeguard can and cannot catch. The results are then offered to be judged through that
protocol, the way an empirical paper’s results are judged through its methods. The rest of the
paper proceeds as follows. Section 3 fixes notation and restates the standard results at the
precision the argument needs. Section 5 develops the theorem and both applications. Section 6
reports the pre-registered studies, including the predictions that failed. Section 7 states what is
firm, what is measured, and what is conjectured.
2 AI–human collaboration
This paper is also a demonstration of a way of working. The derivations were produced
through a collaboration between a human author and an AI system, a kind of collaboration that is
new in the age of AI and whose products are not yet routinely trusted. One stated goal of this
MODEL SELECTION AFTER MULTIPLE IMPUTATION 9
paper is therefore to show a workflow that is at once productive and accurate. Productive means
that the collaboration reached results a single author would have reached slowly or not at all.
Accurate means that those results hold up to the scrutiny a skeptical referee applies to
human-derived mathematics. The division of labor is simple to state. The human author set the
direction, fixed the standards, supplied field knowledge, and accepted or rejected each result; the
AI system produced the derivations, the drafts, and the computations. By design, the provenance
is transparent, because the human prompter of record is ORCID-verified and the full record of the
work enters the public record. This section describes the collaboration and shows that it was
active and human-directed. Section 4 states the verification protocol that makes its conclusions
accurate, and what each safeguard cannot catch.
Roles and decision records. Direction, scope, and the acceptance or rejection of every
result were decided by the human author of record. Derivations, drafts, and computations were
produced by an AI assistant. This division of labor was measured rather than recalled. The
complete session record, 34 transcripts containing 599 substantive human turns, was qualitatively
coded, and the committed analysis backs the counts given here. Five patterns characterize the
collaboration. First, the human author set standards more often than steps. Rules of process such
as preregistration before code and independent cross-model review appear in 136 of the 599 turns,
the densest block of interventions after plain task assignment. Second, decisions were proposed by
the assistant and ratified by the human author. The record shows 71 explicit ratifications, and each
strategic decision was logged with its date, the options rejected, and the rationale. Third, challenge
ran in both directions. The human author disputed derivations, contested framing, caught
omissions, and rejected prose 172 times. The assistant flagged risks and surfaced decisions rather
than deciding silently, with 175 such moves observed. One example from the human direction is
on record. The author caught that the cited EM-based results concern improper imputation, and
that catch produced the Background’s paragraph on proper imputation, while the worked example
from the other direction appears under cross-model review below. Fourth, field knowledge entered
from the human side, as literature leads, methodological alternatives, and venue judgment, 131
MODEL SELECTION AFTER MULTIPLE IMPUTATION 10
times. Fifth, scope was actively cut and the work was partitioned across sessions with written
handoffs, 183 such moves. Recorded decisions were not revisited without a dated amendment.
The transcripts, the decision log, and the coded analysis are all part of the public record, so this
description is auditable, though the coding itself was produced by the same kind of system it
describes. The discipline cannot certify correctness. Recording who decided what catches
nothing about whether a derivation is right. The verification protocol of Section 4 exists for that.
3 Background and notation
This section fixes the paper’s notation and assembles the results the derivations use. The
reader needs five things. First are the two likelihoods of incomplete data and their two
maximizers, because the distance between those two functions is the bias this paper derives.
Second are the standing assumptions of ignorable missingness and congenial, proper imputation.
Third is the paper’s central matrix, the RIV, which comes from the missing-information principle.
Fourth is the machinery behind the averaged log-likelihood, running from the EM Q-function
through its Monte Carlo implementation to Rubin’s combining rules. Fifth are the complete-data
baselines, AIC and Wilks, along with the three prior results the applications extend. Each item is
restated in its source’s own terms, and each restatement names the later result that uses it.
Symbols introduced here are used unchanged throughout.
Notation: two likelihoods, two estimators. This paper turns on two likelihoods and two
estimators, so we fix that notation first. The basic objects are a data matrix and a missingness
indicator. Write 𝑌= (𝑌obs,𝑌mis) for the complete data, split into its observed and missing parts.
Write 𝑅for the indicator of which entries are observed. A parametric model 𝑓(𝑌| 𝜃) gives two
likelihoods. The complete-data likelihood is 𝑓(𝑌| 𝜃) itself. The observed-data likelihood
integrates the missing part out, 𝑓(𝑌obs | 𝜃) = ∫ 𝑓(𝑌obs,𝑌mis | 𝜃) 𝑑𝑌mis. Each likelihood has its own
maximizer, and the two play different roles. The complete-data maximum likelihood estimate
ˆ𝜃com is computable only when no data are missing and is the target an analyst would have reached
with full data. The observed-data maximum likelihood estimate ˆ𝜃obs is the best that can be
computed from the data at hand. The bias at the center of this paper compares the likelihood
MODEL SELECTION AFTER MULTIPLE IMPUTATION 11
values built around these two estimators. The comparison measures what happens when a value
built around the observed-data estimate ˆ𝜃obs is read as if it were built around the complete-data
target ˆ𝜃com. Section 5 instantiates this notation for the multivariate normal model. Everything in
this section is general.
Ignorability. This paper assumes throughout that the missingness mechanism is
ignorable, in the sense and under the conditions set out by Rubin (Rubin, 1976). The data must be
missing at random, which means that the conditional probability of the observed missingness
pattern is the same for all values of the missing data, and the mechanism’s parameter must also be
distinct from the data parameter. Together these two requirements are the weakest general
conditions for ignoring the mechanism in likelihood and Bayesian inference.
Sampling-distribution inference is stricter, because it requires in addition that the observed data
are observed at random. That extra condition, together with MAR, amounts to MCAR; otherwise,
sampling-distribution inference is generally conditional on the observed pattern. The gap between
the two senses of ”ignorable” is not a technicality for this paper, because two of the paper’s
objects live inside that gap. The first is the design-imbalance term (𝐴) + (𝐶) of the main theorem.
The second is the information-matrix distinction restated at the end of this section.
The missing-information principle and the RIV. The paper’s central matrix is the
relative-increase-in-variance matrix, and it comes from the missing-information principle.
Orchard and Woodbury (1972) decompose the information in the complete data into the
observed-data information plus what they call the lost information. They state this decomposition
together with a score identity, in which the observed-data score is the conditional expectation of
the complete-data score. Meng and Rubin (1991) state the same principle in the form used
throughout this paper:
𝐼obs = 𝐼com −𝐼mis|obs, (1)
In words, observed information equals complete information −missing information. The central
MODEL SELECTION AFTER MULTIPLE IMPUTATION 12
matrix is built from these ingredients:
RIV = 𝐼−1 obs 𝐼mis|obs. (2)
Its trace adds up the odds of missing information about each parameter, and the RIV is the matrix
form of Rubin’s scalar relative increase in variance, restated below. One warning is needed here,
because the EM literature works with a different normalization of the same ingredients, namely
the EM rate matrix 𝐷𝑀= 𝐼mis|obs 𝐼−1 com of Dempster et al. (1977) and Meng and Rubin (1991).
The RIV divides by the observed-data information, while the rate matrix divides by the
complete-data information, and the two matrices therefore have different eigenvalues. Conflating
them corrupts every trace formula in this paper. The notation keeps them apart.
EM and the Q-function. The object whose bias this paper derives is a Q-function. So its
definition and one geometric fact about it are needed. Dempster et al. (1977) define
𝑄(𝜙′ | 𝜙) = 𝐸 log 𝑓(x | 𝜙′) y, 𝜙, (3)
the expected complete-data log-likelihood given the observed data. The E-step computes
𝑄(· | 𝜙(𝑝)) and the M-step maximizes it. The reasoning is direct. We do not know log 𝑓(x | 𝜙), so
we maximize instead its current expectation given the data y and the current fit 𝜙(𝑝). Two facts
from their analysis carry the weight later. First, the Q-function decomposes as
𝑄(𝜙′ | 𝜙) = 𝐿(𝜙′) + 𝐻(𝜙′ | 𝜙). Here 𝐿is the observed-data log-likelihood and 𝐻is a
conditional-entropy term. The main theorem is, at bottom, an account of what the 𝐻term does
when its parameters are estimated rather than known. Second, the curvature of 𝑄at its maximizer
is the complete-data information 𝐼com, not the observed-data information. This single geometric
fact drives the likelihood-ratio result. Constrained fits of the averaged log-likelihood project in the
𝐼com metric.
Imputation is Monte Carlo integration of the E-step. The averaged log-likelihood over
imputations is itself a Q-function, computed by Monte Carlo. Wei and Tanner (1990) implement
MODEL SELECTION AFTER MULTIPLE IMPUTATION 13
the E-step by simulation, drawing 𝑧(1), . . . , 𝑧(𝑚) from the conditional predictive distribution of the
missing data. These draws replace the E-step integral by an average over completed data sets.
Their Remark 2 makes the connection to multiple imputation explicit, noting that Rubin (1987)
referred to the quantities 𝑧(1), . . . , 𝑧(𝑚) as multiple imputations. One feature of this construction
must be marked at once. The draws are taken at a fixed parameter value, the current iterate or in
the limit the observed-data estimate. Imputation at a fixed parameter value is what Rubin calls
improper. The 𝑚→∞limit of the averaged log-likelihood is written ¯𝑄∞(𝜃) and is the central
object of this paper. The next paragraph states the form of imputation under which this paper
studies it.
Proper imputation. The imputations this paper studies are proper in Rubin’s sense, and
whether imputation is proper or improper changes the bias the theorem derives. Proper imputation
propagates parameter uncertainty. A parameter value is first drawn from its posterior given the
observed data, and the missing values are then drawn from the predictive distribution at that
drawn value. Chapter 4 of Rubin (1987) states the validity conditions and gives the fully normal
Bayesian scheme as the canonical example. Improper imputation skips the first draw and imputes
at a fixed parameter value. The Monte Carlo E-step of the previous paragraph is improper by
construction, and so is the entire deterministic-EM line of work in which the earlier missing-data
criteria were derived (Cavanaugh & Shumway, 1998; Shimodaira & Maeda, 2018). The
distinction is not bookkeeping, because the extra posterior draw contributes its own variation to
the averaged log-likelihood, and the main theorem prices it exactly. Consider the known-scale
case. The bias is zero under improper imputation at the observed-data estimate, while under
proper imputation it is −1
2 tr(RIV), so the two forms differ by precisely the posterior-draw
contribution. Properness is also not absolute. Nielsen (2003) shows that Bayesian imputations are
proper when the analyst’s complete-data estimator is the maximum likelihood estimator, but the
same imputations can fail to be proper for a different estimator. This paper’s analyst always uses
the complete-data MLE, which is the case where congeniality implies properness (Nielsen, 2003).
The extension of the bias accounting from the improper, deterministic Q-function to proper
MODEL SELECTION AFTER MULTIPLE IMPUTATION 14
multiple imputation is one of the things this paper adds.
Rubin’s rules are exact posterior-moment identities. This paper uses Rubin’s
combining rules through the likelihood rather than as moment formulas, and that use requires
them in their exact form. Result 3.2 of Rubin (1987) shows two things. The posterior mean of an
estimand given the observed data equals the average of the completed-data posterior means, and
the posterior variance equals the average completed-data variance plus the variance of the
completed-data means. These are the ordinary rules for conditional moments applied to
imputation, and with infinitely many imputations they are exact identities rather than asymptotic
approximations. In Rubin’s notation they give ¯𝑄∞, ¯𝑈∞, 𝐵∞, and the total variance 𝑇∞= ¯𝑈∞+ 𝐵∞.
Rubin then defines the scalar relative increase in variance due to nonresponse,
𝑟∞= 𝐵∞/ ¯𝑈∞ (4)
in his equation 3.1.7, and the RIV matrix of Equation 2 is this quantity in matrix form. One
approximation separates the exact identities from usable inference, namely the usual treatment of
the posterior distribution as approximately normal, a Laplace-type approximation (Tierney &
Kadane, 1986). The main theorem describes what happens when these exact moment identities
are used through the likelihood itself rather than as moments.
Congeniality. The second standing assumption is congeniality, which requires the imputer
and the analyst to agree. Meng (1994) formalizes that agreement. An analysis procedure is
congenial to an imputation model when one Bayesian model reproduces both. The posterior
means and variances of that Bayesian model asymptotically match the analyst’s complete-data and
incomplete-data procedures. Its posterior predictive distribution for the missing data is the
imputation model. Everything in this paper assumes congenial, proper imputation. The bias
derived in the main theorem is therefore not an artifact of imputer-analyst disagreement, but what
remains after they agree.
AIC is a bias-corrected plug-in estimate. The model-selection application corrects AIC,
MODEL SELECTION AFTER MULTIPLE IMPUTATION 15
so AIC’s own logic is needed first. Akaike (1974) evaluates a fitted model by its mean
log-likelihood against the true distribution. The maximized log-likelihood is the natural estimate
of this criterion. That estimate is too optimistic. It needs a correction for the downward bias from
replacing 𝜃with its estimate ˆ𝜃, and that correction is simply to add the parameter count 𝑘. The
result is
AIC = −2 log(maximum likelihood) + 2𝑘. (5)
The model-selection application repeats this accounting with one more bias source. Under
multiple imputation the goodness-of-fit term is −2 ¯𝑄∞rather than the complete-data deviance.
That added bias is exactly what the main theorem quantifies.
Wilks. The complete-data baseline for testing is Wilks’ theorem, and the replication
principle is defined against it. Consider a null hypothesis that fixes ℎ−𝑚of ℎparameters. For this
case Wilks (1938) shows that −2 log 𝜆is distributed as 𝜒2 ℎ−𝑚in large samples, so this distribution
is the reference against which every multiply imputed deviance in this paper is ultimately
compared. The complete-data replication principle then asks when that comparison behaves as it
would have with full data.
Observed versus expected information under MAR. One convention must be fixed
before any trace in this paper is computed, namely which information matrix to use under MAR.
Kenward and Molenberghs (1998) settle the question. Under MAR the missingness indicator is
not ancillary, so the correct sampling framework is unconditional over both the data and the
missingness pattern. A “naive” expected information is computed as if the realized pattern were
fixed by design, and it is biased. MCAR is necessary and sufficient for the naive and
unconditional forms to agree. Their recommendation is to use the observed information, and this
paper follows it. The bivariate Gaussian example they give shows where the difference lodges,
because under MAR dropout the unconditional information acquires mean-covariance cross terms
that the naive form misses. Both facts recur in Section 5. The design-imbalance term (𝐴) + (𝐶)
vanishes under MCAR and is a nonzero 𝑂(1) under MAR, and it is computed against the
observed information. The observed-data information behind the one RIV of this paper carries
MODEL SELECTION AFTER MULTIPLE IMPUTATION 16
exactly their MAR cross term.
MI test combining, calibration, and the prior MI-AIC. Three strands of prior work
meet the applications directly, and each is restated here as the launch point it provides. The first
strand is the combining rule. Meng and Rubin (1992) combine complete-data likelihood-ratio
statistics across imputations into a single test statistic, then calibrate it against an 𝐹reference
under an equal-fractions assumption. Chan and Meng (2022) repair the procedure’s known
defects by switching the order of operations. Their statistic
ˆ𝑑L = 2 ¯𝐿( ˆ𝜓∗) −¯𝐿( ˆ𝜓∗ 0) ,
ˆ𝜓∗= arg max 𝜓 ¯𝐿(𝜓), (6)
maximizes the averaged log-likelihood rather than averaging the maxima, and it is nonnegative
and invariant by construction. The numerator analyzed in Section 5 is exactly this
maximize-then-average statistic. The second strand is the reference distribution. Chan (2022)
drops the equal-fractions assumption entirely. Stacking the imputed data sets yields estimators of
every eigenvalue 𝑟𝑗of the odds-of-missing-information matrix, and the limiting null law of the
combined statistic is a weighted sum. Its mean exceeds the parameter count by the total odds of
missing information in the tested directions. That excess matters later. A reference built this way
absorbs the corresponding bias in the numerator, so the bias bears on procedures that use no such
reference. The third strand is the criterion. Consentino and Claeskens (2010) propose an AIC for
multiply imputed data by attaching the Meng-Rubin combined statistic to the standard penalty.
Their criterion does not analyze the bias of the averaged log-likelihood, and their closing
assessment leaves the theory open. The corrected criterion of Section 5 is the answer to the
question their proposal poses.
4 Methods: the derivation and verification workflow
The collaboration that produced these derivations is described in Section 2. This section
states the verification protocol under which the results were produced and checked. One question
MODEL SELECTION AFTER MULTIPLE IMPUTATION 17
motivates that protocol. Why do the results that follow deserve the same scrutiny that is applied to
human-derived mathematics? The protocol answers that question in five parts. The first is
citations checked against their sources, and the second is a verification sequence with explicit trust
grades. The third is preregistration of every simulation, the fourth is adversarial cross-model
review, and the fifth is full reproducibility. Each part is stated below together with what it can
catch and what it cannot. The complete protocol records are collected in the appendices and the
public repository; they include the decision logs, the assessment records, the amendment histories,
and the enforcement code. This section states the design.
Citations checked against the source. Every claim about prior literature was traced to a
source document that was archived locally and read in the working session that used it. A
pre-write check enforced the rule in software, blocking any manuscript edit that cited a paper
whose archived copy did not exist. Reliance on the AI system’s trained recollection of a paper was
prohibited throughout, because invented or misattributed citations are among the most common
failures of human-prompted AI scholarship. The enforcement has a stated limit. It checks that a
source exists and where it came from, but it does not check understanding, so a real passage can
still be misread. Only review catches that.
The verification sequence and trust grades. Every analytic claim entered a fixed
sequence. It was derived first, then verified symbolically in two independent computer-algebra
systems, and finally confirmed by Monte Carlo simulation against criteria fixed in advance.
Results are labeled throughout the paper by the checks they passed. A claim is firm if it was
derived in closed form and passed both symbolic systems and Monte Carlo. A claim is measured
if it is a quantitative finding confirmed by preregistered simulation but not established in closed
form. A claim is structural if it is argued from the form of the problem but not separately
measured. Anything weaker is a conjecture and is labeled as one. These four labels are used in the
Derivations and Simulation studies sections without further comment. The sequence has a limit,
and the limit is shared setup. Both algebra systems verify the expressions they are given. An error
upstream of the algebra passes both. A wrong conditioning or a misstated expectation is exactly
MODEL SELECTION AFTER MULTIPLE IMPUTATION 18
that kind of error. The next two parts exist for that class of error.
Preregistration before code. Every simulation in this paper was preregistered.
Predictions, designs, and pass criteria were committed to the repository before the simulation
code was written, and changes were handled by dated amendments, themselves committed before
any new runs. Failed predictions are reported in the main text alongside those that held. The limit
is stated directly. Preregistration disciplines the reporting and nothing more, because a frozen
prediction can rest on a wrong premise, and committing it early validates neither the derivation
behind it nor the design that tests it.
Cross-model adversarial review. The claims the paper’s results depend on were
re-derived blind by a model from an independent family. That model was given the setup but not
the result. The claims were then subjected to a second pass in which the model was instructed to
refute each one with the strongest available argument. The assessment records are committed.
One episode shows what this check catches, and it is reported here as the worked example. The
main theorem’s sign depends on a conditioning choice. The averaged log-likelihood can be
defined at the fitted imputation model, which is what multiple imputation computes. It can instead
be defined as the true-model expectation conditioned on the truth, which no procedure computes.
Eight of nine blind re-derivations produced the opposite sign, −1
2 tr(RIV). They did so because
the true-model conditioning had been silently substituted. The error was not algebraic. Every
algebra check passed, because the algebra was correct for the definition it was given. The fork is
now stated explicitly where the theorem is set up, with both conditionings and both signs. A less
diverse, single-family check could have missed that the sign turned on a conditioning choice. The
check’s limit is the mirror of its strength. Independent model families are trained on overlapping
corpora. An error common to the corpus can survive both.
Reproducibility. Every number in this paper regenerates from committed code. The
simulations run from fixed entry points with fixed seeds. Outputs are cached as committed
artifacts. The software environment is recorded. Where a result is quoted in the text, its audit trail
names the artifact it comes from. The limit is the usual one. Reproducibility guarantees the
MODEL SELECTION AFTER MULTIPLE IMPUTATION 19
numbers, not their meaning. A wrong design reproduces its artifact exactly.
Read with this section in hand, the rest of the paper carries its evidence with it. Each claim
in the Derivations and Simulation studies sections arrives with a grade label. Each quantitative
claim also arrives with a committed artifact behind it. The appendices hold the records. The
repository holds everything, and the project page at
https://marcus-waldman.github.io/mi-spectral/ links to the companion derivation, the code, and
the session transcripts in one place. The protocol does not ask the reader to trust an AI derivation.
It asks the reader to check one, and states where the checking has already been done and where it
cannot be.
5 Derivations
This section delivers the paper’s analytic results, in three parts. The first part states the
main theorem, which is the answer to the replication principle’s second level. It prices the bias of
the averaged log-likelihood to leading order, so the criterion can be restored to its complete-data
expectation. The second and third parts take up the third level, decision rates, for testing and for
selection in turn. Every claim carries one of the four grade labels defined in Section 4. Full proofs
are in the companion document. Where a claim was tested by a pre-registered study, the test is
flagged where the claim is made, and all empirical evidence is reported in Section 6.
Setup and main theorem
We state the theorem for a general regular likelihood that estimates a scale or covariance,
then instantiate it in the multivariate normal family. Let 𝑌∈R𝑁×𝑝collect 𝑁independent rows
𝑌𝑖∼N𝑝(𝜇, Σ) with estimand 𝜃= (𝜇, vech Σ) of dimension 𝑘= 𝑝+ 1
2 𝑝(𝑝+ 1). Missingness is
ignorable throughout. The two estimators of Section 3 take concrete form here. One is the
complete-data maximizer ˆ𝜃com, the target an analyst with full data would have reached. The other
is the observed-data maximizer ˆ𝜃obs, which sums each row’s marginal density over its observed
coordinates (Schafer, 1997) and is the best available from the data at hand. We take up the scope
of the normal instantiation in Section 7.
We now instantiate the RIV of Equation 2, fixing one convention and repeating one
MODEL SELECTION AFTER MULTIPLE IMPUTATION 20
warning. In this family
RIV = 𝐼−1 obs 𝐼mis|obs
= 𝐼−1 obs𝐼com −𝐼𝑘,
(7)
tr(RIV) = tr(𝐼−1 obs𝐼com) −𝑘.
Under congenial proper MI with 𝑚→∞, Rubin’s rules give ¯𝑈∞= 𝐼−1 com and 𝑇∞= 𝐼−1 obs, so this
matrix is exactly Rubin’s 𝑟∞in matrix form. We fix the convention first. Under MAR the
expectation defining 𝐼obs is taken jointly over data and pattern, without conditioning, as Kenward
and Molenberghs (1998) require, and the resulting information carries their mean-covariance
cross term. Computed this way there is a single RIV, and both bias terms of the theorem calibrate
to it. An earlier reading of our own simulation output made the two terms appear to attach to two
different RIVs, but that reading was an artifact of comparing against the naive block-diagonal
information. The warning repeats Section 3 in local form. Here the
fraction-of-missing-information matrix 𝐷= 𝐼−1 com𝐼mis|obs (Schafer, 1997) has the same trace
ingredients but the other normalization, and the two relate by 𝐷= (𝐼𝑘+ RIV)−1RIV. Substituting
one for the other changes every constant below.
The object of the theorem is the infinite-imputation Q-function, and its definition contains
a choice that fixes the sign of everything that follows. Under congenial proper MI the imputation
parameter is a posterior draw, ˜𝜙∼𝜋(𝜙| 𝑌obs), where 𝜋is the posterior of the imputation model.
By congeniality (R4) that imputation model is the one whose predictive distribution the analyst’s
procedure agrees with. Under deterministic FIML imputation, by contrast, the draw degenerates
to the point mass ˜𝜙= ˆ𝜃obs. In either case ˜𝜙is centered at ˆ𝜃obs with Var( ˜𝜙) = 𝐼−1 obs + 𝑂(𝑛−2), and
the order counting below draws on this property. Each completion is then drawn at ˜𝜙, and the
infinite-imputation Q-function averages these completions.
𝑀 Õ
𝑙=1 ℓcom 𝜃; 𝑌obs, ˜𝑌(𝑙) mis
¯𝑄∞(𝜃) := lim 𝑀→∞ 1 𝑀
(8)
= E ˜𝜙E𝑌mis|𝑌obs, ˜𝜙 ℓcom(𝜃) .
MODEL SELECTION AFTER MULTIPLE IMPUTATION 21
The inner expectation conditions on the fitted or drawn parameter, exactly as the EM E-step
conditions on the current iterate. It does not condition on the truth. This is the fork that Section 4
reported as the sign episode. Both branches are stated here to prevent the standard confusion.
Were ¯𝑄∞instead the true-model expectation E[ℓcom(𝜃) | 𝑌obs] under the data-generating law, the
tower property would force the imputation-bias term to zero and the total bias would be
−1
2 tr(RIV), with the opposite sign. The analyst imputes from a model fit to the data, not from the
true model. The two regimes are distinguished experimentally by a dedicated study in Section 6.
The bias is split at the pivot ˆ𝜃obs into two terms with distinct mechanisms,
𝑇= ¯𝑄∞( ˆ𝜃obs) −ℓcom( ˆ𝜃obs)
| {z }
𝑇imp: imputation-bias term
, (9)
+ ℓcom( ˆ𝜃obs) −ℓcom( ˆ𝜃com)
| {z }
𝑇est: estimation-mismatch term
on the log-likelihood scale, while deviance-scale statements double everything. The companion
derivation calls these Term A and Term B, but the descriptive subscripts are used here instead.
For pieces inside the first term, the companion’s bookkeeping also uses the labels (𝐴) and (𝐶), so
those letters must not collide. Seven regularity conditions are in force, and none is exotic. R1 and
R2 are smoothness and positive-definite information (Vaart, 1998), R3 is ignorability (Rubin,
1976), and R4 through R6 are congeniality, properness, and self-efficiency (Meng, 1994; Rubin,
1987). R7 is the infinite-imputation idealization, and Rubin’s finite-𝑚corrections apply otherwise.
Together they deliver the variance-recovery property 𝑇∞= 𝐼−1 obs that identifies Equation 7 with 𝑟∞.
Theorem 5.1 (Q-function deviance bias). Under R1-R7, for a model that estimates a scale or
covariance, to leading order
2 tr(RIV) + (𝐴) + (𝐶) + 𝑂(𝑛−1), (10)
E[𝑇] = +1
decomposing as E[𝑇imp] = + tr(RIV) + [(𝐴) + (𝐶)] + 𝑂(𝑛−1) and E[𝑇est] = −1
2 tr(RIV) + 𝑂(𝑛−1).
MODEL SELECTION AFTER MULTIPLE IMPUTATION 22
The design-imbalance term (𝐴) + (𝐶) vanishes under MCAR and is a nonzero 𝑂(1) under MAR.
The theorem is firm in the sense of Section 4. The two terms are derived in turn. The
estimation-mismatch term comes first. We expand ℓcom to second order about its own maximizer.
The score vanishes there, giving
2 ( ˆ𝜃obs −ˆ𝜃com)⊤Jcom( ˆ𝜃com) ( ˆ𝜃obs −ˆ𝜃com)
𝑇est = −1
+ 𝑂𝑝(𝑛−3/2).
We then take expectations, replace the realized curvature by 𝐼com, and write the quadratic form as
a trace. This gives
2 tr𝐼com Var( ˆ𝜃obs −ˆ𝜃com) + 𝑂(𝑛−3/2).
E[𝑇est] = −1
The variance of the gap is the substantive step. It runs through the score identity. Differentiating
the factorization of Equation 9’s first term and taking conditional expectations gives
𝑈obs = E[𝑈com | 𝑌obs]. From this,
Cov(𝑈obs,𝑈com) = Var(𝑈obs) = 𝐼obs,
Cov( ˆ𝜃obs, ˆ𝜃com) = 𝐼−1 obs 𝐼obs 𝐼−1 com = 𝐼−1 com,
the second equality holding by the asymptotic linearity of both estimators. An equivalent
statement is that the complete-data estimator is asymptotically uncorrelated with the gap.
Assembling the three pieces of the variance,
Var ˆ𝜃obs −ˆ𝜃com = 𝐼−1 obs + 𝐼−1 com −2𝐼−1 com
= 𝐼−1 obs −𝐼−1 com, (11)
MODEL SELECTION AFTER MULTIPLE IMPUTATION 23
and substituting back,
2 tr𝐼com𝐼−1 obs −𝐼𝑘
E[𝑇est] = −1
2 tr(𝐼−1 obs𝐼com) −𝑘
= −1
= −1
2 tr(RIV).
One trap is worth naming. Substituting Var( ˆ𝜃obs) for the variance of the gap in the second display
produces a spurious −1
2𝑘. We display Equation 11 to forestall it.
The imputation-bias term is the delicate half. Its derivation has three steps. First, factor
the complete-data log-likelihood as ℓcom(𝜃) = ℓobs(𝜃) + ℓmis|obs(𝜃) with
ℓmis|obs(𝜃) := log 𝑃(𝑌mis | 𝑌obs, 𝜃). The observed parts cancel inside 𝑇imp, leaving
𝑇imp = E𝑌mis|𝑌obs, ˜𝜙 ℓmis|obs( ˆ𝜃obs)
−ℓmis|obs( ˆ𝜃obs).
Second, take expectations over the true completion. What remains is the discrepancy between
averaging the missing-data log-density under the fitted ˜𝜙and averaging it under the truth.
E[𝑇imp] = E𝑌obs E ˜𝜙[ℓmis|obs( ˆ𝜃obs)]
−E𝜃0[ℓmis|obs( ˆ𝜃obs)] .
These two inner expectations would cancel were ˜𝜙= 𝜃0, which is the true-model case of
Equation 8 and the tower-property branch stated there. The term is nonzero precisely because the
imputations use ˜𝜙≈ˆ𝜃obs ≠𝜃0. Third, expand to second order in ˜𝜙about 𝜃0, differentiating under
MODEL SELECTION AFTER MULTIPLE IMPUTATION 24
the integral and using the Bartlett identity E𝜃0[𝑆mis|obs𝑆⊤ mis|obs] = 𝐼mis|obs. Three pieces result:
E[𝑇imp] = E ( ˜𝜙−𝜃0)⊤𝛼
| {z }
(𝐴)
+ E ( ˜𝜙−𝜃0)⊤𝐼mis|obs( ˆ𝜃obs −𝜃0)
(12)
| {z }
main cross term
+ 1 2E ( ˜𝜙−𝜃0)⊤𝐻𝜙𝜙( ˜𝜙−𝜃0)
.
| {z }
(𝐶)
Here 𝛼= Cov𝜃0 ℓmis|obs, 𝑆mis|obs = −E′(𝜃0) is the gradient of the conditional missing-data
entropy. The term 𝐻𝜙𝜙is the curvature of the conditional cross-entropy. The labels (𝐴) and (𝐶)
follow the companion derivation’s bookkeeping. The companion letters its main piece (𝐵). This
paper avoids that label for obvious reasons. The entropy gradient is not zero. Gibbs’ inequality
concerns varying the evaluation point of E𝜃0[log 𝑝𝜃]. By contrast, 𝛼varies the sampling
distribution of the completions. The two are different functions. For the normal family the
gradient is supported entirely on the covariance parameters. In the univariate case 𝛼𝜇= 0 and
𝛼𝜎2 = −𝑛mis/2𝜎2 ≠0. The main cross term reduces directly. Proper MI centers ˜𝜙at ˆ𝜃obs, so
E ( ˜𝜙−𝜃0)⊤𝐼mis|obs( ˆ𝜃obs −𝜃0)
= tr𝐼mis|obs Cov( ˜𝜙, ˆ𝜃obs)
= tr𝐼mis|obs 𝐼−1 obs + 𝑂(𝑛−1)
= tr(RIV) + 𝑂(𝑛−1),
the leading constant of the theorem.
The two remaining pieces of Equation 12 form the design-imbalance term, and their
orders are counted directly. Under proper MI centered at ˆ𝜃obs, the displacement 𝛿:= ˜𝜙−𝜃0 has
E[𝛿] = 𝑂(𝑛−1), the MLE bias, and Var(𝛿) = 𝐼−1 obs = 𝑂(𝑛−1). The factors 𝛼and 𝐻𝜙𝜙are both 𝑂(𝑛),
MODEL SELECTION AFTER MULTIPLE IMPUTATION 25
extensive in 𝑛mis. Hence
(𝐴) = E[𝛿]⊤𝛼= 𝑂(1),
2 tr𝐻𝜙𝜙𝐼−1 obs + 𝑂(𝑛−1) = 𝑂(1),
(𝐶) = 1
two order-one pieces of opposite sign. Their leading parts cancel exactly when the missing and
observed units share a conditioning-variable distribution, which is the MCAR case. Under MAR
the sum survives, and it has the bivariate monotone closed form
h 1 −1
2 tr 𝑄mis𝑄−1 obs i
(𝐴) + (𝐶) = 𝑛mis
𝑛obs
+ 𝑂(𝑛−1), (13)
where 𝑄mis and 𝑄obs are the conditioning-variable second-moment matrices of the missing and
observed units. The term is zero exactly when the two distributions agree. For non-monotone
patterns the covariance-MLE bias entering (𝐴) comes from the general second-order Cox-Snell
bias of the MLE, which we carry to general dimension and verify in two computer-algebra
systems. Summing the pieces of Equation 12 with E[𝑇est] gives Equation 10. These statements do
not all rest on the same footing, and the distinction matters for how the term is used. The
structural facts are firm in the sense of Section 4. Its sign, its 𝑂(1) order under MAR, its exact
vanishing under MCAR, the closed form Equation 13, and the information convention it is
computed in are all proved. By contrast, its absolute magnitude is another matter. The
per-replicate statistic is heavy-tailed, so direct Monte Carlo estimates scatter without trend around
the analytic leading-order value, and the magnitude is measured only imprecisely. This
imprecision does not hurt the comparison, because comparisons never depend on the magnitude
alone. Candidates fit to the same imputations touch (𝐴) + (𝐶) only through their difference, and
the heavy-tailed component cancels in that difference. The comparison-relevant differential is
derived and measured precisely in the next part.
Two checks close the theorem, one on its scope and one on its form. The scope check is
MODEL SELECTION AFTER MULTIPLE IMPUTATION 26
the known-scale collapse. With a known scale and a location-only fit the conditional missing-data
entropy is parameter-free. The net bias then collapses to
0 deterministic FIML,
E[𝑇]known scale =
(14)
−1
2 tr(RIV) proper MI,
The two arms differ by exactly the posterior-draw imputation variance of Section 3’s
proper-imputation paragraph. The four-way collapse is a pre-registered prediction, tested in
Section 6. Every model compared below estimates a covariance. The applications therefore fall in
the estimated-scale regime. The form check is the entropy-plug-in reading. Write 𝐶𝑛(𝜃) for the
conditional entropy of the missing block given the observed block. The bias Equation 10 equals
the plug-in bias of evaluating 𝐶𝑛at the estimate rather than the truth. One curvature identity
regroups the proof’s three pieces into a single expansion,
∇2𝐶𝑛(𝜃0) = 𝐻𝜙𝜙+ 𝐼mis|obs, (15)
and delivers the known-scale collapse as the parameter-free special case. The total is
convention-free. Both readings are verified symbolically. The theorem’s pre-registered Monte
Carlo confirmation is reported in Section 6.
Likelihood-ratio comparison
This part takes up the replication principle’s third level for testing. The theorem prices one
model’s bias. A likelihood-ratio comparison involves two such quantities on the same imputed
data, and everything here follows from asking what survives the subtraction. Let the null model be
a smooth submodel 𝜃= 𝑔(𝛾) with full-rank Jacobian 𝐺and 𝑞d tested constraints. Both models
are fit to the same imputations, and the numerator is
ˆ𝑑𝐿= 2 ¯𝑄∞( ˆ𝜓∗) −¯𝑄∞( ˆ𝜓∗ 0) , (16)
MODEL SELECTION AFTER MULTIPLE IMPUTATION 27
the infinite-𝑚limit of the maximize-then-average statistic of Chan and Meng (2022) restated in
Section 3. One definition matters here. The constrained fit ˆ𝜓∗ 0 maximizes the shared ¯𝑄∞over the
null manifold. It is not the null model’s own observed-data MLE. The complete-data counterpart
has null expectation 𝑞d + 𝑂(𝑛−1) by Wilks’ theorem, and the object of interest is the differential.
Proposition 5.1 (Differential bias at the null). Under R1-R7 the following holds at the null.
E ˆ𝑑𝐿−ˆ𝑑com = tr(RIV⊥) + 𝑂(𝑛−1),
tr(RIV⊥) := tr 𝐼−1 obs𝐼mis|obs
(17)
−tr (𝐺⊤𝐼com𝐺)−1 𝐺⊤𝐼com𝐼−1 obs𝐼mis|obs 𝐺 ,
This is the missing information of the tested directions, projected onto the null tangent space in the
𝐼com metric.
This result is firm. The metric is its substance. Write 𝑍= ∇¯𝑄∞(𝜃0) for the score of the
averaged log-likelihood at the truth. Write 𝐽= −∇2 ¯𝑄∞(𝜃0) for its curvature. Because the
conditional density integrates to one, the evaluation-slot gradient of the imputed part vanishes at
(𝜃0, 𝜃0). The score then reduces to
𝑍= 𝑆+ 𝐼mis|obs ( ˆ𝜃obs −𝜃0) + 𝑂𝑝(1)
= 𝐼com 𝐼−1 obs 𝑆+ 𝑂𝑝(1),
Var(𝑍) = 𝐼com 𝐼−1 obs 𝐼com + 𝑂(√𝑛),
with 𝑆the observed-data score. The curvature converges to
𝐽→𝑝𝐼obs + 𝐼mis|obs = 𝐼com,
the EM identity restated in Section 3. The two limits are consistent only together. The
unconstrained maximizer of ¯𝑄∞must reproduce ˆ𝜃obs, which is EM self-consistency. Indeed
𝐽−1𝑍= 𝐼−1 com 𝐼com 𝐼−1 obs𝑆= ˆ𝜃obs −𝜃0 holds only for 𝐽= 𝐼com. The constrained maximizer therefore
MODEL SELECTION AFTER MULTIPLE IMPUTATION 28
projects 𝑍onto col(𝐺) in the 𝐼com metric. The deviance is then the standard difference of
quadratic forms,
ˆ𝑑𝐿= 𝑍⊤ 𝐼−1 com −𝐺(𝐺⊤𝐼com𝐺)−1𝐺⊤ 𝑍+ 𝑂𝑝(𝑛−1/2).
Taking expectations against Var(𝑍) and subtracting Wilks’ E[ ˆ𝑑com] = 𝑞d + 𝑂(𝑛−1) gives
Equation 17. The natural error is now visible. Substituting 𝐼obs for the curvature reproduces
exactly the naive difference that the next proposition shows always overstates. That substitution is
the same as conflating ˆ𝜓∗ 0 with the null model’s own observed-data MLE. The derivation was
carried through three independent routes. A pre-registered design built to separate the two
formulas is assessed in Section 6.
Proposition 5.2 (The naive difference always overstates). Let
tr(RIV0) = tr[(𝐺⊤𝐼obs𝐺)−1𝐺⊤𝐼mis|obs𝐺] be the null model’s own self-contained trace correction.
Then always
tr(RIV⊥) ≤tr(RIV) −tr(RIV0), (18)
with equality if and only if col(𝐼1/2 obs𝐺) is an invariant subspace of the standardized missing
information 𝐻= 𝐼−1/2 obs 𝐼mis|obs𝐼−1/2 obs . Equal fractions of missing information is a special case of
equality, with both sides equal to 𝑟𝑞d for the common odds 𝑟.
This result is firm. The proof is one matrix inequality. The Gram matrix of the pair
(𝐼1/2 obs𝐺, 𝐼−1/2 obs 𝐼com𝐺) is positive semidefinite. Its Schur complement gives
𝐺⊤𝐼com 𝐼−1 obs 𝐼com 𝐺
⪰(𝐺⊤𝐼com𝐺) (𝐺⊤𝐼obs𝐺)−1 (𝐺⊤𝐼com𝐺),
Tracing both sides against (𝐺⊤𝐼com𝐺)−1 yields Equation 18. The gap also has an exact closed
form. That form makes the equality condition transparent. Partition 𝐻into a block 𝐻11 on the
retained directions col(𝐼1/2 obs𝐺), a block on the tested directions, and a coupling block 𝐻12. This
MODEL SELECTION AFTER MULTIPLE IMPUTATION 29
partition gives
tr(RIV) −tr(RIV0) −tr(RIV⊥)
= tr (𝐼+ 𝐻11)−1𝐻12𝐻⊤ 12
(19)
≥0,
The gap is zero exactly when 𝐻12 = 0. That is when the tested and retained directions carry
independent missing information. The overstatement is therefore precisely the
missing-information coupling between the two subspaces. The retained block’s own information
screens it. The practical reading is direct. Correcting an MI deviance comparison by the
difference of the two models’ own traces over-corrects at the null. This happens whenever the
tested directions mix unequally-missing information. The over-correction is negligible when the
design lies near the invariance case. It grows to multiples when the design does not. Both regimes
are exhibited in Section 6.
Proposition 5.3 (Pairing collapses the noise). Let 𝐷be the per-dataset paired differential. At the
null and under local alternatives, sd(𝐷) = 𝑂(1), against 𝑂(√𝑛) for either level separately. At a
fixed alternative the cancellation fails and sd(𝐷) reverts to 𝑂(√𝑛).
This result is firm. The mechanism is exact cancellation of realizations rather than
averaging, because the large noise of each level lives in fit-independent realized constants. Both
fits maximize the same realized ¯𝑄∞, and they build it from the same imputation parameter, so
these constants are identical in the two levels and cancel dataset by dataset. The simulations in
Section 6 show both effects across sample sizes. The single-model noise grows, while the paired
differential stays flat. This is why the paper estimates every comparison-relevant quantity by
paired contrasts, never by differencing separately estimated levels.
Proposition 5.4 (The design-imbalance differential cancels at the null). The (𝐴) + (𝐶)
contributions of the two levels are properties of the imputation, not of which model is fit. At the
null they are identical realizations, so they cancel exactly to leading order. Under local
MODEL SELECTION AFTER MULTIPLE IMPUTATION 30
alternatives the differential is 𝑂(𝑛−1/2), and it becomes a genuine 𝑂(1) only when the candidates’
pseudo-true values are separated at 𝑂(1). That last case is the regime of Vuong (1989).
Separation, not nesting, is the criterion, because a nested but false restriction triggers the
decoupling just as a non-nested pair does. The expansion leads us to expect that the differential’s
size scales with how differently the competitors handle the missing data, since similar candidates
respond to the same imbalance almost identically and their contributions nearly cancel. One
caveat is analytic and travels with dissimilar pairs. A badly misspecified candidate also carries a
mechanism-independent misspecification 𝑂(1), and an MCAR contrast separates this
misspecification term from the design-imbalance term in measurement. Even so, the
misspecification term can still offset part of the design-imbalance term in the net ranking bias. A
pre-registered measurement covered all three, namely the decoupling, the dissimilarity scaling,
and the caveat. Section 6 reports it.
The last result of this part locates which procedures the differential bias actually affects.
At the null, Equation 17 is precisely the mean inflation that a correctly calibrated reference
distribution already absorbs. The limiting law of ˆ𝑑𝐿is the weighted sum Í 𝑗𝜆𝑗𝜒2 1 with
𝜆𝑗= 1 + 𝑟⊥,𝑗(Chan, 2022). Their sum has a fixed value.
𝑞d Õ
𝑗=1 𝜆𝑗= 𝑞d + tr(RIV⊥). (20)
A test that refers the uncorrected numerator to such a reference is therefore approximately
calibrated at the null. Correcting the numerator on top of it double-counts. We check that
prediction against pre-registered null rates in Section 6. The differential bias matters instead
exactly where no reference distribution stands between the statistic and its use. Three procedures
qualify. Information-criterion ranking compares penalized values directly. Explicit numerator
corrections must use Equation 17 rather than the naive difference. Non-nested comparison brings
back both the design-imbalance differential and the heavy-tailed noise. The first of these is taken
MODEL SELECTION AFTER MULTIPLE IMPUTATION 31
A scope statement closes this part. It is structural in the sense of Section 4. The
propositions above are proved for the deterministic-FIML ¯𝑄∞. Under proper MI the
imputation-side quantities acquire posterior smearing. But they remain common to both fits, the
same posterior and the same draws. So the cancellation structure and the leading-order form carry
over structurally. This carry-over is argued, not separately measured. We list it as a limitation in
Section 7. The cross-model check of Section 4 was run on this part in full. A blind re-derivation
reproduced the 𝐼com metric, Equation 17 term by term, the definite sign of Equation 18 with a
third independent proof, and the noise orders. An adversarial pass instructed to break each claim
sustained all four.
Information-criterion selection
This part takes up the replication principle’s third level for selection, where the bias has no
reference distribution to hide behind. Model selection by information criterion compares
penalized deviances as numbers and takes the smallest, so whatever bias each candidate’s −2 ¯𝑄∞
carries lands directly in the ranking. By Theorem 5.1 that bias is model-specific, because RIV𝑘is
computed on candidate 𝑘’s own parameter space. The corrected criterion takes this form.
AIC𝑐 MI(𝑘) = −2 ¯𝑄∞( ˆ𝜓∗ 𝑘) + 2𝑞𝑘+ tr(RIV𝑘). (21)
This penalty reproduces AIC𝑥;𝑦of Shimodaira and Maeda (2018), and its missing-data surcharge
is exactly half that of the earlier complete-data-discrepancy criteria (Cavanaugh & Shumway,
1998). What Theorem 5.1 adds is the anatomy, the MAR term, and the proper-MI scope. The
ranking consequence is immediate. For two candidates the uncorrected difference carries a
differential bias whose leading term is the surplus of missing information, so the candidate with
more missing information about its own parameters looks artificially better by exactly that surplus.
Two qualifications follow from the likelihood-ratio part. First, candidate sets are generally
not nested chains. For pairs with 𝑂(1)-separated pseudo-true values, the design-imbalance
differential of Proposition 5.4 is a genuine 𝑂(1) that Equation 21 does not remove and no
MODEL SELECTION AFTER MULTIPLE IMPUTATION 32
reference absorbs. That differential is small for similar candidates, but it grows with dissimilarity.
Second, the per-model trace corrections are exactly the self-contained levels whose difference
Proposition 5.2 shows can overstate the projected trace. The candidate family studied below
happens to fall at the exact equality case. That is a structural fact, and it is established at the end of
this part.
Selection is where the theorem’s directional content becomes testable. The corrected
criterion makes two pre-registered predictions. Uncorrected MI-AIC should misclassify toward
the candidates with the most missing information about their own parameters. That is the
direction of the differential bias. Adding each candidate’s own trace should remove the tilt and
recover complete-data selection in expectation. Section 6 grades both predictions. It also grades
the recovery the correction achieves and the residual it leaves. The residual matters. A correctly
centered criterion can still select worse than complete-data AIC. The reason is the subject of the
rest of this part.
Selection depends on more than a mean. At the null the limiting law of each anchored
deviance is Í 𝑗𝜆𝑗𝜒2 1 with every weight at least one (Equation 20), so the variance is inflated by Í 𝑗𝜆2 𝑗/𝑞d even after the mean is fixed, and that inflated spread flips rankings. This motivates a
three-step contrast, pre-registered in full before any code was written, that matches the first
moment, the first two moments, or the entire null distribution and then measures what each step
achieves. The working statistic anchors every candidate at the saturated model fit to the same
imputations.
ˆ𝑑𝑘= 2 ¯𝑄∞( ˆ𝜓∗ sat) −¯𝑄∞( ˆ𝜓∗ 𝑘) ≥0.
The companion derivation writes 𝑇𝑘for this statistic, but the deviance notation is used here
because the paper’s 𝑇is the total bias of Equation 9. Anchoring costs nothing, because AIC
ranking is invariant to the common shift, and yet it buys three things. Since the anchor is the
congenial imputation model itself, the heavy-tailed realized (𝐴) + (𝐶) component and the
𝑂𝑝(√𝑛) noise cancel dataset by dataset, which is the Proposition 5.3 mechanism applied to every
candidate. Every ˆ𝑑𝑘is a proper likelihood-ratio statistic with a constructible per-model null law,
MODEL SELECTION AFTER MULTIPLE IMPUTATION 33
whose analytic weights 𝜆𝑘,1, . . . , 𝜆𝑘,𝑞𝑑,𝑘are the nonzero eigenvalues of
𝐼−1 com −𝐺𝑘(𝐺⊤ 𝑘𝐼com𝐺𝑘)−1𝐺⊤ 𝑘 𝐼com 𝐼−1 obs 𝐼com, Õ
𝑗 𝜆𝑘,𝑗= 𝑞𝑑,𝑘+ tr(RIV⊥,𝑘), 𝜆𝑘,𝑗≥1,
available from the law of Proposition 5.1 without knowing the global truth. Each candidate here is
a block-diagonal zero pattern, so the constrained maximizer is closed-form, and
ˆ𝑑𝑘= 𝑁 log | ˆΣ𝑘| −log |𝑆∗|
exactly, with 𝑆∗the saturated E-step second-moment matrix.
Three maps are compared, each built from the analytic null weights. The mean map is the
Equation 17 complement shift.
ˆ𝑑(1) 𝑘 = ˆ𝑑𝑘−tr(RIV⊥,𝑘),
This is a linear per-model shift that telescopes, so every pairwise comparison is simultaneously
mean-corrected. The two-moment map is the affine transformation.
ˆ𝑑(2) 𝑘 = 𝑎𝑘ˆ𝑑𝑘+ 𝑏𝑘,
𝑎𝑘= q
𝑞𝑑,𝑘 Í 𝑗𝜆2 𝑘,𝑗,
𝑏𝑘= 𝑞𝑑,𝑘−𝑎𝑘 Í 𝑗𝜆𝑘,𝑗,
This is the unique affine map carrying the null law’s mean and variance onto those of 𝜒2 𝑞𝑑,𝑘.
Uniqueness is an immediate corollary of the moment identities E[Í 𝑗𝜆𝑗𝜒2 1] = Í 𝑗𝜆𝑗and
Var = 2 Í 𝑗𝜆2 𝑗. Matching a misbehaving statistic’s first two moments is the standard repair in the
structural-equation tradition. The Satorra-Bentler scaled difference statistic matches the mean
(Satorra & Bentler, 2010), while the mean-and-variance-adjusted statistics match both from the
same moment inputs tr(𝑀) and tr(𝑀2) (Asparouhov & Muthén, 2006). Here the coefficients are
derived from the analytic null law rather than estimated from sandwich matrices. The third map
MODEL SELECTION AFTER MULTIPLE IMPUTATION 34
matches all moments by equipercentile equating.
𝐹𝑊𝑘( ˆ𝑑𝑘),
ˆ𝑑(3) 𝑘 = 𝐹−1 𝜒2𝑞𝑑,𝑘
𝑊𝑘= Í 𝑗𝜆𝑘,𝑗𝜒2 1,
The equating function from observed-score test equating is 𝑒𝑌(𝑥) = 𝐺−1[𝐹(𝑥)] (Kolen &
Brennan, 2014). We apply it with the analytic null law as 𝐹and the complete-data 𝜒2 as 𝐺, so by
the probability integral transform the equated statistic matches the entire null distribution. The
weighted-𝜒2 distribution function is evaluated by numerical inversion of the characteristic
function (Davies, 1980; Imhof, 1961). A simulated reference in the style of Chan’s Monte Carlo
null was the pre-registered fallback. The pre-registration predicted the split the data then
delivered. Each step should close more of the null-side gap. The two stronger maps also shrink
evidence against misspecified candidates, the affine map by 𝑎𝑘< 1 and the equating map by
approximately 1/𝜆max in the far tail. The reason is that no transform built from the observed data
can restore destroyed Fisher information.
The pre-registration states what each map should achieve, and the predictions split along
the null/noncentral axis. On the null side, each map should close more of the calibration gap than
the one before it, and matching the full null distribution should make selection indistinguishable
from complete-data AIC wherever overfit flips drive the errors. On the noncentral side, the two
stronger maps should shrink the evidence against misspecified candidates by factors computable
in advance from the null weights. The affine map shrinks it by 𝑎𝑘, while the equating map shrinks
it by approximately 1/𝜆max in the far tail. Where the complete-data benchmark itself struggles, by
contrast, no observed-data correction should close the remaining gap, because the shortfall there
is information loss rather than miscalibration. All three predictions are reported in Section 6,
together with the frozen pass criteria they were assessed against.
Two structural limits close this part. Both are proved. First, block-diagonal zero patterns
make the naive trace difference and the exact projected trace coincide. Such constraints
decompose both information matrices over within-block and cross-block coordinates. This is
MODEL SELECTION AFTER MULTIPLE IMPUTATION 35
exactly the equality condition of Proposition 5.2. The overstatement of Equation 18 is therefore
invisible in block-diagonal candidate families. It is material only for constraints that do not
block-decompose, such as the mean restriction in the design of Proposition 5.1. Second,
per-model marginal transforms cannot calibrate a difference distribution. The dependence
between two candidates’ scores is invariant under maps applied to each score separately. So no
per-model map controls the law of their difference. And near-tied comparisons remain
uncalibrated after equating. Selection-aware refinements are left to future work. Both limits are
exhibited numerically in Section 6. As in the likelihood-ratio part, everything here is stated for the
deterministic-FIML ¯𝑄∞. The anchoring cancellations are properties of sharing the imputation
model. So the construction carries to proper MI unchanged. The proper-MI check is part of the
pre-registered evidence in Section 6.
6 Simulation studies
Simulation studies: setup
This section is the paper’s complete empirical account. Every quantitative claim in
Section 5 points here, and every number here traces to one study. Before any code was written,
each study fixed its predictions and pass criteria (Section 4). The setup subsection states each
study as a stand-alone design that another analyst could reproduce, while the results subsection
presents each study as a figure or a table and includes the predictions that were not met. A reader
can enter at any single exhibit, because each one names the perfect reference and the value the
study achieved. Everything here carries the measured grade of Section 4 unless stated otherwise.
Apparatus. The designs are multivariate normal in four dimensions, with one bivariate
special case. Two imputation arms run throughout. The deterministic arm evaluates the averaged
log-likelihood ¯𝑄∞in closed form at the full-information maximum-likelihood estimate ˆ𝜃obs. This
arm is the expectation-maximization Q-function with no simulation error. The proper arm draws
imputations by expectation-maximization with bootstrapping, as implemented in Amelia
(Honaker et al., 2011). That sampler runs expectation-maximization on bootstrap resamples of the
incomplete data. It approximates the posterior of (𝜇, Σ) and then draws completions at those
MODEL SELECTION AFTER MULTIPLE IMPUTATION 36
values. The exact alternative is data augmentation. We did not run it. The engine-sensitivity study
below bounds what that choice costs.
Assessment and registration. Each study is assessed against pass criteria fixed before its
code existed, and the pre-registration documents together with their dated amendment histories
record those criteria. The bias-decomposition, likelihood-ratio, and selection studies were
registered together, while the distribution-matching ladder and the non-nested measurement each
carry their own registration. Both the known-scale and sign-regime runs are dedicated
single-question designs. Two reporting rules hold throughout. The first rule is that failed
predictions appear here in the main text beside the successes. The second rule is that every
quantitative claim is stated for totals and paired differentials rather than for components, and the
first study makes the reason for that second rule concrete.
Theorem-validation design. The data-generating model is a four-variate normal with
zero mean and unit variances. Three coordinate pairs have correlations 0.6, 0.3, and 0.5, and the
rest are zero, which gives a relative-increase-in-variance trace near 5.55. Missingness falls on the
first two coordinates at random, with probabilities Φ(−0.5 + 0.4𝑋3) and Φ(−0.5 + 0.4𝑋4), so the
two fully observed coordinates drive it. The resulting pattern is non-monotone at about one third
missing per coordinate. Sample sizes are 𝑁∈{200, 500, 1000, 2000}, with 1000 repetitions each.
The estimand is the net deviance bias E[𝑇], which we carry on the analytic arm so that it is free of
simulation error. We pre-registered a target of one half the trace, near 2.77, but refined it during
analysis to the order-one-augmented band 2.42 to 2.55 once the missing-at-random term was
derived. The study passes if the inverse-variance pooled estimate falls in that band.
Known-scale and sign-regime design. Two single-question runs close the theorem. The
known-scale run holds the covariance at its true value and fits only the mean, so the conditional
missing-data entropy no longer depends on the estimated parameters. This run uses the
four-variate normal above at 𝑁= 200 with 2 × 105 repetitions, on both the deterministic and the
proper arm. The estimand is again E[𝑇]. Here Equation 14 predicts zero on the deterministic arm
2 tr(RIV) on the proper arm. The sign-regime run imputes from the true model rather than
MODEL SELECTION AFTER MULTIPLE IMPUTATION 37
the fitted model, and it checks that the total reverses sign, as the fitted-model-versus-true-model
distinction of Section 5 requires. Each run passes if its arms reproduce the predicted values within
Monte Carlo error.
Likelihood-ratio design. The test is 𝐻0 : 𝜎12 = 0 in the four-variate normal, and the local
alternative is 𝜎12 = 𝛿/√𝑛on the grid 𝛿∈{0, 0.5, 1, 1.5, 2, 2.5, 3, 4}. At 𝑁= 200, 𝑀= 200
imputations are drawn from the proper arm, and 1000 repetitions are run per grid point. Four
references are compared, each scored by its rejection rate at level 0.05. The complete-data
benchmark refers its statistic to 𝜒2 1, while the corrected and uncorrected numerators are referred to
a simulated reference distribution. By contrast, the Satorra-Bentler arm applies a
scaled-and-shifted statistic (Asparouhov & Muthén, 2010) referred to 𝜒2 1, in the scaled and
adjusted chi-square difference tradition (Asparouhov & Muthén, 2006; Satorra & Bentler, 2010).
The null arm passes if it lies near nominal, with the corrected numerator slightly conservative as
double-counting predicts.
Formula-discrimination design. This run separates the two candidate formulas for the
differential bias. The two formulas agree under equal missing information, but they diverge when
the missing information is uneven. The test is 𝐻0 : 𝜇1 = 0 in the four-variate normal. Here the first
coordinate is made heavily missing at random through Φ(0.8 + 1.2𝑋3), about 70 to 79 percent,
while the second coordinate is essentially complete and the rest are fully observed. One direction
is tested, and the sample sizes are 𝑁∈{500, 1000} with 2000 repetitions on the deterministic arm.
The metric is the paired differential bias. On that metric the projected trace of Equation 17
predicts about 2.64, while the naive trace difference predicts about 8.5, so the run passes if the
data land on the projected-trace value.
Pairing design. This run shows that comparing the same imputations under both models
collapses the per-dataset noise. The design is the four-variate 𝜎12 = 0 cell with one tested
direction, run at 𝑁∈{500, 1000, 2000} with 2000 repetitions. Of the three arms, the null and the
local alternative hold the differential at order one, while a fixed alternative is included as the
failure mode. The metric contrasts the single-model deviance standard deviation with the
MODEL SELECTION AFTER MULTIPLE IMPUTATION 38
paired-differential standard deviation, and the run passes if pairing holds the differential standard
deviation flat at the null while the single-model figure grows with the sample size.
Selection-sweep design. The candidate set is four nested multivariate-normal covariance
models, ranging from a diagonal model up to the saturated model, while the truth has a
compound-symmetry structure. The sweep covers 60 cells crossing four factors. The factors are
two missingness patterns, the two mechanisms missing at random and not at random, three sample
sizes 𝑁∈{200, 500, 1000}, and five engine slots that pair the deterministic arm with proper
imputation at 𝑀∈{20, 200} under congenial and uncongenial specification. Missingness is set
near 40 percent, and each cell uses 2000 repetitions. The metric is the true-model selection rate by
the Akaike criterion, computed for the complete-data benchmark, which is the result an analyst
would reach with no missing data, and then for the uncorrected imputation criterion and the
corrected criterion. The headline cell is non-monotone missing-at-random on the deterministic
arm. The sweep passes on two conditions. The correction must move selection toward the
complete-data benchmark, and the uncorrected criterion’s errors must concentrate on the
largest-missing-information candidate.
Distribution-matching ladder design. This study takes the same candidate family and
contrasts four ways of referring the saturated-anchored imputation deviance to its complete-data
law, with the contrast spanning a range of signal strength. Three cells run at 𝑁= 500 with 2000
repetitions and four tested directions, and all three are non-monotone missing at random near 40
percent on the deterministic arm. The first cell is a main design at correlation 0.40, while the
second is a weak-signal design at 0.10 with near-tied candidates, and the third is a junk design at 0
where the truth is diagonal. Five constructions are then compared against these cells. The first
three are the complete-data benchmark, the uncorrected deviance, and a mean correction that
subtracts the projected trace of Equation 17, while the remaining two are a two-moment match to
the reference and a per-model equipercentile equating. The metric is the true-model selection rate,
and null-side distance and variance checks support that rate, along with the noncentral shrinkage
factors. The study passes its pre-registered criteria on three conditions. The null side calibrates,
MODEL SELECTION AFTER MULTIPLE IMPUTATION 39
the stronger arms reach the complete-data benchmark where overfit flips drive the errors, and the
noncentral side shrinks by the predicted factors.
Non-nested design. This measurement isolates the order-one design-imbalance
differential of Proposition 5.4 for genuinely separated candidates. That differential is the one
quantity the derivations leave unmeasured. The truth is a four-variate first-order autoregressive
covariance at correlation 0.5, and two candidate pairs run. The similar pair sets compound
symmetry against the autoregressive model, while the dissimilar pair sets a diagonal model
against it. Each cell runs at 𝑁∈{500, 1000, 2000} with 20,000 repetitions, and is also paired with
a missing-completely-at-random twin that removes the design-imbalance term. The metric is the
isolated differential, the missing-at-random paired residual minus its completely-at-random twin.
Three separate points carry the assessment. The mechanism passes if the level is order one and
collapses under the twin, while the dissimilar pair passes if its differential resolves at the predicted
scale. The pre-registered similar-pair headline is assessed honestly against the resolution the
design affords.
Simulation studies: results
Theorem validation. The net deviance bias tracks the analytic half-trace once the sample
sizes are pooled (Figure 1). We preregistered the target as the leading-order half-trace, near 2.77,
but the completed theory of Section 5 adds a second term to that target. That second term is order
one in size and appears only under missing-at-random data, where it is about −0.22. We derived it
independently from the missingness mechanism after the run and recorded it as a dated
amendment. Adding it shifts the prediction to the band 2.42 to 2.55. The inverse-variance pooled
estimate is 2.43 ± 0.26, which centers on the augmented band. Its interval also contains the
leading-order target 2.77, which lies 1.3 standard errors away, so the data are consistent with both
predictions. One cell is a genuine miss, because at 𝑁= 1000 the estimate falls 2.6 standard errors
below the leading-order target. The cause is the per-repetition statistic, which has heavy tails, so its
standard deviation grows with the sample size. At a fixed repetition count the larger cells therefore
cannot resolve an order-one offset. The component terms each miss their own targets by roughly 9
MODEL SELECTION AFTER MULTIPLE IMPUTATION 40
to 20 percent, which is why the assessment reports only the totals and the paired differentials.
Known scale and sign regime. Holding the covariance known and fitting only the mean
collapses the bias to the two values Equation 14 predicts (Figure 2). The deterministic arm returns
0.025 ± 0.018 against a target of zero, while the proper arm returns −0.536 ± 0.018 against
−1
2 tr(RIV) = −0.561. Both arms land within Monte Carlo error of their targets, and the two arms
differ by exactly the posterior-draw imputation variance. Imputing from the true model rather than
the fitted model reverses the sign of the total, and that sign reversal is what the
fitted-model-versus-true-model distinction requires.
Likelihood-ratio absorption. A correctly calibrated reference absorbs the differential
bias at the null, so the test controls its Type I error (Figure 3). The uncorrected numerator rejects
at 0.042 and the additionally corrected numerator at 0.034. Both lie near the nominal 0.05. The
corrected one is slightly conservative, as double-counting predicts. Raw power across the
alternative is not compared here. The four references reject at different Type I error, so a raw
power comparison would be confounded. The size-adjusted power arms are the partial support
cited for the power conjecture of Section 7. One scope note travels with this study. Its committed
correction arm used the naive trace difference. That difference is about 5 percent from
Equation 17 in this near-invariance design and is immaterial here. Still, Equation 17 is the correct
general form.
Discrimination of the two formulas. This run uses an adversarial design that makes one
coordinate 70 to 79 percent missing. That design separates the two candidate formulas as widely
as it allows (Figure 4). The observed differential is 2.64 ± 0.11 at 𝑁= 500 and 2.70 ± 0.11 at
𝑁= 1000. Both match the projected-trace prediction of 2.64. Correctness rests on that match.
The same statistic lies 52 standard errors from the naive trace difference. That gap excludes the
naive formula in this engineered cell. That the naive difference always overstates the projected
trace is the proved general statement of Proposition 5.2. The size of the gap here reflects the
maximal-separation design rather than typical balanced missingness.
Pairing. Comparing the same imputations under both models holds the per-dataset noise
MODEL SELECTION AFTER MULTIPLE IMPUTATION 41
at order one (Figure 5). At the null the paired-differential standard deviation stays near 1.5 across
𝑁= 500, 1000, and 2000. The single-model figure instead grows from 18.4 to 36.5. At a fixed
alternative the square-root-of-𝑛growth returns, rising to 12.4, 16.8, and 24.9. This is exactly the
failure mode the result specifies.
Selection. The uncorrected criterion favors high-missing-information models, and the
correction moves selection back toward the complete-data benchmark (Figure 6). At 𝑁= 500 on
the deterministic arm, the complete-data benchmark selects the true model at 0.90, while the
uncorrected criterion selects it at 0.67 and the corrected criterion at 0.82. In our simulations,
every misclassification by the uncorrected criterion fell on the saturated,
largest-missing-information candidate, 100% in all sixty cells. This directional pattern was
registered for the congenial cells and held in the uncongenial ones as well. Because the candidates
are nested and ordered by missing information, the saturated model is the natural destination for
any downward-biased deviance, so the error pattern confirms the predicted direction of the bias
rather than discriminating the specific decomposition. The recovery is substantial but visibly short
of the complete-data benchmark, and that gap is what the distribution-matching ladder was
registered to explain.
The distribution-matching ladder. Across three levels of signal strength the stronger
constructions recover the complete-data benchmark where the errors come from choosing too rich
a model. A floor remains where selection is genuinely hard (Figure 7). The true-model selection
rates are reported in Table 1, with Monte Carlo standard error near 0.010.
At correlation 0.40 the two-moment and equating arms reach 0.904 and 0.903 against a
complete-data benchmark of 0.899, and the differences are within twice the Monte Carlo standard
error of about 0.010. Both arms are therefore statistically indistinguishable from the benchmark,
with a small overshoot the same interval does not resolve. The junk cell closes 90 percent of the
gap from uncorrected to benchmark. In the weak-signal cell the benchmark itself drops to 0.820,
and no arm passes 0.650. We read that floor as a limit of the information in the data rather than a
calibration error, and the calibration checks below support that interpretation, though they do not
MODEL SELECTION AFTER MULTIPLE IMPUTATION 42
separately decompose it in the weak-signal cell. The residual is a measurement on these designs
rather than a theorem.
Ladder internals. Two internal measurements show why the ladder works (Figure 8). On
the true model’s anchored statistic the distance to the paired complete-data statistic falls from
0.281 under the uncorrected deviance to 0.019 under equating. Over the same arms the variance
ratio falls from 3.18 to near one. The mean correction’s paired gap of 0.076 ± 0.100 is a direct
confirmation of Equation 17 at four tested directions. On the underfit candidate the noncentral
statistic shrinks by its two predicted factors. The two-moment match shrinks it by the factor 𝑎𝑘,
predicted near 0.56 and measured at 0.59. Equating shrinks it by 1/𝜆max, predicted near 0.40 and
measured at 0.36. The internal validity checks held on every repetition. The spectrum-trace
identity held to 3 × 10−15, and the reference inversion never failed in 18,000 evaluations.
The structural limits. Three structural checks of Section 5 appear directly in the data.
They are the equality case where the naive and projected traces coincide, the cost of the naive
moment-map input, and the limit of a per-model map. Table 2 reports them on the selection rates,
with Monte Carlo standard error near 0.010.
Two points follow. In this block-diagonal candidate family the naive and projected traces
coincide, to 3.6 × 10−15, so the family cannot by itself separate the two trace formulas. The 0.824
against 0.904 gap in the second row is therefore a cost of the naive moment-map input, not
evidence about the trace formulas. Where the trace formulas actually do differ, the proof of
Proposition 5.2 establishes that the naive trace difference overstates the projected trace, and the
non-nested measurement of Figure 9 supplies the measured off-equality case, about 9 percent.
The third row is separate again, and it shows that a per-model marginal map cannot calibrate the
joint difference distribution.
The non-nested measurement. This study checks three claims at once (Figure 9). The
mechanism behind the design-imbalance term is confirmed, the predicted scaling resolves for a
dissimilar pair, and the pre-registered headline for the similar pair stays below resolution. Each
candidate’s order-one level is large under missing-at-random data, near +2.6 and +2.9 deviance
MODEL SELECTION AFTER MULTIPLE IMPUTATION 43
units for the two candidates, but that level collapses under the completely-at-random twin, most
sharply for the autoregressive candidate. For the dissimilar diagonal-versus-autoregressive pair
the isolated differential resolves at −1.2, lying three and a half standard errors from zero and
staying stable across sample sizes. For the similar compound-symmetry-versus-autoregressive
pair, by contrast, the differential lies below Monte Carlo resolution. A pilot’s apparent value of −2
dissolved in the full run. We report that failure as the discipline requires. As a byproduct, the
analysis gave the first measured case off the equality condition of Proposition 5.2, where the naive
trace difference overstates the exact projected trace by about 9 percent.
Engine sensitivity. The bootstrap imputation sampler reproduces the analytic results on
the most demanding cells, which indicates the cost of not running data augmentation is small on
these designs. Its hardest case is the non-monotone missing-at-random cells with samples as
small as 𝑁= 200, where the bootstrap approximation is weakest. Even there the sample
relative-increase-in-variance trace from the imputation draws matches the observed-data value
within 2 to 3 percent, with no widening as the sample grows. Selection also agrees between the
analytic arm and the imputation arm within 0.011 on the recovery gap. This robustness is
empirical and scoped to the designs studied. The bootstrap sampler is only approximately
Bayesian and data augmentation was not run, so a fully Bayesian sampler remains the
conservative choice when the fraction of missing information is large (Section 7).
7 Discussion
The result is scoped by its own mechanism. The scope comes first. The deviance
optimism that inflates the criteria exists only when the fit estimates a scale or covariance. In that
case the conditional missing-data entropy depends on the estimated parameters. A known-scale,
location-only fit has no such bias under deterministic FIML and only −1
2 tr(RIV) under proper MI.
That is why the estimated-scale clause sits in the theorem rather than in a footnote. Congeniality is
assumed throughout. What the bias becomes without it (Meng, 1994) is a separate question. This
paper does not take it up. Beyond these scope conditions, six specific limits remain. Each is stated
next to the claim it qualifies, with its evidential standing. They are followed by one conjecture,
MODEL SELECTION AFTER MULTIPLE IMPUTATION 44
three directions, and the answer to the question the paper opened with.
G1. The absolute magnitude of (𝐴) + (𝐶) is not independently determined. This qualifies
the theorem. The structural facts are firm. Its sign, its 𝑂(1) order, its exact MCAR vanishing, the
closed form Equation 13, and the 𝑂(1/𝑛) order of the correction beyond the leading-order
analytic value are all proved. The absolute magnitude is not. The direct Monte Carlo estimates are
heavy-tailed and untrended. The variance-reduced estimate of the higher-order remainder is
conditional on the analytic anchor it is paired against. Every comparison in this paper therefore
uses (𝐴) + (𝐶) only through the better-conditioned paired differential. G2. The proper-MI
carry-over of the likelihood-ratio propositions is structural, not separately measured. This
qualifies the likelihood-ratio part of Section 5. The propositions are proved and confirmed for the
deterministic-FIML ¯𝑄∞. Under proper MI the imputation-side quantities remain common to both
fits. This preserves the cancellation arguments. But no dedicated proper-MI replication of the
differential experiments exists.
G3. Imputation-engine robustness is empirical and scoped, which qualifies Section 6. On
the designs studied, the EMB engine tracks deterministic FIML within the stated tolerances.
Exact data augmentation was not run, so nothing here establishes engine-independence beyond
those designs. G4. The weak-signal floor is a measurement, not an impossibility proof, and this
qualifies the selection part of Section 5. In the weak-signal cell, the residual is information loss as
measured through these corrections. No argument here shows that no estimator built from the
same observed data could do better.
G5. For dissimilar candidate pairs, a mechanism-independent misspecification 𝑂(1)
coexists with the design-imbalance 𝑂(1). This qualifies Proposition 5.4 and its use in selection.
The MCAR contrast separates the two in measurement. But they can reinforce or partially offset
each other in the net ranking bias a criterion sees. We make no claim that the design-imbalance
term dominates every pair. G6. All instantiation is multivariate normal. This qualifies the
theorem. The theorem is stated for general regular likelihoods with an estimated scale. But every
constant is verified, symbolically and by simulation, only in the normal family.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 45
G7. The correction’s guarantees assume congenial imputation. Uncongeniality is the
practical boundary. The imputation model can shrink the cross-block correlations that the analysis
estimates. A strong ridge prior does this. When it happens the deviance bias no longer points the
way the theory assumes. The sweep includes uncongenial cells, and they show the effect. There
the uncorrected criterion already selects the true model more often than the complete-data
benchmark, near 0.93 against the benchmark’s 0.91. The correction then pushes past the
benchmark to about 0.98. That is an overshoot, not a recovery. We registered this behavior as an
observational stress test, not as a pass target. It marks where the framework’s safety margin ends.
The practical recommendation is to diagnose congeniality before trusting bias-corrected selection.
One question is left as a conjecture. We label it as one. We conjecture that the
bias-corrected likelihood-ratio comparison dominates its uncorrected counterpart in power
uniformly. The preregistered likelihood-ratio study’s power arms provide partial support
(Section 6). We state no theorem.
Three directions seem most worth pursuing. The first is covariate shift. The concluding
section of Shimodaira and Maeda (2018) names the combination of a missing mechanism with
other sampling mechanisms as future work. The design-imbalance term derived here is nonzero
exactly when the missing and observed units differ on the conditioning variables. This is a step
into that program. The weighting machinery exists (Shimodaira, 2000). The second is an
exact-engine replication. Running the headline studies under data augmentation would convert G3
from an empirical tolerance into a tested equivalence. The third is the calibration program beyond
per-model null maps. De-shrinkage of estimated noncentrality and joint calibration of score
vectors across candidates are left to future work. So is the extension beyond the normal family.
The paper closes where it began, with the replication principle. The question was whether
an analyst working from multiply imputed data reaches the same conclusions, as often, as the
analyst who never lost the data. The answer now has three parts. The first part concerns the
criterion itself. The answer is yes. After correction, a deviance or information criterion computed
from imputed data means what its complete-data counterpart means, on average, at any signal
MODEL SELECTION AFTER MULTIPLE IMPUTATION 46
strength. The second part concerns decisions when the competing models fit alike. The answer is
again yes. There, corrected selection and calibrated tests behave as they would have with complete
data. The third part concerns decisions when one model genuinely fits better. There the answer is
no. The missing data carried part of the evidence. Less information means less power. No
correction studied here manufactures the lost evidence back. The practical summary for an
applied reader is short. Correct the criterion, or match its null distribution. Then trust the result
the way complete-data results are trusted, subject to the stated conditions. The one exception is
where the comparison was close enough that the missing data could have decided it. There, the
honest answer is that the data no longer say.
Two implications for applied practice follow. The first is statistical. When the aim is to
rank models on imputed data, and no calibrated reference distribution already absorbs the bias,
add the per-candidate trace correction before reading off the result. Information-criterion
selection is the main such case. The exception is a calibrated likelihood-ratio test, where the
reference distribution already carries the null mean, so a numerator correction applied on top
would double-count and should be left out. For a nested comparison without a calibrated
reference, the relevant bias is the missing information in the tested directions alone, not the
difference of the candidates’ separate trace corrections. The second implication is about method.
An applied team deciding whether to rely on a human-prompted AI derivation can ask for the
same evidence required here. That evidence is citations checked against their sources,
independent symbolic checks, preregistered numerical criteria with their failures reported,
adversarial review by a separate model, and a reproducible record. What makes such a result safe
to use is the standard it meets, not the name of the system that produced it.
A final word on the workflow, held to the same standard as everything it produced. Three
catches are on the record. The protocol caught a sign error that eight of nine blind re-derivations
shipped, caught a wrong shortcut in an early entropy-gradient argument, and forced every failed
prediction into Section 6. The cost is also visible. Claims arrived slower, hedged to their grade,
and two preregistered headlines were given up rather than rescued. What it cannot catch was
MODEL SELECTION AFTER MULTIPLE IMPUTATION 47
stated in Section 4 and bears repeating once. It cannot catch errors shared across model lineages,
misreadings of real sources, or designs that answer the wrong question reproducibly. Whether this
workflow generalizes beyond one paper is asserted, not demonstrated. What it leaves behind is a
public record. The decision log, the preregistrations with their amendments, the verification
checks, and the session transcripts are all published. So the assertion can be tested by someone
other than its authors.
8 References
Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on
Automatic Control, 19(6), 716–723. https://doi.org/10.1109/TAC.1974.1100705
Asparouhov, T., & Muthén, B. (2006). Robust Chi Square Difference Testing with Mean and
Variance Adjusted Test Statistics: Webnote 10.
https://doi.org/http://www.statmodel.com/examples/webnotes/webnote10.pdf
Asparouhov, T., & Muthén, B. (2010). Simple Second Order Chi-Square Correction.
Cavanaugh, J. E., & Shumway, R. H. (1998). An Akaike information criterion for model selection
in the presence of incomplete data. Journal of Statistical Planning and Inference, 67(1),
45–65. https://doi.org/10.1016/S0378-3758(97)00115-8
Chan, K. W. (2022). General and feasible tests with multiply-imputed datasets. The Annals of
Statistics, 50(2). https://doi.org/10.1214/21-AOS2132
Chan, K. W., & Meng, X.-L. (2022). Multiple Improvements of Multiple Imputation Likelihood
Ratio Tests. Statistica Sinica. https://doi.org/10.5705/ss.202019.0314
Claeskens, G., & Consentino, F. (2008). Variable Selection with Incomplete Covariate Data.
Biometrics, 64(4), 1062–1069. https://doi.org/10.1111/j.1541-0420.2008.01003.x
Consentino, F., & Claeskens, G. (2010). Order selection tests with multiply imputed data.
Computational Statistics &Amp; Data Analysis, 54(10), 2284–2295.
https://doi.org/10.1016/j.csda.2010.04.009
Davies, R. B. (1980). Algorithm AS 155: The Distribution of a Linear Combination of 𝜒2
Random Variables. Applied Statistics, 29(3), 323. https://doi.org/10.2307/2346911
MODEL SELECTION AFTER MULTIPLE IMPUTATION 48
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum Likelihood from Incomplete
Data Via the EM Algorithm. Journal of the Royal Statistical Society Series B: Statistical
Methodology, 39(1), 1–22. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x
Enders, C. K. (2025). Missing data: An update on the state of the art. Psychological Methods,
30(2), 322–339. https://doi.org/10.1037/met0000563
Hens, N., Aerts, M., & Molenberghs, G. (2006). Model selection for incomplete and design-based
samples. Statistics in Medicine, 25(14), 2502–2520. https://doi.org/10.1002/sim.2559
Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A Program for Missing Data. Journal
of Statistical Software, 45(7). https://doi.org/10.18637/jss.v045.i07
Imhof, J. P. (1961). Computing the distribution of quadratic forms in normal variables.
Biometrika, 48(3-4), 419–426. https://doi.org/10.1093/biomet/48.3-4.419
Kenward, M. G., & Molenberghs, G. (1998). Likelihood Based Frequentist Inference When Data
Are Missing at Random. Statistical Science, 13(3), 236–247.
https://www.jstor.org/stable/2676702
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking: Methods and
practices (Third edition). Springer.
Meng, X.-L. (1994). Multiple-Imputation Inferences with Uncongenial Sources of Input.
Statistical Science, 9(4). https://doi.org/10.1214/ss/1177010269
Meng, X.-L., & Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance
matrices: The SEM algorithm. Journal of the American Statistical Association, 86(416),
899–909. https://doi.org/10.1080/01621459.1991.10475130
Meng, X.-L., & Rubin, D. B. (1992). Performing likelihood ratio tests with multiply-imputed data
sets. Biometrika, 79(1), 103–111. https://doi.org/10.1093/biomet/79.1.103
Nielsen, S. F. (2003). Proper and Improper Multiple Imputation. International Statistical Review,
71(3), 593–607. https://doi.org/10.1111/j.1751-5823.2003.tb00214.x
Orchard, T., & Woodbury, M. A. (1972). A Missing Information Principle: Theory and
Applications.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 49
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
https://doi.org/10.1093/biomet/63.3.581
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley.
Satorra, A., & Bentler, P. M. (2010). Ensuring Positiveness of the Scaled Difference Chi-square
Test Statistic. Psychometrika, 75(2), 243–248. https://doi.org/10.1007/s11336-009-9135-y
Schafer, J. L. (1997). Analysis of incomplete multivariate data (1st ed). Chapman & Hall/CRC.
Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.
Psychological Methods, 7(2), 147–177. https://doi.org/10.1037/1082-989X.7.2.147
Schomaker, M., & Heumann, C. (2014). Model selection and model averaging after multiple
imputation. Computational Statistics & Data Analysis, 71, 758–770.
https://doi.org/10.1016/j.csda.2013.02.017
Shimodaira, H. (2000). Improving predictive inference under covariate shift by weighting the
log-likelihood function. Journal of Statistical Planning and Inference, 90(2), 227–244.
https://doi.org/10.1016/s0378-3758(00)00115-4
Shimodaira, H., & Maeda, H. (2018). An information criterion for model selection with missing
data via complete-data divergence. Annals of the Institute of Statistical Mathematics,
70(2), 421–438. https://doi.org/10.1007/s10463-016-0592-7
Tierney, L., & Kadane, J. B. (1986). Accurate approximations for posterior moments and
marginal densities. Journal of the American Statistical Association, 81(393), 82–86.
https://doi.org/10.1080/01621459.1986.10478240
Vaart, A. W. V. D. (1998). Asymptotic Statistics (1st ed.). Cambridge University Press.
https://doi.org/10.1017/CBO9780511802256
van Buuren, S. (2018). Flexible Imputation of Missing Data, Second Edition (2nd ed.). Chapman
and Hall/CRC. https://doi.org/10.1201/9780429492259
Vuong, Q. H. (1989). Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses.
Econometrica, 57(2), 307. https://doi.org/10.2307/1912557
Wei, G. C. G., & Tanner, M. A. (1990). A Monte Carlo Implementation of the EM Algorithm and
MODEL SELECTION AFTER MULTIPLE IMPUTATION 50
the Poor Man’s Data Augmentation Algorithms. Journal of the American Statistical
Association, 85(411), 699–704. https://doi.org/10.1080/01621459.1990.10474930
Wilks, S. S. (1938). The Large-Sample Distribution of the Likelihood Ratio for Testing
Composite Hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62.
https://doi.org/10.1214/aoms/1177732360
Wood, A. M., White, I. R., & Royston, P. (2008). How should variable selection be performed
with multiply imputed data? Statistics in Medicine, 27(17), 3227–3246.
https://doi.org/10.1002/sim.3177
MODEL SELECTION AFTER MULTIPLE IMPUTATION 51
Table 1 True-model selection rates across the distribution-matching ladder. Rate of selecting the true model for each construction in three signal cells at 𝑁= 500 with 2000 repetitions; Monte Carlo standard error near 0.010.
Cell complete-data uncorrected mean two-moment equating
𝜌= 0.40 0.899 0.678 0.814 0.904 0.903 𝜌= 0.10 0.820 0.582 0.650 0.648 0.650 𝜌= 0 (junk) 0.727 0.422 0.585 0.702 0.696
MODEL SELECTION AFTER MULTIPLE IMPUTATION 52
Table 2 Structural checks of the derivations seen in the selection rates. Each row pairs an exact structural prediction with its measured value; Monte Carlo standard error near 0.010.
Structural limit perfect achieved
block-diagonal family, naive vs projected trace exact equality agree to 3.6 × 10−15
off-equality moment map, main-cell rate complete-data 0.899 naive input 0.824, correct input 0.904 per-model map, equated difference mean and sd complete-data 10.00 and 7.10 6.10 and 5.99
MODEL SELECTION AFTER MULTIPLE IMPUTATION 53
Figure 1 Theorem validation. Each point is the net deviance bias E[𝑇] at one sample size with a 95 percent interval, estimated on the analytic arm with 1000 repetitions. Perfect is the analytic target, the dotted line at the preregistered half-trace 2.77 and the shaded band at the order-one-augmented prediction 2.42 to 2.55. The solid line is the inverse-variance pooled estimate 2.43 ± 0.26, whose interval contains the leading-order target and centers on the band. The 𝑁= 1000 cell falls 2.6 standard errors below the leading-order target, the study’s one reported miss.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 54
Figure 2 Known scale. The two imputation arms of the known-scale run at 𝑁= 200 with 2 × 105
repetitions, each with a 95 percent interval. Perfect is the pair of analytic targets. The target is zero for the deterministic arm and −1
2 tr(RIV) = −0.561 for the proper arm. Both arms reach their target within Monte Carlo error.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 55
Figure 3 Likelihood-ratio Type I error. The rejection rate at the null for four references at 𝑁= 200 with 1000 repetitions and Monte Carlo error bars. Perfect is the nominal 0.05 line. The uncorrected numerator lies at 0.042 and the corrected one at 0.034. Both are near nominal. The corrected one is slightly conservative because it double-counts. Raw power across the alternative is not shown. The references reject at different Type I error, so the comparison would be confounded. The size-adjusted power conjecture is treated in the Discussion.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 56
Figure 4 Discrimination. The observed paired differential at two sample sizes with 95 percent intervals, from 2000 repetitions on the heavy-missingness 𝜇1 design. Perfect is the projected-trace prediction tr(RIV⊥) = 2.64, the solid line. The naive trace difference is the dashed line near 8.5. It is excluded at 52 standard errors in this maximal-separation cell.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 57
Figure 5 Pairing. Standard deviation of the deviance statistic against sample size, from 2000 repetitions, on a logarithmic scale. Perfect is a flat line at order one. The paired differential at the null achieves it, the lower curve. The single-model statistic grows with the sample size. The paired differential at a fixed alternative grows too, the documented limit of pairing.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 58
Figure 6 Selection. True-model selection rate by the Akaike criterion at three sample sizes on the non-monotone missing-at-random cell, 2000 repetitions. Perfect is the complete-data benchmark, the dashed line and the black bar. The uncorrected criterion lies well below it, and the correction recovers most of the gap. All uncorrected errors fall on the largest-missing-information candidate.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 59
Figure 7 Distribution-matching ladder. True-model selection rate for four constructions in three cells at 𝑁= 500 with 2000 repetitions. Perfect is the complete-data benchmark, the dashed line above each cell. At 𝜌= 0.40 the two-moment and equating arms reach the benchmark. At 𝜌= 0.10 no arm reaches it, the weak-signal floor.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 60
Figure 8 Ladder internals. Left, the null-side Kolmogorov-Smirnov distance to the complete-data statistic across the four arms in the main cell. Perfect is zero, approached by the two-moment and equating arms, with the variance ratio falling from 3.18 toward one. Right, the noncentral shrinkage factor measured against predicted on the line 𝑦= 𝑥. The two-moment and equating points lie on the diagonal.
MODEL SELECTION AFTER MULTIPLE IMPUTATION 61
Figure 9 Non-nested measurement. Left, each candidate’s order-one level under missing-at-random data and under its completely-at-random twin, dissimilar pair at 𝑁= 500 with 20,000 repetitions. Perfect for the twin is zero, and the level collapses toward it, most sharply for the autoregressive candidate. Right, the isolated design-imbalance differential with 95 percent intervals. Perfect is no effect, the line at zero. The similar pair stays on that line, the reported failure, while the dissimilar pair resolves at −1.2.
📝 About this HTML version
This HTML document was automatically generated from the PDF. Some formatting, figures, or mathematical notation may not be perfectly preserved. For the authoritative version, please refer to the PDF.