Estimation of Item Parameters in Different Cohorts
In order to examine measurement invariance of the CBCL and SDQ items related to anxiety /depression, and ADHD, we use the Bayesian method for modeling measurement non-invariance as proposed by Verhagen and Fox (
2013a), and applied earlier by Van den Berg et al. (
2014). This method consists of a Bayesian IRT approach for testing differences in item parameters across groups and identifying true differences in means and variances of the latent trait across groups while modelling measurement non-invariance by random item parameters. The IRT approach allows us to include both person (latent trait) and item (discrimination and thresholds) parameters in the analysis, while the Bayesian method enables straightforward estimation of complicated models through hierarchical modeling (Van Den Berg et al.,
2014; J. Verhagen,
2012). Unlike other procedures (e.g., likelihood ratio test; Thissen et al.,
1993), the Bayesian method for modeling measurement non-invariance does not require an indication of some items as invariant beforehand (Verhagen,
2012).
Verhagen and Fox (
2013a) applied this approach to dichotomous data, while Van den Berg et al. (
2014) adapted it for use on polytomous items. In both cases, data consisted of a large number of groups (23 in Verhagen & Fox,
2013a; 9 in Van den Berg et al.,
2014). Because of the large number of groups, they did not test differences between each pair of groups, but instead assessed the variance of item parameters across cohorts, assuming a hierarchical structure for both the person (scores on the measured construct) and item (discrimination and threshold) parameters (Fox et al.,
2007; Fox & Glas,
2001; Verhagen & Fox,
2013a,
2013b; Verhagen,
2012).
In the case of a small number of groups (as we have here), it is generally advised to use fixed effects to model group differences (van den Berg,
2018). We therefore changed the random effects for item parameters into fixed effects by increasing the variance of their prior distributions.
In this study, we used the generalized partial credit model (GPCM) to model item parameters. In the GPCM, the probability of a certain response
c (
c = 1, …,
C) for person
i (
i = 1, …,
N) in group
j (
j = 1, …,
J) on item
k (
k = 1, …,
K) is defined as a function of respondent`s latent trait,
\({\theta }_{i}\), item discrimination parameter,
\({\widetilde{\alpha }}_{kj}\), and item thresholds for that category,
\({\widetilde{\beta }}_{ckj}\), and the ones below it.
$$\mathcal{P}({Y}_{ijk}=c\mid {\eta }_{ijck})=\text{exp}(\sum_{c\in C}({\eta }_{ijck}))$$
$${\eta }_{ijck}={\widetilde{\alpha }}_{kj}({\theta }_{ij}-{\widetilde{\beta }}_{ckj})$$
The item and person parameters are estimated through a Markov Chain Monte Carlo (MCMC) procedure (Van Den Berg et al.,
2014; Verhagen & Fox,
2013a,
2013b). The person parameters (latent trait values) are modeled to be normally distributed around their group mean,
\({\mu }_{{\theta }_{j}}\), with precision
\({\tau }_{j}\) (precision
\({\tau }_{j}\) is inverted variance).
$${\theta }_{i}\sim \mathcal{N}({\mu }_{{\theta }_{j}},{\tau }_{j})$$
In the case of fixed effects for the group-specific parameters, appropriate prior distributions for the group means are normal priors with very large variance parameters (low precision). Accordingly, the group means in our model were normally distributed and centered around 0 with the precision of 0.1 (variance of 10) in order to get quasi-fixed effects for groups.
$${\mu }_{{\theta }_{j}}\sim \mathcal{N}(0,.1)$$
The precision
\({\tau }_{j}\) of the person parameters
\({\theta }_{i}\) is given a gamma prior distribution with a shape parameter of 1 and a scale parameter of 0.1. This precision is the inverse of the within-group variance
\({\sigma }_{j}\).
$${\tau }_{j}\sim \Gamma (1,.1)$$
The multilevel structure on the item parameters consists of group-specific item parameters
\({\widetilde{\xi }}_{kj}\), and each set of item parameters per item per group comes from a multivariate normal distribution with specific item mean and specific item covariance (precision). In this study, we have one discrimination parameter and two threshold parameters per item. Accordingly, the group-specific item parameters
\({\widetilde{\xi }}_{kj}\) are consisting of group-specific discrimination parameter
\({\widetilde{\alpha }}_{kj}\) and two group-specific threshold parameters (
\({\widetilde{\beta }}_{1kj}\) and
\({\widetilde{\beta }}_{2kj}\)) for each item in a certain group. The group-specific item parameters are multivariate normally distributed around general item parameters
\({\xi }_{k}\), with uninformative prior
\(\mathcal{Q}\) for a precision matrix.
$${\widetilde{\xi }}_{kj}\sim \mathcal{N}({\xi }_{0},\mathcal{Q})$$
The general item parameters
\({\xi }_{k}\) are assumed to be multivariate normally distributed around mean
\({\xi }_{0}\), with covariance matrix
\(\Sigma {\xi }_{k}\).
$${\xi }_{k}\sim \mathcal{N}({\xi }_{0},\Sigma {\xi }_{k})$$
The
\({\xi }_{0}\) is assumed to be multivariate normally distributed around the overall parameter means
\({\mu }_{k}\), with variance described by precision matrix
\(\mathcal{R}\) for the discrimination and each threshold parameter. An inverse Wishart prior distribution is chosen for the covariance matrix
\(\Sigma {\xi }_{k}\).
$${\xi }_{0}\sim \mathcal{N}({\mu }_{k},\mathcal{R})$$
$${\Sigma }_{{\xi }_{k}}\sim \mathcal{I}\mathcal{W}(\mathcal{R},d)$$
The Inverse Wishart distribution is specified with a
\(\mathcal{C}x\mathcal{C}\) scale matrix
\(\mathcal{R}\), where
\(\mathcal{C}\) is the number of item parameters, and with a number of degrees of freedom d. The degrees of freedom must be higher than
\(\mathcal{C}-1\) (Schuurman et al.,
2016). It is assumed that the overall parameter means are equal to 0 in the case of threshold parameters, and 1 in the case of the discrimination parameter.
The model is identified by a restriction in such a way that the overall mean threshold in each group is equal to 0 and the product of the discrimination parameters is equal to 1 within each group.
One cohort can have higher scores on the test because of a higher group-mean latent trait or because all items have lower threshold parameters for this cohort (Verhagen,
2012). The restriction that the threshold parameters are 0 on average, allows us to identify the means of the thetas. Similarly, more variance of one cohort on the test can be a consequence of a larger variance on the latent distribution or smaller discrimination parameters for this cohort. Accordingly, the variance of the latent scale can be identified by restricting the product of the discrimination parameters to 1 (van den Berg et al.,
2014; Verhagen,
2012).
In the Supplementary Material, we include examples and explain the logic of the model in more detail.
Quantification and Visualization of Measurement Invariance at the Scale Level
We will quantify the DTF using the method proposed by Stark et al. (
2004). This method is an improved version of one of the most prominent methods for calculating the DTF proposed by Raju et al. (
1995). It enables researchers to quantify the DTF and to express its degree in raw test scores. Stark et al. (
2004) proposed an estimation of the expected total test scores for participants from one cohort based on different sets of item parameters. In other words, in the first case, the expected total test score for the participant from the reference group will be estimated based on item parameters from the reference group, while in the second case it will be estimated based on item parameters from the focal group. We will have two expected total test scores for each participant—one estimated using item parameters from the reference group and another estimated using item parameters from the focal group. Accordingly, we will have two Test Characteristic Curves (TCC)—one based on item parameters from the reference group and another based on item parameters from the focal group.
For expressing the amount of DTF, Stark et al. (
2004) used the DTFR parameter. The DTFR measure is similar to the DTF parameter proposed by Raju et al. (
1995) and is the expected difference between the TCCs:
$$DTFR=E(TC{C}_{R}-TC{C}_{F})$$
\(TC{C}_{R}\) is based on the item parameters from the reference group, while \(TC{C}_{F}\) is based on item parameters from the focal group.
The greater the absolute value of DTFR, the greater the differential functioning of a test. While an absolute value of DTFR tells us what amount of DTFR is present, the sign of the DTFR tells us which group has higher expected test scores due to differential test functioning. Stark et al. (
2004) proposed to subtract the values of expected test scores of the focal group from values of expected test scores of the reference group (
\(TC{C}_{R}-TC{C}_{F}\)). Consequently, a positive value of DTFR means that the expected test scores based on item parameters from the focal group are lower than expected test scores based on item parameters from the reference group, while a negative value means that the scores based on item parameters from the focal group are higher than scores based on item parameters from the reference group. Since this is a bit counter-intuitive,
\(TC{C}_{R}-TC{C}_{F}\) we will use
\(TC{C}_{F}-TC{C}_{R}\) (Raju et al.,
1995).
$$DTFR=E(TC{C}_{F}-TC{C}_{R})$$
Based on DTFR, we can obtain an effect size by using (Stark et al.,
2004):
$${d}_{DTF}=\frac{DTFR}{S{D}_{F}}$$
Stark et al. (
2004) also proposed another interesting parameter – IMPACT. DTFR is the difference between groups caused by differential test functioning, while IMPACT represents the true mean difference between groups, that is, a component of the observed mean difference that is not caused by differential test functioning. Both of them are forming an Observed Mean Difference (OMD).
$$OM{D}_{(F-R)}={M}_{(F)}-{M}_{(R)}$$
$$OM{D}_{(F-R)}=DTFR+IMPACT$$
$$IMPACT=OM{D}_{(F-R)}-DTFR$$
We will calculate an IMPACT score to determine the amount of difference between sum scores that is caused by the true mean difference between groups. Because of the specific restriction that the threshold parameters are on average 0 in the Bayesian modelling approach, the DTFR should be close to 0, once the groups are allowed to have different trait means (the IMPACT component of the OMD).
We will visualize DTF in order to gain more precise insights about its presence for different values of the latent trait. We will use the method proposed by Stark et al. (
2004). Their method is based on using the Test Characteristic Curve (TCC). The test characteristic curve is the functional relation between the true score and the latent trait scale (Baker & Kim,
2017). Stark et al. (
2004) calculated expected sum scores for any point on the latent trait scale from −3 to 3 in order to plot them, that is, to plot TCC, where the scale is defined by a variance of 1. By visualizing the TCC, the researchers are able to find the corresponding test score for any level of the latent trait. Since our scale for person parameters is defined by the restriction that the product of the discrimination parameters is equal to 1, we will instead visualize TCCs using item parameters from different cohorts to visualize expected sum scores for any point on the latent trait scale in the range between minimum and maximum observed theta in the sample.