Background
An economic evaluation generally calculates quality-adjusted life years (QALYs) to measure the efficiency of healthcare technologies. Health technology assessment (HTA) agencies, including the National Institute for Health and Care Excellence (NICE) in the UK, ask pharmaceutical (and medical device) companies to submit cost-effectiveness data using QALYs [
1]. In such countries, measurement of QALYs is important not only for academic researchers but also for pharmaceutical and medical device companies because it influences the reimbursement or pricing of pharmaceuticals and medical devices. Japan has the same situation as other countries. The Japanese government enacted a new pricing system in 2019 that uses economic evaluation to recalculate pharmaceutical or medical device prices [
2]. The Japanese HTA organization, Center for Outcomes Research and Economic Evaluation for Health (C2H), requests QALY-based outcome data for cost-effectiveness analysis [
3].
QALYs can be calculated by multiplying life years by quality of life, or utility, weights which are anchored on a scale of 0 (death) and 1 (full health) where values below zero reflect that the health state is considered as being worse than dead. Preference-based measures (PBMs) or preference-weighted measures are generally used to provide the utility weights for QALYs. PBMs include a set of dimensions defining health states and a value set including weights for every health state described. Value sets are derived using a preference elicitation method, and are usually country specific (for example, many PBMs have Japanese value sets [
4‐
11], including widely used generic measures such as the EQ-5D-5L, and SF-6Dv1).
SF-6Dv1 is a generic PBM developed in the UK [
12]. SF-6D consists of six domains [physical functioning (PF), role limitations (RL), pain (PA), vitality (VT), social functioning (SF), and mental health (MH)] that can be scored from the SF-36 Health Survey. A Japanese value set for the SF-6Dv1 was developed by Fukuhara and colleagues [
6]. The SF-6Dv1 scores can be derived from 11 SF-36v1 or SF-36v2 items. The valuation survey of SF-6Dv1 was based on the standard gamble (SG) method where respondents trade a risk of death or severely impaired health to avoid impaired health [
13,
14]. The use of SG is sometimes criticized because the respondents’ risk aversion leads to relatively higher values for severe states. For example, the value for the most severe health state for the SF-6Dv1 in the UK is 0.29. Therefore, in some countries, DCE-based value sets have been published [
15] using DCE with duration (i.e. a profile consisting of a health states experienced for a specified number of life years) rather than the SG method. In another Australian study, the value of the worst health states using DCE with duration was − 0.36 to − 0.44 depending on the model. Based on these and other results, the DCE method was applied to the SF-6Dv2. The SF-6Dv2 [
16] health state classification system was developed to improve on the SF-6Dv1. It consists of the same six dimensions as SF-6Dv1, but with changes to the descriptors. Valuation surveys of the SF-6Dv2 have already been completed in the UK [
17], Australia [
18], China [
19] and the US to generate country-specific value sets.
The primary objective of this study was to perform a valuation survey of the SF-6Dv2 based on an international protocol that included three DCE designs and to obtain a Japanese value set. The Japanese preference for each item in a PBMs is sometimes quantitatively and qualitatively different from that of Western countries [
4‐
11]. Consequently, it is not appropriate to apply an existing value set developed in other countries. Actually, C2H requests the use of a value set that “reflects the preferences of the general population in Japan”. In our survey, we used the DCE with duration method to elicit the value set. The DCE method has been increasingly used for valuation surveys, including cancer-specific EORTC QLU C-10D [
20], and FACT-8D [
21]. Second, the Japanese value set was compared with those in the UK, Australia, and China, where published SF-6Dv2 value sets exist.
Methods
SF-6Dv2 classification system
The SF-6Dv2 is a classification system comprising six dimensions: physical functioning (PF), role limitations (RL), pain (PA), vitality (VT), social functioning (SF), and mental health (MH), with five to six severity levels (only PA has six levels) Similar to the SF-6Dv1, the SF-6Dv2 scores can be derived from SF-36v2 items. The SF-6Dv2 can also be scored from an independent six item instrument, the SF-6Dv2 Health Utility Survey (HUS) [
22]. The Japanese version of SF-6Dv2 HUS was established by the research team. The Japanese team drafted the translation of SF-6Dv2 HUS to be consistent with existing Japanese SF-36 translation, back-translated into English for review by UK team. After that, cognitive debriefing was performed for 10 Japanese people. Considering and reflecting the feedback from the cognitive debriefing, the final Japanese version of SF-6Dv2 HUS was completed.
Discrete choice experiment
We used DCE with duration for valuing SF-6Dv2 health states. In the DCE survey, participants were required to imagine hypothetical health states, which consisted of health states derived from the SF-6Dv2 classification system and life years (1, 4, 7, and 10 years). Subsequently, two health states (states A and B) were presented, and the participants chose the one they preferred between the two options. In addition, we used the ternary method, in which three health states (states A, B and “immediate death”) were shown to respondents, who were asked to identify what they thought was the best and the worst health state.
Survey process and design
Respondents were asked to choose their preferred profile for each choice set. A total of 15 choice sets were presented, consisting of three training tasks, two “common tasks”, eight core tasks and two ternary tasks. The two common tasks were randomly selected from a set of 76 choice tasks across 38 blocks based on health states that are commonly experienced by the general population. These choice tasks used health states selected from the 200 most common health states identified in general population surveys. The choice set was identified using the Fedorov algorithm implemented in NGene. Regarding core tasks, respondents were randomly allocated to a set of 304 core tasks across 38 blocks, which were selected among all of the health states described by the SF-6Dv2. As before, the choice set was constructed on the basis of the Federov algorithm. Two ternary tasks were randomly selected from a set of 76 choice tasks that include a third choice of immediate death. In contrast to normal DCE tasks, respondents were asked to select the best and worst health states from the three options.
These three types of tasks were presented in order. Two pairs (common), eight pairs (core), and two pairs (ternary) were randomly allocated to each participant from each of the 38 blocks. In each task, the order in which the questions were presented was randomized and the presentation positions (left or right) of the two health states were randomized to avoid a positioning effect.
The sample size of 3800 was chosen to match the power of the original UK study, which used a sample size of 3000 respondents, a set of 300 core choice tasks, and 60 ternary tasks. The UK respondents were grouped into 30 subgroups of 100 respondents that each answered 10 core choice tasks and 2 ternary tasks. In the current study, that each answered 2 common choice tasks, 8 core tasks, and 2 ternary tasks for a total of 76 common choice tasks, 304 core tasks and 76 ternary tasks.
Survey participants
An online survey was also conducted. Respondents (aged 20–79) were recruited through a Japanese web panel based on quota sampling by sex and age to represent the general population. This means that an equal number of respondents were collected from the 12 groups [age categories (20–29, 30–39, 40–49, 50–59, 60–69, 70–79) multiplied by sex categories]. If the target number of respondents was included in the survey in one group, the recruitment for the group was closed. Respondents were invited to this survey by an email and asked to click the link if they wanted to join the survey. Respondents had to provide informed consent to proceed to complete the survey. Background information on respondents was collected after 15 DCE tasks were completed. Respondents who completed all the tasks could obtain a small incentive. When the required number of responses was collected, the web page for the survey was closed.
The inclusion criteria were as follows: (a) being aged 20 years and over (definition of “adult” citizens in Japan), (b) currently living in Japan, (c) providing informed consent, (d) possessing literacy skills in Japanese, and (e) having access to a device with an internet connection. The survey was conducted in March 2022.
Statistical analysis
We calculated the number and percentage of background factors. A conditional logit model was used for the analysis of the choice tasks. The model for the estimation of coefficients was based on Bansback et al. [
23] and Norman et al. [
24] and included continuous duration (time) and the interaction between duration and the severity of each dimension (with the least sever level, level 1, as the baseline). Let
t be the duration, and
uij be the utility of profile
j for individual
i. In that case,
uij can be formulated as follows:
$$ U_{ij} = \beta_{1} t_{ij} + \beta_{2} x_{ij} t_{ij} + \varepsilon_{i} $$
(1)
where
εij is an error term. However, the estimated
β2 is not anchored on the 0 (death) to 1 (full health) scale. To change the latent coefficients to the disutility of each level, we can calculate the utility weight using the following equation:
$$- \widehat{{\beta }_{2}}/ \widehat{{\beta }_{1}}$$
(2)
In the immediate death profile of the ternary tasks, duration was treated as 0. We also included an interaction term (WORST) to assess the impact of the worst level of each dimension in the analysis. If the profile had one or more than one dimension at the worst level, the WORST term was defined as 1 (the “worst” model). If the estimated disutility was not logically consistent (consistency implied that “weights at the higher level in the same dimension were higher, and those at the lower level were lower”), inconsistent levels were combined and the dataset was analyzed by the same models.
We analyzed four different subsamples of the data and 9 models. Model 1 included only core task responses (eight tasks per respondent, from the total of 304 included in the design) for analysis without a worst term. Model 3 included only the core task responses, but included a worst term. Model 4 included eight core tasks and two common tasks (10 tasks per respondent). Model 6 included the eight core tasks and two ternary tasks (10 tasks per respondent). Finally, model 8 included eight core tasks, two common tasks, and two ternary tasks (12 tasks per respondent). Corresponding to each of the above models, a constrained model was applied if inconsistencies were observed (models 2, 5, 7 and 9). The only exception was model 3, where the number of inconsistencies was deemed to be too high to attempt a constrained model. The parameters were estimated using Phreg in SAS 9.4 and clogit in STATA 17. These two approaches gave the same results. We compared the models using log likelihood, number of logical inconsistencies (where as severity increases utility increases), coefficients of each level and distribution of utilities. To obtain the distribution of all utilities that can be generated by SF-6Dv2, utilities of 55*6 = 18,750 health states were calculated using the parameter estimates for each level of each dimension.
This study was approved by the ethics committee of the National Institute of Public Health, to which the first author belongs (NIPH-IBRA #12338).
Discussion
In this study, we used data from a large sample of respondents from the general Japanese population to estimate a value set for the SF-6Dv2 based on an international protocol using DCE. The value set obtained using Model 5 can now be used for cost-effectiveness analyses in Japan. According to the Japanese value set of the SF-6Dv1, the score of the worst state was 0.392, which was much larger than that of other PBMs, including the EQ-5D-5L. Although the valuation methods differed between the two studies (standard gamble in SF-6Dv1 and DCE in SF-6Dv2), the worst score of SF-6Dv2 was − 0.722 (Model 5). The problem of measurement range improved.
The results calculated by all unconstrained models revealed some logical inconsistencies, where as health state severity increases utility increases. Inconsistencies in the RL and VT dimensions for levels 4 and 5 were observed in all the unconstrained models, which suggests that the Japanese respondents did not distinguish between Levels 4 and 5 of the RL and VT dimensions. The preference weights of Levels 4 and 5 in the PF dimension and those of Levels 5 and 6 in the PA dimension are considerably larger in terms of their impact on utilities than the other weights. They have considerable influence on the range of the Japanese value set. Especially, compared with the UK and Australian weights, it is noteworthy that the Japanese utility decrements for levels 4 and 5 in PF are quite large (level 4: − 0.327 (Japan, model 5), − 0.092 (the UK) [
17], − 0.138 (Australia) [
18] and level 5: − 0.593 (Japan), − 0.186 (the UK) [
17] and − 0.222 (Australia) [
18]), although the UK and Australia uses the WORST model in which the weight of the worst is − 0.084 (the UK) and − 0.079 (Australia) and this is not included here with the exception of model 3. However, the coefficients of the PA and MH dimension are higher than those for the UK and Australian weights. For Japan, in contrast with UK and Australia, the lowest weight of the PF dimension in model 5 is lower than that of the PA dimension. The Chinese data showed a similar tendency in that the utility decrements of PF and PA were small, but particularly so for the PA dimension. In the case of the Japanese value set of the EQ-5D-5L [
4], the utility decrement of the worst level of mobility (Mo) was the largest (− 0.243), although those of pain/discomfort (Pd) and anxiety/depression (Ad) were − 0.191 and − 0.196, respectively. Mo was the most influential item on utility, but Pd was comparable to Ad. In contrast, Devlin et al. [
27] indicated that the decrease in Pd was the largest (− 0.335), and that of Ad was the second largest (− 0.289). The coefficient of Mo is − 0.274. These findings may partly result from cultural differences between other countries and Japan, where physical independence is more valued and partly from the characteristics of the SF-6Dv2.
The minimum scores obtained by all the models were lower than that of the Japanese EQ-5D-5L (− 0.025). Although the scores of the Japanese EQ-5D-5L are much higher than those of the EQ-5D-5L in almost all other countries, the scores of the Japanese SF-6Dv2 are low compared to the UK scores. The reason may be that the valuation method of SF-6Dv2 is DCE with duration, and that of EQ-5D-5L is time trade-off (TTO) where Japanese people tend to avoid choosing immediate death. Moreover, DCE with duration trades expected life years, and the trading of death is not explicit. In contrast the ternary tasks do include a direct trade of death and impaired health. When we included data from ternary tasks in our analyses, the worst possible scores did increase (− 0.488 and − 0.426). These results support the hypothesis that Japanese people tend to avoid choosing immediate death.
The new Japanese guidelines for economic evaluation in 2024 recommend using EQ-5D-5L (“8.2.1 The Japanese version of the EQ-5D-5L is recommended as the initial choice for the PBM.”) [
3]. However, the guidelines also accept the use of other generic PBMs including SF-6Dv2 as the second choice (Data collected using a generic Japanese PBM with a Japanese value set other than the EQ-5D-5L). Therefore, developing a Japanese value set for SF-6Dv2 is important because increasing the number of PBMs with Japanese value sets is helpful for academia and decision-makers. A PBM can lack sensitivity or responsiveness when measuring the utility of certain conditions or diseases. Different PBMs reflect different aspects of health states in terms of utility. However, it is also essential to consider comparability among PBMs, especially for decision-makers.
Considering the number of logical inconsistencies in the coefficients, models 1 and 4 had a few inconsistencies in the coefficients, but these were remodeled with ordering imposed to produce consistent versions, Models 2 and 5. Model 4 showed only two inconsistencies: levels 4 and 5 of the RL and VT dimensions. Model 1 showed an additional inconsistency in the RL dimension between level 1 (baseline) and level 2. The absence of this inconsistency in model 4 may be due to the inclusion of data from the common design, which provides more statistical power for the estimation of utility decrements for mild heath problems. In the UK and Australia, the WORST model (Model 3 in our report) was preferred, but in Japan, model 3 had a high number of inconsistencies. For these reasons, we recommend the constrained version of model 4 (i.e. model 5) to be used for scoring the Japanese SF-6Dv2.
With the development of the value set of the Japanese SF-6Dv2, it is now possible to calculate QALYs for economic evaluation using SF-6Dv2. The value set is based on the results of an online survey completed by 3933 members of the Japanese public, and web-survey was well-controlled. One limitation of this study was the sampling method. This web survey, and recruiting from an existing web panel, does not allow respondents to be chosen randomly across Japan. In addition, this survey was performed during the latter stages of the outbreak of COVID-19. The influence of the COVID-19 outbreak, which could have changed the preferences for health states, is unknown. Compared with the numerically large weights for PF and PA dimensions, the weights were numerically smaller for other dimensions, especially RL, VT and SF.
Finally, our statistical model makes the following three assumptions: (a) linear time preference [
28], (b) independence from irrelevant alternatives (IIA), and (c) a multiplicative utility function (health state × duration) [
29]. According to Jonker et al. [
28], the assumption of linear time preference (without discounting) is not valid; the estimated discount rate is larger than that normally used by HTA agencies, and the hyperbolic discount function is better fitted than the exponential one. However, we did not consider these time preferences in the survey. For example, a mixed logit model can ease this assumption; however, we used only a fixed model. Jonker et al. [
29] showed that many respondents’ choices were based on the additive utility function that does not differ from the multiplicative utility function, which is the model assumption of Bansback et al. [
23]. If the respondents violated this assumption, the estimated value sets were biased; however, we analyzed the DCE data based on the multiplicative assumption. If reflecting non-linear time preference, the absolute value of the utility coefficients of the SF-6D becomes smaller [
28]; in contrast, considering only respondents with a multiplicative utility function, those values become larger [
29]. We do not empirically predict which influences on utility are severe; however, our estimates of utility decrements may have been affected by these factors.
Some aspects of the Japanese SF-6Dv2 have not yet been clarified because experiences with SF-6Dv2 use have not accumulated to a sufficient degree. For example, the relationship between the SF-6Dv2 and other PBMs is unknown. Moreover, the population norms [
30,
31] of SF-6Dv2 may help interpret obtained data for both the general population and specific patient groups. Further studies may be required to address these issues. Nevertheless, the present study contributes to promoting and enabling economic evaluations in Japan.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.