|
|||||
|
|
||||||
Journal of Clinical Oncology, Vol 25, No 23 (August 10), 2007: pp. 3482-3487 © 2007 American Society of Clinical Oncology. DOI: 10.1200/JCO.2007.11.3670 Statistical Power of Negative Randomized Controlled Trials Presented at American Society for Clinical Oncology Annual Meetings
From the Division of Medical Oncology and Hematology, Biostatistics, Princess Margaret Hospital and University of Toronto, Toronto, Ontario, Canada Address reprint requests to Ian F. Tannock, MD, PhD, Division of Medical Oncology and Hematology, Princess Margaret Hospital, 610 University Ave, Toronto, Ontario, M5G 2M9, Canada; e-mail: ian.tannock{at}uhn.on.ca
Purpose To investigate the prevalence of underpowered randomized controlled trials (RCTs) presented at American Society of Clinical Oncology (ASCO) annual meetings.
Methods We surveyed all two-arm phase III RCTs presented at ASCO annual meetings from 1995 to 2003 for which negative results were obtained. Post hoc calculations were performed using a power of 80% and an Results Of 423 negative RCTs for which post hoc sample size calculations could be performed, 45 (10.6%), 138 (32.6%), and 233 (55.1%) had adequate sample size to detect small, medium, and large effect sizes, respectively. Only 35 negative RCTs (7.1%) reported a reason for inadequate sample size. In a multivariable model, studies that were presented at oral sessions (P = .0038), multicenter studies supported by a cooperative group (P < .0001), and studies with time to event as primary outcome (P < .0001) were more likely to have adequate sample size. Conclusion More than half of negative RCTs presented at ASCO annual meetings do not have an adequate sample to detect a medium-size treatment effect.
New treatments in clinical oncology are accepted on the basis of efficacy or decreased toxicity demonstrated in randomized controlled trials (RCTs). Despite promising results in earlier phase II studies, many treatment regimens do not show statistically significant gains when tested in larger RCTs.1 The power of a study is its probability of detecting a clinically important effect of the experimental treatment, compared with the control arm, if a difference actually exists. If a clinical trial fails to show a statistically significant benefit in favor of the experimental treatment, an investigator may erroneously conclude that the experimental treatment is of no benefit, even if the trial did not include enough participants to demonstrate reliably a clinically meaningful effect. Previous studies of negative RCTs reported in the general medical and specialty literature show that many trials are inadequately powered to detect a meaningful difference between the arms.2,3,5-11 Such underpowered RCTs have been criticized as unethical because they expose participants to the toxicities of experimental treatments but are unable to determine whether those treatments are effective.12-15 The objective of this study was to investigate the prevalence of underpowered RCTs published in abstract form in the Proceedings of the American Society for Clinical Oncology (ASCO) Annual Meetings from 1995 to 2003, and the factors associated with lack of statistical power.
Identification of Studies To identify a large cohort of negative clinical trials, the Proceedings of the ASCO Annual Meetings from 1995 to 2003 were reviewed to identify randomized phase III clinical trials. Superiority trials with two-group parallel design for both dichotomous and continuous primary outcomes were included. Abstracts with explicit statements of negative results were classified as negative studies. If there was no explicit statement within the text of the abstract, a study was considered negative if it did not show a statistically significant benefit (P > .05 or CI including 1.0) in favor of the experimental treatment arm for the primary outcome measure. If a study did not explicitly state its primary outcome measure, the primary outcome was considered the first end point reported in the abstract.16 Abstracts reporting preliminary results before the completion of patient accrual, phase II studies, meta-analyses, equivalence studies, overviews, pooled data from two or more studies, and secondary analyses were excluded.
Data Abstraction Data from 30 RCTs were extracted independently by two investigators (P.L.B. and M.K.K.). After resolution of minor differences, studies were identified and data abstraction was performed on the remaining sample by one author (P.L.B).
Statistical Analysis
The main outcome of this study was to determine the proportion of negative RCTs that were underpowered. Post hoc sample size calculations were performed on all abstracts included in our cohort (Appendix and Tables A1 and A2, online only). Studies were considered to be underpowered if the total number of assessable participants was less than the sample size needed to detect a prespecified difference in outcome with 80% power and an Logistic regression analysis was performed to identify factors associated with lack of power to detect a medium effect size. Factors assessed in both univariable and multivariable analyses included year of publication, cancer type, format of presentation (oral, poster, or published only), whether the primary end point was identified, and whether the study was multicenter, involved a cooperative group and/or was sponsored by the pharmaceutical industry, and the type of primary end point (mean, proportion, or time to event). All statistical analyses were carried out using SAS version 9 (SAS Institute Inc, Cary, NC).
Study Population We identified 514 abstracts that met inclusion criteria. Twenty-two abstracts were excluded subsequently (11 interim reports, one four-arm trial, one trial with an unknown number of participants, and nine trials for which the type of outcome was unclear). Characteristics of the remaining 492 abstracts are summarized in Table 1. Abstracts describing negative clinical trials were published in a nearly uniform manner from 1995 to 2003, with the largest number published in 2001 (71 trials) and the fewest published in 2003 (48 trials). Among this cohort of negative studies, the most common tumor site was breast (24%). The median sample size in the trials was 210, with a median of 189 participants assessable for the primary end point. A primary end point was stated explicitly in only 168 trials (34%). Less than 10% of trials reported sample size considerations. A time-to-event variable was identified as the primary end point in 263 trials, with 216 trials expressing their primary result as a proportion and 13 trials expressing it as a mean. Most studies were multicenter (78%) and only 25% of them indicated pharmaceutical sponsorship. Involvement of a cooperative group was identified in almost half of the trials.
Data Abstraction Of the 16 items used in the analysis, nine had a concordance proportion of 90% or more, four had a concordance proportion between 80% and 90%, and three had a concordance proportion between 70% and 80%. Differences between investigators were resolved by consultation.
Statistical Power of Negative Randomized Trials
There were 263 trials for which the primary outcome was expressed as a time to event, 59 of which (22.4%) were excluded because of insufficient information to perform post hoc power calculations. Of the remaining 204 trials, 45 trials (22.1%) were adequately powered to detect a small effect size (HR
There were 216 trials for which the primary outcome was expressed as a proportion. Of these, 10 were excluded because of insufficient information to perform post hoc power calculations (4.6%). Of the remaining 206 studies, none had adequate sample size to detect a small effect size (OR 1.3), only three trials (1.4%) had adequate sample size for a medium effect size (OR 1.5), and 26 trials (12.6%) were sufficiently powered to detect a large effect size (OR 2.0; Table 2).
Thirteen trials expressed their primary outcome as a mean. Of these trials, none had adequate sample size for a small effect size (
Predictors of Lack of Statistical Power
Reasons for Premature Termination In 35 trials, the authors indicated that their studies were terminated prematurely before the attainment of their targeted accrual (Table 4). In 13 trials (37%), the authors attributed the early termination of their studies to slower than anticipated accrual. In nine trials (26%), an interim analysis suggesting lack of efficacy of the experimental treatment was cited as the indication for premature termination.
Our survey of 423 negative clinical trials indicates that 55% of trials had too few participants to detect a medium effect size in favor of the experimental over the standard treatment arm for their primary end point with at least 80% statistical power. Although underpowered negative clinical trials have been widely reported in the general medical and subspecialty literature,2-11 there are few reports relating to trials evaluating treatment of cancer. A review of 22 negative randomized oncology trials published in major general medical or oncology journals during a 1-year period found that 16 trials (73%) lacked adequate statistical power to detect a 50% improvement in median survival in favor of the experimental arm.16 The present study is a more robust survey of research practices in clinical oncology; it is not limited to negative trials that used survival as a primary end point. Furthermore, our study encompasses all negative trials presented at the ASCO Annual Meetings during a 9-year period and includes negative studies that are never published. We have demonstrated previously that RCTs presented at the ASCO Annual Meeting that have nonsignificant findings are less likely to be published than RCTs with significant results.18 A trial may be underpowered because an investigator fails to perform an a priori sample size calculation. In our study, fewer than 10% of abstracts reported sample size calculations. This observation is consistent with the results of other surveys of abstracts published at oncology meetings.19 Many investigators may have performed sample size calculations that were not reported in the abstract. However, given that many negative clinical trials never achieve journal publication, authors of trials with negative results should be obliged to report a brief summary of the sample size calculation in the abstract so that their findings can be evaluated properly. A trial may be designed with an appropriate sample size calculation, but then fail to accrue its target sample. This may occur because of patient-related factors, such as preference for a specific treatment arm, concerns about random assignment, and practical issues such as distance from the clinic and transportation costs. There may also be clinician and organizational barriers, such as lack of time for recruitment, preference for a particular treatment arm, poor organizational infrastructure, and multiple trials competing for the same patient. Trials may also be terminated before recruitment of their planned sample size because an interim analysis demonstrates lack of efficacy or unexpected toxicity in the treatment arm, because of lack of financial support for continuing the trial, or because new evidence renders the question being addressed by the trial to be no longer of clinical interest or even unethical. In our study, few authors indicated why their studies were underpowered. In our multivariate model, failure to identify explicitly a primary end point, type of presentation, type of sponsorship, and type of primary end point were the most significant predictors of lack of adequate sample size to detect a medium effect. Negative RCTs reporting a time-to-event variable as a primary end point were more likely to demonstrate adequate sample size to detect a medium effect size than studies reporting a proportion or mean variable as a primary end point. In our cohort, time-to-event studies had larger sample sizes than studies with a proportion or mean variable as a primary end point. Moreover, time-to-event studies were more likely to explicitly identify a primary end point (P = .0007), involve multiple centers or a cooperative group (P < .0001), and present results in oral sessions at the ASCO Annual Meeting (P = .0002). Time-to-event studies were also less likely to be sponsored by the pharmaceutical industry than studies reporting a proportion or mean variable as a primary end point (P = .0007). This suggests that trials that report a time-to-event primary end point are more heavily funded and may be more likely to involve a statistician in the research design phase to perform a priori sample size calculations. Some authors have characterized underpowered clinical trials as unethical, given that they expose patients to the risks of research without providing a reasonable opportunity for the outcome to contribute to scientific knowledge.12-15 They suggest that investigators should perform appropriate sample size calculations when designing trials and anticipate potential recruitment problems that might threaten the statistical power of their trial design. Other authors have challenged this doctrine, suggesting that well-conducted underpowered studies may provide valuable point estimates and CIs of treatment effect and can be synthesized with other studies in meta-analyses to perform valid treatment comparisons.20-22 Some have suggested that underpowered trials are unavoidable for rare cancers; however, in our study, 60% of trials underpowered for a medium effect size were for common tumor sites, such as breast, lung, and GI cancer. Although power is only one variable that determines the validity of a clinical trial result, our findings indicate that most negative trials in clinical oncology lack an adequate sample size to detect at least a medium effect size for their primary end point. In contrast, most clinical trials in oncology with positive results demonstrate much smaller effect sizes. For example, a review published by two major cooperative groups in the United States of clinical trials that achieved their targeted accrual during a 15-year period showed that the average effect size was 1.20, or a relative improvement of 20% in favor of the experimental versus the standard treatment arm.23 The average effect size to detect clinical improvements in the advanced-disease setting rather than adjuvant setting may be even smaller.24
There are several limitations of our study. We used a power value of 80% with an
In summary, more than half of negative RCTs published at ASCO Annual Meetings are underpowered to detect a medium-sized treatment effect. We propose that abstracts that report clinical trials in oncology should identify explicitly a primary end point; provide a brief summary of the sample size calculation; and indicate the statistical power,
The author(s) indicated no potential conflicts of interest.
Conception and design: Philippe L. Bedard, Monika K. Krzyzanowska, Melania Pinitilie, Ian F. Tannock Administrative support: Philippe L. Bedard, Monika K. Krzyzanowska, Ian F. Tannock Collection and assembly of data: Philippe L. Bedard, Monika K. Krzyzanowska Data analysis and interpretation: Philippe L. Bedard, Monika K. Krzyzanowska, Melania Pinitilie, Ian F. Tannock Manuscript writing: Philippe L. Bedard, Monika K. Krzyzanowska, Ian F. Tannock Final approval of manuscript: Philippe L. Bedard, Monika K. Krzyzanowska, Melania Pinitilie, Ian F. Tannock
Definition of effect size. Classification of effect size as small, medium, or large was defined differently for studies reporting a mean, proportion, or time-to-event variable as a primary end point. Mean as primary end point. For studies reporting a mean as a primary end point, effect size was assessed as a multiple of the standard deviation of the sample in the standard arm (SD) with a small effect size 0.2 SD, medium effect size 0.5 SD, and a large effect size 0.8 SD. These criteria are based on the published guidelines of Cohen.17
Proportion as primary end point. For studies with a proportion (eg, an odds ratio [OR]) as a primary end point, we could not find a published definition of small, medium, and large effect sizes. To establish criteria, we considered a hypothetical study with an event rate of 50% in the standard treatment arm. In this scenario, the calculated ORs for the following observed treatment effects are listed in Table A1. These absolute differences provide a reasonable definition of small, medium, and large effect sizes, respectively, and we have therefore defined a small effect size as OR
Time-to-event as primary end point. For studies with a time-to-event variable (eg, a hazard ratio [HR]) as a primary end point, we could not find a published definition of small, medium, and large effect sizes. To establish criteria, we considered a hypothetical study in which the survival at 2 years was 50% in the standard treatment arm. In this scenario, the calculated HRs for the observed treatment effects are listed in Table A2. These absolute differences provide a reasonable definition of small, medium, and large effect sizes, respectively, and we have therefore defined a small effect size as HR
Post hoc sample size calculations. Post hoc sample size calculations were performed in the following manner for studies with a mean, proportion, or time-to-event variable as a primary end point. Mean as primary end point. The effect size is given as a multiple of the standard deviation The total sample size is given by the following equation:
is the effect size as defined earlier, and z1– /2 and z1–ß are the quantiles of the standard normal distribution, using z1– /2 = 1.96 and z1–ß = 0.84. To perform post hoc power calculations, the following assumptions were made: the data are normally distributed, the standard deviation is the same in both arms, and there was the same number of patients in each of the two arms (ie, 1:1 randomization)
Proportion as primary end point. The effect size is expressed in terms of the OR. If an increase in OR is anticipated (p2 > p1), then
The total sample size is given by
= p2 – p1, and z1– /2 = 1.96 and z1–ß = 0.84.
Time-to-event as a primary end point. The effect size is expressed in terms of the HR. The following effect sizes are considered: small (
is as defined earlier, and z1– /2 = 1.96 and z1–ß = 0.84. On the basis of the information obtained from the abstract, the total number of events over the duration of the study was estimated by the following methods. (1) For studies in which competing risks were not present, the accrual and follow-up time were provided, along with an estimate of the percent survival in the standard arm at a point in time or the median survival.
(a) The hazard rate for the standard arm can be calculated as either
(b) The hazard rate for the experimental arm can be calculated as
(c) For each arm, the probability of event during the study is
(d) The total number of events is n1p[r]1 + n2p[r]2, where pi values are calculated in (c), and n1 and n2 are the total number of patients randomly assigned in the standard and experimental arm, respectively. (2) For studies in which competing risks might have been present, the total number of events, if provided in the abstract, was used.
(3) For studies in which competing risks were not present, an estimate of percent survival in the standard arm at a given point in time was provided or the median survival was provided. In these studies, the accrual and follow-up times were not known but the median follow-up time was provided. To calculate sample size, the procedure was similar to that described in section (1), except (c) was replaced by (4) For studies in which competing risks were not present, an estimate of percent survival for the standard arm at a point in time was provided or the median survival was provided. If neither the accrual and follow-up times nor the median follow-up time were provided but the total number of events was explicitly stated, then this number of events was used. (5) For studies in which competing risks might have been present and the total number of events was not provided, but all the information from either sections (1) or (3) or (4) was provided, then the same procedures as outlined in the respective paragraphs were used. The total number of events observed in the study is contrasted with the number of events necessary, calculated as nev (equation 1). To perform post hoc power calculations, the following assumptions were made: it was considered that the time to event was exponentially distributed and the accrual was uniform over time; when the end point was described using words such as "relapse," "progression," or "failure," unless specifically defined, it was considered that the competing risks might have been present.
We thank Ida Lee for assistance with data entry and Mel Giovinazzo for administrative support.
Authors' disclosures of potential conflicts of interest and author contributions are found at the end of this article.
1. Zia MI, Siu LL, Pond GR, et al: Comparison of outcomes of phase II studies and subsequent randomized control studies using identical chemotherapeutic regimens. J Clin Oncol 23:6982-6991, 2005 2. Freiman JA, Chalmers TC, Smith H, et al: The importance of beta, the type II error, and sample size in the design and interpretation of the randomized controlled trial: A survey of 71 negative trials. N Engl J Med 299:690-694, 1978[Abstract] 3. Moher D, Dulberg CS, Wells GA: Statistical power, sample size, and their reporting in randomized controlled trials. JAMA 272:122-124, 1994 4. Reference deleted by author. 5. Hebert B, Wright S, Dittus R, et al: Prominent medical journals often provide insufficient information to assess the validity of studies with negative results. J Negat Results Biomed 1:1, 2002[Medline] 6. Brown CG, Kelen GD, Ashton JJ, et al: The beta error and sample size determination in clinical trials in emergency medicine. Ann Emerg Med 16:183-187, 1987[CrossRef][Medline] 7. Edmund MJ, Overall JE, Rhoades HM: Beta, or type II error in psychiatric controlled clinical trials. J Psychiatr Res 19:563-567, 1985[CrossRef][Medline] 8. Mengel MB, Davis AB: The statistical power of family practice research. Fam Pract Res J 13:105-111, 1993[Medline] 9. Dimick JB, Diener-West M, Lipsett PA: Negative results of randomized clinical trials published in the surgical literature: Equivalency or error? Arch Surg 136:796-800, 2001 10. Williams HC, Seed P: Inadequate size of negative clinical trials in dermatology. Br J Dermatol 128:317-326, 1993[CrossRef][Medline] 11. Keen HI, Pile K, Hill CL: The prevalence of under-powered randomized clinical trials in rheumatology. J Rheumatol 32:2083-2088, 2005 12. Halpern S, Karlawish JH, Berlin JA: The continuing unethical conduct of underpowered clinical trials. JAMA 288:358-362, 2002 13. Altman DG: Statistics and ethics in medical research: III. How large a sample? BMJ 281:1336-1338, 1980 14. Newell DJ: Type II errors and ethics. BMJ iv:1789, 1978 15. Altman DG: The scandal of poor medical research. BMJ 308:283-284, 1994 16. Martins RG, Finkelstein DM, Selden MV: The importance of beta, type II error, in negative trials in oncology. Proc Am Soc Clin Oncol 15:415a, 1997 (abstr 1482) 17. Cohen, J: Statistical power analysis for the behavioral sciences (ed 2). Hillsdale, NJ, Erlbaum, 1988 18. Krzyzanowska MK, Pintilie M, Tannock IF: Factors associated with large randomized trials presented at an annual oncology meeting. JAMA 290:495-501, 2003 19. Krzyzanowska MK, Pintilie M, Brezden-Masley C, et al: Quality of abstracts describing randomized trials in the proceedings of American Society of Clinical Oncology meetings: Guidelines for improved reporting. J Clin Oncol 22:1993-1999, 2004 20. Edwards SJL, Lilford RJ, Braunholtz D, et al: Why "underpowered" trials are not necessarily unethical. Lancet 350:804-807, 1997[CrossRef][Medline] 21. Knapp TR: The overemphasis on power analysis. Nurs Res 45:379-381, 1996[CrossRef][Medline] 22. Lilford R, Stevens AJ: Underpowered studies. Br J Surg 89:129-131, 2002[Medline] 23. Joffe S, Harrington DP, George SL, et al: Satisfaction of the uncertainty principle in cancer clinical trials: Retrospective cohort analysis. BMJ 328:1463, 2003 24. Chleblowski RT, Lillington LM: A decade of breast cancer clinical investigation: Results as reported in the program/Proceedings of the American Society of Clinical Oncology. J Clin Oncol 12:1789-1795, 1994 25. Janosky JE: The ethics of underpowered clinical trials. JAMA 288:2118-2119, 2002[CrossRef][Medline] Submitted February 19, 2007; accepted May 29, 2007.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||
|
Copyright © 2007 by the American Society of Clinical Oncology, Online ISSN: 1527-7755. Print ISSN: 0732-183X
|