• - Google Chrome

Intended for healthcare professionals

  • Access provided by Google Indexer
  • My email alerts
  • BMA member login
  • Username * Password * Forgot your log in details? Need to activate BMA Member Log In Log in via OpenAthens Log in via your institution


Search form

  • Advanced search
  • Search responses
  • Search blogs
  • Innovative research...

Innovative research methods for studying treatments for rare diseases: methodological review

  • Related content
  • Peer review
  • Joshua J Gagne , assistant professor ,
  • Lauren Thompson , research assistant ,
  • Kelly O’Keefe , research manager ,
  • Aaron S Kesselheim , assistant professor
  • 1 Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA
  • Correspondence to: J J Gagne jgagne1{at}partners.org
  • Accepted 4 November 2014

Objective To examine methods for generating evidence on health outcomes in patients with rare diseases.

Design Methodological review of existing literature.

Setting PubMed, Embase, and Academic Search Premier searched for articles describing innovative approaches to randomized trial design and analysis methods and methods for conducting observational research in patients with rare diseases.

Main outcome measures We assessed information related to the proposed methods, the specific rare disease being studied, and outcomes from the application of the methods. We summarize methods with respect to their advantages in studying health outcomes in rare diseases and provide examples of their application.

Results We identified 46 articles that proposed or described methods for studying patient health outcomes in rare diseases. Articles covered a wide range of rare diseases and most (72%) were published in 2008 or later. We identified 16 research strategies for studying rare disease. Innovative clinical trial methods minimize sample size requirements (n=4) and maximize the proportion of patients who receive active treatment (n=2), strategies crucial to studying small populations of patients with limited treatment choices. No studies describing unique methods for conducting observational studies in patients with rare diseases were identified.

Conclusions Though numerous studies apply unique clinical trial designs and considerations to assess patient health outcomes in rare diseases, less attention has been paid to innovative methods for studying rare diseases using observational data.


Though an individual rare disease is by definition uncommon, according to the statutory definitions set in the United States (prevalence <200 000 people each year; equating to a prevalence of approximately <64 per 100 000 people) and European Union (<50 per 100 000 people), more than 6800 different conditions qualify as rare diseases and 6-8% of the population is affected. 1 2 3 This translates to about 60 million people in the United States and EU alone. Rare diseases comprise a heterogeneous set of conditions that afflict various organ systems, have wide ranging prognoses, and even vary along a gradient of rareness.

Many barriers exist to advancing knowledge of and treatment options for rare diseases. 4 The small patient populations can dampen commercial interest in development of treatments. Yet even for those rare conditions where funding is plentiful and manufacturers of therapeutics are engaged, methodological and data constraints limit the ability to generate evidence on patient health outcomes. The most obvious challenge to conducting rigorous research is the small number of eligible participants for a given study. In addition, geographic dispersion of patients, lack of knowledge about the clinical course of disease, and lack of appropriate comparator treatments further hinder the generation of evidence. 5 As a result relatively little is known about the clinical course of many rare diseases and few treatment options exist.

However, there may be pathways for collectively advancing the study of rare diseases. Although rare diseases may present unique clinical problems, the methodological challenges to studying health outcomes are often communal. In recent years, innovative epidemiological and clinical trial methods have been developed that offer promise for promoting more efficient and effective research. Because rare diseases are so clinically dissimilar, clinicians, scientists, and other stakeholders working in one medical specialty may not be familiar with methods being applied in other disciplines. Thus, we conducted a methodological review to catalogue and describe innovative approaches to studying health outcomes in patients with rare diseases. Our goal was to identify innovative approaches to research that have been, or can be, applied to overcome the methodological challenges inherent in the study of rare diseases.

Search strategy

We searched PubMed, Embase, and Academic Search Premier from their commencement through December 2012 for English language articles that included the following terms: “rare diseases”, “orphan drug”, “comparative effectiveness”, “evidence-based medicine”, “health technology assessment”, “outcome assessment”, “methods”, “epidemiology”, and “registries”. The supplementary file provides details of the search strategies.

We also conducted ad hoc searches of the three reference databases as well as general internet searches in Google using search terms specific to individual rare diseases (for example, progeria) and names of methods (for example, response adaptive randomization) identified in the database searches. Finally, we mined the reference lists of qualifying articles to supplement our search.

Article selection

We combined the results of each search strategy and removed duplicates. One author (LT) screened titles and abstracts to exclude those articles that were clearly not relevant. Another author (JJG) conducted a second stage screening of those articles that passed the title and abstract screens. We included articles covering randomized trial design and analysis methods and methods for conducting observational research. Articles relating to other facets of rare diseases and their treatments (for example, those related to clinical practice or policy) were excluded.

Data extraction

We extracted descriptive information about each article, including information on the authors, title, and publication. If the article focused on a specific rare disease, we extracted the name of the condition. We then summarized the unique methods proposed or used in each article to study patient health outcomes in rare diseases. If the article presented an empirical application of an innovative method, we extracted the study’s objective, the number of participants, the description of the method, and the description of the outcome.

For the qualitative synthesis, we classified novel research methods relating to the study of rare diseases into two broad categories: advances in clinical trial design for patients with rare diseases, and methods for observational studies of health outcomes in rare diseases. In each category we highlighted the most innovative research methodologies, and, where possible, provided examples of their applications.

We identified 5346 records through our search process. After removing duplicates and performing an initial title screening to exclude those that were clearly irrelevant to our review, we identified 442 potentially relevant articles and, after the subsequent two stage screening process, we obtained full text versions of 55 articles. Of these, 46 proposed or employed methods for studying patient health outcomes in rare diseases (figure ⇓ ). Articles covered a wide range of rare diseases, from amyotrophic lateral sclerosis to multiple myeloma to uveal melanoma. Of the 13 articles that involved an empirical application, the number of participants ranged from 23 to 4980. Most of the articles (33/46, 70%) were published between 2008 and 2012. Table 1 ⇓ presents a summary of the research methods we identified and their advantages in the setting of research into rare diseases.

PRISMA flow diagram

  • Download figure
  • Open in new tab
  • Download powerpoint

 Summary of research strategies for studying rare diseases and their advantages

  • View inline

Clinical trial designs used in patients with rare diseases

Conventional parallel group randomized controlled trials, which randomly allocate participants to one of two or more treatment groups, are not always feasible in rare conditions. 6 We found 19 articles proposing or employing novel clinical trial methods for studying therapeutic interventions in rare diseases. These approaches were classified into two groups: designs that minimize the total number of participants, and designs that maximize the number of on-treatment participants.

Minimizing trial sample size

Investigators studying rare diseases have tried to deal with the small pool of potential trial participants. Some proposed or made adjustments to traditional randomized trials. For example, when considering the treatment period, choosing a longer trial duration can reduce sample size requirements by capturing more events among the trial participants. 7 Focusing on high risk patients can reduce sample size and study duration, 8 and using genetic testing can reduce variability between individuals and allow inclusion of patients before they experience symptoms. 9 Finally, some investigators have sought to reduce sample size by tackling multiple treatment options in a factorial study, in which two (or more) treatment comparisons are carried out simultaneously. 10 Factorial designs provide answers to multiple questions within the same study population. This reduces the total number of patients required to answer all of the questions of interest but does not reduce the number of patients required to answer each individual question.

Another way to reduce sample size requirements in rare disease studies is through selection of the outcome measure using a continuous outcome variable, a surrogate marker, a composite endpoint, or repeated measure outcome. Identifying a continuous outcome variable, rather than a binary measure, can enhance statistical efficiency. 7 For example, percentage reduction in a continuous measurement imparts greater statistical power in an analysis than an outcome measurement based on the proportion of patients who attain some threshold in reduction of the measure, provided that the continuous outcome variable has a small variance. Surrogate endpoints, such as biomarkers, that predict whether patients will experience clinical outcomes of interest may also be useful, but validating biomarkers as good surrogates of the clinical outcome of interest can be difficult. They can further enhance statistical power since a potentially small number of patients in a study experience the hard endpoint of interest, whereas nearly all patients have measured values of the biomarker. 7 11 When hard clinical endpoints are preferred, combining multiple outcomes into a single composite outcome measure can increase the number of observed events and thus the statistical power. 12 Repeated outcome measurements permit patients to contribute more than one outcome event or measurement, which also increases study power, allowing more precise estimation of variance between patients while permitting estimation of the variance within patients. 12

A third approach to the sample size problem is to build networks to allow broader access to trials. Development of clinical trial networks for rare diseases can facilitate the conduct of multicenter and even multinational randomized trials. 13 Trial networks facilitate the recruitment of larger and more geographically diverse patient populations than may be permitted by single center studies. 14 The existence of such networks can also decrease the time required to complete a trial. 14 For example, Goss and colleagues provide a comprehensive overview of clinical trial networks for rare diseases in the context of the Cystic Fibrosis Therapeutics Development Network. 14

Finally, we found investigators who proposed and used novel trial design strategies to account for small pools of patients with rare diseases. Trials featuring an “adaptive design” allow modification of some aspects of the trial based on prospectively planned interim data analyses. The two basic types of adaptive designs are adaptive randomization and sequential trials. In trials using adaptive randomization, the probability of being randomized to an intervention changes during the enrollment period. The goal of adaptive randomization may be to minimize imbalance in baseline covariates among treatment groups (covariate-adaptive randomization) or to increase the proportion of patients assigned to the seemingly more effective treatment while reducing overall trial enrollment (response-adaptive randomization). By contrast, in sequential trials, data are analyzed intermittently to guide decisions on termination when safety concerns, futility, efficacy, or a combination of these factors is demonstrated. Trials that are stopped early because of important interim results require fewer patients. However, to control for multiple testing, trials that are not stopped early generally require larger sample sizes compared with similarly designed non-sequential trials. Chow and colleagues, Gupta and colleagues, and Cornu and colleagues have all summarized adaptive and sequential design methods in clinical trials and provide examples of applications to rare diseases. 15 16 17 Gupta and colleagues also provide a framework for selecting among these approaches for studies of rare diseases.

Many variants of adaptive randomization and sequential designs are applicable to studying rare diseases because they can reduce the sample size required for conventional trials. In addition, certain adaptive designs can also increase participants’ probability of receiving the most effective treatment, which can encourage enrollment in a trial. 7 11 12 15 16 18 The decision about whether to use an adaptive design involves considering whether a set sample size can be reasonably recruited, the number of therapeutic options to be compared, and whether preliminary data suggest one treatment is superior. 16 Cornu and colleagues proposed an algorithm for choosing an experimental design for small randomized clinical trials that also involves judging whether the outcome is reversible, whether the treatment response is likely to be rapid, and whether investigators seek to minimize the time participants are receiving placebo. 17

Even if investigators use one of these innovative designs or adaptations of traditional trials in studying a rare disease treatment, individual trials of patient health outcomes may not be capable of attaining sufficient power to reject the null hypothesis using a conventional frequentist threshold (α=0.05). One solution is to increase α, as was done in the alternating design trial of itraconazole by Gallin and colleagues. 19 Another solution is to conduct the underpowered study and incorporate the results into a prospectively planned meta-analysis. 18 20 21 A third option is to incorporate the results into a bayesian framework. Lilford and colleagues recommend the third approach for trials in rare diseases in which the individual trials are unlikely to result in a definitive answer but each can change the level of certainty around the clinical question. 22 The bayesian approach uses all available data—from the trial and other sources—to calculate probabilities that a particular treatment is effective. These probabilities can then be applied to clinical practice. Bayesian methods can also be useful in individual studies (randomized controlled trials and observational) of health outcomes in rare diseases. 11

Tan and colleagues described a bayesian approach to combining previous data with data from a new randomized controlled trial by creating scores that are then used to weight the pieces of evidence according to their pertinence, validity, and precision. 23 The validity scores enable investigators to down-weight evidence based on studies with flaws or other concerns, such as confounding in non-randomized trials. Pertinence scores are based on how closely the information from each source relates to the information to be gained in the trial. In theory, pertinence scores could also be based on the degree to which the evidence streams are relevant to patients’ decision making and could therefore support patient centered decision making. The authors make the case that such a bayesian approach can increase the robustness of information from small trials and can be used to help design and provide justification for such trials. However, bayesian approaches require appropriate specification of a prior distribution, which may be subjective or based on limited information.

Maximizing on-treatment participants

Trials that guarantee participants receive an intervention can enhance recruitment for patients with rare diseases who have limited treatment options. Some of these designs can also reduce recruitment requirements compared with alternative conventional parallel group randomized controlled trials. For example, crossover trials involve randomizing patients to treatment at one time (or several times) and to no treatment (or treatment with a comparator) at another time (or other times). 10 12 13 16 23 24 In addition to guaranteeing treatment, crossover designs are more statistically efficient than their parallel group randomized controlled trial counterparts. Crossover trials are particularly well suited to studying treatments for chronic conditions in which the treatments provide immediate relief of symptoms. But crossover trials generally cannot be used to study treatments that have curative effects or conditions that are rapidly changing. Many rare diseases are chronic conditions that progress over time. Changes in the disease over time that are unrelated to the treatment under study can cause bias in crossover trials. Crossover trials also require a transient treatment effect to minimize carryover effects into the subsequent treatment periods.

In the most basic crossover design involving two treatments, patients are randomly assigned to one treatment, followed by a washout period, and then receive a different treatment. Other patients are randomized to the reverse ordering. More complex crossover studies include so called alternating designs, in which patients are randomly assigned to each treatment at multiple time points. 25 Gallin and colleagues conducted a randomized crossover trial to examine itraconazole for fungal infections in patients with chronic granulomatous disease. 19 Given the rarity of this disease, it took 10 years to enroll only 39 patients. The investigators randomly assigned patients to receive itraconazole or placebo for one year and then to alternate annually between itraconazole and placebo. While this approach could not provide much information on the long term safety of itraconazole treatment, the multiple observations that each patient contributed made it possible to achieve sufficient statistical power (defined as a two sided type I error probability of 0.10) with only 39 participants. 25

An n-of-1 study is a special type of crossover design in which the trial comprises one patient. 10 11 12 13 16 23 24 Within clinical practice settings, healthcare providers administer a treatment and a control at randomly determined times and observe subsequent outcomes. These trials require the same general assumptions as crossover trials. While statistical inference cannot be made based on a single n-of-1 trial, results of multiple such studies can be aggregated in case series or even meta-analyzed quantitatively. 26 Investigators in the Netherlands are developing an n-of-1 trial service integrated in the Dutch healthcare system to generate evidence on the efficacy of treatments for rare neuromuscular diseases. 27 It will involve testing treatments that are available on the market but not necessarily approved for the neuromuscular indications. The project will create protocols for each n-of-1 trial and will collect the data in an electronic registry system. Less common variants of crossover designs include the Latin square design, the stepped wedge design, and the randomized withdrawal design. 17 Cornu and colleagues and Gupta and colleagues provide more detailed descriptions of the application of these clinical trial designs to studying treatments in rare diseases. 16 17

Methods for observational studies of health outcomes in rare diseases

In addition to the often small samples, studies using observational data to assess patient health outcomes in rare diseases face important challenges. For example, there is often no appropriate comparison group against which to compare outcome frequencies in patients with rare diseases and even when there is, controlling for confounding can be difficult because the risk factors of those outcomes are usually not well understood. Table 2 ⇓ summarizes methods that have been proposed or used to analyze health outcomes in patients with rare disease in observational data. These methods can be generally classified into four categories: advanced methods to tackle confounding, self controlled observational study designs, approaches for case-control studies, and prospective inception cohorts.

 Selected observational studies of health outcomes in patients with rare diseases

Advanced methods to deal with confounding

Some authors have suggested the use of certain advanced methods to tackle confounding in studies of rare disease health outcomes, such as propensity scores. 28 29 When comparing patients being treated for a particular rare disease to patients with the same disease but who are not being treated, confounding will occur if the determinants of one patient’s receipt of treatment over another are also risk factors for the outcome of interest. Often, many such confounders can be present. Propensity scores reduce the dimensionality of confounding in observational studies by summarizing all potential confounders into a single scalar score. 30 This tool is particularly useful in studies in which there are few outcome events relative to the number of confounders, which is a defining characteristic of rare diseases. 31 In a study of a dose-response effect of enzyme replacement therapy in patients with Gaucher disease type 1, Grabowski and colleagues created propensity scores to summarize multiple confounders and then used the scores to match patients who received different doses of enzyme therapy. 32 Though propensity scores can facilitate adjustment for many potential confounders by modeling the exposure rather than the outcome, neither propensity scores nor traditional outcome regression modeling can overcome confounding due to unmeasured variables.

Self controlled observational study designs

Self controlled observational designs may be useful in the rare disease setting. These approaches are observational analogues to the randomized crossover trials described above in which patients act as their own controls. These studies can be indexed by outcome, such as in case-crossover designs, 33 in which the frequency of exposure is compared during different time points among those who develop the outcome. They can also be indexed by exposure, such as the self controlled case series, 34 in which the frequency of outcome is compared during different time points among those exposed to the intervention of interest. Notable for patients with rare diseases, these approaches are immune to confounding by factors that do not change over time because of the within person comparisons. Similar to randomized self controlled trial designs, self controlled observational methods enhance statistical power and therefore reduce sample size requirements. Self controlled observational methods are subject to the same limitations as randomized self controlled trial designs but can also be susceptible to time varying confounding, such as when worsening of disease, which may be a risk factor for the outcome of interest, may also prompt treatment.

Case-control designs

Several observational studies of rare diseases have used a case-control design, which is particularly useful in settings in which outcomes are rare and require primary data collection methods. Case-control studies involve sampling from an underlying cohort of patients rather than utilizing information on all cohort patients, which can be resource prohibitive. Schmidt-Pokrzywniak and colleagues conducted an institutional based case-control study to examine risk factors for uveal melanoma. 35 Rather than using a full cohort approach, the authors recruited cases from a referral center for eye tumors and sampled controls from among the cases’ siblings and from local ophthalmologists’ case loads. The case-control design yields an estimate of the same effect estimate as if the entire underlying cohort were used, but with slightly less precision given the sampling. In addition to reducing sample size requirements by identifying all cases and sampling controls, the case-control design allows investigators to easily examine multiple risk factors related to the outcome of interest. In other articles, Schmidt-Pokrzywniak and colleagues have examined the relations between uveal melanoma and mobile phone use, occupational cooking, and ultraviolet radiation. 36 37 38

Cole and colleagues conducted a case-control study using the International Collaborative Gaucher Group registry. 28 The authors compared the odds of splenectomy in patients with avascular necrosis (cases) with the odds in patients without avascular necrosis (controls). The authors used risk set matching, which can reduce bias in case-control studies relative to other sampling strategies. In risk set matching, controls are sampled from sets of patients at risk for the outcome at the time of the corresponding case event. These sets are usually defined by calendar time but can be defined by other variables as well, such as age and sex.

Prospective inception cohorts

A fourth group of studies employed prospective inception cohort designs, which are also sometimes referred to as “new user” designs when cohort inception is defined by the start of some medical treatment. 39 40 Inception cohorts permit investigators to establish clear temporality among study variables (that is, baseline confounders, exposures, and outcomes) and capture outcome events that occur shortly after entry to the cohort. This approach is particularly important for outcomes related to medical interventions that may be immediately affected by those interventions. While inception cohorts increase validity of observational studies, they can be difficult to implement for rare diseases because they require restricting the already small patient population to those with an observable start of the exposure, risk factor, or disease of interest. Identifying patients at the onset of a rare disease can be challenging because there can be a long lag time associated with making accurate diagnoses for rare diseases. Thus, patients enrolling in registries and other data sources may have had the underlying condition and subsequent treatment for some time. In addition, identifying “new users” of medical treatments for rare diseases outside of clinical trials can be limited if a large proportion of patients with the disease participated in the trial and were exposed to the treatment. Bernard and colleagues described the design and implementation of institution based prospective inception cohort studies in pediatric thrombosis and stroke research. 41

In this review of methods that have been proposed for and used to study health outcomes in rare diseases, we identified a wide variety of non-traditional approaches. The majority of the identified articles were published in 2008 or later, highlighting the increasing interest in this area. Most articles also focused on innovations in methods for clinical trials intended to minimize the number of participants needed to meet the study goals or to maximize the proportion of participants who receive active treatment to encourage enrollment.

Implications for randomized trials

Advances in clinical trial design relevant to rare diseases are well developed, having been discussed in several technical articles and applied in many clinical scenarios. Cornu and colleagues provide examples of studies that have used each of 12 different randomized designs in the setting of rare diseases. 17 They and Gupta and colleagues have also proposed frameworks to aid selection of randomized clinical trial methods for studying health outcomes in rare diseases. 16 17 Both algorithms pose similar questions to address whether the assumptions of crossover and n-of-1 trials are likely to hold, such as whether the intervention of interest has only a short term effect on the outcome. Gupta and colleagues’ algorithm asks about whether sufficient numbers of patients are likely to be recruited for a given design and offers alternatives when this is not the case. Cornu and colleagues’ algorithm explicitly asks about whether objectives of the study include minimizing the time patients are receiving placebo or ensuring that patients receive active treatment by the end of the trial. Until a unified framework is developed, both algorithms can be used to help decide the most appropriate design to study health outcomes in patients with rare diseases.

Implications for observational studies

In addition to dealing with considerations about general design and analysis (for example, outcome selection, incorporation of evidence into larger context), our methodological review is the first to go beyond randomized trial methods for studying rare diseases. This is important because in small sample sizes, randomization will not always achieve its goal of balancing patient characteristics between treatment groups. In contrast with the body of literature on clinical trial methods in rare diseases, however, the literature on observational methods is considerably less mature. Several observational studies presented only descriptive frequencies of outcomes after a treatment and often with no comparison group, limiting the inferences that can be drawn about the treatment and subsequent outcomes. In general, observational studies of rare diseases used the same methods that are used to study health outcomes in more common conditions. However, several advanced observational methods that are used to study outcomes in common conditions—including propensity scores and self controlled designs—are particularly well suited for tackling confounding in the setting of rare events. Propensity scores deal with confounding in between person comparisons, whereas self controlled designs implicitly tackle time invariant confounding by making within person comparisons. It is important to note, however, that statistically controlling for confounding may not always be possible, even with propensity scores in studies with few participants.

In addition to the often small samples, studies of patient health outcomes in rare diseases using observational data face other important challenges. For example, there is often no appropriate group against which to compare outcome frequencies in patients with rare diseases and, even when there is, controlling for confounding can be difficult because the risk factors of those outcomes are usually not well understood. Yet, little work has been done to develop or apply methods to directly deal with these challenges. We did not identify any novel observational methods that have been developed to study outcomes in rare diseases. As observational data on rare diseases become more ubiquitous, greater attention is needed on methods to analyze these data to validly evaluate health outcomes in patients with rare diseases.

Limitations of this study

This survey of research methods for rare diseases has several limitations. Firstly, our literature search was focused on articles that mentioned “rare disease” in a searchable field. Because of the large number of unique rare diseases, we were not able to search for applications of innovative methods related to each specific disease. In addition, our review was intended to provide a general overview of non-traditional methods that have been proposed or applied to studying rare diseases. If other non-traditional methods exist that might be applicable to rare diseases but have not yet been discussed in a publication in the databases we searched, we may not have identified them. Moreover, our review is intended to enhance awareness of the availability and use of innovative methods for studying health outcomes in rare diseases and is not intended to provide a technical review of these methods, which can be found in the cited references. Finally, while we searched three databases, two of which include biomedical journals and a third that covers disciplines including psychology, physics, and engineering, it is possible that we missed relevant methods that have been used in other specialties, such as the social sciences.

Conclusions and future directions

Despite these limitations, we found several promising strategies that may contribute substantial advances to the study of health outcomes in patients with rare diseases. Some of these methods (for example, crossover designs and propensity scores) are already used in studies of common conditions. Awareness of the armamentarium of research tools available will help investigators design studies in patients with specific rare diseases and will help clinicians interpret the results of these studies when treating patients with these conditions. Observational studies are an important approach for studying health outcomes in rare diseases, particularly as patient registries and electronic healthcare databases continue to grow and offer richer clinical information. However, greater attention to innovative methods for using observational data to study rare disease health outcomes is needed.

What is already known on this topic

Many barriers exist to advancing knowledge of and treatment options for rare diseases

Because rare diseases are clinically dissimilar, clinicians, scientists, and other stakeholders working in one medical specialty may not be familiar with methods being applied in other disciplines

What this study adds

Several promising strategies that may contribute substantial advances to the study of health outcomes in patients with rare diseases have been proposed, particularly for randomized trials

Greater attention to innovative methods for using observational data to study rare disease health outcomes is needed

Cite this as: BMJ 2014;349:g6802

Contributors: JJG and ASK conceived and designed the study. JJG drafted the article. All authors analysed and interpreted the data, revised the manuscript for important intellectual content, gave final approval of the version to be published, and fulfill the criteria for authorship. No one who is not included as an author fulfills the criteria. JJG is the guarantor.

Funding: This project was funded under Contract No 290 2010 00006l TO #4 from the Agency for Healthcare Research and Quality, US Department of Health and Human Services as part of the Developing Evidence to Inform decisions about Effectiveness (DEcIDE) program. The authors of this report are responsible for its content. Statements in the report should not be construed as endorsement by the Agency for Healthcare Research and Quality or the US Department of Health and Human Services.

Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.pdf (available on request from the corresponding author) and declare: support from the Agency for Healthcare Research & Quality for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.

Ethical approval: Not required.

Data sharing: Summary data are available from the corresponding author at jgagne1{at}partners.org .

Transparency: JJG affirms that the manuscript is an honest, accurate, and transparent account of the study being reported; that no important aspects of the study have been omitted; and that any discrepancies from the study as planned have been explained.

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/ .

  • ↵ EURORDIS: European Organisation for Rare Diseases. Rare diseases: understanding this public health priority. 2005. www.eurordis.org/IMG/pdf/princeps_document-EN.pdf .
  • ↵ Orphan Drug Act, Pub. L. No 97-414, 96 Stat. 2049 (1984 as amended).
  • ↵ National Center for Advanceing Translational Sciences. Office of Rare Disease Research. Frequently asked questions. 2013. http://rarediseases.info.nih.gov/about-ordr/pages/31/frequently-asked-questions .
  • ↵ Kesselheim AS, Gagne JJ. Strategies for post-market surveillance of drugs for rare diseases. Clin Pharmacol Ther 2014 ; 95 : 265 -8. OpenUrl CrossRef PubMed
  • ↵ De la Paz MP, Villaverde-Hueso A, Alonso V, János S, Zurriaga O, Pollán M, et al. Rare diseases epidemiology research. Adv Exp Med Biol 2010 ; 686 : 17 -39. OpenUrl CrossRef PubMed
  • ↵ Kesselheim AS, Myers JA, Avorn J. Characteristics of clinical trials to support approval of orphan vs nonorphan drugs for cancer. JAMA 2011 ; 305 : 2320 -6. OpenUrl CrossRef PubMed Web of Science
  • ↵ Shurin S, Krischer J, Groft SC. Clinical trials In BMT: ensuring that rare diseases and rarer therapies are well done. Biol Blood Marrow Transplant 2012 ; 18 : S8 -11. OpenUrl CrossRef PubMed Web of Science
  • ↵ Whitehead J, Tishkovskaya S, O’Connor J, Damato B. Devising two-stage and multistage phase II studies on systemic adjuvant therapy for uveal melanoma. Invest Ophthalmol Vis Sci 2012 ; 53 : 4986 -9. OpenUrl Abstract / FREE Full Text
  • ↵ Stone EM. Challenges in genetic testing for clinical trials of inherited and orphan retinal diseases. Retina 2005 ; 25 : S72 -3. OpenUrl CrossRef PubMed
  • ↵ Griggs RC, Batshaw M, Dunkle M, Gopal-Srivastava R, Kaye E, Krischer J, et al. Clinical research for rare disease: opportunities, challenges, and solutions. Mol Genet Metab 2009 ; 96 : 20 -6. OpenUrl CrossRef PubMed Web of Science
  • ↵ Buckley BM. Clinical trials of orphan medicines. Lancet 2008 ; 371 : 2051 -5. OpenUrl CrossRef PubMed Web of Science
  • ↵ Van der Lee JH, Wesseling J, Tanck MW, Offringa M. Efficient ways exist to obtain the optimal sample size in clinical trials in rare diseases. J Clin Epidemiol 2008 ; 61 : 324 -30. OpenUrl CrossRef PubMed
  • ↵ Kinder B, McCormack FX. Clinical trials for rare lung diseases: lessons from lymphangioleiomyomatosis. Lymphat Res Biol 2010 ; 8 : 71 -9. OpenUrl CrossRef PubMed
  • ↵ Goss CH, Mayer-Hamblett N, Kronmal RA, Ramsey BW. The cystic fibrosis therapeutics development network (CF TDN): a paradigm of a clinical trials network for genetic and orphan diseases. Adv Drug Deliv Rev 2002 ; 54 : 1505 -28. OpenUrl CrossRef PubMed Web of Science
  • ↵ Chow SC, Chang M. Adaptive design methods in clinical trials—a review. Orphanet J Rare Dis 2008 ; 3 : 11 . OpenUrl CrossRef PubMed
  • ↵ Gupta S, Faughnan ME, Tomlinson GA, Bayoumi AM. A framework for applying unfamiliar trial designs in studies of rare diseases. J Clin Epidemiol 2011 ; 64 : 1085 -94. OpenUrl CrossRef PubMed
  • ↵ Cornu C, Kassai B, Fisch R, Chiron C, Alberti C, Guerrini R, et al. Experimental designs for small randomised clinical trials: an algorithm for choice. Orphanet J Rare Dis 2013 ; 8 : 48 . OpenUrl CrossRef PubMed
  • ↵ Dimichele DM, Blanchette V, Berntorp E. Clinical trial design in haemophilia. Haemophilia 2012 ; 18 (Suppl 4): 18 -23. OpenUrl CrossRef PubMed
  • ↵ Gallin JI, Alling DW, Malech HL, Wesley R, Koziol D, Marciano B, et al. Itraconazole to prevent fungal infections in chronic granulomatous disease. N Engl J Med 2003 ; 348 : 2416 -22. OpenUrl CrossRef PubMed Web of Science
  • ↵ Puopolo M, Pocchiari M. Need to improve clinical trials in rare neurodegenerative disorders. Ann Ist Super Sanita 2011 ; 47 : 55 -9. OpenUrl PubMed
  • ↵ Halpern SD, Karlawish JH, Berlin JA. The continuing unethical conduct of underpowered clinical trials. JAMA 2002 ; 288 : 358 -62. OpenUrl CrossRef PubMed Web of Science
  • ↵ Lilford RJ, Thornton JG, Braunholtz D. Clinical trials and rare diseases: a way out of a conundrum. BMJ 1995 ; 311 : 1621 -5. OpenUrl FREE Full Text
  • ↵ Tan SB, Dear KB, Bruzzi P, Machin D. Strategy for randomised clinical trials in rare cancers. BMJ 2003 ; 327 : 47 -9. OpenUrl FREE Full Text
  • ↵ Hyman L. Design of phase III clinical trials for treatments of orphan retinal diseases: an overview of considerations. Retina 2005 ; 25 : S69 -71. OpenUrl CrossRef PubMed
  • ↵ Lagakos SW. Clinical trials and rare diseases. N Engl J Med 2003 ; 348 : 2455 -6. OpenUrl CrossRef PubMed Web of Science
  • ↵ Berlin JA. N-of-1 clinical trials should be incorporated into clinical practice. J Clin Epidemiol 2010 ; 63 : 1283 -4. OpenUrl CrossRef PubMed Web of Science
  • ↵ Weinreich SS, Vrinten C, Verschuuren JJGM, Uyl-de Groot, CA, Kuijpers MR, Sterrenburg E, et al. From rationing to rationality: an n-of-one trial service for off-label medicines for rare (neuromuscular) diseases. Orphanet J Rare Dis 2012 ; 7 (Suppl 2): A29 . OpenUrl CrossRef
  • ↵ Cole JA, Taylor JS, Hangartner TN, Weinreb NJ, Mistry PK, Khan A. Reducing selection bias in case-control studies from rare disease registries. Orphanet J Rare Dis 2011 ; 6 : 61 . OpenUrl CrossRef PubMed
  • ↵ Sun P, Garrison LP. Retrospective outcomes studies for orphan diseases: challenges and opportunities. Curr Med Res Opin 2012 ; 28 : 665 -7. OpenUrl CrossRef PubMed
  • ↵ Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983 ; 70 : 41 -55. OpenUrl Abstract / FREE Full Text
  • ↵ Cepeda MS, Boston R, Farrar JT, Strom BL. Comparison of logistic regression versus propensity score when the number of events is low and there are multiple confounders. Am J Epidemiol 2003 ; 158 : 280 -7. OpenUrl Abstract / FREE Full Text
  • ↵ Grabowski GA, Kacena K, Cole JA, Hollak CE, Zhang L, Yee J, et al. Dose-response relationships for enzyme replacement therapy with imiglucerase/alglucerase in patients with Gaucher disease type 1. Genet Med 2009 ; 11 : 92 -100. OpenUrl CrossRef PubMed
  • ↵ Maclure M. The case-crossover design: a method for studying transient effects on the risk of acute events. Am J Epidemiol 1991 ; 133 : 144 -53. OpenUrl Abstract / FREE Full Text
  • ↵ Farrington CP, Nash J, Miller E. Case series analysis of adverse reactions to vaccines: a comparative evaluation. Am J Epidemiol 1996 ; 143 : 1165 -73. OpenUrl Abstract / FREE Full Text
  • ↵ Schmidt-Pokrzywniak A, Jockel KH, Bornfeld N, Stang A. Case-control study on uveal melanoma (RIFA): rational and design. BMC Ophthalmol 2004 ; 4 : 11 . OpenUrl CrossRef PubMed
  • ↵ Schmidt-Pokrzywniak A, Jockel KH, Bornfeld N, Sauerwein W, Stang A. Positive interaction between light iris color and ultraviolet radiation in relation to the risk of uveal melanoma: a case-control study. Ophthalmology 2009 ; 116 : 340 -8. OpenUrl CrossRef PubMed Web of Science
  • ↵ Schmidt-Pokrzywniak A, Jockel KH, Marr A, Bornfeld N, Stang A. A case-control study: occupational cooking and the risk of uveal melanoma. BMC Ophthalmol 2010 ; 10 : 26 . OpenUrl CrossRef PubMed
  • ↵ Stang A, Schmidt-Pokrzywniak A, Lash TL, Lommatzsch PK, Taubert G, Bornfeld N, et al. Mobile phone use and risk of uveal melanoma: results of the risk factors for uveal melanoma case-control study . J Natl Cancer Inst 2009 ; 101 : 120 -3. OpenUrl Abstract / FREE Full Text
  • ↵ Ray WA. Evaluating medication effects outside of clinical trials: new-user designs. Am J Epidemiol 2003 ; 158 : 915 -20. OpenUrl Abstract / FREE Full Text
  • ↵ Armstrong-Wells J, Goldenberg NA. Institution-based prospective inception cohort studies in neonatal rare disease research. Semin Fetal Neonatal Med 2011 ; 16 : 355 -8. OpenUrl CrossRef PubMed Web of Science
  • ↵ Bernard TJ, Armstrong-Wells J, Goldenberg NA. The institution-based prospective inception cohort study: design, implementation, and quality assurance in pediatric thrombosis and stroke research. Semin Thromb Hemost 2013 ; 39 : 10 -4. OpenUrl PubMed
  • Nakamura C, Bromberg M, Bhargava S, Wicks P, Zeng-Treitler Q. Mining online social network data for biomedical research: a comparison of clinicians’ and patients’ perceptions about amyotrophic lateral sclerosis treatments. J Med Internet Res 2012 ; 14 : e90 . OpenUrl CrossRef PubMed
  • Wicks P, Vaughan TE, Massagli MP, Heywood J. Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm. Nat Biotechnol 2011 ; 29 : 411 -4. OpenUrl CrossRef PubMed Web of Science
  • Barash JA, Desai RA, Patwa HS. Veterans health administration information systems as a resource for rare disorders research: Creutzfeldt-Jakob disease as a paradigm. Mil Med 2012 ; 177 : 1343 -7. OpenUrl CrossRef PubMed
  • Schick U, Bolukbasi Y, Thariat J, Abdah-Bortnyak R, Kuten A, Igdem S, et al. Outcome and prognostic factors in endometrial stromal tumors: a Rare Cancer Network study. Int J Radiat Oncol Biol Phys 2012 ; 82 : e757 -63. OpenUrl CrossRef PubMed
  • Pugnet G, Sailler L, Bourrel R, Sommet A, Montastruc JL, Lapeyre-Mestre M. Pharmacoepidemiology as an opportunity for prognostic studies in rare diseases: the example of giant cell arteritis and the French APOGEE cohort (Arterite en Population Generale). Basic Clin Pharmacol Toxicol 2010 ; 10 : 533 . OpenUrl
  • McCann LJ, Juggins AD, Maillard SM, Wedderburn, LR, Davidson JE, Murray KJ, et al. The Juvenile Dermatomyositis National Registry and Repository (UK and Ireland)—clinical characteristics of children recruited within the first 5 yr. Rheumatology (Oxford) 2006 ; 45 : 1255 -60. OpenUrl Abstract / FREE Full Text
  • Ozsahin M, Gruber G, Olszyk O, Karakoyun-Celik O, Pehlivan B, Azria D, et al. Outcome and prognostic factors in olfactory neuroblastoma: a rare cancer network study. Int J Radiat Oncol Biol Phys 2010 ; 78 : 992 -7. OpenUrl CrossRef PubMed
  • Fasnacht MS, Tolsa JF, Beghetti M. The Swiss registry for pulmonary arterial hypertension: the paediatric experience. Swiss Med Wkly 2007 ; 137 : 510 -3. OpenUrl PubMed Web of Science
  • Sun P, Krueger D, Liu J, Guo A, Rogerio J, Kohrman M. Surgical resection of subependymal giant cell astrocytomas (SEGAs) and changes in SEGA-related conditions: a US national claims database study. Curr Med Res Opin 2012 ; 28 : 651 -6. OpenUrl CrossRef PubMed
  • Sun P, Kohrman M, Liu J, Guo A, Rogerio J, Krueger D. Outcomes of resecting subependymal giant cell astrocytoma (SEGA) among patients with SEGA-related tuberous sclerosis complex: a national claims database analysis. Curr Med Res Opin 2012 ; 28 : 657 -63. OpenUrl CrossRef PubMed

case control study rare disease

  • Download PDF
  • Share Twitter Facebook Email LinkedIn
  • Permissions

Small Data Challenges of Studying Rare Diseases

  • 1 Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
  • 2 Statistical Editor, JAMA Network Open
  • Original Investigation Assessment of Thyroid Function in Patients With Alkaptonuria Shirisha Avadhanula, MD; Wendy J. Introne, MD; Sungyoung Auh, PhD; Steven J. Soldin, PhD; Brian Stolze, MSc; Debra Regier, MD; Carla Ciccone, MS; Fady Hannah-Shmouni, MD; Armando C. Filie, MD; Kenneth D. Burman, MD; Joanna Klubo-Gwiezdzinska, MD, PhD, MHSc JAMA Network Open

The age of big data is in full swing, with researchers in both clinical medicine and public health seeking to take advantage of the increasing availability of massive amounts of electronic and administrative health data. In turn, this has led to substantial resources and efforts being poured into the development and teaching of methods for data collection and storage as well as machine learning analytic methods. 1

However, big data are not always available, especially in the study of rare diseases. Indeed, in the study of rare diseases, small sample sizes are inevitable, especially when the primary end point is also uncommon. As an example, Avadhanula et al 2 used data from a cohort of 125 patients with alkaptonuria, a rare autosomal recessive disorder. Patients were recruited between 2000 and 2018 as part of a prospective longitudinal study conducted at the National Human Genome Research Institute to investigate the incidence of thyroid dysfunction among patients with alkaptonuria. While this is by no means a generous sample size, the cohort is the largest of its kind for patients with alkaptonuria, according to the authors.

In the US, a rare disease is defined as a health condition that affects fewer than 200 000 individuals. 3 This definition was created by Congress as part of the Orphan Drug Act of 1983, which aimed to use financial incentives to motivate pharmaceutical and medical device companies to develop new treatments for patients with rare diseases. Close to 7000 conditions meet this definition. Although a relatively small number of individuals are affected by each rare disease, the estimated total number of individuals living with any rare disease is between 25 million and 30 million. 3 Support for rare disease research continues today. In 2016, the US Food and Drug Administration awarded $23 million over 4 years to support research in 21 different rare diseases. 4 The Patient-Centered Outcomes Research Institute also has a special advisory board for rare disease research and thus far has funded more than 28 patient-centered comparative effectiveness studies that focus on the treatment and management of rare diseases. 5

That the study of rare diseases poses unique challenges has been recognized. From the perspective of study design, researchers investigating rare diseases have many options, including crossover and adaptive trials. 6 For observational studies, Whicher et al 7 list self-controlled study designs, case-control designs, and prospective inception cohorts as potential designs suitable for rare disease research. Beyond the choice of study design, researchers must also be wary of the analytic challenges that arise from studying rare diseases, including the extent to which the available data can be viewed as representative of the entire population of patients with the condition and whether there is sufficient (statistical) power to draw definitive conclusions (ie, those that could inform decision-making). It is perhaps less well recognized that, when the sample size is small, P values are especially vulnerable to small deviations in the observed number of outcomes. For example, in the study by Avadhanula et al, 2 1 patient was diagnosed with hyperthyroidism in the cohort of 125 individuals. Based on the exact test for 1-sample proportion, Avadhanula et al 2 found insufficient evidence that the estimated prevalence in the study population (ie, 1 of 125 [0.8%]) was different than that in the general population (ie, 0.5%), with a P value of .88. As a thought experiment, suppose 2 patients instead of 1 had been diagnosed with hyperthyroidism. The same test would yield a P value of .23. Furthermore, if 3 patients were diagnosed, the resulting P value would then be .04. Thus, by hypothetically observing just 2 more cases, there is a dramatic change in the P value, a change that would likely alter decision-making.

This is all the more important to acknowledge when it is placed against the backdrop of a 2019 editorial by the American Statistical Association that called on researchers to move away from using the term statistical significance to describe results with a P value of less than .05. 8 As part of the editorial, the American Statistical Association solicited suggestions for alternative paradigms. One interesting proposal was that journals adopt a so-called results-blind review process in which study results are omitted from the initial manuscript submission. In doing so, the central criteria for publication would be whether the study objective is relevant and interesting from either a clinical or public health perspective and whether the study design and methods are appropriate. Rare disease research that lacks statistical power or fails to achieve the conventional levels of statistical significance may especially benefit from this type of review process. More publications and dissemination of knowledge of rare disease research would increase awareness and possibly foster new collaborations among different institutions that could lead to small data becoming bigger.

In a 2019 study, Rees et al 9 reported on the completion and publication status of 659 clinical trials for rare diseases registered at ClinicalTrials.gov between January 2010 and December 2012. They found that, as of December 2014, 199 trials (30.2%) were discontinued, with insufficient patient accrual as the most cited reason. Furthermore, among those completed, more than half (306 [66.5%]) remained unpublished at 2 years and nearly one-third (142 [31.5%]) remained unpublished at 4 years. Although the authors were unable to ascertain whether sample size and statistical significance factored into whether a study was published, it seems highly plausible that they would in many instances. 10

Currently, JAMA Network Open does not use a results-blind review process. However, although not explicitly stated in the Instructions to Authors, statistical significance is not considered a criterion for publication. Driven by the desire to publish important science, JAMA Network Open is open to publishing high-quality studies with an important research question, a sound study design, appropriate methodology, and conclusions that are a reasonable and accurate reflection of the nature and strength of the evidence. Consequently, this journal represents an important venue for the publication of studies of rare diseases and embraces the challenges that arise from studying diseases that are often overlooked. After all, do we not hope that every disease will become rare in the future?

Published: March 23, 2020. doi:10.1001/jamanetworkopen.2020.1965

Open Access: This is an open access article distributed under the terms of the CC-BY License . © 2020 Mitani AA et al. JAMA Network Open .

Corresponding Author: Sebastien Haneuse, PhD, Department of Biostatistics, Harvard T.H. Chan School of Public Health, 655 Huntington Ave, Bldg II, Room 407, Boston, MA 02115 ( [email protected] ).

Conflict of Interest Disclosures: None reported.

See More About

Mitani AA , Haneuse S. Small Data Challenges of Studying Rare Diseases. JAMA Netw Open. 2020;3(3):e201965. doi:10.1001/jamanetworkopen.2020.1965

Manage citations:

© 2023

Select Your Interests

Customize your JAMA Network experience by selecting one or more topics from the list below.

  • Academic Medicine
  • Acid Base, Electrolytes, Fluids
  • Allergy and Clinical Immunology
  • Anesthesiology
  • Anticoagulation
  • Art and Images in Psychiatry
  • Assisted Reproduction
  • Bleeding and Transfusion
  • Caring for the Critically Ill Patient
  • Challenges in Clinical Electrocardiography
  • Climate and Health
  • Clinical Challenge
  • Clinical Decision Support
  • Clinical Implications of Basic Neuroscience
  • Clinical Pharmacy and Pharmacology
  • Complementary and Alternative Medicine
  • Consensus Statements
  • Coronavirus (COVID-19)
  • Critical Care Medicine
  • Cultural Competency
  • Dental Medicine
  • Dermatology
  • Diabetes and Endocrinology
  • Diagnostic Test Interpretation
  • Drug Development
  • Electronic Health Records
  • Emergency Medicine
  • End of Life
  • Environmental Health
  • Equity, Diversity, and Inclusion
  • Facial Plastic Surgery
  • Gastroenterology and Hepatology
  • Genetics and Genomics
  • Genomics and Precision Health
  • Global Health
  • Guide to Statistics and Methods
  • Hair Disorders
  • Health Care Delivery Models
  • Health Care Economics, Insurance, Payment
  • Health Care Quality
  • Health Care Reform
  • Health Care Safety
  • Health Care Workforce
  • Health Disparities
  • Health Inequities
  • Health Informatics
  • Health Policy
  • History of Medicine
  • Hypertension
  • Images in Neurology
  • Implementation Science
  • Infectious Diseases
  • Innovations in Health Care Delivery
  • JAMA Infographic
  • Law and Medicine
  • Leading Change
  • Less is More
  • LGBTQIA Medicine
  • Lifestyle Behaviors
  • Medical Coding
  • Medical Devices and Equipment
  • Medical Education
  • Medical Education and Training
  • Medical Journals and Publishing
  • Mobile Health and Telemedicine
  • Narrative Medicine
  • Neuroscience and Psychiatry
  • Notable Notes
  • Nutrition, Obesity, Exercise
  • Obstetrics and Gynecology
  • Occupational Health
  • Ophthalmology
  • Orthopedics
  • Otolaryngology
  • Pain Medicine
  • Pathology and Laboratory Medicine
  • Patient Care
  • Patient Information
  • Performance Improvement
  • Performance Measures
  • Perioperative Care and Consultation
  • Pharmacoeconomics
  • Pharmacoepidemiology
  • Pharmacogenetics
  • Pharmacy and Clinical Pharmacology
  • Physical Medicine and Rehabilitation
  • Physical Therapy
  • Physician Leadership
  • Population Health
  • Professional Well-being
  • Professionalism
  • Psychiatry and Behavioral Health
  • Public Health
  • Pulmonary Medicine
  • Regulatory Agencies
  • Research, Methods, Statistics
  • Resuscitation
  • Rheumatology
  • Risk Management
  • Scientific Discovery and the Future of Medicine
  • Shared Decision Making and Communication
  • Sleep Medicine
  • Sports Medicine
  • Stem Cell Transplantation
  • Substance Use and Addiction Medicine
  • Surgical Innovation
  • Surgical Pearls
  • Teachable Moment
  • Technology and Finance
  • The Art of JAMA
  • The Arts and Medicine
  • The Rational Clinical Examination
  • Tobacco and e-Cigarettes
  • Translational Medicine
  • Trauma and Injury
  • Treatment Adherence
  • Ultrasonography
  • Users' Guide to the Medical Literature
  • Vaccination
  • Venous Thromboembolism
  • Veterans Health
  • Women's Health
  • Workflow and Process
  • Wound Care, Infection, Healing

Get the latest research based on your areas of interest.

Others also liked.

  • Register for email alerts with links to free full-text articles
  • Access PDFs of free articles
  • Manage your interests
  • Save searches and receive search alerts

Study Design 101

  • Helpful formulas
  • Finding specific study types
  • Case Control Study
  • Meta- Analysis
  • Systematic Review
  • Practice Guideline
  • Randomized Controlled Trial
  • Cohort Study
  • Case Reports

A study that compares patients who have a disease or outcome of interest (cases) with patients who do not have the disease or outcome (controls), and looks back retrospectively to compare how frequently the exposure to a risk factor is present in each group to determine the relationship between the risk factor and the disease.

Case control studies are observational because no intervention is attempted and no attempt is made to alter the course of the disease. The goal is to retrospectively determine the exposure to the risk factor of interest from each of the two groups of individuals: cases and controls. These studies are designed to estimate odds.

Case control studies are also known as "retrospective studies" and "case-referent studies."

  • Good for studying rare conditions or diseases
  • Less time needed to conduct the study because the condition or disease has already occurred
  • Lets you simultaneously look at multiple risk factors
  • Useful as initial studies to establish an association
  • Can answer questions that could not be answered through other study designs


  • Retrospective studies have more problems with data quality because they rely on memory and people with a condition will be more motivated to recall risk factors (also called recall bias).
  • Not good for evaluating diagnostic tests because it’s already clear that the cases have the condition and the controls do not
  • It can be difficult to find a suitable control group

Design pitfalls to look out for

Care should be taken to avoid confounding, which arises when an exposure and an outcome are both strongly associated with a third variable. Controls should be subjects who might have been cases in the study but are selected independent of the exposure. Cases and controls should also not be "over-matched."

Is the control group appropriate for the population? Does the study use matching or pairing appropriately to avoid the effects of a confounding variable? Does it use appropriate inclusion and exclusion criteria?

Fictitious Example

There is a suspicion that zinc oxide, the white non-absorbent sunscreen traditionally worn by lifeguards is more effective at preventing sunburns that lead to skin cancer than absorbent sunscreen lotions. A case-control study was conducted to investigate if exposure to zinc oxide is a more effective skin cancer prevention measure. The study involved comparing a group of former lifeguards that had developed cancer on their cheeks and noses (cases) to a group of lifeguards without this type of cancer (controls) and assess their prior exposure to zinc oxide or absorbent sunscreen lotions.

This study would be retrospective in that the former lifeguards would be asked to recall which type of sunscreen they used on their face and approximately how often. This could be either a matched or unmatched study, but efforts would need to be made to ensure that the former lifeguards are of the same average age, and lifeguarded for a similar number of seasons and amount of time per season.

Real-life Examples

Boubekri, M., Cheung, I., Reid, K., Wang, C., & Zee, P. (2014). Impact of windows and daylight exposure on overall health and sleep quality of office workers: a case-control pilot study . Journal of Clinical Sleep Medicine : JCSM : Official Publication of the American Academy of Sleep Medicine, 10 (6), 603-611. https://doi.org/10.5664/jcsm.3780

This pilot study explored the impact of exposure to daylight on the health of office workers (measuring well-being and sleep quality subjectively, and light exposure, activity level and sleep-wake patterns via actigraphy). Individuals with windows in their workplaces had more light exposure, longer sleep duration, and more physical activity. They also reported a better scores in the areas of vitality and role limitations due to physical problems, better sleep quality and less sleep disturbances.

Togha, M., Razeghi Jahromi, S., Ghorbani, Z., Martami, F., & Seifishahpar, M. (2018). Serum Vitamin D Status in a Group of Migraine Patients Compared With Healthy Controls: A Case-Control Study . Headache, 58 (10), 1530-1540. https://doi.org/10.1111/head.13423

This case-control study compared serum vitamin D levels in individuals who experience migraine headaches with their matched controls. Studied over a period of thirty days, individuals with higher levels of serum Vitamin D was associated with lower odds of migraine headache.

Related Formulas

  • Odds ratio in an unmatched study
  • Odds ratio in a matched study

Related Terms

A patient with the disease or outcome of interest.


When an exposure and an outcome are both strongly associated with a third variable.

A patient who does not have the disease or outcome.

Matched Design

Each case is matched individually with a control according to certain characteristics such as age and gender. It is important to remember that the concordant pairs (pairs in which the case and control are either both exposed or both not exposed) tell us nothing about the risk of exposure separately for cases or controls.

Observed Assignment

The method of assignment of individuals to study and control groups in observational studies when the investigator does not intervene to perform the assignment.

Unmatched Design

The controls are a sample from a suitable non-affected population.

Now test yourself!

1. Case Control Studies are prospective in that they follow the cases and controls over time and observe what occurs.

a) True b) False

2. Which of the following is an advantage of Case Control Studies?

a) They can simultaneously look at multiple risk factors. b) They are useful to initially establish an association between a risk factor and a disease or outcome. c) They take less time to complete because the condition or disease has already occurred. d) b and c only e) a, b, and c

← Previous Next →

© 2011-2019, The Himmelfarb Health Sciences Library Questions? Ask us .

Creative Commons License

  • Himmelfarb Intranet
  • Privacy Notice
  • Terms of Use
  • GW is committed to digital accessibility. If you experience a barrier that affects your ability to access content on this page, let us know via the Accessibility Feedback Form .
  • Open access
  • Published: 12 September 2011

Reducing selection bias in case-control studies from rare disease registries

  • J Alexander Cole 1 ,
  • John S Taylor 1 ,
  • Thomas N Hangartner 2 ,
  • Neal J Weinreb 3 ,
  • Pramod K Mistry 4 &
  • Aneal Khan 5  

Orphanet Journal of Rare Diseases volume  6 , Article number:  61 ( 2011 ) Cite this article

10k Accesses

22 Citations

2 Altmetric

Metrics details

In clinical research of rare diseases, where small patient numbers and disease heterogeneity limit study design options, registries are a valuable resource for demographic and outcome information. However, in contrast to prospective, randomized clinical trials, the observational design of registries is prone to introduce selection bias and negatively impact the validity of data analyses.

The objective of the study was to demonstrate the utility of case-control matching and the risk-set method in order to control bias in data from a rare disease registry. Data from the International Collaborative Gaucher Group (ICGG) Gaucher Registry were used as an example.

A case-control matching analysis using the risk-set method was conducted to identify two groups of patients with type 1 Gaucher disease in the ICGG Gaucher Registry: patients with avascular osteonecrosis (AVN) and those without AVN. The frequency distributions of gender, decade of birth, treatment status, and splenectomy status were presented for cases and controls before and after matching. Odds ratios (and 95% confidence intervals) were calculated for each variable before and after matching.

The application of case-control matching methodology results in cohorts of cases (i.e., patients with AVN) and controls (i.e., patients without AVN) who have comparable distributions for four common parameters used in subject selection: gender, year of birth (age), treatment status, and splenectomy status. Matching resulted in odds ratios of approximately 1.00, indicating no bias.


We demonstrated bias in case-control selection in subjects from a prototype rare disease registry and used case-control matching to minimize this bias. Therefore, this approach appears useful to study cohorts of heterogeneous patients in rare disease registries.

Rare diseases, exemplified by Gaucher disease, are defined as having a prevalence of fewer than 200,000 patients [ 1 ]. A major impediment to the study of these diseases is the scarcity of patients in any one city or country. Nevertheless, the global burden of patients affected by rare diseases is substantial: at least 30 million patients are estimated to suffer from one of the 7,000 rare diseases currently identified [ 2 ]. On average, each rare disease is estimated to afflict 4,200 patients [ 2 ]. Our search of the word 'registry' on clinicaltrials.gov as of 4 May 2011 identified 913 results.

Rare disease patient registries provide relatively large representative cohorts for clinical study. As a rule individual rare diseases are highly heterogeneous in phenotypic expression, which hinders optimal natural history or outcomes studies using data from rare disease registries. An excellent example of a rare disease registry is the International Collaborative Gaucher Group (ICGG) Gaucher Registry, which has been collecting patient data for 20 years. In fact, the ICGG Gaucher Registry is the prototype by which several disease registries have been created (Table 1 ).

Randomized double-blind, placebo controlled clinical trials represent the highest category of evidence base for determining efficacy of treatments. For rare hereditary diseases, such as Gaucher disease, there are significant impediments to the design and conduct of adequately powered clinical trials. For example, rarity of the disease compounded by genetic and phenotypic heterogeneity hinders the development of appropriate subject groups for study that are controlled for factors such as age, sex, disease severity, and genotype. Moreover, following the introduction of an effective therapy, few patients remain treatment-naive for evaluation of alternative therapies, which may differ in mechanism of action and have overlapping effects. An additional consideration when evaluating long-term treatment outcomes is the chronic nature of many rare diseases, which often extends beyond the reasonable time span of a traditional clinical trial. As an alternative model, the Framingham heart study provides an example of the design and conduct of an observational cohort study designed to collect longitudinal data with the goal of studying health outcomes [ 3 ].

An important feature of disease registries is the potential to provide real-world data from the community [ 2 ]. Therefore, data from registries could complement data obtained from clinical trials to develop optimal standards of care for rare diseases. Indeed, data from the ICGG Gaucher Registry have been effectively used to demonstrate treatment outcomes in multiple disease compartments which have been used to develop a standard of care and expected treatment outcomes for Gaucher disease [ 4 – 6 ]. These have formed the basis for developing therapeutic goals [ 7 ] and to define endpoints for subsequent clinical trials of new therapeutic agents [ 8 – 10 ]. Analytical approaches used in these studies from the ICGG Gaucher Registry have included multivariate mixed-effects analyses [ 11 ], propensity scoring and non-linear effects modeling [ 12 ], and Poisson regression modeling to determine relative risk [ 13 ].

A major confounder with registry data is selection bias, which is inherent in the observational design of the registry and the flexibility accorded to contributors to determine which patients to include and what data to submit [ 14 ]. An approach to overcome such selection bias is the use of case-control matching, in which cases are selected based on the presence of a specific disease outcome and matched to controls that are identified to not have that outcome. These cases and controls are matched according to values for a set of background characteristics. However, this type of analysis requires a population sufficiently large to identify cases of interest and randomly selected controls. With almost 6,000 enrolled subjects, the ICGG Gaucher Registry is the largest worldwide registry for an inborn error of metabolism, and it becomes feasible to attempt case-control matching.

In this paper, the cases of interest are patients with skeletal avascular osteonecrosis (AVN), a serious and irreversible complication of Gaucher disease that occurs sporadically and unpredictably in a subset of patients. The set of matched controls are patients with type 1 Gaucher disease who did not develop AVN. By applying the risk-set method approach, we demonstrate the utility of the case-control matching method to identify case and control patients who have comparable distributions for four common parameters used in subject selection: gender, year of birth (age), treatment status, and splenectomy status. We conclude that selection bias in case-control selection of subjects from rare disease registries occurs and that this can be overcome through case-control matching to minimize bias. Therefore, application of this technique permits the study of treatment outcomes or natural history within rare disease registries.

International Collaborative Gaucher Group (ICGG) Gaucher Registry

The ICGG Gaucher Registry was started to track the clinical, demographic, genetic, biochemical and therapeutic characteristics of patients with Gaucher disease throughout the world, irrespective of disease severity, treatment status, or treatment choice [ 15 ]. An independent international group of physician experts in Gaucher disease provides scientific direction and governance of the Registry, with logistical support from Genzyme, a Sanofi Company (Cambridge, Massachusetts). Since its inception in 1991, with Institutional Review Board/Ethics Committee approvals, over 700 physicians from more than 60 countries have voluntarily submitted de-identified data on over 5,800 patients to the Registry.

Study population

We identified all patients in the ICGG Gaucher Registry as of 1 October 2010, with type 1 Gaucher disease and reported treatment status including date of initiation of imiglucerase (Cerezyme ® , Genzyme Corporation) or alglucerase (Ceredase ® , Genzyme Corporation) treatment. Until early 2010, alglucerase and imiglucerase were the only commercially approved enzyme treatments for Gaucher disease. Alglucerase and imiglucerase have been shown to be therapeutically equivalent in a randomized, two-arm clinical trial [ 16 ]. For simplicity, these two treatments will be denoted as imiglucerase in this publication.

Case identification

Based on data from the ICGG Gaucher Registry skeletal case report forms, we identified all patients with affirmative reports of AVN. Cases of AVN were typically ascertained through radiographic or magnetic resonance image (MRI) results. An affirmative report was based on the treating physician's review of the corresponding radiographic or MRI result. Each patient's earliest date of an affirmative report of AVN was considered to be the index date.

Case-control matching

In order to quantify the association between risk factors with the onset of AVN, we initially sought to identify all patients without AVN as controls in our analysis. Following a review of characteristics between cases and controls, apparent differences between the groups according to gender, decade of birth, imiglucerase/alglucerase treatment status, and history of splenectomy were noted. Prior to the advent of imiglucerase, patients underwent splenectomy for relief of cytopenia and/or pressure symptoms; however, splenectomy itself has the potential to alter the phenotype and natural course of the disease [ 17 , 18 ]. Since these variables (gender, decade of birth, treatment status, history of splenectomy) may impact both the risk of AVN and also may be associated with other risk factors for AVN, we implemented a case-control matching algorithm using the risk-set method [ 19 ]. For each case of AVN, we identified all controls who matched on gender and year of birth (± five years). Among these matched controls, we then assigned their index date to be the same date as the AVN onset date for the corresponding case and excluded controls who were not followed-up in the ICGG Gaucher Registry as of that index date. We further determined whether the case and controls as of their index date had 1) initiated treatment with imiglucerase/alglucerase and 2) underwent prior splenectomy. For each individual case, we randomly selected up to five controls who matched on all four characteristics [ 20 ].

Statistical analysis

We presented the frequency distributions of gender, decade of birth, treatment status, and splenectomy status for cases and controls before and after matching. We calculated odds ratios (and 95% confidence intervals) for each variable before and after matching and present the percent bias for each variable [ 21 , 22 ] using the formula below:

ARR = Apparent exposure relative risk (i.e., before matching)

RR = 'True' or fully adjusted exposure relative risk (i.e., after matching)

An odds ratio of 1·00 indicates no difference in the distributions between cases and controls [ 23 ]. All analyses were conducted in SAS 9·1 (SAS Institute Inc., Cary, North Carolina, USA) in accordance with STrengthening the Reporting of OBservational studies in Epidemiology ( STROBE) guidelines [ 24 ].

As of 1 October 2010, the ICGG Gaucher Registry contained a total of 5,894 patients. Of these, 5,156 patients met the study inclusion criteria: type 1 Gaucher disease, known treatment status, and known date of initiation of treatment. From this group of patients (n = 5,156), 176 patients had a history of AVN with no accompanying assessment or diagnosis dates reported to the Registry and were therefore excluded from the study. Of the remaining 4,980 patients, we identified 853 patients with reports of AVN and 4,127 patients without AVN.

Patient characteristics before matching are shown in Table 2 . Before matching, the ratio of females to males was similar in both groups, with a slightly higher percentage of females in the control group. In contrast, before matching, a higher percentage of patients born in earlier decades (i.e. older patients with more years at risk) reported AVN compared to the group without AVN. Additionally, distributions of splenectomy and treatment status were substantially different between case and control patients, as indicated by odds ratios of 3·21 for splenectomy status and 6·09 for treatment status.

In general, matching resulted in odds ratios of approximately 1·00 as seen in Table 3 . After matching, the distributions of patients born in each decade in both groups were more comparable. For splenectomy status and treatment status, where differences in distributions before matching were apparent, the percent bias was ((3·21 - 1·32)/1·32) × 100 = 143·2% and ((6·09 - 1·10)/1·10) × 100 = 453·6%, respectively (Figure 1 ).

figure 1

Odds Ratios in Subjects With and Without Avascular Osteonecrosis Before and After Matching .

Registries for the study of rare diseases serve to create pooled patient populations that are sufficiently large for robust statistical analysis. However, studies based on registry databases are vulnerable to bias. For example, domains captured in the database may differ from center to center; patients with less severe disease may not be enrolled or, if enrolled, may have fewer data collected. In addition, the data may be incomplete. Verification of the quality or completeness of the data may be lacking and there is no systematic evaluation of statistical methods to generate an unbiased dataset from registry data. Nevertheless, as many long-term studies [ 25 – 27 ] have demonstrated in a variety of diseases, having longitudinal data is critical to understanding the natural history or response to treatment of a chronic disease. This type of data is often analyzed using case-control methodology.

However, case-control studies in patients with rare diseases, whether performed in individual large clinics or through disease registries, are inherently vulnerable to bias. Chronic diseases, such as Gaucher disease, are highly heterogeneous, and the phenotype can vary depending on the age of onset, age of the patient, adjunct therapies, genotype, access to health-care resources, and environmental factors. Patients with milder disease tend to have less contact with specialty clinics and less frequent and intensive follow-up; many are not diagnosed for several years [ 28 ]. When more than one control is identified that matches to each case, there has been no validation to our knowledge, whether non-random selection of a control, pooling all controls, or selecting a group of controls are valid methods to reduce selection bias. Therefore, selecting an unbiased control group is not simply a matter of finding subjects who are negative for the disease variable being studied, and arbitrary selection of controls or pooling of controls does not obviate having a biased control group that may lead to an erroneous conclusion. The method we used permitted appropriate risk-set selection and subsequent matching, and it circumvented the challenge of clinical heterogeneity in observational registries. However, it is applicable only in the context of a large, well annotated patient cohort combined with extensive follow-up data.

This study shows that some biases can be successfully minimized in an observational database such as the ICCG Registry by using case-control matching and a modified risk-set method approach. Applying this established method to registry data, we demonstrated the effective use of the case-control matching method to yield cohorts of case and control patients who have comparable distributions for four common areas used in subject selection: gender, year of birth (age), treatment status, and splenectomy status. The results after matching showed odds ratios close to one, which indicates no difference or bias between cases and controls on these matching variables. Skeletal avascular osteonecrosis was selected for this analysis because it is a complication of type 1 Gaucher disease associated with serious acute and chronic morbidity[ 13 ], but it is a difficult target to study because it occurs sporadically and unpredictably. The matched patients now constitute a resource for further analysis. In this cohort, other risk factors can now be studied without introducing bias due to differences in age, gender, treatment status, and splenectomy status.

In this study, the main outcome variable was the change in odds ratios. The odds ratios indicate the amount of bias in the groups. The largest changes were observed for treatment status and splenectomy status. This difference may be due to several factors. One factor is that many of the controls, even though they were not symptomatic for the variable in question, were receiving imiglucerase therapy. Because biased selection of controls may over or under represent the variables in case-control pairs, having more controls than cases may have made it appear as if AVN was more likely to occur in younger patients or subjects without a history of imiglucerase therapy or who underwent a prior splenectomy. Having randomly matched controls, the cases and controls were numerically equally represented, thus reducing the bias. The purpose of having matched data is to reduce the finding of any such relationship due to biased case or control selection.

The practical application of this technique is to validate that case-control studies have a minimized bias in subject selection, which provides researchers with an analytical tool to test their hypotheses of interest. This study has demonstrated the use of case-control matching to reduce the bias between groups. We conclude that bias in case-control selection in subjects from rare disease registries can occur, and case-control matching is one method to minimize this bias.

This study shows that some biases can be successfully minimized in an observational database such as the ICCG Gaucher Registry by using case-control matching and a modified risk-set method approach.

Authors' information

AK is an Assistant Professor of Medical Genetics and Pediatrics at the University of Calgary at Alberta Children's Hospital. His primary work is in the clinical management of patients with inborn errors of metabolism, including Gaucher disease, in addition to clinical research in the same area.

TNH is a Distinguished Professor of Biomedical Engineering, Medicine & Physics at Wright State University in Dayton, OH. His long-term interests in non-invasive, quantitative assessment of bone resulted in the invitation to participate in the data analysis and subsequent drafting of this manuscript.

JAC is Director, Epidemiology at Genzyme, a Sanofi Company, where he participates in the design and conduct of data analysis from disease registries, including the ICGG Gaucher Registry. He holds a Doctor of Science degree in Epidemiology.

JST is a Senior Biostatistician at Genzyme, a Sanofi Company, where he participates in the design and conduct of data analysis from the ICGG Gaucher Registry. He holds a Master of Arts degree in Statistics.

PKM is Professor and Chief, National Gaucher Disease Treatment Center at Yale School of Medicine. He has major clinical and research interests in Gaucher disease. He is a member of the Scientific Board of ICGG Gaucher Registry and his participation in the study derives from this role.

NJW is Voluntary Associate Professor of Medicine at the Miller School of Medicine of the University of Miami and Director of the University Research Foundation for Lysosomal Storage Diseases (unaffilliated with the University of Miami). He has had a research and clinical interest in Gaucher disease for 44 years. NJW is the chair of the North American Scientific Board of ICGG Gaucher Registry and co-chair of the International ICGG Board. His participation in the study derives from these roles.

Office of Rare Disease Research. [ http://rarediseases.info.nih.gov/RareDiseaseList.aspx?PageID=1 ].

Rubinstein YR, Groft SC, Bartek R, Brown K, Christensen RA, Collier E, Farber A, Farmer J, Ferguson JH, Forrest CB, Lockhart NC, McCurdy KR, Moore H, Pollen GB, Richesson R, Miller VR, Hull S, Vaught J: Creating a global rare disease patient registry linked to a rare diseases biorepository database: Rare Disease-HUB (RD-HUB). Contemp Clin Trials. 2010, 31: 394-404. 10.1016/j.cct.2010.06.007.

Article   PubMed Central   PubMed   Google Scholar  

Thanassoulis G, Massaro JM, Cury R, Manders E, Benjamin EJ, Vasan RS, Cupple LA, Hoffmann U, O'Donnell CJ, Kathiresan S: Associations of long-term and early adult atherosclerosis risk factors with aortic and mitral valve calcium. J Am Coll Cardiol. 2010, 55: 2491-2498. 10.1016/j.jacc.2010.03.019.

Article   PubMed Central   CAS   PubMed   Google Scholar  

Andersson H, Kaplan P, Kacena K, Yee J: Eight-year clinical outcomes of long-term enzyme replacement therapy for 884 children with Gaucher disease type 1. Pediatrics. 2008, 122: 1182-1190. 10.1542/peds.2007-2144.

Article   PubMed   Google Scholar  

Kaplan P, Andersson HC, Kacena KA, Yee JD: The clinical and demographic characteristics of nonneuronopathic Gaucher disease in 887 children at diagnosis. Arch Pediatr Adolesc Med. 2006, 160: 603-608. 10.1001/archpedi.160.6.603.

Weinreb N, Taylor J, Cox T, Yee J, vom Dahl S: A benchmark analysis of the achievement of therapeutic goals for type 1 Gaucher disease patients treated with imiglucerase. Am J Hematol. 2008, 83: 890-895. 10.1002/ajh.21280.

Article   CAS   PubMed   Google Scholar  

Pastores GM, Weinreb NJ, Aerts H, Andria G, Cox TM, Giralt M, Grabowski GA, Mistry PK, Tylki-Szymanska A: Therapeutic goals in the treatment of Gaucher disease. Semin Hematol. 2004, 41: 4-14.

Cox TM: Eliglustat tartrate, an orally active glucocerebroside synthase inhibitor for the potential treatment of Gaucher disease and other lysosomal storage diseases. Curr Opin Investig Drugs. 2010, 11: 1169-1181.

CAS   PubMed   Google Scholar  

Lukina E, Watman N, Arreguin EA, Dragosky M, Iastrebner M, Rosenbaum H, Phillips M, Pastores GM, Kamath RS, Rosenthal DI, Kaper M, Singh T, Puga AC, Peterschmitt MJ: Improvement in hematological, visceral, and skeletal manifestations of Gaucher disease type 1 with oral eliglustat tartrate (Genz-112638) treatment: 2-year results of a phase 2 study. Blood. 2010, 116: 4095-4098. 10.1182/blood-2010-06-293902.

Lukina E, Watman N, Arreguin EA, Banikazemi M, Dragosky M, Iastrebner M, Rosenbaum H, Phillips M, Pastores GM, Rosenthal DI, Kaper M, Singh T, Puga AC, Bonate PL, Peterschmitt MJ: A phase 2 study of eliglustat tartrate (Genz-112638), an oral substrate reduction therapy for Gaucher disease type 1. Blood. 2010, 116: 893-899. 10.1182/blood-2010-03-273151.

Wenstrup RJ, Kacena KA, Kaplan P, Pastores GM, Prakash-Cheng A, Zimran A, Hangartner TN: Effect of enzyme replacement therapy with imiglucerase on BMD in type 1 Gaucher disease. J Bone Miner Res. 2007, 22: 119-126.

Grabowski GA, Kacena K, Cole JA, Hollak CE, Zhang L, Yee J, Mistry PK, Zimran A, Charrow J, vom Dahl S: Dose-response relationships for enzyme replacement therapy with imiglucerase/alglucerase in patients with Gaucher disease type 1. Genet Med. 2009, 11: 92-100. 10.1097/GIM.0b013e31818e2c19.

Mistry PK, Deegan P, Vellodi A, Cole JA, Yeh M, Weinreb NJ: Timing of initiation of enzyme replacement therapy after diagnosis of type 1 Gaucher disease: effect on incidence of avascular necrosis. Br J Haematol. 2009, 147: 561-570. 10.1111/j.1365-2141.2009.07872.x.

Registries for Evaluating Patient Outcomes: A Users Guide. [ http://www.ahrq.gov ].

Charrow J, Andersson HC, Kaplan P, Kolodny EH, Mistry P, Pastores G, Rosenbloom BE, Scott CR, Wappner RS, Weinreb NJ, Zimran A: The Gaucher registry: demographics and disease characteristics of 1698 patients with Gaucher disease. Arch Intern Med. 2000, 160: 2835-2843. 10.1001/archinte.160.18.2835.

Grabowski GA, Barton NW, Pastores G, Dambrosia JM, Banerjee TK, McKee MA, Parker C, Schiffmann R, Hill SC, Brady RO: Enzyme therapy in type 1 Gaucher disease: comparative efficacy of mannose-terminated glucocerebrosidase from natural and recombinant sources. Ann Intern Med. 1995, 122: 33-39.

Cox TM, Aerts JM, Belmatoug N, Cappellini MD, vom Dahl S, Goldblatt J, Grabowski GA, Hollak CE, Hwu P, Maas M, Martins AM, Mistry PK, Pastores GM, Tylki-Szymanska A, Yee J, Weinreb N: Management of non-neuronopathic Gaucher disease with special reference to pregnancy, splenectomy, bisphosphonate therapy, use of biomarkers and bone disease monitoring. J Inherit Metab Dis. 2008, 31: 319-336. 10.1007/s10545-008-0779-z.

Deegan PB, Pavlova E, Tindall J, Stein PE, Bearcroft P, Mehta A, Hughes D, Wraith JE, Cox TM: Osseous manifestations of adult Gaucher disease in the era of enzyme replacement therapy. Medicine (Baltimore). 2011, 90: 52-60. 10.1097/MD.0b013e3182057be4.

Article   CAS   Google Scholar  

Rothman K: Epidemiology: An Introduction. New York: Oxford University Press; 2002.

Google Scholar  

Strom B: Pharmacoepidemiology. 4 edition New Jersey: Wiley Publishers; 2005.

Maldonado G, Greenland S: Simulation study of confounder-selection strategies. Am J Epidemiol. 1993, 138: 923-936.

Schneeweiss S: Sensitivity analysis and external adjustment for unmeasured confounders in epidemiologic database studies of therapeutics. Pharmacoepidemiol Drug Saf. 2006, 15: 291-303. 10.1002/pds.1200.

Szklo M, Nietro F: Epidemiology: Beyond the Basics. Massachusetts: Jones and Bartlett; 2006.

STROBE Statements. [ http://www.strobe-statement.org ].

Baldwin LM, Dobie SA, Cai Y, Saver BG, Green PK, Wang CY: Receipt of general medical care by colorectal cancer patients: a longitudinal study. J Am Board Fam Med. 2011, 24: 57-68. 10.3122/jabfm.2011.01.100080.

Freedman MS: Long-term follow-up of clinical trials of multiple sclerosis therapies. Neurology. 2011, 76: S26-34.

Mansoor O, Chandar J, Rodriguez MM, Abitbol CL, Seeherunvong W, Freundlich M, Zilleruelo G: Long-term risk of chronic kidney disease in unilateral multicystic dysplastic kidney. Pediatr Nephrol. 2011, 26: 597-603. 10.1007/s00467-010-1746-0.

Mistry PK, Sadan S, Yang R, Yee J, Yang M: Consequences of diagnostic delays in type 1 Gaucher disease: the need for greater awareness among hematologists-oncologists and an opportunity for early diagnosis and intervention. Am J Hematol. 2007, 82: 697-701. 10.1002/ajh.20908.

Clinical Trials. [ http://www.clinicaltrials.gov ].

Download references

Acknowledgements and Funding

Robert Brown is a graphic artist employed by Genzyme, a Sanofi Company who imported the figure into a graphics program to produce the final figure submitted.

Andrea Gwosdow, Ph.D. was responsible for writing, editing, and managing the manuscript, and interpretation of data. This included managing author reviews and synthesizing the comments of each individual author into each draft of the manuscript. Andrea Gwosdow is a medical writer contracted by Genzyme, a Sanofi Company.

We would like to thank the patients with type 1 (non-neuronopathic) Gaucher disease and their physicians and health-care personnel who submit data to the Gaucher Registry, the Gaucher Registry support team at Genzyme Corporation, and Radhika Tripuraneni, MD, MPH.

Operational support of the ICGG Gaucher Registry is provided by Genzyme, a Sanofi Company.

Author information

Authors and affiliations.

Biomedical Data Sciences and Informatics, Genzyme, a Sanofi Company, 500 Kendall Street, Cambridge, MA, 02142, USA

J Alexander Cole & John S Taylor

Biomedical, Industrial & Human Factors Engineering, Wright State University, 3640 Col. Glenn Highway, 207 Russ Egr. Center, Dayton, OH, 45435, USA

Thomas N Hangartner

Northwest Oncology Hematology Associates PA, University Research Foundation for Lysosomal Storage Diseases, 8170 Royal Palm Boulevard, Coral Springs, FL, 33065, USA

Neal J Weinreb

Pediatric Gastroenterology and Hepatology Yale University School of Medicine, PO Box 208064, 333 Cedar Street; LMP 4093, New Haven, CT, 06520, USA

Pramod K Mistry

University of Calgary, Alberta Children's Hospital, 2888 Shaganappi Tr NW, 3rd Floor Metabolic Clinic, Alberta, Calgary, Canada

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Aneal Khan .

Additional information

Competing interests.

Aneal Khan, Pramod Mistry and Neal Weinreb receive honoraria and expense reimbursement for serving on the Board of Advisors of the ICGG Gaucher Registry; travel reimbursements and/or honoraria and/or research support from Genzyme, a Sanofi Company, Shire Pharmaceuticals, Amicus Therapeutics, and Actelion. Aneal Khan and Neal Weinreb do not hold any financial interest in any pharmaceutical company. Thomas Hangartner receives travel reimbursement and/or honoraria for speaking engagements from Genzyme, a Sanofi Company, and Shire Pharmaceuticals. John Taylor and J. Alexander Cole are employees of Genzyme, a Sanofi Company.

Aneal Khan, Pramod Mistry, Neal Weinreb and Thomas Hangartner did not receive funding for this study.

Authors' contributions

AK was responsible for the hypothesis, overall concept, analyses, and data interpretation. He wrote the first draft, edited, and oversaw the writing of the manuscript. The research hypothesis was developed as an independent research question prior to joining the ICGG. AK presented a research request to the ICGG Gaucher Registry in order to test his hypothesis.

TH assisted in hypothesis development, data interpretation, and editing the manuscript.

JAC was primarily responsible for the overall epidemiologic design and statistical analyses, including the overall concept, data interpretation, and drafting and editing the manuscript.

JST was primarily responsible for the overall statistical analyses, including the data interpretation, and drafting and editing the manuscript.

PKM assisted in hypothesis development, editing the manuscript, and interpretation of data.

NJW assisted in hypothesis development, writing and editing the manuscript, and interpretation of data.

All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions.

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article.

Cole, J.A., Taylor, J.S., Hangartner, T.N. et al. Reducing selection bias in case-control studies from rare disease registries. Orphanet J Rare Dis 6 , 61 (2011). https://doi.org/10.1186/1750-1172-6-61

Download citation

Received : 04 July 2011

Accepted : 12 September 2011

Published : 12 September 2011

DOI : https://doi.org/10.1186/1750-1172-6-61

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Gauche Disease
  • Alglucerase
  • Imiglucerase
  • Genzyme Corporation

Orphanet Journal of Rare Diseases

ISSN: 1750-1172

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

case control study rare disease

Book cover

Handbook of Epidemiology pp 293–323 Cite as

Case-Control Studies

  • Norman E. Breslow 3  
  • Reference work entry

11k Accesses

1 Citations

The case-control study examines the association between disease and potential risk factors by taking separate samples of diseased cases and of controls at risk of developing disease. Information may be collected for both cases and controls on genetic, social, behavioral, environmental, or other determinants of disease risk. The basic study design has a long history, extending back at least to Guy’s 1843 comparison of the occupations of men with pulmonary consumption to the occupations of men having other diseases (Lilienfeld and Lilienfeld 1979). Beginning in the 1920s, it was used to link cancer to environmental and hormonal exposures. Broders (1920) discovered an association between pipe smoking and lip cancer; Lane-Claypon (1926), who selected matched hospital controls, investigated the relationship between reproductive experience and female breast cancer; and Lombard and Doering (1928) related pipe smoking to oral cancer. The publication in 1950 of three reports on the association between cigarette smoking and lung cancer generated enormous interest in case-control methodology as well as bitter criticism (Levin et al. 1950; Wynder and Graham 1950; Doll and Hill 1950). The landmark study of Doll and Hill (1950, 1952), in particular, inspired future generations of epidemiologists to use this methodology. It remains to this day a model for the design and conduct of case-control studies, with excellent suggestions on how to reduce or eliminate selection, interview, and recall bias.

  • Incidence Rate Ratio
  • Random Digit Dialing
  • Control Selection
  • Disease Incidence Rate
  • Odds Ratio Estimator

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Aird L, Bentall HH, Roberts JAF (1953) A relationship between cancer of stomach and the ABO blood groups. Br Med J 1:799–801

CrossRef   CAS   PubMed Central   PubMed   Google Scholar  

Andrieu N, Goldstein AM (1998) Epidemiologic and genetic approaches in the study of gene-environment interaction: an overview of available methods. Epidemiol Rev 20:137–147

CrossRef   CAS   PubMed   Google Scholar  

Armenian HK, Lilienfeld DE (1994) Overview and historical perspective. Epidemiol Rev 16:1–5

CAS   PubMed   Google Scholar  

Armstrong RW, Armstrong MJ, Yu MC, Henderson BE (1983) Salted fish and inhalants as risk-factors for nasopharyngeal carcinoma in Malaysian Chinese. Cancer Res 43:2967–2970

Austin H, Hill HA, Flanders WD, Greenberg RS (1994) Limitations in the application of case-control methodology. Epidemiol Rev 16:65–76

Barlow WE (1994) Robust variance estimation for the case-cohort design. Biometrics 50: 1064–1072

Benichou J, Gail MH (1995) Methods of inference for estimates of absolute risk derived from population-based case-control studies. Biometrics 51:182–194

Berkson J (1946) Limitations of the application of fourfold table analysis to hospital data. Biom Bull 2:47–53

CrossRef   CAS   Google Scholar  

Breslow N (1982) Design and analysis of case-control studies. Annu Rev Public Health 3:29–54

Breslow NE (1996) Statistics in epidemiology: the case-control study. J Am Stat Assoc 91:14–28

Breslow NE (2003) Are statistical contributions to medicine undervalued? Biometrics 59:1–8

CrossRef   PubMed   Google Scholar  

Breslow NE, Cain KC (1988) Logistic regression for two-stage case-control data. Biometrika 75:11–20

CrossRef   Google Scholar  

Breslow NE, Chatterjee N (1999) Design and analysis of two-phase studies with binary outcomes applied to Wilms tumor prognosis. Appl Stat 48:457–468

Google Scholar  

Breslow NE, Day NE (1980) Statistical methods in cancer research I: the analysis of case-control studies. International Agency for Research on Cancer, Lyon

Breslow NE, Lubin JH, Marek P, Langholz B (1983) Multiplicative models and cohort analysis. J Am Stat Assoc 78:1–12

Broders AC (1920) Squamous-cell epithelioma of the lip. A study of five hundred and thirty-seven cases. J Am Med Assoc 74:656–664

Carroll RJ, Ruppert D, Stefanski LA (1995) Measurement error in nonlinear models. Chapman and Hall, London

Chase G, Klauber MR (1965) A graph of sample sizes for retrospective studies. Am J Public Health 55:1993–1996

Cole P (1979) The evolving case-control study. J Chronic Dis 32:15–27

Comstock GW (1994) Evaluating vaccination effectiveness and vaccine efficacy by means of case-control studies. Epidemiol Rev 16:77–89

Cornfield J (1951) A method of estimating comparative rates from clinical data. Applications to cancer of the lung, breast, and cervix. J Natl Cancer Inst 11:1269–1275

Correa A, Stewart WF, Yeh HC, Santos-Burgoa C (1994) Exposure measurement in case-control studies: reported methods and recommendations. Epidemiol Rev 16:18–32

Cox DR (1972) Regression models and life-tables (with discussion). J R Stat Soc (Ser B) 34: 187–220

Daling JR, Weiss NS, Metch BJ, Chow WH, Soderstrom RM, Stadel BV (1985) Primary tubal infertility in relation to the use of an intrauterine-device. N Engl J Med 312:937–941

Doll R, Hill AB (1950) Smoking and carcinoma of the lung. Preliminary report. Br Med J 2:739–748

Doll R, Hill AB (1952) A study of the aetiology of carcinoma of the lung. Br Med J 2:1271–1286

Dorn HF (1959) Some problems arising in prospective and retrospective studies of the etiology of disease. N Engl J Med 261:571–579

Fleming PJ, Gilbert R, Azaz Y, Berry PJ, Rudd PT, Stewart A, Hall E (1990) Interaction between bedding and sleeping position in the sudden-infant-death-syndrome – a population based case-control study. Br Med J 301:85–89

Gordis L (1982) Should dead cases be matched to dead controls? Am J Epidemiol 115:1–5

Graubard BI, Fears TR, Gail MH (1989) Effects of cluster sampling on epidemiologic analysis in population-based case-control studies (Corr: V47 p. 779–780). Biometrics 45:1053–1071

Greenberg ER (1990) Random digit dialing for control selection – a review and a caution on its use in studies of childhood-cancer. Am J Epidemiol 131:1–5

Greenland S (1987) Estimation of exposure-specific rates from sparse case-control data. J Chronic Dis 40:1087–1094

Greenland S, Robins JM (1985) Confounding and misclassification. Am J Epidemiol 122:495–506

Greenland S, Thomas DC (1982) On the need for the rare disease assumption in case-control studies. Am J Epidemiol 116:547–553

Harlow BL, Davis S (1988) Two one-step methods for household screening and interviewing using random digit dialing. Am J Epidemiol 127:857–863

Henderson MM, Kushi LH, Thompson DJ, Gorbach SL, Clifford CK, Thompson RS (1990) Feasibility of a randomized trial of a low-fat diet for the prevention of breast-cancer – dietary compliance in the womens health trial vanguard study. Prev Med 19:115–133

Herbst AL, Ulfelder H, Poskanzer DC (1971) Adenocarcinoma of the vagina. N Engl J Med 284:878–881

Hill AB (1953) Observation and experiment. N Engl J Med 248:995–1001

Hill AB (1965) The environment and disease: association or causation? Proc R Stat Soc Med 58:295–300

CAS   Google Scholar  

Hill AB (1971) Principles of medical statistics. Oxford University Press, New York

Horvitz DG, Thompson DJ (1952) A generalization of sampling without replacement from a finite universe. J Am Stat Assoc 47:663–685

Hsieh DA, Manski CF, McFadden D (1985) Estimation of response probabilities from augmented retrospective observations. J Am Stat Assoc 80:651–662

Ibrahim MA, Spitzer WO (1979) The case-control study: the problem and the prospect. J Chronic Dis 32:139–144

Jablon S, Neel JV, Gershowitz H, Atkinson GF (1967) The NAS-NRC twin panel: methods of construction of the panel, zygosity diagnosis, and proposed use. Am J Human Genet 19: 133–161

Kelsey JL, Whittemore AS, Evans AS, Thompson WD (1996) Methods in observational epidemiology, 2nd edn. Oxford University Press, New York

Khoury MJ, Beaty TH (1994) Applications of the case-control method in genetic epidemiology. Epidemiol Rev 16:134–150

Kupper LL, McMichael AJ, Spirtas R (1975) A hybrid epidemiologic study design useful in estimating relative risk. J Am Stat Assoc 70:524–528

Lane-Claypon JE (1926) A further report on cancer of the breast. Her Majesty’s Stationery Office, London

Langholz B, Borgan O (1995) Counter-matching: a stratified nested case-control sampling method. Biometrika 82:69–79

Langholz B, Borgan O (1997) Estimation of absolute risk from nested case-control data. Biometrics 53:767–774

Langholz B, Goldstein L (1996) Risk set sampling in epidemiologic cohort studies. Stat Sci 11:35–53

Levin ML, Goldstein H, Gerhardt PR (1950) Cancer and tobacco smoking. A preliminary report. J Am Med Assoc 143:336–338

Liddell FDK, McDonald JC, Thomas DC (1977) Methods of cohort analysis: appraisal by application to asbestos mining. J R Stat Soc (Ser A) 140:469–491

Lilienfeld AM, Lilienfeld DE (1979) A century of case-control studies: progress? J Chronic Dis 32:5–13

Lin DY, Ying Z (1993) Cox regression with incomplete covariate measurements. J Am Stat Assoc 88:1341–1349

Little RJA, Rubin DB (2002) Statistical analysis with missing data, 2nd edn. Wiley, New York

Lombard HL, Doering CR (1928) Cancer studies in Massachusetts. 2. Habits, characteristics and environment of individuals with and without cancer. N Engl J Med 198:481–487

MacMahon B, Cole P, Lin TM, Lowe CR, Mirra AP, Ravnihar B, Salber EJ, Valaoras VG, Yuasa S (1970) Age at first birth and breast cancer risk. Bull World Health Org 43:209–221

CAS   PubMed Central   PubMed   Google Scholar  

Mantel N, Haenszel W (1959) Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst 22:719–748

Miettinen OS (1970) Matching and design efficiency in retrospective studies. Am J Epidemiol 91:111–118

Miettinen O (1976) Estimability and estimation in case-referent studies. Am J Epidemiol 103: 226–235

Miettinen O (1982) Design options in epidemiologic research: an update. Scand J Work Environ Health 8:7–14

Miettinen OS (1985) Theoretical epidemiology: principles of occurrence research in medicine. Wiley, New York

Neutra RR, Drolette ME (1978) Estimating exposure-specific disease rates from case-control studies using Bayes theorem. Am J Epidemiol 108:214–222

Neyman J (1955) Statistics – servant of all sciences. Science 122:401–406

O’Neil MJ (1979) Estimating the nonresponse bias due to refusals in telephone surveys. Public Opin Q 43:218–232

Poole C (1987) Critical appraisal of the exposure-potential restriction rule. Am J Epidemiol 125:179–183

Prentice RL (1986) A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73:1–11

Prentice RL (1996) Measurement error and results from analytic epidemiology: dietary fat and breast cancer. J Natl Cancer Inst 88:1738–1747

Prince AM, Szmuness W, Michon J, Demaille J, Diebolt G, Linhard J, Quenum C, Sankale M (1975) A case-control study of the association between primary liver cancer and hepatitis B infection in Senegal. Int J Cancer 16:376–383

Robins J, Pike M (1990) The validity of case-control studies with nonrandom selection of controls. Epidemiology 1:273–284

Robins JM, Gail MH, Lubin JH (1986) More on ‘biased selection of controls for case-control analyses of cohort studies’. Biometrics 42:293–29

Robison LL, Daigle A (1984) Control selection using random digit dialing for cases of childhood cancer. Am J Epidemiol 120:164–165

Rodrigues L, Kirkwood BR (1990) Case-control designs in the study of common diseases: updates on the demise of the rare disease assumption and the choice of sampling scheme for controls. Int J Epidemiol 19:205–213

Rosenbaum PR (1987) The role of a second control group in an observational study (with discussion). Stat Sci 2:292–316

Rothman KJ (1986) Modern epidemiology. Little, Brown, Boston

Rothman KJ, Greenland S (1998) Modern epidemiology, 2nd edn. Lippincott-Raven, Philadelphia

Schlesselman JJ (1982) Case-control studies. Oxford University Press, New York

Schlesselman JJ, Stadel BV (1987) Exposure opportunity in epidemiologic studies. Am J Epidemiol 125:174–178

Sheehe PR (1962) Dynamic risk analysis in retrospective matched pair studies of disease. Biometrics 18:323–341

Smith DC, Prentice R, Thompson DJ, Herrmann W (1975) Association of exogenous estrogen and endometrial carcinoma. N Engl J Med 293:1164–1167

Smith PG, Day NE (1984) The design of case-control studies: the influence of confounding and interaction effects. Int J Epidemiol 13:356–365

Smith PG, Rodrigues LC, Fine PEM (1984) Assessment of the protective efficacy of vaccines against common diseases using case-control and cohort studies. Int J Epidemiol 13:87–93

Taubes G (1995) Epidemiology faces its limits. Science 269:164–169

Thomas DC, Greenland S (1983) The relative efficiencies of matched and independent sample designs for case-control studies. J Chronic Dis 36:685–697

Tuyns AJ, Péquignot G, Jensen OM (1977) Le cancer de l’oesophage en Ille-et-Vilaine en fonction des niveaux de consommation d’alcool et de tabac. Bull Cancer 64:45–60

Wacholder S (1995) Design issues in case-control studies. Stat Methods Med Res 4:293–309

Wacholder S, McLaughlin JK, Silverman DT, Mandel JS (1992a) Selection of controls in case-control studies I. Principles. Am J Epidemiol 135:1019–1028

Wacholder S, Silverman DT, McLaughlin JK, Mandel JS (1992b) Selection of controls in case-control studies II. Types of controls. Am J Epidemiol 135:1029–1041

Waksberg J (1978) Sampling methods for random digit dialing. J Am Stat Assoc 73:40–46

Weinberg CR, Wilcox AJ (1998) Reproductive epidemiology. In: Rothman KJ, Greenland S (eds) Modern epidemiology, 2nd edn., Chap. 29. Lippincott-Raven, Philadeplphia, pp 585–608

Weiss NS (1994) Application of the case-control method in the evaluation of screening. Epidemiol Rev 16:102–108

Weiss NS (2002) Can the ‘specificity’ of an association be rehabilitated as a basis for supporting a causal hypothesis? Epidemiology 13:6–8

White JE (1982) A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol 115:119–128

Wynder EL, Graham EA (1950) Tobacco smoking as a possible etiologic factor in bronchogenic carcinoma. A study of six hundred and eighty-four proved cases. J Am Med Assoc 143: 329–336

Yu MC, Ho JHC, Lai SH, Henderson BE (1986) Cantonese-style salted fish as a cause of nasopharyngeal carcinoma – report of a case-control study in Hong-Kong. Cancer Res 46: 956–961

Ziel HK, Finkle WD (1975) Increased risk of endometrial carcinoma among users of conjugated estrogens. N Engl J Med 293:1167–1170

Download references


I am indebted to Sander Greenland, Noel Weiss, the editors and an anonymous referee for helpful comments on an earlier draft. This work was supported in part by grant R01-CA40644 from the US Public Health Service.

Author information

Authors and affiliations.

Department of Biostatistics, University of Washington, 357232, 98155-7232, Seattle, WA, USA

Norman E. Breslow

You can also search for this author in PubMed   Google Scholar

Editor information

Editors and affiliations.

Department of Epidemiological Methods and Etiologic Research, Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany

Wolfgang Ahrens

Department of Biometry and Data Management, Leibniz Institute for Prevention Research and Epidemiology – BIPS, Bremen, Germany

Iris Pigeot

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this entry

Cite this entry.

Breslow, N.E. (2014). Case-Control Studies. In: Ahrens, W., Pigeot, I. (eds) Handbook of Epidemiology. Springer, New York, NY. https://doi.org/10.1007/978-0-387-09834-0_7

Download citation

DOI : https://doi.org/10.1007/978-0-387-09834-0_7

Publisher Name : Springer, New York, NY

Print ISBN : 978-0-387-09833-3

Online ISBN : 978-0-387-09834-0

eBook Packages : Medicine Reference Module Medicine

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us

Reducing selection bias in case-control studies from rare disease registries


  • 1 Biomedical Data Sciences and Informatics, Genzyme, a Sanofi Company, 500 Kendall Street, Cambridge, MA 02142, USA.
  • PMID: 21910867
  • PMCID: PMC3200984
  • DOI: 10.1186/1750-1172-6-61

Background: In clinical research of rare diseases, where small patient numbers and disease heterogeneity limit study design options, registries are a valuable resource for demographic and outcome information. However, in contrast to prospective, randomized clinical trials, the observational design of registries is prone to introduce selection bias and negatively impact the validity of data analyses. The objective of the study was to demonstrate the utility of case-control matching and the risk-set method in order to control bias in data from a rare disease registry. Data from the International Collaborative Gaucher Group (ICGG) Gaucher Registry were used as an example.

Methods: A case-control matching analysis using the risk-set method was conducted to identify two groups of patients with type 1 Gaucher disease in the ICGG Gaucher Registry: patients with avascular osteonecrosis (AVN) and those without AVN. The frequency distributions of gender, decade of birth, treatment status, and splenectomy status were presented for cases and controls before and after matching. Odds ratios (and 95% confidence intervals) were calculated for each variable before and after matching.

Results: The application of case-control matching methodology results in cohorts of cases (i.e., patients with AVN) and controls (i.e., patients without AVN) who have comparable distributions for four common parameters used in subject selection: gender, year of birth (age), treatment status, and splenectomy status. Matching resulted in odds ratios of approximately 1.00, indicating no bias.

Conclusions: We demonstrated bias in case-control selection in subjects from a prototype rare disease registry and used case-control matching to minimize this bias. Therefore, this approach appears useful to study cohorts of heterogeneous patients in rare disease registries.

Publication types

  • Evaluation Study
  • Aged, 80 and over
  • Case-Control Studies
  • Child, Preschool
  • Cooperative Behavior
  • Gaucher Disease / epidemiology*
  • Gaucher Disease / physiopathology
  • Infant, Newborn
  • Internationality
  • Middle Aged
  • Osteonecrosis / epidemiology*
  • Osteonecrosis / physiopathology
  • Patient Selection*
  • Rare Diseases / epidemiology*
  • Rare Diseases / physiopathology
  • Registries*
  • Research Design*
  • Selection Bias
  • Young Adult
  • Alzheimer's & Dementia
  • Asthma & Allergies
  • Atopic Dermatitis
  • Breast Cancer
  • Cardiovascular Health
  • Environment & Sustainability
  • Exercise & Fitness
  • Headache & Migraine
  • Health Equity
  • HIV & AIDS
  • Human Biology
  • Men's Health
  • Mental Health
  • Multiple Sclerosis (MS)
  • Parkinson's Disease
  • Psoriatic Arthritis
  • Sexual Health
  • Ulcerative Colitis
  • Women's Health
  • Nutrition & Fitness
  • Vitamins & Supplements
  • At-Home Testing
  • Men’s Health
  • Women’s Health
  • Latest News
  • Medical Myths
  • Honest Nutrition
  • Through My Eyes
  • New Normal Health
  • Why exercise is key to living a long and healthy life
  • What do we know about the gut microbiome in IBD?
  • My podcast changed me
  • Can 'biological race' explain disparities in health?
  • Why Parkinson's research is zooming in on the gut
  • Can diet and exercise reverse prediabetes?
  • Health Hubs
  • Find a Doctor
  • BMI Calculators and Charts
  • Blood Pressure Chart: Ranges and Guide
  • Breast Cancer: Self-Examination Guide
  • Sleep Calculator
  • RA Myths vs Facts
  • Type 2 Diabetes: Managing Blood Sugar
  • Ankylosing Spondylitis Pain: Fact or Fiction
  • Our Editorial Process
  • Content Integrity
  • Conscious Language
  • Health Conditions
  • Health Products

What is a case-control study in medical research?

case control study rare disease

A case-control study is a type of medical research investigation often used to help determine the cause of a disease, particularly when investigating a disease outbreak or rare condition.

If public health scientists want a quick and easy way to highlight clues about the cause of a new disease outbreak, they can compare two groups of people: Cases, the term for people who already have the disease, and controls, or people not affected by the disease.

Other terms used to describe case-control studies include epidemiological, retrospective, and observational.

What is a case-control study?

Case control study on questionnaire

A case-control study is a way of carrying out a medical investigation to confirm or indicate what is likely to have caused a condition.

They are usually retrospective, meaning that the researchers look at past data to test whether a particular outcome can be linked back to a suspected risk factor and prevent further outbreaks.

Prospective case-control studies are less common. These involve enrolling a specific selection of people and following that group while monitoring their health. Cases emerge as people who develop the disease or condition under investigation as the study progresses. Those unaffected by the disease form the control group.

To test for specific causes, the scientists need to create a hypothesis about possible causes of the outbreak or disease. These are known as risk factors.

They compare how often the people in the group of cases had been exposed to the suspected cause against how often members of the control group had been exposed. If more participants in the case group experience the risk factor, this suggests that it is a likely cause of the disease.

Researchers might also uncover likely risk factors not mentioned in their hypothesis by studying the medical and personal histories of the people in each group. A pattern may emerge that links the condition to certain factors.

If a specific risk factor has already been identified for a disease or condition, such as age, sex, smoking, or eating red meat, the researchers can use statistical methods to adjust the study to account for that risk factor, helping them to identify other possible risk factors more easily.

Case-control research is a vital tool used by epidemiologists, or researchers who look into the factors affecting health and illness of populations.

Just one risk factor could be investigated for a particular outcome. A good example of this is to compare the number people with lung cancer who have a history of smoking with the number who do not. This will indicate the link between lung cancer and smoking.

Why is it useful?

There are multiple reasons for the use of case-control studies.

Relatively quick and easy

Case-control studies are usually based on past data, so all of the necessary information is readily available, making them quick to carry out. Scientists can analyze existing data to look at health events that have already happened and risk factors that have already been observed.

A retrospective case-control study does not require scientists to wait and see what happens in a trial over a period of days, weeks, or years.

The fact that the data is already available for collation and analysis means that a case-control study is useful when quick results are desired, perhaps when clues are sought for what is causing a sudden disease outbreak.

A prospective case-control study may also be helpful in this scenario as researchers can collect data on suspected risk factors while they monitor for new cases.

The time-saving advantage offered by case-control studies also means they are more practical than other scientific trial designs if the exposure to a suspected cause occurs a long time before the outcome of a disease.

For example, if you wanted to test the hypothesis that a disease seen in adulthood is linked to factors occurring in young children, a prospective study would take decades to carry out. A case-control study is a far more feasible option.

Does not need large numbers of people

Numerous risk factors can be evaluated in case-control studies since they do not require large numbers of participants to be statistically meaningful. More resources can be dedicated to the analysis of fewer people.

Overcomes ethical challenges

As case-control studies are observational and usually about people who have already experienced a condition, they do not pose the ethical problems seen with some interventional studies.

For example, it would be unethical to deprive a group of children of a potentially lifesaving vaccine to see who developed the associated disease. However, analyzing a group of children with limited access to that vaccine can help determine who is at most risk of developing the disease, as well as helping to guide future vaccination efforts.


While a case-control study can help to test a hypothesis about the link between a risk factor and an outcome, it is not as powerful as other types of study in confirming a causal relationship.

Case-control studies are often used to provide early clues and inform further research using more rigorous scientific methods.

The main problem with case-control studies is that they are not as reliable as planned studies that record data in real time, because they look into data from the past.

The main limitations of case-control studies are:

‘Recall bias’

When people answer questions about their previous exposure to certain risk factors their ability to recall may be unreliable. Compared to people not affected by a condition, individuals with a certain disease outcome may be more likely to recall a certain risk factor, even if it did not exist, because of a temptation to make their own subjective links to explain their condition.

This bias may be reduced if data about the risk factors – exposure to certain drugs, for example – had been entered into reliable records at the time. But this may not be possible for lifestyle factors, for example, because they are usually investigated by questionnaire.

An example of recall bias is the difference between asking study participants to recall the weather at the time of the onset of a certain symptom, versus an analysis of scientifically measured weather patterns around the time of a formal diagnosis.

Finding a measurement of exposure to a risk factor in the body is another way of making case-control studies more reliable and less subjective. These are known as biomarkers. For example, researchers may look at results of blood or urine tests for evidence of a specific drug, rather than asking a participant about drug use.

Cause and effect

An association found between a disease and a possible risk does not necessarily mean one factor directly caused the other.

In fact, a retrospective study can never definitively prove that a link represents a definite cause, as it is not an experiment. There are, though, questions that can be used to test the likelihood of a causal relationship, such as the extent of the association or whether there is a ‘dose response’ to increasing exposure to the risk factor.

One way of illustrating the limitations of cause-and-effect is to look at associations found between a cultural factor and a particular health effect. The cultural factor itself, such as a certain type of exercise, may not be causing the outcome if the same cultural group of cases shares another plausible common factor, such as a certain food preference.

Some risk factors are linked to others. Researchers have to take into account overlaps between risk factors, such as leading a sedentary lifestyle, being depressed, and living in poverty.

If researchers conducting a retrospective case-control study find an association between depression and weight gain over time, for example, they cannot say with any certainty that depression is a risk factor for weight gain without bringing in a control group containing people who follow a sedentary lifestyle.

‘Sampling bias’

The cases and controls selected for study may not truly represent the disease under investigation.

An example of this occurs when cases are seen in a teaching hospital, a highly specialized setting compared with most settings in which the disease may occur. The controls, too, may not be typical of the population. People volunteering their data for the study may have a particularly high level of health motivation.

Other limitations

There are other limitations to case-control studies. While they are good for studying rare conditions, as they do not require large groups of participants, they are less useful for examining rare risk factors, which are more clearly indicated by cohort studies.

Finally, case-control studies cannot confirm different levels or types of the disease being investigated. They can look at only one outcome because a case is defined by whether they did or did not have the condition.

Last medically reviewed on May 16, 2018

  • Public Health
  • Clinical Trials / Drug Trials
  • Pharma Industry / Biotech Industry

How we reviewed this article:

  • Introduction to study designs – case-control studies. (n.d.) https://www.healthknowledge.org.uk/e-learning/epidemiology/practitioners/introduction-study-design-ccs
  • Mann, C. J. (2003). Observational research methods. Research design II: cohort,  cross sectional , and case-control studies. Emergency Medicine Journal, 20 , 54-60 http://emj.bmj.com/content/emermed/20/1/54.full.pdf
  • Prospective vs. retrospective studies. (n.d.) https://www.statsdirect.com/help/default.htm#basics/prospective.htm

Share this article

Latest news

  • What personality traits are associated with lower dementia risk? Study offers new evidence
  • Many adults eligible for statins for heart disease prevention are not taking them
  • Personalized lifestyle changes could improve Alzheimer’s risk profile by 145%
  • Stem cell therapy could protect cognition in progressive MS
  • Switching to a healthier diet linked to improved longevity

Related Coverage

Systematic reviews and meta-analyses are a reliable type of research. Medical experts base guidelines for the best medical treatments on them.

A randomized controlled trial is one of the best ways of keeping the bias of the researchers out of the data and making sure that a study gives the…

Another year has come and gone, and we are about to step into a new decade. But what have the past 12 months meant for medical research?

Clinical trials are carried out to ensure that medical practices and treatments are safe and effective. People with a health condition may choose to…

The past 12 months have seen discoveries, breakthroughs, and innovations in medical research. MNT take you on a journey through 2017's highlights.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 05 December 2023

Prostate cancer genetic risk and associated aggressive disease in men of African ancestry

  • Pamela X. Y. Soh   ORCID: orcid.org/0000-0002-8485-6556 1 ,
  • Naledi Mmekwa 2 ,
  • Desiree C. Petersen   ORCID: orcid.org/0000-0002-0817-2574 3 ,
  • Kazzem Gheybi 1 ,
  • Smit van Zyl 4 ,
  • Jue Jiang   ORCID: orcid.org/0000-0003-0920-8310 1 ,
  • Sean M. Patrick 2 ,
  • Raymond Campbell 5 ,
  • Weerachai Jaratlerdseri   ORCID: orcid.org/0000-0001-9100-1807 1 ,
  • Shingai B. A. Mutambirwa 6 ,
  • M. S. Riana Bornman   ORCID: orcid.org/0000-0003-3975-2333 2 &
  • Vanessa M. Hayes   ORCID: orcid.org/0000-0002-4524-7280 1 , 2 , 4 , 7  

Nature Communications volume  14 , Article number:  8037 ( 2023 ) Cite this article

Metrics details

  • Cancer genetics
  • Genetics research
  • Prostate cancer

African ancestry is a significant risk factor for prostate cancer and advanced disease. Yet, genetic studies have largely been conducted outside the context of Sub-Saharan Africa, identifying 278 common risk variants contributing to a multiethnic polygenic risk score, with rare variants focused on a panel of roughly 20 pathogenic genes. Based on this knowledge, we are unable to determine polygenic risk or differentiate prostate cancer status interrogating whole genome data for 113 Black South African men. To further assess for potentially functional common and rare variant associations, here we interrogate 247,780 exomic variants for 798 Black South African men using a case versus control or aggressive versus non-aggressive study design. Notable genes of interest include HCP5 , RFX6 and H3C1 for risk, and MKI67 and KLF5 for aggressive disease. Our study highlights the need for further inclusion across the African diaspora to establish African-relevant risk models aimed at reducing prostate cancer health disparities.


Prostate cancer (PCa) is characterised by substantial ancestral disparity 1 , which together with significant heritability 2 , suggests an inherited genetic contribution. Specifically, men of African ancestry are at greatest risk, with African Americans 1.7 times more likely to receive a diagnosis and 2.1 times more likely to die from PCa than White American men 3 . Within Sub-Saharan Africa, mortality rates are up to 3.2 times greater than global estimates 4 . For southern Africa, PCa incidence rates largely reflect that reported for Northern America, however, mortality rates are 2.7-fold greater (age-standardised mortality rate of 22 per 100,000 men) 4 . Previously, we (and others) have demonstrated that southern Africa is not only home to genetically diverse populations 5 , but that Black South African men are at 2.1-fold increased risk for advanced PCa at presentation compared with African Americans (adjusted for age) 6 . Despite this, there is a notable lack of genetic data for populations across Africa.

As of 8 February 2023, the National Human Genome Research Institute-European Bioinformatics Institute (NHGRI-EBI) genome-wide association study (GWAS) Diversity Monitor reports that while Europeans contribute 95.85% to global GWAS, African Americans or Afro-Caribbeans contribute 0.49%, and Sub-Saharan Africans only 0.15% 7 . PCa GWAS research for Sub-Saharan Africa is no different, with further limitations including a focus on specific variants and study power 8 , 9 . As outlined in Fig.  1A , studies have been restricted to Ghana (474 cases, 458 controls; 2,837,019 variants) 10 , Uganda (560 cases, 480 controls; 118 known PCa risk loci and 17,125,421 imputed variants) 11 , South Africa (552 cases, 315 controls; 46 known PCa risk alleles) 12 , and a single study (MADCaP) that merged data from Ghana, Nigeria, and South Africa (399 cases, 403 controls; >1.5 million custom markers) 13 . Furthermore, studies associating rare variants with PCa pathogenesis within Sub-Saharan Africa are equally scarce. While a single study from Uganda associated pathogenic BRCA2, ATM, PALB2 and NBN variants with aggressive prostate disease 14 , pathogenic variants for BRCA2, ATM, CHEK2 , and TP53 , and rare early-onset/familial oncogenic variants of unknown pathogenicity for BRCA2, FANCA and RAD51C , were linked to advanced disease in Black South African men 15 . The latter study highlights a maximum 30% utility for current European-biased PCa germline testing gene panels for men from southern Africa.

figure 1

A Map showing populations across Africa where GWAS studies have been conducted for prostate cancer, and sample locations and sizes of African ancestry from the Human Genome Diversity Project (HGDP) and 1000 Genomes Project (1KGP) subset of gnomAD v3.12 67 (red circles). Mortality rates of prostate cancer from GLOBOCAN 2020 are shown in the bottom left bar plot, with the population size of men in the region indicated in brackets below the region name 4 . Study references: a Cook et al. 10 , b Harlemon et al. 13 , c Tindall et al. 12 , d Du et al. 11 . B Admixture plot ( K  = 5, cross-validation error = 0.162) which was replicated in 10 out of 10 runs, including 1003 Africans, 20 Europeans, 20 Chinese individuals from the HGDP and 1KGP subset of gnomAD with our dataset of 781 South Africans.

Most recently, the largest meta-analysis of African ancestral PCa GWAS included samples from the Ghana and Uganda, with additional representation from the Democratic Republic of Congo (3,149 cases, 2,547 controls recruited from Africa out of a total of 19,378 cases and 61,620 controls) 16 . The study increased the number of known PCa risk alleles from 269 to 278, and combined to generate a multi-ancestry polygenic risk score (PRS) 16 , 17 . Demonstrating effective PCa risk stratification for men of African ancestry, including MADCaP validation, African men in the top PRS decile were further distinguished by aggressive disease. Notably and of relevance to this study, of the nine recently identified African-specific/predominant PCa risk variants identified, seven occurred in gene regions, including a protein truncating variant in the prostate-specific gene anoctamin 7 ( ANO7 Ser914Ter), adding a third functionally relevant protein coding variant to the repertoire of known African-specific PCa risk alleles, including previously identified CHEK2 Ile448Ser 18 and HOXB13 Ter285Lys 19 . Recognising that only 7.4% of the meta-analysis included men from Sub-Saharan Africa, which in turn represents only a fraction of the region with vast ethno-linguistic and genetic diversity 20 , the authors call for further studies across the African diaspora 16 .

Placing the Black South African population into global perspective, in this work we use data from 113 whole-genome sequenced men with predominantly high-risk PCa (HRPCa) 21 to evaluate the use of polygenic scoring based on 278 known common risk PCa variants 16 .

To further assess for potentially functional variants associated with PCa and aggressive disease in 798 Black South African men, we conduct an exome-wide association study (EWAS; 247,870 variants), including both common variant (minor allele frequency (MAF) > 0.01) and gene-based rare variant (MAF ≤ 0.01) analyses.

Known risk alleles in Black South Africans with PCa

Published whole genome sequenced data generated for 113 Black South African PCa cases (average 45.4X coverage), filtered to represent no more than 8% non-African genetic ancestral contribution (Fig.  2A ) 21 , were interrogated for known risk variants and previously associated variants from African GWAS. This included the recently reported set of 278 PCa risk variants 16 , and the top associated variants from the Ugandan 11 and Ghanaian GWAS 10 . Biased towards HRPCa at presentation, defined as International Society of Urological Pathology (ISUP) grade group ≥3 (Gleason score ≥4 + 3, n  = 87, 76.99%), the average age of the cohort was 66.9 years (standard deviation (SD) 8.43, range 45–99), with prostate specific antigen (PSA) levels highly elevated (33.63% PSA ≥ 100 ng/ml), as previously observed within the region 6 (Supplementary Data  1 ). Of the 278 risk variants, 18 (6.47%) were absent, 21 (9.09%) fixed in our Southern African (SA) cohort, and eight were excluded from scoring due to uncertain risk variants (Supplementary Data  2 ). When compared to previously published risk allele frequencies (RAF) for African-ancestral (AA) controls 16 , 11 showed differences in RAF > 0.15, of which eight were more common in our SA population and three were more common in the published largely US-derived AA data (Fig.  3 , Supplementary Data  2 ). The largest differences included rs111595856 ( INHBB , RAF SA  = 0.67, RAF AA  = 0.398), rs35159226 ( ZNF322 , RAF SA  = 0.562, RAF AA  = 0.308), and rs8005621 ( SALRNA1 , RAF SA  = 0.549, RAF AA  = 0.348). Among the top 136 associated variants in the Ugandan GWAS, three variants showed differences in RAF > 0.15 compared to Ugandan cases and controls, rs6431219 ( BIN1 , RAF SA  = 0.416, RAF UGPCS_Cases  = 0.6, RAF UGPCS_Controls  = 0.51), rs61005944 ( ENSG00000237101 , RAF SA  = 0.527, RAF UGPCS_Cases  = 0.31, RAF UGPCS_Controls  = 0.22) and rs140698498 ( RBFOX1 , RAF SA  = 1, RAF UGPCS_Cases  = 0.87, RAF UGPCS_Controls  = 0.8) (Supplementary Fig.  1 ). A total of 19 variants were more common in Ugandan controls than in our SA population (difference in RAF ranged from 0.001 to 0.094, Supplementary Data  3 ). Among the top 30 associated variants in the Ghanaian GWAS, only one had a difference in RAF > 0.15 (rs28747043, closest gene MTCO3P1 18.4 kb away, RAF SA  = 0.181, RAF Ghana_Controls  = 0.371; Supplementary Fig.  2 ). A total of 19 of these variants were more common in Ghanaian controls than our SA cases (difference in RAF ranged from 0.0003 to 0.19; Supplementary Data  4 ).

figure 2

A Summary of the data used for polygenic risk scoring (PRS), ( B ) sample filtering for the genotype data, ( C ) variant filtering for the exome wide association studies (EWAS), and ( D ) rare variant filtering for the gene-based analyses.

figure 3

Comparison of the RAF for 267 out of 278 known risk variants between South African prostate cancer cases ( N  = 113) and African Ancestry controls ( N  = 61,620) as previously reported 16 . Gene labels in white boxes indicate variants that overlap a gene, while gene labels in grey indicate the closest genes to the variant. Note four risk variants had no risk allele frequency reported in the African Ancestry PCa cases 16 and another seven variants in the South African PCa cases were excluded from the plot due to being indel repeats or having unclear risk variants.

Polygenic risk scores (PRS) in Black South Africans

The PRS was evaluated using two definitions of aggressive PCa, firstly Chen et al., 2023’s definition: ISUP 4-5 or PSA ≥ 20 ng/ml, which grouped our samples into N  = 101 aggressive and N  = 11 non-aggressive (one sample with missing PSA and ISUP excluded); and our definition: ISUP 3-5, grouping our samples into N  = 87 aggressive and N  = 18 non-aggressive (eight samples with missing ISUP excluded) (Fig.  2 ). A total of 231 non-fixed variants (Supplementary Data  2 ) were used to score the SA population with PLINK v1.9 22 using African and multiethnic weights, as previously published 16 . Using Chen et al., 2023’s definition of aggressiveness, for African weights, for the aggressive group the mean score was 0.034 (SD = 0.002, range 0.029–0.039) while that of the non-aggressive group was 0.034 (SD = 0.002, range 0.032–0.037). For multiethnic weights, the aggressive group’s mean score was 0.041 (SD = 0.002, range 0.035–0.046), and non-aggressive was 0.041 (SD = 0.002, range 0.039–0.045). Using our definition of aggressiveness, for African weights, the aggressive group had a mean score of 0.034 (SD = 0.002, range 0.029–0.039), while the non-aggressive group had a mean score of 0.034 (SD = 0.002, range 0.032–0.037). With multiethnic weights, we observed for the aggressive group a mean 0.041 (SD = 0.02, range 0.035–0.046) and for the non-aggressive group a mean 0.041 (SD = 0.02, range 0.039–0.045) (Supplementary Figs.  3 , 4 ).

No significant associations were detected using either the multiethnic study’s definition of aggressiveness 16 : African score OR per SD = 1.38, 95% CI = 0.72–2.66; multiethnic score OR per SD = 1.24, 95% CI = 0.65–2.34; nor using our definition of HRPCa: African score OR per SD = 1.01, 95% CI = 0.6–1.71; multiethnic score OR per SD = 1.13, 95% CI = 0.67–1.9.

Common risk variants associated with PCa in Black South Africans

A total of 798 Black South Africans were genotyped on the Infinium HumanExome-12 v1.0 BeadChip array (Illumina, California, United States), screening 247,870 variants. A total of 781 remained after sample quality control (QC) filtering (Fig.  2B ). The dataset was assessed for regional ancestral clustering using the Human Genome Diversity Project (HGDP) and 1000 Genomes Project (1KGP) subset of gnomAD v3.1.2, including 1,003 African, 20 European, and 20 Chinese samples. Using principal component analysis (PCA, Supplementary Fig.  5 ) and ADMIXTURE v1.3.0 for K  = 1 to 10 with five-fold cross-validation (CV) and 10 replications each (Supplementary Fig.  6 ), with K  = 5 generating the lowest CV error at 0.162 (Supplementary Fig.  7 ), we assessed for within-population substructure. Excluding for a single patient that clustered with Nigerian Yoruba and Esan (West African) populations, we confirm that our cohort represents a distinct southern African genetic ancestry (Fig.  1B ). Recruited from Southern African Prostate Cancer Study (SAPCS) presentative urology clinics within South Africa, cases were defined as presenting with clinicopathologically confirmed PCa ( N  = 451) and controls with no histopathological evidence of cancer (i.e. no Gleason score, N  = 292), with relatively even distribution across the age-representation (average 70.52, range 49–102 vs average 69.99, range 45-99, respectively). Further clinical characteristics are summarised in Supplementary Data  5 . For exome-wide association analysis (EWAS) and gene-based analysis, a total of 37 men with unknown age at diagnosis were excluded, leaving 743 men.

After variant QC, 50,591 common variants (MAF > 0.01) remained for further case-control EWAS (Fig.  2C ), with no variants reaching genome-wide significance ( q  < 0.05) (Fig.  4A ). The QQ plot and genomic inflation factor (λ = 1.06) from the P -values of the EWAS indicated no population stratification in the data (Supplementary Fig.  8 ). Of the 17 SNPs with a P -value of <5E−04 (Table  1 ), six (35.3%) were in Chromosome 6 including two intronic variants in the HLA-complex P5 ( HCP5 ) lncRNA gene (rs2244839, rs12660382), two within or close to MUC22 (rs1634718, rs1634725), and one each in LINC00243 (rs1264362) and RFX6 (rs339331). The RFX6 variant rs339331 is a known PCa variant (Fig.  4A ), although not included in the set of 278 risk variants, was found to be in strong linkage ( r 2  = 0.95) with GPRC6A rs2274911 (Supplementary Fig.  9 ). The top-ranked SNPs included rs2244839 ( P  = 3.4E−05; OR 1.63, 95% CI:1.29–2.05) in HCP5 , rs11009235 ( P  = 6.26E−05; OR 1.6, 95% CI:1.27–2.01) located 15.8 kb upstream and 44.9 kb downstream from IATPR and NRP1 , respectively, and rs3865188 ( P  = 1.03E−04; OR 1.58, 95% CI: 1.25–1.99) 9.9 kb upstream of CDH13 . Other noteworthy associated variants included rs7963300 ( P  = 1.43E−04; OR 2.15, 95% CI: 1.45−3.19), located within an unknown gene ENSG00000286069 approximately 74.7 kb upstream from a HOXC gene cluster (Supplementary Fig.  10 ), and the nonsynonymous rs114057260 ( P  = 4.18E−04; OR 0.34, 95% CI 0.19−0.62) in ZZEF1 which is the only predicted deleterious variant (PDV), defined as variants predicted to be deleterious by SIFT and damaging (or possibly damaging) by PolyPhen.

figure 4

A Manhattan plot of -log 10 p-values from an age-adjusted logistic regression, with 451 cases and 292 controls. B Manhattan plot of -log 10 p-values from an age-adjusted logistic regression for high grade prostate cancer (ISUP 3-5) cases ( N  = 203) against low risk or no PCa ( N  = 540). Known risk variants ( N  = 9) from the recently described set of 278 variants 16 are shown as red circles, and known cancer variants as summarised previously 13 are shown as orange circles, while variants in both datasets are represented in red triangles. Variants labelled in a white box indicate the overlapping gene, while those labelled in grey are the closest genes.

Only 17 of the 278 known PCa risk variants 16 were captured by the exome array data (Supplementary Data  2 ), with three SNPs found to be fixed for the risk allele (rs77482050, rs33984059, rs61752561), an additional two almost fixed (rs138708, rs17804499), two were fixed for the reference allele (rs77559646, rs74911261), and one rare (MAF < 0.01 rs76832527) in our SA study population. None of the nine remaining SNPs (MAF 0.015 to 0.49) showed risk association (all P  > 0.25) (Fig.  4A ). A total of 367 out of 2477 known cancer variants, summarised previously 13 , were genotyped in the exome array (Supplementary Data  6 ). Among these, the top associated variants included the RFX6 SNP rs339331 ( P  = 0.0002), intergenic variant rs9600079 (pseudogene RNU4-10P 37.5 kb downstream, closest protein-coding gene is KLF5 76 kb upstream, P  = 0.0014), and CASC8 / PCAT1 variant rs445114 ( P  = 0.0047).

Common risk variants associated with HRPCa in Black South Africans

Further classification of our study cohort as HRPCa (ISUP \(\ge\) 3, N  = 203) versus low-risk or no PCa ( N  = 461), again showed no genome-wide significance, while 25 SNPs had P  < 5E−04 (Fig.  4B , Supplementary Fig.  11 ), of which three were among the top ranking PCa risk EWAS SNPs (rs11009235, rs2897495, rs7963300). Several of the top SNPs are in genes associated with PCa or PCa processes, including MKI67 (rs8473), PCSK6 (rs80278342), ABCB6 (rs60322991), and TCHP (rs11068997); or other cancer-associated processes, including GGA2 (rs1135045), H1-5 (rs11970638), and COL15A1 (rs2075662). The rs8473 SNP in MKI67 was strongly linked ( r 2  = 0.92) to rs1063535 in the same gene, and moderately linked ( r 2  = 0.4 to 0.64) to rs34750407, rs11016071, rs10082391, rs1050767, rs12777740, rs11016076, and rs7095325 (Supplementary Fig.  12 ). SNPs rs60322991 ( ABCB6 ), rs877834 ( NPVF ), rs2075662 ( COL15A1 ), and rs77944357 ( ABCA10 ) have deleterious and potentially damaging effects based on SIFT and/or PolyPhen yet are benign or are lacking prediction by ClinVar (Table  2 ). The nine SNPs included in the 278 PCa risk allele panel showed non-significance (all P  > 0.11), while the top associated known cancer variants were rs1859962 in CASC17 ( P  = 0.003; OR 1.49, 95% CI: 1.142–1.93) and rs3734805 in CCDC170 ( P  = 0.004; OR 2.0, 95% CI: 1.244–3.2) (Fig.  4B ).

Genes associated with PCa and HRPCa in Black South African men

Derived from more than 12,000 European-biased sequenced genomes, the exomic array was designed with a focus on protein altering (nonsynonymous, slicing and nonsense) variants. These were selected based on a minimum of three observations across two or more datasets, and as such many rare variants have been included, allowing for further gene-based analyses (Fig.  2D ). After improving rare variant genotype calls using zCall v3.4 23 , gene-based analyses were performed using optimal unified sequence kernel association test (SKAT-O). SKAT-O adjusts for small sample size and retains power when variants in a region are causal and the effect is in a single direction (through burden tests) as well as when variants in a region have bidirectional effects and may contain noncausal variants (through SKAT) 24 .

Using this method, genes with rare variants associated with PCa risk (family-wise error rate (FWER) < 0.05; Supplementary Fig.  13 , Supplementary Data  7 ) included H3C1 (rs199943654, P  = 9.91e−05), MBP (rs61742941, P  = 1.23e−04), and MTG1 (3 variants including predicted deleterious variant (PDV) rs138851534, P  = 1.59e−04). Although no significance was observed for HRPCa (Supplementary Fig.  14 ), the top associated gene was EPS15 (3 rare variants, P  = 1.47e−04). Through further analyses for common and rare variants (Supplementary Fig.  15 , Supplementary Data  8 ), we show MBP to be significantly associated with PCa risk (1 rare and 5 common variants, P  = 1.26e−04) and included the rare PDV rs61742941. For the HRPCa common and rare variants analysis (Supplementary Fig.  16 , Supplementary Data  9 ), we found KLF5 to be significantly associated with aggressive disease (1 rare variant, 3 common variants, P  = 1.52e−04), including the common PDV rs115503899.

PCa variance explained by exome SNPs

GREML was used to estimate the phenotypic variance explained by genetic variance (SNP heritability). The 49,534 common autosomal SNPs (post-QC) in GREML explained 48.19% (standard error (SE) 22.2%, P  = 4.77E−03) of disease liability for PCa, and only 17.78% (SE 8.19%) when transformed for PCa prevalence of 0.001 (Table  3 ). Common and rare autosomal SNPs together ( N  = 80,421) explained 50.7% (SE 25.13%, P  = 0.014) of disease liability for PCa and at a prevalence of 0.001, explained 18.7% (SE 9.27%).

Using the top 16 SNPs from the EWAS with all cases and controls, only 5.07% (SE 1.78%, P  = 1.59E−25) of disease liability was explained. When stratifying the cohort by HRPCa versus low-risk/no-PCa at an estimated prevalence of 0.0004 for HRPCa, all 49,534 autosomal SNPs explained 16.15% (SE 11.17%, P  = 0.065) of disease liability, 9.24% (SE 2.31%, P  = 2.3E−44) of which could be explained using the top 25 SNPs from the HRPCa EWAS.

In this study, motivated by limited studies having identified three African-specific protein-altering risk alleles 19 , 25 , we examined PCa risk and aggressive disease associations and have provided much needed evaluation of known PCa risk alleles within the under-represented region of southern Africa. While dwarfed in sample size compared to European-ancestral or African American GWAS studies, this study highlights significant resources and efforts needed to elucidate the genetic contribution to ancestrally-driven PCa health disparities across the African diaspora.

When examining the previously reported allele frequencies of the 278 known risk variants 16 in our population, large differences in RAF (>0.15) were observed for 11 variants, including those in the genes INHBB (rs111595856), ZNF322 (rs35159226), SALRNA1 (rs8005621), FGF10 (rs1482675), HNF1B (rs11263763), PCAT19 (rs11673591), and TAB3 (rs5972255). These differences could mean that alternate oncogenic pathways or epigenetic regulation are at play in southern Africa. Importantly, although we were restricted in sample size and biased to aggressive PCa, we were unable to replicate previous findings of the multiethnic PRS association with aggressive disease 16 , nor with our definition of HRPCa. Notably, the variant rs72725854, which is the most strongly associated risk variant for PCa in men of African ancestry with an allele frequency of 6.1% 26 , was present at a frequency of 13.7% in our case-only South African population.

In the classic case-control PCa EWAS analysis, the top associations included variants in HCP5 and RFX6 , and near IATPR and NRP1 . The histocompatibility leucocyte antigen (HLA) complex P5 ( HCP5 ) is a long non-coding RNA located in the HLA class I region and has shown aberrant expression in multiple cancers, including PCa 27 . A single study investigating HCP5 in PCa using tissues and cell lines, found high expression of HCP5 to be positively correlated to prostate tumour metastasis 28 . Expression of HCP5 acted as a sponge for miR-4656, preventing miR-4656 from suppressing the cell migration-inducing hyaluronidase 1 ( CEMIP ) gene, leading to upregulated expression of CEMIP, which plays a key role in tumour proliferation 28 , 29 . Functional experiments will be needed to explore whether the two associated variants from this study affect expression levels of HCP5. Conversely, the variant located between the lncRNA IATPR and NRP1 provides a potential proxy for a yet unknown gene-associated variant, Notably, IATPR has been found to promote cell migration and development in other cancers 30 , 31 , while NRP1 is an androgen-repressed gene that plays a role in cancer progression with its expression associated with prostate tumour grade and biochemical recurrence 32 .

Intriguingly, the member of the regulatory factor X family of transcription factors RFX6 gene variant rs339331 associated with a decreased PCa risk (C-allele OR 0.64, 95% CI:0.5–0.82, P  = 4.6E−04) was in strong linkage with GPRC6A rs2274911 ( r 2  = 0.95). The rs339331 association replicates findings from previous studies of PCa in Ghanaian, Japanese, and Chinese men 10 , 33 , 34 . Conversely, the T-allele has been shown to increase HOXB13 binding to a transcriptional enhancer, which upregulates the expression of RFX6 associated with tumour progression, metastasis and biochemical relapse 35 , as well as upregulating GPRC6A expression 36 . Additionally, the A-allele at rs2274911 has been associated with increased PSA levels 37 . Representing the major allele in the current study (MAF = 0.7941) could potentially contribute (at least in part) to elevated PSA levels observed for Black South African men, irrespective of PCa status 6 . Notably, the intron variant rs339351 in the RFX6 gene is among the 278 known PCa risk variants 16 , differing in frequency by 0.039 (RAF SA  = 0.783, RAF AA  = 0.744). As noted for other PCa risk variants, such as European-specific HOXB13 p.Ile448Ser (rs138213197) 38 and African-specific HOXB13 p.Ter285Lys (rs77179853) 19 , 39 , it is possible that different ancestral-specific causative variants may represent the same PCa gene.

In the HRPCa EWAS, the top associated variants were in several genes relevant to PCa, including MKI67 , PCSK6 , ABCB6 , and TCHP . MKI67 encodes the Ki67 protein, a widely used diagnostic marker of proliferation in numerous human cancers, with increased expression associated with poor prognosis in localised PCa 40 . One study reported low PSA and high Ki67 expression in patients with HRPCa and TMPRSS2-ERG fusion gene, whereas high PSA and low Ki67 expression predominated in patients with low-risk disease and favourable outcomes 41 . Associating the T-allele of MKI67 rs8473 with reduced odds for HRPCa (P = 3.09E-04, OR = 0.64, 95% CI:0.5–0.82), warrants further investigation into the potential of this allele to reduce Ki67 expression. The PCSK6 variant rs80278342 is not regarded as a PDV in this study, but an isoform of PCSK6 has previously been identified as a plasma biomarker for PCa, and expression levels have been correlated with ERG tumour status and ISUP grade group 42 . ABCB6 codes for an ABC transporter, with expression levels linked to chemoresistance 43 . While increased ABCB6 expression has been reported in PCa, deregulated expression has been further associated with recurrent versus non-recurrent disease 44 . Although the correlation between ABCB6 expression levels and PCa grade has not yet been examined, increased ABCB6 has been associated with histological grade in gliomas 45 . The PDV rs11068997, associated with HRPCa in our study, is located within the tumour suppressor gene TCHP , shown to inhibit cell growth in PCa cells 46 . Other notable SNPs associated with HRPCa in our study and in genes representing cancer-associated processes include; GGA2 involved in cell growth 47 , H1-5 with transcriptional regulatory effects 48 , and COL15A1 shown to have tumour suppressive effects 49 . Lastly, the rs7963300 associated SNP upstream of a HOXC gene cluster, is of note as increased expression of HOX genes have been observed in prostate tumours 50 .

Since the distribution of several disease-associated alleles across a range of African populations has previously shown large variation 51 , it is plausible that, on top of potentially different oncogenic pathways or epigenetic regulation, germline PCa risks may also differ between regions within Sub-Saharan Africa. African American bias and within continental representation limited to a snapshot of largely west and east African ancestral diversity 10 , 11 , 16 , 17 provides an explanation for the limited replication of previous GWAS findings in our SA-focused study. This is exemplified in the HOXB13 risk variant (rs77179853) found in men of West African ancestry being absent in Uganda and South Africa, and possibly arising after the Bantu migration from western to eastern and southern Africa around 1500–4600 years ago 19 . This variant was also absent in our cohort, along with another recently identified African-specific variant CHEK2 (rs17886163) 18 . Conversely, the ANO7 variant (rs60985508) 18 was present (RAF SA  = 0.363) 52 . Although further investigations are needed, this study highlights some avenues of interest for future germline studies of PCa across southern Africa.

The gene-based analyses showed association between the genes H3C1 , MBP , and MTG1 with PCa, and KLF5 with HRPCa. Modifications to the H3 histone plays a key role in epigenetic regulation, and their relevance to PCa and treatment options have been reviewed elsewhere 53 , 54 . In this study, significant association of H3C1 to PCa suggests that differences in transcription may exist, and since the variant was only present in controls it may suggest a protective effect against PCa. The MBP gene encodes the myelin basic protein, which is a constituent of the myelin sheath, and has no obvious role in PCa, however, mouse model studies have shown that neural progenitor cells can invade prostate tumours, triggering neurogenesis and promoting tumour growth and metastasis 55 . The MTG1 gene is involved in the regulation of mitochondrial translation 56 and although it has no known direct role in PCa, mutations in mitochondria have been associated with PCa aggression 57 , including in Black South African men 58 . Conversely, KLF5 is a transcription factor that has been implicated in several cancers with opposing roles (tumour suppressor or oncogenic driver) depending on the context 59 . Expression levels and post-translational modifications of KLF5, specifically acetylation, are of key interest in PCa with therapeutic implications for chemoresistance 60 , 61 , 62 , 63 , 64 . While there are no PCa risk variants within KLF5 , two risk SNPs in close proximity include snRNA rs7489409 (65 kb downstream) and lncRNA rs7336001 (344 kb downstream) 16 . A known cancer variant rs9600079 approximately 76 kb downstream from KLF5 showed slight association ( P  = 0.0017). Located at chromosome 13q22.1, this region is frequently deleted in PCa 65 . Although earlier PCa cell line and xenograft research found that mutations were rare, deletions and down-regulation of KLF5 was frequent 66 . In our recent study genome profiling SA derived prostate tumours, we describe a molecular taxonomy we call global mutational subtypes (GMSs), identifying a single KLF5 African-specific predicted cancer driver mutation 21 .

Acknowledging our limited sample size, in turn it cannot be ignored that our EWAS is not only comparable to the previous Ghanian and Uganda PCa GWAS 10 , but it provides an as yet unmet regionally focused SA alternative for both risk allele validation and discovery across Sub-Saharan Africa (Fig.  1A ). While the HGDP and 1KGP subset from the gnomAD v3.1.2 dataset represents limited (20 genomes) representation across SA 67 (Fig.  1A) , we acknowledge and appreciate that through projects like H3Africa ( www.h3africa.org ) 68 additional resources are becoming available, although the pace remains a fraction of the global effort. As with the exomic-array used in this study 69 , commercial genotyping arrays have been designed based on variant frequencies heavily skewed towards Europeans, with arrays from MADCaP and H3Africa tailored for African populations, with the MADCaP array specifically designed for PCa research 13 , 68 . Furthermore, GREML analyses showed that only 49.71% (±22.71%) of the variance in phenotype could be explained by the autosomal genomic variance, indicating that the genomic risk for PCa was not fully captured by the European-biased exome array. Finally, we appreciate that although we attempted to improve rare variant calls using the zCall 23 , the software can introduce false positives 70 and as such, we call for caution when interpreting allele frequencies.

While we appreciate our limited sample size, we were unable to replicate previous findings of an association between multiethnic PRS to PCa aggression in our African population. Consequently, we call for further African-relevant whole-genome sequencing and genome-wide interrogation studies for establishing PRS of relevance for Sub-Saharan Africa. In our exome-wide association analyses, we identified several avenues of interest for further investigation, including HCP5 , RFX6 , and H3C1 for PCa, and MKI67 and KLF5 for HRPCa. Clearly, significant resources are necessary to elucidate the genomic variants contributing to ethnic disparity in PCa. The global inclusion of southern African data, a region with the most diverse human populations, will benefit not only the design of African-relevant cancer screening panels and further enhance ancestrally focused SNP arrays, but will also be important for accurate multiethnic PRS.

Ethics approvals and recruitment

Written informed consent was obtained from all participants, with study approval granted by the University of Pretoria Faculty of Health Sciences Research Ethics Committee (HREC #43/2010, with US Federal wide assurance FWA00002567 and IRB00002235 IORG0001762). Study participants were recruited at time of biopsy (diagnosis) from participating Southern African Prostate Cancer Study (SAPCS) urology clinics within the Gauteng and Limpopo Provinces of South Africa. The majority of men presented with a urological or associated complaint without a predetermined prostate specific antigen (PSA) test 6 . Men self-identifying ethno-linguistically, by two generations, as Black South Africans, where included in this study; firstly, irrespective of their PCa diagnosis to undergo whole exome genotyping (case-control study, N  = 798), and secondly, selected for aggressive PCa at presentation and having undergone deep tumour/blood paired whole genome sequencing ( N  = 113), as recently published 21 . Genomic interrogation was performed under approval granted by the St. Vincent’s Sydney HREC (#SVH/15/227) and an executed material and data sharing agreement between the University of Pretoria in South Africa and University of Sydney in Australia. While data is shared, all data remains the property of the University of Pretoria, as chair of the SAPCS data sharing committee.

Interrogation of known PCa risk alleles in SAPCS

We interrogated population-matched whole genome sequenced data for the distribution of the 278 known risk variants 16 , and the top associated variants from the Uganda 11 and Ghana 10 GWAS within a cohort of men selected for bias towards aggressive PCa at presentation. Gene annotations were fetched through ANNOVAR from hg38 ensGene (GENCODE v43, last updated from UCSC 2023-02015) 71 . As recently described 21 , deep sequenced data (average 45.4X coverage) was generated for 113 Black South African PCa cases, which were filtered to include samples with no more than 8% non-African genetic ancestral contribution (Fig.  2A ). A summary of clinical information is available in Supplementary Data  1 .

We scored the South African cases via PLINK v1.9 22 using default settings based on their genotypes at 231 out of 278 available risk variants (Fig.  2A ) using multiethnic and African ancestry weights 16 . Aggressiveness was defined as previously described (ISUP 4 or 5, or PSA ≥ 20 ng/ml) 16 , as well as using an alternative definition (ISUP 3-5), to use as the outcome variable in a logistic regression using the score (African or multiethnic) and age as covariates.

SAPCS EWAS data generation and genotype filtering

A total of 798 Black South Africans were genotyped on the Infinium HumanExome-12 v1.0 BeadChip array (Illumina, California, United States), assaying 247,870 variants (Fig.  2B–D ). Genotypes were called using the Illumina GenomeStudio 2.0 software following previously published guides 70 , 72 . Briefly, non-pseudoautosomal variants with poor GenTrain scores were manually re-clustered based on visual inspection to improve the accuracy of genotype calls. No samples were removed as all sample call rates were >0.97.

Several quality control steps were conducted to prepare the dataset for exome-wide association analyses (EWAS) (Fig.  2C ). Following manual re-clustering, variants that had a poor GenTrain score (<0.7) and poor call frequency (<0.95) were excluded ( N  = 3743). The remaining 244,127 variants were exported to PLINK format. The strand was converted using scripts by William Rayner from the Wellcome Centre for Human Genetics, Oxford website ( https://www.well.ox.ac.uk/~wrayner/strand/ ), in the process removing two single nucleotide polymorphisms (SNPs) that did not reach the required 90% threshold for mapping to the genome. The variants coordinates were converted from hg18 to hg38 using UCSC’s webtool liftOver ( http://genome.ucsc.edu/cgi-bin/hgLiftOver ), and 43 variants not in hg38 were removed. Triallelic or duplicated variants were removed ( N  = 826). Among duplicated variants pairs, the variant with the higher call rate was kept. Heterozygous chromosome X SNPs ( N  = 38) were removed. Variants with MAF < 0.01 ( N  = 192,606; Supplementary Data  10 ) and 21 SNPs that failed the Hardy-Weinberg exact test (threshold 1E−6) were removed, leaving 50,591 variants.

SAPCS EWAS cohort characterisation

After excluding for a single patient presenting with prostate metastasis with squamous cell primary carcinoma (Fig.  2B ), the remaining 797 SAPCS EWAS samples were checked for genetic duplicates. Identity-by-descent (IBD) was calculated using the –genome function in PLINK v1.9 22 . Eight pairs of genetic duplicates were identified (PI_HAT ≥ 0.99), and all 16 individuals were removed from further analysis. The maximum PI_HAT in remaining pairs of individuals was 0.29, so no further samples were removed. To check for genetic admixture or non-African genetic fractions, the exome array variants were extracted from published African ancestral genomes ( N  = 1003) as well as 20 randomly selected individuals from European (CEU) and Han Chinese (CHB) populations each from the human genome diversity project (HGDP) and 1000 genome project (1KGP) subset of gnomAD v3.1.2 67 . The extracted data was merged with the data in the current study and pruned for SNPs based on linkage disequilibrium (LD), using a 50 SNP window moving 5 SNPs at a time, at a variance inflation factor of 1.5 (--indep 50 5 1.5) in PLINK v1.9 22 . The remaining 77,372 SNPs were then used for a principal component analysis (PCA) using PLINK v1.9 22 and plotted with ‘ggplot2’ in RStudio v4.1.1 73 . While no individual showed substantial non-African genetic contribution, a single study participant clustering near the Nigerian Yoruba and Esan west African populations was excluded (Supplementary Fig.  5 ), leaving a total of 780 samples. The distribution of samples from this study and previous PCa GWAS studies in Africa 10 , 11 , 12 , 13 were plotted on a map using the R package ‘rnaturalearth’ 74 .

To further assess for African-specific ancestral fractions, an unsupervised ADMIXTURE v1.3.0 analysis 75 was performed using the same dataset. The ADMIXTURE analysis was conducted for K  = 1 to 10 with five-fold cross-validation (CV) and 10 replications each. The tool pong v1.5 76 was used to plot ancestry proportions with a greedy approach set to 0.95 (Supplementary Fig.  6 ). The K  = 5 ADMIXTURE run produced the lowest cross-validation error at 0.16182 (Supplementary Fig.  7 ).

As age is a significant PCa risk factor, 37 samples were further excluded from the EWAS for lack of reported age at diagnosis, leaving a total of 451 clinicopathologically confirmed cases (70.52 years SD ± 9.21, range 49-102) and 292 controls either with or without benign prostate hyperplasia (69.99 years SD ± 9.07, range 45–99) (Supplementary Data  5 ). Cases were relatively evenly distributed across the ISUP grade grouping, representing high (ISUP 3 to 5, 54.58%) and low-risk PCa (ISUP 1 to 2, 45.43%). As previously observed for this study population 6 , PSA levels were significantly elevated, with 44.95% of cases presenting with PSA > 100 ng/ml, and only 20.85% of controls presenting with a PSA level less than the global standard for PCa diagnosis (4 ng/ml) 77 .

EWAS analysis and statistics

As the controls in this study were recruited based on a referral to a urologist clinic and were negative at histopathological examination of resected biopsy cores (on average 12 per patient), one cannot ignore the possibility of missed PCa diagnoses in these individuals. Therefore, in parallel to a classic case-control EWAS analysis (451 cases, 292 controls), we designed an additional EWAS focused on distinguishing HRPCa (ISUP 3-5, N  = 203) versus low-risk or no PCa (LRPCa/NoPCa, N  = 461). Cases that were missing an ISUP grading were excluded from this analysis.

RStudio v4.1.1 and PLINK v1.9 were used for analyses and visualisation 22 , 78 . A logistic regression using an additive genetic model accounting for age was used. A q-value false-discovery rate (FDR) cut-off of 0.05, calculated in R using the package ‘qvalue’ 79 , was used to determine genome-wide significance. Manhattan, quantile-quantile (QQ), and regional association plots were generated using ‘ggplot2’ in RStudio 73 , with gene annotations for the canonical transcripts fetched through the R package ‘biomaRt’ 80 from Ensembl Human GRCh38.p13, version 108 and GENCODE v43 via UCSC 81 . African ancestry minor allele frequencies were fetched from gnomAD v3.1.2 67 .

For the 17 out of 278 known PCa risk variants 16 , the three variants from the Ghana study 10 , and the previously summarised known cancer variants 13 that were available in our exome array data, the risk allele frequency in cases and controls, age-adjusted odds ratio (OR), 95% confidence intervals (CI), and P-values were calculated in PLINK v1.9 22 . For SNPs where the risk allele was the major allele in our population, the inverse odds ratio (and 95% CI) was calculated to reflect the odds of the risk allele in cases compared to controls. Three of the 17 known PCa risk SNPs were rare variants that were processed with zCall v3.4 23 (see below) to improve genotype calls. There were no changes to the allele frequencies of the three SNPs post-processing.

Gene-based analyses with rare variants

To improve the genotype calls of rare variants (minor allele frequency (MAF) < 0.01), zCall v3.4 23 was used with a default z value of 7 (global concordance=99.19%). Quality control followed that of EWAS filtering, including strand flipping, converting coordinates from hg18 to hg38, removing triallelic or duplicated variants, and removing heterozygous chromosome X variants (Fig.  2D ). A total of 31,445 rare variants (0 < MAF ≤ 0.01) remained. Processing the data with zCall reduced the total number of fixed variants from 163,838 to 162,678 SNPs (Supplementary Data  10 ). Variant annotations for the canonical transcripts were fetched via the R package ‘biomaRt’ 80 from Ensembl Human GRCh38.p13 version 108. The R package ‘SKAT’ version 2.2.4 was used to conduct optimal unified sequence kernel association tests (SKAT-O) 24 . SNPs were grouped by genes, and two analyses were conducted per phenotype: (1) SKAT_CommonRare to test for the combined effect of common and rare variants, using 71,908 common and rare SNPs across 15,593 genes; and (2) SKATBinary to test for rare variant associations, using 31,345 rare SNPs across 11,740 genes. The phenotypes tested were as per the EWAS analyses: PCa risk (cases vs controls) and HRPCa risk (HRPCa vs LRPCa/No PCa). The analyses were conducted chromosome by chromosome, using age as a covariate with N  = 1000 resampling used in the null model. Significant genes were identified using the family-wise error rate (FWER) multiple testing correction (cut-off 0.05) built into the package.

Heritability estimates

SNP-based heritability was calculated using genome-based restricted maximum likelihood (GREML) in GCTA v1.92.0 82 , using autosomal variants at a prevalence of 0.001 based on the 5-year prevalence of PCa in South Africa (39,863 cases out of 29,216,012 men) 4 , as well as an approximate prevalence of 0.0004 for high risk PCa based on the fraction of cases with ISUP 3-5 (43%) in this study.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

The sequencing data analysed in this study were obtained from the European Genome-Phenome Archive (EGA; https://ega-archive.org/ ) under overarching accession EGAS00001006425, with access to the Southern African Prostate Cancer Study (SAPCS) Dataset (EGAD00001009067) granted by the SAPCS Data Access Committee (DAC). Exomic genotyping summary statistics have been deposited in the GWAS Catalog database ( www.ebi.ac.uk/gwas ) under accession code GCST90296485 for cases versus control data and GCST90296486 for high-risk PCa versus low-risk PCa and control. Polygenic risk scores are available in the PGS Catalog database ( https://www.pgscatalog.org/ ) under accession code PGP000516. Source data are provided with this paper. The remaining data are available within the Article, Supplementary Information or Source Data file.  Source data are provided with this paper.

Mahal, B. A. et al. Prostate cancer racial disparities: a systematic review by the prostate cancer foundation panel. Eur. Urol. Oncol. 5 , 18–29 (2022).

Article   PubMed   Google Scholar  

Hjelmborg, J. B. et al. The heritability of prostate cancer in the Nordic Twin Study of Cancer. Cancer Epidemiol. Biomarkers Prev. 23 , 2303–2310 (2014).

Article   PubMed   PubMed Central   Google Scholar  

Siegel, R. L., Miller, K. D., Fuchs, H. E. & Jemal, A. Cancer statistics, 2022. CA Cancer J. Clin. 72 , 7–33 (2022).

Sung, H. et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J. Clin. 71 , 209–249 (2021).

Petersen, D. C. et al. Complex patterns of genomic admixture within southern Africa. PLoS Genet. 9 , e1003309 (2013).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Tindall, E. A. et al. Clinical presentation of prostate cancer in black South Africans. Prostate 74 , 880–891 (2014).

Mills, M. C. & Rahal, C. The GWAS Diversity Monitor tracks diversity by disease in real time. Nat. Genet. 52 , 242–243 (2020).

Article   CAS   PubMed   Google Scholar  

Acheampong, E. et al. Association of genetic variants with prostate cancer in Africa: a concise review. Egyptian J. Med. Hum. Genet. 22 , 1–9 (2021).

Google Scholar  

Rotimi, S. O., Rotimi, O. A. & Salhia, B. A review of cancer genetics and genomics studies in Africa. Front. Oncol. 10 , 606400 (2020).

Cook, M. B. et al. A genome-wide association study of prostate cancer in West African men. Hum. Genet. 133 , 509–521 (2014).

Du, Z. et al. Genetic risk of prostate cancer in Ugandan men. Prostate 78 , 370–376 (2018).

Tindall, E. A. et al. Addressing the contribution of previously described genetic and epidemiological risk factors associated with increased prostate cancer risk and aggressive disease within men from South Africa. BMC Urol. 13 , 74 (2013).

Harlemon, M. et al. A custom genotyping array reveals population-level heterogeneity for the genetic risks of prostate cancer and other cancers in Africa. Cancer Res 80 , 2956–2966 (2020).

Matejcic, M. et al. Pathogenic variants in cancer predisposition genes and prostate cancer risk in men of African ancestry. JCO Precis. Oncol. 4 , 32–43 (2020).

Gheybi, K. et al. Evaluating germline testing panels in southern african males with advanced prostate cancer. J. Natl. Compr. Canc. Netw. 21 , 289–296 e283 (2023).

Chen, F. et al. Evidence of novel susceptibility variants for prostate cancer and a multiancestry polygenic risk score associated with aggressive disease in men of African ancestry. Eur. Urol. 84 , 13–21 (2023).

Conti, D. V. et al. Trans-ancestry genome-wide association meta-analysis of prostate cancer identifies new susceptibility loci and informs genetic risk prediction. Nat. Genet. 53 , 65–75 (2021).

Conti, D. V. et al. Two novel susceptibility loci for prostate cancer in men of African ancestry. J. Natl Cancer Inst. 109 (2017).

Darst, B. F. et al. A rare germline HOXB13 variant contributes to risk of prostate cancer in men of African ancestry. Eur. Urol. 81 , 458–462 (2022).

Soh, P. X. Y. & Hayes, V. M. Common genetic variants associated with prostate cancer risk: the need for African inclusion. Eur. Urol. In press.

Jaratlerdsiri, W. et al. African-specific molecular taxonomy of prostate cancer. Nature 609 , 552–559 (2022).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4 , 7 (2015).

Goldstein, J. I. et al. zCall: a rare variant caller for array-based genotyping: genetics and population analysis. Bioinformatics (Oxford, England) 28 , 2543–2545 (2012).

CAS   PubMed   Google Scholar  

Lee, S. et al. Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies. Am. J. Hum. Genet. 91 , 224–237 (2012).

Na, R. et al. The HOXB13 variant X285K is associated with clinical significance and early age at diagnosis in African American prostate cancer patients. Br. J. Cancer 126 , 791–796 (2022).

Walavalkar, K. et al. A rare variant of African ancestry activates 8q24 lncRNA hub by modulating cancer associated enhancer. Nat. Commun. 11 , 3598 (2020).

Zou, Y. & Chen, B. Long non-coding RNA HCP5 in cancer. Clin. Chim. Acta 512 , 33–39 (2021).

Hu, R. & Lu, Z. Long non‑coding RNA HCP5 promotes prostate cancer cell proliferation by acting as the sponge of miR‑4656 to modulate CEMIP expression. Oncol. Rep. 43 , 328–336 (2020).

Li, L., Yan, L. H., Manoj, S., Li, Y. & Lu, L. Central role of CEMIP in tumorigenesis and its potential as therapeutic target. J. Cancer 8 , 2238–2246 (2017).

Yan, M. et al. Long noncoding RNA linc-ITGB1 promotes cell migration and invasion in human breast cancer. Biotechnol. Appl. Biochem. 64 , 5–13 (2017).

Dai, L. et al. LncRNA ITGB1 promotes the development of bladder cancer through regulating microRNA-10a expression. Eur. Rev. Med. Pharmacol. Sci. 23 , 6858–6867 (2019).

Tse, B. W. C. et al. Neuropilin-1 is upregulated in the adaptive response of prostate tumors to androgen-targeted therapies and is prognostic of metastatic progression and patient mortality. Oncogene 36 , 3417–3427 (2017).

Takata, R. et al. Genome-wide association study identifies five new susceptibility loci for prostate cancer in the Japanese population. Nat. Genet. 42 , 751–754 (2010).

Wang, N. N. et al. Susceptibility loci associations with prostate cancer risk in northern Chinese men. Asian Pac. J. Cancer Prev. 14 , 3075–3078 (2013).

Huang, Q. et al. A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by modulating HOXB13 chromatin binding. Nat. Genet. 46 , 126–135 (2014).

Wang, M. et al. Replication and cumulative effects of GWAS-identified genetic variations for prostate cancer in Asians: a case-control study in the ChinaPCa consortium. Carcinogenesis 33 , 356–360 (2012).

Qi, N. et al. rs2274911 polymorphism in GPRC6A associated with serum E2 and PSA in a Southern Chinese male population. Gene 763 , 145067 (2020).

Ewing, C. M. et al. Germline mutations in HOXB13 and prostate-cancer risk. N. Engl. J. Med. 366 , 141–149 (2012).

Marlin, R. et al. Mutation HOXB13 c.853delT in Martinican prostate cancer patients. Prostate 80 , 463–470 (2020).

Berlin, A. et al. Prognostic role of Ki-67 score in localized prostate cancer: A systematic review and meta-analysis. Urol. Oncol. 35 , 499–506 (2017).

Hammarsten, P. et al. Immunoreactivity for prostate specific antigen and Ki67 differentiates subgroups of prostate cancer related to outcome. Mod. Pathol. 32 , 1310–1319 (2019).

Couture, F. et al. PACE4-altCT isoform of proprotein convertase PACE4 as tissue and plasmatic biomarker for prostate cancer. Sci. Rep. 12 , 6066 (2022).

Minami, K. et al. Expression of ABCB6 is related to resistance to 5-FU, SN-38 and vincristine. Anticancer Res. 34 , 4767–4773 (2014).

Karatas, O. F., Guzel, E., Duz, M. B., Ittmann, M. & Ozen, M. The role of ATP-binding cassette transporter genes in the progression of prostate cancer. Prostate 76 , 434–444 (2016).

Zhao, S. G. et al. Increased expression of ABCB6 enhances protoporphyrin IX accumulation and photodynamic effect in human glioma. Ann. Surg. Oncol. 20 , 4379–4388 (2013).

Vecchione, A. et al. MITOSTATIN, a putative tumor suppressor on chromosome 12q24.1, is downregulated in human bladder and breast cancer. Oncogene 28 , 257–269 (2009).

Uemura, T., Kametaka, S. & Waguri, S. GGA2 interacts with EGFR cytoplasmic domain to stabilize the receptor expression and promote cell growth. Sci. Rep 8 , 1368 (2018).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Li, H. et al. Mutations in linker histone genes HIST1H1 B, C, D, and E; OCT2 (POU2F2); IRF8; and ARID1A underlying the pathogenesis of follicular lymphoma. Blood 123 , 1487–1498 (2014).

Mutolo, M. J. et al. Tumor suppression by collagen XV is independent of the restin domain. Matrix Biol. 31 , 285–289 (2012).

Morgan, R. et al. Targeting HOX transcription factors in prostate cancer. BMC Urol. 14 17 (2014).

Choudhury, A. et al. High-depth African genomes inform human migration and health. Nature 586 , 741–748 (2020).

Jiang, J. et al. ANO7 African-ancestral genomic diversity and advanced prostate cancer. Prostate Cancer Prostatic Dis (2023).

Jones, K. et al. Epigenetics in prostate cancer treatment. J. Transl. Genet. Genom. 5 , 341–356 (2021).

CAS   PubMed   PubMed Central   Google Scholar  

Sugiura, M. et al. Epigenetic modifications in prostate cancer. Int. J. Urol. 28 , 140–149 (2021).

Mauffrey, P. et al. Progenitors from the central nervous system drive neurogenesis in cancer. Nature 569 , 672–678 (2019).

Article   ADS   CAS   PubMed   Google Scholar  

Barrientos, A. et al. MTG1 codes for a conserved protein required for mitochondrial translation. Mol. Biol. Cell 14 , 2292–2302 (2003).

Hopkins, J. F. et al. Mitochondrial mutations drive prostate cancer aggression. Nat. Commun. 8 , 656 (2017).

McCrow, J. P. et al. Spectrum of mitochondrial genomic variation and associated clinical presentation of prostate cancer in South African men. Prostate 76 , 349–358 (2016).

Diakiw, S. M., D’Andrea, R. J. & Brown, A. L. The double life of KLF5: Opposing roles in regulation of gene-expression, cellular function, and transformation. IUBMB Life 65 , 999–1011 (2013).

Xing, C. et al. Different expression patterns and functions of acetylated and unacetylated Klf5 in the proliferation and differentiation of prostatic epithelial cells. PLoS ONE 8 , e65538 (2013).

Jia, J. et al. KLF5 downregulation desensitizes castration-resistant prostate cancer cells to docetaxel by increasing BECN1 expression and inducing cell autophagy. Theranostics 9 , 5464–5477 (2019).

Li, Y. et al. TGF-beta causes docetaxel resistance in prostate cancer via the induction of Bcl-2 by acetylated KLF5 and protein stabilization. Theranostics 10 , 7656–7670 (2020).

Che, M. et al. Opposing transcriptional programs of KLF5 and AR emerge during therapy for advanced prostate cancer. Nat. Commun. 12 , 6377 (2021).

Zhang, B. et al. Acetylation of KLF5 maintains EMT and tumorigenicity to cause chemoresistant bone metastasis in prostate cancer. Nat. Commun. 12 , 1714 (2021).

Kluth, M. et al. 13q deletion is linked to an adverse phenotype and poor prognosis in prostate cancer. Genes Chromosomes Cancer 57 , 504–512 (2018).

Chen, C., Bhalala, H. V., Vessella, R. L. & Dong, J. T. KLF5 is frequently deleted and down-regulated but rarely mutated in prostate cancer. Prostate 55 , 81–88 (2003).

Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581 , 434–443 (2020).

Mulder, N. et al. H3Africa: current perspectives. Pharmgenomics Pers. Med. 11 , 59–66 (2018).

PubMed   PubMed Central   Google Scholar  

Grove, M. L. et al. Best practices and joint calling of the HumanExome BeadChip: the CHARGE Consortium. PLoS ONE 8 , e68095 (2013).

Guo, Y. et al. Illumina human exome genotyping array clustering and quality control. Nat. Protoc. 9 , 2643–2662 (2014).

Wang, K., Li, M. & Hakonarson, H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38 , e164 (2010).

Zhao, S. et al. Strategies for processing and quality control of Illumina genotyping arrays. Brief Bioinform. 19 , 765–775 (2018).

Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).

Massicotte, P., South A. rnaturalearth: World Map Data from Natural Earth. R package version 0.3.4 edn (2023).

Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19 , 1655–1664 (2009).

Behr, A. A., Liu, K. Z., Liu-Fang, G., Nakka, P. & Ramachandran, S. pong: fast analysis and visualization of latent clusters in population genetic data. Bioinformatics (Oxford, England) 32 , 2817–2823 (2016).

David MK & Leslie SW. Prostate Specific Antigen. [Updated 2022 Nov 10]. In: StatPearls [Internet] . https://www.ncbi.nlm.nih.gov/books/NBK557495/ (StatPearls Publishing, Treasure Island, FL, 2023)

Team R. RStudio: Integrated Development for R (RStudio, PBC, 2020).

Storey J. D., Bass A. J., Dabney A., Robinson D. qvalue: Q-value estimation for false discovery rate control.). R package version 2.30.0 edn (2022).

Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 4 , 1184–1191 (2009).

Karolchik, D. et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32 , D493–496 (2004).

Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88 , 76–82 (2011).

Download references


The authors are grateful to the patients and their families who have contributed to this study, to co-founding SAPCS members the late Professor Philip A. Venter and retired urologist Dr. Richard Monare from the University of Limpopo in South Africa, as well as the Medical Research Council (MRC) of South Africa for SAPCS seed-funding; without their contribution, this research would not be possible. This study was supported by a Cancer Association of South Africa (CANSA) grant to M.S.R.B. and V.M.H., the National Health and Medical Research Council (NHMRC) of Australia via a Project grant (APP1165762) to V.M.H. and Ideas grants (APP2001098) to V.M.H. and M.S.R.B. and (APP2010551) to V.M.H., as well as partially via the U.S.A. Congressionally Directed Medical Research Programs (CDMRP) Prostate Cancer Research Program (PCRP) through an Idea Development Award (PC200390, TARGET Africa) to V.M.H. and a HEROIC Consortium Award (PC210168, HEROIC PCaPH Africa1K) to V.M.H. and M.S.R.B., acknowledging our co-leads Professors Gail S. Prins (University of Illinois at Chicago) and Mungai Peter Ngugi (University of Nairobi, Kenya). V.M.H. is further supported by the Petre Foundation via the University of Sydney Foundation, Australia.

Author information

Authors and affiliations.

Ancestry and Health Genomics Laboratory, Charles Perkins Centre, School of Medical Sciences, Faculty of Medicine and Health, University of Sydney, Camperdown, NSW, 2006, Australia

Pamela X. Y. Soh, Kazzem Gheybi, Jue Jiang, Weerachai Jaratlerdseri & Vanessa M. Hayes

School of Health Systems and Public Health, University of Pretoria, Pretoria, South Africa

Naledi Mmekwa, Sean M. Patrick, M. S. Riana Bornman & Vanessa M. Hayes

South African Medical Research Council Centre for Tuberculosis Research, Division of Molecular Biology and Human Genetics, Faculty of Medicine and Health Sciences, Stellenbosch University, Cape Town, South Africa

Desiree C. Petersen

Faculty of Health Sciences, University of Limpopo, Turfloop Campus, South Africa

Smit van Zyl & Vanessa M. Hayes

Phulukisa health Care, Pretoria, South Africa

Raymond Campbell

Department of Urology, Sefako Makgatho Health Science University, Dr George Mukhari Academic Hospital, Medunsa, South Africa

Shingai B. A. Mutambirwa

Manchester Cancer Research Centre, University of Manchester, Manchester, M20 4GJ, UK

Vanessa M. Hayes

You can also search for this author in PubMed   Google Scholar


V.M.H. and M.S.R.B. co-conceived the study. V.H.H. and D.C.P. designed the experimental approach. P.X.Y.S. led the statistical analyses, with statistical and computational assistance provided by K.G., J.J. and W.J. S.v.Z., R.C. and S.B.A.M. recruited patients and provided critical clinical review. N.M., D.C.P., S.M.P. and M.S.R.B. collated the specimens and provided both quality control and administrative data support. D.C.P. prepared the DNA for analysis and quality control. P.X.Y.S. generated the figures and provided the data analytics, with further interpretation provided by V.M.H. M.S.R.B. as the SAPCS Clinical Director, S.B.A.M. as the SAPCS Urological Director and V.M.H. as the SAPCS Scientific Director provided expertise-specific supervision. P.X.Y.S. drafted the manuscript with supervision from V.M.H., with all authors reviewing and editing the manuscript.

Corresponding author

Correspondence to Vanessa M. Hayes .

Ethics declarations

Competing interests.

All authors declare no competing interest.

Peer review

Peer review information.

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, description of additional supplementary files, supplementary data 1, supplementary data 2, supplementary data 3, supplementary data 4, supplementary data 5, supplementary data 6, supplementary data 7, supplementary data 8, supplementary data 9, supplementary data 10, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and Permissions

About this article

Cite this article.

Soh, P.X.Y., Mmekwa, N., Petersen, D.C. et al. Prostate cancer genetic risk and associated aggressive disease in men of African ancestry. Nat Commun 14 , 8037 (2023). https://doi.org/10.1038/s41467-023-43726-w

Download citation

Received : 28 May 2023

Accepted : 17 November 2023

Published : 05 December 2023

DOI : https://doi.org/10.1038/s41467-023-43726-w

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: Cancer newsletter — what matters in cancer research, free to your inbox weekly.

case control study rare disease

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

Observational Studies: Cohort and Case-Control Studies

Jae w. song.

1 Research Fellow, Section of Plastic Surgery, Department of Surgery The University of Michigan Health System; Ann Arbor, MI

Kevin C. Chung

2 Professor of Surgery, Section of Plastic Surgery, Department of Surgery The University of Michigan Health System; Ann Arbor, MI

Observational studies are an important category of study designs. To address some investigative questions in plastic surgery, randomized controlled trials are not always indicated or ethical to conduct. Instead, observational studies may be the next best method to address these types of questions. Well-designed observational studies have been shown to provide results similar to randomized controlled trials, challenging the belief that observational studies are second-rate. Cohort studies and case-control studies are two primary types of observational studies that aid in evaluating associations between diseases and exposures. In this review article, we describe these study designs, methodological issues, and provide examples from the plastic surgery literature.

Because of the innovative nature of the specialty, plastic surgeons are frequently confronted with a spectrum of clinical questions by patients who inquire about “best practices.” It is thus essential that plastic surgeons know how to critically appraise the literature to understand and practice evidence-based medicine (EBM) and also contribute to the effort by carrying out high-quality investigations. 1 Well-designed randomized controlled trials (RCTs) have held the pre-eminent position in the hierarchy of EBM as level I evidence ( Table 1 ). However, RCT methodology, which was first developed for drug trials, can be difficult to conduct for surgical investigations. 3 Instead, well-designed observational studies, recognized as level II or III evidence, can play an important role in deriving evidence for plastic surgery. Results from observational studies are often criticized for being vulnerable to influences by unpredictable confounding factors. However, recent work has challenged this notion, showing comparable results between observational studies and RCTs. 4 , 5 Observational studies can also complement RCTs in hypothesis generation, establishing questions for future RCTs, and defining clinical conditions.

Levels of Evidence Based Medicine

From REF 1 .

Observational studies fall under the category of analytic study designs and are further sub-classified as observational or experimental study designs ( Figure 1 ). The goal of analytic studies is to identify and evaluate causes or risk factors of diseases or health-related events. The differentiating characteristic between observational and experimental study designs is that in the latter, the presence or absence of undergoing an intervention defines the groups. By contrast, in an observational study, the investigator does not intervene and rather simply “observes” and assesses the strength of the relationship between an exposure and disease variable. 6 Three types of observational studies include cohort studies, case-control studies, and cross-sectional studies ( Figure 1 ). Case-control and cohort studies offer specific advantages by measuring disease occurrence and its association with an exposure by offering a temporal dimension (i.e. prospective or retrospective study design). Cross-sectional studies, also known as prevalence studies, examine the data on disease and exposure at one particular time point ( Figure 2 ). 6 Because the temporal relationship between disease occurrence and exposure cannot be established, cross-sectional studies cannot assess the cause and effect relationship. In this review, we will primarily discuss cohort and case-control study designs and related methodologic issues.

An external file that holds a picture, illustration, etc.
Object name is nihms-237355-f0001.jpg

Analytic Study Designs. Adapted with permission from Joseph Eisenberg, Ph.D.

An external file that holds a picture, illustration, etc.
Object name is nihms-237355-f0002.jpg

Temporal Design of Observational Studies: Cross-sectional studies are known as prevalence studies and do not have an inherent temporal dimension. These studies evaluate subjects at one point in time, the present time. By contrast, cohort studies can be either retrospective (latin derived prefix, “retro” meaning “back, behind”) or prospective (greek derived prefix, “pro” meaning “before, in front of”). Retrospective studies “look back” in time contrasting with prospective studies, which “look ahead” to examine causal associations. Case-control study designs are also retrospective and assess the history of the subject for the presence or absence of an exposure.


The term “cohort” is derived from the Latin word cohors . Roman legions were composed of ten cohorts. During battle each cohort, or military unit, consisting of a specific number of warriors and commanding centurions, were traceable. The word “cohort” has been adopted into epidemiology to define a set of people followed over a period of time. W.H. Frost, an epidemiologist from the early 1900s, was the first to use the word “cohort” in his 1935 publication assessing age-specific mortality rates and tuberculosis. 7 The modern epidemiological definition of the word now means a “group of people with defined characteristics who are followed up to determine incidence of, or mortality from, some specific disease, all causes of death, or some other outcome.” 7

Study Design

A well-designed cohort study can provide powerful results. In a cohort study, an outcome or disease-free study population is first identified by the exposure or event of interest and followed in time until the disease or outcome of interest occurs ( Figure 3A ). Because exposure is identified before the outcome, cohort studies have a temporal framework to assess causality and thus have the potential to provide the strongest scientific evidence. 8 Advantages and disadvantages of a cohort study are listed in Table 2 . 2 , 9 Cohort studies are particularly advantageous for examining rare exposures because subjects are selected by their exposure status. Additionally, the investigator can examine multiple outcomes simultaneously. Disadvantages include the need for a large sample size and the potentially long follow-up duration of the study design resulting in a costly endeavor.

An external file that holds a picture, illustration, etc.
Object name is nihms-237355-f0003.jpg

Cohort and Case-Control Study Designs

Advantages and Disadvantages of the Cohort Study

Cohort studies can be prospective or retrospective ( Figure 2 ). Prospective studies are carried out from the present time into the future. Because prospective studies are designed with specific data collection methods, it has the advantage of being tailored to collect specific exposure data and may be more complete. The disadvantage of a prospective cohort study may be the long follow-up period while waiting for events or diseases to occur. Thus, this study design is inefficient for investigating diseases with long latency periods and is vulnerable to a high loss to follow-up rate. Although prospective cohort studies are invaluable as exemplified by the landmark Framingham Heart Study, started in 1948 and still ongoing, 10 in the plastic surgery literature this study design is generally seen to be inefficient and impractical. Instead, retrospective cohort studies are better indicated given the timeliness and inexpensive nature of the study design.

Retrospective cohort studies, also known as historical cohort studies, are carried out at the present time and look to the past to examine medical events or outcomes. In other words, a cohort of subjects selected based on exposure status is chosen at the present time, and outcome data (i.e. disease status, event status), which was measured in the past, are reconstructed for analysis. The primary disadvantage of this study design is the limited control the investigator has over data collection. The existing data may be incomplete, inaccurate, or inconsistently measured between subjects. 2 However, because of the immediate availability of the data, this study design is comparatively less costly and shorter than prospective cohort studies. For example, Spear and colleagues examined the effect of obesity and complication rates after undergoing the pedicled TRAM flap reconstruction by retrospectively reviewing 224 pedicled TRAM flaps in 200 patients over a 10-year period. 11 In this example, subjects who underwent the pedicled TRAM flap reconstruction were selected and categorized into cohorts by their exposure status: normal/underweight, overweight, or obese. The outcomes of interest were various flap and donor site complications. The findings revealed that obese patients had a significantly higher incidence of donor site complications, multiple flap complications, and partial flap necrosis than normal or overweight patients. An advantage of the retrospective study design analysis is the immediate access to the data. A disadvantage is the limited control over the data collection because data was gathered retrospectively over 10-years; for example, a limitation reported by the authors is that mastectomy flap necrosis was not uniformly recorded for all subjects. 11

An important distinction lies between cohort studies and case-series. The distinguishing feature between these two types of studies is the presence of a control, or unexposed, group. Contrasting with epidemiological cohort studies, case-series are descriptive studies following one small group of subjects. In essence, they are extensions of case reports. Usually the cases are obtained from the authors' experiences, generally involve a small number of patients, and more importantly, lack a control group. 12 There is often confusion in designating studies as “cohort studies” when only one group of subjects is examined. Yet, unless a second comparative group serving as a control is present, these studies are defined as case-series. The next step in strengthening an observation from a case-series is selecting appropriate control groups to conduct a cohort or case-control study, the latter which is discussed in the following section about case-control studies. 9

Methodological Issues

Selection of subjects in cohort studies.

The hallmark of a cohort study is defining the selected group of subjects by exposure status at the start of the investigation. A critical characteristic of subject selection is to have both the exposed and unexposed groups be selected from the same source population ( Figure 4 ). 9 Subjects who are not at risk for developing the outcome should be excluded from the study. The source population is determined by practical considerations, such as sampling. Subjects may be effectively sampled from the hospital, be members of a community, or from a doctor's individual practice. A subset of these subjects will be eligible for the study.

An external file that holds a picture, illustration, etc.
Object name is nihms-237355-f0005.jpg

Levels of Subject Selection. Adapted from Ref 9 .

Attrition Bias (Loss to follow-up)

Because prospective cohort studies may require long follow-up periods, it is important to minimize loss to follow-up. Loss to follow-up is a situation in which the investigator loses contact with the subject, resulting in missing data. If too many subjects are loss to follow-up, the internal validity of the study is reduced. A general rule of thumb requires that the loss to follow-up rate not exceed 20% of the sample. 6 Any systematic differences related to the outcome or exposure of risk factors between those who drop out and those who stay in the study must be examined, if possible, by comparing individuals who remain in the study and those who were loss to follow-up or dropped out. It is therefore important to select subjects who can be followed for the entire duration of the cohort study. Methods to minimize loss to follow-up are listed in Table 3 .

Methods to Minimize Loss to Follow-Up

Adapted from REF 2 .


Case-control studies were historically borne out of interest in disease etiology. The conceptual basis of the case-control study is similar to taking a history and physical; the diseased patient is questioned and examined, and elements from this history taking are knitted together to reveal characteristics or factors that predisposed the patient to the disease. In fact, the practice of interviewing patients about behaviors and conditions preceding illness dates back to the Hippocratic writings of the 4 th century B.C. 7

Reasons of practicality and feasibility inherent in the study design typically dictate whether a cohort study or case-control study is appropriate. This study design was first recognized in Janet Lane-Claypon's study of breast cancer in 1926, revealing the finding that low fertility rate raises the risk of breast cancer. 13 , 14 In the ensuing decades, case-control study methodology crystallized with the landmark publication linking smoking and lung cancer in the 1950s. 15 Since that time, retrospective case-control studies have become more prominent in the biomedical literature with more rigorous methodological advances in design, execution, and analysis.

Case-control studies identify subjects by outcome status at the outset of the investigation. Outcomes of interest may be whether the subject has undergone a specific type of surgery, experienced a complication, or is diagnosed with a disease ( Figure 3B ). Once outcome status is identified and subjects are categorized as cases, controls (subjects without the outcome but from the same source population) are selected. Data about exposure to a risk factor or several risk factors are then collected retrospectively, typically by interview, abstraction from records, or survey. Case-control studies are well suited to investigate rare outcomes or outcomes with a long latency period because subjects are selected from the outset by their outcome status. Thus in comparison to cohort studies, case-control studies are quick, relatively inexpensive to implement, require comparatively fewer subjects, and allow for multiple exposures or risk factors to be assessed for one outcome ( Table 4 ). 2 , 9

Advantages and Disadvantages of the Case-Control Study

An example of a case-control investigation is by Zhang and colleagues who examined the association of environmental and genetic factors associated with rare congenital microtia, 16 which has an estimated prevalence of 0.83 to 17.4 in 10,000. 17 They selected 121 congenital microtia cases based on clinical phenotype, and 152 unaffected controls, matched by age and sex in the same hospital and same period. Controls were of Hans Chinese origin from Jiangsu, China, the same area from where the cases were selected. This allowed both the controls and cases to have the same genetic background, important to note given the investigated association between genetic factors and congenital microtia. To examine environmental factors, a questionnaire was administered to the mothers of both cases and controls. The authors concluded that adverse maternal health was among the main risk factors for congenital microtia, specifically maternal disease during pregnancy (OR 5.89, 95% CI 2.36-14.72), maternal toxicity exposure during pregnancy (OR 4.76, 95% CI 1.66-13.68), and resident area, such as living near industries associated with air pollution (OR 7.00, 95% CI 2.09-23.47). 16 A case-control study design is most efficient for this investigation, given the rarity of the disease outcome. Because congenital microtia is thought to have multifactorial causes, an additional advantage of the case-control study design in this example is the ability to examine multiple exposures and risk factors.

Selection of Cases

Sampling in a case-control study design begins with selecting the cases. In a case-control study, it is imperative that the investigator has explicitly defined inclusion and exclusion criteria prior to the selection of cases. For example, if the outcome is having a disease, specific diagnostic criteria, disease subtype, stage of disease, or degree of severity should be defined. Such criteria ensure that all the cases are homogenous. Second, cases may be selected from a variety of sources, including hospital patients, clinic patients, or community subjects. Many communities maintain registries of patients with certain diseases and can serve as a valuable source of cases. However, despite the methodologic convenience of this method, validity issues may arise. For example, if cases are selected from one hospital, identified risk factors may be unique to that single hospital. This methodological choice may weaken the generalizability of the study findings. Another example is choosing cases from the hospital versus the community; most likely cases from the hospital sample will represent a more severe form of the disease than those in the community. 2 Finally, it is also important to select cases that are representative of cases in the target population to strengthen the study's external validity ( Figure 4 ). Potential reasons why cases from the original target population eventually filter through and are available as cases (study participants) for a case-control study are illustrated in Figure 5 .

An external file that holds a picture, illustration, etc.
Object name is nihms-237355-f0006.jpg

Levels of Case Selection. Adapted from Ref 2 .

Selection of Controls

Selecting the appropriate group of controls can be one of the most demanding aspects of a case-control study. An important principle is that the distribution of exposure should be the same among cases and controls; in other words, both cases and controls should stem from the same source population. The investigator may also consider the control group to be an at-risk population, with the potential to develop the outcome. Because the validity of the study depends upon the comparability of these two groups, cases and controls should otherwise meet the same inclusion criteria in the study.

A case-control study design that exemplifies this methodological feature is by Chung and colleagues, who examined maternal cigarette smoking during pregnancy and the risk of newborns developing cleft lip/palate. 18 A salient feature of this study is the use of the 1996 U.S. Natality database, a population database, from which both cases and controls were selected. This database provides a large sample size to assess newborn development of cleft lip/palate (outcome), which has a reported incidence of 1 in 1000 live births, 19 and also enabled the investigators to choose controls (i.e., healthy newborns) that were generalizable to the general population to strengthen the study's external validity. A significant relationship with maternal cigarette smoking and cleft lip/palate in the newborn was reported in this study (adjusted OR 1.34, 95% CI 1.36-1.76). 18

Matching is a method used in an attempt to ensure comparability between cases and controls and reduces variability and systematic differences due to background variables that are not of interest to the investigator. 8 Each case is typically individually paired with a control subject with respect to the background variables. The exposure to the risk factor of interest is then compared between the cases and the controls. This matching strategy is called individual matching. Age, sex, and race are often used to match cases and controls because they are typically strong confounders of disease. 20 Confounders are variables associated with the risk factor and may potentially be a cause of the outcome. 8 Table 5 lists several advantages and disadvantages with a matching design.

Advantages and Disadvantages for Using a Matching Strategy

Multiple Controls

Investigations examining rare outcomes may have a limited number of cases to select from, whereas the source population from which controls can be selected is much larger. In such scenarios, the study may be able to provide more information if multiple controls per case are selected. This method increases the “statistical power” of the investigation by increasing the sample size. The precision of the findings may improve by having up to about three or four controls per case. 21 - 23

Bias in Case-Control Studies

Evaluating exposure status can be the Achilles heel of case-control studies. Because information about exposure is typically collected by self-report, interview, or from recorded information, it is susceptible to recall bias, interviewer bias, or will rely on the completeness or accuracy of recorded information, respectively. These biases decrease the internal validity of the investigation and should be carefully addressed and reduced in the study design. Recall bias occurs when a differential response between cases and controls occurs. The common scenario is when a subject with disease (case) will unconsciously recall and report an exposure with better clarity due to the disease experience. Interviewer bias occurs when the interviewer asks leading questions or has an inconsistent interview approach between cases and controls. A good study design will implement a standardized interview in a non-judgemental atmosphere with well-trained interviewers to reduce interviewer bias. 9

The STROBE Statement: The Strengthening the Reporting of Observational Studies in Epidemiology Statement

In 2004, the first meeting of the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) group took place in Bristol, UK. 24 The aim of the group was to establish guidelines on reporting observational research to improve the transparency of the methods, thereby facilitating the critical appraisal of a study's findings. A well-designed but poorly reported study is disadvantaged in contributing to the literature because the results and generalizability of the findings may be difficult to assess. Thus a 22-item checklist was generated to enhance the reporting of observational studies across disciplines. 25 , 26 This checklist is also located at the following website: www.strobe-statement.org . This statement is applicable to cohort studies, case-control studies, and cross-sectional studies. In fact, 18 of the checklist items are common to all three types of observational studies, and 4 items are specific to each of the 3 specific study designs. In an effort to provide specific guidance to go along with this checklist, an “explanation and elaboration” article was published for users to better appreciate each item on the checklist. 27 Plastic surgery investigators should peruse this checklist prior to designing their study and when they are writing up the report for publication. In fact, some journals now require authors to follow the STROBE Statement. A list of participating journals can be found on this website: http://www.strobe-statement.org./index.php?id=strobe-endorsement .

Due to the limitations in carrying out RCTs in surgical investigations, observational studies are becoming more popular to investigate the relationship between exposures, such as risk factors or surgical interventions, and outcomes, such as disease states or complications. Recognizing that well-designed observational studies can provide valid results is important among the plastic surgery community, so that investigators can both critically appraise and appropriately design observational studies to address important clinical research questions. The investigator planning an observational study can certainly use the STROBE statement as a tool to outline key features of a study as well as coming back to it again at the end to enhance transparency in methodology reporting.


Supported in part by a Midcareer Investigator Award in Patient-Oriented Research (K24 AR053120) from the National Institute of Arthritis and Musculoskeletal and Skin Diseases (to Dr. Kevin C. Chung).

None of the authors has a financial interest in any of the products, devices, or drugs mentioned in this manuscript.

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.


  1. Pin on Epidemiology

    case control study rare disease

  2. PPT

    case control study rare disease

  3. PPT

    case control study rare disease

  4. PPT

    case control study rare disease

  5. Case-Control Study in Clinical Research

    case control study rare disease

  6. PPT

    case control study rare disease


  1. Clinical Case Scenarios 5

  2. Systemic risks

  3. Case control study

  4. Ch 11 Risks

  5. Flip Case Control for Android -- Demo

  6. Clinical Case Scenarios 7


  1. Innovative research methods for studying treatments for rare diseases

    Rare diseases comprise a heterogeneous set of conditions that afflict various organ systems, have wide ranging prognoses, and even vary along a gradient of rareness. Many barriers exist to advancing knowledge of and treatment options for rare diseases. 4 The small patient populations can dampen commercial interest in development of treatments.

  2. Case Control Studies

    Advantages There are many advantages to case-control studies. First, the case-control approach allows for the study of rare diseases. If a disease occurs very infrequently, one would have to follow a large group of people for a long period of time to accrue enough incident cases to study.

  3. Reducing selection bias in case-control studies from rare disease

    The objective of the study was to demonstrate the utility of case-control matching and the risk-set method in order to control bias in data from a rare disease registry. Data from the International Collaborative Gaucher Group (ICGG) Gaucher Registry were used as an example. Methods

  4. Small Data Challenges of Studying Rare Diseases

    From the perspective of study design, researchers investigating rare diseases have many options, including crossover and adaptive trials. 6 For observational studies, Whicher et al 7 list self-controlled study designs, case-control designs, and prospective inception cohorts as potential designs suitable for rare disease research.

  5. A Practical Overview of Case-Control Studies in Clinical Practice

    Case-Control Study Subtypes The case-control study can be subcategorized into four different subtypes based on how the control group is selected and when the cases develop the disease of interest as described in the following sections. Nested Case-Control Study

  6. An overview of the impact of rare disease characteristics on research

    Our objectives were to: 1. identify algorithms for matching study design to rare disease attributes and the methodological approaches applicable to these algorithms; 2. draw inferences on how research communities and infrastructure can contribute to the efficiency of research on rare diseases; and 3. to describe methodological approaches in the ...

  7. PDF Case-control studies: an efficient study design

    aetiology of a disease or condition (i.e. an outcome). Case-control studies are particularly useful for studying the cause of an outcome that is rare and for studying the effects of prolonged ...

  8. Case-Control Studies

    Abstract. Case-control studies are a specialized type of observational study design ideally suited for evaluating rare diseases and those with a long latency period. Case-control studies begin by targeting people who have and do not have a disease or condition of interest and then work backward to determine associations with previous exposures.

  9. Case-Control Study

    Case-control studies are ideal for the study of rare disease or conditions that are slow to evolve, as they permit the assembly of a group of cases of appropriate size for analysis, without requiring an extremely large study population. This presents an important advantage as it reduces the cost and time necessary for the study of such ...

  10. Case Control

    A study that compares patients who have a disease or outcome of interest (cases) with patients who do not have the disease or outcome (controls), and looks back retrospectively to compare how frequently the exposure to a risk factor is present in each group to determine the relationship between the risk factor and the disease.

  11. Case Control Studies

    Advantages There are many advantages to case-control studies. First, the case-control approach allows for the study of rare diseases. If a disease occurs very infrequently, one would have to follow a large group of people for a long period of time to accrue enough incident cases to study.

  12. On the need for the rare disease assumption in case-control studies

    Abstract. The conditions under which matched and unmatched odds ratios are consistent estimators of the incidence-density ratio in case-control studies are examined. Under "incidence-density" sampling, in which controls are selected from those at risk at the time of onset of each case, the matched estimator is shown to be consistent.

  13. Case-control designs in the study of common diseases: updates on the

    Case-control designs in the study of common diseases: updates on the demise of the rare disease assumption and the choice of sampling scheme for controls Int J Epidemiol. 1990 Mar;19(1):205-13.doi: 10.1093/ije/19.1.205. Authors L Rodrigues 1 , B R Kirkwood Affiliation

  14. Epidemiology in Practice: Case-Control Studies

    Since case-control studies start with people known to have the outcome (rather than starting with a population free of disease and waiting to see who develops it) it is possible to enroll a sufficient number of patients with a rare disease.

  15. PDF Case-control studies

    • Cumulative "Epidemic" case-control studies - odds ratio will approximate rate ratio if proportion diseased in each exposure group is low (< 20%) and remains steady during study period When is the rare disease assumption needed? • Cumulative-based sampling if want to approximate the relative risk

  16. Case-control study

    The case-control study design is often used in the study of rare diseases or as a preliminary study where little is known about the association between the risk factor and disease of interest. [8] Compared to prospective cohort studies they tend to be less costly and shorter in duration.

  17. Reducing selection bias in case-control studies from rare disease

    The objective of the study was to demonstrate the utility of case-control matching and the risk-set method in order to control bias in data from a rare disease registry. Data from the International Collaborative Gaucher Group (ICGG) Gaucher Registry were used as an example. Methods

  18. Case-Control Studies

    1.1 A Brief History. The case-control study examines the association between disease and potential risk factors by taking separate samples of diseased cases and of controls at risk of developing disease. Information may be collected for both cases and controls on genetic, social, behavioral, environmental, or other determinants of disease risk.

  19. Reducing selection bias in case-control studies from rare disease

    Background: In clinical research of rare diseases, where small patient numbers and disease heterogeneity limit study design options, registries are a valuable resource for demographic and outcome information. However, in contrast to prospective, randomized clinical trials, the observational design of registries is prone to introduce selection bias and negatively impact the validity of data ...

  20. A Practical Overview of Case-Control Studies in Clinical Practice

    Case-control studies are particularly appropriate for studying disease outbreaks, rare diseases, or outcomes of interest. This article describes several types of case-control designs, with simple graphical displays to help understand their differences.

  21. Case-control study in medical research: Uses and limitations

    A case-control study is a type of medical research investigation often used to help determine the cause of a disease, particularly when investigating a disease outbreak or rare...

  22. Case-Control Studies

    A case-control study is a better way of studying rare diseases because a very large cohort would be required to demonstrate an excess of a rare disease. In contrast, a case-control study is an inefficient way of assessing the effect of an uncommon exposure, when it might be possible to conduct a cohort study of all those exposed.

  23. Prostate cancer genetic risk and associated aggressive disease ...

    African ancestry is a significant risk factor for prostate cancer and advanced disease. Yet, genetic studies have largely been conducted outside the context of Sub-Saharan Africa, identifying 278 ...

  24. Observational Studies: Cohort and Case-Control Studies

    Cohort studies and case-control studies are two primary types of observational studies that aid in evaluating associations between diseases and exposures. In this review article, we describe these study designs, methodological issues, and provide examples from the plastic surgery literature.