The Nature of Evidence, Standards of Proof, and Certainty
The challenge of producing actionable evidence is not absolution of the need for evidence-based practice.
This essay will be something of an addendum to the most recent essay on the Cass Review and the postmodernisation of science. In response to the conclusion of “remarkably weak” evidence in support of the “affirmative care” model of paediatric gender medicine, some clinicians and academics have responded by questioning the evidential bar for treating children and adolescents with gender dysphoria. A reader’s comment posed an important question related to such responses: is seeking “high quality” evidence feasible and/or ethical for this type of area of research and medicine?
Watching responses from advocates of the “affirmative care” model has been rather mindblowing. The ideological chokehold they’ve so willingly placed themselves in is utterly anathema to any formal training in medicine or science. The complete lack of clarity and understanding of evidence and standards of proof among healthcare professionals is stunning, and a testament to how ideological views have triumphed over what should be a dispassionate task of assessing the safety, efficacy, and effectiveness, of interventions. This applies to any intervention, and as stated in the previous essay, the social cause to which research relates should not be allowed to mislead the evidential assessment of that research.
Such responses have not been confined to clinicians. A British political party, the Liberal Democrats, issued an official statement expressing their “disappointment at the review’s lack of evidence…” The lack of apparent irony in this statement is difficult to fathom; the review’s “lack of evidence” is precisely because there is a lack of evidence. They went on to label the conclusions of the review as “undeniable transphobia towards young adults and people”, deploying the common tactic of dishonestly shifting questions of medical and scientific evidence to a question of “phobia” against this population group. This is akin to suggesting that evidence supporting a carcinogenic effect of processed meats is “phobic” of people who consume processed meat.
In this essay, we’ll consider the question posed above in more detail: what is “high quality” or “low quality” evidence, and what level of evidence is required to justify an intervention or policy? To consider this, we need to return to some first principles.
The Nature of Evidence, Proof, and Certainty
Three concepts intertwine in any scientific evaluation of evidence for or against a given intervention or policy; what we mean by “evidence” itself, what thresholds for evidence may justify the intervention/policy (i.e., does the evidence need to be “proven”), and the relationship between evidence and the spectrum of uncertainty-certainty. I’ll endeavour to illustrate these interactions by reference to examples from existing research and its application to public health.
The term “evidence” may be given a straightforward conceptual definition: that of any facts or data that support an inference in favour of, or against, a given theory or hypothesis being tested. Simply put, “evidence” is information upon which to support a conclusion. Evidence may thus exist in many forms, and what evidence is required and acceptable will be relative to the nature of the question. For example, whether statins lower cardiovascular disease risk, folic acid supplementation prevents neural tube defects, or vaccinations reduce measles incidence, all pose different evidential questions. Answering questions such as these requires a quantitative evaluation of efficacy, effectiveness, and safety, and requires that we attempt to draw on the best available evidence, methodologically.
However, the “best available evidence” does not necessarily mean “highest quality evidence”, because the available evidence may be constrained by practical, methodological, ethical, and/or safety issues. It is important to note that “best available evidence” facts’ is a qualified concept, not absolute. Evidence synthesis in science relies on inclusion and exclusion criteria that attempt to evaluate the available evidence according to the “highest quality” of evidence available, which is always relative to the types of research designs available and the strength of the body of evidence. For health and medical sciences, the “quality” of evidence has been framed as a pyramid (below), where the hierarchy reflects an assumption of increasing “internal validity”, i.e., the degree to which a study design may minimise potential sources of bias.
It is important to stress the word assumption; a poorly controlled and conducted RCT is not “higher quality” evidence than a well-designed and conducted prospective cohort study, even though the default position of an RCT sits above cohorts on the pyramid. Thus, evidence assessment relies on appraising individual studies on their methodological merits, not making unwarranted assumptions as to their evidential probity based on their default position on the hierarchy of evidence. There is a cautionary example of the perils of this intellectual indolence, that of the use of hormone replacement therapy (HRT) and coronary heart disease (CHD) risk in postmenopausal women. Observational research suggested a lower risk of CHD in postmenopausal women on HRT1, which was contradicted by a subsequent RCT showing an increase in cardiovascular disease (CVD) risk.2 Later re-analysis reconciled the discrepancy in the respective findings, demonstrating that when factoring in the timing of HRT initiation relative to the time since the onset of menopause, both the observational and RCT designs yielded similar results.34
A more refined way to think about the hierarchy of evidence was proposed in 2016 by Murad et al.5, illustrated by the figure below. In (A) below, the traditional static view of the evidential pyramid is shown; in (B), the wavy lines merge each level of the pyramid either higher or lower, to reflect the fact that a study can be rated up or down based on its methodological quality, but (B) still contained meta-analysis and systematic reviews on top, an assumption of superiority which is often not warranted due to issues with included studies, e.g., “bad input equals bad output”. In (C), the individual study designs were placed under the microscope of systematic reviews and meta-analysis, representing that evidence synthesis is a lens through which to assess the body of evidence available from those respective lines of evidence.
Thus, while “high quality” evidence is desirable, it isn’t always available. Depending on the question being addressed, the “best available evidence” may be a synthesis of prospective cohorts and case-control studies, i.e., it is possible, and indeed common, that “best available evidence” is a body of evidence that does not include randomised controlled trials (RCTs). Given that we are often left to synthesise evidence that does not entirely consist of highly powered, double-blind, placebo-controlled, RCTs, how do we then make decisions with what may be, on a cursory reference to the hierarchy of evidence, a “lower quality” of evidence?
This is where we need to consider evidence in the context of standards of proof. While the nature of evidence is conceptually simple, the nature of proof is a conceptual conundrum. Applying the strictest philosophical sense of the word “proof”, we would be confined to Hume’s conclusion that proof is an empirical impossibility6. This is clearly unsatisfactory for the need to make individual clinical and population-wide public health decisions based on the available body of scientific evidence. As the level of evidence required to make decisions differs relative to the nature (e.g., the risk-to-benefit ratio) of the question we are seeking to answer, thus proof itself exists on a continuum where different levels of evidence may satisfy different questions and purposes. This continuum is the “standard of proof”, evaluating available evidence relative to the “degree of necessary persuasion”7 required to be satisfied that there is a sufficiency of evidence to support a given conclusion.
In this context, what is “sufficient” depends on the research question being asked, the related risk-to-benefit consideration of that question, the knowledge already acquired on that question, what understanding of the relationship between an intervention/exposure and outcome is being sought, and whether a new theory is being developed or an established line of inquiry further expanded. Conclusions in health and medical sciences are unlikely to be adduced from any one study design alone. Rather, the persuasiveness of the evidence base tends to encompass a synthesis of different lines of research that informs the evidential assessment of a given question. The place of a study design in the assessment of evidence is relative, reflecting the fact that standards of proof exist on a spectrum. This is where it becomes crucial to distinguish the hierarchy of evidence from standards of proof. The weight attributed to evidence should depend, irrespective of source (i.e., design), on its relevance and persuasiveness, which are more important criteria than a preordained place in an arbitrary hierarchy. Three examples will help us to illustrate these concepts: cigarette smoking and cancer, sleeping in the prone position and Sudden Infant Death Syndrome (SIDS), and folic acid (synthetic vitamin B9) and neural tube defects.
The case of cigarette smoking is a well-known example of these principles of evidence assessment in action. For ethical reasons, there was never an RCT randomly assigning participants to smoke 20 cigarettes a day with lung cancer as an endpoint. How have concluded that smoking is causative of cancers? A deterministic causal conclusion would have required the identification of the specific agent within cigarettes causative of cancer. However, considering cigarette smoking through a probabilistic causal framework viewed the cigarette itself as, if not a “direct” cause of lung cancer, the most relevant exposure for public health interventions. The original synthesis of evidence in the 1964 U.S. Surgeon General’s report utilised probabilistic causal concepts based on five criteria – consistency and strength of association, specificity, temporality (i.e., exposure precedes the outcome), and coherence (i.e., with other lines of evidence) – taken together with experimental evidence, that would satisfy the conclusion that smoking was causative of lung cancer in men.
Those criteria, along with others (i.e., plausibility, reversibility, biological gradient), would form the basis of Sir Austin Bradford Hill’s criteria for causal inference from observational research, a probabilistic framework predicated upon the question of whether a change in exposure influences the incidence of disease8. The continued updating of available evidence culminated over the subsequent 50 years from the 1964 Surgeon General’s Report into a comprehensive body of evidence supporting the conclusion that exposure to tobacco smoke was causative of cancers and numerous other disease outcomes. From the perspective of standards of proof, the continued accumulation of evidence against tobacco smoke allowed for the degree of necessary persuasion to proceed from a probabilistic causal inference to a conclusive causal conviction.
So, despite the fact that RCTs on this issue were ethically unfeasible, we could be confident in determining that “high quality” evidence exists that smoking causes cancer, based on multiple converging lines of evidence. There is also little to no risk related to giving up smoking cigarettes and obvious benefits. The evidence thus justified the implementation of public health strategies that have successfully decreased smoking prevalence in the population, and related risk.
The second example is the high incidence of sudden infant death syndrome (SIDS) in New Zealand in the 1980s, which precipitated the initiation of the New Zealand Cot Death Study, a prospective case-control study.9 The 1-year results indicated a significant 3.5-fold higher odds of SIDS associated with sleeping in the prone position, confirmed by a 4-fold higher odds at the 3-year follow-up, in 370 cases of infants who died of SIDS and 1,559 controls. The fact that the effects of sleeping in the prone position were not significantly different between risk stratifications, e.g., socioeconomic status, indicated that targeting this risk factor would be amenable to a whole-population approach.10 These findings generated the initiation of the National Cot Death Prevention Programme, a public health education policy targeting a reduction in the prevalence of sleeping in the prone position. Again, recall that this was based on findings from a case-control study. During the first 6 months of the programme, the SIDS mortality rate was 43% lower than the same period over the previous 3 years.
In the SIDS example, a case-control study provided the “best available evidence” at the time; it provided a degree of necessary persuasion sufficient to determine that sleeping in the prone position was a modifiable risk factor strongly associated with mortality from SIDS. Considering the ethical inability to conduct an intervention trial (one could hardly randomise babies to sleep in the prone position) and the importance of action before awaiting outcomes of longer-term prospective studies, the available data satisfied a lower standard of proof sufficient to implement a low-risk population-wide approach that successfully reduced incidence of SIDS by modifying sleeping position.
Importantly, however, evidence of a causal relationship was not required to implement policies to improve public health. This illustrates an important distinction between standards of proof and determinations of causality; the former can be satisfied without recourse to the latter. Thus, standards of proof provide a threshold to determine the sufficiency of evidence for a given question that is distinct from the language of causality, instead emphasising the gravity of the question being addressed and the rigour of the tests that this question has survived.11
The SIDS example is very much a point in principle, as it is rare that there would be only one or two lines of evidence upon which to base recommendations. An example of more thorough persuasiveness from a body of evidence, including epidemiology, human intervention trials, and experimental research, is our final example of vitamin B9/folic acid and neural tube defects (NTDs). The evidence supporting a causal relationship between vitamin B9 deficiency and NTDs included a convergence of case-control and prospective studies on periconceptual multivitamin use and NTD risk12, case-control biomarker studies on red blood cell (RBC) folate status and risk of NTD13, RCTs of folic acid supplementation reducing NTD incidence14, and RCTs on effective strategies to increase RBC folate levels.15 This example is illustrative of a fundamental theme of this essay: that establishing proof is a process. The place of a study design in the assessment of evidence is relative, and all evidence may contribute some probative value.
These examples introduce the final crucial concept, on top of evidence and standards of proof, which is the relationship between two axes: risk-benefit and uncertainty-certainty. As a matrix, we can distil the interaction into four categories:
Low risk and high certainty.
Low risk and low certainty.
High risk and low certainty.
High risk and high certainty.
The intersection between these may be described as follows: where there is increasing potential for risk with any intervention or policy, there is an increasing requirement for certainty in the available evidence. Conversely, where there is a low potential for risk, there is more margin for uncertainty. For example, the fortification of foods with certain vitamins has a relatively high certainty of preventing deficiencies in the population, but there is also a low risk of any adverse effects of those added nutrients in the food supply. Conversely, drugs and surgical interventions both carry a much higher risk of potential adverse effects and therefore a more onerous requirement for certainty of their safety, efficacy, and effectiveness in their use.
In the smoking example, the evidence is both high certainty and low risk, an ideal evidential position upon which to base policy interventions and recommendations. In the SIDS example, the evidence was low certainty and low risk; it was the risk aspect that justified the National Cot Death Programme, as there is a low risk of recommending that children are not placed in a prone position to sleep. Thus, the uncertainty was mitigated by the low risk. In the NTDs example, there was initially lower certainty evidence that the UK Medical Research Council RCT results cemented into high certainty evidence, and a low risk to both food fortification with folic acid and specific recommendations for women to supplement folic acid preconception.
Evidence, Standards of Proof, and Risk-Benefit in the Paediatric Gender Medicine Context
The paediatric gender medicine question meets almost none of these criteria. The interventions are high risk, not low. There is uncertainty over the potential benefit relative to the risk of adverse effects. There is nothing equivalent to the SIDS example, i.e., an enormous effect size from lower-quality evidence that justifies a low-risk policy recommendation. On the contrary, the low quality of evidence available is compounded by the fact that there is disconfirming evidence of core tenets of the “affirmative care” model (e.g., “puberty blockers are just pressing the pause button and totally reversible…”). This is why the pushback on the lack of evidence in favour of this model of care is startling. Perhaps tellingly, much of the pushback is non-evidential, instead mounting the usual arguments on theoretical (in the postmodern sense of “Theory”) grounds that are as spurious as they are tangential to the question of what model of care is justified by evidence in a medical context.
Before considering some of the more evidential claims, it is important to return to a point made in the previous essay, which is the lack of hypothesis-driven direction in this field. This is crucial, because science is a forward-moving iterative process that requires theories, and the veracity of their assumptions, to be testable. Postmodern paediatric gender medicine has proceeded with a series of assumptions that the field, i.e., the researchers and clinicians committed to the “affirmative care” model, already believe with absolute certitude are “true”. This is why there is a lack of hypothesis-driven direction in the field; why test your theories when you already know them to be true? This is the precise opposite of the scientific method and the evidence-based medical model. And this type of epistemic arrogance and its attendant intellectual dishonesty has a dark history in science and medicine.
The problem with the assumptions that practitioners of postmodern paediatric gender medicine hold to be true is that they are anything but proven; they are highly contestable and in several cases open to disproving.1617 Levine and Abbruzzese18 set out ten assumptions used to justify the “affirmative care” model that have been represented by its adherents as facts:
The emergence of a trans identity is the result of reaching a higher level of self-awareness.
Whether the trans-identity emerges in very young children, older children, teens, or mature adults, it is authentic and will be lifelong.
All gender identity variations are biologically determined and inherently healthy.
The frequently co-occurring psychiatric symptoms are a direct result of gender incongruence (the so-called “minority distress” model).
The only way to relieve, or prevent, psychiatric problems is to alter the body at the earliest signs of puberty.
Psychological evaluations and attempts to address psychiatric comorbidities should only be used to support transition.
Attempts to resolve gender dysphoria with psychotherapy range from ineffective to harmful.
Gender-dysphoric youth must have unquestioning social, hormonal, and surgical support for their current gender identities and desired physical appearance.
All individual embodiment goals, even those that do not occur in nature, must be fulfilled to the full extent technically possible.
Science has proven the benefits of early gender transition, and low rates of regret and detransition further validate the practice.
These are far from established, proven theories or statements; they are highly contested both in philosophy and medicine. Thus, before we even get to the question of what level of evidence would be feasible and ethical to achieve for the care of gender dysphoric youth, there is a fundamental issue that the very theoretical basis of the model is predicated upon flawed and contestable assumptions, several of which are falsifiable based on the totality of evidence (see Refs 16 and 17).
With that said, let’s consider some of the common responses from an evidential perspective, framed as a series of questions.
“It’s unreasonable/infeasible to expect high quality RCTs in this context.”
There are two problems with this statement. The first is that no one is expecting the kinds of mega-trials we would have for, e.g., lipid-lowering therapies for cardiovascular disease; that just isn’t going to happen because the population subgroup in question comprises ~0.1-0.2% of the population. So, trials in this area will probably always have low statistical power. This is not a license to throw rigour out the window, which is precisely what activist-clinicians in this area want to do.
The second problem with this statement is that not every RCT has to be a highly powered, double-blind, “hard endpoint” intervention to be considered high quality. As someone in a field (nutrition science) that is often browbeaten with what David Gorski of ScienceBasedMedicine termed “methodolatry”, i.e., “the profane worship of the randomized clinical trial as the only valid method of investigation”, I can empathise with this issue.
This problem tends to relate to other fields trying to emulate biomedical-style drug trials. However, several adaptive clinical trial design options could be utilised in a paediatric gender medicine setting, and contribute to a vastly improved evidence base. So-called “N-of-1” trials, in which a single patient is randomised to an order of different treatments and a placebo/control condition (i.e., a crossover design), could be particularly valuable in this area. N-of-1 trials are useful in instances where potentially heterogeneous responses to treatments exist, and have been used successfully to determine, e.g., that cannabis extract may be beneficial for chronic pain management19, that paracetamol is more effective than the non-steroidal anti-inflammatory medication, celecoxib, for osteoarthritis20, and that self-reported side effects of statins are primarily “nocebo” effects.21
Of course, it is important to bear in mind that the interventions proposed for paediatric gender medicine are, ultimately, drugs and surgeries, both of which are highly amenable to a traditional comparative trial. They would be small trials in a minuscule population subgroup, but small trials offer an opportunity for enhanced rigour. Other potential clinical trial designs, such as registry-based trials, could be utilised to examine longer-term follow-up outcome data.
“It’s unethical to do RCTs in this population group.”
This is a fairly incomprehensible argument, given the position the field finds itself in. What is unethical is proceeding to implement high-risk interventions with little certainty of supporting evidence of benefit. Particularly because once medicalisation has been initiated the requirements for treatment will be lifelong; this reason alone is sufficient to justify calls for greater certainty in the evidence for care. This is also probably why the “affirmative care” model is so attractive for American for-profit healthcare systems.
“Producing ‘high quality’ evidence isn’t feasible for this area.”
This particular statement is why I laboured with the previous section, to highlight various examples of different levels of evidence - different standards of proof - that have underpinned successful policies and interventions, from smoking reduction to pre-conception folic acid supplementation. However, let’s return to the SIDS example again, because it is the example where the least overall evidence, and methodological quality in that evidence, was available to base the policy recommendations on. The case-control study which first identified sleeping in the prone position as a risk factor was based on 370 cases of infants who died of SIDS and 1,559 controls.22 However, the effect sizes were very large, with 3 to 4-fold higher odds of mortality associated with that risk factor, and the subsequent decrease in mortality observed during the implementation of the national programme mirrored what was predicted from the population-attributable risk associated with sleeping in the prone position from the case-control study.
The issue facing paediatric gender medicine is that there is no equivalent signal in the noise from the low-quality research that defines this area. For example, suicide is often weaponised by activist-clinicians, with the claim that the need for rapid medicalised “affirmation” lowers suicide risk and therefore constitutes an imperative that is worth the risk. However, there is no evidence of any large effect size in reductions in suicide or even in improved mental health post-transition (which has now been shown in the UK, Swedish, and Finnish systematic reviews, respectively). This not only shifts the evidential needle back towards “high risk, low certainty” and attendant caution, but it also falsifies assumptions No. 4 and 5 set out above.
Given the very clear overlap between gay and lesbian youths experiencing transient gender dysphoria, the widespread epidemic of mental health issues in youth, with young girls particularly afflicted, and differential diagnoses of other neurocognitive conditions, the precautionary principle strongly applies in the case of paediatric gender medicine.
It is absolutely possible to develop an evidence-based modality of care for gender dysphoric youth. This could comprise small but well-controlled intervention trials utilising alternative clinical trial designs, longer-term prospective cohorts with repeated measures of physiological and psychological health markers, to case-control studies on important questions such as bone mineral density with puberty blockade drugs, cardiovascular risk with the use of cross-sex hormones, and mental health functioning.
The fact that the activist-clinicians in this area seem to so strenuously oppose this should be a flashing red light to any paediatrician, endocrinologist, or researcher with a shred of medical and scientific integrity.
Grodstein F, Stampfer MJ, Manson JE, et al. Postmenopausal estrogen and progestin use and the risk of cardiovascular disease. N Engl J Med 1996;335:453–61.
Rossouw JE, Anderson GL, Prentice RL, et al. Risks and benefits of estrogen plus progestin in healthy postmenopausal women: principal results from the women's health initiative randomized controlled trial. JAMA 2002;288:321–33.
Prentice RL, Langer R, Stefanick ML, et al. Combined postmenopausal hormone therapy and cardiovascular disease: toward resolving the discrepancy between observational studies and the women’s health initiative clinical trial. Am J Epidemiol 2005;162:404–14.
Hernán MA, Alonso A, Logan R, et al. Observational studies analyzed like randomized experiments: an application to postmenopausal hormone therapy and coronary heart disease. Epidemiology 2008;19:766–79.
Murad MH, Asi N, Alsawas M, Alahdab F. New evidence pyramid. Evid Based Med. 2016 Aug;21(4):125-7.
Hume D. A Treatise of Human Nature. 2nd ed. Selby-Bigge LA, editor. Vols. I–III. London: Clarendon Press; 1739. 1–736.
Clermont KM, Sherwin E. A Comparative View of Standards of Proof. Am J Comp Law. 2002;50(2):243–75.
Hill AB. The Environment and Disease: Association or Causation? J R Soc Med. 1965;58(5):295–300.
Mitchell EA, Taylor BJ, Ford RPK, Steward AW, Becroft DMO, Thompson JMD, et al. Four modifiable and other major risk factors for cot death: The New Zealand study. J Paediatr Child Health. 1992 Aug;28(s1):S3–8.
Mitchell EA, Ford RPK, Taylor BJ, Stewart AW, Becroft DMO, Scragg R, et al. Further evidence supporting a causal relationship between prone sleeping position and SIDS. J Paediatr Child Health. 1992 Aug;28(s1):S9–12.
Loevinger L. Standards of Proof in Science and Law. Jurimetrics. 1992;32(3):323–44.
Milunsky A, Hick H, Jick SS, Bruell CJ, MacLaughlin DS, Rothman KJ, et al. Multivitamin/folic acid supplementation in early pregnancy reduces the prevalence of neural tube defects. JAMA. 1989 Nov 24;262(20):2847–52.
Daly LE, Kirke PN, Molloy A, Weir DG, Scott JM. Folate Levels and Neural Tube Defects. JAMA. 1995 Dec 6;274(21):1698.
MRC Vitamin Study Research Group. Prevention of neural tube defects: results of the Medical Research Council Vitamin Study. Lancet. 1991 Jul 20;338(8760):131–7.
Cuskelly GJ, McNulty H, Scott JM. Effect of increasing dietary folate on red-cell folate: implications for prevention of neural tube defects. The Lancet. 1996 Mar;347(9002):657–9.
Levine EB, Abbruzzese E. Current Concerns About Gender-Affirming Therapy in Adolescents. Curr Sex Health Rep. 2023 April;(15):113-123.
Cohn J. Some Limitations of "Challenges in the Care of Transgender and Gender-Diverse Youth: An Endocrinologist's View". J Sex Marital Ther. 2023;49(6):599-615.
Ref. 16.
Notcutt W, Price M, Miller R, Newport S, Phillips C, Simmons S, Sansom C. Initial experiences with medicinal extracts of cannabis for chronic pain: results from 34 'N of 1' studies. Anaesthesia. 2004 May;59(5):440-52.
Yelland MJ, Nikles CJ, McNairn N, Del Mar CB, Schluter PJ, Brown RM. Celecoxib compared with sustained-release paracetamol for osteoarthritis: a series of n-of-1 trials. Rheumatology (Oxford). 2007 Jan;46(1):135-40.
Howard JP, Wood FA, Finegold JA, Nowbar AN, Thompson DM, Arnold AD, Rajkumar CA, Connolly S, Cegla J, Stride C, Sever P, Norton C, Thom SAM, Shun-Shin MJ, Francis DP. Side Effect Patterns in a Crossover Trial of Statin, Placebo, and No Treatment. J Am Coll Cardiol. 2021 Sep 21;78(12):1210-1222.
Ref. 9
Thanks a lot, Alan, for profoundly elaborating on this interesting, thought-provoking and important topic! This essay is so comprehensive, brilliantly written and really helpful in deepening ones understanding of these fundamental concepts. Valuable indeed.