CIDG Editor Karen Steingart discusses a recent symposium, 'Methods for evaluating medical test and biomarkers'

Symposium: Methods for evaluating medical tests and biomarkers, hosted by the Test Evaluation Research Group, University of Birmingham, UK, 19th and 20th July 2016

In this blog post, CIDG Editor Dr Karen Steingart discusses a recent symposium, 'Methods for evaluating medical test and biomarkers'.

I have a special interest in systematic reviews that assess the diagnostic accuracy of tuberculosis (TB) tests. I have co-authored three Cochrane reviews which were the basis for developing World Health Organization (WHO) policy guidelines. It has been a privilege observing how evidence informs policy and policy offers potential for people to achieve healthier lives. Considering my interests and desire to keep abreast of the latest research on diagnostic tests, I was thrilled to attend the symposium, 'Methods for evaluating medical tests and biomarkers'. Around 100 people participated.

Evaluations of medical tests and biomarkers present distinct methodological challenges. To paraphrase Jon Deeks, Scientific Committee Chair and Professor of Biostatistics, Institute of Applied Health Research, University of Birmingham, the aim of the symposium was “to take participants on a bit of a journey, disseminate recent research findings, and consider the uncertainties and unanswered questions in the evaluation of medical tests for diagnosis and prognosis”. Abstracts from this symposium may be found in “Methods for Evaluating Medical Tests and Biomarkers. Symposium; 2016 July 19-20; Institute of Applied Health Research, University of Birmingham, Birmingham, UK”. In addition, all abstracts will be published in Diagnostic and Prognostic Research¹, a newly launched open access journal.

The presentations were uniformly excellent and it was hard to choose which ones to include in this synopsis. Unless otherwise noted, below I have summarized, in my own words, several presentations that I found particularly germane to systematic reviews of diagnostic test accuracy (DTA). I have added my unbridled comments (in italics), which I hope will make the technical aspects more understandable.

Developing Plain Language Summaries for DTA Reviews
Whiting P, Davenport C, Leeflang M, Gopalakrishna G, de Sails I

Background

The Plain Language Summary (PLS) is critical to ensure that DTA systematic reviews are understandable to clinicians, policy makers, and consumers who may be unfamiliar with the highly technical content of these reviews. Challenges to writing a PLS for DTA reviews include the review methodology (Who can easily explain what is meant by a ‘hierarchical random-effects model’?); diagnostic accuracy measures like sensitivity and specificity (Zhelev and colleagues² queried healthcare decision makers about DTA systematic reviews. The authors reported that, “Participants could not recall the definitions for sensitivity and specificity and spontaneously commented that, since they did not come across these terms very often, they always found them confusing and every time had to look them up.”); and heterogeneity (see below). There have been prior drafts offering guidance for writing the PLS for DTA reviews. The aim of this project is to develop a template and additional guidance.

Methods

The authors are implementing the project in four stages: focus groups, user testing, a web-based survey, and templates developed from the first three stages. The authors distributed a draft PLS for The Informant Questionnaire on Cognitive Decline in the Elderly (IQCODE), a screening tool for diagnosing dementia. As symposium participants, we had an opportunity to vote in real time on our preferences for writing the PLS for IQCODE.

Discussion

The authors plan to incorporate feedback from participant voting into the next version of the PLS.

Comment

In DTA reviews, heterogeneity in results (i.e. between-study differences in results) may occur due to differences in study design or conduct, participants, index tests, reference standards, and positivity thresholds, as well as from errors in determining accuracy measures or from chance. The I² statistic (used for measuring heterogeneity in systematic reviews of interventions) is not recommended for Cochrane DTA reviews, as it does not account for heterogeneity explained by phenomena such as positivity threshold effects. Instead, the prediction region is a way of illustrating the extent of heterogeneity by depicting a region within which we may expect the results of a future study to lie³.

User testing of Test-Treatment Pathway derivation to help formulating focused diagnostic questions
Gopalakrishna G, Langendam M, Scholten R, Bossuyt P, Leeflang M

Background

The test-treatment pathway is a framework for linking diagnostic accuracy to patient health outcomes. The pathway depicts where a new test is positioned relative to an existing test(s). However, there is little practical guidance on how to describe this pathway.

Methods

The authors started with the PICO elements (participants, index test, comparator, outcome). They then brainstormed a set of triggering questions, which they refined after three rounds of user testing. With the final set of questions, the authors interviewed users and obtained their feedback on different test-treatment model pathways. The authors recorded and summarized the sessions.

Results

Users agreed that a framework or model describing the existing test-treatment pathway and where to position the new test would be useful in ensuring the formulation of the right diagnostic questions. However, they had no clear preference for a model. Users would also like to see an electronic tool that can help them describe the pathway.

Discussion

The authors noted that deriving a pathway remains challenging and that further development of a model is needed.

Comment

In the literature, one finds different descriptions and terminology for testing pathways. As in this abstract, di Ruffano and colleagues described the ‘test-treatment pathway’ and classified it into five major categories (test delivery, test results, diagnostic decision, treatment decision, treatment implementation), with key steps in each category⁴. Bossuyt and colleagues described the ‘diagnostic pathway’ in which they elaborated on the role of the new test as being replacement, triage, or add-on⁵. RevMan for DTA reviews includes a section called ‘clinical pathway,’ with three optional subheadings: prior tests, the role of tests, and alternative tests. The US Preventive Services Task Force and Agency for Healthcare Research and Quality (AHRQ) refer to the ‘analytic framework:’ “Two key components of the analytic framework are: (1) a typology for describing the context in which the test is to be used, and (2) some form of visual representation of the relationship between the application of the test or treatment and the outcomes of importance for decision making,”⁶ and the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach refers to the ‘analytical PICO framework’⁷. Common to all these schemes is the importance of placing the new test to be evaluated in a clinical context and linking test results to patient management decisions and important downstream health consequences. For a systematic review of diagnostic test evaluation frameworks, see Lijmer et al 2009⁸.

Using machine learning and crowdsourcing for the identification of diagnostic test accuracy
Noel-Storr A, Thomas J, Marshall I, Wallace B

Background

Poor reporting, dissimilar indexing, and an increasing number of studies make it difficult to identify studies of diagnostic test accuracy (DTA). Search filters are not recommended⁹. A crowd model has been successfully applied to the identification of randomized controlled trials (RCTs) in Cochrane’s EMBASE screening project and data generated from this project is being used for machine learning¹⁰. This approach is now being extended to identify DTA studies.

Methods

The authors created a reference standard dataset (n = 1120) consisting of DTA and non-DTA studies. They evaluated two machine-learning methods (active learning and binary classifier). The outcomes of interest were recall and precision. (The ideal search would achieve high recall and high precision, but often there is a trade-off between the two.)

Results

The active learning process achieved 95% recall after 30% of the articles were manually screened and increased to 100% recall after 77% were manually screened. The binary classifier approach achieved 95% recall and 40% precision and increased to 100% recall, but with a decrease in precision to 13%.

Discussion

The authors considered these preliminary findings to be promising. They added that, if a reference standard dataset could be created through the crowd (as it has been for Cochrane’s EMBASE project), then a much larger human-generated reference dataset would be available to improve the accuracy of machine learning.

Comment

This was a fascinating presentation. If the process works, it would make screening DTA studies faster and more efficient. For those (including myself) who have screened > 20,000 citations in a given review, this process would be a gift for sore eyes (literally).

Estimates of excess cancer incidence and cancer deaths avoided in Great Britain from 1980-2012: the potential for overdiagnosis
Oke J, O’Sullivan J, Perera R, Nicholson B

Background

The authors defined overdiagnosis as “the detection of disease that will not progress to cause symptoms or premature death.” There is no single generally accepted method for measuring overdiagnosis. The authors examined incidence and mortality patterns in the UK over the past 30 years with the aim of describing a method to identify potential overdiagnosis.

Methods

The authors used mortality data, available since 1950, and incidence data, available since 1979. They considered two time periods: 1950 to 1978 (the so called pre-diagnostic era), and 1980 to 2012 (the diagnostic era). Results from the pre-diagnostic era were used to predict cancer incidence and mortality in the diagnostic era and the predicted rates compared to observed rates obtained from Cancer Research UK. They calculated excess incidence and deaths avoided by subtracting the observed rates from the predicted rates for 10 cancers separately for males and females.

Results

Except for breast cancer, mortality in the diagnostic era followed predictions from the pre-diagnostic era. However, cancer incidence in the diagnostic era was several orders of magnitude greater than predicted.

Conclusions

The authors found that the incidence of common cancers in the UK has increased considerably in the last 30 years. However, they noted that, “This has not necessarily prevented cancer mortality.” This finding supports the notion that some of the increased detection represents cancer overdiagnosis.

Comment

Several presentations reported on methods for estimating overdiagnosis. Clearly, this is a hot topic. Of interest, a poster presenting results from a non-systematic review of overdiagnosis reported that the clinical domains in which overdiagnosis was most common are oncology (60.0%), mental disorders (8.2%), and infectious diseases (5.2%), respectively, (CIDG take heed)¹¹.

The experience of a Diagnostic Evidence Co-operative in working with industry to evaluate new medical tests
Van den Bruel A

The National Institutes for Health Research (NIHR) has funded four Diagnostic Evidence Co-operatives (DECs) to help create information on the clinical and cost-effectiveness of in vitro diagnostic devices (IVDs), which are important in helping to improve the way diseases are diagnosed. This will help patients access the most appropriate treatments more quickly and help the National Health Service (NHS) make the best use of its resources¹². The author described one of the DECs (Oxford) with an interest in primary care, including general practice, pharmacies, and midwives. The DEC holds an annual forum to bring together diverse stakeholders (e.g. clinicians and other healthcare professionals, the IVD industry, NHS pathology services, NHS commissioners, academic researchers including health economists, and patient-groups) with an interest in the development of new tests. They also organize a workshop aimed at industry, in particular to explain the evidence and regulatory environment for bringing a new test to market and consideration by the National Health Service.

Comment

I was excited about this presentation because bringing industry into the testing picture early makes a lot of sense. The TB community has taken a slightly different approach to involving industry partners in new diagnostics. FIND, the Bill and Melinda Gates Foundation, World Health Organization, Stop TB Partnership's New Diagnostics Working Group, PATH, UNITAID, McGill International TB Centre and other partners have produced a technology and market landscape report, a consensus report on target product profiles, a series of market analyses, and answers to frequently asked questions by TB product developers¹³.

The problem with composite reference standards
Dendukuri N

I extracted the following information from the article by Schiller et al¹⁴.

Background

When there is no agreed upon single perfect diagnostic test for a disease, some researchers advocate the use of composite reference standards that use multiple tests. Two common rules for composite reference standards are the OR rule (any positive test) and the AND rule (all positive tests). The rationale for using a composite reference standard is that multiple imperfect tests lead to a more accurate reference test than a single test. But is this true?

Methods

Using, as an example, a new test for Chlamydia trachomatis, an asymptomatic disease for which no perfect reference standard exists, the authors considered a composite reference standard that classified patients as positive if at least one component was positive. They studied how sensitivity and specificity of the index test compared with the composite reference standard as the number of components increased.

Results

When specificity of all components was 100%, a composite reference standard was useful. However, when specificity of all components was less than 100% (even if specificity was high), as the number of components increased, sensitivity of the composite reference standard increased, but at the expense of specificity.

Discussion

A composite reference standard is not a guarantee of improved accuracy over a single imperfect reference standard unless each component test has perfect specificity, and the composite reference standard is conditionally independent of the index test. When these conditions are not met, as in the example of testing for C. trachomatis, other statistical models may be necessary, instead of relying on a composite reference standard.

Comment

There was a lot of food for thought in this presentation. At first blush these findings seemed counterintuitive, but that is where the motivating example was brilliant.

Estimating the level of over-treatment in childhood pulmonary tuberculosis in the absence of an accurate reference standard: a Bayesian latent class analysis
Schumacher S, van Smeden M, Dendukuri N, Joseph L, Nicol M, Pai M, Zar H

Background

The WHO estimates that, globally, in 2014, one million children under 15 years old fell ill from tuberculosis (TB)¹⁵. Children with TB typically have few TB bacteria and difficulty producing sputum, making the diagnosis of TB challenging. Healthcare providers often rely on clinical rather than microbiological findings for the diagnosis. The evaluation of tests for diagnosing childhood TB and estimating over-treatment are compromised by the absence of a quality reference standard.

Methods

Using data from a study of hospitalized South African children thought to have pulmonary TB (n = 749), the authors developed a Bayesian latent class model based on five different TB tests: smear microscopy, culture, nucleic acid amplification test, tuberculin skin test, and chest radiography. They then used the model to estimate accuracy for the five tests and over- treatment.

Results

The authors found the prevalence of childhood pulmonary TB to be 27% (95% credible interval 21% to 35%). Forty-six percent of children without TB received treatment. Even when the probability of TB was low (around 30%), the probability of receiving TB treatment was greater than 80%.

Conclusions

The authors found latent class analysis to be useful for estimating TB over-treatment. They thought that, in this setting, over-treatment might be due to suboptimal accuracy of the available TB tests and concern about the high risk of mortality if a TB diagnosis were missed.

Comment

In DTA systematic reviews, the reference standard is the agreed upon best yardstick against which to assess the new test (index test). The reference standard classifies participants with the disease and those without the disease of interest. Reitsma and colleagues describe four main ways of determining diagnostic accuracy in the presence of an imperfect reference standard¹⁶. One way is to construct a reference standard by latent class analysis using a statistical model based on actual data. This means that the disease (or target condition) is mathematically defined, rather than clinically defined. The application of this method involves a lot of hard work. I commend the authors for applying this approach to childhood TB. I wonder if latent class analysis would be useful for evaluating the diagnostic accuracy of tests for extrapulmonary TB.

¹http://diagnprognres.biomedcentral.com/

²Zhelev Z, Garside R, Hyde C. Systematic Reviews 2013 2:32.http://www.ncbi.nlm.nih.gov/pubmed/23680077

³Harbord RM, Deeks JJ, Egger M, Whiting P, Sterne JA. A unification of models for meta-analysis of diagnostic accuracy studies. Biostatistics 2007;8(2):239-51. http://www.ncbi.nlm.nih.gov/pubmed/16698768

⁴Di Ruffano LF, Hyde CJ, McCaffery KJ, Bossuyt PMM, Deeks JJ. BMJ 2012;344 http://www.bmj.com/content/344/bmj.e686.long

⁵Bossuyt PM, Irwig L, Craig J, Glasziou P. BMJ 2006;322 http://www.bmj.com/content/332/7549/1089

⁶Methods Guide for Medical Test Reviews, 2012, Chapter 2. http://effectivehealthcare.ahrq.gov/index.cfm/search-for-guides-reviews-and-reports/?productid=558&pageaction=displayproduct

⁷Schunemann HJ, Mustafa R, Brozek J, Santesso N, Alonso-Coello P, Guyatt G, et al. Journal of Clinical Epidemiology 2016 Feb 27. pii: S0895-4356(16)00136-0 http://www.ncbi.nlm.nih.gov/pubmed/26931285

⁸http://www.ncbi.nlm.nih.gov/books/NBK49467/

⁹Whiting P, Westwood M, Beynon R, Burke M, Sterne JA, Glanville J. Journal of Clinical Epidemiology 2011 64(6):602-7 http://www.ncbi.nlm.nih.gov/pubmed/21075596

¹⁰For information about Cochrane crowd see https://crowd.cochrane.org/index.html

¹¹Jenniskens, Naaktgeboren, Reitsma, Moons, de Groot. A bird’s-eye view of overdiagnosis: What’s out there? Poster 51.

¹²http://www.nihr.ac.uk/about/diagnostic-evidence-co-operatives.htm

¹³See www.tbfaqs.org for these reports.

¹⁴Schiller. Statistics in Medicine 2015. http://www.ncbi.nlm.nih.gov/pubmed/26555849

¹⁵Global Tuberculosis Report. WHO, Geneva, 2015. www.who.int/tb/publications/global_report/en/

¹⁶Reitsma JB, Rutjes AW, Khan KS, Coomarasamy A, Bossuyt PM. Journal of Clinical Epidemiology 2009 62(8):797-806. http://www.ncbi.nlm.nih.gov/pubmed/19447581

CIDG Editor Karen Steingart discusses a recent symposium, 'Methods for evaluating medical test and biomarkers'

Symposium: Methods for evaluating medical tests and biomarkers, hosted by the Test Evaluation Research Group, University of Birmingham, UK, 19th and 20th July 2016

Developing Plain Language Summaries for DTA Reviews Whiting P, Davenport C, Leeflang M, Gopalakrishna G, de Sails I

User testing of Test-Treatment Pathway derivation to help formulating focused diagnostic questionsGopalakrishna G, Langendam M, Scholten R, Bossuyt P, Leeflang M

Using machine learning and crowdsourcing for the identification of diagnostic test accuracyNoel-Storr A, Thomas J, Marshall I, Wallace B

Estimates of excess cancer incidence and cancer deaths avoided in Great Britain from 1980-2012: the potential for overdiagnosisOke J, O’Sullivan J, Perera R, Nicholson B

The experience of a Diagnostic Evidence Co-operative in working with industry to evaluate new medical testsVan den Bruel A

The problem with composite reference standardsDendukuri N

Estimating the level of over-treatment in childhood pulmonary tuberculosis in the absence of an accurate reference standard: a Bayesian latent class analysisSchumacher S, van Smeden M, Dendukuri N, Joseph L, Nicol M, Pai M, Zar H

Developing Plain Language Summaries for DTA Reviews
Whiting P, Davenport C, Leeflang M, Gopalakrishna G, de Sails I

User testing of Test-Treatment Pathway derivation to help formulating focused diagnostic questions
Gopalakrishna G, Langendam M, Scholten R, Bossuyt P, Leeflang M

Using machine learning and crowdsourcing for the identification of diagnostic test accuracy
Noel-Storr A, Thomas J, Marshall I, Wallace B

Estimates of excess cancer incidence and cancer deaths avoided in Great Britain from 1980-2012: the potential for overdiagnosis
Oke J, O’Sullivan J, Perera R, Nicholson B

The experience of a Diagnostic Evidence Co-operative in working with industry to evaluate new medical tests
Van den Bruel A

The problem with composite reference standards
Dendukuri N

Estimating the level of over-treatment in childhood pulmonary tuberculosis in the absence of an accurate reference standard: a Bayesian latent class analysis
Schumacher S, van Smeden M, Dendukuri N, Joseph L, Nicol M, Pai M, Zar H