You can enhance your skills in evidence-based practice using these DiTA tutorials:

## 1. Is this study valid?

DiTA is a database of studies of diagnostic test accuracy. Studies of diagnostic test accuracy seek to determine how well particular diagnostic tests, or combinations of tests, distinguish between people who do and do not have a particular condition.

Published reports of diagnostic test accuracy studies vary greatly in quality: some report high-quality studies that are well-designed, well-conducted and well-analysed, but others report low-quality studies that have been poorly designed, conducted and analysed.

This tutorial is the first of two tutorials providing guidance on how to read reports of studies of diagnostic test accuracy. This first tutorial considers how readers of diagnostic test accuracy studies can distinguish between high-quality and low-quality studies. The second tutorial will discuss how the findings of diagnostic test accuracy studies can assist in making clinical diagnoses.

The logic of diagnostic test accuracy studies is simple. In these studies, a sample of people suspected of having a particular condition are assessed with the diagnostic test of interest (called the “index test”). The same people also have another test applied to them that is thought to accurately measure the condition of interest. This test is called a “reference test” or “reference standard”, or sometimes the “gold standard test”. As the reference test is assumed to be accurate, the index test is said to be accurate if it is found to give the same results as the reference test. The degree of concordance between the index test and the reference test provides a measure of the accuracy of the index test.

You might wonder why, if there is a reference test that is assumed to be accurate, we would ever be interested in the accuracy of an index test. If the reference test is accurate, why would we ever want to use a test other than the reference test? The answer is that, while reference tests are assumed to be accurate, they may also be difficult to administer, invasive, hard to access, expensive, time consuming, unpleasant or potentially harmful. By investigating the accuracy of an index test, we hope to be able to identify a test that is nearly as accurate as the reference test but easier to administer, less invasive, easier to access, less expensive, faster, less unpleasant or less risky than the reference test.

When reading a report of a study of diagnostic test accuracy, you should look for study characteristics that suggest the study is likely to provide trustworthy estimates of the accuracy of the diagnostic test. Ask three questions:

**1. Was the reference test accurate?**

Trustworthy studies of diagnostic test accuracy have accurate reference tests. If the reference test is accurate, then a finding of a strong concordance between the index test and the reference test is evidence that the index test is accurate, and a finding of a weak concordance between the index test and the reference test is evidence that the index test is not accurate. If, however, the reference test is not accurate, interpretation of the study is difficult. Usually inaccuracies in the reference test cause the study to underestimate the true accuracy of the index test. But that is not always the case: it is also possible that studies with inaccurate reference tests can overestimate index test accuracy.

The problem of inaccurate reference tests is common – many studies of diagnostic test accuracy have imperfect reference tests. This is because it is hard to accurately diagnose some conditions, even with the best available tests. When that is the case, the reference test may not be accurate. Even though reference test inaccuracy may be unavoidable, the use of an inaccurate reference test in a study of diagnostic test accuracy still makes the study findings difficult to interpret.

How can we know if the reference test used in a particular study of diagnostic test accuracy is or is not accurate? Unfortunately there is usually no objective way of assessing the accuracy of a reference test because if we were to test the accuracy of a reference test we would need to compare its findings to another reference test (and then we would need to know *its* accuracy …!). Consequently we must judge the accuracy of reference tests by thinking about how the reference test is conducted and its likely sources of error. Such considerations might lead us to believe, for example, that medical resonance imaging (MRI) is quite a good reference test for complete tears of the anterior cruciate ligament of the knee (because you can usually see complete anterior cruciate ligament tears pretty clearly on MRI scans). On the other hand, we might think that MRI is a less satisfactory reference test for partial tears of the medial ligament of the knee (because partial tears of the medial ligament are harder to see on MRI than complete cruciate ligament tears).

When making judgements about a reference test we need to think about *how much* inaccuracy there could be in the reference test. If the reference test is likely to be correct almost all of the time (only infrequently incorrect) then the reference test may be *good enough* to provide reasonably trustworthy estimates of index test accuracy. On the other hand, if the reference test is often incorrect then the estimates of index test accuracy could be very biased. Studies of diagnostic test accuracy that have reference standards that are likely to be frequently incorrect should not be considered trustworthy.

**2. Was the index test conducted and interpreted without knowledge of the finding of the reference test?**

Sometimes, when a diagnostic test is conducted, the findings are clear: the test is clearly positive (the findings are consistent with the hypothesis that the person being tested has the condition that is being tested for) or negative (consistent with the hypothesis that the person being tested does not have the condition that is being tested for). But sometimes the test findings are not so clear, and the tester may be uncertain whether the test was positive or negative. When that occurs, it is preferable that the person conducting and interpreting the index test is not aware of the findings of the reference test. If the person conducting and interpreting the index test is uncertain about whether the index test was positive or negative but knows the result of the reference test, he or she may (consciously or subconsciously) be more inclined to assign to the index test the finding of the reference test. In other words, knowledge of the reference test finding could influence the finding of the index test. This would make the index test appear more accurate than it really is.

Rigorous studies of diagnostic test accuracy are designed in a way that ensures the person who is conducting and interpreting the index test does not know the findings of the reference test. One easy way to achieve that is to conduct the index test before the reference test is conducted.

In trustworthy studies of diagnostic test accuracy, the index test is conducted without knowledge of the result of the reference test.

**3. Was the study conducted on people for whom there was diagnostic uncertainty?**

In clinical practice, diagnostic tests are applied to people in whom there is diagnostic uncertainty (i.e., to people for whom the diagnosis is not already known); the purpose of testing is to resolve uncertainty about whether a particular condition is or is not present. We want that context to be mirrored in studies of diagnostic test accuracy. Studies of diagnostic test accuracy should be conducted on people for whom there is diagnostic uncertainty and are therefore are representative of the people the test would be used on in clinical practice.

You may be surprised to hear that many studies of diagnostic test accuracy are conducted on people for whom there is no diagnostic uncertainty. Such studies are often conducted on a sample consisting of both “cases” (people for whom a definitive diagnosis has already been made) and “non-cases” (people who are not suspected of having the diagnosis; often healthy volunteers who are not seeking health care). These studies provide an indication of how well the index test can discriminate between people who *obviously do* and *obviously do not* have the condition of interest. However, the task of discriminating between people who obviously do and obviously do not have the condition of interest is easier than the task of clinical diagnosis, because clinical diagnosis requires discrimination between people who do (but do not *obviously*) have the condition and people who don’t (but do not *obviously not*) have the condition of interest. Clinical diagnosis is more challenging than the problem of discriminating between people who obviously do and don’t have the condition because clinical diagnosis is carried out on people for whom there is diagnostic uncertainty. For that reason, studies conducted on cases and non-cases (rather than a population with diagnostic uncertainty) are likely to generate inflated estimates of diagnostic accuracy. Studies conducted on people for whom there is diagnostic uncertainty are likely to provide more realistic, less biased, estimates of real-world diagnostic accuracy.

When you read studies of diagnostic test accuracy, look for these three study characteristics. Studies with accurate reference tests that conduct and interpret index tests without knowledge of the reference test’s findings and sample from a population with diagnostic uncertainty are likely to provide the most trustworthy estimates of diagnostic test accuracy. Such studies are most useful for informing clinical practice. The next tutorial considers how the findings of these studies can be used to inform clinical practice.

If you would like to read more about critical appraisal of studies of diagnostic test accuracy, check out the relevant chapters in reference (1).

*References*

*1. Herbert RD, Jamtvedt G, Mead J, Hagen KB. Practical Evidence-Based Physiotherapy. 2nd ed. Oxford: Elsevier; 2011. *

## 2. How can I use evidence of diagnostic text accuracy?

This tutorial is the second of two tutorials on critical appraisal of studies of diagnostic test accuracy.

The first tutorial considered how readers of diagnostic test accuracy studies can distinguish between high-quality and low-quality studies. Low-quality studies should not be relied upon for clinical decision making, but high-quality studies provide trustworthy information about diagnostic test accuracy that can be used to inform clinical practice. This second tutorial considers how the findings of high-quality diagnostic test accuracy studies can be used to inform clinical decision making.

The findings of high-quality diagnostic test studies can be used in two ways. First, they can help us identify accurate tests. Second, they can help us to interpret test findings. Let’s look at those in turn:

**1. Using the findings of diagnostic test accuracy studies to identify accurate tests.**

The preceding tutorial briefly explained the logic of studies of diagnostic test accuracy. Studies of diagnostic test accuracy involve comparing the findings of the index test to a reference test. The degree of concordance of the findings of the index and reference tests provides a measure of the accuracy of the index test.

How can the accuracy of a diagnostic test be quantified? Somehow we have to come up with some numbers that say something about the concordance between the findings of the index test and the reference test. This task is easiest when each of the index test and the reference test can generate just one of two findings: a positive finding or a negative finding. Here we will restrict consideration to these sorts of tests, as they are the most common sorts of diagnostic tests. We say the test is positive when its findings suggest the person who was tested has the condition of interest, and we say the test is negative when its findings suggest the person who was tested does not have the condition of interest.

The most frequently reported measures of diagnostic test accuracy are sensitivity and specificity. Sensitivity is the probability that a person who has the condition of interest will test positive. We can estimate sensitivity by first identifying all of the people in the study who tested positive to the reference test (i.e., the people who really do have the condition of interest) and then calculating the proportion of these people who tested positive with the index test. Specificity is the probability that a person who does not have the condition of interest will test negative. We can estimate specificity by identifying all of the people in the study who tested negative with the reference test (the people who really do not have the condition of interest) and then calculating the proportion of these people who tested negative with the index test.

Obviously we would like tests to be both sensitive and specific. That is, the most useful index tests are usually positive in people who have the condition and negative in people who do not have the condition. Both sensitivity and specificity are proportions, so they can theoretically have values between 0 and 100%. However even a completely random diagnostic test (think of using a coin toss as a diagnostic test!) will have a sensitivity and specificity of 50%. So a useful test must have a sensitivity and a specificity that is more than 50% – the closer to 100% the better. As a rough guide, a diagnostic test is only likely to be clinically useful if both its sensitivity and specificity are greater than 80%. (There are some exceptions to this but it’s not a bad rule of thumb.)

Studies of diagnostic test accuracy will almost always report sensitivity and specificity. When you read studies of the accuracy of a particular diagnostic test, look to see if the sensitivity and specificity of the test are greater than 80%. If so, the test might be useful. (Sometimes a test can be useful if it has a very high sensitivity but a low specificity, or if it has a very high specificity but a low sensitivity, but usually it will only be useful if both sensitivity and specificity are high.)

**2. Using the findings of diagnostic test accuracy studies to interpret test findings.**

So you’ve just conducted a diagnostic test on a particular patient, and you came up with a test finding (the test was positive or negative). How should you interpret the test’s findings? If the test was positive, should you now believe that the person has the condition you were testing for? And if the test was negative should you now believe that the person does not have the condition you were testing for?

There are a few ways that we can use the findings of studies of diagnostic test accuracy to help us interpret the findings of a particular diagnostic test when it is applied to a particular patient. The methods at our disposal vary in complexity – simple methods give approximate answers and more complex methods give more precise answers. Here we will consider two simple approaches.

The simplest approach to using information about diagnostic test accuracy is to use the mnemonics “Sp-in” and “Sn-out”. Sp-in and Sn-out remind us that:

Sp-in: ** Sp**ecificity is important when you want to rule a diagnosis

**.**

*in*Sn-out:

**sitivity is important when you want to rule a condition**

*Sen***.**

*out*What’s all that about? Well, if you have just conducted a test and its finding is positive, then the test is suggesting you should rule the condition “in”. However it is still important to consider how confidently we can rule the condition in. The mnemonic Sp-in reminds us that our confidence in ruling a condition in depends on the specificity of the test. If the test is positive and the specificity of the test is very high (say, greater than 90%) then you can quite confidently rule the condition in. But if the specificity is low, (say, 60%) you should be less inclined to rule the condition in.

Conversely, if you have just conducted a test and its finding is negative, then the test finding is suggesting you should rule the condition “out”. The mnemonic Sn-out says that your inclination to rule the condition out should depend on the sensitivity of the test. If the test is negative and the sensitivity of the test is very high then you can quite confidently rule the condition out, but if the sensitivity is low you are less able to confidently rule the condition out.

Many people find Sp-in and Sn-out counterintuitive – they think it should be the other way around. That’s what makes Sp-in and Snout good mnemonics.

Sp-in and Sn-out are helpful. But it would be better if we could be a bit more quantitative about how confidently we can rule in or rule out a condition after performing a diagnostic test. Before getting more quantitative about diagnostic probabilities, it is necessary to introduce two new ideas.

(a) The process of diagnosis is usually incremental. Typically, when making a diagnosis, we start with several diagnostic hypotheses. For example, we might think “This person could have a ruptured ACL ligament, or a torn posterior cruciate ligament, or even possibly a torn meniscus”. It is almost always the case that not all hypotheses appear equally credible. That is, we implicitly or explicitly assign different probabilities to each hypothesis. (We might be thinking “the diagnosis of a meniscal tear is a bit unlikely because the person doesn’t report locking of the knee”, implying a lower probability for the meniscal tear hypothesis.) The diagnostic process is a process of progressively modifying the probability assigned to particular diagnoses. We use diagnostic tests to modify our estimates of the probability of particular diagnostic hypotheses. A positive test for condition A increases the probability of that person having condition A, and a negative test decreases the probability. If the probability of a particular diagnosis becomes low we may choose to rule out that diagnosis. And if the probability of a diagnosis becomes high we may choose to rule the diagnosis in (i.e., make a diagnosis).

The key idea here is that we use diagnostic tests to modify probabilities. Whenever we conduct a diagnostic test, there is a pre-test probability (the probability we assigned to a particular condition before testing) and a post-test probability (the revised probability obtained by modifying the pre-test probability on the basis of the test finding). Those probabilities can be expressed in qualitative terms (unlikely, possible, very likely, nearly certain etc.) or in quantitative terms (a number between 0 and 100%).

(b) Once we start being more quantitative about diagnosis, we will do better to stop using sensitivity and specificity and instead start using likelihood ratios. Likelihood ratios can be derived from sensitivity and specificity, so they carry the same information about diagnostic test accuracy, but they package information in a way that makes them easier to use. Likelihood ratios are pretty cool in a whole lot of ways, but we’ll leave it to you to explore their total coolness. Here we will focus on just one way to use likelihood ratios.

There is a likelihood ratio associated with each possible test outcome. And, because most tests have two possible outcomes (positive or negative), most tests have two likelihood ratios: a positive likelihood ratio (LR+) associated with a positive test outcome and a negative likelihood ratio (LR-) associated with a negative test outcome.

Positive likelihood ratios are used to help interpret the findings of a positive test. Obviously, positive tests make us think it is more probable that the person has the condition we are testing for. Positive likelihood ratios tell us how much more probable the condition is after a positive test than it was before the test finding was known. (Or, to be a bit more precise, the positive likelihood ratio is the factor by which the odds of the condition is increased when the test result is positive. If you don’t know what “odds” are, don’t worry about it for now.) The more a positive likelihood ratio is above 1, the more the probability that the condition is present is increased by a positive test. Clinically useful tests have large positive likelihood ratios which mean that they substantially increase the probability that the condition is present. In other words, when accurate tests are positive, the post-test probability is substantially higher than the pre-test probability. Lousy tests have positive likelihood ratios that are not much greater than 1, so they don’t have much effect on the estimated probability of the condition being present. We might as well not use lousy tests!

Negative likelihood ratios tell us about how to interpret the findings of a negative test. Negative tests make us think it is less probable that the person has the condition we are testing for and negative likelihood ratios tell us how much less probable. Accurate tests have small negative likelihood ratios which mean that they greatly decrease the probability that the condition is present. When accurate tests are negative, the post-test probability is substantially lower than the pre-test probability. Get it?

These days many – perhaps most – studies of diagnostic test accuracy report likelihood ratios. But if they don’t, you can easily calculate the positive and negative likelihood ratios using these formulas:

LR+ = sensitivity in % / (100 – specificity in %)

LR- = (100 – sensitivity in %) / specificity in %.

How can we use the positive and negative likelihood ratios reported in high-quality studies of diagnostic test accuracy? Here are some rules of thumb recommended by McGee (2).

- If the
**test is positive**and the**LR+ is about 2, increase**your estimate of the probability of the condition being present by about**15%**. - If the
**test is positive**and the**LR+ is about 5, increase**your estimate of the probability of the condition being present by about**30%**. - If the
**test is positive**and the**LR+ is about 10, increase**your estimate of the probability of the condition being present by about**45%**. - If the
**test is negative**and the**LR- is about 0.5, decrease**your estimate of the probability of the condition being present by about**15%**. - If the
**test is negative**and the**LR- is about 0.2, decrease**your estimate of the probability of the condition being present by about**30%**. - If the
**test is negative**and the**LR- is about 0.1, decrease**your estimate of the probability of the condition being present by about**45%**.

Here’s a hypothetical example. Imagine you are testing a person who has injured her knee. Everything you know about the person up to now tells you that it is quite possible, though perhaps not likely, that she has a meniscal tear. It’s line ball, so you might nominate a pre-test probability of meniscal tear of 50%. Then imagine you conduct a McMurray’s test for a meniscal tear and find that the test is negative. How much should you decrease your estimate of the probability of a meniscal tear? Should you now conclude that the patient does not have a meniscal tear? Well, the negative likelihood ratio of McMurray’s test for a meniscal tear is about 0.5 (1). So McGee’s rules suggest the negative McMurray’s test should decrease your estimate of the probability of a meniscal tear by 15%, from 50% to 35%. In other words, McMurray’s test is not good enough to confidently rule out a meniscal tear – it’s still plausible this patient has a meniscal tear, even though McMurray’s test was negative.

What if, instead, the McMurray’s test had been positive? The positive likelihood ratio is excellent: about 16. Using McGee’s rules this increases the probability by more than 45% (a positive likelihood ratio increases the probability by 45%, so a positive likelihood ratio of 16 must increase the probability by more than 45%). So the post-test probability must be nearly 100%. You should therefore be nearly certain (post-test probability greater than 95%) that this woman has a meniscal tear.

In his paper, McGee (2) gives a clear explanation of his rules and how they can be used. These rules are neat because they tell us how much a positive or negative test result should influence our estimate of the probability that the condition being tested for is present.

There are other ways to use likelihood ratios to interpret the findings from studies of diagnostic test accuracy. The interested reader is referred to reference (3) for more details.

*References*

*1. Jackson JL, O’Malley PG, Kroenke K. Evaluation of acute knee pain in primary care. Ann Intern Med 2003;139(7):575-88. *

*2. McGee S. Simplifying likelihood ratios. J Gen Intern Med 2002;17(8):647-50. *

*3. Herbert RD, Jamtvedt G, Mead J, Hagen KB. Practical Evidence-Based Physiotherapy. 2nd ed. Oxford: Elsevier; 2011. *