The Rebirth of CAD: How Is Modern AI Different from the CAD We Know?
Luke Oakden-Rayner
Author Affiliations
From the Department of Radiology, Royal Adelaide Hospital, North Terrace, Adelaide, SA, Australia 5000; School of Public Health, University of Adelaide, Adelaide, Australia; and Australian Institute for Machine Learning, Adelaide, Australia.
Address correspondence to the author (e-mail: luke.oakden-rayner@adelaide.edu.au).
Published Online:May 29 2019https://doi.org/10.1148/ryai.2019180089
PDF
Tools
Share
Computer-aided diagnosis is a phrase that inspires strong opinions among radiologists, and many of those opinions are negative.
The term computer-aided detection (CAD) arose during the 1980s and 1990s, during the second era of artificial intelligence (AI). Like the first era in the 1950s and 1960s, and the third era through which we are living currently, the second era was built around new and exciting technologies paired with the promise that computers would soon solve all of our problems. The technology of the second era fell far short of these lofty expectations and led to a period of extreme disillusionment, often described as an “AI winter.” Given this history, it is no wonder that many consider CAD to be a disappointment.
Radiologists have more reason than most to be disappointed, because CAD in medical imaging was more than an unrealized promise. Almost uniquely across the world of technology, medical or otherwise, the hype and optimism around second-era AI led to the widespread utilization of CAD in clinical imaging. This use was most obvious in screening mammography, where it has been estimated that by 2010 more than 74% of mammograms in the United States were read with CAD assistance (1).
Unfortunately, CAD’s benefit has been questionable. Several large trials came to the conclusion that CAD has at best delivered no benefit (2) and at worst has actually reduced radiologist accuracy (3), resulting in higher recall and biopsy rates (4).
Not only were the medical outcomes disappointing, but they came with an estimated 20% increase in the time needed to interpret each study, a result of the radiologist needing to dismiss the many false alarms that these systems produced (5). It was not unusual for a radiologist to be asked to second-guess a half dozen spurious CAD-detected lesions per study, which research has shown can bias the radiologist’s interpretation (6), for example making them less likely to detect a real cancer when the CAD system does not flag any potential lesions.
Given this experience of increased costs without improved performance, and of poor user experiences and unfulfilled promises, why are radiologists now expressing renewed interest in CAD?
The answer is that a new technology has been developed that vastly outperforms the methods of historical CAD. This technology is known as deep learning, which has rapidly spread across the technology sector and appears entirely capable of fulfilling the promises that second-era CAD could not.
This leads us to the first difference between deep learning and traditional computer vision systems: at least outside of medicine, deep learning actually works. As an example, in the well-known ImageNet image analysis competition, the best traditional CAD methods produced five times as many errors as a practiced human when asked to identify everyday objects in photographs, such as bicycles, dogs, and airplanes (7). In the space of a few short years, deep learning has surpassed human performance in this challenge and now makes around half as many errors as humans (8).
Similarly, before deep learning, autonomous vehicles could not get out of the parking lot. Now hundreds of cars are being tested on real roads all over the world, with over 10 million miles driven (9). The same technology is already part of your life, being used to categorize your photographs, understand your speech, and suggest sensible responses to your e-mails.
Although these applications are unrelated to medical imaging, they represent a massive change in the perceptual abilities of our computer systems. Computer vision in the era of traditional CAD was unable to perform visual tasks that a toddler would find trivial, but modern AI is succeeding in tasks that have previously been the domain of human experts. The difference in capability is difficult to overstate.
Although we do not yet have the level of evidence required to prove that deep learning can do the work of human doctors, early results show systems that appear to perform at a human level in common medical tasks such as retinal assessment (10) and skin lesion analysis (11). We have seen similar results in radiology; for example, Chilamkurthy et al used a large dataset to produce a system that can detect a variety of critical findings on CT head scans (12), which may be useful for the triage of reporting worklists.
Further testing is required, particularly given our previous experiences with traditional CAD, but the success of deep learning in so many “human” domains is unprecedented and a degree of optimism is justified.
This ability of deep learning to succeed across a range of perceptual tasks also distinguishes the technology from traditional CAD. Second-era AI was task-specific; each system had to be crafted specifically for each task. Deep learning is task-agnostic; it simply learns from the data it is given, whether that data are clinical photographs, radiographs, or pathologic slides. The ability is incredibly powerful, because deep learning systems often can perform well on multiple similar tasks and can be fine-tuned on new tasks with much less effort than with traditional CAD (13). This ability to generalize across tasks may mean that the way we think about CAD will need to change.
We have historically thought of CAD as a solution to a variety of independent tasks and have even defined specific subgroups of CAD based on the intended use of the systems. CADx (computer-aided diagnosis) and CADe (computer-aided detection) are the most commonly used terms, but the list of CAD variants also includes CADq (computer-aided quantification) and CAST (computer-aided simple triage), among others. These subgroups have even been used to define regulatory frameworks (14), which made sense when the previous technology could only do the single task for which it was designed.
Deep learning systems discover patterns that are useful beyond the task they are designed to perform. Just as a human can only diagnose a disease if he or she can identify the image features that inform the diagnosis, a diagnostic deep learning system also contains information that can help to localize and explain its decisions. Similarly, a triage system that identifies cases with possible cerebral artery occlusion for urgent review (15) is inherently performing a diagnostic task; it is recognizing some of the same features a neuroradiologist would look for to diagnose a stroke. With this flexibility in mind, we need to consider if the boundaries we previously drew between CADx, CADe, and the other CAD variants are still relevant in the third era of AI. Although these categories are undoubtedly useful descriptions, they have been used historically to define risk. For example, the U.S. Food and Drug Administration (FDA) has treated CADx as higher risk than CADe (14).
In the current third era of AI, where computer systems approach human-level capability and are able to perform human tasks, it may be far more important to define risk based on the level of human supervision required over an algorithm. A diagnostic algorithm that is backed up by radiologist review may engender lower risk than a system that only quantifies the size of a lung nodule, if a clinician could make a treatment decision based on that measurement with no human ever looking at the images themselves. While this idea of autonomous AI might seem too futuristic to some readers, we should remember that the FDA already has approved an autonomous system to review retinal photographs for signs of diabetic eye disease (16). It is promising that the Therapeutic Goods Administration in Australia has proposed new regulations that take the degree of autonomy into account when determining the risk of AI systems (17), building on earlier work from the International Medical Device Regulators Forum (18).
These developments highlight the most important way that modern AI differs from traditional CAD; we have learned from the past. At a basic level, CAD failed because there was a difference between performance in controlled multireader experiments and performance in clinical practice. There is a range of likely causes for this failure, but whatever the reason, we now have decades of hard-won experience that we can use to inform AI design, testing, validation, policy, and regulation. It is heartening to see that these conversations are beginning to take place, with many groups in leadership roles, including RSNA, identifying safety and evidence-based AI use as key priorities (19,20).
It would be easy to point to the negative historical experience of CAD in radiology and dismiss the current claims around modern AI, but this isn’t the CAD we have known. Outside of medicine, where the techniques that underpinned CAD failed, deep learning succeeds. Where CAD methods were narrow and brittle, deep learning can exploit patterns that are more meaningful and more broadly useful. We should consider this new CAD through a lens that makes the most of our historical experiences as well as our new knowledge. We should hold these systems to higher standards than we have in the past, avoiding the pitfalls we now know how to recognize, while at the same time recognizing the technological advances that finally might allow computers to fulfill their clinical potential. Only by doing so will we be able to plan for the coming changes in our profession, to predict where this technology will succeed and where it may still fail, and most importantly, to protect ourselves and our patients from harm.
Luke Oakden-Rayner
Author Affiliations
From the Department of Radiology, Royal Adelaide Hospital, North Terrace, Adelaide, SA, Australia 5000; School of Public Health, University of Adelaide, Adelaide, Australia; and Australian Institute for Machine Learning, Adelaide, Australia.
Address correspondence to the author (e-mail: luke.oakden-rayner@adelaide.edu.au).
Published Online:May 29 2019https://doi.org/10.1148/ryai.2019180089
Tools
Share
Computer-aided diagnosis is a phrase that inspires strong opinions among radiologists, and many of those opinions are negative.
The term computer-aided detection (CAD) arose during the 1980s and 1990s, during the second era of artificial intelligence (AI). Like the first era in the 1950s and 1960s, and the third era through which we are living currently, the second era was built around new and exciting technologies paired with the promise that computers would soon solve all of our problems. The technology of the second era fell far short of these lofty expectations and led to a period of extreme disillusionment, often described as an “AI winter.” Given this history, it is no wonder that many consider CAD to be a disappointment.
Radiologists have more reason than most to be disappointed, because CAD in medical imaging was more than an unrealized promise. Almost uniquely across the world of technology, medical or otherwise, the hype and optimism around second-era AI led to the widespread utilization of CAD in clinical imaging. This use was most obvious in screening mammography, where it has been estimated that by 2010 more than 74% of mammograms in the United States were read with CAD assistance (1).
Unfortunately, CAD’s benefit has been questionable. Several large trials came to the conclusion that CAD has at best delivered no benefit (2) and at worst has actually reduced radiologist accuracy (3), resulting in higher recall and biopsy rates (4).
Not only were the medical outcomes disappointing, but they came with an estimated 20% increase in the time needed to interpret each study, a result of the radiologist needing to dismiss the many false alarms that these systems produced (5). It was not unusual for a radiologist to be asked to second-guess a half dozen spurious CAD-detected lesions per study, which research has shown can bias the radiologist’s interpretation (6), for example making them less likely to detect a real cancer when the CAD system does not flag any potential lesions.
Given this experience of increased costs without improved performance, and of poor user experiences and unfulfilled promises, why are radiologists now expressing renewed interest in CAD?
The answer is that a new technology has been developed that vastly outperforms the methods of historical CAD. This technology is known as deep learning, which has rapidly spread across the technology sector and appears entirely capable of fulfilling the promises that second-era CAD could not.
This leads us to the first difference between deep learning and traditional computer vision systems: at least outside of medicine, deep learning actually works. As an example, in the well-known ImageNet image analysis competition, the best traditional CAD methods produced five times as many errors as a practiced human when asked to identify everyday objects in photographs, such as bicycles, dogs, and airplanes (7). In the space of a few short years, deep learning has surpassed human performance in this challenge and now makes around half as many errors as humans (8).
Similarly, before deep learning, autonomous vehicles could not get out of the parking lot. Now hundreds of cars are being tested on real roads all over the world, with over 10 million miles driven (9). The same technology is already part of your life, being used to categorize your photographs, understand your speech, and suggest sensible responses to your e-mails.
Although these applications are unrelated to medical imaging, they represent a massive change in the perceptual abilities of our computer systems. Computer vision in the era of traditional CAD was unable to perform visual tasks that a toddler would find trivial, but modern AI is succeeding in tasks that have previously been the domain of human experts. The difference in capability is difficult to overstate.
Although we do not yet have the level of evidence required to prove that deep learning can do the work of human doctors, early results show systems that appear to perform at a human level in common medical tasks such as retinal assessment (10) and skin lesion analysis (11). We have seen similar results in radiology; for example, Chilamkurthy et al used a large dataset to produce a system that can detect a variety of critical findings on CT head scans (12), which may be useful for the triage of reporting worklists.
Further testing is required, particularly given our previous experiences with traditional CAD, but the success of deep learning in so many “human” domains is unprecedented and a degree of optimism is justified.
This ability of deep learning to succeed across a range of perceptual tasks also distinguishes the technology from traditional CAD. Second-era AI was task-specific; each system had to be crafted specifically for each task. Deep learning is task-agnostic; it simply learns from the data it is given, whether that data are clinical photographs, radiographs, or pathologic slides. The ability is incredibly powerful, because deep learning systems often can perform well on multiple similar tasks and can be fine-tuned on new tasks with much less effort than with traditional CAD (13). This ability to generalize across tasks may mean that the way we think about CAD will need to change.
We have historically thought of CAD as a solution to a variety of independent tasks and have even defined specific subgroups of CAD based on the intended use of the systems. CADx (computer-aided diagnosis) and CADe (computer-aided detection) are the most commonly used terms, but the list of CAD variants also includes CADq (computer-aided quantification) and CAST (computer-aided simple triage), among others. These subgroups have even been used to define regulatory frameworks (14), which made sense when the previous technology could only do the single task for which it was designed.
Deep learning systems discover patterns that are useful beyond the task they are designed to perform. Just as a human can only diagnose a disease if he or she can identify the image features that inform the diagnosis, a diagnostic deep learning system also contains information that can help to localize and explain its decisions. Similarly, a triage system that identifies cases with possible cerebral artery occlusion for urgent review (15) is inherently performing a diagnostic task; it is recognizing some of the same features a neuroradiologist would look for to diagnose a stroke. With this flexibility in mind, we need to consider if the boundaries we previously drew between CADx, CADe, and the other CAD variants are still relevant in the third era of AI. Although these categories are undoubtedly useful descriptions, they have been used historically to define risk. For example, the U.S. Food and Drug Administration (FDA) has treated CADx as higher risk than CADe (14).
In the current third era of AI, where computer systems approach human-level capability and are able to perform human tasks, it may be far more important to define risk based on the level of human supervision required over an algorithm. A diagnostic algorithm that is backed up by radiologist review may engender lower risk than a system that only quantifies the size of a lung nodule, if a clinician could make a treatment decision based on that measurement with no human ever looking at the images themselves. While this idea of autonomous AI might seem too futuristic to some readers, we should remember that the FDA already has approved an autonomous system to review retinal photographs for signs of diabetic eye disease (16). It is promising that the Therapeutic Goods Administration in Australia has proposed new regulations that take the degree of autonomy into account when determining the risk of AI systems (17), building on earlier work from the International Medical Device Regulators Forum (18).
These developments highlight the most important way that modern AI differs from traditional CAD; we have learned from the past. At a basic level, CAD failed because there was a difference between performance in controlled multireader experiments and performance in clinical practice. There is a range of likely causes for this failure, but whatever the reason, we now have decades of hard-won experience that we can use to inform AI design, testing, validation, policy, and regulation. It is heartening to see that these conversations are beginning to take place, with many groups in leadership roles, including RSNA, identifying safety and evidence-based AI use as key priorities (19,20).
It would be easy to point to the negative historical experience of CAD in radiology and dismiss the current claims around modern AI, but this isn’t the CAD we have known. Outside of medicine, where the techniques that underpinned CAD failed, deep learning succeeds. Where CAD methods were narrow and brittle, deep learning can exploit patterns that are more meaningful and more broadly useful. We should consider this new CAD through a lens that makes the most of our historical experiences as well as our new knowledge. We should hold these systems to higher standards than we have in the past, avoiding the pitfalls we now know how to recognize, while at the same time recognizing the technological advances that finally might allow computers to fulfill their clinical potential. Only by doing so will we be able to plan for the coming changes in our profession, to predict where this technology will succeed and where it may still fail, and most importantly, to protect ourselves and our patients from harm.
Δεν υπάρχουν σχόλια:
Δημοσίευση σχολίου