Showing posts with label Health. Show all posts

Using Machine Learning to Detect Deficient Coverage in Colonoscopy Screenings

Colorectal cancer (CRC) is a global health problem and the second deadliest cancer in the United States, resulting in an estimated 900K deaths per year. While deadly, CRC can be prevented by removing small precancerous lesions in the colon, called polyps, before they become cancerous. In fact, it is estimated that a 1% increase in the adenoma detection rate (ADR, defined as the fraction of procedures in which a physician discovers at least one polyp) can lead to a 6% decrease in the rate of interval CRCs (a CRC that is diagnosed within 60 months of a negative colonoscopy).

Colonoscopy is considered the gold standard procedure for the detection and removal of polyps. Unfortunately, the literature indicates that endoscopists miss on average 22%-28% of polyps during colonoscopies; furthermore, 20% to 24% of polyps that have the potential to become cancerous (adenomas) are missed. Two major factors that may cause an endoscopist to miss a polyp are (1) the polyp appears in the field of view, but the endoscopist misses it, perhaps due to its small size or flat shape; and (2) the polyp does not appear in the field of view, as the endoscopist has not fully covered the relevant area during the procedure.

In “Detecting Deficient Coverage in Colonoscopies”, we introduce the Colonoscopy Coverage Deficiency via Depth algorithm, or C2D2, a machine learning-based approach to improving colonoscopy coverage. The C2D2 algorithm performs a local 3D reconstruction of the colon as images are captured during the procedure, and on that basis, identifies which areas of the colon were covered and which remained outside of the field of view. C2D2 can then indicate in real time whether a particular area of the colon has suffered from deficient coverage so the endoscopist can return to that area. Our work proposes a novel approach to compute coverage in real time, for which 3D reconstruction is done using a calibration-free, unsupervised learning method, and evaluate it in a large scale way.

The C2D2 Algorithm
When considering colon coverage, it is important to estimate the coverage fraction — what percentage of the relevant regions were covered by a complete procedure. While a retrospective analysis is useful for the physician and could provide general guidance for future procedures, it is more useful to have real-time estimation of coverage fraction, on a segment by segment basis, i.e. knowledge of what fraction of the current segment has been covered while traversing the colon. The helpfulness of such functionality is clear: during the procedure itself, a physician may be alerted to segments with deficient coverage, and can immediately return to review these areas. Higher coverage will result in a higher proportion of polyps being seen.

The C2D2 algorithm is designed to compute such a segment-by-segment coverage in two phases: computing depth maps for each frame of the colonoscopy video, followed by computation of coverage based on these depth maps.

C2D2 computes a depth image from a single RGB image. Then, based on the computed depth images for a video sequence, C2D2 calculates local coverage, so it can detect where the coverage has been deficient and a second look is required.

Depth map creation consists of both depth estimation as well as pose estimation — the localization of where the endoscope is in space, as well as the direction it is pointing. In addition to the detection of deficient coverage, depth and pose estimation are useful for a variety of other interesting tasks. For example, depth can be used for improved detection of flat polyps, while pose estimation can be used for relocalizing areas of the colon (including polyps) that the endoscopist wishes to revisit, and both together can be used for visualization and navigation.

Top row: RGB image, from which the depth is computed. Bottom row: Depth image as computed by C2D2. Yellow is deeper, blue is shallower. Note that the “tunnel” structure is captured, as well as the Haustral ridges.

In order to compute coverage fractions from these depth maps, we trained C2D2 on two sources of data: synthetic sequences and real sequences. We generated the synthetic videos using a graphical model of a colon. For each synthetic video, ground truth coverage is available in the form of a number between 0 (completely uncovered) and 1 (completely covered). For real sequences, we analyzed de-identified colonoscopy videos, for which ground truth coverage is unavailable.

Performance on Synthetic Videos
When using synthetic videos, the availability of ground truth coverage enables the direct measurement of C2D2’s performance. We quantify this using the mean absolute error (MAE), which indicates how much the algorithm’s prediction differs, on average, from the ground truth. We find that C2D2’s MAE = 0.075; meaning that, on average, the prediction of C2D2 is within 7.5% of the ground truth. By contrast, a group of physicians given the same task achieved MAE = 0.177, i.e., within 17.7% of the ground truth. Thus, the C2D2 attained an accuracy rate 2.4 times higher on synthetic sequences.

Performance on Real Videos
Of course, what matters most is performance on videos of real colonoscopies. The challenge in this case is the absence of ground truth labelling: we don’t know what the actual coverage is. Additionally, one cannot use labels provided by experts directly as they are not always accurate, due to the challenges described earlier. However, C2D2 can still perform inference on real colonoscopy videos. Indeed, the learning pipeline is designed to perform equally well on synthetic and real colonoscopy videos.

To verify performance on real sequences, we used a variant of a technique common in the generative modelling literature, which involves providing video sequences to human experts along with C2D2’s coverage scores for those sequences. We then ask the experts to assess whether C2D2’s score is correct. The idea is that while it is difficult for experts to assign a score directly, the task of verifying a given score is considerably easier. (This is similar to the fact that verifying a proposed solution to an algorithmic problem is generally much easier than computing that solution.) Using this methodology, experts verified C2D2’s score 93% of the time. And in a more qualitative sense, C2D2’s output seems to pass the “eyeball test”, see the figure below.

Coverage on real colonoscopy sequences. Top row: Frames from a well covered sequence — the entire “tunnel” down the lumen may be seen; C2D2 coverage = 0.931. Middle row: A partially covered sequence — the bottom may be seen, but the top is not as visible; C2D2 coverage = 0.427. Bottom row: A poorly covered sequence, much of what is seen is the wall; C2D2 coverage = 0.227.

Next steps
By alerting physicians to missed regions of the colon wall, C2D2 promises to lead to the discovery of more adenomas, thereby increasing the ADR and concomitantly decreasing the rate of interval CRC. This would be of tremendous benefit to patients.

In addition to this work that addresses colonoscopy coverage, we are concurrently conducting research to improve polyp detection by combining C2D2 with an automatic, real-time polyp detection algorithm. This study adds to the mounting evidence that physicians may use machine learning methods to augment their efforts, especially during procedures, to improve the quality of care for patients.

Acknowledgements
This research was conducted by Daniel Freedman, Yochai Blau, Liran Katzir, Amit Aides, Ilan Shimshoni, Danny Veikherman, Tomer Golany, Ariel Gordon, Greg Corrado, Yossi Matias, and Ehud Rivlin, with support from Verily. We would like to thank all of our team members and collaborators who worked on this project with us, including: Nadav Rabani, Chen Barshai, Nia Stoykova, David Ben-Shimol, Jesse Lachter, and Ori Segol, 3D-Systems and many others. We'd also like to thank Yossi Matias for support and guidance. The research was conducted by teams from Google Health and Google Research, Israel.

Exploring Faster Screening with Fewer Tests via Bayesian Group Testing



How does one find a needle in a haystack? At the turn of World War II, that question took on a very concrete form when doctors wondered how to efficiently detect diseases among those who had been drafted into the war effort. Inspired by this challenge, Robert Dorfman, a young statistician at that time (later to become Harvard professor of economics), proposed in a seminal paper a 2-stage approach to detect infected individuals, whereby individual blood samples first are pooled in groups of four before being tested for the presence or absence of a pathogen. If a group is negative, then it is safe to assume that everyone in the group is free of the pathogen. In that case, the reduction in the number of required tests is substantial: an entire group of four people has been cleared with a single test. On the other hand, if a group tests positive, which is expected to happen rarely if the pathogen’s prevalence is small, at least one or more people within that group must be positive; therefore, a few more tests to determine the infected individuals are needed.
Left: Sixteen individual tests are required to screen 16 people — only one person’s test is positive, while 15 return negative. Right: Following Dorfman’s procedure, samples are pooled into four groups of four individuals, and tests are executed on the pooled samples. Because only the second group tests positive, 12 individuals are cleared and only those four belonging to the positive group need to be retested. This approach requires only eight tests, instead of the 16 needed for an exhaustive testing campaign.
Dorfman’s proposal triggered many follow-up works with connections to several areas in computer science, such as information theory, combinatorics or compressive sensing, and several variants of his approach have been proposed, notably those leveraging binary splitting or side knowledge on individual infection probability rates. The field has grown to the extent that several sub-problems are recognized and deserving of an entire literature on their own. Some algorithms are tailored for the noiseless case in which tests are perfectly reliable, whereas some consider instead the more realistic case where tests are noisy and may produce false negatives or positives. Finally, some strategies are adaptive, proposing groups based on test results already observed (including Dorfman’s, since it proposes to re-test individuals that appeared in positive groups), whereas others stick to a non-adaptive setting in which groups are known beforehand or drawn at random.

In “Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design”, we present an approach to group testing that can operate in a noisy setting (i.e., where tests can be mistaken) to decide adaptively by looking at past results which groups to test next, with the goal to converge on a reliable detection as quickly, and with as few tests, as possible. Large scale simulations suggest that this approach may result in significant improvements over both adaptive and non-adaptive baselines, and are far more efficient than individual tests when disease prevalence is low. As such, this approach is particularly well suited for situations that require large numbers of tests to be conducted with limited resources, as may be the case for pandemics, such as that corresponding to the spread of COVID-19. We have open-sourced the code to the community through our GitHub repo.

Noisy and Adaptive Group Testing in a Non-Asymptotic Regime
A group testing strategy is an algorithm that is tasked with guessing who, among a list of n people, carries a particular pathogen. To do so, the strategy provides instructions for pooling individuals into groups. Assuming a laboratory can execute k tests at a time, the strategy will form a kn pooling matrix that defines these groups. Once the tests are carried out, the results are used to decide whether sufficient information has been gathered to determine who is or is not infected, and if not, how to form new groups for another round of testing.

We designed a group testing approach for the realistic setting where the testing strategy can be adaptive and where tests are noisy — the probability that the test of an infected sample is positive (sensitivity) is less than 100%, as is the specificity, the probability that a non-infected sample returns negative.

Screening More People with Fewer Tests Using Bayesian Optimal Experimental Design
The strategy we propose proceeds the way a detective would investigate a case. They first form several hypotheses about who may or may not be infected, using evidence from all tests (if any) that have been carried out so far and prior information on the infection rate (a). Using these hypotheses, our detectives produce an actionable item to continue the investigation, namely a next wave of groups that may help in validating or invalidating as many hypotheses as possible (b), and then loop back to (a) until the set of plausible hypotheses is small enough to unambiguously identify the target of the search. More precisely,
  1. Given a population of n people, an infection state is a binary vector of length n that describes who is infected (marked with a 1), and who is not (marked with a 0). At a certain time, a population is in a given state (most likely a few 1’s and mostly 0’s). The goal of group testing is to identify that state using as few tests as possible. Given a prior belief on the infection rate (the disease is rare) and test results observed so far (if any), we expect that only a small share of those infection states will be plausible. Rather than evaluating the plausibility of all 2n possible states (an extremely large number even for small n), we resort to a more efficient method to sample plausible hypotheses using a sequential Monte Carlo (SMC) sampler. Although quite costly by common standards (a few minutes using a GPU in our experimental setup), we show in this work that SMC samplers remain tractable even for large n, opening new possibilities for group testing. In short, in return for a few minutes of computations, our detectives get an extensive list of thousands of relevant hypotheses that may explain tests observed so far.

  2. Equipped with a relevant list of hypotheses, our strategy proceeds, as detectives would, by selectively gathering additional evidence. If k tests can be carried out at the next iteration, our strategy will propose to test k new groups, which are computed using the framework of Bayesian optimal experimental design. Intuitively, if k=1 and one can only propose a single new group to test, there would be clear advantage in building that group such that its test outcome is as uncertain as possible, i.e., with a probability that it returns positive as close to 50% as possible, given the current set of hypotheses. Indeed, to progress in an investigation, it is best to maximize the surprise factor (or information gain) provided by new test results, as opposed to using them to confirm further what we already hold to be very likely. To generalize that idea to a set of k>1 new groups, we score this surprise factor by computing the mutual information of these “virtual” group tests vs. the distribution of hypotheses. We also consider a more involved approach that computes the expected area under the ROC curve (AUC) one would obtain from testing these new groups using the distribution of hypotheses. The maximization of these two criteria is carried out using a greedy approach, resulting in two group selectors, GMIMAX and GAUCMAX (greedy maximization of mutual information or AUC, respectively).
The interaction between a laboratory (wet_lab) carrying out testing, and our strategy, composed of a sampler and a group selector, is summarized in the following drawing, which uses names of classes implemented in our open source package.
Our group testing framework describes an interaction between a testing environment, the wet_lab, whose pooled test results are used by the sampler to draw thousands of plausible hypotheses on the infection status of all individuals. These hypotheses are then used by an optimization procedure, group_selector, that figures out what groups may be the most relevant to test in order to narrow down on the true infection status. Once formed, these new groups are then tested again, closing the loop. At any point in the procedure, the hypotheses formed by the sampler can be averaged to obtain the average probability of infection for each patient. From these probabilities, a decision on whether a patient is infected or not can be done by thresholding these probabilities at a certain confidence level.
Benchmarking
We benchmarked our two strategies GMIMAX and GAUCMAX against various baselines in a wide variety of settings (infection rates, test noise levels), reporting performance as the number of tests increases. In addition to simple Dorfman strategies, the baselines we considered included a mix of non-adaptive strategies (origami assays, random designs) complemented at later stages with the so-called informative Dorfman approach. Our approaches significantly outperform the others in all settings.
We executed 5000 simulations on a sample population of 70 individuals with an infection rate of 2%. We have assumed sensitivity/specificity values of 85% / 97% for tests with groups of maximal size 10, which are representative of current PCR machines. This figure demonstrates that our approach outperforms the other baselines with as few as 24 tests (up to 8 tests used in 3 cycles), including both adaptive and non-adaptive varieties, and performs significantly better than individual tests (plotted in the sensitivity/specificity plane as a hexagon, requiring 70 tests), highlighting the savings potential offered by group testing. See preprint for other setups.
Conclusion
Screening a population for a pathogen is a fundamental problem, one that we currently face during the current COVID-19 epidemic. Seventy years ago, Dorfman proposed a simple approach currently adopted by various institutions. Here, we have proposed a method to extend the basic group testing approach in several ways. Our first contribution is to adopt a probabilistic perspective, and form thousands of plausible hypotheses of infection distributions given test outcomes, rather than trust test results to be 100% reliable as Dorfman did. This perspective allows us to seamlessly incorporate additional prior knowledge on infection, such as when we suspect some individuals to be more likely than others to carry the pathogen, based for instance on contact tracing data or answers to a questionnaire. This provides our algorithms, which can be compared to detectives investigating a case, the advantage of knowing what are the most likely infection hypotheses that agree with prior beliefs and tests carried out so far. Our second contribution is to propose algorithms that can take advantage of these hypotheses to form new groups, and therefore direct the gathering of new evidence, to narrow down as quickly as possible to the "true" infection hypothesis, and close the case with as little testing effort as possible.

Acknowledgements
We would like to thank our collaborators on this work, Olivier Teboul, in particular, for his help preparing figures, as well as Arnaud Doucet and Quentin Berthet. We also thank Kevin Murphy and Olivier Bousquet (Google) for their suggestions at the earliest stages of this project, as well as Dan Popovici for his unwavering support pushing this forward; Ignacio Anegon, Jeremie Poschmann and Laurent Tesson (INSERM) for providing us background information on RT-PCR tests and Nicolas Chopin (CREST) for giving guidance on his work to define SMCs for binary spaces.