Machine learning and medical devices: data quality and bias

July 02, 2020

•

Medical Devices

•

Rob Turpin, Emily Hoefer, Joe Lewelling & Pat Baird

Data quality is a key factor in the success or failure of a machine learning system; in fact, data quality is as or more important than the machine learning algorithm. There are two main elements that impact data quality: the dataset and the model. The dataset is sent to the model to learn. It is not feasible for a machine to learn outside of this given dataset, and the size and variability define how easily a model can learn from it. Data scientists therefore play an important role with regard to scaling the algorithm.

AI may fail (became untrustworthy) either because data was not representative or not fit for the task to which it was applied. Therefore, the key to making medical AI more trustworthy is ensuring necessary data quality and confirming that algorithms are sufficiently robust and fit for purpose. In short, ensuring the safety and effectiveness of AI depends on verification of data quality and validation of its suitability for the algorithm model. Furthermore, given that AI has the ability to change over time, the processes of verification and validation cannot be a onetime premarket activity, but instead must continue over the life cycle of a system, from the initial design and clinical substantiation, across its post market use, until decommissioning. Continual assurance of the AI-based device’s safety and performance across its life cycle will help regulators, clinicians, and patients gain trust in machine learning AI.

There are many aspects that contribute to data quality, including the completeness, correctness, and appropriateness of the data; annotation; bias; and consistency in labelling of the data (e.g., different labels may mean the same thing but the algorithm treats them differently).

Dataset annotations involve variables and biases that humans apply so that an AI solution can spot it.

Any bias that exists in a dataset will affect the performance of the machine learning system. There are many sources, including population, frequency, and instrumentation bias.

Having a system that is unintentionally biased towards one subset of a patient population can result in poor model performance when faced with a different subset, and ultimately this can lead to healthcare inequities. When working with quality data, instances of intentional bias (also known as positive bias) can be present, such as a dataset made up of people only over the age of 70 to look at age-related health concerns.

When considering the application of a dataset for a machine learning application, it is important to understand the claims that it makes. This can be in terms of whether a proper balance in the representative population classes has been achieved, along with whether the data can be reproduced, and if any annotations are reliable. For example, a dataset could contain chest X-rays from males aged 18–30 in a specific country, half of whom have pneumonia. This dataset cannot claim to represent pneumonia in females. It may not be able to claim to represent young males of a particular ethnic group, as this subgroup might not be listed within the dataset variables and might not be plausibly represented in the sample size.

The AI model is trained on the dataset. It will learn the variables and annotations that the dataset is trained on. In healthcare, the vast majority of neural networks are initially trained on a dataset, evaluated for accuracy, and then used for inference (e.g., by running the model on new images).

It is important to understand what the model can reliably identify (e.g., the model claims). Neural networks can generalize a bit, allowing them to learn things slightly different from their training dataset. For example, a model that is carefully trained on male chest X-rays may also perform well on the female population, or with different X-rays equipment. The only way to verify this is to present the trained model with a new test dataset. Depending on the model performance, it may be possible to demonstrate that the AI can accurately identify pneumonia across male and female patients and generalize across different X-ray machines. There may be minor differences in performance between datasets, but these could still be more accurate than a human.

In summary, AI will learn the variables, biases, and annotations of a dataset, with the expectation that it can spot an important feature. Once trained, an algorithm will be tested, revealing that it is able to identify this feature with a certain level of accuracy. In order to test the claim that the AI can identify a specific item, it needs to be tested on a dataset that claims to represent this feature fairly. If it performs to a satisfactory level of performance on this dataset, the model can then claim to be able to identify this item in future datasets that share the same variable as the test dataset.

The following example shows an instance where a poor quality dataset and its incorrect relationship with the algorithm model has caused a failure in the output.

An adaptive learning classifier system[1] analyzed photographs to differentiate between wolves and huskies. Instead of detecting distinguishing features between the two canine breeds, the system determined the most salient distinction was that photos of huskies included snow in the background, whereas photos of wolves did not. The system’s conclusions were correct with respect to its training data but were not usable in real-world scenarios, because extraneous and inappropriate variables (i.e., the backgrounds) were included in the learning dataset.[2] This is an example of how an AI system may detect incidental patterns or correlations in a dataset and assign a causal or meaningful relationship that is incorrect or irrelevant.

[1] https://arxiv.org/pdf/1602.04938. pdf

[2] A similar medical AI example occurred when Stanford researchers tested an AI tool to identify melanomas from pictures of moles and found the tool used the presence of rulers in the photos as a positive indicator of cancer. See http://stanmed. stanford.edu/2018summer/ artificial-intelligence-putshumanity-health-care.html

This is an excerpt from the BSI/AAMI white paper: Machine learning AI in medical devices: adapting regulatory frameworks and standards to ensure safety and performance. To browse our collection of medical device white papers, please visit the Insight page on the Compliance Navigator website.

Request more information today for a call back from a member of our sales team so that you can get a better understanding of how Compliance Navigator can meet your needs.

The Compliance Navigator blog is issued for information only. It does not constitute an official or agreed position of BSI Standards Ltd or of the BSI Notified Body. The views expressed are entirely those of the authors.