Can systems based on AI be certified for medical use?
The answer to this question needs to consider the types of algorithm that are used in the AI system. Devices that make use of sensor data that is captured from patients might be expected to have input/output mappings that are very well defined. However, as the number of sensor measurements increases, or the history over which a signal is analysed (i.e. time-scale), there is arguably an increased possibility of unexpected device behaviour. In particular, when the definition of a system’s performance is based on the
examples that it has been exposed to during network training, specific requirements may need to be met in the certification process.
We should make a distinction between “AI” that refers to partially autonomous behaviour based on a welldefined rule set or schema, and AI which incorporates machine learning in some way. For the former of these, one can see medical devices that contain AI as “business as usual”. For the latter, there may be extra requirements on aspects of system design that are traditionally considered under software versioning. For example, all other things being equal, it is possible to entirely change a systems’ function by altering the weights of the network. Since weight changes are very easy to effect, and the difference between two trained networks can be difficult to detect, there is the possibility for error.
There are two obvious solutions to this conundrum, depending on the complexity of the function performed by the network, and on the complexity of signals or data on which it operates. Let us take the example of
“symptoms checkers” (Armstrong, 2018; Elliot et al., 2015). Semigran et al. (2016) proposed the use of patient diagnostic vignettes, examples of possible diagnostic cues that could be provided by patients. This takes the form of lists of patient symptoms for which gold standard diagnoses by experienced human doctors are maintained. One could envisage that such a test could be expanded to many different types of diagnostic data, and indeed the development of techniques that cope with missing or ambiguous data is an area of active research (Campos et al., 2015).
So, one of the likely mechanisms that would be required for certification of the future will be based on agreed sets of sensor signal examples, symptoms, and even images, which represent an open, accepted standard upon which all candidate systems must perform to the same level of agreement with human expert diagnoses. Precisely how such datasets will be established and agreed upon is unclear, but it is likely to be an essential part of the certification process for medical devices which have certain minimum degrees of complexity.
Once a system has been guaranteed to perform to a required standard, information on the software, hardware and operating systems – the entire software stack – should be captured. In addition, since function is largely determined by network weights, these should be uniquely encoded in some way. For example, a hash or some equivalent of a checksum, could be used to produce a unique signature of the weights that define a particular network’s behaviour.
Finally, a question naturally arises: should the data or examples used to train an ML system also be maintained, or subject to scrutiny? Scrutiny of data might sound impractical: the data sizes are large by definition. But tools can be created to find pathological items in datasets, which might lead to bad clinical decisions or device behaviour; and datasets and trained models can be treated as an inseparable pair, and assigned a joint signature.
Given the common practice of retraining networks with new data, having a verification of a unique model and data combination seems a sensible idea as part of the certification process. In addition, the data or examples used in training a network should be archived and guaranteed to be retained as part of the system specification.
Online learning remains an open topic in machine learning. In online learning, models are updated as an inference system is live. For the healthcare setting, we should make the distinction between performance across a large population being periodically updated, and adaptive behaviour within an intelligent device. Online learning is more likely to be useful if applied using large amount of data from several individuals, batched together: this is appropriate to the diagnostic setting. On the other hand, we might consider the actions of a specific device with a single patient as adaptive, patient-specific behaviour. The latter is more likely to fall under the category of reinforcement learning; then, we might imagine that certain bounds of behaviour would have to be defined which hold for all patients: adaptation would then be within those agreed bounds. How to establish such bounds remains to be seen.
This raises, of course, questions around the nature of the examples themselves, and how the examples used to train a system using machine learning are labelled according to some form of gold standard. Although, in most fields of diagnostic medicine, there are often recognised gold standards for diagnostic tests, treatment decisions by individual clinicians might diverge. But when an AI system is to be deployed for patient triage, how does one establish the “gold standard” for treatment? For palliative or ICUs, how do quality-of-life considerations come into the picture in establishing the “best” outcome? Here, it seems that we are facing a wider challenge, perhaps to medicine as a whole. It might be that the way forward is to have agreement on standardisation and sharing of clinical records, pooled across hospitals and with fine-grained clinical detail. Thus, the gold standards willemerge, and even the rare events, from which both machines and humans can learn, will be captured.
This is an excerpt from the white paper Recent advancements in AI - implications for medical device technology and certification. To download our other medical device white papers, please visit the Insight page on the Compliance Navigator website.
Request more information today for a call back from a member of our sales team so that you can get a better understanding of how Compliance Navigator can meet your needs.
The Compliance Navigator blog is issued for information only. It does not constitute an official or agreed position of BSI Standards Ltd or of the BSI Notified Body. The views expressed are entirely those of the authors.