The use of artificial intelligence in medical devices is on the rise. Although there is no specific directive or regulatory framework yet, AI decisions still need to be understandable and verifiable to ensure patient safety. TÜV SÜD shares key aspects of artificial intelligence in the medtech industry.

Dr. Abtin Rad, global director, functional safety, software, and digitization

March 17, 2021

7 Min Read
artificial intelligence
Image by Gerd Altmann from Pixabay

In early January 2020, the World Health Organization (WHO) released information about a special case of a flu-like disease in Wuhan, China. However, a Canadian company specializing in artificial intelligence (AI)-based monitoring of the spread of infectious diseases had already warned its customers of the risk of an epidemic in China as early as December 2019.1 The warning had been derived from AI-based analyses of news reports and articles in online networks for animal and plant diseases. Access to global flight ticket data enabled AI to correctly forecast the spread of the virus in the days after it emerged.

Lack of Regulatory Framework

The example reveals the capabilities of AI and machine learning (ML). Both are also used in an increasing number of medical devices, for example, in the form of integrated circuits. Despite the risks likewise associated with the use of AI, common standards and regulations do not yet include specific requirements addressing these innovative technologies. The European Union’s Medical Device Regulation (MDR), for example, only sets forth general software requirements. According to the regulation, software must be developed and manufactured in line with the state of the art and designed for its intended use.

This implies that AI, too, must ensure predictable and reproducible performance, which in turn requires a verified and validated AI model. The requirements for validation and verification have been described in the software standards IEC 62304 and IEC 82304-1. However, there are still fundamental differences between conventional software and artificial intelligence with machine learning. Machine learning is based on using data to train a model, without explicitly programming the processes. As training progresses, the model is continually improved and optimized through changes in “hyperparameters.”

Testing AI Training Data and Defining the Scope

Data quality is crucial for the forecasts delivered by AI. Frequent problems include bias, over-fitting, or under-fitting of the model and labelling errors in supervised machine-learning models. Thorough testing reveals some of these problems.

It shows that bias and labelling errors are often caused unintentionally by training data that are lacking in diversity. Take the example of an AI model that is trained to recognize apples. If the data used to train the model include predominantly green apples of different shapes and sizes, the model might identify a green pear as an apple but fail to recognize a red apple. Under certain circumstances, accidentally or unintentionally common features of aspects might be rated as significant by AI even though they are irrelevant. The statistical distribution of data must be justified and correspond to the real environment. The existence of two legs, for example, must not be applied as a critical factor for AI classification as a human being.

Labeling errors are also caused by subjectivity (“severity of disease”) or identifiers that are unsuitable for the purpose of the model. Labeling of large data volumes and selection of suitable identifiers is a time- and cost-intensive process. In some cases, only a very minor amount of the data will be processed manually. These data are used to train AI. Subsequently, AI is instructed to label the remaining data. This process is not always error-free, which in turn means that errors will be reproduced.

Key factors of success are data quality and the volume of data used. So far, empirical estimates of the amount of data required for an algorithm are few and far between. While it is basically true that even a weak algorithm functions well if the quality and volume of data is large enough, in most cases capabilities will be limited by the availability of (labeled) data and computing power. The minimum scope of data required depends on the complexity of both the problem and the AI algorithm, with non-linear algorithms generally requiring more data than linear algorithms.

Normally 70 to 80 percent of the available data are used to train the model while the rest is used for verification of the prediction. The data used for AI training should cover a maximum bandwidth of attributes.

Example: Identification of Osteoarthritis of the Knee

According to black-box AI, one of the two patients represented by the following images will develop osteoarthritis of the knee in the next three years. This is invisible to the human eye and the diagnosis cannot be verified. Would a patient still choose to undergo surgery? (The following images have been taken from the following publication "Making Medical AI Trustworthy," Spectrum IEEE.org, August 2018 [https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8423571], originally from The Osteoarthritis Initiative [https://nda.nih.gov/]. This article reflects the views of the author and may not reflect the opinions or views of the NIH or of the researchers who submitted the original data to The Osteoarthritis Initiative.).

Screen Shot 2021-03-16 at 4.37.32 PM.png

Above: Figure 1. This patient will not suffer from osteoarthritis in the next 3 years.

Screen Shot 2021-03-16 at 4.37.58 PM.png

Above: Figure 2. This patient will suffer from osteoarthritis in the next 3 years.

AI–Beware of the Black-Box Problem

The transparency of the AI algorithm used in a medical device is clinically relevant. As AI models have very convoluted and non-linear structures, they often operate as a “black box,” i.e., it can be very difficult if not impossible to understand how they make their decisions. In this case, for example, experts can no longer determine which part of the data input into the model (e.g., diagnostic images) triggers the decision made by AI (e.g., cancer tissue detected in an image).

AI methods used in the reconstruction of MRT and CT images have also proven to be unstable in some cases. Even minor changes in the input images may lead to completely different results. One reason is that algorithms are developed in some cases with accuracy but not with stability in mind.

Without transparent and explainable AI forecasts, the medical validity of a decision might be doubted. Some current errors of AI in pre-clinical applications further increase these doubts. However, to ensure safe use in patients, experts must be able to explain the decisions made by AI. This is the only way to inspire and maintain trust.

The following figures demonstrate the differences between black-box and white-box AI.

Screen Shot 2021-03-16 at 2.52.10 PM.png

 

Above: Figure 3. Black-box AI.

Screen Shot 2021-03-16 at 5.23.44 PM.png

Above: Figure 4. Opening black-box AI.

 

Data Quality

The figures below demonstrate the effects of training AI using low-quality data. Examples include:

  • Biased data (bias in assigning entries to one category of results).

  • Over-fitting of data (see Figure 6)  Inclusion and excessive weighting of characteristics of little or no relevance.  

  • Under-fitting of data: The model does not represent the training example with sufficient accuracy.

Screen Shot 2021-03-16 at 5.33.32 PM.pngAbove: Figure 5. Effect of training using data of low quality.

Screen Shot 2021-03-16 at 5.40.07 PM.png

Above: Figure 6. Over-fitting (red line) of data (points). Inclusion and excessive weighting of characteristics of little or no relevance.

Screen Shot 2021-03-16 at 5.40.58 PM.png

Above: Figure 7. Under-fitting (red line) of data (points). The model does not represent the training example with sufficient accuracy.

 

Free Guidance for Developers and Manufacturers

A free checklist published by the Interest Group of the Notified Bodies in Germany (IG-NB) lists about 150 requirements for the development and post-market surveillance of medical devices (see info box below). Until standards governing the safety of AI-based medical devices are published, this guidance can be used to minimize risks in the lifecycle of medical AI. This facilitates the placement on the market of new technologies in an environment which is highly regulated by nature.

 

Checklist for Medical Devices with AI

Safety of AI-based medical devices requires a process-focused approach across all phases of the product lifecycle. The checklist published by IG-NB covers the following three areas:

General requirements

General requirements include the certifiability of AI, the pertinent processes and the competencies required in development, as well as thorough documentation.

Requirements for product development

Tasks include identifying users, gathering software requirements, and developing and evaluating models.

Requirements for downstream phases

After development, the focus must be directed to production, distribution, and installation. This is as important as continual post-market surveillance.

The full list (in German only) is available for free download from: www.ig-nb.de/dok_view?oid=795601

 

Reference

1. https://www.rnd.de/digital/coronavirus-warum-ein-algorithmus-zuerst-von-der-epidemie-wusste-JE32CSE745EW7CBU5ESLVE36ZE.html

 

About the Author(s)

Dr. Abtin Rad

global director, functional safety, software, and digitization, TÜV SÜD Product Service GmbH

Dr. Abtin Rad serves as global director, functional safety, software, and digitization, for TÜV SÜD Product Service

Contact

TÜV SÜD Product Service GmbH
Medical & Health Services
Ridlerstraße 65
80339 Munich
Germany

www.tuvsud.com/en/industries/healthcare-and-medical-devices

Sign up for the Design News Daily newsletter.

You May Also Like