Deep learning demystified: Why testing is the key to success

Deep learning demystified: Why testing is the key to success

By: Jara Linders

The deep learning revolution

Before we jump into testing, let’s quickly recap what deep learning is all about. Deep learning is a subfield of machine learning that has taken the tech world by storm in recent years. It’s responsible for some of the most impressive advancements in artificial intelligence, from image recognition and natural language processing to autonomous vehicles and medical diagnosis. In my previous blog post I explained how deep learning is the backbone of our BoneMRI product: we can’t do without!

At the core of deep learning there are neural networks, inspired by the human brain’s interconnected neurons. These networks learn patterns and make decisions based on data, allowing them to perform tasks that would have seemed like science fiction not too long ago. But here’s the secret: the effectiveness of these models depends heavily on how well they are tested and evaluated.

Why testing matters

Testing helps us ensure that our models are not only learning from the data but also generalising their knowledge to new, unseen data. It’s like preparing for an exam —you don’t want to memorise answers; you want to understand the concepts so that you can tackle any new question thrown at you. In the world of deep learning, testing means challenging the model with data it hasn’t seen before to check if it truly comprehends the underlying patterns.

To give an example of what can go wrong, we go back in time to some preliminary BoneMRI experiments. At the start we had limited data, so the first deep learning models were naively trained and validated on data from the same clinical centre. Just when we thought we created the best performing model ever, we received data of a new clinical centre. You might already guess, but this centre used a slightly different MRI setup, and our freshly trained model could not stand against this real-world variation. In the context of deep learning this is a classic example of overfitting where a model fails to generalise to new, unseen data. In other words we overfitted our system to only memorise the answers for a single clinical centre.

Especially for a medical device, the data you test on should be able to ensure safety and generalisation to any medical scenario the product is intended for. Of course it’s impossible to cover all situations, so in the case of BoneMRI validation, we carefully select data of patient populations or scanners that were not presented to the model during training. If the model is able to perform well on this unseen data, it provides a good estimate of its generalisation capabilities in real world practice.

Example of an old BoneMRI experiment where testing revealed that there was still some work to do. Left: Perfect BoneMRI from an MRI centre the model was trained on. Right: Not so perfect BoneMRI from an MRI centre the model never saw before.

The importance of good metrics

In one of our previous blog posts we explained how BoneMRI software is clinically validated by performing clinical studies and investigations. However, you can imagine that developments wouldn’t go very fast if we needed a clinical study for every deep learning experiment to decide whether it’s an improvement. To make informed decisions during development, metrics are vital. Metrics, in the context of deep learning, are quantitative measures that help us assess how well our models are performing and can be computed on-the-fly while developing our algorithms.

During development, metrics help set a benchmark for your model’s performance. Good metrics make it clear when your model has reached the desired level of performance. By continually testing our models and tracking their metrics, you can identify areas where they fall short and fine-tune them accordingly. This iterative process is at the core of building trustworthy deep learning models.

The metrics behind BoneMRI

Our previous deep learning blog post explains that BoneMRI models are trained to transform MRI images into CT images. Therefore most BoneMRI metrics are computed in comparison to a CT of the same patient. To demystify the magic behind BoneMRI, we will reveal to you our most important metrics:

Bone Morphology
Accurate morphology (shape) of the bone is important to guarantee safe use of our device in surgical planning and guidance workflows. For example when using BoneMRI as guidance to place screws in the spine it is important that the shape of the bone truly reflects its shape in reality. We assess the morphological accuracy of bone by the surface distance metric, which is computed for individual bone segments between BoneMRI and the ground truth CT.

Analysis of the morphological accuracy based on a bone segment on the BoneMRI (left), a bone segment on the CT (middle) and the computed surface distance map (right). Green means perfect alignment.

Radiodensity Accuracy
Radiodensity refers to the degree to which a substance absorbs x-rays and is commonly expressed in Hounsfield Units (HU) in CT scans. The normalised HU scale allows for reproducible imaging, spotting differences and deviations easily, simplifying and standardising the reading and reporting of images.. By measuring the radiodensity of different tissues and structures within the body, radiologists can distinguish between normal and abnormal tissues. This helps in diagnosing various medical conditions, such as fractures, tumours, and infections. Contrary to CT, MRI is non-quantitative and thus not expressed in HU. Although BoneMRI is based on MRI, the underlying deep learning model is explicitly trained to generate images in the HU scale. To optimise for this, the accuracy of radiodensity in BoneMRI is evaluated by computing the difference in HU between BoneMRI and CT.

Analysis of the radiodensity accuracy of a BoneMRI (left) compared to a CT (middle) with the differences in Hounsfield Units (right). Blue indicates an underestimation of the HU on the BoneMRI, and red an overestimation.

Do we measure what we truly want?

BoneMRI images are perceived and interpreted by the trained eye of a medical specialist, who expects an image that looks like the usual CT image. Optimising for high similarity between those images is therefore important. A limitation of using only such metrics to do so is that it is very challenging to fully replicate human perception, especially the critical eye of the trained medical expert. In a perfect world we would use the professional opinion of a specialist to assess the millions of attempts a model needs to optimise towards a good solution. However, this would require an army of radiologists working round the clock for years. In order to develop and iterate quickly we need metrics that can be computed deterministically, quickly and automatically; like the surface distance and the difference in Hounsfield Units.

To mitigate any limitations brought by these metrics, we always introduce a validation phase, in which we review the resulting BoneMRI images with a clinical expert before a new algorithm is released. In this way, we are sure our metrics didn’t miss anything, and it gives us a possibility to validate and reflect; “do we measure what we truly want?”. Even better, most of our metrics are linked to clinical outcomes or reproduced by clinical measures, such as in the study: MRI-based synthetic CT of the lumbar spine: Geometric measurements for surgery planning in comparison with CT.

In the world of deep learning, where we’re teaching machines to make sense of vast amounts of data, good metrics are our guiding lights. They help us steer our models in the right direction, fine-tune their performance, and ensure they produce reliable results.

So, whether you’re building an AI to classify images of adorable kittens or diagnose life-threatening diseases, remember this: good metrics are your best friends on this journey. They help you measure success, set goals, and keep iterating until you achieve AI greatness.