Enhancing bone radiology images classification through appropriate preprocessing: a deep learning and explainable artificial intelligence approach

Yaoyang Wu; Simon Fong; Jiahui Yu

doi:10.21037/qims-24-1745

Original Article

Enhancing bone radiology images classification through appropriate preprocessing: a deep learning and explainable artificial intelligence approach

Yaoyang Wu , Simon Fong , Jiahui Yu

Department of Computer and Information Science, University of Macau, Macau, China

Contributions: (I) Conception and design: Y Wu, S Fong; (II) Administrative support: J Yu; (III) Provision of study materials or patients: Y Wu, J Yu; (IV) Collection and assembly of data: J Yu; (V) Data analysis and interpretation: Y Wu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Simon Fong, PhD. Department of Computer and Information Science, University of Macau, Avenida da Universidade, 519000 Macau, China. Email: ccfong@um.edu.mo.

Background: Medical image classification has been an important application for deep learning techniques for over a decade, and since the emergence of explainable artificial intelligence (XAI), researchers have started using XAI to validate the results produced by these black box models. In the research field, it has become clear that accuracy and efficiency are not the only crucial factors for developing medical deep learning models; the authenticity of results and the accountability of the model and its creator also matter greatly. The objective of this study is to emphasize the importance of authenticity of the results and the accountability of deep learning models used for medical purposes, through proposing targeted preprocessing method for medical dataset processed by deep learning models.

Methods: In this paper we conduct comparison experiments on processing two bone radiology image datasets using various deep learning neural networks, while emphasizing on the effect of appropriate preprocessing methods for the dataset towards the models’ prediction performance. Comparisons are conducted both horizontally, between performance of different neural networks; and vertically, of using same models processing datasets before and after going through appropriate preprocessing procedures. Furthermore, we evaluate the experimental results not only quantitatively, but also visually by using XAI techniques, in order to determine the reasonability and reliability of the predictions from the experiments.

Results: Results showed that for the bone radiology image dataset used for our experiment, among the five comparison models, DenseNet201 achieved the highest validation accuracy of 78%. Using the same models to process the abovementioned dataset after conducting appropriate preprocessing procedures, performance for all models have increased by an average of 0.06. Using XAI technique to evaluate the comparison results for before/after preprocessing experiments, we could observe that the appropriate preprocessing method effectively helped the models to concentrate on the abnormality areas on the radiology images comparing to processing raw images.

Conclusions: The novelty of this paper lies in its specific application of extended preprocessing techniques—namely, the removal of background and irrelevant parts—to medical images for improving the performance of deep learning models in classification tasks. While the concept of preprocessing images has been explored by many researchers, applying such targeted preprocessing steps to medical images, combined with the use of XAI to validate and illustrate the benefits, is a novel approach. This paper highlights the unique requirements of medical image data and proposes an innovative method to enhance model accuracy and reliability in medical diagnostics by removing background and redundant features from the images.

Keywords: Bone abnormality; deep learning; explainable artificial intelligence (XAI); medical image preprocessing; convolutional neural network (CNN)

Submitted Aug 21, 2024. Accepted for publication Dec 16, 2024. Published online Jan 17, 2025.

doi: 10.21037/qims-24-1745

Introduction

Deep learning techniques are now widely being implemented in the medical field, one of the more popular directions is using neural networks for medical image classification (1,2). There has been a large number of high-performance deep learning classification models for medical images, and since the emergence of the explainable artificial intelligence (XAI) concept, more and more researchers involved XAI in their work. For those works involving deep learning models and XAI techniques for medical image classification, the most straightforward approach is to achieve the best performance possible with the deep learning model, in the meantime to use XAI techniques to ensure that the model ‘learned’ the correct knowledge and made the correct classification based on that (3-5).

Current research regarding the use of deep learning for medical image recognition is thriving, many researchers focus on the design and structure of the model itself. For common image recognition, many models achieve remarkable performance (6,7). In their works, many of them performed basic standard data preprocessing steps. Yet for medical images, to say professionally, they too need ‘special treatment’ as the diseases they represent.

As we know, in common image classification tasks, an image being processed by the model will be recognized because of the object presented in the image. This means, every part of said object will support the correspondent prediction. There is a method of data augmentation process called “cropping”, used widely to increase data samples for better training and performance. Cropping is essentially segmenting an image into several partial images, and they are added under the same class as new data, thereby increasing the size of the dataset. For common images where we mentioned every part of the object supports the true label, cropping is a good way for data augmentation same as flipping, rotation, etc. But for medical images, the object is the body part in the image, the key indication of a disease is only lying on a part, even a very small fragment of this object, and the same indications are what help human radiologists to diagnose and determine the diseases from medical images. And now we can see that cropping is not a suitable data augmentation method for medical images, because once implemented, we risk having new images that do not contain key indications be classified as being ill. What we wish to point out is that medical images are different from common ones, not only because of the reason that we explained above, but also because we need to be cautious and precise when we develop methods for important medical use. Medical image data such as computed tomography (CT) scans, X-rays, and magnetic resonance imaging (MRI), currently often have high resolution, clear boundaries between body parts, also between background and object itself. We believe that it will be necessary and beneficial to process accordingly for higher accuracy and better performance.

On the other hand, with the thriving development of XAI thanks to researchers’ awakened awareness of accountability and reliability of their creations, XAI techniques have become popular among black box model research fields for validation and verification (8). Essentially even with extraordinary accuracy and performance, black box models might possibly learn the wrong knowledge and achieve through the wrong way, and XAI helps humans to interpret black box models, and thereby supervise, observe and verify.

Additionally, for critical fields such as radiology and medicine where human well beings are concerned, choosing a suitable XAI method in case-specific scenarios for the human users to appropriately understand black box models when they are involved in important decision-making processes is also a priority, as stated in the work proposed by Retzlaff et al. (9).

In this paper, we utilize one of the most trending XAI techniques—gradient-weighted class activation mapping (GradCAM), to illustrate the necessity of extending preprocessing methods for medical images specifically.

Related works

In the current research field, there have been tens of thousands of achievements regarding medical image recognition using deep learning models. Among these, lung disease is one of the most popular topics. Its trending reasons, besides the recent outbreak of coronavirus disease 2019 (COVID-19) and the immediate research demands that came with it, are because lung-related image data is extraordinarily abundant, far more than other diseases. Even before deep learning research started to thrive in the medical field, this abundance gave researchers rich materials and large space to develop. After the concept of computer-aided diagnosis (CAD) systems emerged, Hua et al. (10) proposed the combined design of convolutional neural network (CNN) and deep belief network (DBN) and illustrated satisfactory improvement in the model performance of lung cancer X-ray image classification. The Throax-Net proposed by Wang et al. (11) achieves classification of 14 lung diseases, containing two branches within the operation system—a classification branch and an attention branch—that together produce an output averaged and binarized by both operators. There is also the model proposed by Liang et al. (1), a combination design of dilated convolutional and residual structure, implemented on the classification of pediatric lung pneumonia. With a more specific range but still rich amount of chest X-ray image data, it achieves 96.7% recall and 92.7% F1-score. For the thriving research of COVID-19 diagnosis and classification, for example, the models proposed by Keles et al. (12), plainly called ‘Covid19-CNNet’ and ‘Covid19-ResNet’, achieved 97.61% and 94.28% accuracy respectively when performing multi-class classification on lung radiology image datasets with both normal cases, COVID-19 cases, and viral pneumonia cases.

As we can presume from the above works, many researchers in this field emphasize model design to achieve higher performance. Few of them pay close attention to the image data itself and the method of pre-processing before practice. For example, in the work of Salehinejad et al. (13), they utilize a generative adversarial network to generate new synthetic images for extended training material, which is essentially an upsampling technique for imbalanced image datasets. The principle is similar to image data augmentation, a sort of pre-processing method. Because of the already scarce medical image data, there is no strict downsampling for it. However, in the work of Shin et al. (14), they implemented a transfer learning technique to reduce the requirement of training material from CNNs so that even with a smaller amount of medical image data, a fair training effect can be achieved. In these two works, the researchers put effort into designing and calibrating sophisticated networks and techniques, thereby achieving innovative results. Statistically, this type of publication is still less than conventional research, and we seek a more effort- and time-friendly approach to improve the performance of processing medical images. To address this perspective, in this paper, we propose an extended preprocessing approach for medical image data.

Even though there is much deep learning-related research for medical applications, XAI has only been introduced into this field for less than a decade. The consideration of accountability and reliability of creation for critical use has risen, and XAI techniques provide the means to transfer trust and confidence from black-box computer systems to human users. Researchers started to use XAI as a validation and verification tool for the black-box models. For example, in the work of Majkowska et al. (15), they proposed deep learning models for chest radiograph interpretations, along with their validations and evaluations using XAI techniques. In this work, they used four CNNs for the detection of fractures, nodules/masses, pneumothorax, and opacity from chest X-ray images. The models showed equally decent performance as human experts, with slightly higher sensitivity. They used SmoothGrad (16), an XAI technique where one randomly adds noise into input images, calculates derivatives of feature maps, and removes the noise in the resulting saliency map. Based on the explanation produced by SmoothGrad, experts can assess them and provide valuable inputs regarding where the model might present errors. Dunnmon et al. (17) and Rajpurkar et al. (18) both used the classic CAM method to validate their deep learning models developed for interpreting chest radiographs. CheXNeXt (18) was developed to detect 14 different types of lung pathologies and achieved human-expert level performance on 11 types. CAM results showed that the model focused on meaningful region of interest (ROI) and produced correct detection.

Methods

Dataset

In the experiment, we use the musculoskeletal radiographs (MURA) dataset developed by Rajpurkar et al. (19), which is recognized as currently the largest public osteo dataset worldwide. MURA dataset contains a total of 40,895 images categorized according to the body parts, which includes: elbow, finger, forearm, hand, humerus, shoulder and wrist. Dataset is organized by order of patient, and data masking has been performed originally from source (Stanford University), therefore no patient information is contained, only random index is assigned to each patient. Each patient file contains at least one, at most three different angles of images for the same subject. Extraction of data follows the csv file provided by original developer. For our experiment, we have extracted the elbow and shoulder images, which respectively represents different kind of redundant image elements. The two datasets are stored and utilized independently for different runs of experiment.

The elbow dataset contains a total of 5,396 images categorized into two classes: normal and abnormal. According to official document, training set contains a total of 4,931 images, among which 2,006 are abnormal and 2,925 are normal, and within testing set there are 235 normal images and 230 abnormal images.

On the other hand, the shoulder dataset contains a total of 8,942 images categorized into two classes: normal and abnormal. According to official document, there are 4,211 normal images and 4,168 abnormal images in training set, 285 normal images and 278 abnormal images in testing set.

The data that support the findings of this study are openly available at https://github.com/ushashwat/MURA-Bone-Abnormality-Detection.

Preprocessing

The workflow of preprocessing independent images is illustrated in Figure 1, which includes the following four steps:

Contrast enhancement (brightness included), since there are a faire portion of X-ray images that too dark, the model might not be able to distinguish clearly the features on the images.
Background cropping. As mentioned before, images should be removed of the black background outside the image frame.
Centered cropping. From our observation, all the images contain the upper arm and forearm (in the case of elbow dataset), along with the nearby areas that often contain the logos, while the abnormality areas are almost always at or around the joint area. Therefore, a centered cropping is performed to remove background areas containing logos as well as surrounding forearm and upper arm areas including muscles and skin tissues.
Final step is subject extraction. This step is to extract only the main subject.

Figure 1 Workflow of processing an image.

We considered a precise manual processing of removing complete background along with irrelevant logos and other components from the images, but with the amount of almost 14 thousand images, the workload would be overwhelming. Therefore, we utilized automated functions by Python coding to achieve the removing process automatically.

By using the functions provided by OpenCV (20) packages in Python, we are able to preprocess the images following the workflow illustrated in Figure 1 for all 14 thousand images in approximately 7 minutes with low computational cost, using a loop function packing all the operations performed for each image. Using OpenCV functions, we first enhance the contrast to level 2, and enhanced brightness to 60. Then by setting a threshold distinguishing black and white pixels in the images, backgrounds are cropped. The centered cropping is performed by customizing cropping size based on the boarders of the images. Finally, the subject extraction is achieved by again using thresholding function. The first three steps help to perform and more precise subject extraction.

Our preprocessing method has the advantages of automaticity, low time consumption/human labor/computational cost/possible financial cost, and easy manipulation, also produces valuable data samples, suitable for researchers with high volume image data and limited resources.

Most images are successfully processed and essential parts are extracted as shown in Figure 2 below (original image left, processed image right), small percentage of images with low resolution are being cut out of essential parts, but these images are unable to provide too much valuable information for the model initially, therefore we determine that this automated process greatly reduces experimental cost and the small error has low negative influence towards the results, so we choose this process over manual process in our experiment.

Figure 2 Examples of processing an image.

As an example of XAI explanation shown in Figure 3, we can observe that in unprocessed images with background and irrelevant parts (image on the right), the explanation ROI would appear on the Figure 3 its background or even on the opposite side of the true ROI, also on the logo (label) of the right or left hand for the image (as shown in the image on the left inside the red frame), which we clearly understand is irrelevant to the outcome. As we are using the elbow dataset, they are categorized by left- and right-hand side; therefore, images are tagged with the logo “L” or “R”. Even though they are irrelevant to the classification (for diagnosing purposes only, not considering other surgery purposes), there is a high possibility that the ratio of left- and right-side abnormal cases is imbalanced in the datasets, which could lead to potentially biased training for the model. If, for example, there are more left-side abnormal elbow cases than right-side, the model could recognize that right-side images are less likely to suffer illness than left-side. To achieve accurate training, removing the irrelevant indicators is necessary.

Figure 3 Examples of misguided XAI explanations. XAI, explainable artificial intelligence.

Finally, an important step of rebalancing is conducted for both datasets. As we can see, the ratio of normal and abnormal classes is slightly imbalanced for both the training set and testing set, where normal images outnumber abnormal ones. Since the objective of a CAD system is always to find and identify abnormal cases and factors, more material to train the model to identify abnormal cases is preferred. To avoid the effect of this imbalance, we manually discarded 919 normal cases from the training set and 5 normal cases from the testing set, and achieved the ideal balance ratio. Therefore, for our experiment, we have 4,012 training images (2,006 in each class) and 460 testing images (230 in each class).

CNN models

For our experiment, we selected five classic CNN models included in TensorFlow Keras applications, which are known for their excellent performance in image classification. For models belonging to the same branch, we selected one representative from each family. They are: VGG16 (21), ResNet50 (22), EfficientNet (23), MobileNet (24), and DenseNet201 (25). Details for each model are shown in Table 1. For descriptions and concept definitions, please refer to the referenced literature.

Table 1

Models and descriptions

Model	Parameters	Layer	Pretrain
VGG16	138 million	16	ImageNet
ResNet50	25,636,712	50	ImageNet
MobileNet	4.2 million	28	ImageNet
EfficientNetB0	5,330,564	237	ImageNet
DenseNet201	20,242,984	201	ImageNet

Parameter configuration

To ensure a same experimental environment for every selected model, we use the exact same parameter settings for all models. We conducted parameter screening to search for a suitable parameter configuration. The screening process is done for the following three parameters: batch size, optimizer and initial learning rate. For batch size, we chose between 16 and 32; for optimizer, we chose between stochastic gradient descent (SGD) and Adam; for initial learning rate, we chose between 0.01 and 0.001. Forming a total of eight combinations of parameters, we use them on all models mentioned in Table 1 to conduct independent test runs using the elbow dataset. For the test runs using each of the eight parameter combinations, five accuracies will be produced, and a variance value of these accuracies is calculated for each combination. We selected the combination that possesses the lowest variance. The final configuration is shown in (Table 2).

Table 2

Parameter settings of compared methods

Parameters	Setting
Input size	128×128
Batch size	16
Optimizer	SGD
Initial learning rate	0.01
Learning rate reduction	0.1
Loss function	Categorical
Activation function	Relu
Epoch	10

All models are pre-trained with ImageNet, as various research has shown that ImageNet pre-trained models generally have significantly superior performance over untrained ones. SGD, stochastic gradient descent.

All models are pre-trained with ImageNet, as various research has shown that ImageNet pre-trained models generally have significantly superior performance over untrained ones.

XAI method

The XAI method used for our experiment is GradCAM, proposed by Selvaraju et al. (26), on the concept of “generalization of CAM”, namely the base concept “class activation mapping” by Zhou et al. (27). Since the CAM is initially implemented under the premise of having global average pooling (GAP) layer instead of the last fully connected layer, it requires the neural network structure to be changed and the model might need retraining. In the original CAM method, feature maps are processed by GAP and produce correspondent weighted feature vectors with the same length as the channels of the feature maps. CAM is the linear sum of weights of all layers’ feature maps towards the predicted class. At the final softmax layer, its size is generally different from input, therefore the CAM result also needs to be oversampled to the same size as the input image, then overlay it on original image for a final observational result. And GradCAM works in similar mechanism without the requirement of GAP, therefore it is applicable for different structures of neural networks. We call GradCAM as a “class-oriented method”, because in its process, a “class of interest” is initially given for the target image and assigned a gradient value 1, which will then be backpropagate to rectify feature maps and form a rough area of interest for the “class of interest”. Therefore, GradCAM produces a heatmap of ROI with vague boundaries. Since GradCAM produces heatmaps based on the model structure itself, it is known to produce stable explanations for same data sample on same models.

As all CNN models are black box models, users cannot determine whether the model makes its prediction based on the correct knowledge judging only from its prediction results. Therefore, the necessity of XAI emerges. As mentioned in the Related Works Section, many researchers now value the reliability of the predictions made by black box models especially for medical-used ones since it is crucial that doctors are getting reliable aid and results from black box models, therefore they now use XAI methods to validate their results, and this has often been the sole purpose of using XAI in their research. In our work, the XAI technique is not only used to evaluate whether the models make reliable predictions, but also, through using XAI, we discovered that CNN models trained with unprocessed image datasets would focus on the wrong features (background, logos) to make predictions, and thereby inspires us to propose the effective, efficient, low cost and easily operated preprocessing method for specifically medical images.

Experimental setup

Our experiment is a two-dimensional process. Vertically, experiments will be conducted on elbow dataset and shoulder dataset using the same five CNN models. Horizontally, for each dataset, experiments will be conducted twice, for un-processed background and processed background respectively, evaluation is performed as previously mentioned, in both XAI method and statistical method. Our experiment process includes the following main steps:

Individually putting the preprocessed two datasets with background and irrelevant part through selected CNNs and produced trained models.
Select testing image samples and put them through the correspondent trained models.
Using XAI methods, produce correspondent ROI for testing samples.
Remove background and irrelevant part from both datasets and input into unweighted CNNs, produced trained models.
Input the same testing image samples into new trained models.
Using XAI methods, produce correspondent new ROIs for testing samples.

Same steps performed for both dataset in both forms (processed and un-processed).

According to our multiple experiments, most times the performance trends to stable after epoch No. 10, therefore the epoch is set to 10.

Results

Results will be presented by the order of the two datasets respectively, containing statistical analysis and visual analysis.

For both elbow and shoulder dataset, respectively five test samples will be selected. For a just and fair experimental comparison, only as training material that the images will be processed, the samples being tested afterwards are all unprocessed images, to precisely test the improvement made by training with processed material.

For clearer description and reading, the models trained by raw images will be referred to as ‘Model A’, and the models trained by processed images will be called ‘Model B’.

Quantitative results: elbow dataset

Quantitative performance for the elbow dataset is shown in Tables 3,4.

Table 3

Validation accuracy of comparison

Model	VGG16	ResNet50	MobileNet	EfficientNetB0	DenseNet201
A	0.72	0.71	0.74	0.63	0.78
B	0.78	0.74	0.77	0.79	0.80
Aug	0.06	0.03	0.03	0.16	0.02

Model A, the models trained by raw images; Model B, the models trained by processed images; Aug, performance augmentation (from Model A to Model B).

Table 4

TP and TN rates comparison

Model	VGG16		ResNet50		MobileNet		EfficientNetB0		DenseNet201
Model	TP	TN	TP	TN	TP	TN	TP	TN	TP	TN
A	71	73	71	73	70	78	47	79	66	90
B	74	81	64	84	78	79	72	85	71	90

Model A, the models trained by raw images; Model B, the models trained by processed images. TP, true positive; TN, true negative.

Figures 4,5 present a comparison of validation accuracy and validation loss for ResNet50.

Figure 4 Accuracy comparison: Resnet50.

Figure 5 Loss value comparison: Resnet50.

Figure 6 presents a comparison of validation accuracy for EfficientNetB0. Figure 7 presents a comparison of validation loss for EfficientNetB0.

Figure 6 Accuracy comparison: EfficientNetB0.

Figure 7 Loss value comparison: EfficientNetB0.

Figure 8 shows the confusion matrix of DenseNet21 trained with raw images and processed images, respectively; the confusion matrix values are normalized.

Figure 8 Confusion matrix comparison: DenseNet201.

Visual results: elbow dataset

Five testing samples are randomly selected from the original MURA dataset elbow part before preprocessing, and they are never included or involved in any training and testing process.

Regarding the testing samples, they are raw images without any preprocessing (sample for the testing samples from shoulder dataset to be used in the next part of experiment). We intentionally use the raw images to experiment Models A and B, to demonstrate that, after being trained by the “effective training materials” mentioned above, the models are capable of predicting new images based on the correct attention areas where the illness occurs, even when the new images come black background and logos, or other irrelevant features.

The original images are shown in Figure 9.

Figure 9 Test samples original images: No. 1 to 5.

Figures 10,11 present the GradCAM explanations for the 5 testing samples from the models trained with raw images (Model A corresponding to Figure 10A and Figure 11A), and from the models trained with processed images (Model B, corresponding to Figure 10B and Figure 11B). Order of the testing samples (Nos. 1 to 5) is the same as shown in Figure 9. Comparison will not be conducted between models since performance differences between models are not our objective. Only the differences in explanations between the same model using different training materials will be compared. Therefore, we do not need all five selected models’ results. Judging from general performance, we determine that models VGG16 and DenseNet201 are suitable for the elbow dataset. Hence, the explanations produced from these two models will be presented and compared.

Figure 10 VGG16: (A,B) Models A and B comparison. Model A, the models trained by raw images; Model B, the models trained by processed images.

Figure 11 Denseet201: (A,B) Models A and B comparison. Model A, the models trained by raw images; Model B, the models trained by processed images.

Quantitative results: shoulder dataset

Table 5 presents the accuracy comparisons for selected models between being trained with raw images and processed images, Table 6 presents the comparisons of true positive (TP) and true negative (TN) rates.

Table 5

Validation accuracy of comparison: shoulder dataset

Model	VGG16	ResNet50	MobileNet	EfficientNetB0	DenseNet201
A	0.61	0.70	0.66	0.75	0.66
B	0.68	0.71	0.73	0.79	0.75
Aug	0.07	0.01	0.07	0.04	0.09

Model A, the models trained by raw images; Model B, the models trained by processed images; Aug, performance augmentation (from Model A to Model B).

Table 6

TP and TN rates comparison: shoulder dataset

Model	VGG16		ResNet50		MobileNet		EfficientNetB0		DenseNet201
Model	TP	TN	TP	TN	TP	TN	TP	TN	TP	TN
A	56	66	69	70	64	71	75	76	36	95
B	67	69	70	71	65	81	76	81	74	76

Model A, the models trained by raw images; Model B, the models trained by processed images. TP, true positive; TN, true negative.

Figures 12,13 illustrate the accuracy and loss of MobileNet trained with raw and processed images.

Figure 12 Accuracy comparison: MobileNet.

Figure 13 Loss value comparison: MobileNet.

Figures 14,15 respectively present the accuracy and loss comparison from Densenet_201. Figure 16 presents the comparison of confusion matrix from EfficientNetB0.

Figure 14 Accuracy comparison: DenseNet201.

Figure 15 Loss value comparison: DenseNet201.

Figure 16 Confusion matrix comparison: EfficientNetB0.

Visual results: shoulder dataset

Five testing samples are selected from the original MURA dataset shoulder part, all testing samples have never been included or involved in any training and testing process. Original images are as shown in Figure 17.

Figure 17 Test samples original images: No. 1 to 5.

Judging from general performance, we determined that MobileNet and ResNet50 are the two models suitable for the shoulder dataset. Figures 18,19 contain the explanation images produced using GradCAM for the testing samples (ordered as shown in Figure 17 from Nos. 1 to 5) processed by these two models but trained with respectively raw images (Model A, corresponding to Figure 18A and Figure 19A) and processed images (Model B, corresponding to Figure 18B and Figure 19B).

Figure 18 MobileNet: (A,B) Models A and B comparison. Model A, the models trained by raw images; Model B, the models trained by processed images.

Figure 19 ResNet50: (A,B) Models A and B comparison. Model A, the models trained by raw images; Model B, the models trained by processed images.

Discussion

This section presents the corresponding discussions of the respected experimental results. For better reading and comprehension, the discussions will be presented in the same order as the “Results” section.