Automated elbow ultrasound image recognition: a two-stage deep learning system via Swin Transformer

Weichen Zhou; Chengting Zhou; Lirong Hu; Li Qiu

doi:10.21037/qims-24-763

Original Article

Automated elbow ultrasound image recognition: a two-stage deep learning system via Swin Transformer

Weichen Zhou^1,2 , Chengting Zhou², Lirong Hu², Li Qiu¹

¹Department of Medical Ultrasound, West China Hospital of Sichuan University, Chengdu, China; ²Department of Ultrasound, Chengdu First People’s Hospital, Chengdu, China

Contributions: (I) Conception and design: W Zhou; (II) Administrative support: L Qiu; (III) Provision of study materials or patients: C Zhou, L Hu; (IV) Collection and assembly of data: C Zhou; (V) Data analysis and interpretation: W Zhou; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Li Qiu, PhD. Department of Medical Ultrasound, West China Hospital of Sichuan University, No. 37 Guoxue Alley, Chengdu 610041, China. Email: wsqiuli@vip.126.com.

Background: Ultrasound imaging is pivotal for point of care non-invasive diagnosis of musculoskeletal (MSK) injuries. Notably, MSK ultrasound demands a higher level of operator expertise compared to general ultrasound procedures, necessitating thorough checks on image quality and precise categorization of each image. This need for skilled assessment highlights the importance of developing supportive tools for quality control and categorization in clinical settings. To address these challenges, we aim to develop and evaluate an automated system designed to assess whether an ultrasound image meets established standards and to identify its specific category, thereby assisting clinicians in improving diagnostic efficiency and accuracy in MSK ultrasound.

Methods: We proposed a two-stage system built upon a Swin Transformer-based Unet model. Initially, an ultrasound image of the elbow is categorized, following which its quality is evaluated to determine if it meets the established standard criteria set by experienced radiologists.

Results: The proposed method surpassed other approaches in performance, with the Swin Transformer, serving as the backbone, outperforming conventional convolutional neural network (CNN) models. Additionally, when utilizing the same backbone, the two-stage model exceeded the capabilities of the single-stage model. Notably, the accuracy of our model exceeded 97%, with each category achieving over 95%. Additionally, processing time was significantly reduced from over 16 minutes to less than 15 seconds. Furthermore, this model reduces the workload for experienced radiologists, demonstrating its efficiency and effectiveness. This allows human experts to focus more on rare and challenging cases that machine learning struggles to solve, and to create high-quality labels for the models to learn from.

Conclusions: We have introduced an elbow ultrasound image recognition system leveraging the Swin Transformer, achieving competitive outcomes in both categorization and standard verification. This approach significantly reduces diagnostic time, allowing radiologists to focus more on scientific validation rather than image analysis.

Keywords: Ultrasound imaging; elbow classification; deep learning; Swin Transformer

Submitted Apr 16, 2024. Accepted for publication Nov 16, 2024. Published online Dec 18, 2024.

doi: 10.21037/qims-24-763

Introduction

In the realm of musculoskeletal (MSK) imaging, magnetic resonance imaging (MRI) is often considered the gold standard due to its superior tissue resolution, which allows for detailed imaging of both soft tissues and bones simultaneously (1). Despite these advantages, the widespread adoption of MRI as a routine imaging modality has been hindered by its high cost and the discomfort that it often causes on patients. For example: (I) MRI scans typically take longer than other imaging tests, around 30 minutes, and require patients to remain still, which can be difficult for those with joint issues. (II) The confined space of the MRI machine can trigger anxiety or claustrophobia in some patients. (III) MRI machines are loud, and even with earplugs or headphones, the noise may still cause discomfort. In contrast, ultrasound imaging is more patient-friendly, widely accessible, and cost-effective compared to MRI. Over the past two decades, advancements in ultrasound equipment and techniques have significantly enhanced its utility in MSK imaging (2). It is now frequently employed to evaluate joints, tendons, ligaments, muscles, nerves, and bone surfaces.

Ultrasound imaging plays a critical role in non-invasively diagnosing injuries, especially in scenarios where more advanced diagnostic tools are unavailable (3,4). The elbow joint, in particular, with its relatively straightforward anatomical structure and superficially located main tendons and ligaments, is especially amenable to ultrasound examination. Numerous guidelines and protocols have been developed to standardize elbow ultrasound examinations (3,5). However, it is important to note that MSK ultrasound is more dependent on the operator’s skill compared to general ultrasound imaging. It demands a thorough understanding of the anatomical structures and proficiency in examination techniques (6). Consequently, ultrasound radiologists are required to undertake standardized training to minimize the risk of inaccurate interpretations.

Therefore, interpreting ultrasound images can be particularly challenging, as it requires specialized expertise in MSK anatomy and advanced proficiency in ultrasound techniques to ensure accurate assessments. To address this issue, there is a notable shift towards utilizing artificial intelligence (AI) algorithms to enhance the process of image analysis and diagnosis (7). A significant breakthrough in this domain is the development of a convolutional neural network (CNN) (8,9) tailored for identifying shrapnel in ultrasound images. This innovative method was initially validated by embedding various types and sizes of shrapnel in a tissue-mimicking phantom and subsequently in swine thigh tissues, marking the early application of this technology (10). Moreover, the past 5 years have witnessed remarkable advancements with the integration of transformers, which, starting from natural language processing (11), have shown immense potential in areas such as image classification and denoising as well (12,13). Recent advancements in deep learning have significantly improved medical ultrasound imaging, particularly in segmentation and visualization. Shen et al. (14) introduced the Dilated Transformer with residual axial attention for breast ultrasound segmentation, outperforming CNNs by capturing long-range dependencies. Zhao et al. (15) applied deep learning to visualize distal humeral cartilage, showcasing AI’s potential for ultrasound-guided therapies. Chen et al. (16) proposed GDUNet, a U-shaped network with attention gates and dilation, enhancing multiscale breast ultrasound segmentation. Shin et al. (17) developed tFUSFormer, a physics-guided transformer for simulating transcranial ultrasound in brain stimulation. Lastly, Chi et al. (18) presented a hybrid Transformer UNet for thyroid segmentation, blending Transformer and UNet strengths. These works demonstrate deep learning’s growing impact on ultrasound imaging.

However, accurately classifying elbow ultrasound images into categories and simultaneously determining their standardization within a single model presented a complex challenge. Drawing inspiration from the two-stage object detection models, a proposal was made to simplify the deep learning model by dividing it into two sequential phases: one for classifying the type of elbow and another for determining its standardization.

In our study, we introduced the first of its kind approach to the automated recognition of elbow joint ultrasound images, specifically targeting those from healthy subjects. By adapting the deep learning model into a two-stage framework, we are able to significantly enhance its performance. This optimization not only reduces diagnostic times but also eliminates the necessity for radiologist involvement. We present this article in accordance with the STARD reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-24-763/rc).

Methods

Dataset descriptions

This dataset was collected from Chengdu First People’s Hospital from 2020–2022 in Chengdu, Sichuan, China. In line with the Guidelines of Chinese Musculoskeletal Ultrasound Examination (19), seven standard ultrasound sections of the elbow were identified for this study as shown in Figure 1. The scanning position is displayed in Table 1. From 120 adult participants, a comprehensive collection of 600 standard elbow ultrasound images and 600 corresponding non-standard images were gathered respectively as shown in Table 2, two of the categories had two non-standard types. The study complies with the Declaration of Helsinki (as revised in 2013). All participants in this study provided informed consent. The study protocol was reviewed and approved by the Ethics Committee of Chengdu First People’s Hospital (approval No. 2023 KT 010). The demographic information of the participants is shown in Table 3. The detail of the subject information was removed immediately when the data were received.

Figure 1 Column I–VII indicates different elbow ultrasound image types. I, short axis view of the anterior surface of the elbow, showing the capitellum and trochlea of the humerus. II, long axis view at the anterior and radial surface of the elbow, showing the capitellum, the radial head and the humeroradial joint. III, long axis view at the anterior and ulnar surface of the elbow, showing the coronoid fossa of the humerus, trochlea, and coronoid process of the ulna. IV, long axis view of the anterior surface of the elbow, showing the distal biceps tendon. V, long axis view at the lateral surface of the elbow, showing the common extensor tendon. VI, long axis view at the medial surface of the elbow, showing the common flexor tendon. VII, long axis view at the posterior surface of the elbow, showing the triceps brachii tendon and its distal attachment point. Row (A) refers to the standard view of the elbow images, and rows (B) and (C) refer to the non-standard ones.

Table 1

The demonstration of each elbow ultrasound image type

Elbow ultrasound image types	The photo of scanning position	Scanning position notes
I: short axis view of the anterior surface of the elbow; II: long axis view at the anterior and radial surface of the elbow; III: long axis view at the anterior and ulnar surface of the elbow		The client is seated with the elbow straightened and placed on the examination table
IV: long axis view of the anterior surface of the elbow		The client is seated with the elbow extended and the forearm supinated on the examination table
V: long axis view at the lateral surface of the elbow		The client is seated with the elbow flexed at 90 degrees and the forearm supinated on the examination table
VI: long axis view at the medial surface of the elbow		The client is seated with the elbow slightly flexed and the forearm everted and positioned on the examination table
VII: long axis view at the posterior surface of the elbow		The client is seated with the elbow flexed at 90 degrees and the palm facing down on the examination table

Table 2

The number of different types of elbow ultrasound images

Elbow ultrasound image types	Total	Standard	Non-standard
Short axis view of the anterior surface of the elbow	1,200	600	600
Long axis view at the anterior and radial surface of the elbow*	1,800	600	1,200
Long axis view at the anterior and ulnar surface of the elbow*	1,800	600	1,200
Long axis view of the anterior surface of the elbow	1,200	600	600
Long axis view at the lateral surface of the elbow	1,200	600	600
Long axis view at the medial surface of the elbow	1,200	600	600
Long axis view at the posterior surface of the elbow	1,200	600	600

*, there are two different non-standard sections for those elbow ultrasound image types.

Table 3

The demographic information of the participants

Patients’ demographics	Value
Age (years)	31.25±6.93
Gender
Female	70 (58.33)
Male	50 (41.66)
BMI (kg/m²)	25.12±3.69

Data are presented as mean ± standard deviation or count (%). BMI, body mass index.

The ultrasound examinations were conducted using the SonoScape S60 Exp ultrasound system, manufactured by SonoScape Medical Corporation, Shenzhen, China. This system was equipped with a 12L-A array linear transducer, featuring a bandwidth frequency range of 3–17 MHz, and was operated by a sonographer with more than 5 years of experience in MSK ultrasound imaging. To ensure consistency and reproducibility across the scans, grayscale setting parameters were standardized. These settings included a B-mode frequency of 11.4–13.8 MHz, a mechanical index of 0.5, a dynamic range of 200, and a focus position tailored to the area of interest. Since all images were acquired using the same scanner and imaging sequence, we used the raw images directly as input for our model as resizing the image size into 227×227. The acquisition specifications varied based on the scanned area. For the elbow joint, which has a shallow structure, the depth is typically set between 2–4 cm. The field of view (FOV) should be adjusted according to the anatomical structure, with an FOV range for the elbow joint generally falling between 1–4 cm.

Network structures

The network’s architecture is designed in two distinct stages, as illustrated in Figure 2. The first stage focuses on classifying the type of elbow ultrasound image, whereas the second stage evaluates whether the image meets standardization criteria. In the first stage, the Swin Transformer UNet (20) serves as the backbone model, offering a modern approach to feature extraction and analysis. Despite the innovative capabilities of the Swin Transformer model, its performance is comparably effective to that of traditional CNN models, which are employed in the second stage for the purpose of standardization classification. For efficiency and simplicity, a CNN is utilized to determine the standardization of images. The first stage of the network is comprised of three key modules: shallow feature extraction, UNet feature extraction, and categorization classification, each contributing to the robust processing and accurate classification of elbow ultrasound images.

Figure 2 The architecture of the two-stage network. US, ultrasound; CNN, convolutional neural network.

The shallow feature extraction module utilizes a single 3×3 convolution layer to capture low-frequency details such as color, shape, and texture from the input elbow ultrasound images. These initial, shallow features are then processed by the UNet feature extraction module, which is designed to extract high-level and multi-scale deep features essential for accurate analysis. This process is facilitated by three Swin Transformer Blocks (STBs), each comprising eight Swin Transformer Layers (STLs). The specifics of the STBs and STLs will be detailed in the following section. To finalize the feature extraction phase, another 3×3 convolution layer is employed to classify the ultrasound images into their respective categories based on the deep features identified.

In UNet extraction module, we use STB to substitute the traditional convolution layer as shown in Figure 3A. STL (20) is based on the original Transformer layer (11) from natural language processing. The number of STL is always set as a multiple of two, allocating one layer for window multi-head self-attention (W-MSA) and another for shifted-window multi-head self-attention (SW-MSA). Diverging from the traditional transformer layer architecture, STL employs a cyclic shift strategy as shown in Figure 3B. This approach reduces computational time while preserving key convolutional features such as translation invariance, rotation invariance, and the consistent relationship between the size of the receptive field and the layers. The hyperparameters of the Swin Transformer and CNN are listed as follows in Tables 4,5, respectively.

Figure 3 The network architecture of the proposed method for elbow us recognition. (A) STB which has 8 STL in our experiments. (B) The structure of STL. STB, Swin Transformer Block; STL, Swin Transformer Layers; H, height; W, width; C, channel; MSA, multi-head self-attention; MLP, multilayer perceptron.

Table 4

The hyperparameters of Swin Transformer

Hyperparameter	Value
Input image size (pixel)	224×224
Patch size (pixel)	4×4
Window size (pixel)	7×7
Embedding dimension	96
Number of heads	3
Number of layers	4
MLP ratio	4
Dropout rate	0.1
Stochastic depth rate	0.1
Shift size (pixel)	3
Optimizer	ADAM
Learning rate	1e−4
Batch size (sample)	16
Weight decay	0.05

MLP, multilayer perceptron; ADAM, adaptive moment estimation.

Table 5

The hyperparameters of CNN

Hyperparameter	Value
Input image size (pixel)	224×224
Kernel size (pixel)	7×7
Number of filters	32, 64, 128, 256, 128, 64, 32
Stride (pixel)	1
Padding	Same
Pooling (pixel)	2×2
Activation function	ReLU
Optimizer	ADAM
Learning rate	1e−4
Batch size (sample)	32
Loss function	Cross entropy loss

CNN, convolutional neural network; ReLU, rectified linear unit; ADAM, adaptive moment estimation.

Upon successful categorization, the process advances to the second stage, where an individual CNN model Resnet34 (8), is deployed for each image category. This stage is crucial for determining whether the input image adheres to the standardization criteria, effectively distinguishing standard from non-standard images within the specified categories.

The loss function of the model is the combination loss of the two stages as shown below:

$Loss = L_{CE_1} + \sum_{i = 1}^{N} L_{CE_2_i}$ [1]

where $L_{CE_1}$ is cross entropy loss used in the first stage to classify the elbow image type, and $\sum_{i = 1}^{N} L_{CE_2_i}$ is also cross entropy loss used in the second stage to determine if the image is standard or not in terms of each class. The details of cross entropy loss are as follows:

$L_{CE} = - \sum_{i = 1}^{n} t_{i} \log (p_{i})$ [2]

where $t_{i}$ is the truth label and $p_{i}$ is the Softmax probability for the $i^{t h}$ class. $n$ represents the number of classes within each stage. In the first stage, $n$ is set to 7, corresponding to the seven distinct types of elbow ultrasound images being identified. For the second stage, $n$ is reduced to 2, serving to classify images based on the standardization criteria, effectively distinguishing between standard and non-standard images.

Training details

All the ultrasound images were categorized and standardized by two experienced radiologists. In cases of initial disagreement, the images were reviewed again to reach a consensus on the final label. This label serves as both the reference for training and the ground truth for evaluation. All these labels were then incorporated into the training process as the reference. The entire dataset was split into 60, 20 and 20 percent as training, validation and testing sets after randomly shuffling. With only 20% of the data used as the testing set, we split the data into five groups and conducted five experiments to cover the entire dataset as five-fold cross validation; 300 epochs and a batch size of 4 were used for training. The adaptive moment estimation (ADAM) optimizer (21) with an initial learning rate of 0.001 was used to train the network. After 120 and 180 epochs, the learning rate was reduced by 9 and 19 times, respectively. The proposed network is an end-to-end trainable model without any pretrained networks and implemented by PyTorch 1.8.0 (22) with single NVIDIA GTX 1080Ti GPU in Ubuntu 22.04.

Evaluation metrics

Since this is the first-of-its-kind deep learning method for elbow ultrasound image recognition, there is no prior-based method. We can only compare the proposed method with our single-stage model with Swin Transformer and CNN. For the quantitative comparisons, model prediction in test image sets was performed after the model was trained with a separate Python script which loaded the trained model and performed independent predictions on every image within the test sets. These test results were used to calculate overall trained model performance metrics: accuracy, sensitivity, and specificity as shown below:

$Accuracy = \frac{True Positives}{True Positives + False Negatives + True Negatives + False Positives}$ [3]

$Sensitivity = \frac{True Positives}{True Positives + False Negatives}$ [4]

$Specificity = \frac{True Negatives}{True Negatives + False Positives}$ [5]

Results

Since 20% of the image was used as testing set in the study, five-fold cross validation was conducted to ensure the entire dataset can be treated as testing samples.

Figure 4 demonstrates the loss curves for both the training and validation phases across different models. The training loss showed a consistent decline for all models, with convergence occurring around the 25^th epoch. In contrast, the performance of the validation loss varied significantly. The single-stage CNN model failed to converge even after 100 epochs, despite displaying the smallest training loss among the evaluated methods. In addition, this model reached peak performance after just 10 epochs of training. Conversely, the transformer models exhibited superior performance compared to the CNN model. Both the single-stage and two-stage SWIN Transformer models saw a steady reduction in validation loss as the number of epochs increased, aligning with the trends observed in our quantitative assessments.

Figure 4 The training and validation loss of different models. (A) Single-stage CNN model; (B) single-stage SWIN model; (C) two-stage SWIN model. CNN, convolutional neural network; SWIN, Shift WINdow.

In Table 6, we calculated the accuracy of all three models across the entire dataset. In the single-stage models, both standard and non-standard elbow image types were treated as individual classes. The SWIN Transformer backbone demonstrated superior performance over the traditional CNN model, showing an approximate 5% improvement in accuracy. By dividing the model into two stages—first categorizing the image type and then assessing its standardization—the accuracy was further enhanced to 97.32%. Notably, the majority of inaccuracies occurred during the second stage, suggesting that the model finds it more challenging to differentiate the distribution of non-standard images from standard ones. Enhancing the model with more non-standard training data could potentially enhance the results.

Table 6

The accuracy of different methods across all the images

Method	Accuracy (%)
Single-stage CNN model	90.69
Single-stage SWIN Transformer model	95.86
Two-stage SWIN Transformer model	97.32

CNN, convolutional neural network; SWIN, Shift Window.

In our investigation of the two-stage SWIN Transformer model, we evaluated each image type for sensitivity, specificity, and accuracy. According to Table 7, all categories showed remarkable performance. However, the categories comprising two non-standard types (long axis view at the anterior and radial surface of the elbow, and long axis view at the anterior and ulnar surface of the elbow) exhibited the lowest outcomes. This diminished performance is attributed to the complexity of their second-stage classification, which involves distinguishing among three classes instead of the simpler two-class models for the other categories. Enhancing the model with additional training data appears to be the most effective strategy for improving the results.

Table 7

The sensitivity, specificity, and accuracy of the two-stage model in terms of different elbow ultrasound image types

Elbow ultrasound image types	Sensitivity (%)	Specificity (%)	Accuracy (%)
Short axis view of the anterior surface of the elbow	98.33	96.67	97.5
Long axis view at the anterior and radial surface of the elbow*	97.5	95.00	95.83
Long axis view at the anterior and ulnar surface of the elbow*	98.83	95.83	96.83
Long axis view of the anterior surface of the elbow	98.50	98.33	98.42
Long axis view at the lateral surface of the elbow	99.17	98.67	98.92
Long axis view at the medial surface of the elbow	98.33	97.83	98.08
Long axis view at the posterior surface of the elbow	96.33	97.00	96.67

*, there are two different non-standard sections for those elbow ultrasound image types.

In Table 8, we conducted a time efficiency comparison among various methods, with two experienced radiologists evaluating the entire dataset to establish a baseline. Remarkably, all deep learning-based approaches drastically reduced the average evaluation time for 100 ultrasound images from over 16 minutes to under 15 seconds. The most time-consuming aspect of these models is the training phase, which, fortunately, can be performed offline. Once trained, the model can be implemented in diverse settings without the need for experienced radiologists’ intervention. With just a basic training session, anyone can operate the automated system and directly obtain results from the images. This frees radiologists to focus on contributing to the development of the model by labeling existing ultrasound images to compile a dataset for training purposes. By consolidating the efforts of radiologists from various locations, it’s possible to forge a more resilient model capable of managing different image sequences and potentially accommodating multiple joints simultaneously.

Table 8

The average evaluation time of different methods

Method	Average evaluation time per 100 images
Trained radiologist I	16 min 32 s
Trained radiologist II	16 min 54 s
Single-stage SWIN model	5.5 s
Two-stage SWIN model	13.2 s

SWIN, Shift WINdow.

Discussion

To the best of our knowledge, this research marks the inaugural effort to employ deep learning techniques in the development of an automated system for elbow ultrasound image recognition (23). The findings demonstrate the Swin Transformer’s exceptional capability in accurately classifying elbow ultrasound images, addressing both the categorization and quality control aspects. This approach has the potential to significantly reduce the manual effort required by radiologists for sequential image diagnosis. Instead, their valuable expertise could be redirected towards training deep learning models for elbow ultrasound image recognition, enabling the deployment of such systems globally. The success of this experiment holds promise for application to other joints or body parts, indicating a scalable and versatile solution in the realm of medical imaging.

According to Figure 4, it appears that the CNN may be overfitting compared to the Transformer models. This likely accounts for CNN’s lower performance relative to the Swin Transformer. The results indicate that the Transformer models appear to handle the data more effectively without showing signs of overfitting, unlike the CNN, which exhibited clear overfitting tendencies. This suggests that the Transformer models, particularly the Swin Transformer, are better suited for capturing the underlying patterns in the data, resulting in superior performance compared to CNNs, even when using similar training settings.

Additionally, the enhancement in performance transitioning from a single-stage to a two-stage model underlines deep learning’s proficiency in managing straightforward tasks within the realm of computer vision, particularly when training samples are limited. It is possible that providing sufficient data to the single-stage model could enable it to achieve results comparable to those of the multi-stage model. Moreover, the efficacy of the two-stage model hinges on the accuracy of the initial stage; inaccuracies at this level could compromise the subsequent stage’s outcomes. In the future, we aim to generate saliency maps for a deeper insight into how the model processes elbow ultrasound images. It will be intriguing to explore whether the deep learning model prioritizes similar regions for categorizing elbow images or assessing their quality, offering a clearer understanding of its decision-making processes.

The code for both the single-stage and two-stage models using CNN and Swin Transformer will be released soon. However, due to Institutional Review Board (IRB) and hospital regulations, we are unable to share the original data. Instead, we will provide toy data and pre-trained weights to help run the experiments, along with detailed instructions for users to implement the models on their own datasets.

Conclusions

In this paper, we present the elbow ultrasound image recognition system which is based on the new backbone of Swin Transformer and achieve competitive results on classification on both categorization and standardization. Furthermore, we propose the two-stage model to improve the performance of the automated system. Last but not least, it saves time and does not require trained radiologists for diagnosis. It is too early to say deep learning can replace the expertise of radiologists. However, the potential of deep learning especially Swin Transformer still deserves to be expected in the future.

Acknowledgments

Funding: None.

Footnote

Reporting Checklist: The authors have completed the STARD reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-24-763/rc

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-763/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). The study was approved by the Ethics Committee of Chengdu First People’s Hospital (approval No. 2023 KT 010) and informed consent was taken from all individual participants.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Sneag DB, Abel F, Potter HG, Fritz J, Koff MF, Chung CB, Pedoia V, Tan ET. MRI Advancements in Musculoskeletal Clinical and Research Practice. Radiology 2023;308:e230531. [Crossref] [PubMed]
Cook CR. Ultrasound Imaging of the Musculoskeletal System. Vet Clin North Am Small Anim Pract 2016;46:355-71. v. [Crossref] [PubMed]
Jäschke M, Weber MA. Ultrasound of the elbow-standard examination technique and normal anatomy. Radiologe 2018;58:985-90. [Crossref] [PubMed]
Tamborrini G, Bianchi S. Ultrasound of the Elbow (Adapted According to SGUM Guidelines). Praxis (Bern 1994) 2020;109:641-51.
Martinoli C, Bianchi S, Giovagnorio F, Pugliese F. Ultrasound of the elbow. Skeletal Radiol 2001;30:605-14. [Crossref] [PubMed]
Karanasios S, Korakakis V, Moutzouri M, Drakonaki E, Koci K, Pantazopoulou V, Tsepis E, Gioftsos G. Diagnostic accuracy of examination tests for lateral elbow tendinopathy (LET) - A systematic review. J Hand Ther 2022;35:541-51. [Crossref] [PubMed]
Snider EJ, Hernandez-Torres SI, Boice EN. An image classification deep-learning algorithm for shrapnel detection from ultrasound images. Sci Rep 2022;12:8427. [Crossref] [PubMed]
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016:770-8.
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer; 2015:234-41.
Liu S, Wang Y, Yang X, Lei B, Liu L, Li SX, Ni D, Wang T. Deep learning in medical ultrasound analysis: a review. Engineering 2019;5:261-75.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in Neural Information Processing Systems 2017;30.
Fan CM, Liu TJ, Liu KH. SUNet: swin transformer UNet for image denoising. In: 2022 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE; 2022:2333-7.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision. Springer; 2022:205-18.
Shen X, Wang L, Zhao Y, Liu R, Qian W, Ma H. Dilated transformer: residual axial attention for breast ultrasound image segmentation. Quant Imaging Med Surg 2022;12:4512-28. [Crossref] [PubMed]
Zhao W, Su X, Guo Y, Li H, Basnet S, Chen J, Yang Z, Zhong R, Liu J, Chui EC, Pei G, Li H. Deep learning based ultrasonic visualization of distal humeral cartilage for image-guided therapy: a pilot validation study. Quant Imaging Med Surg 2023;13:5306-20. [Crossref] [PubMed]
Chen J, Shen X, Zhao Y, Qian W, Ma H, Sang L. Attention gate and dilation U-shaped network (GDUNet): an efficient breast ultrasound image segmentation network with multiscale information extraction. Quant Imaging Med Surg 2024;14:2034-48. [Crossref] [PubMed]
Shin M, Seo M, Yoo SS, Yoon K. tFUSFormer: Physics-Guided Super-Resolution Transformer for Simulation of Transcranial Focused Ultrasound Propagation in Brain Stimulation. IEEE J Biomed Health Inform 2024;28:4024-35. [Crossref] [PubMed]
Chi J, Li Z, Sun Z, Yu X, Wang H. Hybrid transformer UNet for thyroid segmentation from ultrasound scans. Comput Biol Med 2023;153:106453. [Crossref] [PubMed]
Chinese Society of Ultrasound Medical Engineering Musculoskeletal System Ultrasound Professional Committee. Specification for Musculoskeletal Ultrasound Examination and Reporting. Chin J Med Ultrasound 2015;1:11-7. (Electronic Edition).
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2021:10012-22.
Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint 2014. arXiv:1412.6980.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 2019;32.
Song K, Feng J, Chen D. A survey on deep learning in medical ultrasound imaging. Front Phys 2024;12:1398393.

Cite this article as: Zhou W, Zhou C, Hu L, Qiu L. Automated elbow ultrasound image recognition: a two-stage deep learning system via Swin Transformer. Quant Imaging Med Surg 2025;15(1):731-740. doi: 10.21037/qims-24-763

Automated elbow ultrasound image recognition: a two-stage deep learning system via Swin Transformer

Introduction

Methods

Dataset descriptions

Table 1

Table 2

Table 3

Network structures

Table 4

Table 5

Training details

Evaluation metrics

Results

Table 6

Table 7

Table 8

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share