Lightweight attention network for guidewire segmentation and localization in clinical fluoroscopic images of vascular interventional surgery

Haoyun Wang; Ziyang Mei; Kanqi Wang; Jingsong Mao; Lianxin Wang; Gang Liu; Yang Zhao

doi:10.21037/qims-2024-2926

Original Article

Lightweight attention network for guidewire segmentation and localization in clinical fluoroscopic images of vascular interventional surgery

Haoyun Wang¹ , Ziyang Mei¹ , Kanqi Wang¹ , Jingsong Mao², Lianxin Wang³, Gang Liu^1,4, Yang Zhao^1,5,6

¹Xiamen University, Xiamen, China; ²Department of Vascular Intervention, Affiliated Hospital of Guilin Medical University, Guilin, China; ³Department of Orthopedics, The First Affiliated Hospital of Xiamen University, School of Medicine, Xiamen University, Xiamen, China; ⁴Center for Molecular Imaging and Translational Medicine, School of Public Health, Xiamen University, Xiamen, China; ⁵Pen-Tung Sah Institute of Micro-Nano Science and Technology, Xiamen University, Xiamen, China; ⁶Shenzhen Research Institute of Xiamen University, Shenzhen, China

Contributions: (I) Conception and design: Y Zhao, G Liu, H Wang; (II) Administrative support: Y Zhao, G Liu; (III) Provision of study materials or patients: G Liu, J Mao; (IV) Collection and assembly of data: H Wang, Z Mei, K Wang; (V) Data analysis and interpretation: H Wang, Z Mei, L Wang; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Yang Zhao, PhD. Pen-Tung Sah Institute of Micro-Nano Science and Technology, Xiamen University, No. 4221, Xiang’an South Road, Xiang’an District, Xiamen 361102, China; Xiamen University, Xiamen, China; Shenzhen Research Institute of Xiamen University, Shenzhen, China. Email: zhaoy@xmu.edu.cn; Gang Liu, PhD. Xiamen University, Xiamen, China; Center for Molecular Imaging and Translational Medicine, School of Public Health, Xiamen University, No. 4221, Xiang’an South Road, Xiang’an District, Xiamen 361102, China. Email: gangliu.cmitm@xmu.edu.cn; Lianxin Wang, PhD. Department of Orthopedics, The First Affiliated Hospital of Xiamen University, School of Medicine, Xiamen University, No. 55, Zhenhai Road, Siming District, Xiamen 361005, China. Email: dr_shepherd@sina.com.

Background: During transcatheter arterial chemoembolization (TACE), the delivery of a guidewire to the lesion site is a critical step, making the analysis and positioning of guidewire morphology crucial for both robotic systems and physicians in interventional surgeries. Current research on guidewires often faces challenges such as a low image signal-to-noise ratio and severe class imbalance. To overcome these issues and enhance the practical delivery of guidewires in clinical settings, this study introduces a comprehensive dataset for guidewire delivery during TACE and develops a specialized deep learning model for segmenting guidewire morphology in X-ray fluoroscopic images.

Methods: We retrospectively collected 2,839 X-ray images acquired under real-time guidance from 38 subjects and manually annotated the guidewires. We proposed a deep learning-based guidewire segmentation method, which integrated two effective modules designed in this study: a bilateral feature fusion (BGA) module and a lightweight gated attention (SDA) module, achieving precise segmentation of guidewires in intraoperative images.

Results: Quantitative and qualitative assessments were performed on 903 clinical images from 27 X-ray fluoroscopy videos. The segmentation network proposed in this paper demonstrated superior performance, achieving an area under the curve (AUC) of 91.64%, a Macro-F1 score of 85.63%, and a Dice coefficient of 71.29%.

Conclusions: This study introduces a novel guidewire segmentation method specifically designed for clinical TACE. It not only assists physicians during interventional procedures but is also expected to be integrated into the intelligent systems of vascular interventional surgical robots, enabling robotic assistance in the future interventional surgeries.

Keywords: Transcatheter arterial chemoembolization (TACE); deep learning; medical image segmentation; X-ray fluoroscopy

Submitted Dec 25, 2024. Accepted for publication Mar 19, 2025. Published online Apr 28, 2025.

doi: 10.21037/qims-2024-2926

Introduction

Due to the insidious onset and high malignancy of liver cancer, the majority of patients are diagnosed at intermediate to late stages, resulting in a low success rate for surgical resection. With the advancement of vascular interventional therapy, an increasing number of clinical cases have indicated that transcatheter arterial chemoembolization (TACE) has become the primary treatment method for unresectable hepatocellular carcinoma (HCC) (1). Compared to traditional surgical methods, interventional procedures offer several distinct advantages, such as minimal invasiveness, the potential for repeated treatments, improved patient quality of life, reduced risk of complications, and alleviated surgical burden on physicians.

In TACE interventional procedures, physicians insert catheters, microcatheters, and guidewires through a puncture site, starting from the femoral artery, to deliver the guidewire to the specific hepatic tumor-feeding artery, and subsequently inject embolic agents and chemotherapeutic drugs for treatment (2). During the procedure, physicians rely on real-time X-ray fluoroscopy images to observe the state of guidewire delivery, enabling intraoperative judgment and treatment decisions to deliver the guidewire to the target lesion and perform minimally invasive therapy. Therefore, precise manipulation of the guidewire is pivotal for successful embolization and drug delivery. With the advancement of vascular interventional robotic technology (3), exemplified by the CorPath GRX vascular interventional robot (Newton, USA), clinical adoption has been realized to assist physicians in intraoperative guidewire delivery, thereby reducing the intraoperative burden on physicians. However, observing and tracking guidewires based on X-ray fluoroscopy images present significant challenges. Real-time segmentation of the intraoperative guidewire to acquire its morphology ensures the safety and efficiency of interventional procedures and enhances the accuracy of visual guidewire localization. Moreover, with the development of intelligent systems for interventional procedures and vascular interventional surgical robots, there is potential to assist interventional robots in real-time inference of TACE surgical conditions, thereby better supporting physicians during the surgery.

Real-time acquisition of the morphology and location of the guidewire during surgery can provide visual feedback to interventional physicians and robotic systems. However, this task is challenging and fraught with several technical challenges, especially concerning the segmentation of guidewires from TACE clinical fluoroscopy images. The key challenges include: (I) variable proportions of effective content in images. During the procedure, physicians may dynamically adjust the display window size of the X-ray images, thereby demanding high feature extraction and generalization capabilities from the guidewire segmentation network. (II) Extreme class imbalance. In X-ray fluoroscopy images, guidewires have a slender structure and occupy a very small portion of the image, resulting in a very low ratio of guidewire pixels to background pixels. (III) Similar morphology of guidewires to other structures. The structures within the patient’s body, such as bone edges, appear similar to the guidewire in X-ray images, posing potential interference to subsequent guidewire segmentation efforts.

In recent years, deep learning, particularly deep convolutional neural networks (CNNs), has found extensive applications in the segmentation, detection, and keypoint localization of medical surgical instruments. Current research on guidewires is generally based on various vascular interventional surgeries. To more closely align with the clinical application of TACE interventional procedures, we collected real clinical images of guidewires with varying image window sizes for the study of microcatheter and guidewire segmentation. To address the a forementioned key challenges, we applied a series of preprocessing steps to the fluoroscopic images, employing cropping methods to ensure the model focuses exclusively on regions with relevant information. Furthermore, we proposed a guidewire segmentation network utilizing a patch-based approach for both training and inference phases, ultimately achieving the segmentation of guidewire morphology in real clinical intraoperative fluoroscopic images.

The contributions of this paper are as follows:

A TACE clinical guidewire dataset is proposed, and a guidewire segmentation framework, termed the focused gated attention network for clinical interventions (FGA-Net), is developed specifically for clinical fluoroscopic images.
A feature fusion module, bilateral feature fusion (BGA), is proposed, which integrates multi-scale information from both low-level and high-level features, enabling more effective capture of guidewire characteristics in images.
A lightweight attention gate mechanism, lightweight gated attention (SDA), is introduced, which enhances the model’s adaptability to the irregular and elongated structure of the guidewire based on spatial shift operations and deformable convolutions.
Extensive experiments demonstrate that our approach achieves state-of-the-art performance on clinical images during TACE procedures.

The code for this paper is publicly available at https://github.com/why-26/Guidewire-Segmentation.git.

Related works

Over the past decade, significant progress has been made in the segmentation research of medical surgical instruments. Notably, there have been advances in the segmentation and localization techniques for essential interventional instruments, specifically, such as microcatheters and guidewires. However, the body of existing research in this area remains limited. Current studies can be broadly categorized into two main approaches: those based on curve fitting and those based on deep learning.

Curve fitting-based methods are traditional approaches for segmenting and tracking interventional instruments (4-6). In these methods, the first frame of the intraoperative fluoroscopic sequence requires manual annotation for initialization, and no significant deformation of the instrument should occur between two adjacent frames. Heibel et al. used B-splines to model interventional tools and employed first-order Markov random fields (MRF) for guidewire tracking (5); however, this method still suffers from insufficient tracking accuracy when the visibility of the guidewire is poor. Similarly, Vandini et al. proposed segment-like features to model the guidewire between two adjacent frames using splines (6). These improved methods ensure a certain level of accuracy, achieving the tracking of significant deformations between adjacent frames, but they remain flawed. They cannot achieve efficient real-time segmentation of guidewire morphology, and exhibit limited robustness and generalization in noisy fluoroscopic images.

With the rapid development of deep learning in the field of computer vision, numerous high-performance networks have emerged in the domain of medical imaging (7-10) (e.g., nnU-Net, U-Net, TransUNet, Swin-UNet). These methods assist physicians in improving diagnostic accuracy and can also aid in surgical planning and navigation. CNNs have found extensive applications in the study of guidewires and catheters, including tasks such as morphological segmentation and endpoint localization.

Ambrosini et al. proposed a fully automatic catheter segmentation method based on the U-Net model (11), and trained a network that simultaneously segments guidewires and catheters. However, due to the thickness difference between the two, the accuracy for catheters was significantly higher than for guidewires, indicating that this approach is still in an exploratory stage. Vlontzos and Mikolajczyk trained a multi-class U-Net within the Siamese architecture (12) to segment and localize both vasculature and medical instruments. However, the need for manual threshold selection during ground truth creation may affect the consistency and accuracy of the final segmentation results. Zhou et al. proposed the pyramid attention recurrent network (PAR-Net) for real-time guidewire segmentation and tracking (13), achieving precise guidewire segmentation in intraoperative images. Additionally, Zhou et al. also introduced a two-stage multi-task network framework for guidewires (14), accomplishing guidewire segmentation, endpoint localization, and angle calculation. However, being a two-stage framework, it suffers from redundancy and cannot be trained end-to-end. Du et al. proposed a two-stage detection framework for small objects (15), improving the segmentation and localization accuracy of guidewires. Nevertheless, this study was conducted on a dataset from live animals, thus not fully reflecting the complexity and multi-faceted nature of real clinical surgeries. Ghosh et al. introduced a topology-aware geometric deep learning method, achieving automatic and accurate segmentation of catheters and guidewires in cerebral vasculature images. However, issues such as misclassification of some guidewires and missed detection of catheters still persist (16).

In recent years, deep CNNs and Transformers, along with their variants based on U-shaped architectures, have been widely applied to medical image segmentation tasks. U-shaped architecture-based medical image segmentation networks have achieved commendable performance in various tasks involving X-ray images, such as U-Net (8), Attention U-Net (17), and U-Net++ (18-20). These methods effectively capture multi-scale features through encoder-decoder structures and skip connections, significantly improving segmentation accuracy. For instance, MoNuSAC2020 (21) has demonstrated the effectiveness and potential of deep learning-based segmentation methods in handling multi-task challenges. Sharp U-Net (22) introduced depthwise separable convolutions and sharpening filters, further enhancing the performance of medical image segmentation. However, despite the excellent performance of CNNs, the intrinsic local receptive field limitation of convolutional operations poses challenges in capturing long-range dependencies and global contextual information. Consequently, many studies in medical imaging have introduced attention mechanisms to mitigate these issues. Self-attention mechanisms and multi-head attention mechanisms, as exemplified by the Transformer architecture, allow models to capture long-range dependencies and global contextual information, facilitating a better understanding of image content (23-26).

Additionally, convolutional attention mechanisms, which primarily include channel attention and spatial attention, enhance the model’s effective utilization of image information by adjusting the importance assigned to different feature channels and spatial locations (27-29). Moreover, gated attention mechanisms introduce gated units to achieve dynamic control over information flow transmission, thereby regulating the network’s focus on specific features or information (30,31).

Methods

Dataset

At present, there is no publicly available dataset for micro-guidewire trajectories in liver cancer interventional surgeries. Therefore, this paper proposes a new dataset benchmark, TCGSeg (TACE guidewire segmentation dataset), to validate the performance of the proposed network architecture. This dataset is designed for the task of guidewire morphology segmentation in intraoperative images. Real-time X-ray fluoroscopic videos are collected and extracted at 8 frames per second (fps) to obtain the primary images of micro-guidewire delivery trajectories during surgery. Each image has a resolution of 512×512 and contains only one guidewire.

The morphology, position, and edges of each guidewire were manually annotated. The sequence frames of the interventional surgery fluoroscopy video were segmented into individual guidewire masks using the Labelme tool by an interventional physician with 4 years of experience, under the supervision of another interventional physician with 23 years of experience. In this process, the guidewire pixels were assigned a value of 1, while the background pixels were assigned a value of 0. Finally, the annotated dataset was reviewed and corrected by a medical doctor with 3 years of experience and an interventional physician with 10 years of experience to ensure the accuracy of the dataset construction, thus providing a benchmark for subsequent research on guidewires in interventional surgeries. Figure 1 shows an example of the annotations in our dataset, including the original frame images and their corresponding binary segmentation labels.

Figure 1 The current frame and its corresponding binary segmentation mask in the TCGSeg dataset. The first row shows three fluoroscopic images with yellow bounding boxes indicating the location of the guidewire in the images. The second row displays the corresponding binary segmentation masks of the guidewire, represented as white curves on a black background. Symbols: yellow bounding boxes indicate the guidewire location in the images, and white curves represent the segmented guidewire. TCGSeg, TACE guidewire segmentation dataset; TACE, transcatheter arterial chemoembolization.

In routine interventional procedures, doctors perform intraoperative detection of guidewires and catheters according to the surgical workflow and save the corresponding imaging data. Our dataset is constructed from DSA images of patients undergoing liver interventional therapy, which were collected and preserved following standard procedures at the Affiliated Hospital of Guilin Medical University. The images were acquired using Philips FD20 and FD20UNiQ equipment (Amsterdam, The Netherlands). The entire dataset includes 80 fluoroscopy sequences from 27 subjects, comprising a total of 2,839 fluoroscopic images. The images are divided into a training set (1,936 images from 53 fluoroscopy sequences) and a test set (903 images from 27 fluoroscopy sequences). Across the 46 fluoroscopy sequences, we analyzed the demographic characteristics of the patients, with an average age of 50.2 years (standard deviation: 12.0 years). The group included more male participants (n=28) than female participants (n=22). The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Review Board of Affiliated Hospital of Guilin Medical University (No. 2022YJSLL-46), and individual consent for this retrospective analysis was waived.

The constructed guidewire segmentation dataset offers the following two main advantages:

This dataset includes images of the entire intraoperative trajectory of the micro-guidewire during interventional therapy, starting with the superselective insertion of the micro-guidewire catheter system into the common hepatic artery, proper hepatic artery, left hepatic artery, or right hepatic artery, and extending all the way to the tumor-feeding artery in the liver. Additionally, the TCGSeg dataset includes images with several interference factors, such as stents, internal fixation devices, contrast agents, Lipiodol deposition, and respiratory motion artifacts in some patients, further enriching the dataset composition and enhancing its applicability to actual surgical scenarios. This comprehensive coverage ensures that the dataset provides researchers with a more complete view of the surgical process, thereby increasing the comprehensiveness of subsequent studies and their reliability in clinical practice.
This dataset is collected from real intraoperative interventional therapy sessions for liver cancer patients, accurately reflecting the dynamic behavioral changes of physicians during the surgical process, including adjustments in the super-selection morphology of the micro-guidewire, the coordination between the guidewire and catheter, image zoom levels, flow rate, and dosage of the contrast agent. As a result, it is more closely aligned with actual clinical application scenarios.

During surgery, physicians need to make real-time adjustments based on the dynamic feedback from the progress of the guidewire and catheter. This includes modifying the super-selection morphology of the micro-guidewire, coordinating between the guidewire and catheter, and adjusting image zoom levels, image resolution, exposure parameters, window width, and window level. These adjustments ensure that crucial details required by the physician are captured at key moments, thereby guiding the surgical procedure. Consequently, the proportion of content with useful information in intraoperative X-ray images varies. This variation reflects the flexible adjustments physicians make based on intraoperative feedback, demonstrating their judgment and strategic adaptability. Figure 2 shows clinical fluoroscopic images under different conditions in our dataset.

Figure 2 Effective fluoroscopic images with different window sizes. The yellow arrows indicate structures resembling the guidewire, while the yellow dashed boxes denote interference factors such as stents and internal fixations.

In summary, our dataset holds clinical application value in actual liver cancer interventional treatments. It enables segmentation networks to learn more diverse imaging features of guidewires across different time periods, varying window widths, and different machine parameters, further increasing the complexity of segmentation. This not only enhances the accuracy and generalizability of the guidewire segmentation model in clinical intraoperative environments but also provides a reliable foundation for subsequent research aimed at improving the quality of physician training and optimizing intraoperative decision-making processes.

Preprocessing

In X-ray fluoroscopy images, the guidewire occupies a very small portion and the signal-to-noise ratio is relatively low. This paper proposes a preprocessing method for binary classification of X-ray fluoroscopy images. This comprehensive preprocessing strategy plays a significant role in the training phase of the network. The preprocessed image data markedly enhance the model’s accuracy and robustness in the guidewire segmentation task. Figure 3 shows the flowchart of our data collection and preprocessing.

Figure 3 Overview of TACE dataset collection and preprocessing. This figure provides an overview of the (A) data acquisition and (B) preprocessing pipeline for the TACE dataset. The process is divided into two main stages: data acquisition and data processing. TACE, transcatheter arterial chemoembolization.

Cropping: due to the varying proportion of effective content in uniformly sized X-ray fluoroscopy images—that is, between effective regions in actual clinical X-ray images and entirely black invalid regions. All data are cropped to non-zero pixel regions, with areas outside the crop marked as invalid. This approach allows us to precisely identify and retain the minimal region of the image containing effective information, ensuring that the model focuses solely on the image parts with valid information. This significantly reduces unnecessary computation, thereby enhancing the computational efficiency of subsequent tasks.

Normalization: this paper also introduces a mask-guided z-score normalization for X-ray fluoroscopic images. This approach normalizes only the foreground regions defined by the segmentation mask, calculating the mean and standard deviation of pixel values within this region, and then applying normalization based on these statistics. By normalizing only the regions of interest (i.e., the areas defined by the segmentation mask), this method reduces potential interference and noise impact from the background on model training. Consequently, it allows the model to more easily learn the most critical visual features in the images, thereby improving the accuracy of the model’s understanding and analysis of the images.

Model architecture

Figure 4 shows the overall architecture diagram of our guidewire segmentation network. The proposed network employs an encoder-decoder architecture. The encoder comprises six stages with channels set to {32, 64, 128, 256, 512, 1024} in each stage. The first two stages employ two conventional convolutions with a kernel size of 3. For the latter four stages, we introduce the designed SDA module, placing it before the two 3×3 convolutional layers, to enhance the model’s focus on the target guidewire features, thereby improving the extraction of guidewire characteristics. The decoder consists of five stages with channels set to {512, 256, 128, 64, 32} in each stage. The SDA module is introduced before the convolution layers in the first three stages, while the last two stages utilize conventional 3×3 convolutions. The SDA module is selectively omitted in the initial encoder stages (32 and 64 channels) and final decoder stages (64 and 32 channels) to prioritize computational efficiency for clinical deployment: shallow encoder layers primarily extract low-level features (e.g., edges) with high spatial resolution, where attention mechanisms would introduce redundant parameters, while deeper decoder stages focus on reconstructing localized outputs where additional attention may compromise real-time inference speed in robotic systems.

Figure 4 An overview of the proposed method. (A) The overall architecture of the network. It is an encoder-decoder architecture-based network that integrates our designed modules exhibiting advanced performance. The detailed diagram of the BGA module will be presented in the subsequent SDA module, BGA module. (B) A detailed diagram of the designed SDA module, which enhances the model’s adaptability to the variable structure of guidewires. (C) The network incorporates a deep supervision strategy during the training phase, further enhancing the network’s feature learning capability and generalization ability. FGA-Net, focused gated attention network for clinical interventions; BGA, bilateral feature fusion; SDA, lightweight gated attention; GELU, Gaussian error linear unit; Conv, convolution; Norm, normalization.

Unlike the skip connection approach commonly used in U-Net and other network architectures, we employ a BGA module to better capture features of different sizes between the encoder and the decoder. The BGA module connects each stage of the network. Additionally, we apply a deep supervision approach by performing upsampling on the output of each stage of the decoder and then computing auxiliary loss functions with the ground-truth labels. Compared to previous segmentation networks, the architecture of the proposed network is more suitable for clinical interventional surgery scenarios and achieves superior segmentation performance on clinical interventional surgery datasets.

BGA module

In current research, the fusion of different types of features within an image is often achieved using the skip connection method in the U-Net model. This method is one of the simplest and most widely used feature fusion techniques. It effectively enhances the network’s ability to extract and utilize features, leading to improved performance. However, the challenge in the guidewire segmentation task for X-ray fluoroscopy images is how to better integrate features from images with low signal-to-noise ratios, interference from other linear structures, and a minimal proportion of guidewire occupancy. Simple skip connections are insufficient to fully exploit different types of features, resulting in suboptimal model performance. Therefore, it is necessary to explore a method that can more comprehensively integrate and utilize various features. Based on the characteristics of the guidewire segmentation task, this paper designs a feature extraction and fusion method tailored to these requirements.

First, the features from the two branches are each processed via two pathways. For low-level detail features, the first pathway employs a 3×3 depthwise separable convolution followed by a 1×1 convolution to enhance feature representation and integration. This approach reduces computational complexity while improving the model’s generalization ability. The second pathway uses a conventional 3×3 convolution to extract features. For high-level semantic features, the first pathway is similar to that for low-level features, using a 3×3 depthwise separable convolution and a 1×1 convolution for efficient feature extraction and fusion. The other pathway first employs a 3×3 convolution for feature extraction, followed by a 2×2 upsampling operation to match the feature map size with that of the low-level features. Finally, the outputs of the two pathways for each feature are combined through cross-multiplication and addition, followed by a 1×1 convolution for final feature fusion. This enhances the model’s feature representation capabilities, allowing it to capture features at different scales without significantly increasing computational burden.

Figure 5 illustrates the design details of the BGA module. This method provides a more refined and efficient feature enhancement and fusion approach for handling the complex X-ray fluoroscopy images of guidewires. It not only strengthens information exchange between features at different scales but also enhances the model’s ability to segment small and elongated objects like guidewires. Additionally, while maintaining computational efficiency, it significantly improves the accuracy and robustness of guidewire segmentation, making it suitable for addressing complex segmentation tasks that traditional methods cannot effectively manage.

Figure 5 Detailed diagram of the bilateral feature fusion module. This figure provides a detailed diagram of the bilateral feature fusion module, which integrates features from both the encoder and decoder branches to enhance the model’s ability to capture multi-scale information. DWConv, depthwise convolution; Conv, convolution.

SDA module

Attention mechanisms are widely applied in the field of image processing. Existing attention mechanisms, with the self-attention mechanism in Transformers as a representative, can effectively capture global dependencies within images but face significant computational and memory consumption issues when handling large-sized images. Another category combines spatial and channel attention mechanisms, focusing simultaneously on important spatial regions and feature channels in the image, but this also increases the computational load and the number of model parameters. For the guidewire segmentation task, it is essential not only to focus the model on the critical features of the image but also to ensure the feasibility of future model deployment in clinical applications and surgical robots. Therefore, this paper designs a lightweight attention module aimed at enhancing model performance without significantly increasing the computational burden. Figure 6 shows the feature map visualizations of the outputs from the same layer, with and without the SDA module.

Figure 6 Comparative visualization of model attention with vs. without SDA module via Grad-CAM++. The first row shows the original images, while the second and third rows show Grad-CAM++ visualizations with and without the SDA module, respectively. SDA, lightweight gated attention.

An SDA mechanism is proposed in this study, comprising two primary components. The first part is the feature extraction layer, which utilizes a spatial shifting operation method (32). First, we set the shift direction d and step size s to determine the direction and distance of feature movement in space. Then, we combine the shift direction and step size to define a spatial shifting step step (dh × sh, dw × sw). Based on the desired local coverage of the corresponding convolution kernel size, we construct a set of spatial shifting steps S={step_i|i=1,..., n}. The input features are divided into n groups along the channel dimension (where n is the same as the size of S). Each group of features is spatially shifted according to the corresponding shifting step. Finally, new features are generated, where each pixel feature in the new feature map contains local feature information from the surrounding area. This method does not require additional floating-point operations or parameters.

First, feature extraction is performed through a 1×1 convolution, followed by a spatial-shifting operation to aggregate the local features of the image and capture information from surrounding pixels. This simple and effective method extends the feature processing capability of the 1×1 convolution layer, enhancing feature representation, expanding the receptive field, and allowing flexible adjustment of shift direction and step size to accommodate different aggregation needs. After extracting and processing the features, we use instance normalization and the Gaussian error linear unit (GELU) activation function to further optimize model performance and feature representation, thereby improving the model’s stability and efficiency.

In the second part, we introduce deformable convolutions as the attention gate (33,34). Deformable convolutions incorporate a dynamic offset Δp and a modulation parameter Δm, enabling the convolution to adjust the offset of sampling positions and modulate the amplitude of features at different spatial locations. This allows the convolution kernel to focus on relevant areas of the image, even if these regions are located outside the reference point. The formulation of the deformable convolution is as follows:

$y (p) = \sum_{k = 1}^{K} w_{k} \cdot x (p + p_{k} + Δ p_{k}) \cdot Δ m_{k}$ [1]

where $K \in N$ denotes the total number of sampling points in the convolution kernel, we set 𝐾=9 for a 3×3 kernel. For each sampling position $k \in {1, \dots, K}$ , the aggregation of input features is determined by an associated weight w_k and a predefined offset p_k, which collectively define the feature extraction process. After the deformable convolution effectively and flexibly captures information from the relevant regions of the input features, we apply layer normalization and the sigmoid activation function to the processed features. This further enhances the network’s feature processing capability and improves the overall model performance. The following Eq. [2] represents the attention gate, which controls the flow of information from the feature map to the next layer. Each element in the gate tensor has a value between 0 and 1, representing the parts of the feature map that need to be emphasized or ignored.

$A = σ (L a y e r N o r m (y (p)))$ [2]

Finally, the SDA mechanism is seamlessly incorporated into the proposed network framework. The original output tensor is pointwise-multiplied with the output tensor from the attention layer to obtain the final output, as shown in Eq. [3].

$X_{out} = X ⊙ A$ [3]

Loss

In the task of guidewire segmentation in X-ray images, we employ a combination of the binary cross-entropy (BCE) loss function, centerline Dice (clDice), and Dice hybrid loss function to train our network. As this study involves a binary classification task, we use the BCE loss function to minimize the difference between the predicted probabilities and the ground truth labels, thereby focusing on the classification accuracy of each pixel in the image sample. The BCE loss function is expressed as follows:

$L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} [y_{i} \log (p_{i}) + (1 - y_{i}) \log (1 - p_{i})]$ [4]

where N represents the total number of pixels in a single image, i denotes the index of the current pixel being computed, y_i is the ground truth label of the i-th pixel, and p_i is the predicted probability of the i-th pixel.

The Dice loss function is widely used in medical image segmentation tasks. It is based on the Dice coefficient, which directly reflects the similarity between the segmentation results and the ground truth labels. This allows the network to more directly relate to the ultimate goal of the segmentation task. However, in guidewire segmentation tasks, there is a class imbalance problem, as the guidewire occupies only a small portion of the image. Therefore, we introduce the Soft Dice loss function, which directly optimizes the overlap between the predicted results and the ground truth, enhancing sensitivity to small targets and making it more suitable for small object tasks in medical image segmentation. The softDice loss function is expressed as follows:

$L_{D i c e} = 1 - \frac{2 \times \sum_{i = 1}^{N} p_{i} g_{i}}{\sum_{i = 1}^{N} p_{i}^{2} + \sum_{i = 1}^{N} g_{i}^{2}}$ [5]

The definitions of the relevant parameters remain the same as those in L_BCE.

Due to the elongated structure of the guidewire and its small proportion in the image, the continuity of the guidewire morphology is particularly important. This is especially challenging in our clinical dataset, where feature extraction is more difficult. To achieve more accurate segmentation continuity of the guidewire morphology and to minimize issues of discontinuity and breakage in the segmentation results, we employ the clDice. The clDice is a topology-preserving similarity metric specifically designed for the segmentation of tubular structures (35). The soft-clDice loss function is expressed as follows:

$c l D i c e (V_{P}, V_{L}) = 2 \times \frac{T p r e c (S_{P}, V_{L}) \times T s e n s (S_{L}, V_{P})}{T p r e c (S_{P}, V_{L}) + T s e n s (S_{L}, V_{P})}$ [6]

where V_P and V_Lrepresent the ground truth mask and the predicted segmentation mask, respectively. S_P and S_L are the skeletons extracted from V_P and V_L, respectively. Tprec (S_P, V_L) denotes the proportion of V_L within S_P, representing topological precision, while Tsens (S_L, V_P) denotes the proportion of V_P within S_L, representing topological sensitivity.

To achieve precise guidewire morphology segmentation while maintaining the topological structure of the guidewire, we introduce a loss function based on the combination of softclDice and softDice, expressed as follows:

$L_{c} = (1 - α) (1 - s o f t D i c e) + α (1 - s o f t c l D i c e)$ [7]

In this study, due to the elongated morphology of the guidewire and its small target proportion, there exists an extreme class imbalance problem. To ensure the pixel-level classification performance and region-level segmentation effectiveness of the model, we use a hybrid loss function combining BCE, softclDice, and softDice. Additionally, deep supervision is employed to calculate the loss functions at different stages of the network, accelerating the model’s convergence speed and enhancing segmentation performance. The hybrid loss function is expressed as follows:

$l_{i} = L_{B C E} (y, \hat{y}) + L_{c} (y, \hat{y})$ [8]

$L = \sum_{i = 0}^{5} λ_{i} \times l_{i}$ [9]

where i represents different stages in deep supervision, with $i = {0, 1, 2, 3, 4, 5}$ . l_i denotes the loss value at the i-th stage, and λ_iis the weight assigned to the loss at stage i. Specifically, y represents the ground truth segmentation mask, while $\hat{y}$ denotes the predicted segmentation mask produced by the model.

Experimental framework

Evaluation metric

In this study, the Macro-F1 score is employed as evaluation metrics. The selection of the Macro-F1 score as one of the evaluation metrics was to ensure a fair assessment of our model’s performance in handling different classes of guidewires. Particularly in situations where the number of positive-class guidewires is relatively small, the Macro-F1 score provides a more comprehensive and balanced performance evaluation.

$P r e c i s i o n = \frac{T P}{T P + F P}$ [10]

$R e c a l l = \frac{T P}{T P + F N}$ [11]

$F 1 = 2 \cdot \frac{P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}$ [12]

$M a c r o F 1 = \frac{1}{N} \sum_{i = 1}^{N} F 1_{i}$ [13]

In Eqs. [10] and [11], TP denotes the number of pixels correctly predicted as the guidewire category, FP denotes the number of pixels incorrectly predicted as the guidewire category, and FN denotes the number of pixels incorrectly predicted as background pixels. In Eq. [13], N represents the total number of classes, i denotes the index of a class, and F1_i represents the F1 score of the i-th class.

Mean intersection over union (MIoU) is a commonly used metric in semantic segmentation, measuring the similarity between the predicted results and the ground truth labels by computing the intersection over union (IoU) for each class and then averaging across all classes. Additionally, the object IoU (OIoU) metric is employed to further assess performance, focusing on object-level segmentation accuracy. MIoU considers the classification effectiveness for each class, while OIoU ensures the consistency of object boundaries, which is crucial for precise guidewire segmentation. By incorporating both MIoU and OIoU, we can comprehensively evaluate the performance of our guidewire segmentation model, providing a balanced and thorough analysis of its capabilities.

$I o U = \frac{T P}{T P + F P + F N}$ [14]

$M I o U = \frac{1}{N} \sum_{i = 1}^{N} I o U_{i}$ [15]

$O I o U = \sum_{i = 1}^{N} I o U_{i}$ [16]

In this context, TP, FP, and FN have the same definitions as those described in the F1 score. N represents the total number of classes, and IoU_i is the IoU value for the i-th class.

In medical image segmentation, the Dice coefficient is one of the most widely used evaluation metrics. We employ the Dice coefficient as one of the evaluation metrics for guidewire segmentation results, effectively measuring the degree of overlap between the predicted results and the ground truth labels, thereby calculating the similarity between the two sets.

$D i c e = \frac{2 | X \cap Y |}{| X | + | Y |}$ [17]

Where X denotes the set of predicted results, and Y represents the set of ground truth labels. The Dice coefficient ranges from 0 to 1. Additionally, we utilize the area under the curve (AUC) to evaluate the accuracy of pixel-wise classification in the segmentation model.

Implementation details

To evaluate the performance of the model, the proposed TCGSeg dataset, comprising 2,839 images, was utilized. During the training phase on the TCGSeg dataset, five interventional patient sequences (388 frames) were randomly selected from the training data as the validation set to verify the model’s performance. Our network framework is implemented based on PyTorch (version 2.0.0), and all experiments were conducted on a single NVIDIA RTX 4090 GPU. The stochastic gradient descent (SGD) optimizer was employed with an initial learning rate of 1×10^–8 and a weight decay of 0.00005. A cosine annealing strategy was used to adjust the learning rate, specifically leveraging the CosineAnnealingLR scheduler in PyTorch with a maximum iteration count set to 50. The model was trained for a total of 180 epochs with a batch size of 8.

Results

Quantitative analysis

To demonstrate the advancements and superiority of the proposed method in each metric, as well as its potential for real-world interventional clinical applications, we conducted comparisons with a series of widely-used and well-performing image segmentation networks including U-Net (8), Attention U-Net (17), U-Net++ (18), DeepLabV3+ (36), and CENet (37). The results indicate that the proposed method achieved superior performance on the TCGSeg dataset across all evaluated metrics.

Table 1 presents a comparison of our proposed method against other approaches. As shown in Table 1, our method exhibits superior performance in wire segmentation and also demonstrates improved computational efficiency and inference speed, achieving comprehensive optimization of model performance. The traditional U-Net model shows inferior performance in both segmentation and classification metrics compared to other networks in the experiment. When handling the wire in fluoroscopic images, U-Net has lower sensitivity due to its relatively simple structure, which uses direct skip connections and possesses fewer network parameters, thus reducing computational complexity but failing to capture wire features effectively. Consequently, FGA-Net outperforms in terms of classification and segmentation accuracy, with improvements of 0.09% in AUC, 2.26% in Macro-F1, and 4.49% in Dice, respectively.

Table 1

Quantitative results of comparative experiments on the TCGSeg dataset

Model	GFLOPs ↓	Time (ms) ↓	Volumetric (%) ↑
Model	GFLOPs ↓	Time (ms) ↓	AUC	Macro-F1	Dice	MIoU	OIoU
U-Net	54.98	4.06	91.55	83.37	66.80	50.76	50.68
U-Net++	139.61	18.68	89.75	84.01	68.07	52.35	51.54
Attention U-Net	266.53	17.42	87.9	83.8	67.64	52.06	50.47
CENet	35.60	4.02	88.32	84.28	68.6	53.08	51.95
DeepLabV3+	83.32	21.64	85.42	83.3	66.64	51.32	49.47
Ours	75.36	17.18	91.61	85.63	71.29	56.59	55.30

AUC, area under the curve; GFLOPs, giga floating point operations per second; MIoU, mean intersection over union; OIoU, object intersection over union; TCGSeg, TACE guidewire segmentation dataset; TACE, transcatheter arterial chemoembolization.

Compared to U-Net++, which achieves better multi-scale information fusion through nested and optimized skip connections, FGA-Net demonstrates superior performance in processing elongated structures, enhancing volume accuracy by 1.89% in AUC, 1.62% in Macro-F1, and 3.22% in Dice. Attention-Unet introduces an attention mechanism based on the U-Net framework, effectively improving the network’s feature selection capability. However, due to the simplistic nature of the wire structure in fluoroscopic images and the potential for confusion with other structures, general attention mechanisms still show limitations in extracting wire features from such images. Our method improves segmentation and classification accuracy by 3.74% in AUC, 1.83% in Macro-F1, and 3.65% in Dice, respectively.

CENet incorporates a context encoding module to better capture and aggregate global features in images. However, it tends to lose some critical details of the wire when processing images with substantial intricate information. In comparison, FGA-Net achieves superior classification and segmentation accuracy, with improvements of 3.32% in AUC, 1.35% in Macro-F1, and 2.69% in Dice, respectively. DeepLabV3+ introduces atrous convolution and the atrous spatial pyramid pooling module, enabling better capture of multi-scale contextual information and enhancing the network’s adaptability to different scenes. However, in terms of detail processing, particularly for fine structures such as wires, FGA-Net significantly outperforms, improving by 6.22% in AUC, 2.3% in Macro-F1, and 4.65% in Dice, respectively.

Figure 7 shows the performance metrics of our network compared to other networks on the TCGSeg dataset. Compared to widely used medical image segmentation networks (such as U-Net, Attention U-Net, U-Net++, etc.), our proposed method achieves state-of-the-art results across various metrics. In summary, our approach is better at capturing the features of elongated structures like guidewires, resulting in more accurate and superior segmentation outcomes.

Figure 7 The performance metrics of different network models on the TCGSeg dataset. This figure presents a comparative analysis of various performance metrics for different network models evaluated on the TCGSeg dataset. The metrics include precision, AUC, Macro-F1, Dice coefficient, MIoU, and OIoU. Each model’s performance is represented by a distinct color bar. AUC, area under the curve; MioU, mean intersection over union; OIoU, overall intersection over union; TCGSeg, TACE guidewire segmentation dataset; TACE, transcatheter arterial chemoembolization.

Qualitative analysis

Figure 8 shows a comparison of the segmentation results of various networks, highlighting the advantages of our network in segmenting guidewire structures. From left to right, the third to seventh columns display the segmentation accuracy results of five different networks on the TCGSeg dataset. Compared to other networks, our model can better adapt to and perceive the slender structure of guidewires, significantly reducing instances of segmentation results exhibiting breaks, incorrect predictions, or missed predictions. Our network is tailored to the characteristics of interventional guidewire images, including small targets, slender structures, and low signal-to-noise ratios. It better extracts the features of guidewires from real clinical images, enabling the model to more effectively learn and perceive the structure of guidewires, thus achieving superior performance in guidewire segmentation.

Figure 8 Results of qualitative analysis. The results indicate that our network outperforms other models in terms of segmentation accuracy and topological continuity. The yellow arrows represent areas with segmentation breaks, the red arrows represent areas with missing or incorrect predictions, and the green arrows represent areas with successful segmentation and accurate prediction. This figure presents a qualitative comparison of segmentation performance across different models, including U-Net, Attention-Unet, CENet, DeepLab V3++, U-Net++, and the proposed network (our network). The analysis focuses on segmentation accuracy and topological continuity, with specific attention to areas where each model excels or falls short.

Discussion

Ablation study

In this study, ablation experiments were conducted to demonstrate the effectiveness of various modules in the model. The baseline network employed in our research is a basic U-shaped architecture, where each encoder and decoder includes two convolution operations with a kernel size of 3. The number of channels at each stage is set to {32, 64, 128, 256, 512, 1024}. Table 2 presents the ablation experiments for the BGA module and the SDA module. In the experiments involving the BGA module, it replaced all simple skip connections, and the features from the encoder and the corresponding subsequent of features across different scales via bilateral guidance, improvements over the baseline were observed when all decoder layers were used as inputs to the BGA module. Thanks to the BGA module’s fine-grained fusion of features across different scales via bilateral guidance, improvements over the baseline were observed in all metrics, with increases of 1.72% in MIoU, 1.08% in Macro-F1, and 2.18% in Dice.

Table 2

Ablation experiments on the TCGSeg dataset

Model^a	Loss^b	AUC (%)	Macro-F1 (%)	Dice (%)	MIoU (%)	OIoU (%)
Baseline	BSC	88.27	83.14	66.32	51.13	50.23
Baseline + BGA	BSC	92.01	84.22	68.5	52.85	52.32
Baseline + SDA	BSC	90.63	85.14	70.32	55.64	54.05
Our network	BD	92.78	84.03	68.11	52.47	51.78
Our network	BSC	91.64	85.63	71.29	56.59	55.3

^a, ablation of two modules; ^b, ablation of loss functions. AUC, area under the curve; BD, a combination of binary cross-entropy and Dice loss; BGA, bilateral feature fusion; BSC, a combination of binary cross-entropy and Dice loss with clDice; MIoU, mean intersection over union; OIoU, object intersection over union; SDA, lightweight gated attention; TCGSeg, TACE guidewire segmentation dataset; TACE, transcatheter arterial chemoembolization.

In the experiments involving the SDA module, it was added before the convolution layers of the last four encoder layers and the first three decoder layers. As an SDA mechanism, the SDA module effectively focuses on significant features in the image. Compared to the baseline, it achieved superior performance, with increases of 4.51% in MIoU, 2% in Macro-F1, and 4% in Dice, among other metrics.

Additionally, we conducted ablation experiments related to the loss functions. We experimented with three different loss functions: a combination of BCE and Dice loss, a combination of BCE and Dice loss with clDice, and each using their respective optimal parameter settings. The network architecture and loss functions were all employed with their best parameter configurations. As shown in Table 1, the combined loss function of BCE, Dice, and clDice achieved the best performance across most metrics on the TCGSeg dataset, with the exception of AUC, which decreased by a relative 1.14%.

Analysis of inference time and giga floating point operations per second (GFLOPs)

Our method is designed to be applied in clinical and surgical robot settings. Thus, we have analyzed the computational performance and average inference time of each model. As shown in Table 2, our model demonstrates good performance in terms of computational performance and inference time, with GFLOPs and average inference time being 75.36 and 17.18 ms, respectively. The CENet, utilizing a pre-trained backbone network and depthwise separable convolutions, shows the best performance in terms of computational efficiency and inference time, with GFLOPs and average inference time of 35.6 and 4.02 ms, respectively. The traditional U-Net, with a simple architecture and direct skip connections, also performs well, with GFLOPs and average inference time being 54.98 and 4.06 ms, respectively. DeepLabv3+ introduces atrous convolutions to enhance segmentation performance without increasing computational load, resulting in GFLOPs and average inference time of 83.32 and 21.64 ms, respectively. U-Net++ has GFLOPs and average inference time of 139.61 and 18.68 ms, respectively, due to its dense skip connections that lead to computational redundancy and longer inference time. Attention U-Net exhibits the poorest performance, with GFLOPs and average inference time being 266.53 and 17.42 ms, respectively, attributed to the high computational cost introduced by the attention mechanism. This analysis of computational performance and average inference time provides a theoretical foundation for the portability and feasibility of our model in future clinical and surgical robot applications.

Our model achieves a satisfactory inference time, meeting the fundamental requirements for clinical applications. However, there remains potential for further optimization in computational efficiency. Compared to the lightweight CENet architecture, our model achieves superior segmentation performance but comes with a higher GFLOPs overhead. While deeper architectures such as U-Net++ and Attention U-Net theoretically offer strong performance, their computational demands remain prohibitive for resource-constrained systems. Given the current clinical practice, we prioritize maintaining segmentation accuracy at a clinically viable level to ensure that physicians receive the most precise visual feedback. At the same time, we strive to strike a balance between computational efficiency and accuracy.

Analysis of limitations

Although our method has demonstrated excellent performance in guidewire segmentation for clinical TACE procedures, several limitations remain that warrant further investigation. First, although the dataset used in this study is relatively comprehensive within its scope, its origin from a single hospital’s clinical environment may restrict the model’s generalizability across other settings. Future work will therefore focus on expanding the dataset in scale and diversity to encompass a broader range of clinical scenarios, thereby enhancing the model’s adaptability and robustness. Second, while our bilateral guidance and SDA modules have significantly improved segmentation accuracy, they may also introduce additional complexity when integrated into real-time surgical workflows. To ensure efficient application during actual procedures, further optimization and adjustments of the framework are needed to streamline deployment and enhance usability. Third, although the model has demonstrated significant performance advantages in guidewire segmentation, its computational efficiency still needs improvement. While it meets the basic real-time requirements for clinical applications, the model’s GFLOPs are higher than those of lightweight architectures such as CENet, indicating there is room for further optimization. Looking ahead to future deployments in clinical and surgical robotics, we plan to explore hardware-specific optimizations, including TensorRT acceleration, INT8 quantization, and the adoption of edge computing. These efforts aim to reduce the current latency of 17.18 ms and enhance overall computational efficiency, ensuring that the stringent real-time demands of complex surgical scenarios are met, all while preserving accuracy.

Conclusions

This paper proposes a guidewire delivery dataset for interventional clinical procedures and a guidewire segmentation network, aiming to achieve precise segmentation of guidewires in interventional clinical images, thereby providing visual feedback for interventional doctors and robotic surgery systems. Our method has demonstrated state-of-the-art performance in comparative experiments. Specifically, our BGA module employs a bilateral guidance strategy to facilitate feature fusion and interaction between high-level semantic features and low-level features. The SDA module utilizes a lightweight context feature extraction method and deformable convolutions as a gated attention mechanism, allowing the model to effectively focus on key regions of the guidewires in the images. Experimental results demonstrate that compared to other segmentation networks, our method exhibits superior accuracy and continuity in the task of guidewire segmentation in interventional clinical images. Additionally, the computational efficiency of our network framework is notable, achieving a lightweight design with enhanced portability.

In future work, we will further explore the development of the guidewire segmentation framework, focusing primarily on continuing to enhance the network’s segmentation performance while reducing computational complexity and model parameter count. Additionally, we aim to improve the model’s inference speed to enable the deployment of this network framework in real clinical environments and interventional surgical robots. Such advancements could alleviate visual fatigue for interventional doctors, improve the efficiency of guidewire delivery in interventional surgeries, and consequently improve patient treatment outcomes.

Acknowledgments

None.

Footnote

Funding: This work was financially supported by the National Key Research and Development Program of China (No. 2023YFB3810000), Shenzhen Science and Technology Program (No. JCYJ20220530143217037), and the National Natural Science Foundation of China (Nos. 81925019 and U22A20333).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2024-2926/coif). G.L. serves as an unpaid editorial board member of Quantitative Imaging in Medicine and Surgery. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Review Board of Affiliated Hospital of Guilin Medical University (No. 2022YJSLL-46), and individual consent for this retrospective analysis was waived.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Raoul JL, Forner A, Bolondi L, Cheung TT, Kloeckner R, de Baere T. Updated use of TACE for hepatocellular carcinoma treatment: How and when to use it based on clinical evidence. Cancer Treat Rev 2019;72:28-36. [Crossref] [PubMed]
Sieghart W, Hucke F, Peck-Radosavljevic M. Transarterial chemoembolization: modalities, indication, and patient selection. J Hepatol 2015;62:1187-95. [Crossref] [PubMed]
Zhao Y, Mei Z, Luo X, Mao J, Zhao Q, Liu G, Wu D. Remote vascular interventional surgery robotics: a literature review. Quant Imaging Med Surg 2022;12:2552-74. [Crossref] [PubMed]
Bismuth V, Vaillant R, Talbot H, Najman L. Curvilinear structure enhancement with the polygonal path image--application to guide-wire segmentation in X-ray fluoroscopy. Med Image Comput Comput Assist Interv 2012;15:9-16.
Heibel H, Glocker B, Groher M, Pfister M, Navab N. Interventional tool tracking using discrete optimization. IEEE Trans Med Imaging 2013;32:544-55. [Crossref] [PubMed]
Vandini A, Glocker B, Hamady M, Yang GZ. Robust guidewire tracking under large deformations combining segment-like features (SEGlets). Med Image Anal 2017;38:150-64. [Crossref] [PubMed]
Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021;18:203-11. [Crossref] [PubMed]
Ronneberger O, Fischer P, Brox T, editors. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18; 2015: Springer.
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y. TransUNet: transformers make strong encoders for medical image segmentation. arXiv preprint 2021. arXiv:2102.04306.
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In: European Conference on Computer Vision; 2022. Springer; 2022:205-18.
Ambrosini P, Ruijters D, Niessen WJ, Moelker A, van Walsum T. Fully automatic and real-time catheter segmentation in X-ray fluoroscopy. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2017: 20th International Conference; 2017 Sep 11-13; Quebec City, QC, Canada. Springer; 2017:577-85.
Vlontzos A, Mikolajczyk K. Deep segmentation and registration in X-ray angiography video. arXiv preprint 2018. arXiv:1805.06406.
Zhou YJ, Xie XL, Zhou XH, Liu SQ, Bian GB, Hou ZG. Pyramid attention recurrent networks for real-time guidewire segmentation and tracking in intraoperative X-ray fluoroscopy. Comput Med Imaging Graph 2020;83:101734. [Crossref] [PubMed]
Zhou Y, Xie X, Zhou X, Liu S, Bian G, Hou Z. A real-time multifunctional framework for guidewire morphological and positional analysis in interventional X-ray fluoroscopy. IEEE Transactions on Cognitive and Developmental Systems 2020;13:657-67.
Du W, Yi G, Omisore OM, Duan W, Chen X, Akinyemi T, Liu J, Lee BG, Wang L. Guidewire endpoint detection based on pixel-adjacent relation during robot-assisted intravascular catheterization: in vivo mammalian models. Adv Intell Syst 2024;6:2300687.
Ghosh R, Wong K, Zhang YJ, Britz GW, Wong STC. Automated catheter segmentation and tip detection in cerebral angiography with topology-aware geometric deep learning. J Neurointerv Surg 2024;16:290-5. [Crossref] [PubMed]
Oktay O, Schlemper J, Le Folgoc L, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B, Glocker B, Rueckert D. Attention U-Net: Learning where to look for the pancreas. arXiv preprint 2018. arXiv:1804.03999.
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018) 2018;11045:3-11. [Crossref] [PubMed]
Gao Y, Zhou M, Metaxas DN. UTNet: a hybrid transformer architecture for medical image segmentation. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III. Springer; 2021:61-71.
Verma R, Kumar N, Patil A, Kurian NC, Rane S, Graham S, et al. MoNuSAC2020: A Multi-Organ Nuclei Segmentation and Classification Challenge. IEEE Trans Med Imaging 2021;40:3413-23. [Crossref] [PubMed]
Zunair H, Ben Hamza A. Sharp U-Net: Depthwise convolutional network for biomedical image segmentation. Comput Biol Med 2021;136:104699. [Crossref] [PubMed]
Yang X, Li Z, Guo Y, Zhou D. DCU-net: a deformable convolutional neural network based on cascade U-net for retinal vessel segmentation. Multimed Tools Appl 2022;81:15593-607.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4-9; Long Beach, CA, USA: Curran Associates Inc.; 2017:6000-10.
Dosovitskiy A, Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint 2020. arXiv:2010.11929.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z. Swin Transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021 Oct 11-17; Montreal, QC, Canada. Piscataway, NJ: IEEE; 2021:9992-10002.
Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L. Pyramid Vision Transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision; 2021 Oct 11-17; Montreal, QC, Canada. Piscataway, NJ: IEEE; 2021:548-58.
Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks. In: Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2018 Jun 18-23; Salt Lake City, UT, USA. Piscataway, NJ: IEEE; 2018:7132-41.
Jaderberg M, Simonyan K, Zisserman A, Kavukcuoglu K. Spatial transformer networks. In: Proceedings of the 29th International Conference on Neural Information Processing Systems; 2015 Dec 7-12; Montreal, Canada. Cambridge, MA: MIT Press; 2015:2017-25.
Woo S, Park J, Lee JY, Kweon IS. CBAM: convolutional block attention module. In: Proceedings of the 15th European Conference on Computer Vision; 2018 Sep 8-14; Munich, Germany. Cham, Switzerland: Springer; 2018:3-19.
Yang Z, Zhu L, Wu Y, Yang Y. Gated channel transformation for visual recognition. In: Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020 Jun 13-19; Seattle, WA, USA. Piscataway, NJ: IEEE; 2020:11791-800.
Li X, Zhao H, Han L, Tong Y, Tan S, Yang K. Gated fully fusion for semantic segmentation. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence; 2020 Feb 7-12; New York, NY, USA. Palo Alto, CA: AAAI Press; 2020:11418-25.
Wu G, Jiang J, Jiang K, Liu X. Fully 1×1 convolutional network for lightweight image super-resolution. arXiv preprint 2023. arXiv:2307.16140.
Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y. Deformable convolutional networks. In: Proceedings of the 2017 IEEE International Conference on Computer Vision; 2017 Oct 22-29; Venice, Italy. Piscataway, NJ: IEEE; 2017:764-73.
Zhu X, Hu H, Lin S, Dai J. Deformable convnets v2: More deformable, better results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019 Jun 16-20; Long Beach, CA, USA. IEEE/CVF; 2019:9308-16.
Shit S, Paetzold JC, Sekuboyina A, Ezhov I, Unger A, Zhylka A, Pluim JPW, Bauer U, Menze BH. ClDice - A novel topology-preserving loss function for tubular structure segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 19-25; Nashville, TN, USA. IEEE/CVF; 2021:16560-9.
Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision; 2018 Sep 8-14; Munich, Germany. Springer; 2018:801-18.
Gu Z, Cheng J, Fu H, Zhou K, Hao H, Zhao Y, Zhang T, Gao S, Liu J. CE-Net: Context Encoder Network for 2D Medical Image Segmentation. IEEE Trans Med Imaging 2019;38:2281-92. [Crossref] [PubMed]

Cite this article as: Wang H, Mei Z, Wang K, Mao J, Wang L, Liu G, Zhao Y. Lightweight attention network for guidewire segmentation and localization in clinical fluoroscopic images of vascular interventional surgery. Quant Imaging Med Surg 2025;15(5):4689-4707. doi: 10.21037/qims-2024-2926

Lightweight attention network for guidewire segmentation and localization in clinical fluoroscopic images of vascular interventional surgery

Introduction

Related works

Methods

Dataset

Preprocessing

Model architecture

BGA module

SDA module

Loss

Experimental framework

Evaluation metric

Implementation details

Results

Quantitative analysis

Table 1

Qualitative analysis

Discussion

Ablation study

Table 2

Analysis of inference time and giga floating point operations per second (GFLOPs)

Analysis of limitations

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share