Cross-domain dynamic routing decoders for multi-domain generalization in ultrasound imaging
Introduction
In the current era of rapid artificial intelligence (AI) advancement, deep learning (DL) technology has achieved remarkable progress in medical imaging analysis, particularly demonstrating powerful diagnostic potential in cardiology (1), oncology (2), and neuroimaging (3). DL-based algorithms have shown exceptional capabilities in automated image interpretation and pathology detection, substantially improving clinical efficiency and diagnostic precision.
However, despite their impressive performance, most existing DL models rely on the assumption that the training and testing data are identically distributed. In real-world medical scenarios, this assumption rarely holds due to domain shift—the discrepancy between data distributions from different healthcare institutions, imaging devices, or acquisition protocols (4,5). Such heterogeneity is especially pronounced in ultrasound (US) imaging, where image quality can vary significantly owing to operator skill, patient anatomy, and equipment settings. Consequently, models trained on high-quality data from well-resourced medical centers often suffer a substantial performance drop when deployed in other clinical environments, undermining their generalizability and reliability (6).
To mitigate these challenges, domain generalization (DG) techniques have emerged as a promising paradigm (7). DG seeks to learn domain-invariant representations from multiple source domains, enabling models to generalize effectively to unseen target domains without explicit adaptation. Extending this idea, multi-domain generalization (MDG) focuses on learning from multiple labeled source domains that exhibit distinct yet related distributions, thereby encouraging the extraction of robust and transferable features across varying acquisition conditions (8,9). This paradigm is particularly well suited for US imaging, where data from different quality levels can be treated as distinct domains.
In this paper, we introduce USHydraNet, a novel multi-decoder architecture explicitly designed within the MDG framework to address domain and quality variations in US classification and segmentation. USHydraNet incorporates an image quality evaluation module that estimates both image- and feature-level statistics to assess input quality. Based on these assessments, a dynamic routing mechanism selectively activates the most appropriate quality-specific decoder, thereby maintaining decoder specialization while achieving differentiable routing. The model is trained on multiple datasets representing low-, medium-, and high-quality image domains, enabling it to learn domain-invariant yet quality-aware representations that improve robustness under heterogeneous imaging conditions.
The key contributions of this work are summarized as follows:
- We propose USHydraNet, a unified multi-decoder framework with a single shared encoder and multiple quality-specialized decoders, specifically designed for the classification of fetal US images and the segmentation task for cardiac US images across diverse imaging domains and varying quality levels.
- We introduce a dynamic routing strategy that jointly considers pixel-level and feature-level image quality metrics to determine decoder selection. This mechanism effectively bridges decoder specialization with differentiable routing, enhancing adaptability and scalability.
- We frame our method within the MDG paradigm, demonstrating through extensive experiments that USHydraNet achieves improved robustness and generalization across heterogeneous US datasets compared to conventional single-decoder models.
We present this article in accordance with the CLEAR reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-521/rc).
Methods
DL models in DG
DG was initially formulated as a machine learning problem by Blanchard et al. (10), and the term was formally defined by Muandet et al. (11). Unlike related concepts such as domain adaptation or domain transfer, DG assumes that target domain data are entirely unavailable during training. Consequently, models must learn domain-invariant and robust feature representations solely from the source domains to ensure reliable performance in unseen environments. Typical DG approaches include data augmentation, invariant representation learning, and meta-learning, all of which aim to enhance generalization across diverse data distributions.
Building upon this concept, MDG extends DG by learning domain-invariant features from multiple labeled source domains instead of a single one. Theoretical and empirical studies (8,9) have demonstrated that MDG frameworks can effectively extract such shared representations, improving model robustness under varying acquisition conditions and unseen variations within related domains. In contrast to single-source DG, MDG emphasizes leveraging multiple distinct yet related data distributions to train a unified model that generalizes across them.
In the broader context of medical image analysis, classification and segmentation remain fundamental tasks supported by a variety of neural network architectures. For classification, models such as vision transformer (ViT) (12), ResNet (13), and GoogLeNet (14) are widely utilized, while segmentation tasks commonly rely on fully convolutional networks (FCN) (15), UNet (16), and SegNet (17). In this work, our classification framework follows the MDG paradigm rather than conventional DG. Although no completely unseen external dataset is included for classification, our multi-domain training strategy—built upon datasets categorized by image quality (low, medium, and high), which encourages the model to learn invariant features through the ViT encoder while promoting specialization in quality-specific decoders. This design aligns closely with the MDG principle, fostering consistent performance across diverse acquisition conditions and quality variations. For the segmentation task, the inclusion of the external CardiacUDA dataset further supports the model’s generalization capability beyond the training domains. We employ ViT and UNet as backbone encoders for feature extraction. Both are initialized with pretrained weights to leverage their strong global representation capabilities, ensuring enhanced adaptability to heterogeneous medical datasets originating from varied imaging qualities and acquisition settings.
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional ethics board of Kiang Wu Hospital (Ethics approval No. K-2023-067-H01) and individual consent for this retrospective analysis was waived.
Feature analysis in medical images
Medical image analysis encounters numerous challenges, including image quality degradation, noise interference, and image shifts caused by variations in equipment and operator proficiency. These factors introduce substantial differences in feature distributions for US images sourced from different hospitals or regions, which negatively impact the performance of conventional DL models. To effectively address these cross-device and cross-regional disparities, as well as to enhance classification accuracy, researchers have focused on multi-domain learning and domain adaptability as central themes. A critical step in accommodating data from disparate domains involves minimizing the feature distribution discrepancy across these domains (18).
In this paper, we propose a decoder selection strategy that leverages feature mean and variance to dynamically adapt to diverse data distributions. Specifically, mean and variance are calculated for both shallow features (e.g., RGB color channels) and deep features (e.g., feature vectors extracted by DL models). This dynamic computation enables the selection of the most appropriate decoder for the input image. In contrast to the feature alignment strategy proposed by Sun et al. (19), our approach goes beyond aligning feature distributions. By incorporating statistical feature insights, our method optimizes decoder selection, significantly improving adaptability to inter-domain differences. This unique strategy enhances classification performance and ensures robust handling of cross-domain challenges in medical image analysis.
Model architecture
To effectively address the challenge of MDG in both classification and segmentation tasks, we propose a novel architecture, USHydraNet, designed to handle data with diverse quality conditions. Depending on the downstream task, USHydraNet adopts two configurations: USHydra-ViT for classification and USHydra-UNet for segmentation. As illustrated in Figure 1, USHydraNet is built to process multi-center US datasets that exhibit varying image quality levels (low, medium, and high). These datasets encompass both fetal and cardiac US images, reflecting the inherent heterogeneity of acquisition conditions across different regions.
The workflow begins with a preprocessing stage, where an image quality evaluation module assesses the input image in two complementary steps:
- Pixel-level calculation: local variations in image patches are analyzed through pixel-wise intensity and texture statistics, enabling the detection of low-level image quality degradations.
- Feature-level calculation: the mean and variance of high-level feature maps are computed to characterize abstract, semantic quality attributes, providing a more refined assessment of overall image quality.
For the classification task (fetal US images), we employ ViT (12) as the backbone to extract meaningful feature information. For the segmentation task (four-chamber cardiac US images), we utilize UNet (16) as the backbone. For end-to-end training, each decoder is first independently trained on its corresponding quality-specific subset (low-, medium-, or high-quality images). During this stage, the routing weights in Eq. [1] and Eq. [2] are assigned in a one-hot manner according to the image quality label of each training sample. This ensures that each decoder learns to specialize in its designated quality domain.
After pre-training, the best-performing decoder weights from each subset are saved and subsequently loaded into the unified USHydraNet architecture. The classification or segmentation heads of the backbone are then frozen, allowing the pretrained decoders to retain their specialized capabilities. This training pipeline enables the model to perform differentiable routing while preserving decoder specialization, ensuring logical consistency between the network design and its training objectives.
Following the image quality evaluation, a dynamic decoder selection mechanism is triggered. This mechanism dynamically routes the input image to the decoder most suitable for its estimated quality level. The image is first encoded into latent representations by the backbone encoder, and subsequently processed by the selected decoder to perform adaptive classification or segmentation, depending on the task. Finally, USHydraNet produces task-specific outputs, either classification probabilities or segmentation maps, optimized to maintain robustness and accuracy across diverse image quality conditions.
Dynamic routing loss function
During the training process for the classification task, the backbone ViT model is initialized with pre-trained weights derived from natural image datasets. To address the variations in image quality (low, medium, and high), three decoders are introduced, each specialized for a specific quality level. These decoders are trained to extract and interpret discriminative features corresponding to their assigned quality range.
In our implementation, the training dataset for classification is strictly divided into three subsets according to image quality. During training, the routing weights in Eq. [1] are assigned in a one-hot manner based on the quality label of the input image: when a high-quality image is provided, and ; similarly, for medium-quality images, and the others are set to zero, and so on. This ensures that each decoder is updated only by samples of its corresponding quality level, thereby achieving decoder specialization while maintaining differentiable routing for inference.
The overall loss function for the classification task is formulated as:
where represents the image quality level (low, medium, or high), denotes the routing weight assigned to decoder , and is the cross-entropy loss between the ground truth and the prediction from the corresponding decoder. This design enables each decoder to focus on its respective image-quality domain while allowing the overall network to maintain logical consistency between architecture and training objectives.
For the segmentation task, a similar strategy is applied using a UNet backbone for feature extraction. In this case, two decoders are employed, each corresponding to low- or high-quality input images. The segmentation loss is defined as:
where denotes the Dice loss between the ground truth and the segmentation prediction from the decoder specialized for quality level . The routing weights follow the same one-hot assignment scheme as in the classification task.
After these specialized decoders are trained independently on their respective subsets, the pre-trained weights are integrated into our unified USHydraNet framework. In the fine-tuning stage, the network dynamically estimates the mean and variance of image-level and feature-level quality cues to route inputs through the most appropriate decoder. This two-stage training and dynamic routing strategy effectively combines the advantages of specialization and adaptability, ensuring robust performance across diverse image quality conditions.
Quality evaluation and dynamic routing
To enhance the adaptability and performance across heterogeneous datasets, we introduce an image quality statistics calculation and dynamic routing mechanism during the training phase. Medical datasets often suffer from variability caused by differences in equipment, acquisition conditions, and operator techniques, which hinder model generalization. To address these issues in classification and segmentation tasks, we propose a module that evaluates both low-dimensional (pixel-level) and high-dimensional (feature-level) data distributions to guide decoder selection dynamically.
Specifically, we analyze the data distribution at both levels by computing the mean and variance of the image features corresponding to the input task to dynamically select the most appropriate decoder. By calculating the mean and variance at both the pixel and feature levels, we can dynamically select the decoder that best matches the data distribution, thereby improving the performance of the model and its generalization across different datasets.
The model’s process is as follows: The two tasks correspond to the dataset D containing N images each image has C channels with dimensions H × W. we can calculate the mean and standard deviation of each channel of the each original image:
After feeding the image into the neural network model corresponding to classification and segmentation, the feature vector is obtained, and its mean and variance are computed:
Next, the original image features are combined with the weighted features extracted by the corresponding neural network model to obtain the combined mean and standard deviation :
The overall mean and standard deviation are then obtained by averaging the combined mean and standard deviation of all channels before and after entering the neural network model:
During inference, domain disparity () is calculated by comparing the mean and standard deviation of test data with those of the training domains. The domain with the smallest disparity is selected, and the corresponding decoders are used for classification and segmentation. The disparity computation is:
where represent precalculated averages derived from each quality domain’s complete training dataset. For each domain d, these statistics are obtained by processing all training images through the encoder and subsequently computing the mean and standard deviation of the resulting feature vectors. The weighting factor of 10 in Eq. [11] was empirically determined through extensive ablation studies to appropriately balance mean and standard deviation differences. Our experiments revealed that standard deviation differences provide more robust signals for domain disparity, necessitating this amplification factor.
Where the domain represents the specific dataset. The strategy for selecting the classification and segmentation decoders is as follows:
Here, the classification and segmentation decoders corresponding to the domain with the smallest disparity are used to normalize features and make predictions for the target domain. By leveraging this adaptive routing mechanism, USHydraNet achieves robust generalization across datasets with varying image quality conditions.
During inference, we first extract the features of that image by the same encoder. Then, the global mean and standard deviation of the test image are computed based on these features.
Next, these test statistics are compared to the statistics of each precalculated domain using Eq. [11]. Finally, the decoder of the specific domain with the smallest feature deviation from the test image is selected for subsequent processing.
Results
Experimental setup
Implementation details: in this study, we used ViT and UNet as backbones for classification and segmentation, respectively. The models were implemented on an NVIDIA RTX A4000 GPU (16 GB RAM) using the PyTorch framework. The training set was used to train the model for 100 epochs with a batch size of 16. The ReLU activation function was applied to prevent vanishing gradient.
Evaluation metrics: for the classification task, several common evaluation metrics were used, including precision, recall, F1-Score, accuracy, and area under the curve (AUC). Higher values of these metrics indicate better model performance. In the segmentation task, in addition to the metrics mentioned above, two additional commonly used metrics were employed: the Dice coefficient and intersection over union (IoU). The Dice coefficient measures the overlap between the predicted and ground truth regions, while IoU evaluates the overlapping area of the two regions. As with the classification metrics, higher values of the Dice coefficient and IoU indicate better performance.
Datasets
For fetal classification task, three datasets were used: two public datasets and one private dataset, as shown in Table 1 and Figure 2. The first public dataset is the Zenodo dataset (20), which contains maternal fetal US images from low-resource imaging environments in five African countries: Egypt, Algeria, Uganda, Ghana, and Malawi. This dataset includes 450 images, with 100 samples from 25 patients in each country. The images are relatively low in quality, reflecting the instrumentation used in these regions. The second public dataset is the FETAL PLANES DB (21), collected at BCNatal, an institution with two affiliated hospitals (Clínic Hospital and Sant Joan de Déu Hospital, Barcelona, Spain), as well as four Danish Fetal Research Centres (FRCs), which include Copenhagen University Hospital Rigshospitalet, Hvidovre Hospital, Herlev Hospital and Nordsjællands Hospital Hillerød. The images in this dataset are of moderate quality and reflect the instrumentation used at these centers. The private dataset was collected from Kiang Wu Hospital in Macao, where the images are fully annotated, and both the image quality and instrumentation are relatively high.
Table 1
| Dataset | Brain | Thorax | Femur | Abdomen | Total |
|---|---|---|---|---|---|
| Africa | 125 | 75 | 125 | 125 | 450 |
| Europe | 3,092 | 1,718 | 1,040 | 711 | 6,565 |
| Macau | 123 | 94 | 102 | 112 | 431 |
US, ultrasound.
For cardiac US segmentation, two publicly available datasets were used, as shown in Table 2 and Figure 3. The first dataset is Camus (22), which contains data from 500 patients obtained at the University Hospital of Saint-Étienne (France). The dataset includes 2D sequences of apical four chamber and two-chamber sections, along with their corresponding masks. The second dataset is EchoNet-Dynamic (23), which includes 10,030 apical four-chamber echocardiography videos. These videos were collected as part of routine clinical care at Stanford University Hospital between 2016 and 2018. In addition we used the CardiacUDA (24) dataset to demonstrate that the model has the ability to generalize.The CardiacUDA dataset consists of the parasternal left ventricular long-axis (LVLA), the pulmonary artery long-axis (PALA), the left ventricular short-axis (LVSA), and the apical four chambers (A4 C). There are four views per patient. The resolution of each video was 800×800 or 1,024×768, depending on the scanner used (Philips or Hitachi). Approximately 100 different patients contributed 516 and 476 videos from the G and R regions, respectively. Each video consisted of more than 100 frames covering at least one cardiac cycle. Only five annotated frames were provided for each video, and these five frames along with their corresponding original video frames were used as training data.
Table 2
| Dataset | Train | Val | Test |
|---|---|---|---|
| Camus | 400 | 50 | 50 |
| Echo | 6,014 | 2,004 | 2,006 |
US, ultrasound.
Data preprocessing
In this paper, the datasets are categorized into three different resource settings for the classification task: the dataset from the five African countries reflects the prevailing conditions in low-resource regions, the European dataset from Spain represents the typical healthcare resource level in medium regions, and the dataset from Macao exemplifies the healthcare environment in high-resource regions. During the training phase, images are resized to 224×224 pixels and normalized. Data augmentation techniques are applied, including resizing, horizontal flipping, and rotation. The data is split into 60% for training, 20% for testing, and 20% for validation.
For the segmentation task, the datasets are divided into two scenarios: the Camus dataset corresponds to low- and medium-resource regions, while the EchoNet-Dynamic dataset reflects the situation in regions with medium- to high-resource levels. Due to the differences in formats between the CAMUS and Echo datasets, distinct preprocessing methods are applied to each. For the CAMUS dataset, the four-chamber cardiac US slices (4CH) and their corresponding segmentation labels are read. Each sample consists of an image file and a segmentation label file, both following the same naming convention. The segmentation label file marks the target region (e.g., cardiac chambers). For the Echo dataset, the US image and its corresponding mask label (binary mask) are loaded, which indicates the target region. During the label processing stage, a uniform binarization operation is performed on the segmentation labels from both datasets by setting the pixel value of the target region to 1 and the pixel value of the background region to 0. Both CAMUS and EchoNet-Dynamic datasets are divided into training (60%), validation (20%), and test (20%) sets, maintaining consistency in data processing and model training across the segmentation task.
Ablation study
Table 3 summarizes the performance of the model is summarised on the European, African and Macau datasets, where the baseline ViT model and the proposed USHydra-ViT model are evaluated. While the ViT model is centred on domain invariant feature extraction, our approach aims to enhance generalization across different domains. The results show that USHydra-ViT significantly outperforms the baseline approach on all metrics. In particular, the improvements observed on the Macau and Africa datasets are striking. For the Macau dataset, the proposed method achieves an improvement of about 10% on almost all key metrics, demonstrating its ability to maintain high accuracy at high resource settings. On the African dataset, which represents a low resource setting, the proposed method shows even greater improvements, with performance gains of up to 20%. While the European dataset shows a relatively small improvement, possibly due to its larger size and better inherent data distribution, the gains are still statistically significant. These findings reflect the effectiveness of USHydra-ViT in processing heterogeneous medical image data and improving generalization performance across different resource settings. In addition, the USHydra-ViT method also shows significant advantages over ViT on mixed datasets, confirming its robustness in diverse resource environments.
Table 3
| Methods | Dataset | Precision (%) | Recall (%) | F1-score (%) | Accuracy (%) | AUC (%) |
|---|---|---|---|---|---|---|
| Vit | European | 83.99 | 83.90 | 83.92 | 83.90 | 92.67 |
| African | 72.34 | 64.50 | 66.25 | 64.50 | 87.56 | |
| Macao | 60.08 | 60.00 | 60.04 | 60.00 | 82.40 | |
| Mixture | 73.52 | 68.97 | 73.02 | 75.39 | 88.39 | |
| USHydra-Vit (ours) | European | 93.99 | 93.75 | 93.71 | 93.75 | 99.29 |
| African | 84.06 | 80.00 | 78.91 | 80.00 | 96.29 | |
| Macao | 84.06 | 85.05 | 80.34 | 85.05 | 95.00 | |
| Mixture | 92.93 | 93.70 | 93.68 | 93.70 | 99.17 |
AUC, area under the curve; US, ultrasound.
For the segmentation task, Table 4 reports the performance of the baseline UNet model and the proposed USHydra-UNet model on the Camus and EchoNet-Dynamic datasets is reported. The results show that USHydra-UNet outperforms the baseline on all evaluation metrics. On the Camus dataset, which reflects low and medium resource regions, USHydra-UNet improves the Dice coefficient and IoU by 12.23% and 19.52%, respectively. These significant improvements demonstrate the ability of USHydra-UNet to effectively model and generalize data in resource-constrained environments. For the EchoNet-Dynamic dataset, which represents a medium to high resource region, USHydra-UNet shows a smaller but still meaningful improvement, with an increase in the Dice coefficient of 0.59% and an increase in IoU of 1.05%. The smaller magnitude of improvement can be attributed to the higher quality data and annotations inherent in this dataset, which already provide strong baseline performance for UNet. Overall, the results highlight the superior generalization ability of USHydra-UNet for segmenting medical images on datasets with varying resource levels and image quality, validating the robustness and adaptability of the proposed framework in addressing various segmentation challenges. In addition, on mixed datasets, USHydra-UNet demonstrates more obvious advantages in Dice coefficient and IoU metrics compared to UNet, which further proves the robustness of the method in diverse resource environments.
Table 4
| Methods | Dataset | Dice (%) | IoU (%) |
|---|---|---|---|
| UNet | Camus | 81.72 | 69.09 |
| EchoNet | 91.14 | 83.72 | |
| Mixture | 85.60 | 78.30 | |
| USHydra-UNet (ours) | Camus | 93.95 | 88.61 |
| EchoNet | 91.73 | 84.77 | |
| Mixture | 92.90 | 85.84 |
IoU, intersection over union; US, ultrasound.
Comparison experiments
In the classification module, we evaluated the performance of our proposed method against several widely used models, including ResNet18 (13), ResNet50 (13), VGG (25), and DenseNet121 (26). To ensure a fair comparison, we utilized a mixed dataset that combines data from multiple regions, serving as a benchmark for both training and testing. The results of these experiments are presented in Table 5.
Table 5
| Methods | Dataset | Precision (%) | Recall (%) | F1-score (%) | Accuracy (%) | AUC (%) |
|---|---|---|---|---|---|---|
| ResNet50 (13) | European | 88.04 | 88.15 | 89.12 | 88.15 | 94.83 |
| African | 78.93 | 79.00 | 78.89 | 79.00 | 84.05 | |
| Macao | 66.80 | 68.00 | 66.36 | 68.00 | 82.70 | |
| Mixture | 91.79 | 91.33 | 91.38 | 91.33 | 98.49 | |
| VGG (25) | European | 86.12 | 86.19 | 86.20 | 86.19 | 95.29 |
| African | 80.52 | 83.00 | 79.93 | 83.00 | 84.00 | |
| Macao | 77.30 | 80.00 | 78.60 | 80.00 | 82.67 | |
| Mixture | 87.09 | 87.00 | 86.72 | 87.00 | 96.51 | |
| DenseNet121 (26) | European | 87.30 | 87.10 | 87.14 | 87.10 | 93.77 |
| African | 78.01 | 78.00 | 77.02 | 78.00 | 84.01 | |
| Macao | 66.80 | 60.00 | 68.30 | 60.00 | 82.25 | |
| Mixture | 92.74 | 92.76 | 92.69 | 92.67 | 98.75 | |
| ResNet18 (17) | European | 83.99 | 83.90 | 83.92 | 83.90 | 92.67 |
| African | 72.34 | 64.50 | 66.25 | 64.50 | 87.56 | |
| Macao | 60.08 | 60.00 | 60.04 | 60.00 | 82.40 | |
| Mixture | 73.52 | 68.97 | 73.02 | 75.39 | 88.39 | |
| USHydra-Vit (ours) | European | 93.99 | 93.75 | 93.71 | 93.75 | 99.29 |
| African | 84.06 | 80.00 | 78.91 | 80.00 | 96.29 | |
| Macao | 84.06 | 85.05 | 80.34 | 85.05 | 95.00 | |
| Mixture | 92.93 | 93.70 | 93.68 | 93.70 | 99.17 |
AUC, area under the curve.
The results show that ResNet18 performs poorly inside all the models. resNet50 and DenseNet121 show that it is possible to perform higher. However, our method significantly outperforms all these models on all datasets and consistently outperforms them by a significant margin, especially on key metrics such as AUC and accuracy, demonstrating the strong robustness of our model. Specifically, our method utilizes a new multi-decoder framework that dynamically adapts to different data distributions and quality levels, distinguishing it from traditional models such as ResNet and VGG, which do not explicitly account for such variability. The ability to adaptively select the most appropriate decoder based on the features of the input data is one of the key innovations of our approach, enhancing its ability to generalize across different domains. This feature allows our model to provide better performance, especially in resource-constrained environments where other models struggle to maintain accuracy.
In the segmentation module, Table 6 presents a comparison of the performance of various segmentation models on the mixed test set, evaluated using the Dice coefficient and IoU. The models compared in this study include DoFE (27), SegNet (28), DeepLab v3 (29), and FCN (15). The results clearly show that the model USHydra-UNet proposed in this paper outperforms all other models in both metrics, with a Dice coefficient of 91.87% and an IoU of 84.74%. Although models like SegNet and DeepLab v3 show competitive performance, they are still not as good as USHydra-UNet, especially in more challenging tasks. USHydra-UNet has a Dice coefficient of 93.95% and an IoU of 88.61% on the Camus dataset, 91.73% and 84.77% on the EchoNet dataset, and 92.90% and 85.84% on the Mixture dataset. The performance of these metrics far exceeds that of other models, especially on the Mixture dataset, where USHydra-UNet has a clear advantage in Dice coefficient and IoU. On the unfamiliar CardiacUDA dataset, USHydra-UNet still maintains strong performance with 89.26% Dice and 84.14% IoU. One of the key innovations of USHydra-UNet is its multi-decoder architecture, which dynamically selects the optimal decoder based on the features of the input image, and thus is able to handle the variations in the data quality. This approach ensures that our model provides higher accuracy and robustness in target area segmentation, even in the presence of diverse data from different regions and resource environments. In addition, USHydra-UNet’s ability to maintain robust performance across multiple datasets highlights its exceptional versatility, making it an ideal solution for addressing real-world medical imaging challenges, especially where data quality and availability are often inconsistent.
Table 6
| Methods | Dataset | Dice (%) | IoU (%) |
|---|---|---|---|
| DoFE (27) | Camus | 86.86 | 76.49 |
| EchoNet | 87.87 | 81.14 | |
| Mixture | 89.81 | 81.62 | |
| CardiacUDA | 77.67 | 69.72 | |
| SegNet (28) | Camus | 90.14 | 82.28 |
| EchoNet | 89.66 | 81.26 | |
| Mixture | 92.90 | 85.84 | |
| CardiacUDA | 87.16 | 77.25 | |
| DeepLab v3 (29) | Camus | 85.21 | 74.23 |
| EchoNet | 89.31 | 80.69 | |
| Mixture | 89.64 | 81.23 | |
| CardiacUDA | 74.62 | 66.70 | |
| FCN (15) | Camus | 88.44 | 79.28 |
| EchoNet | 88.97 | 80.13 | |
| Mixture | 89.81 | 81.51 | |
| CardiacUDA | 80.81 | 73.62 | |
| USHydra-UNet (ours) | Camus | 93.95 | 88.61 |
| EchoNet | 91.73 | 84.77 | |
| Mixture | 92.90 | 85.84 | |
| CardiacUDA | 89.26 | 84.14 |
FCN, fully convolutional networks; IoU, intersection over union.
Our experiments reveal that while traditional models can perform well on certain datasets, they fail to maintain robustness and accuracy when faced with varying data qualities, especially in resource-constrained environments. In contrast, the dynamic decoder selection mechanism employed by USHydra-UNet enables it to adapt to different data characteristics, thereby delivering better segmentation results in a broader range of real-world scenarios. This represents a significant advancement over existing models, providing a more reliable and versatile solution for medical image segmentation tasks.
Data visualization
In the classification module, as shown in Figure 4, the heatmap indicates that USHydra-ViT effectively captures complex features with high accuracy and robustness. DenseNet121 performs well across the four comparative experiments, with its heatmap showing a strong flow of information and more localized activations, highlighting its accuracy and robustness. ResNet50, with its residual structure, demonstrates significant activation in edge and contour regions, indicating its strong performance, particularly in complex features. In comparison, ResNet18, although slightly less effective, shows fewer active regions in the heatmap and lower performance. However, it provides better efficiency for relatively simple tasks. The VGG model, with its simpler structure, exhibits more uniform activation in the heatmap and lacks the ability to capture complex details, which results in lower accuracy in more intricate tasks. Overall, USHydra-ViT is best suited for complex tasks, while DenseNet121 and ResNet50 excel in complex tasks as well. VGG and ResNet18, however, may be more advantageous in resource-limited scenarios.
Figure 5 shows the confusion matrix, we observe that the model struggles to accurately classify the fetal abdomen and thorax in the African dataset. However, the fetal brain and femur are predicted more accurately, although some confusion remains. Overall, the model’s prediction accuracy is lower for the African dataset, suggesting potential for improvement, whereas the European and Macau datasets show stronger classification capabilities due to their higher quantity and quality. This highlights the model’s strong generalization performance across datasets of varying quality and quantity.
In the segmentation module, as shown in Figure 6, the segmentation results of the cardiac US image are visualized as a hash: the original US image, the ground truth (the real segmentation region manually marked by the expert) and the model prediction results. The first column shows the input US image showing the original morphology of the target structure. The second column shows the ground truth, i.e., the target region manually labeled by the expert as a true reference for segmentation. The first column row shows the segmentation output of the model showing the predicted regions.
It is obvious from the comparison results that the predictions are very close to the actual labels in terms of shape, location and size. This consistency fully demonstrates the strong ability of the model to recognize and segment target regions in complex backgrounds. In addition, this paper provides the prediction results of other comparative experimental models on the same input images and finds that the prediction results of our model are closest to the images manually labeled by experts. By way of comparison, the detail of the prediction results highlights the effectiveness of the model in boundary processing, as it is able to accurately capture the integrity of the region. This further emphasizes the superior performance of our model in cardiac US image segmentation.
Discussion
A key objective of DL applications in healthcare is the development of inclusive and robust solutions that accommodate resource constraints across different regions. However, building effective predictive models often requires large datasets and extensive model parameters, which are difficult to acquire in low-resource environments. Moreover, the challenges posed by DG, which ensures that models perform effectively across diverse domains—are critical to address, especially in highly variable medical scenarios.
In this study, we introduced the USHydraNet framework, a novel approach aimed at addressing DG in both classification and segmentation tasks for medical images, particularly US images. The proposed architecture integrates a multidecoder design with a dynamic decoder selection mechanism that leverages image quality metrics, such as the mean and variance of input images. By dynamically selecting the most appropriate decoder output, the framework ensures adaptability and robust performance, even when applied to datasets with varying resource levels and characteristics. USHydraNet’s effectiveness was demonstrated across multiple cross-region datasets, including those from Africa, Europe, and Macao, where it achieved outstanding results in both classification and segmentation tasks.
In the classification module, USHydraNet showcased its ability to process images from diverse geographic regions, delivering significant improvements in classification accuracy, particularly with low-quality images common in resource-constrained environments. One of the key strengths of USHy-draNet is its dynamic decoder selection capability, which enables the model to adapt to variations in image quality and maintain strong generalization performance across domains. Likewise, in the segmentation module, the model achieved exceptional results in segmenting cardiac US data. Comparative experiments demonstrated that USHydraNet consistently outperformed the traditional UNet baseline, especially in scenarios involving datasets of heterogeneous quality.
While USHydraNet represents a significant step forward in handling DG and achieving reliable performance across diverse datasets, it is not without its limitations. Firstly, the model may encounter challenges when processing datasets with extreme imbalances or stark differences in image quality. In such scenarios, the decoder selection mechanism may fail to identify the most suitable decoder, potentially leading to degraded performance. Secondly, the reliance on pretrained weights facilitates faster training but introduces potential risks when applying the model to vastly different domains from those represented in the pretraining datasets. Such domain discrepancies could reduce the model’s accuracy and robustness in previously unseen scenarios.
Future research will aim to address these limitations by incorporating a wider variety of datasets to further enhance the model’s generalization and robustness across more complex and diverse medical imaging scenarios. Additionally, we aim to optimize the decoder selection mechanism and integrate more advanced DL techniques to enhance the model’s performance, efficiency, and adaptability in real-world applications.
Conclusions
This paper proposes the USHydraNet framework to address DG challenges in medical imaging, tackling limitations arising from regional, technological, and resource differences. Using fetal US for classification and cardiac US for segmentation, USHydraNet integrates a dynamic routing mechanism to optimize performance by selecting the most appropriate decoder based on input quality. Experimental results demonstrate significant improvements over baseline models, highlighting the framework’s robustness and adaptability across diverse healthcare environments. USHydraNet shows great potential for reliable deployment in real-world medical scenarios, particularly in resource-constrained settings.
Acknowledgments
None.
Footnote
Reporting Checklist: The authors have completed the CLEAR reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-521/rc
Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-521/dss
Funding: This work was supported by the grant from
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-521/coif). R.G. is employed by Shenzhen RayShape Medical Technology Co., Ltd. The other authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the institutional ethics board of Kiang Wu Hospital (Ethical approval number: K-2023-067-H01) and individual consent for this retrospective analysis was waived.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Martin-Isla C, Campello VM, Izquierdo C, et al. Image-Based Cardiac Diagnosis With Machine Learning: A Review. Front Cardiovasc Med 2020;7:1. [Crossref] [PubMed]
- Houssein EH, Emam MM, Ali AA, et al. Deep and machine learning techniques for medical imaging-based breast cancer: A comprehensive review. Expert Systems with Applications 2021;167:114161.
- Zhang L, Wang M, Liu M, et al. A Survey on Deep Learning for Neuroimaging-Based Brain Disorder Analysis. Front Neurosci 2020;14:779. [Crossref] [PubMed]
- Saito K, Watanabe K, Ushiku Y, et al. Maximum classifier discrepancy for unsuper- vised domain adaptation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018, pp. 3723-32.
- Ganin Y, Lempitsky V. Unsupervised domain adaptation by backpropagation. International Conference on Machine Learning, PMLR; 2015, pp. 1180-9.
- Yang Y, Gandhi M, Wang Y, et al. A Textbook Remedy for Domain Shifts: Knowledge Priors for Medical Image Analysis. Adv Neural Inf Process Syst 2024;37:90683-713.
- Guan H, Liu M. Domain Adaptation for Medical Image Analysis: A Survey. IEEE Trans Biomed Eng 2022;69:1173-85. [Crossref] [PubMed]
- Li Y, Tian X, Gong M, et al. Deep domain generalization via conditional invariant adversarial networks. Proceedings of the European Conference on Computer Vision (ECCV); 2018, pp. 624-39.
- Liu Q, Dou Q, Heng PA. Shape-aware meta-learning for generalizing prostate mri segmentation to unseen domains. Medical Image Computing and Computer Assisted Intervention— MICCAI 2020: 23rd International Conference, Lima, Peru, October 4—8, 2020, Proceedings, Part II 23. Springer; 2020, pp. 475-85.
- Blanchard G, Lee G, Scott C. Generalizing from several related classification tasks to a new unlabeled sample. Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS'11); 2011:2178-86.
- Muandet K, Balduzzi D, Schölkopf B. Domain generalization via invariant feature representation. International conference on machine learning, PMLR; 2013, pp. 10-8.
- Han K, Wang Y, Chen H, et al. A Survey on Vision Transformer. IEEE Trans Pattern Anal Mach Intell 2023;45:87-110. [Crossref] [PubMed]
- He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016, pp. 770-8.
- Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015, pp. 1-9.
- Long J, Shelhamer E, Darrell T.Fully convolutional networks for semantic segmentation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA; 2015, pp. 3431-40.
- Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, et al., editors. Medical image computing and computer-assisted intervention—MICCAI 2015. Springer, 2015, pp. 234-41.
- Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:2481-95. [Crossref] [PubMed]
- Wang W, Deng W. Deep visual domain adaptation: A survey. Neurocomputing 2018;312:135-53.
- Sun B, Saenko K. Deep coral: Correlation alignment for deep domain adaptation. In: Hua G, Jégou H, editors. Computer Vision—ECCV 2016 Workshops. Amsterdam, The Netherlands, October 8-10 and 15- 16, 2016, Proceedings, Part III 14 . Springer, 2016, pp. 443-50.
- Sendra-Balcells C, Campello VM, Torrents-Barrena J, et al. Generalisability of fetal ultrasound deep learning models to low-resource imaging settings in five African countries. Sci Rep 2023;13:2728. [Crossref] [PubMed]
- Burgos-Artizzu XP, Coronado-Gutiérrez D, Valenzuela-Alcaraz B, et al. Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes. Sci Rep 2020;10:10200. [Crossref] [PubMed]
- Leclerc S, Smistad E, Pedrosa J, et al. Deep Learning for Segmentation Using an Open Large-Scale Dataset in 2D Echocardiography. IEEE Trans Med Imaging 2019;38:2198-210. [Crossref] [PubMed]
- Yang J, Ding X, Zheng Z, et al. Graphecho: Graph-driven unsupervised domain adaptation for echocardiogram video segmentation. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 11878-87.
- Ouyang D, He B, Ghorbani A, et al. Video-based AI for beat-to-beat assessment of cardiac function. Nature 2020;580:252-6. [Crossref] [PubMed]
- Tammina S. Transfer learning using vgg-16 with deep convolutional neural network for classifying images. International Journal of Scientific and Research Publications (IJSRP) 2019;9:143-50.
- Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017, pp. 4700-8.
- Wang S, Yu L, Li K, et al. DoFE: Domain-Oriented Feature Embedding for Generalizable Fundus Image Segmentation on Unseen Datasets. IEEE Trans Med Imaging 2020;39:4237-48. [Crossref] [PubMed]
- Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39:2481-95. [Crossref] [PubMed]
- Si H, Shi Z, Hu X, et al. Image semantic segmentation based on improved DeepLab V3 model. Int J Model Identif Control 2020;36:116-25.

