A developmental stage-aware graph transformer framework for automated bone-age assessment

Kerang Cao; Chang Liu; Jiaming Du; Lini Duan; Lli Li; Ye Ma; Hoekyung Jung

doi:10.21037/qims-2025-998

Original Article

A developmental stage-aware graph transformer framework for automated bone-age assessment

Kerang Cao^1,2 , Chang Liu^1,2 , Jiaming Du^1,2 , Lini Duan³ , Lli Li⁴ , Ye Ma⁵ , Hoekyung Jung⁶

¹College of Computer Science and Technology, Shenyang University of Chemical Technology, Shenyang, China; ²Key Laboratory of Intelligent Technology of Chemical Process Industry in Liaoning Province, Shenyang, China; ³School of Economics and Management, Shenyang University of Chemical Technology, Shenyang, China; ⁴Shenyang Maternal and Child Health Hospital, Shenyang, China; ⁵School of Computer Science and Technology, Dalian University of Technology, Dalian, China; ⁶Computer Engineering Department, Paichai University, Daejeon, Republic of Korea

Contributions: (I) Conception and design: K Cao, C Liu; (II) Administrative support: K Cao, H Jung; (III) Provision of study materials or patients: K Cao, C Liu, J Du, L Li; (IV) Collection and assembly of data: J Du, L Duan, L Li; (V) Data analysis and interpretation: C Liu, L Duan, L Li, Y Ma; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Hoekyung Jung, PhD. Computer Engineering Department, Paichai University, Baejae-ro 155-40, Daejeon 35345, Republic of Korea. Email: hkjung@pcu.ac.kr.

Background: Bone-age assessment is a crucial tool in pediatric healthcare for monitoring children’s growth and diagnosing endocrine disorders. Traditional manual assessment methods such as Greulich-Pyle (GP) and Tanner-Whitehouse (TW) standards have significant limitations including high subjectivity, complex operation, and a time-consuming process. The currently available automated methods struggle to effectively highlight clinically relevant growth regions and capture anatomical associations between skeletal structures across different developmental stages. This study developed a novel framework integrating anatomical knowledge with developmental stage awareness to improve automated bone-age assessment accuracy and interpretability.

Methods: We evaluated our proposed framework using the publicly available Radiological Society of North America (RSNA) pediatric bone age dataset. The model was trained and tested using the standard data split, with performance primarily assessed by the mean absolute error (MAE) in months. We designed a developmental stage-aware graph transformer framework (DSGTF) for automated bone-age assessment. Our framework integrates image preprocessing with automatic key region detection, along with subsequent feature extraction from both local skeletal regions and the whole-hand area. The core innovation lies in our graph transformer architecture that models anatomical relationships with a biologically informed skeletal graph structure. This is enhanced by our developmental stage-aware visual transformer fusion module, which adaptively identifies each sample’s developmental stage and dynamically adjusts processing strategies to accommodate variations in bone characteristics across maturity levels. The system is optimized through multitask learning, which balances bone age prediction with anatomically meaningful feature representation.

Results: The DSGTF model achieved an MAE of 4.82 months in bone-age assessment. The model exhibited consistent performance across different age groups and sexes, with the MAE ranging from 3.39 to 6.03 months. Robustness evaluation showed that the model maintained stable performance under various image transformations, with 11 out of 16 transformations showing no statistically significant performance degradation (P>0.05). Geometric transformations resulted in minimal increases in MAE (<0.43 months), while moderate photometric changes produced MAE differences within 0.21 months. Visualization analysis revealed that the developmental stage–aware module successfully captures differentiated attention patterns across skeletal regions, with early stage processing focusing primarily on wrist regions, the middle stage showing balanced attention across structures, and the late stage emphasizing the proximal phalanges while maintaining high attention to radius-ulna regions—a pattern that demonstrates the model’s ability to capture clinically relevant developmental priorities across different maturation stages.

Conclusions: By leveraging a skeletal graph structure based on anatomical relationships and incorporating a developmental stage-aware processing mechanism, the DSGTF framework provides an accurate and efficient automated solution for clinical bone-age assessment. The proposed approach exhibits strong stability and consistency across different age groups and sexes, making it a reliable tool for real-world medical applications. This model effectively simulates the clinical assessment process by dynamically adjusting attention to different skeletal regions based on developmental stages, mirroring the decision-making process of radiologists.

Keywords: Bone age assessment; graph neural networks (GNNs); developmental stage awareness; transformer; deep learning

Submitted Apr 27, 2025. Accepted for publication Oct 22, 2025. Published online Dec 31, 2025.

doi: 10.21037/qims-2025-998

Introduction

Bone age refers to the biological age of a person’s skeletal maturity as determined through an assessment of the degree of bone development. Bone age assessment holds significant clinical importance in children and adolescents, particularly in evaluating growth development and skeletal health. Studies have found that abnormal growth in children is associated with various conditions including metabolic disorders (1) and genetic bone disorders such as skeletal dysplasias (2,3), and thus bone-age assessment has gradually been established as an essential tool in pediatric endocrinology and orthopedics (4). Accurate bone-age assessment not only helps to determine an individual’s growth potential but also allows for the early detection of growth abnormalities, skeletal diseases, and endocrine disorders, thereby providing a basis for early intervention. In clinical practice, bone age is typically determined manually by physicians using posteroanterior X-ray images of the left hand, with assessment standards primarily falling into two categories: the Greulich-Pyle (GP) atlas standard (5) and the Tanner-Whitehouse (TW) standard (6). However, both of these clinical methods involve considerably subjectivity, complex operations, and time-consuming procedures. Moreover, as the assessment results are easily influenced by the evaluator’s experience, standardization remains elusive.

Due to the sustained advancements in computer vision technology and medical image analysis (7), bone-age assessment methods have undergone a significant transformation, from traditional machine learning to deep learning. Early automated systems such as BoneXpert (8) primarily achieved automatic assessment by extracting the morphological features, grayscale distribution, and structural textures of bones. Somkantha et al. (9) proposed an edge-tracking-based method that combines vector image models with edge map information to extract carpal bone boundary features, achieving precise bone-age assessment via support vector regression. Despite these advances, traditional machine learning approaches still depend heavily on domain knowledge and expert experience for feature design, requiring complex image preprocessing and feature extraction steps, which results in low overall processing efficiency (10). The development of deep learning has pushed bone-age assessment into a new phase (11,12), with convolutional neural network (CNN)-based methods improving assessment accuracy and efficiency by learning deep features directly from skeletal images (13-16). However, it is crucial to acknowledge the inherent limitations in automated bone-age assessment systems that rely on human annotations as the ground truth. As demonstrated by Halabi et al. (17), interradiologist disagreement in bone-age assessment is typically around 4 months or more, establishing a practical lower bound for artificial intelligence (AI) systems trained on human-curated data. This interobserver variability represents a fundamental ceiling for assessment accuracy, as no automated system can realistically achieve precision beyond the consistency of its training annotations. Furthermore, a recent work by Santomartino et al. (18) highlighted the critical importance of evaluating algorithm robustness to clinical image variations, emphasizing that the discrepancy between research and clinical implementation lies not only in absolute accuracy but also in the ability to maintain performance consistency across a diversity of imaging conditions encountered in routine clinical practice—particularly those acquired at community hospitals rather than specialized pediatric centers.

Transformer architectures have recently been applied to bone-age assessment, marking a significant evolution in automated assessment methodologies. Zhang et al. (19) proposed bone-age estimation vision transformer (BAE-ViT), an efficient multimodal vision transformer that integrates image and sex information, which demonstrated improved robustness and accuracy on the Radiological Society of North America (RSNA) dataset through use of novel tokenization techniques for multimodal data fusion. Choi et al. (20) introduced a self-accumulative vision transformer for Sauvegrain-based elbow assessment, achieving strong performance with reduced model complexity through token replay and regional attention bias mechanisms. Additionally, Mao et al. (21) developed two-stage convolutional transformer pipelines that combine region of interest (ROI) detection with transformer-based assessment, reporting promising results in capturing spatial dependencies within skeletal structure. Although these transformer-based approaches represent significant improvements in accuracy and feature representation capabilities, they primarily focus on optimizing individual architectural components rather than explicitly modeling anatomical relationships between skeletal structures or incorporating developmental stage-specific processing strategies.

Automated bone-age assessments can be broadly categorized into three types: end-to-end approaches that directly process raw X-ray images without auxiliary information (22,23); methods that incorporate global auxiliary information (such as skeletal masks) for CNN training, as demonstrated by Iglovikov et al. (24); and those that train based on local regions, as exemplified by Son et al. (16), who performed assessment after localizing the key skeletal regions. However, the end-to-end approach struggles to highlight key growth regions that are clinically relevant, while methods based on local regions neglect the anatomical associations between skeletal regions. In recent years, graph neural networks (GNNs) have increasingly become a focal point in medical image analysis due to their advantages in processing structured data (25). As a variant of GNNs, graph transformers can more effectively capture complex relationships between nodes and global dependencies by incorporating self-attention mechanisms (26). In bone-age assessment, skeletal structures possess inherent graphical characteristics, with significant anatomical associations among different skeletal nodes that could be effectively modeled through use of these advanced architectures. A key challenge in bone-age assessment is the processing of skeletal feature differences across various developmental stages. This challenge is particularly prominent in clinical practice, as skeletal structures undergo significant morphological changes and density transitions from infancy to adolescence. Especially during critical developmental periods, such as the formation of ossification centers in carpal bones during early childhood and the fusion process of epiphyses and diaphyses in adolescence, accurately capturing these feature changes is crucial for precise assessment. Furthermore, traditional automated assessment systems lack adaptive recognition mechanisms for these stage-specific features, resulting in inconsistent assessment accuracy across different age groups (27). As shown in Figure 1, bones at different developmental stages exhibit markedly distinct characteristics, rendering the uniform processing strategy adopted by traditional methods inadequate for accommodating these variations (28).

Figure 1 Comparison of bone-age features in hand X-rays at different developmental stages.

To address these challenges, this paper proposes a novel bone-age assessment framework: developmental stage-aware graph transformer framework (DSGTF). This framework effectively captures the spatial relationships of skeletal structures by integrating CNNs with graph transformers. At its core, the framework features an innovative developmental stage-aware visual transformer (DS-ViT) fusion module that adaptively estimates the developmental stage of each sample and dynamically adjusts feature-processing strategies accordingly, effectively simulating how clinical experts modify their assessment approach across different developmental phases. Furthermore, the framework constructs a skeletal graph structure based on anatomical relationships, which more precisely models the spatial associations between skeletal nodes and thus overcomes the limitations of traditional methods in capturing these critical anatomical connections. Through this synergistic combination of advanced techniques, DSGTF significantly enhances bone-age assessment accuracy across the entire spectrum of developmental stages—from infancy to adolescence—providing a more reliable, efficient, and automated solution for clinical applications. We present this article in accordance with the CLEAR reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-998/rc).

Methods

Dataset and study population

This study used the pediatric hand X-ray image dataset provided by the RSNA for experimental evaluation (17). This dataset is one of the largest publicly available bone-age assessment datasets, containing 12,611 left hand and wrist X-ray images, comprehensively covering developmental stages of children and adolescents ranging from 0 to 228 months (0–19 years). Each image was assessed and annotated with bone age by professional radiologists, and patient sex information is provided as important auxiliary data. The dataset has a relatively balanced sex distribution, with male samples accounting for 47.2% (5,952 cases) and female samples for 52.8% (6,659 cases), which helped the model learn skeletal development characteristics across different sexes.

For experimental validation, we followed the standard data partitioning protocol commonly used in bone-age assessment research (29). The dataset was randomly divided into training (n=8,827, 70%), validation (n=1,892, 15%), and test (n=1,892, 15%) sets, with a balanced sex distribution being maintained across all splits. Since the RSNA dataset contains only a single image per patient with no longitudinal follow-up data, there was no risk of patient-level data leakage between splits. Each image in the dataset corresponds to a unique patient identifier, and thus our random splitting procedure naturally prevented any form of data leakage while maintaining statistical independence across the training, validation, and test sets. For data annotation, we adopted the hand bone keypoint definition scheme proposed by Escobar et al., which identified 17 anatomically significant keypoints in hand X-rays, covering important areas such as the finger phalanges, metacarpal bones, and wrist bones (15). We converted these keypoint definitions into annotation box formats suitable for object detection tasks, providing the model with precise region localization information. We used a publicly available research dataset, in which all patient identifiers were previously anonymized by the original data providers in compliance with ethical standards. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Image preprocessing and key region detection

In our proposed method, image preprocessing and key region detection are crucial components of data processing. First, we designed an adaptive grayscale normalization method to enhance image quality, specifically tailored for X-ray images. This method automatically determines effective grayscale value intervals by analyzing the image’s grayscale histogram to identify meaningful intensity boundaries while excluding sparse outlier values that may result from imaging artifacts or background noise. Linear remapping is then performed with the following formula:

$I^{'} = 255 \times \frac{I - I_{l o w}}{I_{u p} - I_{l o w}}$ [1]

where $I^{'}$ is the normalized intensity, I is the original intensity, and $I_{l o w}$ and $I_{u p}$ represent the identified lower and upper bounds, respectively, after exclusion of outliers. This effectively enhances contrast by stretching the informative grayscale range and is followed by Gaussian filtering to reduce noise effects, thereby improving image contrast and clarity.

Our detection method follows the key point definitions proposed by Escobar et al. and converts these key points into annotation box formats suitable for object detection tasks. During the image-preprocessing stage, we dynamically determined the detection the box size for each key region based on the image width. We defined the baseline size as 8% of the image width, which served as the fundamental unit for region size calculation. For phalangeal regions, considering their structural characteristics, we set the detection box width to 1.5 times the baseline size and the height to the baseline size. For the carpal bone region, as it contains multiple bone structures, we used larger detection boxes with a width three times the baseline size and a height twice the baseline size. For the radius-ulna region, the detection box width was set to three times the baseline size and the height to 1.5 times. These region size settings were determined based on clinical standards and skeletal anatomical features. This adaptive sizing approach ensured that our detection framework can accommodate variations in image resolution and patient-specific anatomical differences, which is essential for robust performance across diverse clinical datasets. Furthermore, we also retained the boundary box detection for the entire hand region, forming a detection task with 18 categories (17 skeletal nodes and 1 whole-hand region).

For the automatic detection of these key regions, we implemented the YOLOv11 object detection model (30,31). The model was trained on the annotated RSNA dataset to automatically locate the 18 key regions in hand X-ray images. This automated detection approach eliminates the need for manual region selection, making the overall bone-age assessment process more efficient and reproducible. Figure 2 illustrates the preprocessing and key region detection process for hand X-rays, with Figure 2A showing the original X-ray images with different hand postures, Figure 2B displaying the effects of image preprocessing, and Figure 2C summarizing the automatic detection results of key skeletal regions (marked with red boxes) as well as key point markings for phalanges and carpal bones (green dots and connecting lines).

Figure 2 The complete preprocessing and key region detection pipeline for hand X-ray images. (A) Original X-ray images with different hand postures. (B) Results of image preprocessing. (C) Automatic detection results: key skeletal regions (red boxes) and key point markings for phalanges and carpal bones (green dots and connecting lines).

Feature extraction architecture

During the feature extraction phase, we processed 17 key point regions and 1 whole-hand region obtained from hand X-ray images based on object detection results. To effectively extract feature information from these regions, we employed two pre-trained CNN models that have demonstrated excellent performance in the field of computer vision. For the whole-hand region, we selected the Inception v3 network as the feature extractor, which excels in capturing image features at different levels through its innovative multiscale convolutional architecture (32). For the 17 key point regions, considering the balance between computational efficiency and feature expression capability, we used EfficientNet B1 for feature extraction (33). Both networks are initialized with ImageNet pre-trained parameters to ensure the reliability and effectiveness of feature extraction. This transfer learning approach allows the model to leverage general visual patterns learned from large-scale datasets, thereby improving performance on the specific task of bone-age assessment despite the relatively limited size of medical imaging datasets. The input images were resized to 299×299 pixels for Inception v3 and 240×240 pixels for EfficientNet B1 in accordance with their standard input requirements. Furthermore, given the importance of sex factors in bone-age assessment (5,6), we integrated sex information into the feature extraction network through a linear layer. This approach enables the model to comprehensively consider the influence of sex characteristics on bone age prediction, as male and female children often exhibit different patterns of skeletal development at the same chronological age.

DSGTF

In the DSGTF framework, we incorporate two key processing modules: the bone age graph transformer module and DS-ViT fusion module. Together, they form the core algorithm for bone-age assessment. As shown in Figure 3, the overall architecture of DSGTF consists of four main components: an image-preprocessing and key-region-detection module, a feature extraction module, a bone age graph transformer module, and a DS-ViT fusion module. The following sections present the detailed design and implementation of these modules.

Figure 3 Overall architecture of a graph transformer framework with developmental stage awareness for bone-age assessment from X-ray images. *, weighted sum (expectation regression). DS-ViT, developmental stage-aware visual transformer; GAP, global average pooling.

Skeletal graph structure construction

The hand X-ray image is constructed as a structured undirected graph representation G = (F, A), with the feature matrix F being represented as follows:

$F = [f_{1}, f_{2}, \dots, f_{17}] \in ℝ^{N \times d}$ [2]

F contains the feature representations of 17 key regions (N=17), where each region is represented by a d-dimensional feature vector extracted from the CNN backbone networks, with d=128 for EfficientNet B1 features from individual regions, and the adjacency matrix $A \in {0, 1}^{N \times N}$ describes the anatomical associations between regions, when an anatomical connection between regions i and j, A_i,j =1 is present. The construction of the graph strictly follows the anatomical characteristics of hand bones. The graph structure incorporates the chain connections of the four fingers, where each phalanx is connected to its adjacent ones in the same finger to model the natural growth pattern along each digit. The thumb connections are specifically modeled to account for its unique anatomical structure distinct from the other four fingers. The connections in the wrist region capture the interrelations between carpal bones, which form a complex functional unit with highly correlated developmental patterns. Additionally, connections between finger bases and the wrist joint are established to model the crucial transitional area between the metacarpal bones and the carpal system. This anatomy-based graph structure design ensures that the model can effectively capture the structural relationships between hand bones.

Bone-age graph transformer module

To better capture the positional information of nodes in the graph structure, position encoding vectors $PE \in ℝ^{N \times d}$ are calculated based on the graph Laplacian matrix L = D − A, where D is the degree matrix, and A is the adjacency matrix. The positional encoding is computed with the eigenvectors of the normalized Laplacian matrix as follows: $L_{n o r m} = D^{- \frac{1}{2}} L D^{- \frac{1}{2}}$ . The k smallest nonzero eigenvalues and their corresponding eigenvectors are selected to form the positional encoding matrix. This graph structure-based position encoding method effectively preserves the topological features of the skeletal structure, providing important spatial information for subsequent feature learning.

In the graph transformer processing stage, two cascaded transformer convolution layers process the graph structure data, with the computation process for each layer represented as follows:

$F^{(k)} = ReLU (BatchNorm (TransformerConv (F^{(k - 1)}, A)))$ [3]

where $F^{(k)}$ represents the output features of the kth layer, and $F^{(k - 1)}$ is the input features from the previous layer. As illustrated in Figure 4, the TransformerConv module is equipped with a multihead attention mechanism, capable of learning feature relationships from different representational subspaces. The architecture incorporates hierarchical feature extraction with two transformer layers focusing on local and global feature patterns, respectively. Specifically, the first transformer layer focuses on learning feature representations of local anatomical structures, with each node primarily aggregating information from its directly connected neighbors. The second transformer layer emphasizes capturing global dependencies, allowing each node to receive information from second-order neighbors, thereby achieving feature integration across a broader range. After processing by both transformer layers, the node-level features from each layer are retained and concatenated to form enhanced node representations. This multilevel feature concatenation strategy accomplishes the fusion of “local-global” features, simultaneously preserving local structural information and global association features, enabling the model to comprehensively understand complex patterns of skeletal development.

Figure 4 Architecture of the bone age graph transformer module with multihead attention and hierarchical feature extraction.

DS-ViT fusion module

To integrate clinical bone-age assessment expertise into the model, we propose the DS-ViT fusion module. Inspired by the TW3 bone-age assessment standard, we designed this module to adaptively identify the developmental stage of a sample and optimize feature-processing strategies accordingly. Figure 5 illustrates the architectural design of the DS-ViT module. The module first applies independent transformation and normalization to the node features output by the bone age graph transformer, the whole-hand features, and sex features to ensure that they exist in the same feature space while incorporating positional encoding to preserve spatial information. Subsequently, the module estimates the probability distribution across three developmental stages—early, middle, and late—by analyzing statistical information derived from global features and node features as follows:

$stage weights = Softmax (f_{stage} ([f_{global}, Mean (f_{nodes})]))$ [4]

Figure 5 Architecture of the developmental stage-aware ViT module with dynamic stage-based feature processing and weighted fusion. ViT, visual transformer.

where $f_{stage}$ is a multilayer perceptron used to map the global features $f_{global}$ and the mean of node features Mean( $f_{nodes}$ ) to a probability distribution across three developmental stages. An important design consideration is ensuring that the developmental stage-aware mechanism does not simply act as a proxy for direct age prediction, which would create circularity in the model. Our approach addresses this concern through several key architectural choices. First, the stage weights are derived from intermediate feature representations rather than from any explicit age-related predictions or the final model output. This ensures that stage estimation occurs independently within the feature space based on learned skeletal maturity patterns rather than on chronological age. Second, the three developmental stages are designed as broad maturity categories spanning wide chronological age ranges, which are intentionally much coarser than are the month-level bone age predictions our model produces. This intentionally coarse-grained design ensures that the stage classifier captures general skeletal maturity patterns rather than specific chronological ages, thereby preventing the mechanism from functioning as a simple age predictor and maintaining focus on biologically meaningful developmental characteristics. Third, the soft assignment mechanism prevents discrete age-based partitioning, allowing smooth transitions for samples exhibiting characteristics of multiple developmental stages. The effectiveness of this architectural design was evaluated through ablation studies presented in the Results section.

The DS-ViT equips each developmental stage with dedicated feature transformation layers (called routers) and transformer processors. Routers map input features to representation spaces suitable for processing specific developmental stages, while transformer processors capture skeletal feature patterns of specific developmental stages through multihead attention mechanisms and feed-forward networks. The processing results from various stages are dynamically weighted and fused according to developmental stage weights as follows:

$f_{fused} = \sum_{i = 1}^{3} w_{i} \cdot {Transformer}_{i} (f_{router}^{i} (f_{input}))$ [5]

where $w_{i}$ is the weight of the sample belonging to the i_th developmental stage; $f_{input}$ contains node features processed by the graph transformer, whole-hand features, and sex features; $f_{router}^{i}$ is the router function for the ith stage; and ${Transformer}_{i}$ is the transformer processor for the i_th stage.

This dynamic routing mechanism enables the model to process skeletal features differently across various developmental stages, effectively addressing the diversity of skeletal development patterns from infancy to adolescence. From a clinical perspective, it simulates the thought process of radiologists adjusting assessment strategies based on bone-age developmental stages. When assessing an infant’s bone age, clinicians focus more on the appearance of ossification centers in carpal bones; when evaluating adolescents, they primarily observe the degree of fusion between epiphyses and diaphyses. Our model achieves optimal interpretation of bones at different developmental stages through this adaptive mechanism.

Multitask loss function

To further enhance the model’s performance and robustness, we carefully designed a multitask joint loss function:

$L = w_{1} L_{mae} + w_{2} L_{graph} + w_{3} L_{ortho} + w_{4} L_{kl}$ [6]

The loss function includes four complementary components: The first is $L_{mae}$ , which represents the primary bone age prediction loss from the final model output. The second is $L_{graph}$ , which is the auxiliary prediction loss from the first graph transformer layer output, providing additional supervision for intermediate feature learning. The third is $L_{ortho}$ , which promotes the complementarity of different feature representations through feature orthogonal constraints, specifically measuring the correlation between pooled features from the first and second graph transformer layers. These two feature sets are expected to have low correlation as they should capture different aspects of skeletal development, defined as follows:

$L_{ortho} = max (corr (F_{1}, F_{2}) - θ, 0)$ [7]

where $corr (F_{1}, F_{2})$ is the correlation between pooled features from the first and second graph transformer layers, and $θ$ is the correlation threshold. The fourth component is $L_{k l}$ , which guides the model to learn more accurate bone age probability distributions through Kullback-Leibler (KL) divergence loss. The Gaussian assumption is justified by the clinical observation that radiologist disagreements typically follow normal distributions around the ground truth age. This can be expressed as follows:

$L_{kl} = KL (p ∥ G)$ [8]

where $p$ represents the bone age probability distribution predicted by the model, and $G$ is a Gaussian distribution constructed based on the true bone age, which is calculated as follows:

$G_{j} = \frac{1}{Z} \exp (- \frac{{(j - y)}^{2}}{2 σ^{2}}), j \in {1, 2, \dots, 240}$ [9]

where y is the true bone age, σ controls the width of the distribution, and Z is a normalization constant. The final hyperparameter values were determined through systematic validation as follows: w1=0.6, w2=0.1, w3=0.2, and w4=0.1, with σ=1.0 for the Gaussian distribution width and θ=0.3 for the correlation threshold. These values were selected through grid search optimization conducted exclusively on the validation set over approximately 50 iterations, ensuring that no test set contamination occurred during hyperparameter tuning. The weights of various components achieve a balance between different task objectives. This developmental stage-aware feature fusion strategy, combined with a multitask learning framework, allows DSGTF to effectively integrate multisource features from skeletal graph structures, global images, and sex information, enhancing the model’s expressive capability and prediction accuracy, thereby providing a more reliable technical solution for clinical bone-age assessment.

Implementation details

In the experimental implementation phase, we built the DSGTF model based on the PyTorch framework and completed training on a workstation equipped with a GeForce RTX 4090 GPU (Nvidia, Santa Clara, CA, USA). The entire training process continued for 100 epochs, and the data input strategy included a batch size of 8, achieving a balance between computational efficiency and memory utilization. During the optimization process, we selected AdamW as the optimizer and implemented gradient clipping techniques to maintain training stability. To improve the model’s prediction accuracy, we constructed a multitask joint optimization framework, combining bone-age prediction with graph structure feature learning. Meanwhile, targeting the characteristics of medical imaging data, we implemented a comprehensive data augmentation scheme, including random affine transformations (±15° rotation and ±10% translation), horizontal/vertical flips (50% and 30% probability, respectively), additional 90°/180° rotations (30% probability), and brightness/contrast adjustments (±20%). This simulated diverse image variations in the clinical environments, thus enhancing the model’s adaptability to different imaging conditions and potentially contributing to the robust performance observed in our evaluation.

Computational efficiency

We conducted performance profiling on the test set with a GeForce RTX 4090 GPU to evaluate clinical feasibility. The complete inference pipeline consisted of two sequential stages: first, YOLOv11-based region detection for identifying 17 skeletal key points and the whole-hand region; second, subsequent feature extraction and bone age prediction through our graph transformer and DS-ViT fusion modules. According to testing with 100 randomly selected test samples, we found that the region detection stage requires approximately 0.1 seconds per image, while the feature extraction and prediction stage averages 0.015 seconds, resulting in a total end-to-end inference time of approximately 0.115 seconds per image. Although the two-stage architecture introduces additional computational steps compared to end-to-end approaches, the total processing time of under 0.2 seconds per image remains well within clinically acceptable bounds for routine bone-age assessment workflows. Future work will explore model optimization techniques such as knowledge distillation and quantization to further enhance computational efficiency for deployment in resource-constrained clinical environments.

Evaluation metrics

In this study, the performance evaluation of bone age prediction primarily consisted of mean absolute error (MAE) as the metric, which is calculated as follows:

$MAE = \frac{1}{N} \sum_{i = 1}^{N} | {\hat{y}}_{i} - y_{i} |$ [10]

where ${\hat{y}}_{i}$ is the predicted bone age for the i_th sample, $y_{i}$ is its true bone age, and N is the total number of samples. A smaller MAE indicates the higher prediction accuracy of the model. Additionally, for precision evaluation of key region detection, we adopted mean average precision (mAP), specifically including two metrics: the average precision at an intersection over union (IoU) threshold of 0.50 (mAP@50) and the average precision at IoU thresholds from 0.50 to 0.95 with a step size of 0.05 (mAP@50–95). mAP@50 reflects the detection capability of the model under relatively lenient conditions, while mAP@50–95 provides a comprehensive assessment of the model’s detection performance under varying degrees of strictness. These evaluation metrics could comprehensively reflect the overall performance of the proposed model in bone-age prediction and key region detection tasks, ensuring the comprehensiveness and reliability of the experimental results.

Results

We evaluated our DSGTF framework using the metrics defined in the Methods section, with a primary focus on MAE for bone-age prediction accuracy mAP for region detection performance. Recent works have successfully applied vision transformers and other advanced architectures to bone-age assessment, yet our framework’s distinctive contribution lies in the integration of anatomically informed graph neural networks with a developmental stage-aware fusion mechanism. This enables the explicit modeling of skeletal structural relationships and adaptive processing of features based on developmental maturity patterns.

Detection performance of key regions

To evaluate the performance of key region detection, we conducted a comprehensive assessment of the YOLOv11 model on the test set. Table 1 shows the detection performance of the model across five major region types. Overall, the model achieved excellent detection results in most regions, with an average mAP@50 of 99.3% and an mAP@50–95 of 89.8%, indicating that the model can locate key skeletal regions in hand X-rays with high precision.

Table 1

Detection performance of key regions in hand X-ray images

Region type	mAP@50 (%)	mAP@50–95 (%)
Whole-hand region	99.5	98.4
Carpal bone region	99.4	89.1
Radius-ulna region	99.4	86.5
Thumb region	99.2	85.9
Other phalanges	99.2	89.5
Average	99.3	89.8

mAP@50 denotes the average precision at an IoU threshold of 0.50, while mAP@50–95 represents the average precision across IoU thresholds from 0.50 to 0.95 with a step size of 0.05. IoU, intersection over union; mAP, mean average precision.

The detection results demonstrate that our model performs consistently well across all region types, with particularly high precision in the whole-hand region detection. The slightly lower mAP@50–95 values for specific bone regions compared to their corresponding mAP@50 values indicate that while the model can accurately locate these regions, there is some variability in the precision of bounding box placement. This is expected given the anatomical variations across different participants and age groups. Nevertheless, the high detection performance across all regions provides a solid foundation for subsequent feature extraction and bone-age assessment.

Bone-age assessment performance

Comparison with state-of-the-art methods

Table 2 shows the performance comparison of our DSGTF model with that of other methods on the RSNA dataset. Our proposed DSGTF model achieved a mean MAE of 4.82 months on the test set.

Table 2

Comparison of different bone-age assessment methods

Author [year]	Extra labels	MAE	Method details
Spampinato et al. [2017] (13)	No	9.5	Deep CNN on raw X-ray images
Larson et al. [2018] (14)	No	6.24	Optimized deep residual network
Wu et al. [2019] (34)	Hand mask	7.38	Mask R-CNN + residual attention
Li et al. [2022] (35)	Hand mask	6.20	Unsupervised region localization
Zhang et al. [2020] (19)	Hand mask	5.49	Multiregion CNN ensemble
Ren et al. [2018] (36)	Hand mask	5.20	Regression CNN architecture
Iglovikov et al. [2018] (24)	Hand mask + ROI	4.97	Combined segmentation approach
Son et al. [2019] (16)	ROI	5.52	TW3-based ROI processing
Reddy et al. [2020] (37)	ROI	5.10	Index finger-focused CNN
Proposed	ROI	4.82	Graph transformer + stage awareness

CNN, convolutional neural network; MAE, mean absolute error; ROI, region of interest.

It is important to acknowledge that recent works have achieved superior MAE performance, with Yang et al. (38) reporting 3.88 months and Rassmann et al. (23) 3.87 months on the RSNA dataset. Although our DSGTF framework’s MAE of 4.82 months reflects a different performance tier, our work addresses complementary research objectives focused on interpretability, anatomical reasoning, and clinical alignment. Importantly, our results approach the interradiologist agreement baseline of approximately 4 months established by Halabi et al. (17), indicating that our method achieves clinically acceptable accuracy while providing additional benefits in terms of explainability and developmental stage-aware processing.

Performance across different age groups and sex

To provide a more detailed analysis of the DSGTF model’s performance, we evaluated its accuracy across different age groups and sexes. Table 3 lists in detail the model’s performance across different age groups and sex populations.

Table 3

Bone-age assessment performance across different age groups and sexes

Age group	Male MAE	Female MAE	Overall MAE
Infant (≥0 to <3 years)	4.77	3.39	4.10
Early childhood (≥3 to <8 years)	6.03	5.07	5.45
Middle childhood (≥8 to <13 years)	5.85	5.67	5.76
Adolescent (≥13 years)	3.63	4.85	3.95
Overall (all ages)	5.07	4.74	4.82

MAE, mean absolute error.

Experimental results showed that the model achieved acceptable accuracy levels across all age ranges and sex groups, with the MAE ranging from 3.39 to 6.03 months. The model performed particularly well in the infant and adolescent age groups, with overall MAEs of 4.10 and 3.95 months, respectively. The relatively higher errors in early and middle childhood reflect the inherent challenges in bone-age assessment during these transitional developmental periods, which is consistent with the known variability in skeletal maturation patterns across different childhood stages (39).

Figure 6 provides supplementary analysis of the DSGTF model’s performance through error distribution across the different age groups. The boxplot reveals that the infant and adolescent groups not only have lower median errors but also smaller error variance compared to the childhood groups, indicating more stable prediction performance for these age ranges. Figure 7 presents a scatter plot comparing predicted bone ages against ground truth values. The plot indicates there being a strong linear relationship between predicted and true bone ages, with points clustered closely around the diagonal line across the entire age spectrum. This pattern confirms the overall high accuracy of the model regardless of sex, as indicated by the consistent distribution of both male and female data points.

Figure 6 Absolute error distribution by age group. Box plot analysis showing the distribution of prediction errors across different developmental stages. Lower median values and narrower interquartile ranges indicate more accurate and consistent predictions.

Figure 7 Scatter plot of predicted versus ground truth bone age. Comparison between predicted bone ages (y-axis, in months) and ground truth bone ages (x-axis, in months) across the entire dataset, with male samples shown in blue and female samples in red. The proximity of points to the diagonal line (perfect prediction) demonstrates the model’s accuracy across different sexes and age ranges.

Ablation studies

To systematically evaluate the contribution of key components in the DSGTF framework and their synergistic effects, we designed and executed a series of ablation experiments. These experiments analyzed three aspects: graph structure design, feature fusion strategy, and multitask loss function.

First, we examined the impact of graph structure design on model performance. As shown in Table 4, when the graph transformer module was completely removed, the “no graph structure” approach relying solely on CNN for feature extraction produced an MAE of 5.83 months. Introducing a fully connected graph structure improved performance to 5.49 months but still led to an inferior performance compared to our anatomy-based graph structure design, which achieved an MAE of 4.82 months. This result emphasizes the importance of precisely modeling skeletal anatomical associations. Although a fully connected graph can capture global relationships, it lacks accurate expression of specific anatomical connections between bones.

Table 4

Ablation results of graph structure design

Graph structure	Male MAE	Female MAE	Average MAE
No graph	5.74	5.92	5.83
Fully connected	5.37	5.63	5.49
Anatomical graph (proposed)	5.07	4.74	4.82

MAE, mean absolute error.

Second, we analyzed the impact of feature fusion strategies on model performance, with results being summarized in Table 5. The simple concatenation strategy directly joins features from different sources, while single-stage processing encodes all samples through a unified transformer, improving the MAE to 5.25 months; hard stage classification employs an explicit classifier for discrete routing, yielding an MAE of 5.44 months. In comparison, our proposed DS-ViT fusion strategy achieved the best MAE of 4.82 months, reducing the error by 0.69 months compared to the simple concatenation method. This indicates that the developmental stage-aware mechanism can effectively identify the developmental stage of samples and dynamically adjust processing strategies, providing targeted processing for bones with different developmental characteristics.

Table 5

Ablation results of different feature fusion strategies

Fusion strategy	Description	Male MAE	Female MAE	Average MAE
Simple concatenation	Direct feature concatenation without attention	5.28	5.79	5.51
Single-stage processing	Unified transformer encoding for all samples	5.27	5.24	5.25
Hard stage classification	Explicit stage classifier with discrete routing	5.45	5.46	5.44
DS-ViT fusion (proposed)	Adaptive developmental stage awareness	5.07	4.74	4.82

DS-ViT, developmental stage-aware visual transformer; MAE, mean absolute error.

Finally, we analyzed the contribution of the multitask loss function by progressively adding loss components, as shown in Table 6. When using only the basic MAE loss, the model achieved an MAE of 5.73 months; after introduction of the graph feature-learning loss, the MAE significantly decreased to 5.42 months; further addition of the feature orthogonal loss reduced the MAE to 5.18 months; and finally, integration of the KL divergence loss achieved the optimal performance of 4.82 months. This gradual improvement process verifies the effectiveness of the multitask learning framework, with different loss components working synergistically to encourage the model to learn more discriminative feature representations, thereby enhancing the accuracy of bone-age prediction.

Table 6

Ablation results of multitask loss function

Configuration	Loss function formula	Male MAE	Female MAE	Average MAE
Base	$L = w_{1} L_{MAE}$	5.51	5.99	5.73
Stage 1	$L = w_{1} L_{MAE} + w_{2} L_{graph}$	5.45	5.37	5.42
Stage 2	$L = w_{1} L_{MAE} + w_{2} L_{graph} + w_{3} L_{ortho}$	5.32	5.04	5.18
Stage 3	$L = w_{1} L_{MAE} + w_{2} L_{graph} + w_{3} L_{ortho} + w_{4} L_{KL}$	5.07	4.74	4.82

MAE, mean absolute error.

The ablation experiments collectively demonstrate that each component of our proposed DSGTF framework contributes meaningfully to the overall performance. The anatomical graph structure effectively captures the spatial relationships between skeletal regions, the DS-ViT fusion module successfully adapts to different developmental stages, and the multitask loss function guides the model to learn more discriminative features. These components work together synergistically, resulting in a robust and accurate bone-age assessment system.

Analysis of developmental stage-aware mechanism

To gain deeper insight into how the DS-ViT module simulates clinical experts’ assessment strategies across different developmental stages, we conducted a visualization analysis of the attention distribution the model learned for skeletal regions. Figure 8 presents the relative attention distribution of the early, middle, and late developmental stage processors across 17 skeletal regions. These data are based on the average attention values from 80 test samples and were normalized within each developmental stage to facilitate intuitive comparison of focus points within each stage. The analysis revealed that the three developmental stage processors exhibit distinctly differentiated skeletal attention patterns. The early-stage processor primarily focuses on the wrist region group, which receives significantly higher attention than do the other regions, while finger bone regions receive relatively lower attention. The middle-stage processor presents a more balanced attention distribution, with focus gradually transitioning from the wrist to finger regions, particularly with the distal and middle phalanges receiving higher attention. The late-stage processor shows increased emphasis on the proximal phalanges region group, with the distal phalanx of the thumb and the middle phalanx of the ring finger receiving the highest attention values. Importantly, the radius-ulna region continues to receive substantial attention, which is consistent with the clinical understanding that these bones remain important assessment targets as they are among the last to complete fusion. This attention distribution demonstrates that our model captures the multifocal nature of late-stage skeletal assessment, with attention being appropriately distributed across several clinically relevant anatomical regions.

Figure 8 Relative importance of different bone regions across developmental stages (early, middle, and late) as learned by the DS-ViT module. DS-ViT, developmental stage-aware visual transformer.

Robustness evaluation

To assess the clinical applicability of our DSGTF model under varying imaging conditions, we conducted a robustness evaluation following the computational stress testing methodology proposed by Santomartino et al. (18). We evaluated the model’s performance on 500 randomly selected test samples under 16 different image transformations, including geometric transformations (rotation and flipping), photometric adjustments (brightness and contrast), pixel inversion, and resolution changes. Table 7 presents the detailed robustness evaluation results. The DSGTF model demonstrated notable stability, with 11 out of 16 transformations showing no statistically significant performance degradation (P>0.05). Geometric transformations resulted in minimal MAE increases (<0.43 months), while moderate photometric changes produced MAE differences within 0.21 months. For consistency evaluation, we employed a 6-month threshold due the RSNA dataset’s lack of chronological age information—a limitation also encountered by Santomartino et al. in their evaluation of their model on the same dataset. The model maintained high prediction consistency rates (>90%) for most transformations, with the most challenging being pixel inversion and severe resolution reduction, which represent extreme conditions unlikely to occur in routine practice.

Table 7

Robustness evaluation results of DSGTF model under image transformations

Transformation	MAE difference (months)	Consistency rate (%)	Statistical significance
Rotation 90°	0.107	91.6	No
Rotation 180°	0.008	96.0	No
Rotation 270°	0.428	89.6	Yes
Flip horizontal	−0.022	96.4	No
Flip vertical	0.127	95.2	No
Brightness 0.5×	0.207	98.0	No
Brightness 0.7×	0.090	99.6	No
Brightness 1.3×	0.145	93.8	No
Brightness 1.5×	2.455	80.8	Yes
Contrast 0.5×	0.061	97.2	No
Contrast 1.3×	−0.044	97.0	Yes
Contrast 1.5×	0.050	93.2	No
Pixel inversion	18.004	24.4	Yes
Resolution 256×256	0.205	97.8	Yes
Resolution 128×128	5.144	45.8	Yes

Consistency rate defined as percentage of predictions differing by ≤6 months between original and transformed images, with statistical significance determined via the Wilcoxon signed-rank test (P<0.05). DSGTF, developmental stage-aware graph transformer framework; MAE, mean absolute error.

Discussion

This study developed a novel deep learning framework for automated bone-age assessment that addresses several key challenges in the field. Our DSGTF framework achieved an MAE of 4.82 months on the RSNA dataset. To properly contextualize these results, it is important to examine how our approach compares with existing methodological approaches while acknowledging the broader landscape of recent achievements in this field.

Recent research has achieved notable advances in bone-age assessment accuracy. Methods proposed by Yang et al. (38) and Rassmann et al. (23) achieved MAEs of 3.88 and 3.87 months, respectively, through optimized end-to-end learning. Transformer-based approaches have also shown strong results, with BAE-ViT (19) achieving 4.1 months through efficient multimodal fusion, self-accumulative ViT (20) demonstrating effectiveness for Sauvegrain-based assessment, and two-stage transformer pipelines (21) reaching 5.2 months on RSNA data. Our DSGTF framework achieved 4.82 months MAE, which approaches the interradiologist agreement baseline of approximately 4 months (17), indicating clinically acceptable accuracy. Importantly, our framework’s distinctive contribution lies in explicitly modeling anatomical relationships through graph neural networks while incorporating developmental stage-aware processing. This design provides enhanced interpretability and anatomical reasoning capabilities that complement pure accuracy optimization, addressing critical needs for building clinical trust and enabling transparent decision-making in medical AI systems.

Among the methods evaluated in our comparative analysis, approaches using only original X-ray images have shown progressive improvement over time. Early deep learning approaches such as Spampinato et al. (13) achieved an MAE of 9.5 months, while more sophisticated architectures such as that by Larson et al. (14) reduced this to 6.24 months through optimized deep residual network designs. The introduction of auxiliary information has provided variable results, with methods incorporating hand masks achieving MAEs ranging from 7.38 months (34) to 5.20 months (36). Li et al. (35) also employed unsupervised region localization approaches combined with deep learning architectures, achieving an MAE of 6.20 months on the RSNA dataset. Methods using ROI annotations have generally achieved better performance; for instance, Son et al. (16) and Reddy et al. (37) reported MAEs of 5.52 and 5.10 months, respectively. Within this comparative framework, our DSGTF model achieved an MAE of 4.82 months, demonstrating competitive performance among the ROI-based approaches.

The primary contribution of our DSGTF framework lies not solely in achieving the lowest possible MAE but in providing a system that combines competitive accuracy with enhanced interpretability, anatomical reasoning, and developmental stage awareness. This represents a complementary research direction that prioritizes clinical alignment and explainability alongside predictive performance. The explicit modeling of anatomical relationships through our graph structure and the developmental stage-aware attention patterns provide insights into the model’s decision-making process that are crucial for building clinician trust and promoting clinical adoption. Methods achieving lower MAEs, while impressive in terms of pure accuracy, may not necessarily provide the same level of clinical interpretability that our framework offers through its anatomically informed design and stage-aware processing mechanisms. Compared to recent ViT-based methods that focus on optimizing multimodal fusion for accuracy improvement, our graph-based approach provides complementary advantages in anatomical reasoning and developmental stage awareness. Methods such as BAE-ViT efficiently integrate image and sex information through token-based fusion, while our framework explicitly models the spatial relationships between skeletal structures through anatomically informed graph connections. This architectural choice enables the model to capture domain-specific knowledge about bone development patterns and their anatomical dependencies, providing interpretability that is valuable for clinical validation and adoption.

The improvement in assessment accuracy can be attributed to the key innovations of our approach. The introduction of an anatomical graph structure for modeling spatial relationships between skeletal regions proves to be more effective than does either the absence of graph structure or fully connected graphs. This result suggests that explicitly incorporating domain knowledge on anatomical connections provides valuable constraints that help the model focus on biologically meaningful relationships between bone regions. The 1.01-month reduction in MAE from 5.83 to 4.82 months achieved by incorporating the anatomical graph structure highlights the importance of domain-specific structural modeling in medical image analysis. Beyond structural modeling, the developmental stage-aware mechanism significantly enhances the model’s adaptability to different skeletal maturity levels. The attention pattern analysis revealed that our DS-ViT module successfully mimics the clinical reasoning process of radiologists by dynamically adjusting its focus across different bone regions based on developmental stages. The 0.69-month error reduction compared to simple feature concatenation demonstrates the practical value of this adaptive processing approach. Additionally, the multitask learning framework with complementary loss components guides the model to learn more discriminative feature representations. The progressive improvement in performance with the addition of each loss component confirms that a diversity of supervisory signals are beneficial for this complex regression task. The graph feature-learning loss and the orthogonal constraint loss help the model extract more informative features from the graph structure, while the KL divergence loss provides a more nuanced prediction of bone-age probability distribution.

The model’s performance across different age groups reveals interesting patterns that align with clinical observations. The relatively higher accuracy in infant and adolescent groups compared to childhood periods reflects the complex nature of skeletal development during middle childhood. During this transitional phase, although individual variability is higher, the bone features themselves may be less distinct or in intermediate states of development, making precise age determination more challenging for both human experts and computational models. This finding suggests that future research might benefit from more fine-grained stage-specific processing strategies for these challenging age ranges. The observed sex-based performance differences (males: 5.07 months overall; females: 4.74 months overall) may reflect the different skeletal maturation timelines between sexes and potential representation imbalances in specific age-sex combinations during critical developmental periods. Future work should address these fairness considerations through sex-stratified validation and implementation of fairness constraints to ensure equitable performance across demographic groups.

To properly contextualize our 4.82-month MAE achievement, it is important to consider the fundamental limitations inherent in bone-age assessment. As established by Halabi et al., interradiologist disagreement typically ranges around 4 months or more, representing a theoretical lower bound for AI systems relying on human annotations as the ground truth. Our achieved MAE approaches this theoretical limit, indicating that our method achieves clinically acceptable accuracy while providing additional benefits in terms of explainability and stage-aware processing. Additionally, our robustness evaluation, based on the computational stress testing approach proposed by Santomartino et al., provides insights into the DSGTF framework’s behavior under clinical image variations. The framework maintained stable performance across most clinically relevant transformations, with 11 out of 16 tested transformations showing no statistically significant performance degradation. Notably, our evaluation methodology was constrained by the same dataset limitations encountered by Santomartino et al.—the absence of chronological age information in the RSNA dataset prevented calculation of clinically significant error rates, necessitating our adoption of a prediction-based consistency metric using a 6-month threshold. This threshold provides a clinically relevant assessment of prediction stability, as variations within this range are generally considered acceptable in clinical practice.

These results suggest that the methodological innovations in our approach—particularly the developmental stage-aware processing mechanism and anatomically informed graph structure—may contribute to both accuracy improvements and prediction stability under image variation.

Although our study produced significant advancements, several limitations should be acknowledged. The framework we used included two separate processing stages: key region detection and bone-age regression. This two-stage approach, while effective, may not achieve optimal parameter sharing and computational efficiency. Furthermore, our experiments were conducted on a single dataset (RSNA), and cross-dataset validation would be beneficial to assessing the generalizability across different imaging protocols and patient populations. Although our analysis of developmental stage-aware attention patterns demonstrated that the model learns differentiated focus strategies that align with general clinical knowledge, it was limited by the absence of systematic radiologist review to validate whether the learned attention patterns truly correspond to clinical expert priorities. Future work should include structured expert validation studies to strengthen our interpretability claims and refine the developmental stage-aware mechanism. Future evaluation through standardized competition platforms such as Kaggle could provide additional validation of our approach across a diversity of datasets and evaluation conditions, complementing our current detailed analysis on the RSNA benchmark dataset.

Nevertheless, the DSGTF framework represents a substantial step forward in automated bone-age assessment, demonstrating how the integration of anatomical domain knowledge with advanced deep learning architectures can improve performance in critical medical image analysis tasks.

Conclusions

We developed DSGTF for automated bone-age assessment. By combining anatomical graph modeling with developmental stage-aware processing, the framework achieved an MAE of 4.82 months on the RSNA dataset, approaching the interradiologist agreement baseline while providing enhanced clinical interpretability through developmental stage-aware processing. The developmental stage-aware mechanism successfully adapts attention patterns across different skeletal regions based on maturity levels, demonstrating clinical relevance. Robustness evaluation confirmed the stable performance across a range of image variations. Future research will focus on end-to-end framework integration and extensive validation on a diversity of clinical datasets to enhance system efficiency and ensure the model’s suitability for widespread clinical deployment.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the CLEAR reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-998/rc

Funding: This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) - Innovative Human Resource Development for Local Intellectualization program grant funded by the Korea government (MSIT) (No. IITP-2026-RS-2022-00156334, contribution rate: 70%) and Liaoning Provincial Department of Science and Technology Plan Project - General Project (No. 2025-MS-141).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-998/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Giordano D, Spampinato C, Scarciofalo G, Leonardi R. An automatic system for skeletal bone age measurement by robust processing of carpal and epiphysial/metaphysial bones. IEEE Transactions on Instrumentation and Measurement 2010;59:2539-53.
Krakow D, Rimoin DL. The skeletal dysplasias. Genet Med 2010;12:327-41. [Crossref] [PubMed]
Warman ML, Cormier-Daire V, Hall C, Krakow D, Lachman R, LeMerrer M, Mortier G, Mundlos S, Nishimura G, Rimoin DL, Robertson S, Savarirayan R, Sillence D, Spranger J, Unger S, Zabel B, Superti-Furga A. Nosology and classification of genetic skeletal disorders: 2010 revision. Am J Med Genet A 2011;155A:943-68. [Crossref] [PubMed]
Albanese A, Stanhope R. Investigation of delayed puberty. Clin Endocrinol (Oxf) 1995;43:105-10. [Crossref] [PubMed]
Greulich WW, Pyle SI. Radiographic atlas of skeletal development of the hand and wrist. The American Journal of the Medical Sciences 1959;238:393.
Tanner JM. Assessment of skeletal maturity and prediction of adult height (TW2 Method). Academic Press; 1983:50-106.
Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, van der Laak JAWM, van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60-88. [Crossref] [PubMed]
Thodberg HH, Kreiborg S, Juul A, Pedersen KD. The BoneXpert method for automated determination of skeletal maturity. IEEE Trans Med Imaging 2009;28:52-66. [Crossref] [PubMed]
Somkantha K, Theera-Umpon N, Auephanwiriyakul S. Bone age assessment in young children using automatic carpal bone feature extraction and support vector regression. J Digit Imaging 2011;24:1044-58. [Crossref] [PubMed]
Rayed ME, Islam SS, Niha SI, Jim JR, Kabir MM, Mridha MF. Deep learning for medical image segmentation: State-of-the-art advancements and challenges. Informatics in Medicine Unlocked 2024;47:101504.
Wang X, Zhou B, Gong P, Zhang T, Mo Y, Tang J, Shi X, Wang J, Yuan X, Bai F, Wang L, Xu Q, Tian Y, Ha Q, Huang C, Yu Y, Wang L. Artificial Intelligence-Assisted Bone Age Assessment to Improve the Accuracy and Consistency of Physicians With Different Levels of Experience. Front Pediatr 2022;10:818061. [Crossref] [PubMed]
Liu Y, Ouyang L, Wu W, Zhou X, Huang K, Wang Z, Song C, Chen Q, Su Z, Zheng R, Wei Y, Lu W, Wu W, Liu Y, Yan Z, Wu Z, Fan J, Zhou M, Fu J. Validation of an established TW3 artificial intelligence bone age assessment system: a prospective, multicenter, confirmatory study. Quant Imaging Med Surg 2024;14:144-59. [Crossref] [PubMed]
Spampinato C, Palazzo S, Giordano D, Aldinucci M, Leonardi R. Deep learning for automated skeletal bone age assessment in X-ray images. Med Image Anal 2017;36:41-51. [Crossref] [PubMed]
Larson DB, Chen MC, Lungren MP, Halabi SS, Stence NV, Langlotz CP. Performance of a Deep-Learning Neural Network Model in Assessing Skeletal Maturity on Pediatric Hand Radiographs. Radiology 2018;287:313-22. [Crossref] [PubMed]
Escobar M, González C, Torres F, Daza L, Triana G, Arbeláez P. Hand pose estimation for pediatric bone age assessment. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22 Springer International Publishing; 2019:531-9.
Son SJ, Song Y, Kim N, Do Y, Kwak N, Lee MS, Lee BD. TW3-based fully automated bone age assessment system using deep neural networks. IEEE Access 2019;7:33346-58.
Halabi SS, Prevedello LM, Kalpathy-Cramer J, Mamonov AB, Bilbily A, Cicero M, Pan I, Pereira LA, Sousa RT, Abdala N, Kitamura FC, Thodberg HH, Chen L, Shih G, Andriole K, Kohli MD, Erickson BJ, Flanders AE. The RSNA Pediatric Bone Age Machine Learning Challenge. Radiology 2019;290:498-503. [Crossref] [PubMed]
Santomartino SM, Putman K, Beheshtian E, Parekh VS, Yi PH. Evaluating the Robustness of a Deep Learning Bone Age Algorithm to Clinical Image Variation Using Computational Stress Testing. Radiol Artif Intell 2024;6:e230240. [Crossref] [PubMed]
Zhang J, Chen W, Joshi T, Zhang X, Loh PL, Jog V, Bruce RJ, Garrett JW, McMillan AB. BAE-ViT: An Efficient Multimodal Vision Transformer for Bone Age Estimation. Tomography 2024;10:2058-72. [Crossref] [PubMed]
Choi HJ, Na D, Cho K, Bae B, Kong ST, An H. Self-Accumulative Vision Transformer for Bone Age Assessment Using the Sauvegrain Method. In: European Conference on Computer Vision 2024 Sep 29. Cham: Springer Nature Switzerland; 2024:160-76.
Mao X, Hui Q, Zhu S, Du W, Qiu C, Ouyang X, Kong D. Automated Skeletal Bone Age Assessment with Two-Stage Convolutional Transformer Network Based on X-ray Images. Diagnostics (Basel) 2023;13:1837. [Crossref] [PubMed]
Cheng CF, Huang ET, Kuo JT, Liao KY, Tsai FJ. Report of clinical bone age assessment using deep learning for an Asian population in Taiwan. Biomedicine (Taipei) 2021;11:50-8. [Crossref] [PubMed]
Rassmann S, Keller A, Skaf K, Hustinx A, Gausche R, Ibarra-Arrelano MA, Hsieh TC, Madajieu YED, Nöthen MM, Pfäffle R, Attenberger UI, Born M, Mohnike K, Krawitz PM, Javanmardi B. Deeplasia: deep learning for bone age assessment validated on skeletal dysplasias. Pediatr Radiol 2024;54:82-95. [Crossref] [PubMed]
Iglovikov VI, Rakhlin A, Kalinin AA, Shvets AA. Paediatric bone age assessment using deep convolutional neural networks. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer International Publishing. 2018:300-8.
Du H, Wang J, Hui J, Zhang L, Wang H. DenseGNN: universal and scalable deeper graph neural networks for high-performance property prediction in crystals and molecules. npj Computational Materials. 2024;10:292.
Gao M, Zhang D, Chen Y, Zhang Y, Wang Z, Wang X, Li S, Guo Y, Webb GI, Nguyen ATN, May L, Song J. GraphormerDTI: A graph transformer-based approach for drug-target interaction prediction. Comput Biol Med 2024;173:108339. [Crossref] [PubMed]
Gao C, Qian Q, Li Y, Xing X, He X, Lin M, Ding Z. A comparative study of three bone age assessment methods on Chinese preschool-aged children. Front Pediatr 2022;10:976565. [Crossref] [PubMed]
Prokop-Piotrkowska M, Marszałek-Dziuba K, Moszczyńska E, Szalecki M, Jurkiewicz E. Traditional and New Methods of Bone Age Assessment-An Overview. J Clin Res Pediatr Endocrinol 2021;13:251-62. [Crossref] [PubMed]
Lee H, Tajmir S, Lee J, Zissen M, Yeshiwas BA, Alkasab TK, Choy G, Do S. Fully Automated Deep Learning System for Bone Age Assessment. J Digit Imaging 2017;30:427-41. [Crossref] [PubMed]
Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016:779-88.
Jocher G, Qiu J. Ultralytics YOLO11 [Internet]. 2024. Available online: https://github.com/ultralytics/ultralytics. Accessed: January 15, 2025.
Wang C, Chen D, Hao L, Liu X, Zeng Y, Chen J, Zhang G. Pulmonary image classification based on inception-v3 transfer learning model. IEEE Access 2019;7:146533-41.
Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning 2019 May 24. PMLR; 2019:6105-14.
Wu E, Kong B, Wang X, Bai J, Lu Y, Gao F, Zhang S, Cao K, Song Q, Lyu S, Yin Y. Residual attention based network for hand bone age assessment. 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019), Venice, Italy, 2019;1158-61.
Li S, Liu B, Li S, Zhu X, Yan Y, Zhang D. A deep learning-based computer-aided diagnosis method of X-ray images for bone age assessment. Complex Intell Systems 2022;8:1929-39. [Crossref] [PubMed]
Ren X, Li T, Yang X, Wang S, Ahmad S, Xiang L, Stone SR, Li L, Zhan Y, Shen D, Wang Q. Regression Convolutional Neural Network for Automated Pediatric Bone Age Assessment From Hand Radiograph. IEEE J Biomed Health Inform 2019;23:2030-8. [Crossref] [PubMed]
Reddy NE, Rayan JC, Annapragada AV, Mahmood NF, Scheslinger AE, Zhang W, Kan JH. Bone age determination using only the index finger: a novel approach using a convolutional neural network compared with human radiologists. Pediatr Radiol 2020;50:516-23. [Crossref] [PubMed]
Yang C, Dai W, Qin B, He X, Zhao W. A real-time automated bone age assessment system based on the RUS-CHN method. Front Endocrinol (Lausanne) 2023;14:1073219. [Crossref] [PubMed]
Cavallo F, Mohn A, Chiarelli F, Giannini C. Evaluation of Bone Age in Children: A Mini-Review. Front Pediatr 2021;9:580314. [Crossref] [PubMed]

Cite this article as: Cao K, Liu C, Du J, Duan L, Li L, Ma Y, Jung H. A developmental stage-aware graph transformer framework for automated bone-age assessment. Quant Imaging Med Surg 2026;16(1):62. doi: 10.21037/qims-2025-998

A developmental stage-aware graph transformer framework for automated bone-age assessment

Introduction

Methods

Dataset and study population

Image preprocessing and key region detection

Feature extraction architecture

DSGTF

Skeletal graph structure construction

Bone-age graph transformer module

DS-ViT fusion module

Multitask loss function

Implementation details

Computational efficiency

Evaluation metrics

Results

Detection performance of key regions

Table 1

Bone-age assessment performance

Comparison with state-of-the-art methods

Table 2

Performance across different age groups and sex

Table 3

Ablation studies

Table 4

Table 5

Table 6

Analysis of developmental stage-aware mechanism

Robustness evaluation

Table 7

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share