Multi-scale generative adversarial network: three-dimensional reconstruction of the scoliotic spine from biplanar X-rays

Yuhao Lai; Ranran Tie; Gaosheng Xie; Zhong He; Zezhang Zhu; Zhen Liu Hongda Bao; Jing Xiong

doi:10.21037/qims-2025-aw-2176

Original Article

Multi-scale generative adversarial network: three-dimensional reconstruction of the scoliotic spine from biplanar X-rays

Yuhao Lai¹, Ranran Tie^1,2, Gaosheng Xie¹, Zhong He³, Zezhang Zhu³, Zhen Liu³, Hongda Bao³, Jing Xiong^1,2

¹The Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; ²University of Chinese Academy of Sciences, Beijing, China; ³Division of Spine Surgery, Department of Orthopedic Surgery, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, Nanjing, China

Contributions: (I) Conception and design: J Xiong, Y Lai, R Tie; (II) Administrative support: J Xiong, G Xie; (III) Provision of study materials or patients: Z He, Z Zhu; (IV) Collection and assembly of data: Z Liu, H Bao; (V) Data analysis and interpretation: Y Lai, R Tie; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Jing Xiong, PhD. The Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Nanshan District, Shenzhen 518055, China; University of Chinese Academy of Sciences, Beijing, China. Email: jing.xiong@siat.ac.cn.

Background: Adolescent idiopathic scoliosis (AIS) causes lateral spinal curvature, affecting adolescent growth, nerve function, and cardiopulmonary health, with an incidence of 1% to 3% in adolescents. Accurate three-dimensional (3D) spinal assessment is critical for AIS management, yet clinical imaging faces dilemmas. Specifically, two-dimensional (2D) X-rays lack 3D information due to vertebral overlap, while computed tomography (CT) involves excessive ionizing radiation that increases tumor risks in adolescents. Existing biplanar X-ray-based 3D reconstruction methods suffer from feature fusion loss and poor adaptability to scoliotic deformities. This study aimed to develop a low-radiation and high-precision deep learning framework for 3D spinal reconstruction from orthogonal X-rays.

Methods: We proposed a multi-scale generative adversarial network (MS-GAN) to reconstruct 3D spinal CT volumes from two orthogonal X-rays of both normal and scoliotic spines. The model incorporates three innovative designs: a residual-dense encoder for retaining fine vertebral details, an adaptive cross-view fusion module for integrating orthogonal projections, and a multi-scale fusion discriminator (MSFD) to ensure structural consistency. It was validated on a mixed dataset including 1,087 normal spines from CTSpine1K and clinical AIS data covering 138 cases with Cobb angles ranging from 15° to 85°. Reconstruction performance was evaluated using mean squared error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and Dice coefficient (Dice). Clinical applicability was further assessed using sagittal vertical axis (SVA), lumbar lordosis (LL), and thoracic kyphosis (TK) measurements on 38 real biplanar X-ray cases. We compared the performance gap between the model on real clinical data and simulated data, and verified the effectiveness of the model improvement through ablation experiments.

Results: Compared to the baseline X2CT-GAN, MS-GAN achieved a 30% reduction in MSE; its PSNR reached 73.20 dB, an improvement of 12.2%; its SSIM reached 0.956, an improvement of 5.5%; and its Dice reached 0.92, an improvement of 8%. Analysis demonstrated strong agreement between reconstructed spines and CT gold standards, with mean differences of −0.198 mm for SVA, 0.087° for LL, and 0.095° for TK. When applied to real clinical X-rays, performance decreases remained within 5% for key metrics. And the results of the ablation experiment demonstrated the effectiveness of the model improvement.

Conclusions: MS-GAN resolves feature fusion and dimension mismatch in biplanar X-ray 3D reconstruction, with robust performance on scoliotic spines. It avoids CT radiation and reduces equipment reliance, providing a low-cost, high-precision tool for AIS assessment and is valuable for primary hospitals.

Keywords: Scoliosis; biplanar X-rays; three-dimensional spinal computed tomography reconstruction (3D spinal CT reconstruction); multi-scale generative adversarial network (MS-GAN); deep learning

Submitted Oct 20, 2025. Accepted for publication Feb 27, 2026. Published online Apr 13, 2026.

doi: 10.21037/qims-2025-aw-2176

Introduction

Adolescent idiopathic scoliosis (AIS) is a common spinal deformity defined by a three-dimensional (3D) structural lateral curvature of the spine. Beyond cosmetic issues, it can compress nerves, impair cardiopulmonary function, and affect adolescent growth and development (1). Clinical data indicate that the incidence of AIS among adolescents is approximately 1% to 3%, and the formulation of its diagnosis and treatment plans relies heavily on accurate imaging assessment (1). However, current clinical spinal imaging techniques face a triple dilemma involving accuracy, radiation safety, and equipment accessibility: X-ray imaging enables non-invasive and rapid scanning based on tissue attenuation properties, making it the first-choice method for scoliosis screening. Nevertheless, its two-dimensional (2D) projection causes overlapping of structures such as vertebral bodies and pedicles. Clinicians can only indirectly assess scoliosis severity by manually measuring the Cobb angle, a process that not only introduces errors but also fails to capture 3D spatial position and morphological details of the spine (2). Computed tomography (CT) can generate high-resolution 3D tomographic images, clearly revealing key information such as vertebral rotation and intervertebral space narrowing. However, it involves extremely high radiation doses: a single traditional spinal CT scan delivers an effective dose equivalent to 50 to 250 chest X-rays, while modern low-dose spine CT protocols have optimized this range to 20 to 80 chest X-rays (3). Given the high sensitivity of adolescents’ hematopoietic tissues to radiation, this exposure significantly increases the risk of long-term tumors (4). Additionally, CT equipment is expensive and unevenly distributed globally.

To address this dilemma, medical image 3D reconstruction technology has become a research focus, but existing methods have long been limited by insufficient clinical applicability (5-8). Early reconstruction approaches rely on anatomical priors or manual intervention, leading to poor adaptability to AIS-related anatomical deformities and reconstruction errors exceeding 2 mm (9). Moreover, some methods require matching with specialized equipment such as the EOS system, further restricting their promotion (10).

The rise of deep learning has ushered in a data-driven paradigm for spinal 3D reconstruction, among which generative adversarial networks (GANs) have become the mainstream technical route due to their strong nonlinear mapping capabilities and efficiency. Ying et al. proposed X2CT-GAN, pioneering the direct synthesis of 3D CT from biplanar X-rays (11), but it suffers from insufficient feature fusion. Subsequent improvements, such as domain adaptation strategies (12) and lightweight designs (13), have addressed specific bottlenecks but failed to fully resolve 2D-to-3D mapping challenges in scoliotic spines. In recent years, conditional diffusion models have emerged for their high reconstruction fidelity, especially in fine structures, but their iterative denoising process results in computation times 5 to 8 times longer than GANs, making them impractical for real-time clinical scenarios (14).

Despite technological advancements, existing studies still have significant shortcomings in clinical validation. Firstly, there are biases in the dataset. Lonstein et al. pointed out in their classic study that only 12% of literature on spinal reconstruction included severe scoliosis samples with Cobb angles >30°, leading to issues such as “scoliosis curve flattening” and “vertebral rotation direction distortion” when models are applied to real clinical data (2). Secondly, the rigor of validation is insufficient, as approximately 78% of the studies rely on cross-validation within a single institution’s internal dataset, without undergoing multi-center external testing. Cootes et al. clearly noted in their review that models trained on single-institution data could experience an accuracy drop of up to 40% in cross-center tests, and poor generalization has become a major barrier to the clinical translation of these technologies (9).

In summary, current research on spinal 3D reconstruction has formed three core directions: first, safe imaging alternatives, including low-dose CT optimization, magnetic resonance imaging (MRI) protocol innovation, and radiation-free surface scanning, to build a full-spectrum radiation safety solution; second, robust segmentation and recognition methods, which achieve accurate structural analysis of deformed spines through strategies such as shape prior fusion and multi-scale feature learning; third, efficient deep generative models, which explore clinically applicable 3D mapping architectures by balancing the generation speed of GANs and the reconstruction quality of diffusion models. However, existing technologies still have significant research gaps in adaptability to complex anatomical variations, rigorous multi-center validation, and quantitative assessment of clinical indicators. To this end, this paper proposes an improved multi-scale generative adversarial network (MS-GAN), with the core goal of achieving “low-radiation, low-cost, and high-precision” 3D reconstruction of scoliotic spines. The existing low radiation clinical solutions have inherent limitations. Although the EOS imaging system reduces radiation, it is expensive and most primary hospitals cannot afford it, and its accuracy is reduced for severe scoliosis such as Cobb angles greater than 60°. Modern low-dose CT reduces radiation doses but still poses residual tumor risks for adolescents and requires costly equipment. In contrast, MS-GAN utilizes standard biplane X-rays, which have low radiation, low cost, and are universally available for generating high-precision 3D volumes, filling the gap between the difficulty of obtaining advanced tools and the need for safe and accurate assessment of AIS requirements. Its innovative designs include: firstly, using a pre-trained U-Net model to extract pure spinal regions, completely eliminating soft tissue grayscale interference; Secondly, designing a 2D encoder based on residual dense connections, preserving high-frequency details such as vertebral edges and intervertebral spaces through hierarchical feature cascading and residual feedback; Finally, introducing an adaptive dimension expansion module and a multi-scale discriminator to optimize the mapping accuracy from 2D features to 3D voxels and the consistency of global structures. The main contributions of this paper are as follows:

Proposing MS-GAN, a low-radiation 3D reconstruction model for scoliotic spines based on biplanar X-rays. It not only avoids the high radiation risk associated with CT examinations but also addresses the limited accessibility of CT equipment, offering a promising technical approach with potential for supporting scoliosis diagnosis and management in resource-limited settings.
Innovating core modules such as residual dense encoders, adaptive cross-view fusion modules, and multi-scale discriminators. These modules effectively solve the problems of soft tissue interference and insufficient adaptability to scoliosis scenarios faced by existing methods, significantly improving reconstruction accuracy and robustness.
Conducting rigorous validation on a mixed dataset of “normal spines and scoliotic spines”. Through multi-dimensional evaluation based on various indicators, it is proven that MS-GAN outperforms existing mainstream methods in reconstruction performance, providing theoretical support and practical guidance for clinical diagnosis and treatment management of AIS.

We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2176/rc).

Methods

Overall model design

Traditional encoder structures tend to lose detailed features after multiple downsampling steps, resulting in rough and blurry reconstruction results. As shown in Figure 1, our model consists of three components: first, orthogonal X-ray images were input into the encoder, which used residual dense connections to maximize the retention of feature information; second, the discriminator based on 3D feature convolution adopted multiple scales to compare spinal information from global to local perspectives, optimizing the reconstructed images; third, the dimension of 2D information was expanded, and adaptive weighting was used to fuse anteroposterior and lateral features, achieving better 3D reconstruction results.

Figure 1 Overall model design. 3D, three-dimensional.

Residual dense module

In the 2D encoder, we embedded the dense connection concept of DenseNet into each downsampling unit to form a residual-dense block (RDB). By retaining cascaded features and reusing outputs from all previous layers, we enhanced gradient flow and feature expression capabilities, thereby alleviating the information bottleneck caused by dimension differences during reconstruction.

The above process was applied to two orthogonal X-ray images separately: the input of each layer included the output of the previous layer, and outputs from all previous layers were concatenated along the channel dimension. Thus, each layer could directly access and acquire all previously extracted features, greatly enriching input information. Additionally, each layer was directly constrained by the final loss function, facilitating gradient update and descent and helping to learn more robust features. Features output by the RDB underwent average pooling and then entered the “Transition-to-3D modules”, where all features were concatenated through fully connected layers and directly reshaped to obtain the basic 3D image.

Residual addition of the original input of this module with the feature map obtained after dense connection and local feature fusion can effectively retain low-frequency global structural information of the input image, avoiding the loss of overall morphology caused by downsampling.

Orthogonal view feature fusion and skip connection

The 3D anteroposterior and lateral X-ray features were fused through the cross-view fusion module, with dynamic weight learning as its core mechanism. Two basic 3D voxel images were rotated to corresponding angles, and voxel units at corresponding positions were fused according to specific weights. The skip connection module transmitted network features from each layer to the decoder, fully optimizing CT generation. This process included three tasks: duplication, dimension expansion, and concatenation.

The i-th level 2D features of the encoder first underwent 1×1 convolution to align the number of channels, then were duplicated to the third dimension to obtain pseudo-3D features, and were finally concatenated with the 3D features at the same level of the decoder, as depicted in Eq. [1]. Dimension expansion and concatenation are shown in Eq. [2] and Eq. [3], respectively. Where D_i is the number of voxels in the depth direction of the current decoder, and [.] denotes concatenation along the channel dimension.

$F_{i}^{2 D} = Conv2D (X_{i}^{e n c}, k = 1, s = 1, P = 0)$ [1]

$F_{i}^{3 D} = Repeat (ExpandDims (F_{i}^{2 D}, axis = 3), n = D_{i}))$ [2]

$F_{i}^{f u s e} = [F_{i}^{3 D}, X^{d e c}]$ [3]

The inputs of cross-view feature fusion are the 3D features F^PA (obtained by encoding anteroposterior X-rays), and F^LAT (obtained by encoding lateral X-rays). The fusion process is shown in Eq. [4]. The fused 3D CT was input into the decoder for upsampling, and at each level, it was concatenated with skip features transformed by the 2D encoder. This ensures that the reconstructed voxels not only retain the overall shape of the spine but also integrate local texture details.

$F^{f u s e} = \frac{F^{P A} + F^{L A T}}{2}$ [4]

Multi-scale feature fusion

To further enhance the discriminator’s ability to distinguish between real and fake CT voxels in terms of global structure, middle-level texture, and fine-grained details, we introduced a multi-scale fusion discriminator (MSFD) module at the end of the original 3D PatchGAN discriminator, as shown in Figure 2.

Figure 2 Multi-scale feature fusion. The figure illustrates the process of multi-scale feature fusion.

3D features were extracted independently at three voxel scales (128³, 64³, 32³) through parallel multi-branches, avoiding large-area blurring or detail omission caused by single-scale discrimination. The input 128³ voxels were subjected to bilinear downsampling by factors of 2 and 4 to obtain 64³ and 32³ scales, respectively. Independent 3D convolution feature extraction was performed at each scale, as depicted in Eq. [5]. Weighted fusion was performed after unifying channel dimensions, as depicted in Eq. [6]. Where w_sare channel-wise learnable weights, ϭ is the LeakyReLU activation function.

$F_{s} = {Conv3D}_{s} (V_{s}), s \in {1, 2, 3}$ [5]

$F_{f u s e} = σ (B N (\sum_{s = 1}^{3} w_{s} \cdot {Conv3D}_{1 \times 1} (F_{s})))$ [6]

Different scale feature maps were mapped to a unified channel through learnable 1×1×1 convolutions, and cross-scale weight fusion was performed. This enabled the discriminator to simultaneously focus on structural consistency at different scales during the same loss backpropagation. The output of the discriminator is shown in Eq. [7].

$D (V) = {Conv3D}_{o u t} (F_{f u s e})$ [7]

The fused multi-scale features were directly used to calculate the adversarial loss with real CT. The discriminator could simultaneously penalize global shape distortion and tiny texture artifacts, thereby promoting the generator to produce more detailed and structurally consistent 3D CT. This significantly improved training stability and final reconstruction quality.

Network structure

The improved X2CT-GAN proposed in this paper still follows the three-stage framework of 2D encoder, multi-scale discriminator and 3D decoder, but introduces innovative modules at three key positions. First, the 2D encoder consisted of four stages of residual-dense structures, with each stage stacked K dense blocks internally. Second, to bridge the 2D-to-3D dimension gap, the 2D features at the end of the encoder were first reshaped into 8×8×8 voxels through fully connected layers; meanwhile, 2D features at the same level were duplicated into 3D space via skip connections (after channel alignment with 1×1 convolution) for supplementation, then concatenated with the decoder. Third, the 3D decoder consisted of four symmetric stages of 3×3×3 transposed convolution and residual convolution. At each stage, up-sampling was first performed via Con3DTranspose[4,2,1] then concatenated with skip features and convolved. Finally, the multi-scale discriminator extracted features in parallel at three scales (128³, 64³, 32³), fused them via 1×1×1 convolution, and output authenticity probabilities. Through repeated iterations and competition, the model generated images close to the original CT, as shown in Figure 3.

Figure 3 MS-GAN network structure. 2D, two-dimensional; 3D, three-dimensional; D, depth; H, height; MS-GAN, multi-scale generative adversarial network; W, width.

Loss function

The total loss function of the improved model consists of four components, which constrain reconstruction accuracy, projection consistency, adversarial fidelity, and multi-scale discriminator consistency, respectively. Let V^gt, V^pred represent the real CT and network-predicted volumes, respectively; $P_{a x} (\cdot) P_{c o} (\cdot) P_{s a} (\cdot)$ represented projection operators along the axial, coronal, and sagittal orthogonal directions, respectively; $D_{m s} (\cdot)$ denoted the output of the multi-scale discriminator. The total loss is shown in Eq. [8]. $L_{r e c}$ , $L_{p r o j}$ , $L_{a d v}^{D}$ , $L_{a d v}^{G}$ and $L_{m s}$ are shown in Eqs. [9-13].

$L_{t o t a l} = λ_{r e c} L_{r e c} + λ_{p r o j} L_{p r o j} + λ_{a d v} L_{a d v} + λ_{m s} L_{m s}$ [8]

$L_{r e c} = {‖ V^{g t} - V^{p r e d} ‖}_{2}^{2}$ [9]

$L_{p r o j} = \frac{1}{3} (‖ (V^{g t}) - (V^{p r e d}) ‖ + ‖ P_{c o} (V^{g t}) - P_{c o} (V^{p r e d}) ‖ + {‖ P_{s a} (V^{g t}) - P_{s a} (V^{p r e d}) ‖}_{1})$ [10]

$L_{a d v}^{D} = \frac{1}{2} [{(D_{m s} (V^{g t}) - 1)}^{2} + {(D_{m s} (V^{p r e d}) - 0)}^{2}]$ [11]

$L_{a d v}^{G} = \frac{1}{2} {(D_{m s} (V^{p r e d}) - 1)}^{2}$ [12]

$L_{m s} = \sum_{s = 1}^{3} K L (D_{s} (V^{p r e d}) ‖ D_{s} (V^{g t}))$ [13]

In the formula λ_rec, λ_proj, λ_adv, λ_ms were empirically set to 10, 10, 0.1, and 0.05, respectively, to prioritize ensuring global structural accuracy while balancing local detail fidelity and scale consistency.

Data composition

Digital reconstructed radiograph (DRR)

Due to the lack of large-scale datasets with strictly spatially aligned paired 2D X-ray images and 3D CT scans in the real world, this study used DRR technology for data preprocessing. By simulating the X-ray imaging process, DRR generated virtual 2D projection images from CT volume data, enabling precise control of imaging geometric parameters and ensuring complete spatial alignment between the generated X-ray images and the original CT data. This method not only solved the problems of scarce real data and difficult alignment but also provided diverse, high-quality synthetic training data for the model, effectively supporting the training and validation of end-to-end reconstruction tasks from 2D X-rays to 3D shapes.

Datasets

We used the public CTSpine1K dataset, which contains 1087 CT scan data of normal-shaped spines, and supplemented it with 138 clinical data of scoliosis patients from Nanjing Drum Tower Hospital. Among them, the ratio of the training set, validation set, and test set is 8:1:1. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Review Board of Nanjing Drum Tower Hospital (No. 2024-289), and the requirement for written informed consent was waived by institutional policy and the retrospective study design.

To enable machine learning to focus on the spine, this study adopted a semi-automatic segmentation method to achieve accurate and efficient segmentation of vertebral structures. This process integrated the MONAILabel open-source platform and 3D Slicer medical image analysis software. Segmentation was mainly performed in 3D Slicer, using its built-in MONAILabel plugin to call the pre-trained spinal segmentation model on the remote inference server, as shown in Figure 4. The pre-trained model was based on the nnU-Net architecture and trained on a large-scale, multi-center spinal CT dataset, exhibiting good generalization ability. The datasets used subsequently were all pre-segmented spinal data.

Figure 4 Schematic diagram of spinal segmentation. (A) Patient’s CT coronal plane, and (B) spinal segmentation results. CT, computed tomography.

Results

Segmentation and labeling metrics

To quantify the geometric and surface consistency between reconstructed CT and real CT, we set different evaluation metrics from multiple perspectives: mean squared error (MSE) and peak signal-to-noise ratio (PSNR) were used to compare the absolute error between reconstructed and reference images; structural similarity index measure (SSIM) was used to evaluate the perceptual similarity of two images from three aspects (brightness, contrast, and structure); The Dice coefficient (Dice) was used to measure voxel-level overlap. The formulas for MSE, PSNR, SSIM and Dice were shown in Eqs. [14-17], respectively. MAX₁ represented the maximum value of image point color.

$M S E = \frac{1}{m} \sum_{i = 1}^{m} {(y_{i} - {\hat{y}}_{i})}^{2}$ [14]

$P S N R = 20 \log_{10} (\frac{M A X_{1}}{\sqrt{M S E}})$ [15]

$SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + c_{1}) (2 σ_{x y} + c_{2})}{(μ_{x}^{2} + μ_{y}^{2} + c_{1}) (σ_{x}^{2} + σ_{y}^{2} + c_{2})}$ [16]

$Dice = \frac{2 \times (pred \cap true)}{pred \cup true}$ [17]

Implementation details

We generated paired coronal and sagittal X-rays from segmented CT scans via DRR generation, and resized them to 128×128 pixels. Correspondingly, each real CT segmentation was resampled to isotropic 128×128×128 voxels to ensure strict alignment with the generated X-ray projections. Our algorithm was implemented using the PyTorch framework and trained on a single H100 graphics processing unit (GPU) with 80 GB of memory. The network batch size was set to 8; the initial learning rate was 0.0002, determined via 5-fold cross-validation; and the number of training epochs was 100, determined based on the SSIM convergence curve of the validation set. A cosine annealing strategy was adopted for learning rate adjustment.

Reconstruction performance comparison between different models

We compared the reconstruction performance of different network structures. Table 1 listed the metric comparisons of reconstructed results from different models, Figure 5 showed the reconstruction results of normal spines, and Figure 6 showed the reconstruction results of scoliotic spines.

Table 1

Model comparison on reconstruction performance

Method	MSE	PSNR	SSIM	Dice coefficient
X2CT-GAN	0.016	65.23	0.906	0.85
DiffuX2CT	0.013	70.64	0.934	0.89
BX2S-Net	0.012	71.58	0.941	0.90
MS-GAN (ours)	0.011	73.20	0.956	0.92

MS-GAN, multi-scale generative adversarial network; MSE, mean squared error; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure.

Figure 5 Reconstruction results of normal spines among different models. Real stands for real spine. From left to right, the generated results are respectively from MS-GAN, X2CT-GAN, DiffuX2CT, and BX2S-Net. MS-GAN, multi-scale generative adversarial network.

Figure 6 Reconstruction results of scoliotic spines among different models. From left to right, the generated results are respectively from MS-GAN, X2CT-GAN, DiffuX2CT, and BX2S-Net. MS-GAN, multi-scale generative adversarial network.

Model performance based on real X-ray data

We conducted a preliminary experiment on real clinical X-ray data, which included 38 real biplane X-ray cases that were not involved in model training. Figure 7 showed an example of a true patient’s frontal and lateral X-ray used in this experiment. We directly applied the MS-GAN, which was trained solely on DRRs, to these real X-rays. The changes in the model’s performance are shown in Table 2.

Figure 7 Schematic diagram of true patient’s frontal and lateral X-ray. (A) Coronary plane X-ray of patient, and (B) sagittal X-ray of patient. L, light; R, right.

Table 2

Model comparison on reconstruction performance

Method	MSE	PSNR	SSIM	Dice coefficient
MS-GAN (DRRs)	0.011	73.20	0.956	0.92
MS-GAN (true)	0.013	72.31	0.927	0.89

MS-GAN (DRRs), the experiment utilizes DRRs data; MS-GAN (true), the experiment utilizes real data. DRR, digital reconstructed radiograph; MS-GAN, multi-scale generative adversarial network; MSE, mean squared error; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure.

Statistical analysis of clinical medical indicators

We utilized 38 sets of authentic X-ray data as our experimental dataset to analyze the correlation between the spinal column generated by our model and the gold standard in terms of clinical indicators. All the data utilized were measured by a senior orthopedic expert. In this experiment, we measured three clinical indicators: the sagittal vertical axis (SVA), lumbar lordotic (LL) angle, and thoracic kyphotic (TK) angle. Figure 8 displayed the Bland-Altman plots for these three indicators. Table 3 presented the key statistical results.

Figure 8 Bland-Altman plots comparing the spinal reconstruction results of MS-GAN with the key clinical indicators of gold-standard spinal CT. (A) Displays the Bland-Altman plot for the SVA indicator parameters, unit in mm; (B) displays the Bland-Altman plot for the LL indicator parameters, unit is degrees; (C) displays the Bland-Altman plot for the TK indicator parameters, unit is degrees. CT, computed tomography; LL, lumbar lordosis; MS-GAN, multi-scale generative adversarial network; SD, standard deviation; SVA, sagittal vertical axis; TK, thoracic kyphosis.

Table 3

Quantitative results of Bland-Altman consistency analysis for SVA, LL, TK, and CT gold standards

Clinical indicators	Mean difference	95% limit of agreement
SVA (mm)	−0.198	−6.190 to 5.794
LL (°)	0.087	−4.197 to 4.371
TK (°)	0.095	−4.631 to 4.820

CT, computed tomography; LL, lumbar lordosis; SVA, sagittal vertical axis; TK, thoracic kyphosis.

Ablation experiment

Figures 9,10 showed the change curves of SSIM and PSNR during training for different modules, respectively. Table 4 listed the SSIM and PSNR values of the model at convergence. Configuration instructions for ablation settings: without RDB, the generator adopted a standard convolutional block to replace RDB; without cross-view, simple average fusion was adopted to replace adaptive cross-view fusion; Without MSFD, a single-scale discriminator replaced MSFD.

Figure 9 Curves during SSIM training process. MSFD, multi-scale fusion discriminator; MS-GAN, multi-scale generative adversarial network; RDB, residual-dense block; SSIM, structural similarity index measure; w/o, without.

Figure 10 Curves during PSNR training process. MSFD, multi-scale fusion discriminator; MS-GAN, multi-scale generative adversarial network; PSNR, peak signal-to-noise ratio; RDB, residual-dense block; w/o, without.

Table 4

Ablation final-epoch summary

Setting	SSIM	PSNR	N runs
w/o RDB	0.9260±0.0012	67.61±0.20	5
w/o cross-view	0.9389±0.0043	69.49±0.35	5
w/o MSFD	0.9491±0.0025	71.01±0.21	5
Full MS-GAN	0.9572±0.0047	73.02±0.35	5

Data are presented as number or mean ± standard deviation. MSFD, multi-scale fusion discriminator; MS-GAN, multi-scale generative adversarial network; PSNR, peak signal-to-noise ratio; RDB, residual-dense block; SSIM, structural similarity index measure; w/o, without.

Discussion

Model architecture innovation and methodological differentiation

This study successfully developed and validated the MS-GAN model, which can efficiently generate high-precision 3D spinal CT volume data from orthogonal biplanar X-rays. Compared with existing methods, the core breakthrough of this study lies in effectively solving the feature fusion and dimension conversion challenges faced in the 2D-3D mapping process, especially for spines with anatomical deformities. The following is an in-depth discussion of the model’s innovation, performance improvement, and clinical significance.

The effectiveness of the MS-GAN model architecture design is the key to its excellent performance. Existing methods often suffer from information loss during anteroposterior and lateral X-ray feature extraction, resulting in blurry vertebral details or geometric distortions in reconstruction results. To address this bottleneck, this study innovatively adopted an encoder with dense connections and a decoder with residual connections. Through dense connections, the encoder maximizes the reuse of feature maps at different levels, ensuring that low-level edge texture information and high-level semantic information are retained during fusion, significantly alleviating information loss. The collaborative work of dense connections and residual network design jointly solves the core challenge of dimension mismatch, laying a structural foundation for high-quality 3D reconstruction. It is worth noting that academic research has made valuable contributions to spinal image analysis, especially through innovative strategies such as multi-scale feature fusion, attention mechanisms, and network pruning to promote spinal segmentation and lesion classification (15-23). These efforts have laid a solid foundation for automated spinal image processing. MS-GAN uses low-radiation biplane X-rays as input, specifically for 3D reconstruction of scoliosis, which is a task rarely emphasized in the aforementioned work. Compared with the quantum convolutional neural network method used for spinal reconstruction (23), MS-GAN eliminates the need for complex quantum computing frameworks and provides higher efficiency and easier implementation in clinical environments. At the feature extraction level, RDB outperforms traditional simple multi-scale fusion strategies by capturing richer vertebral details through dense residual feedback. The adaptive cross-view fusion module effectively integrates orthogonal biplane information and provides 2D to 3D mapping capabilities that are not available in single-modal segmentation networks. MSFD further enhances the fidelity of local details, making MS-GAN a comprehensive framework for 3D spine assessment, balancing low radiation, high accuracy, and clinical practicality.

Quantitative performance validation

Quantitative results confirm the significant effectiveness of this method in overall accuracy and anatomical feature preservation. Compared to the baseline X2CT-GAN, MS-GAN achieved a 30% reduction in MSE; its PSNR reached 73.20 dB, an improvement of 12.2%; its SSIM reached 0.956, an improvement of 5.5%; and its Dice reached 0.92, an improvement of 8%. These overall metrics fully indicate that the 3D CT images generated by MS-GAN are highly close to real CT scans in terms of voxel intensity, signal-to-noise ratio, and structural similarity. In particular, the high PSNR value indicates that noise and geometric distortions were minimized; the high SSIM value proves that the model performs excellently in preserving the geometric shape and texture details of anatomical structures such as vertebral bodies and intervertebral spaces. In terms of anatomical features, even minor distortions can affect subsequent diagnosis and measurement, and the superior performance of MS-GAN in this regard further confirms its clinical value. Furthermore, experimental results indicate that when MS-GAN uses real clinical X-ray images instead of DRRs as model inputs, there is a certain decline in model performance, with MSE increasing by 0.001, PSNR decreasing by 0.89, SSIM decreasing by 0.029, and Dice decreasing by 0.03. Except for MSE, the decline in the other metrics is within 5%. This is due to the existence of certain domain gaps between real clinical X-ray images and DRRs, such as noise and calibration differences. The magnitude of this error is within the clinically acceptable range.

It is worth noting that diffusion models, such as DiffuX2CT, with their iterative denoising mechanism, may have potential advantages in the reconstruction of fine anatomical structures such as pedicle and vertebral endplates. These structures have a small volume proportion and limited impact on global quantitative indicators such as SSIM and Dice, so they have not changed the leading position of MS-GAN in quantification, as shown in Table 1. However, there may be comparability in the presentation of qualitative details. However, this detail advantage comes at the cost of significant efficiency loss, as the computation time of diffusion models is 5-8 times that of GAN based methods, making it difficult to meet real-time requirements such as rapid clinical screening and intraoperative evaluation. MS-GAN, through the design of residual dense encoder and multi-scale discriminator, balances reconstruction efficiency and practical value while ensuring the accuracy of the overall structure of the spine and clinical core indicators, forming a reasonable balance between accuracy and real-time performance, which is more suitable for the actual needs of primary hospitals and routine diagnosis and treatment scenarios.

The ablation experiment results showed that after removing RDB, the SSIM of the model decreased from 0.9572 to 0.9260, a decrease of 3.26%, and the PSNR decreased from 73.02 to 67.61 dB, a decrease of 5.41%. After removing MSFD, SSIM only decreased from 0.9572 to 0.9491, a decrease of 0.85%. PSNR decreased from 73.02 to 71.01 dB, a decrease of 2.07%. This difference clearly indicates that RDB, as the core module of the generator, maximizes the preservation of low-level edge textures and high-level semantic features in 2D X-rays through a dense residual feedback mechanism, directly solving the key bottleneck of feature loss in the 2D-3D mapping process, and is the foundation for achieving high-quality spine reconstruction; As a MSFD, MSFD mainly optimizes the fidelity of local details and overall structural consistency of vertebral bodies through global and local multi-scale feature verification. Although it can significantly improve reconstruction accuracy, it needs to be based on the generator fully preserving features, so its impact on the overall performance of the model is secondary. The synergistic effect of the two not only ensures the core quality of reconstruction, but also optimizes the detail performance, jointly constituting the performance advantage of MS-GAN.

Clinical robustness and applicability

More importantly, this study highlights the strong robustness and clinical applicability of the model in complex scenarios of scoliosis. By training and validating the model on a dataset containing normal spines and clinical scoliosis spines, the model is able to actively learn and adapt to various spinal morphological changes, ranging from physiological curvature to pathological deformities. The Bland-Altman analysis further confirms that its reconstruction results exhibit good consistency with the CT gold standard in terms of three core clinical indicators: SVA, LL, and TK. Specifically, 95% of the measurement differences are concentrated around the zero line, falling entirely within clinically acceptable thresholds, and the mean difference is close to zero, indicating no significant systematic error. This means that MS-GAN is not limited to being a “laboratory model” under ideal conditions, but rather a practical tool capable of handling complex cases in real clinical practice. This study successfully demonstrates that the generative model based on biplane X-rays can not only adapt to morphological changes in deformed spines but also provide precise clinical quantitative indicators, meeting the need for accurate assessment of deformed spine structures.

Research limitations

Of course, this study still has certain limitations. Firstly, the performance of the model depends on the quality and calibration accuracy of the input X-ray images. Any deviation in projection angles or patient movement may affect the reconstruction results. Secondly, to eliminate soft tissue interference, this study uses a pre-trained model for spine segmentation. However, such models have minor omissions and errors in segmenting scoliosis, which pose obstacles to model training. In addition, model training mainly relies on paired data generated from DRRs. Although the performance has been validated using real X-ray data, domain adaptation optimization strategies have not been fully introduced, resulting in incomplete elimination of domain gaps such as equipment differences, scattering noise, and tissue overlap patterns between DRRs and real clinical X-rays. This may lead to a slight decrease in reconstruction accuracy and a slight increase in clinical indicator measurement errors when facing real X-ray data from different hospitals, different scanning parameters, or low-quality images from severe spinal deformities combined with low-dose scans. In clinical scenarios, more comprehensive information may be needed, and the adaptability of the current model to multi-source heterogeneous clinical images still needs to be further improved.

Future research directions

Looking ahead, this study provides a robust technical solution for low-radiation, low-cost, and high-precision diagnosis and treatment of spinal diseases. Considering the current research limitations and practical clinical needs, future work will focus on the following three aspects for further advancement: firstly, to address the domain gap issue between DRRs and real clinical X-rays, we will introduce a domain adaptive learning strategy and integrate real X-ray data from multiple centers and devices to expand the training set. This will further enhance the model’s adaptability to heterogeneous clinical images. Simultaneously, we will optimize the segmentation-reconstruction integration framework, tackling pain points such as missed segmentation of scoliosis segments and positioning deviations of deformed vertebral bodies, thereby strengthening the robustness of reconstruction in complex pathological scenarios. Secondly, we will deepen the integration of the reconstruction model with downstream clinical applications. This will not only be limited to surgical planning, pedicle screw guide plate printing, and biomechanical analysis, but will also integrate AI-assisted automatic measurement of clinical indicators, prediction of deformity progression, and other functions, constructing a full-process closed loop of “image reconstruction-quantitative evaluation-treatment planning-postoperative follow-up”. Lastly, we will explore a reconstruction paradigm without paired data, combining the high-fidelity advantage of diffusion models with the efficiency of GANs to further reduce measurement errors of clinical indicators. This will promote the popularization of the technology to grassroots hospitals, ultimately facilitating the large-scale implementation and standardized development of intelligent and precise spinal surgery.

Conclusions

We proposed an MS-GAN method for 3D spinal reconstruction, which realizes high-precision and structurally consistent spinal CT reconstruction using orthogonal biplanar X-ray images. By introducing an RDB in the encoder, the method effectively alleviates feature loss and gradient vanishing; through cross-view dynamic fusion and multi-scale skip connections mechanisms, it enhances the ability to fuse orthogonal view information and maintain spatial consistency; in addition, the introduction of a multi-scale discriminator further improves the ability to distinguish between global structures and local details, significantly improving reconstruction quality. Experiments on public datasets and clinical scoliosis data showed that the proposed method significantly outperforms existing mainstream methods in multiple quantitative metrics, and exhibits good robustness and clinical applicability especially in scoliotic cases with anatomical deformities.

Despite significant progress in the field of spinal 3D reconstruction, this study still has several limitations and clear directions for future improvement. The training of current models still relies on synthetic DRRs, which to some extent alleviates the challenge of scarce real paired data. However, the domain gap between DRRs and real X-ray images still needs to be further bridged through domain adaptation strategies. In addition, the generalization ability of the model for severe spinal deformity cases also needs further verification. Future research will focus on three aspects: expanding the training set by introducing more real clinical imaging data, exploring new reconstruction paradigms without paired data, and integrating cutting-edge generative techniques such as diffusion models, in order to further improve reconstruction accuracy and clinical adaptability, and promote the technology’s implementation in clinical large-scale applications.

Acknowledgments

None.

Footnote

Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2176/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2176/dss

Funding: This work was supported in part by the National Natural Science Foundation of China (Nos. U24A20671 and 12426305), and in part by the Guangdong Basic and Applied Basic Research Foundation (No. 2022B1515020042).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-aw-2176/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Institutional Review Board of Nanjing Drum Tower Hospital (No. 2024-289), and the requirement for written informed consent was waived by institutional policy and the retrospective study design.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Weinstein SL, Dolan LA, Cheng JC, Danielsson A, Morcuende JA. Adolescent idiopathic scoliosis. Lancet 2008;371:1527-37. [Crossref] [PubMed]
Lonstein JE, Carlson JM. The prediction of curve progression in untreated idiopathic scoliosis during growth. J Bone Joint Surg Am 1984;66:1061-71. [Crossref] [PubMed]
Brenner DJ, Hall EJ. Computed tomography--an increasing source of radiation exposure. N Engl J Med 2007;357:2277-84. [Crossref] [PubMed]
Pearce MS, Salotti JA, Little MP, McHugh K, Lee C, Kim KP, Howe NL, Ronckers CM, Rajaraman P, Sir Craft AW, Parker L, Berrington de González A. Radiation exposure from CT scans in childhood and subsequent risk of leukaemia and brain tumours: a retrospective cohort study. Lancet 2012;380:499-505. [Crossref] [PubMed]
Qiao Z, Chu DH, Ouyang HQ, Yuan HS, Zhen XT, Dong P, Qian Z. Coarse-Fine View Attention Alignment-Based GAN for CT Reconstruction from Biplanar X-Rays. arXiv preprint arXiv:2408.09736 [Preprint]. 2024. Available online: 10.48550/arXiv.2408.0973610.48550/arXiv.2408.09736
Dubey P, Saxena A, Jordan JE, Xian Z, Javed Z, Jindal G, Vahidy F, Sostman DH, Nasir K. Contemporary national trends and disparities for head CT use in emergency department settings: Insights from National Hospital Ambulatory Medical Care Survey (NHAMCS) 2007-2017. J Natl Med Assoc 2022;114:69-77. [Crossref] [PubMed]
Chen Z, Guo L, Zhang R, Fang Z, He X, Wang J. BX2S-Net: Learning to reconstruct 3D spinal structures from bi-planar X-ray images. Comput Biol Med 2023;154:106615. [Crossref] [PubMed]
Chen Y, Gao Y, Fu X, Chen Y, Wu J, Guo C, Li X. Automatic 3D reconstruction of vertebrae from orthogonal bi-planar radiographs. Sci Rep 2024;14:16165. [Crossref] [PubMed]
Cootes TF, Taylor CJ. Statistical Models of Appearance for Medical Image Analysis and Computer Vision. Proceedings of SPIE - The International Society for Optical Engineering 2001;4322: [Crossref]
Rehm J, Germann T, Akbar M, Pepke W, Kauczor HU, Weber MA, Spira D. 3D-modeling of the spine using EOS imaging system: Inter-reader reproducibility and reliability. PLoS One 2017;12:e0171258. [Crossref] [PubMed]
Ying X, Guo H, Ma K, Wu J, Weng Z, Zheng Y. X2CT-GAN: Reconstructing CT From Biplanar X-Rays With Generative Adversarial Networks. arXiv:1905.06902 [Preprint]. 2019. Available online: 10.1109/CVPR.2019.0108710.1109/CVPR.2019.01087
Jecklin S, Shen Y, Gout A, Suter D, Calvet L, Zingg L, Straub J, Cavalcanti NA, Farshad M, Fürnstahl P, Esfandiari H. Domain adaptation strategies for 3D reconstruction of the lumbar spine using real fluoroscopy data. Med Image Anal 2024;98:103322. [Crossref] [PubMed]
Xing X, Li X, Wei C, Zhang Z, Liu O, Xie S, Chen H, Quan S, Wang C, Yang X, Jiang X, Shuai J DP-GAN. +B: A lightweight generative adversarial network based on depthwise separable convolutions for generating CT volumes. Comput Biol Med 2024;174:108393. [Crossref] [PubMed]
Liu XH, Qiao Z, Liu RK, Li H, Zhang J, Zhen XT, Qian Z, Zhang BC. DiffuX2CT: Diffusion Learning to Reconstruct CT Images from Biplanar X-Rays. arXiv:2407.13545 [Preprint]. 2024. Available online: 10.48550/arXiv.2407.1354510.48550/arXiv.2407.13545
Saeed MU, Bin W, Sheng J, Mobarak Albarakati H. An Automated Multi-scale Feature Fusion Network for Spine Fracture Segmentation Using Computed Tomography Images. J Imaging Inform Med 2024;37:2216-26. [Crossref] [PubMed]
Saeed MU, Bin W, Sheng J, Albarakati HM, Dastgir A. MSFF: An Automated Multi-scale Feature Fusion Deep Learning Model for Spine Fracture Segmentation Using MRI. International Conference on Computational Science and Computational Intelligence. Springer, Cham. 2025. doi:10.1007/978-3-031-99586-6_8.10.1007/978-3-031-99586-6_8
Saeed MU, Bin W, Sheng J, Ghulam A, Dastgir A. 3D MRU-Net: A novel mobile residual U-Net deep learning model for spine segmentation using computed tomography images. Biomedical Signal Processing and Control 2023;86:105153. [Crossref]
Saeed MU, Dikaios N, Dastgir A, Ali G, Hamid M, Hajjej F. An Automated Deep Learning Approach for Spine Segmentation and Vertebrae Recognition Using Computed Tomography Images. Diagnostics (Basel) 2023;13:2658. [Crossref] [PubMed]
Saeed MU, Bin W, Sheng J, Saleem S. 3D MFA: An automated 3D Multi-Feature Attention based approach for spine segmentation using a multi-stage network pruning. Comput Biol Med 2025;185:109526. [Crossref] [PubMed]
Dastgir A, Bin W, Saeed MU, Sheng J, Saleem S. MAFMv3: An automated Multi-Scale Attention-Based Feature Fusion MobileNetv3 for spine lesion classification. Image and Vision Computing 2025;155:105440. [Crossref]
Dastgir A, Bin W, Saeed MU, Sheng J, Site L, Hassan H. Attention LinkNet-152: a novel encoder-decoder based deep learning network for automated spine segmentation. Sci Rep 2025;15:13102. [Crossref] [PubMed]
Saeed MU, Bin W, Sheng J. Multimodal Lumbar Spine Segmentation Pipeline with Optimized Deep Learning and Network Pruning. 2025 International Joint Conference on Neural Networks (IJCNN); Rome, Italy. IJCNN; 2025:1-8.
Dastgir A, Bin W, Sheng J, Saeed MU. Spine Image Reconstruction and Lesion Classification Based on Transfer Learning and Quantum Convolutional Neural Network. 2024 IEEE International Conference on High Performance Computing and Communications (HPCC). doi:10.1109/HPCC64274.2024.00012.10.1109/HPCC64274.2024.00012

Cite this article as: Lai Y, Tie R, Xie G, He Z, Zhu Z, Liu Z, Bao H, Xiong J. Multi-scale generative adversarial network: three-dimensional reconstruction of the scoliotic spine from biplanar X-rays. Quant Imaging Med Surg 2026;16(5):341. doi: 10.21037/qims-2025-aw-2176

Multi-scale generative adversarial network: three-dimensional reconstruction of the scoliotic spine from biplanar X-rays

Introduction

Methods

Overall model design

Residual dense module

Orthogonal view feature fusion and skip connection

Multi-scale feature fusion

Network structure

Loss function

Data composition

Digital reconstructed radiograph (DRR)

Datasets

Results

Segmentation and labeling metrics

Implementation details

Reconstruction performance comparison between different models

Table 1

Model performance based on real X-ray data

Table 2

Statistical analysis of clinical medical indicators

Table 3

Ablation experiment

Table 4

Discussion

Model architecture innovation and methodological differentiation

Quantitative performance validation

Clinical robustness and applicability

Research limitations

Future research directions

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share