An interpretable cascaded residual iterative network for sparse-view spectral CT imaging

Xinrui Zhang; Shaoyu Wang; Ningning Liang; Zhizhong Zheng; Ailong Cai; Lei Li; Hengyong Yu; Bin Yan

doi:10.21037/qims-2025-1895

Original Article

An interpretable cascaded residual iterative network for sparse-view spectral CT imaging

Xinrui Zhang¹ , Shaoyu Wang², Ningning Liang¹, Zhizhong Zheng¹, Ailong Cai¹, Lei Li¹, Hengyong Yu³, Bin Yan¹

¹Key Laboratory of Imaging and Intelligent Processing, Information Engineering University, Zhengzhou, China; ²The Academy of Information Engineering, Nanchang University, Nanchang, China; ³The Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, MA, USA

Contributions: (I) Conception and design: X Zhang, S Wang; (II) Administrative support: B Yan, L Li, N Liang, A Cai; (III) Provision of study materials or patients: H Yu, S Wang, X Zhang; (IV) Collection and assembly of data: X Zhang, S Wang; (V) Data analysis and interpretation: X Zhang, Z Zheng; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Prof. Lei Li, PhD. Key Laboratory of Imaging and Intelligent Processing, Information Engineering University, 62 Kexue Avenue, Zhengzhou 450001, China. Email: leehotline@163.com.

Background: Sparse-view spectral tomographic image reconstruction represents a typical ill-posed inverse problem, resulting in distortion in image structures and noise surging in basis materials. Nowadays, deep learning (DL) has emerged as a state-of-the-art method in spectral image reconstruction and quantitative material analysis. However, interpretability, generalizability, and data consistency are still challenges for the existing DL-based methods. Additionally, there is no general network framework capable of simultaneously handling a series of dependent tasks in spectral imaging. This study aimed to establish a general framework for integrating multi-scene spectral imaging issues. The spectral imaging tasks, which interact in a cyclic manner during the iterative process and are optimized together.

Methods: The interpretable cascaded residual iterative network (ICRIN) for spectral tomographic reconstruction and material decomposition was established. First, as a general iterative framework based on hybrid-domain networks, ICRIN integrates physical model-driven, compressed sensing (CS), and data-driven priors to promote model stability and data consistency. Second, a residual iterative mechanism is employed to extract residual image features, which are further emphasized by a transformer attention module. Third, an interpretable objective function is established using the alternating minimization method to jointly optimize spectral images and decomposed materials. Fourth, a feedback mechanism is employed to improve the stability and performance of ICRIN in both tasks. Numerical simulations were conducted on eight patients and real preclinical experiments on 126 mouse slices to evaluate the performance of the proposed model.

Results: Qualitative and quantitative comparisons between ICRIN and other state-of-the-art methods were conducted. The interpretability and generalizability of the ICRIN model were verified using the change curves of the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) indicators as the number of iterations increased. After iterations, the highest PSNR improvements for low- and high-energy spectral images and bone and tissue materials were approximately 6.9, 6.6, 4.0, and 8.4 dB, respectively. After the introduction of the feedback mechanism, the reconstructed images increased by approximately 3 dB, while the material images improved by approximately 1–3 dB.

Conclusions: This study established a general iterative framework, referred to as ICRIN, and discussed its advantages in terms of interpretability, generalizability, and data consistency in a mathematical modeling context. ICRIN could be applied across a wider range of spectral computed tomography (CT) imaging tasks, enabling clinical multi-task imaging and material quantification.

Keywords: Sparse-view imaging; spectral computed tomography (spectral CT); model interpretability; image reconstruction; material decomposition

Submitted Sep 02, 2025. Accepted for publication Dec 29, 2025. Published online Feb 11, 2026.

doi: 10.21037/qims-2025-1895

Introduction

Spectral computed tomography (CT) can provide multi-contrast imaging information while exploiting energy-dependent attenuation coefficients of different substances to achieve material decomposition. Spectral CT is a promising technology with potential applications in virtual non-contrast imaging, clinical lesion detection, and quantitative differential diagnosis (1,2). However, existing diagnostic CT systems face challenges in balancing acquisition efficiency with radiation safety. Prolonged scan times increase motion artifact susceptibility and patient discomfort, as well as cumulative radiation exposure. The implementation of sparse-view scanning in applications such as cone-beam CT (3) and static CT (4) can accelerate the acquisition process and reduce the radiation dose, but may compromise imaging quality. Hence, sparse-view spectral CT algorithms are being employed to address ill-posed inverse imaging problems. Hybrid tasks in medical scenarios of spectral imaging often have competing constraints, such as electronic noise filtering for low X-ray flux rates, artifact suppression under finite conditions (5), and super-resolution (6). These tasks need to be combined into a unified process to simplify clinical operations.

Recently, significant progress has been made in the optimization of independent spectral CT imaging tasks [e.g., image reconstruction from degraded measurement (7) and multi-material decomposition (8)]. Traditional model-based analytic reconstruction methods typically rely on precise physical models. Conversely, iterative reconstruction approaches are sensitive to hand-crafted priors and parameters (9). To address these issues, deep learning (DL)-based methods have emerged as state-of-the-art techniques for designing data-driven deep neural networks that learn latent features of spectral images. However, DL-based methods still exhibit limitations in critical areas, including data consistency, interpretability, robustness, and false positives and negatives.

Material decomposition methods in spectral CT imaging can be categorized into three groups: projection domain methods (10), image domain methods (11), and direct iteration methods (12). Classical image domain methods are approximated using linear models, which incorrectly characterize the nonlinear physical nature of the polychromatic X-ray source and the energy-dependent attenuation of materials. The accurate estimation of energy spectra represents another challenge. DL-based methods can effectively address this complex nonlinear problem, achieving high-quality material analysis and noise removal. However, most existing DL-based methods are mainly image domain methods, making it difficult to avoid linear approximation and leading to undesirable beam hardening artifacts (13).

The deep hybrid model is a novel network framework that integrates physical model priors, compressed sensing (CS) priors, and data-driven priors. One such representative technique is the unrolling method. For example, the typical ADMM-Net (14), based on the alternating direction multiplier method, achieves network unrolling. Another approach employs testing iterations, which reduces computational costs in the training process; examples include the learned primal-dual gradient algorithm (15), analytic compressed iterative deep framework (ACID) (16), and momentum-Net (17).

In this study, we developed a general interpretable cascaded residual iterative network (ICRIN) to synchronously perform spectral CT reconstruction and material decomposition from sparse-view measurements. As shown in Figure 1, ICRIN is a novel deep hybrid model that can leverage sparse priors from measurement data. It employs the alternating minimization method to solve sub-problems (3), and the networks are further unfolded according to the derived expression. The networks are classified into three categories to implement different imaging tasks: image reconstruction (NET1 and NET1*), material decomposition (NET2 and NET2*), and image generation (NET3). The significant innovation is the introduction of the image generation network (NET3), which feeds the material information back into the reconstructed image. This process transforms the framework into a closed-loop iterative structure, thus enhancing the framework stability and dual-task imaging effect. Experiments were conducted to examine the ICRIN effect.

Figure 1 The proposed ICRIN algorithm with five functional neural networks (NET1 and NET1*: image reconstruction; NET2 and NET2*: material decomposition, and NET3: image generation) is a general closed-loop generative iterative framework for cascaded spectral CT reconstruction and material decomposition tasks. NET1 and NET1* adopt different structures, while NET2 and NET2* adopt the same structure but different parameters. CT, computed tomography; ICRIN, interpretable cascaded residual iterative network.

The main contributions of this study are as follows:

A general hybrid model was established for cascaded tasks by combining spectral tomographic reconstruction and material decomposition. ICRIN integrates physical model-driven, CS, and data-driven priors. The physical model-driven prior can be used to maintain data consistency with the raw input, the data-driven prior to extract intrinsic data characteristics, and the CS prior to stabilize ICRIN iteration.
A residual iterative mechanism was introduced to remove redundant information contained in the residual domain and emphasize residual image features. A transformer-based convolutional neural network (CNN) is exploited in the residual domain to extract global and local residual features.
A heuristic objective function was proposed to model ICRIN, which was solved and interpreted via the method of alternating minimization. Every step of the algorithm is unfolded, and different network modules can be interpreted as implementation operators in the heuristic function.
Detailed numerical simulations and real preclinical experiments were performed to verify the effectiveness of ICRIN. The results showed that the introduction of a feedback mechanism stabilized iteration and improved imaging performance. A stability analysis under different sparse views and an anti-noise analysis were also conducted to demonstrate the generation ability of the model.

The rest of this article is organized as follows: section Methods introduces our ICRIN mathematical model as well as the network architectures; section Results compares ICRIN with other state-of-the-art methods and provides the corresponding reconstruction and decomposition results; sections Discussion and Conclusions comprise the discussion and conclusion, respectively.

Methods

Spectral CT imaging model

The X-ray photons emitted by a source can be described by a polychromatic spectrum distribution at a range of energy $E \in [E_{\min}^{c}, E_{\max}^{c}]$ for a certain energy bin, c. We assume that the path of the line integral is L, and l is its variable of integration. For a normalized spectra distribution $W_{c} (E)$ and the attenuation coefficient $μ (l, E)$ of the object, the projection in each energy bin can be expressed as:

$y_{c} = - \ln (\int_{E_{\min}^{c}}^{E_{\max}^{c}} W_{c} (E) \exp [- \int_{L} μ (l, E) d l] d E)$ [1]

The above Forward-Backprojection formula can be approximated using a linear model. Hence, for all energy bins, the forward model for spectral CT image reconstruction can be simplified as $y^{(0)} = A f^{*} + δ$ . $y^{(0)} \in ℝ^{P \times C}$ ( $P = V \times L_{0}$ ) is the original tomographic measurement unfolded from a third-order tensor into a matrix processed by element rearranging, where V, L₀, and C represent the number of projection angles, detector pixels, and energy bins, respectively $f^{*} \in ℝ^{Q \times C}$ . ( $Q = N_{1} \times N_{2}$ ) is the ground truth (GT) of the reconstructed image that is also an unfolded tensor, and the width and height of the image are N₁ and N₂, respectively. $A \in ℝ^{P \times Q}$ represents the measurement matrix. The pixel value of the image is the so-called attenuation coefficient. $δ \in ℝ^{P \times C}$ is an error caused by data noise and linear approximation. The backward model is the inverse process of Eq. [1].

For material decomposition, $μ (l, E)$ can be expressed as the combination of the material attenuation coefficient and decomposition coefficient; that is, $μ (l, E) ≃ \sum_{τ = 1}^{N} b_{τ} (l) μ_{τ} (E)$ , where $τ \in [1, N]$ represents the material category, N is the number of basic materials, and $b_{τ} (l)$ denotes the decomposition coefficient of the τ-th material, representing the volume fraction of these N materials at the same position. Therefore, the image composed of the decomposition coefficients is the basis material image $μ_{τ} (E)$ . is the material attenuation coefficient. Similarly, the material decomposition model based on the image domain can be simplified as a linear transformation, $m^{*} = f^{*} B^{T} + δ^{'}$ , where $m^{*} \in ℝ^{Q \times N}$ denotes the decomposed material GT, T represents the matrix transpose $B \in ℝ^{N \times C}$ , is a transformation matrix related to the attenuation coefficient, and $δ^{'} \in ℝ^{Q \times N}$ is an error generated during the procedure of approximating linearity. In our experiment, two-channel stacked reconstructed images were used to obtain material tissue and bone from the thoracic cavity.

ICRIN architecture

As shown in Figure 2, our ICRIN is a closed-loop iterative heuristical framework composed of five networks. In the process of prior image acquisition, the raw spectral measured data $y^{(0)}$ are fed to a cascaded network framework, including the initialization reconstruction network (RecNet¹), residual iterative reconstruction network (RecNet²), material decomposition network (DecNet¹), residual material decomposition network (DecNet²), and generation network (GenNet) (denoted as Φ₁, Φ₂, D₁, D₂, and $G$ , respectively). The first-round outputs of RecNet¹ and DecNet¹ are initialized with reconstructed spectral images $f^{(0)}$ and materials $m^{(0)}$ , respectively. For the process of residual domain iteration, RecNet² and DecNet² constitute a cascaded network in the residual domain, making it possible to fit the sparse-view residual measurements to full-view residual spectral images and produce the decomposed residual materials. The residual image and residual material iteratively update the image and the material, respectively. The branch of GenNet, as a feedback mechanism, feeds back the material information into the spectral image and fuses it with the image through weighting. All networks are roughly divided into reconstruction and decomposition modules, which are described below.

Figure 2 The overall architecture of the proposed ICRIN. (A) The workflow of the multi-domain hybrid framework. (B) The initialization reconstruction network (RecNet¹) is a dual-channel U-Net preceded by a FBP module. (C) The iterative reconstruction network (RecNet²) mainly comprises several convolutional layers, a hidden feature layer, a linear projection layer, and three transformer modules for each path. Each transformer module employs cascaded MSA layers and MLPs to capture the distant spatial correlations in the images. (D) The decomposition networks (DecNet¹ and DecNet²) and generation network (GenNet) adopt the same structure of dual-channel WGAN [DIWGAN (18)] but different parameters. b, c, p, w, h represent the batch size, channel number, patch size, image weight, and height, respectively. Where l = h/pw/p. BN, batch normalization; DIWGAN, dual interactive Wasserstein generative adversarial networks; FBP, filtered back projection; ICRIN, interpretable cascaded residual iterative network; MLP, multilayer perceptron; MSA, multi-head self-attention; TV, total variation; WGAN, Wasserstein generative adversarial networks.

Reconstruction module

As shown in Figure 2B,2C, our reconstruction module comprises two networks: RecNet¹ and RecNet². RecNet¹ is a dual-channel U-Net preceded by a filtered back projection (FBP) module, implementing spectral image reconstruction from sparse-view data. RecNet² is incorporated into the residual iterative framework. The attention technique is applied to RecNet² based on the existing CNN-based RecNet¹.

Decomposition module

As shown in Figure 2D, the paired networks DecNet¹, DecNet², and GenNet in the decomposition module are all dual interactive Wasserstein generative adversarial networks (DIWGAN) (18) that use the Wasserstein distance as the loss function. During the upsampling process, the dual-channel paths interact to exchange information. The introduction of GenNet not only forms a closed loop for the entire framework, but also feeds back material information into the spectral images, further improving the reconstruction effect.

Attention module

The attention technique is employed in RecNet². RecNet² mainly comprises several convolutional layers, a hidden feature layer, a linear projection layer, and three transformer modules for each path. The hidden feature layer is used to flatten the input feature map into a sequence. The linear projection layer is used to embed the sequence into the latent space. The transformer modules employ cascaded multi-head self-attention layers and multilayer perceptrons (MLPs) to capture the distant spatial correlations in the images. All the modules are generalized into a dual-channel multi-scale network to simultaneously extract the global and local spatial features of the residual images.

ICRIN mathematical model

After establishing the mathematical model of ICRIN, we sought to optimize three variables: the reconstructed spectral image, f, the material image, m, and the sinogram-domain residual error, y. The proposed framework (shown in Figure 2), constructed using Eqs. [6], [11], and [19], integrates a feedback mechanism (denoted as the β-branch). The output of $G (\cdot)$ and image f are integrated using a weighting factor $α = {[α_{1}, ..., α_{C}]}^{T}, β = {[β_{1}, ..., β_{C}]}^{T} \in [0, 1]$ to produce a fusion image $Π_{f, m}$ ; that is, $Π_{f, m} = α ⊙ f + β ⊙ G (m)$ [where is column-wise scaling, subsequently simplified as $Π = α f + β G (m)$ ]. Here ∏ can be considered an intermediate variable that innovatively feeds the material information back into the framework and fuses it with the spectral image. Optimizing ∏ is equivalent to optimizing both the image f and material m. This approach not only preserves the original reconstruction information but also incorporates new information from the material domain, which helps enhance the performance of ICRIN. To further simplify the derivation, we set α+β=1. Therefore, a regularization method is adopted to find an approximate solution to solve variables y, f, and m. In this case, our model is described as a regularized model-based reconstruction problem, which can be obtained by minimizing:

$ \underset{f, y, m}{ Argmin} (\frac{1}{2} {‖ Φ_{1} (A Π - y) - Π ‖}_{2}^{2} + \frac{λ_{1}}{2} {‖ A Π - y^{(0)} - y ‖}_{2}^{2} + \frac{λ_{2}}{2} {‖ (D_{1} (f) - m] ‖}_{2}^{2} + ρ {‖ H f ‖}_{1}) $ [2]

where λ_i>0, i=1, 2 and ρ>0 are hyperparameters that balance the data fidelity term and regularization terms, respectively. Inspired by the ACID (16) deep hybrid baseline model, this objective function incorporates both image fidelity and material fidelity, while retaining the stability and convergence properties of ACID. Specifically, the first term enforces image-domain fidelity by minimizing the difference between output $Φ_{1} (\cdot)$ and ∏; thus, the predicted fusion image ∏ should not deviate from the original reconstructed image $Φ_{1} (A Π - y)$ during the iteration. A residual error is introduced as y. This intermediate variable of correction mechanism is optimized to closely match the difference between the predicted data and measured data. Therefore, the second term enforces data consistency by minimizing the measured error between $A Π - y^{(0)}$ and y. The third term enforces material fidelity by minimizing the output of the material decomposition network, $D_{1} (\cdot)$ , and the material image, m. To stabilize the network and make it converge to the solutions of the operator equation, we introduce a unitary matrix $H \in ℝ^{N_{1} \times N_{2}}$ as the sparse prior, and $H^{*}$ is the adjoint of $H$ . Therefore, the last term $ρ {‖ H f ‖}_{1}$ is a s-order sparse constraint term. To solve the objective function (Eq. [2]), we define the right side as $Ψ (y, m, f)$ . The block coordinate descent method can then be used to alternatively represent the involved variables as three sub-problems:

${\begin{matrix} y^{(k + 1)} = \underset{y}{argmin} Ψ (y, m^{(k)}, f^{(k)}) \\ m^{(k + 1)} = \underset{m}{argmin} Ψ (y^{(k + 1)}, m, f^{(k)}) \\ f^{(k + 1)} = \underset{f}{argmin} Ψ (y^{(k + 1)}, m^{(k + 1)}, f) \end{matrix}$ [3]

where k is defined as the number of iterations, and the maximum number of iterations is set to K, and $k = 0, 1, \dots, K$ . To simplify expression, we denote that $Π^{(k)} = α f^{(k)} + β G (m^{(k)})$ , $Π_{f} = α f + β G (m^{(k + 1)})$ and $Π_{m} = α f^{(k)} + β G (m)$ .

Updating y

To update y, the first line of Eq. [3] is expanded into:

$y^{(k + 1)} = \underset{y}{argmin} (\frac{1}{2} {‖ Φ_{1} (A Π^{(k)} - y) - Π^{(k)} ‖}_{2}^{2} + \frac{λ_{1}}{2} {‖ y^{(0)} - (A Π^{(k)} - y) ‖}_{2}^{2})$ [4]

The derivative of the right-side term of the sub-problem Eq. [4] is equal to zero:

${( \frac{\partial Φ_{1} (A Π^{(k)} - y)}{\partial (A Π^{(k)} - y)} )}^{T} (\begin{array}{l} Φ_{1} (A Π^{(k)} - y) - Π^{(k)} \end{array} ) = λ_{1} (A Π^{(k)} - y - y^{(0)})$ [5]

We assume that the network $Φ_{1} (\cdot)$ is well-trained; that is, the reconstruction problem can be approximated as $A Φ_{1} (\cdot) ≅ (\cdot)$ . ${(\frac{\partial Φ_{1} (\cdot)}{\partial (\cdot)})}^{T}$ can be seen as the backpropagation of the network $Φ_{1} (\cdot)$ (19). We can solve Eq. [5] with approximation, and obtain the result of the first sub-problem:

$y^{(k + 1)} ≅ \frac{λ_{1} (A Π^{(k)} - y^{(0)})}{λ_{1} - 1}$ [6]

According to Eq. [6] and Figure 2A, the residual error $y^{(k + 1)}$ in the (k+1)-th iteration can be inferred from the fusion image $Π^{(k)}$ and the $y^{(0)}$ initial sinogram in the k-th iteration.

Updating m

Equally, we can easily update the corresponding material using the residual material; that is, the solution of the sub-problem in the second line of Eq. [3]:

$ m^{(k + 1)} = \underset{m}{argmin} {\frac{1}{2} {‖ Φ_{1} (A Π_{m} - y^{(k + 1)}) - Π_{m} ‖}_{2}^{2} + \frac{λ_{1}}{2} {‖ y^{(0)} - (A Π_{m} - y^{(k + 1)}) ‖}_{2}^{2} + \frac{λ_{2}}{2} {‖ D_{1} (f^{(k)}) - m ‖}_{2}^{2}}$ [7]

After deriving the Eq. [7], we can obtain:

$ \begin{array}{l} λ_{1} β A^{T} (y^{(0)} - A Π_{m} + y^{(k + 1)}) {( \frac{\partial G ( m )}{\partial m} )}^{T} ≅ λ_{2} (m - D_{1} (f^{(k)})) \end{array}$ [8]

Similarly, the network $G (\cdot)$ is well-trained, and we have $G (\cdot) ≅ (\cdot) B$ . We then substitute $y^{(0)}$ in Eq. [8] using Eq. [6]:

$ \begin{array}{l} \begin{array}{l} λ_{1} β A^{T} ( A Π^{(k)} + \frac{1}{λ_{1}} y^{(k + 1)} - A Π_{m} ) B^{T} ≅ λ_{2} ( m - D_{1} (f^{(k)}) ) \end{array} \end{array}$ [9]

Using the conclusion with $G (m^{(k)}) ≅ m^{(k)} B$ in (15), Eq. [9] can be simplified as:

$ \begin{array}{l} \begin{array}{l} λ_{1} β^{2} ( m^{(k)} - m ) + β A^{T} y^{(k + 1)} = B^{T} ≅ λ_{2} ( m - D_{1} (f^{(k)}) ) \end{array} \end{array}$ [10]

Finally, according to $Φ_{2} (y) ≅ A^{T} y$ and $D_{2} (Φ_{2}) ≅ Φ_{2} B^{T}$ , we have:

$ \begin{array}{l} \begin{array}{l} m^{(k + 1)} ≅ \frac{1}{λ_{2} + λ_{1} β^{2}} (λ_{2} D_{1} (f^{(k)}) + λ_{1} β^{2} m^{(k)} + β D_{2} (Φ_{2} (y^{(k + 1)}))) \end{array} \end{array}$ [11]

Figure 2A also shows the iterative process of the base material $m^{(k + 1)}$ in the (k+1)-th iteration, which is obtained from the spectral image $f^{(k)}$ through the material decomposition network $D_{1} (\cdot)$ and is added with the residual material $D_{2} (Φ_{2} (y^{(k + 1)}))$ update, which is consistent with the theory of Eq. [11].

Updating f

The last sub-problem is updating the spectral image f. If we let $\bar{f} = H f$ , according to the unitary transformation, we have $H^{*} H = I$ . If we let $Π_{f}^{*} = α H^{*} \bar{f} + β G (m^{(k + 1)})$ , the optimization problem of the sub-problem in the third line of Eq. [3] can be expressed as:

$ \begin{array}{l} \begin{array}{l} {\bar{f}}^{( k + 1 )} = \end{array} \end{array} \underset{\bar{f}}{argmin} (\begin{array}{l} \frac{1}{2} ‖ Φ_{1} (A Π_{f}^{*} - y^{(k + 1)}) - Π_{f}^{*} ‖_{2}^{2} + \frac{λ_{1}}{2} ‖ y^{(0)} - (A Π_{f}^{*} - y^{(k + 1)}) ‖_{2}^{2} \\ + \frac{λ_{2}}{2} ‖ D_{1} (H^{*} \bar{f}) - m^{(k + 1)} ‖_{2}^{2} + ρ ‖ H^{*} \bar{f} ‖_{1} \end{array})$ [12]

The derivative of the right side can be computed as follows:

$λ_{1} α H A^{T} (y^{(0)} + y^{(k + 1)} - A Π_{f}^{*}) - ρ sgn (\bar{f}) ≅ λ_{2} H (D_{1} (H^{*} \bar{f}) - m^{(k + 1)}) {(\frac{\partial D_{1} (H^{*} \bar{f})}{\partial (H^{*} \bar{f})})}^{T}$ [13]

We also assume that the network $D_{1} (\cdot)$ is well-trained; that is, the decomposition problem can be approximated as $D_{1} (\cdot) ≅ (\cdot) B^{T}$ . ${(\frac{\partial D_{1} (\cdot)}{\partial (\cdot)})}^{T}$ can be seen as the backpropagation of the network $D_{1} (\cdot)$ . Similar to the study by Pan et al. (19), we solve Eq. [13] with approximation $D_{1} (\cdot) {(\frac{\partial D_{1} (\cdot)}{\partial (\cdot)})}^{T} ≅ (\cdot)$ . Further, we substitute $y^{(0)}$ in Eq. [13] using Eq. [6]:

$λ_{1} α H A^{T} ( A Π^{(k)} + \frac{1}{λ_{1}} y^{(k + 1)} - A Π_{f}^{*}) - ρ sgn (\bar{f}) ≅ λ_{2} H (D_{1} (H^{*} \bar{f}) - m^{(k + 1)}) {(\frac{\partial D_{1} (H^{*} \bar{f})}{\partial (H^{*} \bar{f})})}^{T}$ [14]

Similarly, according to $Φ_{2} (y) ≅ A^{T} y$ , we have:

$\begin{array}{l} \bar{f} ≅ \frac{1}{λ_{2} + λ_{1} α^{2}} (\begin{array}{l} λ_{1} α^{2} H f^{(k)} + α H Φ_{2} (y^{(k + 1)}) + (λ_{2} - λ_{1} α β) H G (m^{(k + 1)}) \\ + λ_{1} α β H G (m^{(k)}) - ρ sgn (\bar{f}) \end{array}) \end{array}$ [15]

Eq. [15] can be solved by soft-threshold filtering, and we have:

$\begin{array}{l} \begin{array}{l} f^{(k + 1)} ≅ \frac{λ_{1} α}{λ_{2} + λ_{1} α^{2}} H^{*} S_{ω} ( H (Π^{(k)} + \frac{1}{λ_{1}} Φ_{2} (y^{(k + 1)} ) + \frac{λ_{2} - λ_{1} α β}{λ_{1} α} G (m^{(k + 1)})) ) \end{array} \end{array}$ [16]

where $S_{ω}$ is the soft-thresholding kernel at parameter $ω = ρ / (λ_{2} + λ_{1} α^{2})$ , which can be defined as:

$\begin{array}{l} \begin{array}{l} S_{ω} (x) = {\begin{array}{l} 0, & | x | < ω \\ x - sgn (x) ω, & other \end{array} \end{array} \end{array}$ [17]

Eq. [16] can be extended to non-unitary discrete gradient transformations for total variation (TV) minimization. This also means the sparse regularization term in Eq. [12] can be written as a total-variation function TV for each energy-channel image. Eq. [18] is expressed as follows:

$\begin{array}{l} \begin{array}{l} T V (Π_{c}) = \sum_{i = 2}^{N_{1}} \sum_{j = 2}^{N_{2}} (| Π_{c} (i, j) - Π_{c} (i - 1, j) | + | Π_{c} (i, j) - Π_{c} (i, j - 1) |) \end{array} \end{array}$ [18]

where (i, j) is the position coordinate of the reconstructed image pixel. Thus, the residual-updated ∏ is optimized through the CS-based module in Figure 2A, and we simply select TV filtering. In the experiment, the coefficient of $G (m^{(k + 1)})$ in Eq. [16] is set to a relatively small value and can thus be ignored. Hence, the actual iteration can be further simplified as:

$\begin{array}{l} \begin{array}{l} \begin{array}{l} f^{(k + 1)} ≅ \frac{λ_{1} α}{λ_{2} + λ_{1} α^{2}} T V (\begin{matrix} Π^{(k)} + \frac{1}{λ_{1}} Φ_{2} (y^{(k + 1)} ) \end{matrix} ) \end{array} \end{array} \end{array}$ [19]

Figure 2A also shows the iterative update of the spectral image $f^{(k + 1)}$ in the (k+1)-th iteration, which is obtained by adding the residual image $Φ_{2} (y_{}^{(k + 1)} )$ update on the fusion image ∏⁽^k⁾ in the k-th iteration with TV filtering.

Eq. [11] and Eq. [19] iteratively update the materials and spectral images, consistent with the workflow in Figure 2A. Therefore, our ICRIN achieves the dual tasks of material decomposition and spectral CT reconstruction in the process of collaborative optimization.

Training and testing

A multi-domain joint training approach is adopted for network training. Since our framework contains five networks, to address the challenge of high computational resource demands for integrated end-to-end training, the networks are jointly trained in the image and material domains, as well as in the residual domain. Specifically, the mapping is first constructed from the raw measured data $y^{(0)}$ to the initial reconstructed image $f^{(0)} = Φ_{1} (y^{(0)})$ , as well as to the initial decomposed materials $m^{(0)} = D_{1} (Φ_{1} (y^{(0)}))$ . The initial materials are then processed through the network $G (\cdot)$ to obtain $G (m^{(0)})$ . Based on the existing image label $f^{*}$ and material label, a joint image-material loss function based on mean squared error (MSE) is constructed:

$\begin{array}{l} \begin{array}{l} \begin{array}{l} L o s s_{1} = σ_{1} L_{M S E} (f^{(0)}, f^{*}) + σ_{2} L_{M S E} (m^{(0)}, m^{*}) + σ_{3} L_{M S E} (G (m^{(0)}), f^{*}) \end{array} \end{array} \end{array}$ [20]

where σ₁, σ₂ and σ₃ are the weighting parameters to control the trade-offs. After jointly training $Φ_{1} (\cdot)$ , $D_{1} (\cdot)$ , and $G (\cdot)$ , the images and materials can be collaboratively optimized. In the residual domain, the residual error is initialized as $y^{(1)} = A Φ_{1} (y^{(0)}) - y_{}^{(0)}$ . The loss function of $Φ_{2} (\cdot)$ is defined in the residual domain as:

$\begin{array}{l} \begin{array}{l} \begin{array}{l} L o s s_{2} = σ_{4} L_{M A E} (Φ_{2} (y^{(1)}), f^{(0)} - f^{*}) + σ_{5} L_{M A E} (D_{2} (Φ_{2} (y^{(1)})), m^{(0)} - m^{*}) \end{array} \end{array} \end{array}$ [21]

where σ₄ and σ₅ are the corresponding weighting parameters. Training in the residual domain ensures that the residual information can accurately approximate the true error.

After the training process, the trained network undergoes iterative optimization during testing, with all network parameters remaining fixed throughout the iterations. The workflow of ICRIN is summarized in Figure 3.

Figure 3 ICRIN training and resting. ICRIN, interpretable cascaded residual iterative network.

Data preparation

We conducted two groups of numerical experiments to: (I) verify the feasibility of our model using simulation datasets (Group 1); and (II) test the model’s effectiveness on real datasets (Group 2). Group 1 comprises medical images and phantom experiments. Group 2 comprises a real preclinical mouse dataset collected by the Medipix All Resolution System (MARS) system.

Generation of simulation datasets

For the simulation datasets, 1,587 single-energy reconstructed image slices were collected from eight patients at the Radiology Department of the Information Engineering University. The size of each image was 512×512. The training set comprised 1,387 image slices from seven patients, and the testing set comprised the remaining 200 image slices from one patient. Data augmentation was performed on the training datasets by rotating the images by 0, 90, 180, and 270 degrees, and flipping them horizontally, expanding the datasets by a factor of 8. In addition, an imaging system was used, comprising a Thales Hawkeye 130 radiation source, a Varian 4030E flat-panel detector, and a high-precision four-axis linkage mechanical system equipped with a stage for scanning the Quasi-Realistic model (QRM).

For the spectral CT reconstruction, our experiments were mainly conducted on 120 and 60 views of fan-beam datasets. For the material decomposition, our experiments primarily focused on bone and tissue decomposition in the thoracic cavity. The following describes the overall process of the simulated data, including the input sparse-view projections, clean spectral image labels, and material labels. Our simulation process is based on the discrete form of the physical model (Eq. [1]) of spectral imaging:

$ y_{c} = - \ln (\sum_{E_{\min}^{c}}^{E_{\max}^{c}} W_{c} [E] \exp [- A μ (l, E)] )$ [22]

As shown in Figure 4, the material labels were obtained from the prepared CT image slices through thresholding segmentation. $μ (l, E)$ were obtained by the linear combination of material labels $b_{τ} (l)$ and their corresponding attenuation coefficients $μ_{τ} (E)$ at specific energy spectra. The attenuation coefficients of the two materials were obtained from the National Institute of Standards and Technology (NIST) database (20). According to Eq. [22], a 120kVp spectrum, $W_{c} (E)$ , was generated using the SpekCalc (21) software with tungsten anode X-ray tubes and Al filtration settings. The resulting spectrum was divided into high- and low-energy bins of [7, 70] and [70, 120] to simulate the photon-counting energy spectrum used in the real datasets. The spectral image labels were then projected on 720 views with a measurement matrix to obtain full-view spectral data (size: 720×3,072). Sparse-view data were obtained by performing 6× and 12× downsampling on the full-view data. The image labels were reconstructed on the full-view data using FBP (22).

Figure 4 The flowchart of the simulated data preprocessing procedure. (A) Material attenuation coefficients of bone and soft tissue under different X-ray energies. (B) Simulated photon-counting CT energy spectra distribution. CT, computed tomography.

For the phantom experiments, the same simulation method was adopted to obtain the images of QRM at 60 sparse views and the labeled images at 720 full views. Since it was a standard phantom, it was relatively easy to directly extract the target materials from the images based on the pixel proportion of the substances. To test the generalization of the ICRIN datasets under different conditions, as shown in Table 1, we set two independent experiment groups, which had different distances from the X-ray source to the system isocenter (SOD) and distances from the X-ray source to the detector array (SDD), as well as other parameters. Group 1 was used for the simulation data experiment.

Table 1

The simulation experiment parameter settings

Datasets	Number of views	Spectrum (kVp)	Data size	Distance (mm)	Detector pixel size (mm)
Group 1	60	[7, 70], [70, 120]	60×3,072	SOD: 1,000; SDD: 1,500	0.204
Group 2	120	[7, 70], [70, 120]	120×600	SOD: 156; SDD: 256	0.110

SDD, distances from X-ray source to the detector array; SOD, distances from X-ray source to the system isocenter.

Generation of real preclinical datasets

To further verify the generalization of the proposed algorithm, the mouse thoracic cavity was selected as real data for validation. An anesthetized mouse was scanned by an advanced MARS. All parameters, such as detector elements, pixel size, SOD, and SDD, were the same as the real settings of Group 2 (see Table 1). In practice, we directly tested the ICRIN model’s performance with pre-trained simulation dataset parameters on the real datasets. We selected 126 real data projections (size: 720×600) and validated the performance of ICRIN on Group 2. The same regularization parameter settings of ICRIN were used. The originally collected data were obtained under an X-ray spectrum of 120 kVp, including energy intervals of [7, 32], [32.1, 43], [43.1, 54], [54.1, 70], and [70, 120]. Spectrum merging was performed in the projection domain to obtain dual-energy data of [7, 70] keV and [70, 120] keV. Six-fold down-sampling was then performed on 720-view projection data to obtain sparse projections of 120 views as input for ICRIN. FBP reconstruction was performed on 720 full-view data obtaining the reconstructed images as the labels. Based on these reconstructed images, the regularization method with meticulous parameter adjustment was adopted to conduct material decomposition, and the material labels were obtained.

Ethical statement

The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. As this study did not involve human trials, clinical diagnosis, treatment information, sensitive personal information, commercial interests, or any procedures that could harm the human body, and given its retrospective nature, it was exempt from ethical approval from the Information Engineering University Ethics Committee. The animal experiments in the study were conducted in accordance with the Laboratory Animal Guidelines for Ethical Review of Animal Welfare and approved by the Medicine Ethics Committee of the The University of Hong Kong Li Ka Shing Faculty.

Results

Our experiments were conducted on a PC equipped with a 24 GB NVIDIA Quadro RTX 6000 GPU, an Intel(R) Xeon(R) Gold 6234 CPU at 3.30 GHz, and 256 GB of RAM. Our network runs in Python based on the TensorFlow framework. The learning rate was set to 1×10⁻⁴ for all networks and the optimizer was based on Adam. The batch size was set to 8. The networks Φ₁, D₁, and $G$ were jointly trained for 120 epochs, and Φ₂ and D₂ for 100 epochs. Five sub-networks (two reconstruction networks, two decomposition networks, and one generation network) were trained independently. We compared our method with the following state-of-the-art methods: (I) the image reconstruction task was compared using the following methods: FBP (22), TV (23), FBP-based convolutional network (FBPConvNet) (24), learned experts’ assessment-based reconstruction network (LEARN) (25), fast iterative shrinkage/thresholding algorithm-based network (FISTA-Net) (26), and multi-domain integrative swin transformer network (MIST-Net) (27); (II) the material decomposition task was compared using the following methods: direct inversion (28), material decomposition framework with prior knowledge aware iterative denoising (MD-PKAID) (29), CNN (30), domain transformation enabled end-to-end deep CNN (DIRECT-Net) (31), butterfly-structured network for dual-energy CT (Butterfly-Net) (32), and DIWGAN (18). The peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) indicators were used to quantitatively evaluate the two tasks.

Simulated data results

Figure 5 shows the representative reconstruction and material decomposition results of one slice (case 1) selected from the medical images in Group 1. Compared to the other traditional methods and DL-based methods based on regions of interest (ROIs), ICRIN effectively restored texture details from sparse-view spectral images. As Figure 5 shows, under the 60-view condition, the spectral CT images reconstructed by the traditional FBP and TV were of low quality. The reconstructed results of FBPConvNet, LEARN and MIST-Net obtained very high PSNR scores; however, the images were oversmoothed. FISTA-Net obtained clearer images, but there were sparse-view artifacts in the spectral images. The 5th and 7th rows demonstrate the decomposed bone and tissue. When applying material decomposition, FBPConvNet was added as a preprocessing module before each comparison algorithm, enabling the other material decomposition algorithms to be compared with ICRIN on the same level. For the ROIs of two materials, the classic image-domain direct inversion and iterative MD-PKAID only restored the approximate material contour. The results of algorithms such as CNN, Butterfly-Net, and DIRECT-Net exhibited substantial loss of structural information. Conversely, ICRIN generated clear images and detailed textures that were superior to those of the other algorithms.

Figure 5 Reconstruction and material decomposition results in case 1 from simulated datasets of Group 1. (A1-H1) and (A2-H2) represent label, FBP, TV, FBPConvNet, LEARN, FISTA-Net, MIST-Net, and ICRIN for [7, 70] kVp and [70, 120] kVp reconstructed spectral images, respectively. (A3-H3) and (A4-H4) represent label, Direct-inversion, MD-PKAID, DIRECT-Net, CNN, Butterfly-Net, DIWGAN, and ICRIN for material bone and tissue. The 2nd, 4th, 6th, and 8th rows show the difference images relative to the GT. The red box and the yellow box indicate the ROI and the zoomed-in ROI, respectively. The display windows for spectral images are set to [0.01, 0.025] mm^–1 for bone and tissue images to [0, 0.8] respectively. The figure displays the color bars of the error images. CNN, convolutional neural network; DIRECT-Net, domain transformation enabled end-to-end deep CNN; DIWGAN, dual interactive Wasserstein generative adversarial networks; FBP, filtered back projection; FBPConvNet, FBP-based convolutional network; FISTA-Net, fast iterative shrinkage/thresholding algorithm-based network; GT, ground truth; ICRIN, interpretable cascaded residual iterative network; LEARN, learned experts’ assessment-based reconstruction network; MD-PKAID, material decomposition framework with prior knowledge aware iterative denoising; MIST-Net, multi-domain integrative swin transformer network; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; TV, total variation.

Figure 6 shows the results of our algorithm on another slice (case 2) selected from the 120-view simulated dataset in Group 1. Compared to the GTs, the TV, FBPConvNet, and MIST-Net smoothed out many tissue texture structures. LEARN and FISTA-Net lost edge contours in ROIs. In relation to material decomposition, the bone and soft tissue of DIWGAN exhibited stripe artifacts and small missing regions as indicated by the yellow circle. In summary, from the visual perspective, our ICRIN achieved better results compared to the other methods.

Figure 6 Representative reconstruction and decomposition results in case 2 from simulated datasets of Group 1. (A1-H1,A2-H2) The [7, 70] kVp and [70, 120] kVp reconstruction results: (A1,A2) label; (B1,B2) FBP; (C1,C2) TV; (D1,D2) FBPConvNet; (E1,E2) LEARN; (F1,F2) FISTA-Net; (G1,G2) MIST-Net; (H1,H2) ICRIN. (A3-H3,A4-H4) The decomposition results: (A3,A4) label; (B3,B4) Direct-inversion; (C3,C4) MD-PKAID; (D3,D4) DIRECT-Net; (E3,E4) CNN; (F3,F4) Butterfly-Net; (G3,G4) DIWGAN; (H3,H4) ICRIN. The red box and the yellow box indicate the ROI and the zoomed-in ROI, respectively. The yellow circle indicates the detailed structure in the tissue. The display windows for spectral images are set to [0.01, 0.025] mm^–1 for bone and tissue images to [0, 0.8], respectively. CNN, convolutional neural network; DIRECT-Net, domain transformation enabled end-to-end deep CNN; DIWGAN, dual interactive Wasserstein generative adversarial networks; FBP, filtered back projection; FBPConvNet, FBP-based convolutional network; FISTA-Net, fast iterative shrinkage/thresholding algorithm-based network; ICRIN, interpretable cascaded residual iterative network; LEARN, learned experts’ assessment-based reconstruction network; MD-PKAID, material decomposition framework with prior knowledge aware iterative denoising; MIST-Net, multi-domain integrative swin transformer network; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; TV, total variation.

We also conducted a statistical analysis of quantitative indicators on 200 testing image slices. Table 2 summarizes the quantitative evaluation indicators on the Group 1 dataset. The results revealed that ICRIN has more advantages than other SOTA methods. Additionally, our algorithm showed the best performance on simulation datasets, indicating that ICRIN has significant generation ability.

Table 2

The quantitative results in the Group 1 dataset (60 views)

Algorithm	PSNR↑				SSIM↑
Algorithm	High-energy channel	Low-energy channel	Bone	Soft tissue	High-energy channel	Low-energy channel	Bone	Soft tissue
Task 1
FBP	25.292±1.571	26.383±1.746			0.629±0.052	0.622±0.055
TV	27.653±1.891	27.881±1.992			0.668±0.051	0.561±0.060
FBPConvNet	37.884±3.232	37.911±3.424			0.975±0.023	0.976±0.023
LEARN	37.416±2.363	40.230±2.334			0.984±0.009	0.983±0.010
FISTA-Net	38.903±2.142	38.908±2.509			0.945±0.014	0.963±0.011
MIST-Net	44.040±3.111	43.732±3.358			0.985±0.006	0.985±0.006
Proposed	45.770±1.533^†	45.511±1.555^†			0.991±0.007^†	0.987±0.006^†
Task 2
Inverse			26.312±4.071	12.302±2.013			0.854±0.100	0.670±0.065
MD-PKAID			35.824±2.304	18.291±3.063			0.899±0.043	0.868±0.055
DIRECT-Net			32.042±1.961	17.402±2.296			0.892±0.042	0.867±0.039
CNN			35.657±2.983	24.662±3.081			0.970±0.014	0.909±0.025
Butterfly-Net			36.812±3.381	27.260±3.402			0.978±0.016	0.930±0.018
DIWGAN			40.503±3.426	28.952±5.695			0.989±0.007	0.950±0.021
Proposed			44.401±2.265^†	35.941±4.182^†			0.994±0.003^†	0.982±0.007^†

Data are presented as mean ± standard deviation. ^†, the best results. An up arrow (↑) signifies that a higher value is desirable. CNN, convolutional neural network; DIRECT-Net, domain transformation enabled end-to-end deep CNN; DIWGAN, dual interactive Wasserstein generative adversarial networks; FBP, filtered back projection; FBPConvNet, FBP-based convolutional network; FISTA-Net, fast iterative shrinkage/thresholding algorithm-based network; LEARN, learned experts’ assessment-based reconstruction network; MD-PKAID, material decomposition framework with prior knowledge aware iterative denoising; MIST-Net, multi-domain integrative swin transformer network; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; TV, total variation.

Figure 7 shows the performance of our algorithm in terms of the visual effects of the reconstruction and the material decomposition results of the QRM phantom in Group 1. The left panel shows the optical image of the phantom in the actual experiment and a schematic diagram of its components. It mainly includes five equivalent material substances: muscle, spongious bone, adipose, CT water, and cortical bone. Each substance forms a cylinder with a diameter of 20 mm, which is embedded in a cylinder with a diameter of 100 mm. Among them, cortical bone and water were the two target materials in this experiment. As shown in the right panel, under the condition of 60 sparse views without repeated training on the QRM datasets, our framework showed good performance in the spectral imaging of multiple mixed substances. Notably, the density components of the tissues and water decomposed by our medical simulation were close, as were the components of the bones and cortical bone. However, since the density of spongious bone is closer to that of water, the area indicated by the red arrow in the water material map contains a large amount of spongious bone, while the corresponding area in the bone material map contains only a small amount. Thus, while ICRIN cannot completely distinguish between the two types of bone materials, ICRIN can identify the concentration information of different materials, rather than relying on morphological threshold segmentation. Therefore, water and cortical bone can be decomposed from the mixed substances.

Figure 7 The left panel shows the standard QRM phantom scanned by the laboratory imaging system. Two substances, cortical bone and CT-equivalent water, are selected from the five material components for decomposition. The right panel shows visual effects of the reconstruction and material decomposition results of the QRM phantom. (A1-E1,A2-E2) The [7, 70] and [70, 120] kVp reconstruction results: (A1,A2) label; (B1,B2) FBP; (C1,C2) TV; (D1,D2) FBPConvNet; (E1,E2) ICRIN. (A3-E3,A4-E4) The decomposition results: (A3,A4) label; (B3,B4) Direct-inversion; (C3,C4) Iteration Method; (D3,D4) DIWGAN; (E3,E4) ICRIN. The red arrow indicates the region of decomposed spongious bone of interest in the material image. The display windows for spectral images are set to [0.01, 0.03] mm^–1 for material images to [0.15, 0.8]. CT, computed tomography; DIWGAN, dual interactive Wasserstein generative adversarial networks; FBP, filtered back projection; FBPConvNet, FBP-based convolutional network; ICRIN, interpretable cascaded residual iterative network; PSNR, peak signal-to-noise ratio; QRM, Quasi-Realistic model; SSIM, structural similarity index measure.

Preclinical mouse validation

Figure 7 shows the qualitative results and the corresponding ROI of one slice (case 3) selected from the real datasets in Group 2. In relation to spectral image reconstruction, although TV and FBPConvNet suppressed most sparse-view artifacts, their results were oversmoothed. LEARN, FISTA-Net and MIST-Net showed structure and texture deterioration due to the poor generalizability on different datasets. As the magnified ROI regions show, our ICRIN not only recovered accurate structures from artifacts but also effectively suppressed the noise.

In relation to the material decomposition results in Figure 8, DIRECT-Net, CNN, and Butterfly-Net incorrectly decomposed structures and produced unclear textures in both materials. Although MD-PKAID and DIWGAN had relatively good visual effects, the small texture of the bone and changes in the tissue were blurrier. Conversely, the ICRIN provided richer structural details as indicated by the yellow arrow. Notably, the proposed ICRIN had richer textures than other methods as indicated by the yellow circle. In noisy spectral images, these structures can only be vaguely identified, but our algorithm was able to capture subtle texture changes in the material structures. Figure 9 specifically compares the differences in material decomposition details between DIWGAN and ICRIN. Notably, the ROI1 of bone material generated by ICRIN was closer to the GT in terms of certain fine structures, while the results of DIWGAN showed insufficient resolution for some structures. For the ROI2 of tissue material, the results of DIWGAN still contained some residual radial artifacts, while the structure generated by ICRIN was more realistic and natural. The quantitative SSIM metric also indicates that the imaging quality of ICRIN in ROIs was improved by approximately 0.02 compared with that of DIWGAN.

Figure 8 Representative reconstruction and decomposition results in case 3 from 120-view real datasets of Group 2. For the reconstructed low- and high-energy images: (A) FBP; (B) TV; (C) FBPConvNet; (D) LEARN; (E) FISTA-Net; (F) MIST-Net; (G) ICRIN; (H) label. For the decomposed two materials: (A) Direct-inversion; (B) MD-PKAID; (C) DIRECT-Net; (D) CNN; (E) Butterfly-Net; (F) DIWGAN; (G) ICRIN; (H) label. The red box and the yellow box indicate the ROI and the zoomed-in ROI, respectively. The yellow circle and arrow indicate the detailed structure in the tissue and bone, respectively. The display windows for spectral images are set to [0.01, 0.025] mm^–1 for bone and tissue images to [0.4, 1.2] and [0.15, 0.4], respectively. CNN, convolutional neural network; DIRECT-Net, domain transformation enabled end-to-end deep CNN; DIWGAN, dual interactive Wasserstein generative adversarial networks; FBP, filtered back projection; FBPConvNet, FBP-based convolutional network; FISTA-Net, fast iterative shrinkage/thresholding algorithm-based network; ICRIN, interpretable cascaded residual iterative network; LEARN, learned experts’ assessment-based reconstruction network; MD-PKAID, material decomposition framework with prior knowledge aware iterative denoising; MIST-Net, multi-domain integrative swin transformer network; TV, total variation.

Figure 9 Detailed qualitative and quantitative comparison of the real data (case 3) material decomposition performance between DIWGAN and ICRIN. The first row and the second row show the ROI1 of bone and the ROI2 of tissue, respectively. DIWGAN, dual interactive Wasserstein generative adversarial networks; ICRIN, interpretable cascaded residual iterative network; ROI, region of interest; SSIM, structural similarity index measure.

Further, we also conducted a statistical analysis of quantitative indicators on 126 testing image slices. Table 3 sets out the quantitative metrics demonstrating our method’s advantages on the real datasets.

Table 3

The quantitative results in the Group 2 dataset (120 views)

Algorithm	PSNR↑				SSIM↑
Algorithm	High-energy channel	Low-energy channel	Bone	Soft tissue	High-energy channel	Low-energy channel	Bone	Soft tissue
Task 1
FBP	24.561±1.232	24.790±1.523			0.912±0.013	0.923±0.028
TV	26.393±1.032	27.570±1.021			0.934±0.054	0.940±0.044
FBPConvNet	28.543±2.213	30.262±1.233			0.950±0.012	0.954±0.025
LEARN	33.680±2.001	34.344±2.100			0.961±0.014	0.966±0.002
FISTA-Net	34.724±1.664	33.789±3.120			0.967±0.002	0.946±0.001
MIST-Net	36.742±1.255	37.280±2.098			0.974±0.002	0.978±0.002
Proposed	39.223±2.077^†	38.489±2.200^†			0.981±0.002^†	0.980±0.001^†
Task 2
Inverse			29.389±4.205	15.644±1.024			0.710±0.005	0.548±0.011
MD-PKAID			33.267±3.257	20.023±4.386			0.876±0.022	0.772±0.076
DIRECT-Net			36.758±1.524	28.640±3.178			0.921±0.015	0.863±0.006
CNN			38.297±2.845	30.045±3.167			0.930±0.022	0.872±0.024
Butterfly-Net			39.749±2.256	30.212±1.034			0.941±0.038	0.892±0.009
DIWGAN			40.823±1.207	32.775±2.442			0.972±0.009	0.921±0.026
Proposed			42.730±1.275^†	33.185±2.632^†			0.988±0.003^†	0.956±0.007^†

Data are presented as mean ± standard deviation. ^†, the best results. An up arrow (↑) signifies that a higher value is desirable. CNN, convolutional neural network; DIRECT-Net, domain transformation enabled end-to-end deep CNN; DIWGAN, dual interactive Wasserstein generative adversarial networks; FBP, filtered back projection; FBPConvNet, FBP-based convolutional network; FISTA-Net, fast iterative shrinkage/thresholding algorithm-based network; LEARN, learned experts’ assessment-based reconstruction network; MD-PKAID, material decomposition framework with prior knowledge aware iterative denoising; MIST-Net, multi-domain integrative swin transformer network; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; TV, total variation.

Ablation analysis

Due to the implementation of dual-task iteration in our framework, both iteration and module effectiveness were assessed in our ablation experiments, which were conducted using 60-view simulation datasets.

Iteration effectiveness

Our ICRIN achieved dual-task iterative optimization simultaneously. Figure 10 shows the gradual evolution of reconstructed images and decomposed materials with respect to the increasing iterations. As the PSNR curves of Figure 10B, B1-B4 show, the highest improvements of low- and high-energy spectral images and bone and tissue materials were approximately 6.9, 6.6, 4.0, and 8.4 dB, respectively. The results indicate that under reasonable parameter settings, the ICRIN framework can stably iterate and converge.

Figure 10 Indicators of reconstructed images and decomposed materials in Group 1 with increased iterations. (A1,A3,B1,B3) The PSNR results; (A2,A4,B2,B4) the SSIM results. Case A: ICRIN without feedback mechanism (i.e., β=0). Case B: ICRIN with feedback mechanism. ICRIN, interpretable cascaded residual iterative network; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure.

Module effectiveness

We also conduct ablation experiments on different modules. Our ICRIN includes two significant optimization solutions: the introduction of a feedback mechanism (β-branch) and a sub-network (RecNet²) based on the attention mechanism. As shown in Table 4, we removed the β-branch and replaced the RecNet² with RecNet¹, then defined ICRIN with both RecNet¹ and without β-branch (NFCN2). By analogy, we can define an ICRIN with two RecNet¹ models and a β-branch (FCN2), an ICRIN without a β-branch (NFCN-Trans), and as the proposed ICRIN (FCN-Trans). As Figure 10 shows, after introducing the fusion image ∏ (i.e., β-branch), the PSNR of the reconstructed image increased by approximately 3 dB, while the PSNR of the material image also improved by approximately 1–3 dB. As shown in Figure 11, we validated these four types on 200 testing image slices and conducted a statistical analysis. The center position and degree of dispersion of the cluster graph reflect the average level and variance of the data. The reconstruction and material decomposition results also improved after the introduction of β-branch and the addition of a transformer module. Thus, the introduction of ∏, incorporating the feedback mechanism $β G (m)$ and data fidelity αf, can improve the imaging performance for both the reconstruction and material decomposition tasks.

Table 4

The different modules of the ablation experiments

Type	Feedback branch (β=10⁻⁴)	RecNet¹	RecNet²
NFCN2	×	√	×
FCN2	√	√	×
NFCN-Trans	×	×	√
FCN-Trans	√	×	√

Figure 11 Ablation analysis of different modules (iteration fixed at 150). (A1-D1) The PSNR; (A2-D2) the SSIM. (A1,A2) Indicators of [7, 70] kVp spectral reconstructed image. (B1,B2) Indicators of a [70, 120] kVp spectral reconstructed image. (C1,C2) Indicators of decomposed bone. (D1,D2) Indicators of decomposed soft tissue. PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure.

Anti-noise performance analysis

We analyzed the anti-noise performance of ICRIN. To address the issues of degraded reconstruction and material decomposition caused by measurement noise in practical spectral CT imaging, we conducted simulation experiments by adding Poisson noise to the original sinogram inputs. The intensity of the noise is controlled by the set number of photons I₀. The noisy sinogram input is denoted as $y_{n}^{(0)}$ . The expression for introducing Poisson noise is then defined as:

$\begin{array}{l} \begin{array}{l} \begin{array}{l} y_{n}^{(0)} = - \ln (\frac{P M (I_{0} e^{(- y^{(0)})})}{I_{0}}) \end{array} \end{array} \end{array}$ [23]

where PM is a Poisson random matrix controlled by the parameter I₀. We set up simulation experiments at 10-fold intervals within the range of $I_{0} \in [10^{2}, 10^{6}]$ ; that is, the Poisson noise was added to data in Group 1 with Eq. [23], and the noisy FBP reconstruction images are shown in Figure 12. To ensure good generalization, we retrained the model on noisy data, obtaining a model capable of handling various noises. Figure 13 represents the visual results of reconstruction and material decomposition of 60 sparse-view data under various noise levels. Figure 14 represents the curves of quantitative indicators varying with the noise level. From the qualitative and quantitative results, a decrease in the number of photons led to a decline in the quality of the reconstructed and material images. As the number of photons decreased, the image quality gradually deteriorated, but the overall destruction of the tissue structure was not severe. Hence, our ICRIN framework is relatively stable within a certain noise range.

Figure 12 The visual effect display of reconstructed spectral images of 60 sparse views under different noise levels with the number of photons I₀=10⁶, 10⁵, 10⁴, 10³, 10².

Figure 13 Representative reconstruction and material decomposition results of ICRIN at different noise levels. The first five columns represent the imaging results with the number of photons I₀=10⁶, 10⁵, 10⁴, 10³, 10², respectively, and the 6th column is the label. The four rows from top to bottom represent the low- and high-energy images, tissues, and bones respectively. The display windows for spectral images and materials are set to [0.01, 0.025] mm^–1 and [0, 0.8], respectively. ICRIN, interpretable cascaded residual iterative network.

Figure 14 Quantitative indicator curves for image reconstruction and material decomposition at with different noise levels. (A,C) The PSNR results; (B,D) the SSIM results. The intensity of the noise controlled by l₀, which ranges from 10⁶ to 10², corresponds to the indicators of added noise image. The original image refers to the reconstructed image or material image obtained by processing the original data using ICRIN. H-E, high-energy; ICRIN, interpretable cascaded residual iterative network; L-E, low-energy; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure.

Discussion

The proposed ICRIN represents a novel ICRIN for spectral CT imaging and material decomposition. By incorporating physical model-driven priors, CS priors, and data-driven priors in a hybrid-domain network framework, ICRIN demonstrates a significant improvement in model stability and data consistency in dual tasks. The residual iterative mechanism employed in ICRIN, coupled with the transformer attention module, effectively extracts and emphasizes residual image features. The cascaded residual-domain decomposition network is used to update materials with residual materials. The interpretability of ICRIN is further strengthened by solving an objective function with alternating minimization, allowing for the joint optimization of the dual tasks.

The proposed feedback mechanism with weighting factors α and β represents a significant addition to ICRIN, as it assists in improving the stability and performance of the cascaded network by feeding back material-domain information into the image domain. As Figure 11 shows, the statistical results confirmed this conclusion, with all three indicators improved in the FCN2 and FCN-Trans ablation studies. The superiority of ICRIN was shown through numerical simulations (Figures 5-7) and real preclinical experiments (Figure 8). The quantitative and qualitative results provide compelling evidence that the proposed method outperforms other progressive approaches. To show that the ICRIN model is interpretable and convergent, we used the change curves of the PSNR and SSIM indicators as the number of iterations increased (Figure 10). Moreover, the inclusion of physical and CS priors in the network ensured that the reconstructed images and decomposed materials were not only accurate but also physically meaningful.

Although the existing ICRIN is limited by model complexity to target photon-counting CT imaging with two energy bins, it could be easily extended to support more energy bins and decompose more materials in the future. More efficient architectures, such as generative diffusion models (33,34), will be explored and applied to the framework. To summarize, ICRIN is suitable as a general framework for processing multi-task spectral CT imaging scenes (35,36), and has the potential to be extended, optimized, and applied to practical clinical problems such as in scenarios with more thresholds and K-edge contrast agents.

Conclusions

This study primarily proposed a general ICRIN for sparse-view CT reconstruction and material decomposition. A multi-domain hybrid framework was designed for optimization in multiple domains. The residual-domain attention mechanism was applied to further improve ICRIN performance. Moreover, a mathematical model was used to elaborate on the advantages of ICRIN in terms of interpretability. Most importantly, a feedback mechanism was proposed that can feed the decomposed material information back into the image again and optimize it repeatedly during the continuous iteration process.

Our method was evaluated on both simulated datasets and preclinical datasets to assess sparse-view spectral CT reconstruction and material decomposition. The experimental results showed that our algorithm had advantages in both qualitative effects and quantitative indicators. Further, we showed the generalization of ICRIN with experiments on different sparse views and datasets. Finally, the results of our ablation experiments showed that the combination of the introduced feedback mechanism and data fidelity improved the performance and stability of the model.

Acknowledgments

The authors thank Dr. Varut Vardhanabhuti from the Department of Diagnostic Radiology at The University of Hong Kong, China, for providing the preclinical mouse datasets.

Footnote

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1895/dss

Funding: This work was supported by the National Natural Science Foundation of China (Nos. 62271504, 62101596 and 62201616), the Central Plains Science and Technology Innovation Leading Talent Project (No. 244200510015), Key Research and Development Special Project of Henan Province (Nos. 251111220600 and 251111312900), and Natural Science Foundation of Henan Province (No. 252300420395).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1895/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Due to not involving human trials, clinical diagnosis, treatment information, sensitive personal information, commercial interests, or causing harm to the human body as well as the retrospective nature of this study, it can be exempted from ethical approval of the Information Engineering University Ethics Committee. For animal experiments, the study was conducted in accordance with the laboratory animal guideline for ethical review of animal welfare and was approved by The University of Hong Kong Li Ka Shing Faculty of Medicine Ethics Committee for animal experiments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Adam SZ, Rabinowich A, Kessner R, Blachar A. Spectral CT of the abdomen: Where are we now? Insights Imaging 2021;12:138. [Crossref] [PubMed]
Wu F, Zhou H, Li F, Wang JT, Ai T, Spectral CT. Imaging of Lung Cancer: Quantitative Analysis of Spectral Parameters and Their Correlation with Tumor Characteristics. Acad Radiol 2018;25:1398-404. [Crossref] [PubMed]
Greene R. Fluoroscopic axialography: clinical applications in thoracic disease. Radiology 1976;121:527-31. [Crossref] [PubMed]
Mushtaq S, Conte E, Pontone G, Baggiano A, Annoni A, Formenti A, Mancini ME, Guglielmo M, Muscogiuri G, Tanzilli A, Nicoli F, Bartorelli AL, Pepi M, Andreini D. State-of-the-art-myocardial perfusion stress testing: Static CT perfusion. J Cardiovasc Comput Tomogr 2020;14:294-302. [Crossref] [PubMed]
Wang F, Eljarrat A, Müller J, Henninen TR, Erni R, Koch CT. Multi-resolution convolutional neural networks for inverse problems. Sci Rep 2020;10:5730. [Crossref] [PubMed]
Tang Y, Gong W, Chen X, Li W. Deep Inception-Residual Laplacian Pyramid Networks for Accurate Single-Image Super-Resolution. IEEE Trans Neural Netw Learn Syst 2020;31:1514-28. [Crossref] [PubMed]
Zha Z, Wen B, Yuan X, Zhou JT, Zhou J, Zhu C. Triply Complementary Priors for Image Restoration. IEEE Trans Image Process 2021;30:5819-34. [Crossref] [PubMed]
Willemink MJ, Persson M, Pourmorteza A, Pelc NJ, Fleischmann D. Photon-counting CT: Technical Principles and Clinical Prospects. Radiology 2018;289:293-312. [Crossref] [PubMed]
Zhang Y, Mou X, Wang G, Yu H. Tensor-Based Dictionary Learning for Spectral CT Reconstruction. IEEE Trans Med Imaging 2017;36:142-54. [Crossref] [PubMed]
Noh J, Fessler JA, Kinahan PE. Statistical sinogram restoration in dual-energy CT for PET attenuation correction. IEEE Trans Med Imaging 2009;28:1688-702. [Crossref] [PubMed]
Wu W, Chen P, Wang S, Vardhanabhuti V, Liu F, Yu H. Image-domain Material Decomposition for Spectral CT using a Generalized Dictionary Learning. IEEE Trans Radiat Plasma Med Sci 2021;5:537-47. [Crossref] [PubMed]
Ren J, Wang Y, Cai A, Wang S, Liang N, Li L, Yan B. MISD-IR: material-image subspace decomposition-based iterative reconstruction with spectrum estimation for dual-energy computed tomography. Quant Imaging Med Surg 2024;14:4155-76. [Crossref] [PubMed]
Bousse A, Kandarpa VSS, Rit S, Perelli A, Li M, Wang G, Zhou J, Wang G. Systematic Review on Learning-based Spectral CT. IEEE Trans Radiat Plasma Med Sci 2024;8:113-37. [Crossref] [PubMed]
Yang Y, Sun J, Li H, Xu Z. ADMM-CSNet: A Deep Learning Approach for Image Compressive Sensing. IEEE Trans Pattern Anal Mach Intell 2020;42:521-38. [Crossref] [PubMed]
Adler J, Oktem O. Learned Primal-Dual Reconstruction. IEEE Trans Med Imaging 2018;37:1322-32. [Crossref] [PubMed]
Wu W, Hu D, Cong W, Shan H, Wang S, Niu C, Yan P, Yu H, Vardhanabhuti V, Wang G. Stabilizing deep tomographic reconstruction: Part A. Hybrid framework and experimental results. Patterns (N Y) 2022;3:100474. [Crossref] [PubMed]
Chun IY, Huang Z, Lim H, Fessler JA. Momentum-Net: Fast and Convergent Iterative Neural Network for Inverse Problems. IEEE Trans Pattern Anal Mach Intell 2023;45:4915-31. [Crossref] [PubMed]
Shi Z, Li H, Cao Q, Wang Z, Cheng M. A material decomposition method for dual-energy CT via dual interactive Wasserstein generative adversarial networks. Med Phys 2021;48:2891-905. [Crossref] [PubMed]
Pan J, Yu H, Gao Z, Wang S, Zhang H, Wu W. Iterative Residual Optimization Network for Limited-Angle Tomographic Reconstruction. IEEE Trans Image Process 2024;33:910-25. [Crossref] [PubMed]
Hubbell JH, Seltzer SM. (2004), Tables of X-Ray Mass Attenuation Coefficients and Mass Energy-Absorption Coefficients (version 1.4). Available online: http://physics.nist.gov/xaamdi. National Institute of Standards and Technology, Gaithersburg, MD.
Poludniowski G, Landry G, DeBlois F, Evans PM, Verhaegen F. SpekCalc: a program to calculate photon spectra from tungsten anode x-ray tubes. Phys Med Biol 2009;54:N433-8.
Zeng GL. Noise-weighted spatial domain FBP algorithm. Med Phys 2014;41:051906. [Crossref] [PubMed]
Huang S, Tang C, Xu M, Qiu Y, Lei Z. BM3D-based total variation algorithm for speckle removal with structure-preserving in OCT images. Appl Opt 2019;58:6233-43. [Crossref] [PubMed]
Kyong Hwan Jin. McCann MT, Froustey E, Unser M. Deep Convolutional Neural Network for Inverse Problems in Imaging. IEEE Trans Image Process 2017;26:4509-22. [Crossref] [PubMed]
Chen H, Zhang Y, Chen Y, Zhang J, Zhang W, Sun H, Lv Y, Liao P, Zhou J, Wang G. LEARN: Learned Experts' Assessment-Based Reconstruction Network for Sparse-Data CT. IEEE Trans Med Imaging 2018;37:1333-47. [Crossref] [PubMed]
Xiang J, Dong Y, Yang Y. FISTA-Net: Learning a Fast Iterative Shrinkage Thresholding Network for Inverse Problems in Imaging. IEEE Trans Med Imaging 2021;40:1329-39. [Crossref] [PubMed]
Pan J, Zhang H, Wu W, Gao Z, Wu W. Multi-domain integrative Swin transformer network for sparse-view tomographic reconstruction. Patterns (N Y) 2022;3:100498. [Crossref] [PubMed]
Lyu Q, O'Connor D, Niu T, Sheng K. Image-domain multimaterial decomposition for dual-energy computed tomography with nonconvex sparsity regularization. J Med Imaging (Bellingham) 2019;6:044004. [Crossref] [PubMed]
Yu Z, Leng S, Li Z, McCollough CH. Spectral prior image constrained compressed sensing (spectral PICCS) for photon-counting computed tomography. Phys Med Biol 2016;61:6707-32. [Crossref] [PubMed]
Nadkarni R, Allphin A, Clark DP, Badea CT. Material decomposition from photon-counting CT using a convolutional neural network and energy-integrating CT training labels. Phys Med Biol 2022;67: [Crossref] [PubMed]
Su T, Sun X, Yang J, Mi D, Zhang Y, Wu H, Fang S, Chen Y, Zheng H, Liang D, Ge Y. DIRECT-Net: A unified mutual-domain material decomposition network for quantitative dual-energy CT imaging. Med Phys 2022;49:917-34. [Crossref] [PubMed]
Zhang W, Zhang H, Wang L, Wang X, Hu X, Cai A, Li L, Niu T, Yan B. Image domain dual material decomposition for dual-energy CT using butterfly network. Med Phys 2019;46:2037-51. [Crossref] [PubMed]
Nichol AQ, Dhariwal P. Improved denoising diffusion probabilistic models. International conference on machine learning. Proceedings of the 38th International Conference on Machine Learning, PMLR 2021;139:8162-71.
Xu Y, Gong M, Xie S, Wei W, Grundmann M, Batmanghelich K, Hou T. Semi-Implicit Denoising Diffusion Models (SIDDMs). Adv Neural Inf Process Syst 2023;36:17383-94.
Liu W, Sun X, Duan Y, Huo M, Zhang M, Wang R. Application Value of Dual-source CT Noise-optimized Virtual Monoenergetic Imaging Technology in Portal Vein Imaging of Patients with Portal Hypertension. CT Theory and Applications 2025;34:864-71.
Xiang R, Zhang L. Suppression Method for Cone-Beam CT Artifact Based on FDK Compensation and Dual-Source Weighting. CT Theory and Applications 2025;34:830-8.

Cite this article as: Zhang X, Wang S, Liang N, Zheng Z, Cai A, Li L, Yu H, Yan B. An interpretable cascaded residual iterative network for sparse-view spectral CT imaging. Quant Imaging Med Surg 2026;16(3):203. doi: 10.21037/qims-2025-1895

An interpretable cascaded residual iterative network for sparse-view spectral CT imaging

Introduction

Methods

Spectral CT imaging model

ICRIN architecture

Reconstruction module

Decomposition module

Attention module

ICRIN mathematical model

Updating y

Updating m

Updating f

Training and testing

Data preparation

Generation of simulation datasets

Table 1

Generation of real preclinical datasets

Ethical statement

Results

Simulated data results

Table 2

Preclinical mouse validation

Table 3

Ablation analysis

Iteration effectiveness

Module effectiveness

Table 4

Anti-noise performance analysis

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share