Intra- and interobserver variability in renal multiparametric magnetic resonance imaging: T1 mapping, diffusion-weighted imaging, arterial spin labeling, blood oxygen level-dependent and renal artery flow measurements utilizing manual segmentation
Introduction
Renal multiparametric magnetic resonance imaging (mpMRI) enables acquisition of multiple structural and functional magnetic resonance imaging (MRI) sequences within a single scan session, offering extensive tissue characterisation. This non-invasive approach is theoretically advantageous over renal biopsy, avoiding procedural risks while mitigating the sampling bias inherent in needle cores. Functional sequences include T1 mapping (for tissue composition/fibrosis), blood oxygen level-dependent (BOLD) imaging (for tissue oxygenation), diffusion-weighted imaging (DWI) (for microstructure and microcirculation), arterial spin labeling (ASL) (for renal perfusion), and phase-contrast MRI (for renal artery flow) (1).
Quantitative measurements from these sequences vary significantly depending on whether they are obtained from the whole kidney, the cortex, or the medulla (2). These compartments are physiologically distinct and respond differently to disease, making region-specific measurements both complementary and necessary. Various segmentation approaches—manual, semi-automated, and fully automated—have been reported (3).
International consortia, such as PARENCHIMA and the UK Renal Imaging Network, have identified harmonization of renal MRI acquisition and analysis as a research priority (4,5). Consensus papers have proposed technical recommendations for post-processing, including segmentation methods (6-11). Manual segmentation is recommended for most sequences, with defined region of interest (ROI) placement strategies for DWI, BOLD, and ASL; however, no consensus was reached for T1 mapping (8).
Manual segmentation is inherently observer-dependent and sensitive to subtle variations in ROI placement. This issue is amplified in patients with chronic kidney disease (CKD), where reduced cortico-medullary differentiation complicates boundary identification (12).
Studies have reported generally robust interobserver reproducibility for cortical T1, DWI, and ASL measurements. Intraclass correlation coefficients (ICCs) above 0.9 have been observed for cortical T1 and ASL (13,14), although not consistently across the literature (e.g., ICC of 0.72 for cortical T1) (15). Apparent diffusion coefficient (ADC) values show slightly lower reproducibility (ICC >0.8) (2).
Despite increasing use of renal mpMRI, real-world data on intra- and interobserver reproducibility—particularly across full mpMRI protocols and in mixed populations including CKD—remain sparse. Most studies are limited in scope, size, and methodological consistency. Given the absence of standardized segmentation approaches, transparent reporting of manual ROI reproducibility is essential for guiding clinical translation and future automation.
In this study, we report intra- and interobserver variability for T1 mapping, T2*/BOLD, ASL, DWI (ADC), T1-rho and phase-contrast flow measurements using manual segmentation in a large, mixed patient cohort. Our aim is to quantify the variability inherent to current ROI-based methods and provide benchmark data to support standardization and reproducibility efforts in renal MRI. We present this article in accordance with the STARD reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2685/rc).
Methods
Patient selection and preparation
All patients scheduled for a kidney biopsy at Akershus University Hospital from February 2021 to November 2023 were invited to participate in the study. Exclusion criteria included metal implants incompatible with MRI, severe claustrophobia and inability to provide informed consent. A total of 72 patients were included and underwent mpMRI following a 10-hour fast. As part of their scheduled kidney biopsy later the same day, most patients took diazepam and blood pressure medications within 3 hours of the scan. Due to the absence of prior effect-size data for a priori power calculation, a sample size of n=72 was utilized, as this cohort is nested within an overarching study involving renal biopsies. This sample size is supported by Sim and Lewis, who suggest that a sample of at least 50 is often sufficient for pilot studies to provide stable parameter estimation (16). Furthermore, this size was deemed sufficient to demonstrate significant inter- and intraobserver agreement and assess variability between raters, providing the necessary depth for stable parameter estimation and exploratory modeling.
Ethics
The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Regional Committee for Medical and Health Research Ethics (REK) for the South-Eastern health region of Norway (Ref: 80527), and informed consent was taken from all individual participants.
mpMRI
Images were acquired on a 1.5 T Philips Ingenia scanner. Balanced steady-state-free-precession (B-SSFP) localizer scans were acquired in three orthogonal planes with parameters: field-of-view 400 mm × 400 mm, in-plane resolution 1.6 mm × 1.6 mm, slice thickness 7 mm, using end-expiration breath-holding. These images were used to quantify kidney volume and plan placement of the subsequent functional scans, all using seven contiguous coronal oblique slices, except for the phase contrast MRI (PC-MRI) sequence. All MRI data were visually inspected for motion. As no significant motion was observed in the majority of subjects, and to avoid potential interpolation artifacts, no automated motion correction was applied to the final dataset.
T1 mapping
A respiratory-triggered inversion recovery (IR) sequence was used with inversion times (TIs): 200/400/600/800/1,000/1,200/1,400/1,600/1,800 ms, using a fat-suppressed spin-echo echo-planar imaging (SE-EPI) readout [sensitivity encoding (SENSE) factor 2.3, echo time (TE) 19.7 ms, repetition time (TR) 6,000 ms, field of view (FOV) 288 mm × 288 mm, pixel size 3 mm × 3 mm, slice thickness 5 mm].
BOLD/R2* mapping
BOLD R2* data were acquired in seven slices using a multi-echo Fast Gradient Echo (mGRE) sequence (12 echoes, TE 5–71 ms, SENSE 2, flip angle 25°, three breath-holds, TR 72.2 ms, pixel size 2 mm × 2 mm).
DWI
Respiratory-triggered fat-suppressed SE-EPI DWI data were acquired (SENSE 2.3, TE 65.7 ms, pixel size 1.5 mm × 1.5 mm) at 14 b-values (0/5/10/20/30/40/50/60/100/150/200/300/400/500 s/mm2) in three orthogonal directions to minimize the influence of diffusion anisotropy.
ASL perfusion
Respiratory-triggered flow alternating inversion recovery (FAIR) ASL data were acquired in 5 oblique coronal slices with a post-label delay (PLD) of 1,800 ms and label/control inversion thicknesses of 45/400 mm. Twenty label/control pairs were collected using SE-EPI readout (SENSE 2.3, TE 19.7 ms, pixel size 1.5 mm × 1.5 mm). M0 scans were acquired for quantification.
Renal artery blood flow
A non-contrast-enhanced MR angiogram was acquired to plan PC-MRI slice placement proximal to the renal artery bifurcation. PC-MRI data were collected for each renal artery in a breath hold [flip-angle 25°, resolution 1.2 mm × 1.2 mm × 6 mm, 20 cardiac phases, velocity encoding (vENC) 100 cm/s].
T1-rho mapping
T1-rho weighted images were acquired in 7 oblique coronal slices with a fast spin echo sequence with a spin locking pre-pulse. Locking frequency was 300 Hz and locking durations [time of spin lock (TSL)] of 20/30/40/50/60 ms were applied. Slice thickness was 5 mm and pixel size 3 mm × 3 mm.
Data analysis
Postprocessing and image analysis were performed using in-house analysis software, Siswin, which has been previously used for functional kidney examinations in other studies (17). SE-EPI IR data were fitted to a single exponential model to generate T1 and M0 maps. ADC maps were generated by fitting to a single exponential signal decay. Perfusion-weighted images were computed, realigned and averaged to create a single perfusion-weighted (ΔM) map. This map was then used in a kinetic model to calculate tissue perfusion maps. mGRE data were fitted to an exponential model to compute R2* maps. T1-rho maps were calculated by fitting the five images with increasing TSL to a single exponentially decaying function.
For T1-maps we measure the spin-lattice relaxation time, measured in milliseconds (ms). For ADC-values we measure the water diffusion magnitude, measured in ×10–3mm2/s. For T2* (BOLD)-maps we measure the effective transverse relaxation, measured in ms. For ASL we measure the arterial spin inflow, measured in mL/100 g/min. And finally, for T1-rho, we measure the rotating frame relaxation, measured in ms. Patient preparation, image acquisition, and data processing were performed in accordance with the recommendations and guidelines of the PARENCHIMA network (6-11).
MRI-image analysis
Technical validity was established via a pilot study of seven healthy volunteers evaluated by an MRI physicist. Additionally, all study scans were screened by a radiologist to rule out incidental pathology.
Image analysis was performed by two trained readers. Observer 1 (a nephrologist) underwent a one-week training period with an MRI physicist, analyzing 15 test cases together and separately to establish proficiency. Observer 1 analyzed the full cohort (n=72) twice, with a six-month washout period to assess intraobserver variability. Observer 2 (a radiologist) was subsequently trained by Observer 1 using the same 15-case protocol and analyzed the cohort once over a 3-month period to assess interobserver variability. T1-rho was only examined by Observer 1. The full segmentation protocol can be found in Appendix 1.
Suboptimal image quality was assessed qualitatively via visual inspection by the observers. While signal to noise ratios were not formally calculated, images were excluded if they exhibited significant artifacts or signal voids that prevented accurate manual segmentation. Renal volume was assessed semi-automatically on coronal slices acquired with a B-SSFP sequence by manually drawing a contour on alternating slices and interpolating the remaining volume. For functional sequences (T1, T1-rho, T2*, DWI and ASL), cortical and medullary ROIs were manually segmented on three central slices per kidney. We did not use mask-propagation, but drew on each sequence individually. In cases of suboptimal image quality (frequently observed in ASL and T1-rho), analysis was restricted to two slices. ROIs were delineated manually on the final parametric maps. To ensure accurate corticomedullary identification despite the inherently lower anatomical contrast of these maps, observers used the corresponding high-resolution anatomical sequences as side-by-side visual guidance. Cortical ROIs traced the renal border, excluding cysts and artifacts, while medullary ROIs targeted the pyramids, excluding the sinus (Figure 1). Medullary segmentation was omitted for ASL. Values were averaged across slices and kidneys to produce mean global estimates. Corticomedullary difference [delta (Δ) values] values were derived by subtracting cortical from medullary means. Renal artery blood flow was quantified using phase contrast MRI by integrating velocities within a dynamic circular ROI tracked across the cardiac cycle.
Statistical analysis
Inter- and intraobserver variability was assessed using ICCs from a two-way random effects model (ICC 2,1) assessing absolute agreement. In cases when the two-way models could not be computed, a one-way model was used. This was the case for the following intraobserver ICC calculations: delta T1, T2* map medulla, ADC cortex, T1-rho medulla and ΔT1-rho. For the interobserver calculations it was only T2* map medulla. For evaluation of the interobserver agreement, the second image analysis by Observer 1 was used. In accordance with ICC interpretation guidelines, ICC values were interpreted as: moderate (0.5–0.75), good (0.76–0.90), and excellent (>0.90). Bland-Altman plots were generated to visualize agreement and assess for systemic biases.
To determine if rater reliability was influenced by the degree of renal impairment, we compared the ICCs between low and high estimated glomerular filtration rate (eGFR) groups. Low eGFR was defined as eGFR below the median of 37 mL/min/1.73 m2, and high as >37 mL/min/1.73 m2. For the majority of MRI parameters, a two-way mixed-effects model (absolute agreement) was utilized. However, for T1 medulla, T2* map medulla, and ADC, a one-way random-effects model was applied due to specific data distribution constraints. 95% confidence intervals (CIs) for these ratios were generated using bootstrapping with 1,000 resamples.
Results
A total of 72 patients were included, with 64 completing all MRI examinations (Figure 2). The 8 dropouts were due to suboptimal image quality or technical failure (missing images or heavy distortion). All MRI examinations were completed without complication or adverse reactions. The average age was 57 years, and 66.7% were male. Kidney function spanned across CKD stages 1 to 5, with 14 cases of AKI, and ranged in eGFR from 3 to 132 mL/min/1.73 m2 with an average of 44 mL/min/1.73 m2. Forty-four patients had significant proteinuria with a protein-to-creatinine ratio >100 mg/mmol with a total average of 228 mg/mmol (Table 1).
Table 1
| Variable | Value (n=72) |
|---|---|
| Demographics | |
| Age (years) | 56.8±16.2 |
| Sex (male) | 48 (66.7) |
| Anthropometrics | |
| BMI (kg/m2) | 28.3±5.8 |
| Kidney function | |
| eGFR (mL/min/1.73 m2) | 44.4±29.4 |
| Kidney volume (BSA adjusted, unilateral) | 70.2±22.3 |
| CKD stage | |
| 1–2 | 17 (23.6) |
| 3 | 20 (27.8) |
| 4–5 | 21 (29.1) |
| AKI | 14 (19.4) |
| Diagnosis | |
| Glomerular disease (e.g., IgAN) | 48 (66.7) |
| Vascular disease (e.g., HTN) | 16 (22.2) |
| Other | 8 (11.1) |
Continuous variables are presented as mean ± standard deviation for normally distributed data. Categorical variables are presented as frequency (percentage). AKI, acute kidney injury; BMI, body mass index; BSA, body surface area; CKD, chronic kidney disease; eGFR, estimated glomerular filtration rate; HTN, hypertension; IgAN, immunoglobulin A nephropathy.
Interobserver variability ranged from moderate to excellent, with ICC values ranging from 0.53 to 0.99. Moderate agreement was observed for kidney volume, ΔT1 and ΔADC, good for T1 cortex, T1 medulla, ΔT2*, ADC cortex, ASL and renal artery flow, and excellent for ADC medulla and T2* maps of cortex and medulla (Table 2). Intraobserver variability also ranged from moderate to excellent with ICC values ranging from 0.52 to 0.97 (Table 3). Moderate agreement was found for ΔT2*, ΔADC, T1-rho cortex, and ΔT1-rho; good for kidney volume, T1 cortex, ΔT1, ADC cortex, ADC medulla, T1-rho medulla, and renal artery flow, and excellent for T1 medulla, T2* cortex, T2* medulla and cortical perfusion. In general, intraobserver agreement was higher than the interobserver agreement, with more parameters being classified as good or excellent (11 vs. 9).
Table 2
| Variable | ICC (95% CI) | Interpretation |
|---|---|---|
| Kidney volume | 0.75 (0.63 to 0.85) | Moderate |
| T1 cortex | 0.88 (0.81 to 0.92) | Good |
| T1 medulla | 0.88 (0.81 to 0.93) | Good |
| T1 | 0.68 (0.53 to 0.79) | Moderate |
| T2* map cortex | 0.99 (0.98 to 0.99) | Excellent |
| T2* map medulla | 0.96 (0.94 to 0.98) | Excellent |
| T2* | 0.84 (0.76 to 0.90) | Good |
| ADC cortex | 0.84 (0.77 to 0.90) | Good |
| ADC medulla | 0.91 (0.86 to 0.94) | Excellent |
| ADC | 0.53 (0.34 to 0.69) | Moderate |
| ASL (cortical perfusion) | 0.78 (0.66 to 0.86) | Good |
| Renal artery flow | 0.82 (0.68 to 0.92) | Good |
Absolute agreement listed with 95% CI and interpretation (bootstrapped 95% CI based on 1,000 resamples). ADC, apparent diffusion coefficient; ASL, arterial spin labeling; CI, confidence interval; ICC, intraclass correlation coefficient; T1, longitudinal relaxation time.
Table 3
| Variable | ICC (95% CI) | Interpretation |
|---|---|---|
| Kidney volume | 0.90 (0.86 to 0.94) | Good |
| T1 cortex | 0.89 (0.83 to 0.93) | Good |
| T1 medulla | 0.96 (0.93 to 0.97) | Excellent |
| ∆T1 | 0.81 (0.72 to 0.88) | Good |
| T2* map cortex | 0.97 (0.95 to 0.98) | Excellent |
| T2* map medulla | 0.94 (0.91 to 0.97) | Excellent |
| ∆T2* | 0.73 (0.61 to 0.83) | Moderate |
| ADC cortex | 0.89 (0.82 to 0.93) | Good |
| ADC medulla | 0.89 (0.84 to 0.93) | Good |
| ∆ADC | 0.52 (0.34 to 0.67) | Moderate |
| ASL (cortical perfusion) | 0.94 (0.91 to 0.96) | Excellent |
| T1-rho cortex | 0.73 (0.62 to 0.82) | Moderate |
| T1-rho medulla | 0.81 (0.71 to 0.88) | Good |
| ∆T1-rho | 0.55 (0.37 to 0.70) | Moderate |
| Renal artery flow | 0.82 (0.73 to 0.89) | Good |
Absolute agreement listed with 95% CI and interpretation (bootstrapped 95% CI based on 1,000 resamples). ADC, apparent diffusion coefficient; ASL, arterial spin labeling; CI, confidence interval; ICC, intraclass correlation coefficient; T1, longitudinal relaxation time.
Bland-Altman plots showed that the mean difference between measurements was close to zero for most parameters, with occasional outliers (4–5 per plot) (Figure 3). A proportional bias was noted for interobserver variability in measurements of cortical perfusion and renal artery flow, where the magnitude of the difference increased with higher values. No such patterns were noted in the Bland-Altmann plots for intraobserver variability (Figure S1).
Analysis of the ICC ratios revealed no significant systematic differences in interrater agreement between the low and high eGFR groups for most MRI variables (Table 4). We observed borderline findings for T2* map cortex (ICC ratio: 1.02; 95% CI: 1.00 to 1.05) and T2* map medulla (ICC ratio: 1.06; 95% CI: 1.00 to 1.12). In these cases, the lower limit of the CI reached 1.00, suggesting a slight trend toward higher rater consistency in the low eGFR group compared to the high eGFR group.
Table 4
| Variable | ICC ratio (95% CI) |
|---|---|
| Kidney volume | 1.26 (0.92 to 1.52) |
| T1 cortex | 0.92 (0.81 to 1.06) |
| T1 medulla | 1.04 (0.90 to 1.11) |
| ∆T1 | 0.55 (0.49 to 1.10) |
| T2* map cortex | 1.02 (1.00 to 1.05) |
| T2* map medulla | 1.06 (1.00 to 1.12) |
| ∆T2* | 0.95 (0.87 to 1.09) |
| ADC cortex | 1.08 (0.95 to 1.21) |
| ADC medulla | 1.05 (0.96 to 1.16) |
| ∆ADC | 0.90 (0.70 to 1.44) |
| ASL (cortical perfusion) | 0.88 (0.70 to 1.10) |
| Renal artery flow | 1.08 (0.97 to 1.13) |
Systematic differences in reliability were assessed via the ICC ratio low eGFR/high eGFR. A ratio of 1.0 represents identical agreement between groups. Values in parentheses represent the 95% CI obtained through bootstrapping with 1,000 resamples. ADC, apparent diffusion coefficient; ASL, arterial spin labeling; CI, confidence interval; eGFR, estimated glomerular filtration rate; ICC, intraclass correlation coefficient; T1, longitudinal relaxation time.
A post-hoc outlier analysis was performed on the Bland-Altman plots to determine if disagreements were clustered within specific subjects. Outliers were not uniformly distributed across the cohort; the observed probability of a patient having zero outliers across all investigated variables was 68.1%. While 16.7% and 4.17% of patients had one or two outliers respectively, a small subset exhibited higher clustering. Most notably, one patient was identified as a systematic outlier across eight separate variables, including T1, T2, ADC, and cortical perfusion metrics.
Discussion
Our study evaluated the intra- and interobserver variability in renal mpMRI using manual segmentation and demonstrated good or excellent agreement between observers for T1-, T2*- and ADC-maps in both cortex and medulla. Agreement for derived “delta values” was lower than agreement for absolute cortical and medullary values in both intra- and interobserver measurements (Tables 2,3). This is expected due to the relative increase in the noise-to-signal ratio. While the absolute measurement error remains similar, it becomes much larger in proportion to the small numerical difference between the two regions, thereby reducing the relative precision of the delta parameter. Our results are consistent with previous studies reporting ICC ≥0.8 for T1, T2*, and DWI (18-21). Furthermore, our results demonstrate that the reliability of MRI-derived renal parameters is largely independent of kidney function. While T2* mapping parameters showed a marginal trend toward higher agreement in the low eGFR group, these ratios were very close to 1.0, suggesting that any difference in reliability is likely of minimal clinical impact. This consistency is crucial for the longitudinal monitoring of patients with progressing CKD, as it ensures that observed changes in MRI metrics are likely due to physiological shifts rather than rater variability at different disease stages. Bland-Altmann plots divided into low and high eGFR groups is available in Figure S2.
T1-rho imaging, a novel technique primarily tested in preclinical models, yielded moderate to good intraobserver agreement but was limited by poor image quality and anatomical delineation, particularly for cortex vs. medulla (22). Renal artery flow showed good intra- and interobserver agreement, while ASL showed good and excellent inter- and intraobserver agreement, respectively.
The Bland-Altman plots confirmed close agreement for most sequences but there were approximately 4–5 outliers for each parameter. This most likely reflects either atypical anatomy, artefacts or poor image quality, where the two observers demarked the outline of the anomaly, cortex and/or medulla differently (Figure 1). In renal artery flow measurements and, to a slightly lesser degree, the cortical perfusion, there seems to be a proportional bias where the magnitude of the difference increases with higher values (Figure 3). This may be a consequence of accessory renal arteries; if they are left out by one observer and not the other, the impact of difference will be greater when the flow is higher. There is also a possibility that the measurement errors simply scale with the underlying signal. To address this, we performed a secondary analysis using percentage-difference Bland-Altman plots (for ASL and renal artery flow). Upon expressing the results in relative (%) terms, the proportional bias was effectively eliminated, resulting in constant limits of agreement across the entire measurement range (available in Figure S3). To our knowledge, previous studies on renal flow measurements with phase contrast MRI have shown good or excellent repeatability and/or intraobserver variability (23). For ASL measurements we saw an excellent intraobserver agreement, but only good interobserver agreement. The proportional bias observed in the ASL Bland-Altman plots likely originates from inherent quantification challenges and the limited signal-to-noise ratio characteristic of these sequences. Because ASL parametric maps often exhibit low intrinsic contrast, manual segmentation is susceptible to increased variability, particularly at higher perfusion values. A potential methodological refinement to mitigate this would be the implementation of automated mask propagation from high-contrast anatomical sequences.
As expected, intra-observer agreement was slightly superior to inter-observer agreement for most MRI parameters. This suggests that while personal systematic bias remains a minor factor, the high inter-rater concordance validates the clinical utility of the standardized measurement protocol for different observers.
In our study, inter-observer agreement for total kidney volume was moderate compared with the excellent agreement seen in functional parameters such as T1 and T2* mapping. This discrepancy is likely attributable to our segmentation methodology, which utilized manual contouring on alternating slices followed by software interpolation. While this approach is more time-efficient in a clinical setting, it is more susceptible to variations in slice selection and edge-detection at the renal poles.
Renal MRI is an emerging method with high diagnostic and prognostic potential, but studies evaluating its utility against clinical markers of kidney disease or renal endpoints are showing considerable heterogeneity in results. This is believed to be caused by differences in patient preparation, MRI-parameters, image acquisition-protocols, post-processing techniques and image analysis. Different image analysis techniques are being used with ROIs on distinct anatomical regions of the kidney, not including the whole cortex or medulla, being the most common. Whereas predetermined ROIs from the upper or lower poles of the kidney might be susceptible to motion artefacts, an observer determined representative ROI has in some studies yielded excellent reproducibility (24). In another study a representative cortex ROI provided a lower ICC but correlation to eGFR was better for the representative ROI than for the whole cortex (14). A reason for this may be that avoidance of artefacts, noise and non-physiological structures such as cysts is easier with the representative ROI method. Although this is labor-efficient it will, much like a kidney biopsy, be prone to sampling bias. A study using ROIs in segmentation of T1-maps showed a lower interobserver agreement in the medullary region than in the cortex, with an ICC of 0.42 and 0.72 respectively, indicating that the medulla might be extra sensitive to ROI placement (15). One strategy to improve the performance of the ROI method could be to increase the number of ROIs, diminishing variance and risk of sampling bias. This, however, increases labor and disrupts the efficacy-argument for using this method. Finally, there is some data showing an association between BOLD and kidney function only when whole-cortex segmentation is used (25).
Disease processes like fibrosis often affect the kidneys in a focal and non-uniform pattern, and evidence suggests that biopsy-estimates of fibrosis-burden are highly inaccurate (26). Furthermore, histological evaluation of fibrosis on kidney biopsies shows a high degree of intra- and interobserver variability and a low degree of agreement on sequential biopsies (27). One major potential advantage of MRI over kidney biopsies, in regard to fibrosis assessment and prognostication, is the ability to accurately determine whole-kidney fibrosis burden. Hence, even though some studies show good agreement with representative ROIs, this method has an inherent theoretical flaw. We believe measuring the whole kidney, segmented by cortex and medulla, is both viable and will produce better correlations to kidney function and fibrosis in the long term. This is supported by a recent study where small ROIs showed significantly lower repeatability than manual (or automatic) segmentation across all MRI parameters (except for ASL where ICC was low for all) (28). Automated segmentation is both labor-efficient and has shown a high degree of consistency and agreement in several studies. In our study, one thorough examination of all MRI-parameters on one single patient could take as long as 90 minutes.
The clinical translation of mpMRI necessitates automated segmentation to ensure scalability and reproducibility. Although approaches based on convolutional neural networks excel at whole-organ and cyst segmentation, their application to internal renal structures is limited by the low tissue contrast found in diseased kidneys (29-31). Consequently, while automation represents the future, manual expert oversight remains critical for accurate corticomedullary quantification in CKD populations.
Our study demonstrates a high inter- and intraobserver agreement in evaluating mpMRI using whole cortex and medulla segmentation, but highlights challenges in evaluating kidney volume, and to some degree ASL and renal artery flow. A potential limitation of our study is the single center nature relying on single vendor MRI-scanner and software-patch. We did not perform test-retest examinations on the scanner, ensuring reliability. We performed the segmentation directly on parametric maps, and not in anatomical images with better contrast, as recommended by consensus papers. We chose this approach to ensure that ROI placement remained spatially coincident with the physiological data, thereby avoiding the registration errors and geometric distortions that can occur when transferring ROIs between different MRI acquisitions. Automated motion and artifact correction were not applied to the final dataset. Although we attempted to mitigate this during segmentation by avoiding regions with severe artifacts, the absence of automated correction may leave the parametric maps susceptible to noise. The impact of these technical constraints is reflected in our outlier analysis. The clustering of outliers in a small subset of patients suggests that inter-observer variability was largely driven by systematic technical challenges in specific cases, such as localized motion or poor parametric map contrast, rather than a generalized lack of rater consistency. This highlights a translational challenge: while manual segmentation is highly reliable for approximately 68% of the population, ‘edge cases’ with poor image quality remain a significant hurdle for clinical consistency.
A considerable strength of our study is a large cohort with varied degree of kidney disease and function. This provides evidence for the robustness of mpMRI measurements across a wide range of eGFR, age and CKD-stages, and increases generalizability of our results. We have a well-defined image analysis protocol with documented ROI rules and preprocessing steps, which has been followed rigorously by the examiners and decreases the variability of our results.
Conclusions
Our study showed good to excellent intra- and interobserver agreement across most mpMRI parameters using kidney cortical and medullary segmentation. This supports their application for research and potential clinical use when our described protocol and image analysis technique is used. Furthermore, the future integration of automated segmentation and artificial intelligence (AI)-based border detection represents a critical next step to further enhance measurement reproducibility.
Acknowledgments
We thank Gabriel Melles for assisting in image acquisition and providing technical help throughout the project; Lisa Katarina Frōdin and her colleagues at the Department of Research at Akershus University Hospital for all help with patient preparation; and Elisabeth Solberg Hoel and Anne Tolås for providing indispensable aid in patient logistics and preparation. We thank Anne-Dorte Blankholm for providing critical insight and experience in initialization of the MRI-protocol. Finally, we thank the Department of Nephrology at Akershus University Hospital for facilitating the implementation of the project. Generative AI tools (Google Gemini) were used during the drafting of this manuscript to improve readability, grammar, and flow. The authors reviewed and revised the output, and take full responsibility for the content of the publication.
Footnote
Reporting Checklist: The authors have completed the STARD reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2685/rc
Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2685/dss
Funding: This study was financially supported by
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1-2685/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by the Regional Committee for Medical and Health Research Ethics (REK) for the South-Eastern health region of Norway (Ref: 80527), and informed consent was taken from all individual participants.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Tournebize C, Schleef M, De Mul A, Pacaud S, Derain-Dubourg L, Juillard L, Rouvière O, Lemoine S. Multiparametric MRI: can we assess renal function differently? Clin Kidney J 2025;18:sfae365. [Crossref] [PubMed]
- Chen J, Zhang Z, Liu J, Li C, Yin M, Nie L, Song B. Multiparametric Magnetic Resonance Imaging of the Kidneys: Effects of Regional, Side, and Hydration Variations on Functional Quantifications. J Magn Reson Imaging 2023;57:1576-86. [Crossref] [PubMed]
- Zöllner FG, Kociński M, Hansen L, Golla AK, Trbalić AŠ, Lundervold A. Kidney Segmentation in Renal Magnetic Resonance Imaging – Current Status and Prospects. IEEE Access 2021;9:71577-605.
- Caroli A, Pruijm M, Burnier M, Selby NM. Functional magnetic resonance imaging of the kidneys: where do we stand? The perspective of the European COST Action PARENCHIMA. Nephrol Dial Transplant 2018;33:ii1-3.
- Selby NM, Blankestijn PJ, Boor P, Combe C, Eckardt KU, Eikefjord E, et al. Magnetic resonance imaging biomarkers for chronic kidney disease: a position paper from the European Cooperation in Science and Technology Action PARENCHIMA. Nephrol Dial Transplant 2018;33:ii4-ii14. [Crossref] [PubMed]
- Ljimani A, Caroli A, Laustsen C, Francis S, Mendichovszky IA, Bane O, et al. Consensus-based technical recommendations for clinical translation of renal diffusion-weighted MRI. MAGMA 2020;33:177-95. [Crossref] [PubMed]
- Caroli A, Schneider M, Friedli I, Ljimani A, De Seigneux S, Boor P, Gullapudi L, Kazmi I, Mendichovszky IA, Notohamiprodjo M, Selby NM, Thoeny HC, Grenier N, Vallée JP. Diffusion-weighted magnetic resonance imaging to assess diffuse renal pathology: a systematic review and statement paper. Nephrol Dial Transplant 2018;33:ii29-40. [Crossref] [PubMed]
- Dekkers IA, de Boer A, Sharma K, Cox EF, Lamb HJ, Buckley DL, et al. Consensus-based technical recommendations for clinical translation of renal T1 and T2 mapping MRI. MAGMA 2020;33:163-76. [Crossref] [PubMed]
- Bane O, Mendichovszky IA, Milani B, Dekkers IA, Deux JF, Eckerbom P, et al. Consensus-based technical recommendations for clinical translation of renal BOLD MRI. MAGMA 2020;33:199-215. [Crossref] [PubMed]
- Nery F, Buchanan CE, Harteveld AA, Odudu A, Bane O, Cox EF, et al. Consensus-based technical recommendations for clinical translation of renal ASL MRI. MAGMA 2020;33:141-61. [Crossref] [PubMed]
- de Boer A, Villa G, Bane O, Bock M, Cox EF, Dekkers IA, et al. Consensus-Based Technical Recommendations for Clinical Translation of Renal Phase Contrast MRI. J Magn Reson Imaging 2022;55:323-35. [Crossref] [PubMed]
- Buchanan CE, Mahmoud H, Cox EF, McCulloch T, Prestwich BL, Taal MW, Selby NM, Francis ST. Quantitative assessment of renal structural and functional changes in chronic kidney disease using multi-parametric magnetic resonance imaging. Nephrol Dial Transplant 2020;35:955-64. [Crossref] [PubMed]
- Xie S, Chen M, Chen C, Zhao Y, Qin J, Qiu C, Zhu J, Nickel MD, Kuehn B, Shen W. T1 mapping combined with arterial spin labeling MRI to identify renal injury in patients with liver cirrhosis. Front Endocrinol (Lausanne) 2024;15:1363797. [Crossref] [PubMed]
- Rankin AJ, Allwood-Spiers S, Lee MMY, Zhu L, Woodward R, Kuehn B, Radjenovic A, Sattar N, Roditi G, Mark PB, Gillis KA. Comparing the interobserver reproducibility of different regions of interest on multi-parametric renal magnetic resonance imaging in healthy volunteers, patients with heart failure and renal transplant recipients. MAGMA 2020;33:103-12. [Crossref] [PubMed]
- Dekkers IA, Paiman EHM, de Vries APJ, Lamb HJ. Reproducibility of native T(1) mapping for renal tissue characterization at 3T. J Magn Reson Imaging 2019;49:588-96. [Crossref] [PubMed]
- Sim J, Lewis M. The size of a pilot study for a clinical trial should be calculated in relation to considerations of precision and efficiency. J Clin Epidemiol 2012;65:301-8. [Crossref] [PubMed]
- Gullaksen S, Vernstrøm L, Sørensen SS, Ringgaard S, Laustsen C, Funck KL, Poulsen PL, Laugesen E. Separate and combined effects of semaglutide and empagliflozin on kidney oxygenation and perfusion in people with type 2 diabetes: a randomised trial. Diabetologia 2023;66:813-25. [Crossref] [PubMed]
- Garcia-Ruiz L, Echeverria-Chasco R, Aramendía-Vidaurreta V, Solis-Barquero SM, Garcia-Fernandez N, Mora-Gutiérrez JM, Vidorreta M, Bastarrika G, Fernández-Seara MA. Influence of Field Strength, Sex, and Age on Pseudo-Continuous Arterial Spin Labeling and T1 Mapping in the Kidney. J Magn Reson Imaging 2025;62:1180-95. [Crossref] [PubMed]
- Nowak M, Henningsson M, Davis T, Chowdhury N, Dennis A, Fernandes C, Thomaides Brears H, Robson MD. Repeatability, Reproducibility, and Observer Variability of Cortical T1 Mapping for Renal Tissue Characterization. J Magn Reson Imaging 2025;61:1914-22. [Crossref] [PubMed]
- Friedli I, Crowe LA, Berchtold L, Moll S, Hadaya K, de Perrot T, Vesin C, Martin PY, de Seigneux S, Vallée JP. New Magnetic Resonance Imaging Index for Renal Fibrosis Assessment: A Comparison between Diffusion-Weighted Imaging and T1 Mapping with Histological Validation. Sci Rep 2016;6:30088. [Crossref] [PubMed]
- Liang P, Chen Y, Li S, Xu C, Yuan G, Hu D, Kamel I, Zhang Y, Li Z. Noninvasive assessment of kidney dysfunction in children by using blood oxygenation level-dependent MRI and intravoxel incoherent motion diffusion-weighted imaging. Insights Imaging 2021;12:146. [Crossref] [PubMed]
- Hu G, Liang W, Wu M, Lai C, Mei Y, Li Y, Xu J, Luo L, Quan X. Comparison of T1 Mapping and T1rho Values with Conventional Diffusion-weighted Imaging to Assess Fibrosis in a Rat Model of Unilateral Ureteral Obstruction. Acad Radiol 2019;26:22-9. [Crossref] [PubMed]
- Villa G, Ringgaard S, Hermann I, Noble R, Brambilla P, Khatir DS, Zöllner FG, Francis ST, Selby NM, Remuzzi A, Caroli A. Phase-contrast magnetic resonance imaging to assess renal perfusion: a systematic review and statement paper. MAGMA 2020;33:3-21. [Crossref] [PubMed]
- Feng YZ, Ye YJ, Cheng ZY, Hu JJ, Zhang CB, Qian L, Lu XH, Cai XR. Non-invasive assessment of early stage diabetic nephropathy by DTI and BOLD MRI. Br J Radiol 2020;93:20190562. [Crossref] [PubMed]
- Chen F, Yan H, Yang F, Cheng L, Zhang S, Li S, Liu C, Xu K, Sun D. Evaluation of Renal Tissue Oxygenation Using Blood Oxygen Level-Dependent Magnetic Resonance Imaging in Chronic Kidney Disease. Kidney Blood Press Res 2021;46:441-51. [Crossref] [PubMed]
- Hysi E, Baek J, Koven A, He X, Ulloa Severino L, Wu Y, Kek K, Huang S, Krizova A, Farcas M, Ordon M, Fok KH, Stewart R, Pace KT, Kolios MC, Parker KJ, Yuen DA. A first-in-human study of quantitative ultrasound to assess transplant kidney fibrosis. Nat Med 2025;31:970-8. [Crossref] [PubMed]
- Farris AB, Alpers CE. What is the best way to measure renal fibrosis?: A pathologist's perspective. Kidney Int Suppl (2011) 2014;4:9-15. [Crossref] [PubMed]
- Liang C, Loster I, Ursprung S, Ghoul A, Küstner T, Gückel B, Kühn B, Schick F, Martirosian P, Seith F. Multiparametric functional MRI of the kidneys - evaluation of test-retest repeatability and effects of different manual and automatic image analysis strategies. Rofo 2025;197:1176-87. [Crossref] [PubMed]
- Inoue K, Hara Y, Nagawa K, Koyama M, Shimizu H, Matsuura K, Takahashi M, Osawa I, Inoue T, Okada H, Ishikawa M, Kobayashi N, Kozawa E. The utility of automatic segmentation of kidney MRI in chronic kidney disease using a 3D convolutional neural network. Sci Rep 2023;13:17361. [Crossref] [PubMed]
- Daniel AJ, Buchanan CE, Allcock T, Scerri D, Cox EF, Prestwich BL, Francis ST. Automated renal segmentation in healthy and chronic kidney disease subjects using a convolutional neural network. Magn Reson Med 2021;86:1125-36. [Crossref] [PubMed]
- Aslam I, Aamir F, Kassai M, Crowe LA, Poletti PA, Seigneux S, Moll S, Berchtold L, Vallée JP. Validation of automatically measured T1 map cortico-medullary difference (ΔT1) for eGFR and fibrosis assessment in allograft kidneys. PLoS One 2023;18:e0277277. [Crossref] [PubMed]


