MMD-Net: a weakly supervised solution for quantification of nonalcoholic fatty liver biopsies
Introduction
Nonalcoholic fatty liver disease (NAFLD), the most prevalent chronic liver disorder worldwide with a global prevalence of 25%, has emerged as a leading etiological factor for cirrhosis and hepatocellular carcinoma (HCC). Substantial epidemiological evidence links the rising NAFLD incidence to modifiable lifestyle factors including physical inactivity, excessive caloric intake, and imbalanced dietary patterns (1). The disease pathogenesis originates from hepatic steatosis caused by triglyceride accumulation within hepatocytes. Clinically significant progression to inflammatory liver injury, termed nonalcoholic steatohepatitis (NASH), was observed in 20% of NAFLD patients in 2015. With aging as an independent risk factor, this proportion is projected to rise to 27% by 2030 due to population aging (2). As the principal driver of hepatic fibrosis (3,4) and HCC (5), NASH constitutes a major cause of liver-related morbidity and mortality.
Accurate histological differentiation of ballooning degeneration, steatosis, and inflammation is clinically pivotal in NAFLD diagnosis, directly impacting disease staging, prognosis, and therapeutic decisions. The NAFLD spectrum spans benign nonalcoholic fatty liver (NAFL) to high-risk NASH, with distinct prognoses: isolated steatosis (NAFL) carries a ≤2% 10-year cirrhosis risk (cardiovascular mortality predominates) (6), whereas NASH (steatosis + inflammation + ballooning) shows 10–15% cirrhosis progression and elevated liver-related mortality (SMR 4.1) (7). Crucially, this triad—particularly ballooning—distinguishes indolent NAFL from progressive NASH, guiding biopsy/management choices.
Steatosis (>5% hepatocyte triglyceride accumulation) signifies metabolic dysregulation and requires lifestyle intervention despite slow progression. Inflammation (lobular/portal lymphocytic/neutrophilic infiltration) indicates disease activity, driving progression to NASH, hepatocyte injury, and fibrogenesis. Ballooning degeneration (hepatocyte swelling, cytoplasmic rarefaction, Mallory-Denk bodies) (8) directly marks hepatocyte injury/death and is the strongest independent fibrosis predictor; combined with inflammation, it initiates fibrosis—the paramount prognostic indicator (9).
Quantification via the NAFLD Activity Score (NAS; steatosis 0–3, inflammation 0–3, ballooning 0–2) enables risk stratification: NAS ≥5 suggests NASH requiring intervention; ballooning ≥1 independently predicts fibrosis progression; grade 2–3 inflammation quadruples annual fibrosis progression. Thus, ballooning acts as a hepatocyte injury/fibrosis alarm, inflammation accelerates cirrhosis, and steatosis reflects metabolic burden. Precise quantification of these features refines patient stratification, preventing low-risk overtreatment while ensuring timely high-risk intervention to improve outcomes.
Early detection and timely intervention are crucial for reversing disease progression. Thus, the accurate assessment of hepatic steatosis severity and implementation of early interventions to block pathological progression hold significant clinical importance. Liver biopsy with histopathological evaluation remains the gold standard for NAFLD/NASH diagnosis and staging. Among various scoring systems, the methodology developed by Kleiner, Brunt, and the NASH Clinical Research Network (CRN) Pathology Committee stands as the most validated framework (9), which requires separate semi-quantitative assessments of steatosis (0–3), lobular inflammation (0–3), hepatocellular ballooning (0–2), and fibrosis (0–4) to calculate the NAS (9). NAS is calculated as the sum of the steatosis score, lobular inflammation score, and ballooning score, thus ranging from 0 to 8 (see Table 1). Traditional manual histopathological scoring systems exhibit two critical limitations. First, the specialized training required for hepato-pathological assessment has created a global shortage of qualified hepato-pathologists, constituting a profound inadequacy to meet escalating clinical demands (10). Second, substantial inter-observer and intra-observer variability undermines the reproducibility of histological evaluations, particularly in evaluating critical histological features such as ballooning degeneration and inflammatory activity. As demonstrated in the seminal 2005 study by Kleiner et al., inter-observer agreement for ballooning degeneration and inflammation yielded kappa values of 0.56 and 0.45, respectively. Even intra-observer consistency for ballooning assessment exhibited substantial temporal variation (κ=0.68), indicating subjective interpretation due to ambiguous morphological criteria (9). Current scoring protocols require pathologists to assign discrete ordinal classifications based on morphological characteristics, spatial distribution patterns, and zonal localization of pathological features. However, this visual interpretation-based methodology inherently introduces diagnostic discrepancies due to subjective assessment biases. Artificial intelligence (AI)-powered quantitative image analysis holds promise for overcoming these limitations through standardized, data-driven classification of disease severity. These constraints collectively highlight the urgent need for developing automated systems capable of delivering standardized, precise, and efficient evaluation of NAFLD/NASH histopathology. Our study specifically focused on deep learning (DL)-based automated quantification of these critical histopathological features.
Table 1
| Histological component | Score 0 | Score 1 | Score 2 | Score 3 |
|---|---|---|---|---|
| Steatosis (hepatocyte fat accumulation) | <5% | 5–33% | >33–66% | >66% |
| Lobular inflammation (inflammatory foci per 200× field) | No foci | <2 foci | 2–4 foci | >4 foci |
| Ballooning (hepatocyte ballooning degeneration) | None | Few balloon cells | Many cells/prominent ballooning | – |
DL-based approaches have recently achieved important breakthroughs in biopsy image analysis. Heinemann et al.’s [2022] InceptionV3-based stepwise Kleiner scoring system requiring separate sub-models for ballooning, inflammation, fibrosis, and steatosis feature extraction followed by artificial neural network (ANN) regression integration for assessing the progression of NAFLD/NASH. The system quantifies histopathological features such as ballooning, inflammation, steatosis, and fibrosis by analyzing microscope images of liver biopsy samples. The features are aggregated into continuous scores using ANNs, offering finer granularity compared to discrete pathologist scores. Validation on a dataset of 467 samples demonstrates that the automated system achieves high consistency with pathologist scores (11).
Multi-instance learning (MIL) has emerged as an effective framework for processing weakly-labeled whole slide images (WSIs) with slide-level annotations. To address multiclass classification challenges in histopathological analysis, recent methodological advancements include ReMix (Yang et al., 2022) (12), a stochastic augmentation methodology that enhances sample variability through instance level feature mixing. Concurrently, Lin et al. [2023] introduced interventional bag MIL (IBMIL) (13), an attention-based framework that improves discriminative feature learning via confounder-aware instance selection and cross-bag feature interaction.
Hashimoto et al. [2020] achieved breakthrough performance in lymphoma cancer subtype classification by integrating MIL for bag-level prediction, domain adversarial (DA) training with gradient reversal layers to mitigate staining variations, and multi-scale learning to capture tumor heterogeneity across spatial scales. Key innovations include: (I) a two-stage training strategy where single-scale DA-MIL networks [using pretrained convolutional neural networks (CNNs) such as VGG16] are first trained individually, followed by multi-scale feature fusion; (II) attention-based MIL to focus on diagnostically critical image patches. Validated on a 196-case malignant lymphoma dataset, the method outperforms conventional CNNs (achieving pathologist-level accuracy) and highlights tumor regions consistent with clinical observations. Its clinical relevance lies in mimicking pathologists’ multi-scale diagnostic workflows while enabling scalable analysis of unannotated whole-slide images (14).
Yan et al. [2022] proposed a Swin Transformer-based deep self-supervised framework integrated with residual modules to enhance model performance. To the best of our knowledge, this work pioneers the application of Swin Transformer in hepatic histopathological analysis. WSIs were processed through dual-scale patch cropping with systematic quantification of four diagnostic features (15).
Yin et al. [2024] developed a primal-dual graph architecture to explicitly model spatial interactions between vascular systems and fibrotic matrices in hepatic fibrosis histopathology. By constructing a vascular network-derived primal graph and a fibrosis region-induced dual graph, this framework systematically extracts topological features of fibrosis-related structures in WSIs. The specifically designed primal-dual graph convolutional module enables independent characterization of vascular morphological patterns and fibrotic distribution features, while establishing their pathological correlation model (16).
Junaid et al. [2025] proposed a two-stage DL framework for molecular subtype classification (basal-like vs. classical) of pancreatic ductal adenocarcinoma (PDAC) using routine hematoxylin and eosin (H&E)-stained histopathology slides. It first employs a CNN to localize tumor regions in WSIs, then evaluates four MIL architectures to integrate local morphological features (e.g., glandular patterns) for subtype prediction. Validated on both The Cancer Genome Atlas-pancreatic adenocarcinoma (TCGA-PAAD) data (97 slides) and an external biopsy cohort (44 patients, 110 slides), the approach demonstrates generalizability and provides interpretable decision patterns through Grad-CAM visualization, bridging histopathological features with molecular subtypes for clinical application (17).
Lu et al. [2024] developed CONtrastive learning from Captions for Histopathology (CONCH), a vision-language foundation model specifically developed for histopathological analysis. Trained on over 1.17 million image-text pairs aggregated from multiple sources of histopathology images and biomedical captions through task-agnostic pretraining, CONCH demonstrates exceptional multimodal capabilities across diverse benchmarks. The model achieves state-of-the-art performance in image classification, segmentation, caption generation, as well as cross-modal retrieval tasks including text-to-image and image-to-text search (18).
Shabanian et al. [2025] investigated the feasibility of leveraging clustering-constrained attention multiple instance learning (CLAM) (19), a weakly supervised learning method, for staging liver fibrosis on trichrome-stained WSIs in children and young adults. Through a retrospective analysis, 217 trichrome-stained WSIs from pediatric liver biopsies were collected and independently scored by two pediatric pathologists using both METAVIR and Ishak fibrosis staging systems. The cases were stratified into low- and high-fibrosis stages, and a binary classification model was subsequently developed using the CLAM pipeline to distinguish between these stages (20).
Ahmadvand et al. [2024] developed a two-stage DL strategy to classify molecular subtypes of PDAC using routine H&E-stained histopathology slides. Firstly, a CNN was trained to automatically localize tumor regions in WSI, effectively excluding interference from normal tissues. Subsequently, the researchers evaluated four DL architectures (Vanilla, IDaRS, DeepMIL, and VarMIL) on the identified tumor regions, integrating local image features through an MIL strategy to construct a classification model distinguishing between basal-like and classical subtypes. Trained on the TCGA-PAAD dataset (97 slides), the model demonstrated robust generalization capability when validated on an external cohort of 44 patients (110 biopsy slides) (21).
Addressing the limitations of existing methodologies—exemplified by Heinemann et al.’s [2022] InceptionV3-based stepwise Kleiner scoring system requiring separate sub-models for ballooning, inflammation, fibrosis, and steatosis feature extraction followed by ANN regression integration (11)—our approach resolves two critical operational constraints:
- Redundant tile-level feature prediction and aggregation procedures on WSIs, typically 40,000×40,000 pixels), which incur significant time overhead;
- Prohibitive annotation costs associated with tile-level labeling of gigapixel WSIs.
Building upon existing research, we were inspired to integrate multiple advanced technologies to address the labor-intensive and time-consuming challenges associated with manual interpretation and annotation in the quantitative assessment of hepatic histopathological images for NAFLD patients. Multi-task learning (MTL) enables joint feature learning through shared representations (22), and demonstrates proven efficiency gains in computer vision (23), natural language processing (NLP) (24), and speech recognition (25) through multi-objective optimization.
MIL, as a weakly-supervised paradigm, circumvents instance-level labeling requirements by operating on diagnostically-labeled bags of image tiles (26,27), and handles weakly-labeled WSI data through bag-level supervision. DA modules mitigate domain shifts caused by staining variations. Experimental validation confirms the framework’s superior efficacy.
This methodological advancement provides an efficient and reliable solution for automated NAS quantification, showing substantial potential to accelerate histopathology-based NAFLD/NASH research and clinical translation. The major contributions of this work include:
- Development of a novel end-to-end network architecture dedicated to NAS quantification, which streamlines computational workflows while achieving competitive performance metrics;
- Construction of an innovative instance bag dynamic sampling mechanism specifically designed to address class-imbalanced data distribution;
- The weakly supervised nature of the proposed MMD-Net model streamlines the annotation workflow in dataset construction, enabling researchers to selectively label a subset from hundreds to thousands of image patches generated from a single WSI.
We present this article in accordance with the TRIPOD+AI reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-981/rc).
Methods
We present MMD-Net, a weakly supervised framework integrating MIL with MTL for the concurrent assessment of three key histopathological features in NAFLD: ballooning degeneration, inflammation, and steatosis. The method requires only bag-level annotations for WSIs, eliminating the need for pixel- or instance-level labeling and significantly reducing annotation costs. Figure 1 presents the dataset preparation process and an overview of the workflow in this study.
Figure 2 illustrates the histopathological data preprocessing workflow.
Problem formulation
We constructed a multi-task classification model on a liver histopathology dataset comprising N patients, with three core subtasks: ballooning grading (3-class), lobular inflammation assessment (4-class), and steatosis quantification (4-class). We employed an MIL framework to process high-resolution WSIs, where each WSI is partitioned into multiple tiles and organized into bags, using only bag-level weak labels for training. Detailed mathematical notation, dataset representation, and formal definitions of the MIL framework are provided in Appendix 1.
Class imbalance handling
The experiment exhibits severe class imbalance. Taking ballooning degeneration as an example (similar distributions observed in inflammation and steatosis), there are only 796 (or 1.5%) positive patches versus 29,913 (or 55.6%) negative samples, with 23,094 (or 42.9%) “ignore”-labeled patches excluded due to quality issues. To address sampling bias, we implemented:
- Generative data augmentation: apply Stable Diffusion v2.1 for semantics-preserving enhancement of positive patches (Figure 3);
- Dynamic priority sampling: positive-first selection during bag construction.
The detailed methodology is provided in Appendix 2.
Through data augmentation and class-balancing strategies under computational constraints, a refined dataset of 3,957 samples was established. As shown in Figure 4, the dataset demonstrated balanced inter-class distribution across three pathological features.
Network
MMD-Net comprises three core components (Figure 5):
- Feature extractor: a CNN that maps 299×299 pixel images to feature vectors;
- Label predictor: an attention-based neural network with three independent parameter sets for different pathological features;
- Domain predictor: a fully-connected network that maps features to domain label probabilities.
The network employs an adversarial training strategy, achieving domain-invariant feature learning through a gradient reversal layer. Detailed network structure, parameter settings, and attention mechanism formulas are provided in Appendix 3.
Experimental setup
Dataset
This study utilized a publicly available hepatocellular pathology dataset originally published by Heinemann et al. (11) on the Open Science Framework (OSF) platform (accessible at osf.io/8e7hd). The dataset comprises a retrospective collection of 467 clinical liver biopsy specimens sourced from three independent institutions, incorporating diversity in digital slide scanners and staining protocols (H&E and Sirius red). The data distribution across the contributing centers is as follows:
- Duke University Medical Center (Durham, NC, USA): 338 whole-slide digital pathology images;
- Institute of Pathology, Hannover Medical School (Germany): 72 specimens;
- Boehringer Ingelheim Biorepository (Biberach, Germany/Ridgefield, CT, USA): 57 specimens provided by Discovery Life Sciences (Huntsville, AL, USA).
The dataset includes two primary biopsy types:
- Wedge biopsies: characterized by substantial tissue volume (typical dimension: 0.8 cm edge length) and containing multiple intact portal tracts;
- Needle biopsies: typically 1–2 cm in length, presenting 6–10 representative portal triads.
All slides were stained using standardized protocols for either Masson-Goldner trichrome or Masson trichrome and were digitized using either a Leica Aperio AT2 whole-slide scanner (Leica Biosystems, Wetzlar, Germany) or a Carl Zeiss Axioscan Z1 digital pathology system (Carl Zeiss, Jena, Germany). Scanning parameters were standardized to bright-field imaging mode with a 20× objective resolution (0.5 µm/pixel).
Expert pathologists manually annotated four key histological features: steatosis, ballooning, inflammation, and fibrosis. To enhance scoring consistency, steatosis grades were further calibrated using a U-Net-based automated segmentation system to mitigate potential systematic bias.
From the initial 467 specimens, a subset of 282 cases with complete histopathological annotations was selected as the final dataset for analysis. The remaining 185 cases were excluded due to incomplete labels or image quality concerns. The curated dataset maintains cross-center staining protocol consistency, structural integrity of portal tracts, and full traceability of histopathological grading labels. Despite the multicenter design enhancing scanner and stain diversity, the dataset presents certain constraints, including quality heterogeneity (with staining artifacts proactively excluded), label incompleteness (specifically, missing inflammation/ballooning scores in some subsets), and a relative scarcity of high-grade cases. Furthermore, fibrosis-annotated slides were excluded from MIL bag construction due to incompatible magnification scales with the specific requirements for NAS scoring.
Evaluation metrics
To quantitatively evaluate the effectiveness of our model performance, 5 common metrics were employed: the accuracy, precision, recall, F1 score, and Cohen’s κ. The formulas corresponding to each metric are as follows:
Where TP, FP, TN, and FN indicate true positive, false positive, true negative, and false negative, respectively.
Where , .
Training strategy
We employ a multi-task loss combining weighted cross-entropy terms, with the optimization objective including bag-level classification loss and attention-weighted domain adaptation regularization. The training process utilizes dynamic domain regularization parameters and an early stopping strategy. The complete training algorithm pseudocode is provided in Appendix 4, and hyperparameter settings are detailed in Appendix 5.
All experiments in this study were conducted on a workstation equipped with an NVIDIA RTX 5080 GPU (16 GB VRAM; NVIDIA, Santa Clara, CA, USA) and an Intel® Core™ Ultra9 285K (3.7 GHz; Intel, Santa Clara, CA, USA) CPU. The operating system was Windows 11 Home Chinese Edition (Microsoft, Redmond, WA, USA), with PyTorch 2.7.1 (Meta, Menlo Park, CA, USA) as the DL framework and Python 3.11.13 (Python Software Foundation, Wilmington, DE, USA) as the programming language. GPU acceleration was enabled via CUDA 12.8 (NVIDIA).
In the ablation study protocol, the dataset was stratified into training, validation, and test subsets through randomized partitioning (6:2:2 ratio, n=3,957 samples). For comparative analysis of backbone network efficacy, we implemented 5-fold cross-validation with randomized stratified sampling, ensuring each fold maintained equivalent class distribution (792±1 samples per fold).
The model was trained for 100 epochs with early stopping patience set to 10. Dropout with a rate of 0.5 was applied during class prediction. The dynamic domain regularization parameter was calculated as:
Where . The hyperparameter α was optimized through grid search on the validation set.
The datasets analyzed during the study are available in the OSF repository, https://osf.io/8e7hd/.
Results
Table 2 compares the experimental results of DA, MIL, and their multi-task enhanced variant (MTL-DA-MIL). The baseline DA-MIL employs a single-feature training protocol for its classification head (iteratively training individual feature channels) while maintaining identical hyperparameter configurations to MTL-DA-MIL. It should be noted that due to the MTL approach employed in the proposed model, the three features (ballooning degeneration, inflammation, and steatosis) failed to converge optimally concurrently during training. A common observation was that ballooning degeneration tended to overfit while inflammation and steatosis remained underfit. Given that the primary objective of this work was the calculation of the NAS score, the experimental focus was primarily directed towards optimizing the average performance across these three features. The comparative results indicate that the multi-task cooperative training approach leads to quantifiable enhancements across all assessed metrics. Specifically, as shown in the table, integrating MTL yielded performance gains in all five metrics across three experimental runs compared to DA-MIL. Furthermore, the comparison between MTL-MIL and DA-MIL underscores that MTL contributes a more significant performance improvement than does DA. Notably, although results for MTL-DA-MIL (3 tasks), comprising three primary tasks (ballooning, inflammation, and steatosis prediction), show marginal improvements in four metrics relative to MTL-MIL, the recall rate decreased from 0.8107 to 0.8072. However, the performance of MTL-DA-MIL (4 tasks), which introduces an auxiliary task to determine the presence of ignored patches within a MIL bag alongside these primary tasks, relative to MTL-DA-MIL (3 tasks) was suboptimal, exhibiting declines in three metrics. We attribute this to the limited relevance of the fourth supplementary task (determining whether bags contain patches annotated as ‘ignored’) to the primary objectives. This observation suggests that task selection within MTL should prioritize those demonstrating strong correlation with the core task. The confusion matrices in Figure 6 demonstrate the classification performance of MMD-Net (MTL-DA-MIL, 4 tasks, ConvNeXt backbone) on three critical histopathological features (ballooning, inflammation, and steatosis) within the test set.
Table 2
| Method | Feature | Accuracy | Precision | Recall | F1 score | Cohen’s kappa |
|---|---|---|---|---|---|---|
| DA-MIL | Ballooning | 0.9295 | 0.9308 | 0.9092 | 0.9272 | 0.9549 |
| Inflammation | 0.7976 | 0.7964 | 0.7312 | 0.7918 | 0.7185 | |
| Steatosis | 0.7421 | 0.7625 | 0.6782 | 0.7406 | 0.6376 | |
| Mean | 0.8231 | 0.8299 | 0.7729 | 0.8199 | 0.7703 | |
| MTL-MIL | Ballooning | 0.946 | 0.9404 | 0.9376 | 0.9456 | 0.9617 |
| Inflammation | 0.8426 | 0.8316 | 0.7885 | 0.8383 | 0.7709 | |
| Steatosis | 0.7361 | 0.7226 | 0.706 | 0.7343 | 0.6196 | |
| mean | 0.8416 | 0.8315 | 0.8107 | 0.8394 | 0.7841 | |
| MTL-DA-MIL (3 tasks) | Ballooning | 0.9414 | 0.9418 | 0.9265 | 0.9404 | 0.9593 |
| Inflammation | 0.8123 | 0.7792 | 0.7651 | 0.8097 | 0.7479 | |
| Steatosis | 0.7718 | 0.7577 | 0.7301 | 0.7706 | 0.6964 | |
| mean | 0.8418 | 0.8262 | 0.8072 | 0.8402 | 0.8012 | |
| MTL-DA-MIL (4 tasks) | Ballooning | 0.94 | 0.9396 | 0.94 | 0.9394 | 0.9579 |
| Inflammation | 0.8201 | 0.8217 | 0.8201 | 0.8198 | 0.7805 | |
| Steatosis | 0.7631 | 0.7629 | 0.7631 | 0.7608 | 0.6316 | |
| Mean | 0.8411 | 0.8414 | 0.8411 | 0.84 | 0.79 |
DA, domain adversarial; MIL, multi-instance learning; MTL, multi-task learning.
To identify the optimal neural architecture for NAFLD NAS scoring, this study implemented a 5-fold cross-validation approach to systematically compare multiple classical feature extraction networks, including VGG16 (28), ResNet50 (29), MobileNet V3 (30), ViT (31), ConvNeXt (32), and Swin Transformer (33). Notably, ResNet50 and MobileNet V3 were excluded from the final comparative analysis due to their suboptimal performance metrics (demonstrating 35–39% lower classification accuracy and 0.44–0.53 reduction in F1 scores compared to top-performing architectures). As shown in the systematic comparison in Table 3, the selection of backbone architectures does impact network performance. The experimental results demonstrate that Swin Transformer achieves state-of-the-art accuracy, whereas ConvNeXt, ViT, and VGG16 exhibit moderately competitive performance. This performance disparity can be primarily attributed to two factors:
- Long-range dependency requirements: the MIL framework contains diverse image tiles that necessitate effective long-range dependency modeling—a capability inherently strengthened by the self-attention mechanisms in Transformer-based architectures such as Swin Transformer.
- Pathological texture characteristics: histopathological images exhibit densely packed textural patterns, making them particularly amenable to VGG’s architectural strength in capturing fine-grained local features through its stacked 3×3 convolutional operations.
Table 3
| Backbone | Feature | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|---|
| VGG16 | Ballooning | 0.913±0.009 | 0.918±0.008 | 0.919±0.01 | 0.913±0.009 |
| Inflammation | 0.799±0.022 | 0.797±0.024 | 0.79±0.025 | 0.802±0.022 | |
| Steatosis | 0.721±0.017 | 0.723±0.014 | 0.721±0.018 | 0.721±0.017 | |
| Mean | 0.811±0.009 | 0.813±0.01 | 0.81±0.011 | 0.812±0.009 | |
| ViT_b_32 | Ballooning | 0.78±0.066 | 0.791±0.069 | 0.767±0.084 | 0.781±0.066 |
| Inflammation | 0.684±0.072 | 0.687±0.068 | 0.673±0.069 | 0.683±0.073 | |
| Steatosis | 0.709±0.046 | 0.715±0.05 | 0.702±0.042 | 0.709±0.045 | |
| Mean | 0.757±0.009 | 0.762±0.007 | 0.754±0.01 | 0.757±0.01 | |
| ConvNeXt_tiny | Ballooning | 0.924±0.015 | 0.929±0.011 | 0.93±0.016 | 0.924±0.014 |
| Inflammation | 0.824±0.024 | 0.821±0.021 | 0.816±0.021 | 0.825±0.024 | |
| Steatosis | 0.754±0.035 | 0.755±0.031 | 0.755±0.036 | 0.754±0.034 | |
| Mean | 0.834±0.021 | 0.835±0.018 | 0.834±0.02 | 0.834±0.02 | |
| Swin transformer (TINY) | Ballooning | 0.932±0.005 | 0.935±0.004 | 0.937±0.007 | 0.932±0.004 |
| Inflammation | 0.835±0.018 | 0.829±0.019 | 0.831±0.017 | 0.836±0.016 | |
| Steatosis | 0.766±0.03 | 0.772±0.023 | 0.767±0.038 | 0.766±0.029 | |
| Mean | 0.844±0.015 | 0.845±0.012 | 0.845±0.018 | 0.845±0.014 |
Data are presented as mean ± standard deviation.
To delineate region-specific attention patterns during model prediction, the class activation mapping (CAM) technique was implemented in this study. Figure 7 employs the Grad-CAM++ (34) visualization method to generate corresponding CAM images for each task. The lesion areas displayed in Figure 7A were annotated by a senior pathologist (with over 15 years of experience) from a tier-3 hospital. Blue arrows indicate ballooning degeneration, red arrows denote inflammation, and green arrows represent steatosis. Analysis of these images reveals that although the original dataset contained insufficient patch-level annotations (e.g., patches labeled as ‘inflammation’ lacked annotations for ‘ballooning’ or ’steatosis’), the model successfully identified suspected lesion areas within the patches. For instance, in Figure 7B, ballooning degeneration was detected in patches 2, 4, 6, and 14, whereas patch 16 exhibited signs of inflammation. This demonstrates a key advantage of the intelligent algorithm over human pathologists: immunity to oversight and fatigue. Furthermore, Figure 7C indicates that the model directs significantly more attention towards inflammatory regions. However, the detection performance for steatosis, as shown in Figure 7D, was suboptimal. Notably, confusion with ballooning degeneration regions can be observed, including instances where macro-vesicular steatosis is misclassified as ballooning degeneration. This finding aligns with the experimental results showing relatively lower accuracy in steatosis detection, indicating potential for further model refinement.
To clinically validate the performance of MMD-Net (MTL-DA-MIL, 4 tasks, Swin backbone), we conducted a slide-level evaluation following the procedure illustrated in Figure 8. After excluding patches annotated as “ignore”, all patches from the same case were grouped into multiple instance bags. These bags were then processed by the model, and the prediction results for each bag were recorded. The highest score for each target feature among all bags from the same slide was aggregated to determine the final slide-level score.
The test set was strictly curated to include only WSIs containing patches exhibiting all three key histological features: ballooning, inflammation, and steatosis. This selection criterion resulted in a final test cohort of 190 WSIs. The quantitative results of this slide-level evaluation are presented in Table 4.
Table 4
| Feature | Accuracy | Precision | Recall | F1 score |
|---|---|---|---|---|
| Ballooning | 0.7326±0.0172 | 0.7633±0.0519 | 0.593±0.0178 | 0.6882±0.0192 |
| Inflammation | 0.7074±0.0858 | 0.7157±0.088 | 0.6787±0.0889 | 0.6982±0.0902 |
| Steatosis | 0.6874±0.0485 | 0.5923±0.0359 | 0.712±0.0393 | 0.7202±0.043 |
Data are presented as mean ± standard deviation.
Upon transitioning the model evaluation from the bag level to the comprehensive WSI level, we observed a moderate performance degradation. This decline may be attributed to the variability in sampling ratios across different slides. For instance, although some slides yielded over 2,400 patches, others contributed fewer than 30, creating a domain-level imbalance that likely undermined the effectiveness of the DA strategy employed in this study. Furthermore, this imbalance increases the difficulty for the model to accurately “focus” on a few critical pathological regions, such as scattered ballooning cells. Although the attention mechanism is theoretically capable of such localization, in practice, attention weights can be “diluted” by large volumes of normal tissue, resulting in reduced sensitivity to subtle lesions.
Additionally, the MIL aggregation strategy adopted in this work is based on the “max-pooling” assumption, where the label of a bag is determined by its most abnormal instance. At the WSI level, this assumption may be an oversimplification. For example, a slide may contain multiple isolated, low-grade inflammatory foci that do not collectively constitute high-grade “diffuse” inflammation. The model, however, might overestimate the severity due to these focal lesions. This method of label aggregation inherently differs from the complex integrative process used by pathologists during global assessment, thereby introducing inevitable evaluation uncertainty.
It should be noted that due to computational resource constraints, all networks implemented in this study were deployed in lightweight configurations. This technical limitation might constrain the full expression of architectural potentials in sophisticated models such as Swin Transformers, and ConvNeXt.
Discussion
NAFLD, as the most prevalent chronic liver disease globally, affects approximately 25% of the world’s population and has emerged as a primary cause of cirrhosis and HCC. However, despite the crucial importance of early diagnosis and intervention in reversing disease progression, current diagnostic approaches, such as liver biopsy and the Kleiner scoring system (9), are hindered by limitations, including inefficiency and labor-intensiveness. In particular, the interpretation of traditional pathology slides is a time-consuming process, often requiring pathologists to spend 5–10 minutes meticulously analyzing each slide, which substantially increases their diagnostic workload.
In the quest for more efficient diagnostic pathways, the application of supervised DL algorithms, although showing some potential, faces dual challenges of cumbersome annotation processes (35) and high costs associated with obtaining high-quality labeled samples (36). To overcome this dilemma, our study designed MMD-Net, a weakly supervised scoring framework that integrates MIL and MTL strategies, enabling efficient assessment of three key histopathological features: steatosis, inflammation, and ballooning. Through a weakly supervised learning mechanism, MMD-Net may reduce the dependence on large-scale labeled data, thereby effectively controlling implementation costs.
In performance evaluations, MMD-Net achieved promising results. Its secondary weighted Cohen’s κ coefficients for the three key features appeared favorable compared to Heinemann et al.’s method (11), suggesting potential effectiveness in pathological image quantification. The framework may offer advantages not only in addressing annotation efficiency challenges but also in balancing computational efficiency with accuracy, potentially serving as a supplementary tool for pathological assessment. MMD-Net’s MTL performance also indicates possible utility for evaluating complex histopathological features.
A slight performance degradation was observed when evaluation shifted from the patch level to the WSI level, which was considered due to the substantial data complexity inherent in WSIs. This includes extensive non-diagnostic background regions and the increased difficulty for attention mechanisms to localize critical lesions within an ultra-large instance pool. Nevertheless, the model retained satisfactory discriminative ability at the WSI level, underscoring MMD-Net’s potential for real-world clinical application. Future work will focus on refining preprocessing strategies and attention mechanisms to improve robustness in complex whole-slide settings.
This study has several limitations. First, the dataset was limited to specific histopathological features, which may affect the model’s generalizability to all NAFLD/NASH manifestations. Second, although MMD-Net showed overall competence, its accuracy for certain features (e.g., steatosis) could be further optimized. Future work will expand datasets to include more diverse histopathological features and patient demographics, while exploring clinical applications such as real-time assessment and remote diagnosis. Model architecture refinements will also be pursued to enhance efficiency without compromising performance.
Conclusions
We present a DL system MMD-Net for automated NAS quantification in partially annotated NAFLD histopathology images. The primary objective of this research was to establish a weakly supervised framework using MIL for NAFLD assessment, aiming to reduce the annotation workload for pathologists while developing a clinically applicable diagnostic approach. Although class imbalance in the original dataset necessitated the implementation of distribution-balancing strategies that might theoretically introduce confounding factors, this study systematically demonstrates that the synergistic integration of MIL with DA training and MTL can substantially enhance model performance. MMD-Net provides end-to-end concurrent assessment of three diagnostic hallmarks (ballooning, inflammation, steatosis) under weak supervision, enabling simultaneous prediction of Kleiner scores for three diagnostic features with promising performance (average Cohen’s κ: 0.845±0.014). This advancement suggests promising potential to contribute to standardized NAFLD assessment using clinically applied AI-powered histopathology analysis.
Acknowledgments
The authors gratefully acknowledge the hepatocellular pathology dataset published by Heinemann et al. (July 2023). This publicly available dataset was provided by Duke University Medical Center (Durham, NC, USA), the Institute of Pathology at Hannover Medical School (Germany), and the Boehringer Ingelheim Biorepository (Biberach, Germany/Ridgefield, CT, USA), and was critical for this research.
Footnote
Reporting Checklist: The authors have completed the TRIPOD+AI reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-981/rc
Funding: None.
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-981/coif). The authors have no conflicts of interest to declare.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.
Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.
References
- Eslam M, Sanyal AJ, George JInternational Consensus Panel. MAFLD: A Consensus-Driven Proposed Nomenclature for Metabolic Associated Fatty Liver Disease. Gastroenterology 2020;158:1999-2014.e1. [Crossref] [PubMed]
- Estes C, Razavi H, Loomba R, Younossi Z, Sanyal AJ. Modeling the epidemic of nonalcoholic fatty liver disease demonstrates an exponential increase in burden of disease. Hepatology 2018;67:123-33. [Crossref] [PubMed]
- Schuppan D, Afdhal NH. Liver cirrhosis. Lancet 2008;371:838-51. [Crossref] [PubMed]
- Tsochatzis EA, Bosch J, Burroughs AK. Liver cirrhosis. Lancet 2014;383:1749-61. [Crossref] [PubMed]
- Cholankeril G, Wong RJ, Hu M, Perumpail RB, Yoo ER, Puri P, Younossi ZM, Harrison SA, Ahmed A. Liver Transplantation for Nonalcoholic Steatohepatitis in the US: Temporal Trends and Outcomes. Dig Dis Sci 2017;62:2915-22. [Crossref] [PubMed]
- Singh S, Allen AM, Wang Z, Prokop LJ, Murad MH, Loomba R. Fibrosis progression in nonalcoholic fatty liver vs nonalcoholic steatohepatitis: a systematic review and meta-analysis of paired-biopsy studies. Clin Gastroenterol Hepatol 2015;13:643-54.e1-9; quiz e39-40.
- Angulo P, Kleiner DE, Dam-Larsen S, Adams LA, Bjornsson ES, Charatcharoenwitthaya P, Mills PR, Keach JC, Lafferty HD, Stahler A, Haflidadottir S, Bendtsen F. Liver Fibrosis, but No Other Histologic Features, Is Associated With Long-term Outcomes of Patients With Nonalcoholic Fatty Liver Disease. Gastroenterology 2015;149:389-97.e10. [Crossref] [PubMed]
- Brunt EM, Kleiner DE, Wilson LA, Belt P, Neuschwander-Tetri BANASH Clinical Research Network (CRN). Nonalcoholic fatty liver disease (NAFLD) activity score and the histopathologic diagnosis in NAFLD: distinct clinicopathologic meanings. Hepatology 2011;53:810-20. [Crossref] [PubMed]
- Kleiner DE, Brunt EM, Van Natta M, Behling C, Contos MJ, Cummings OW, Ferrell LD, Liu YC, Torbenson MS, Unalp-Arida A, Yeh M, McCullough AJ, Sanyal AJNonalcoholic Steatohepatitis Clinical Research Network. Design and validation of a histological scoring system for nonalcoholic fatty liver disease. Hepatology 2005;41:1313-21. [Crossref] [PubMed]
- Metter DM, Colgan TJ, Leung ST, Timmons CF, Park JY. Trends in the US and Canadian Pathologist Workforces From 2007 to 2017. JAMA Netw Open 2019;2:e194337. [Crossref] [PubMed]
- Heinemann F, Gross P, Zeveleva S, Qian HS, Hill J, Höfer A, Jonigk D, Diehl AM, Abdelmalek M, Lenter MC, Pullen SS, Guarnieri P, Stierstorfer B. Deep learning-based quantification of NAFLD/NASH progression in human liver biopsies. Sci Rep 2022;12:19236. [Crossref] [PubMed]
- Yang J, Chen H, Zhao Y, Yang F, Zhang Y, He L, Yao J. ReMix: A General and Efficient Framework for Multiple Instance Learning Based Whole Slide Image Classification. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022. Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg; 2022:35-45.
- Lin T, Yu Z, Hu H, Xu Y, Chen CW. Interventional bag multi-instance learning on whole-slide pathological images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2023:19830-9.
- Hashimoto N, Fukushima D, Koga R, Takagi Y, Ko K, Kohno K, Nakaguro M, Nakamura S, Hontani H, Takeuchi I. Multi-scale Domain-adversarial Multiple-instance CNN for Cancer Subtype Classification with Unannotated Histopathological Images. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA; 2020:3851-60.
- Yan R, He Q, Liu Y, Gou J, Sun Q, Zhou G, He Y, Tian G. DEST: Deep Enhanced Swin Transformer Toward Better Scoring for NAFLD. In: Pattern Recognition and Computer Vision: 5th Chinese Conference, PRCV 2022, Shenzhen, China, November 4–7, 2022; Proceedings, Part II. Springer-Verlag, Berlin, Heidelberg; 2022:204-14.
- Yin C, Liu S, Lyu F, Lu J, Darkner S, Wong VWS, Yuen PC. Xfibrosis: Explicit vessel-fiber modeling for fibrosis staging from liver pathology images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024:11282-91.
- Junaid HHS, Daneshfar F, Mohammad MA. Automatic colorectal cancer detection using machine learning and deep learning based on feature selection in histopathological images. Biomedical Signal Processing and Control 2025;107:107866.
- Lu MY, Chen B, Williamson DFK, Chen RJ, Liang I, Ding T, Jaume G, Odintsov I, Le LP, Gerber G, Parwani AV, Zhang A, Mahmood F. A visual-language foundation model for computational pathology. Nat Med 2024;30:863-74. [Crossref] [PubMed]
- Lu MY, Williamson DFK, Chen TY, Chen RJ, Barbieri M, Mahmood F. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat Biomed Eng 2021;5:555-70. [Crossref] [PubMed]
- Shabanian M, Taylor Z, Woods C, Bernieh A, Dillman J, He L, Ranganathan S, Picarsic J, Somasundaram E. Liver fibrosis classification on trichrome histology slides using weakly supervised learning in children and young adults. J Pathol Inform 2025;16:100416. [Crossref] [PubMed]
- Ahmadvand P, Farahani H, Farnell D, Darbandsari A, Topham J, Karasinska J, Nelson J, Naso J, Jones SJM, Renouf D, Schaeffer DF, Bashashati A. A Deep Learning Approach for the Identification of the Molecular Subtypes of Pancreatic Ductal Adenocarcinoma Based on Whole Slide Pathology Images. Am J Pathol 2024;194:2302-12. [Crossref] [PubMed]
- Caruana R. Multitask learning. Machine learning 1997;28:41-75.
- Kokkinos I. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017:6129-38.
- Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning; 2008:160-7.
- Huang JT, Li J, Yu D, Deng L, Gong Y. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE; 2013:7304-8.
- Gao Z, Mao A, Wu K, Li Y, Zhao L, Zhang X, Wu J, Yu L, Xing C, Gong T, Zheng Y, Meng D, Zhou M, Li C. Childhood Leukemia Classification via Information Bottleneck Enhanced Hierarchical Multi-Instance Learning. IEEE Trans Med Imaging 2023;42:2348-59. [Crossref] [PubMed]
- Kamoona AM, Gostar AK, Bab-Hadiashar A, Hoseinnezhad R. Multiple instance-based video anomaly detection using deep temporal encoding–decoding. Expert Syst Appl 2023;214:119079.
.Simonyan K Zisserman A Very deep convolutional networks for large-scale image recognition. Available online:- He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016:770-8.
- Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V. Searching for MobileNetV3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019:1314-24.
.Dosovitskiy A Beyer L Kolesnikov A Weissenborn D Zhai X Unterthiner T Dehghani M Minderer M Heigold G Gelly S An image is worth 16x16 words: Transformers for image recognition at scale. Available online:- Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. 2022. A ConvNet for the 2020s. Available online:
10.48550/arXiv.2201.03545 - Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021:10012-22.
- Chattopadhay A, Sarkar A, Howlader P, Balasubramanian VN. Grad-Cam++: Generalized gradient-based visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). (IEEE, 2018); 2018:839-47.
- Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, van der Laak JAWM, van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60-88. [Crossref] [PubMed]
- Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J. A guide to deep learning in healthcare. Nat Med 2019;25:24-9. [Crossref] [PubMed]



