STB-Net: a Siamese architecture-based reconstruction-segmentation network for ocular surface image segmentation
Original Article

STB-Net: a Siamese architecture-based reconstruction-segmentation network for ocular surface image segmentation

Cheng Wan1,2, Jimei Wu1 ORCID logo, Yulong Mao1, Weihua Yang3, Yang Yang1

1College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China; 2College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing, China; 3Ophthalmic Imaging and AI Center, Shenzhen Eye Hospital, Shenzhen Eye Medical Center, Southern Medical University, Shenzhen, China

Contributions: (I) Conception and design: C Wan, J Wu, Y Mao; (II) Administrative support: C Wan, W Yang, Y Yang; (III) Provision of study materials or patients: C Wan, W Yang; (IV) Collection and assembly of data: J Wu, Y Mao; (V) Data analysis and interpretation: J Wu, Y Mao; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Weihua Yang, MD. Ophthalmic Imaging and AI Center, Shenzhen Eye Hospital, Shenzhen Eye Medical Center, Southern Medical University, No. 18 Zetian Road, Futian District, Shenzhen, 518040, China. Email: benben0606@139.com; Yang Yang, PhD. College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Jiangjun Road Campus, 29 Jiangjun Road, Jiangning District, Nanjing 211106, China. Email: eeyy@nuaa.edu.cn.

Background: Eyelid morphological parameters are crucial for quantitatively assessing eyelid morphology and diagnosing related disorders. However, achieving precise and automated measurement of these parameters remains a challenge. This study aims to develop an automated segmentation model for ocular surface images to accurately segment key anatomical structures and compute critical eyelid metrics.

Methods: We propose STB-Net, a novel segmentation model tailored for ocular surface imagery. The baseline model, TB-Net, enhances the TransUNet architecture by integrating a Bottom-up Local Attention Modulation (BLAM) module into its decoder, effectively encoding fine-grained features from shallow layers into high-level semantics. The final STB-Net framework integrates TB-Net with an SRSNetwork. In this setup, a TB-Net is first employed as a reconstruction model to learn complex semantic information. Subsequently, its encoder serves as a dynamic convolution module to generate adaptive parameters for a second, segmentation-oriented TB-Net, thereby boosting segmentation performance through augmented reconstruction-task training. The model automatically computes left, central, and right palpebral fissure heights, palpebral fissure width, and area based on the segmentation results.

Results: Experimental evaluation on a local dataset demonstrated the model’s high efficacy. For palpebral fissure segmentation, the model achieved a Dice score of 0.9875, Global Accuracy (GA) of 0.9955, and Intersection-over-Union (IoU) of 0.9767. Corneal segmentation performance was equally strong, attaining a Dice of 0.9891, GA of 0.9978, and IoU of 0.9790.

Conclusions: The proposed STB-Net model provides a robust and effective solution for the automated segmentation of ocular surface structures and the precise quantification of eyelid morphological parameters. It holds significant promise for enhancing the objectivity and efficiency of clinical diagnoses of eyelid disorders.

Keywords: Deep learning; image segmentation; unsupervised learning; ocular surface imaging; assisted diagnosis


Submitted May 14, 2025. Accepted for publication Oct 28, 2025. Published online Nov 21, 2025.

doi: 10.21037/qims-2025-1140


Introduction

The eye, a vital sensory organ for human-environment interaction, exhibits external morphological features that reflect diverse physiological characteristics, including aging progression, severity of eyelid disorders, and clinical status of thyroid-associated ophthalmopathy (1). As critical protective structures of the ocular surface, the eyelids demonstrate complex kinematic functions and heterogeneous morphological traits. Key anatomical components comprise the palpebral fissure, cornea, upper eyelid, and lower eyelid (2). The upper eyelid, positioned anteriorly in the orbit and overlying the palpebral fissure, consists of pliable tissues serving protective roles. Conversely, the lower eyelid, situated anterior to the globe and inferior to the palpebral fissure, forms a curtain-like structure with analogous protective functions. The palpebral fissure, defined as the interpalpebral space between the upper and lower eyelids, is quantified by its width (linear distance between medial and lateral canthi) and height (vertical interpalpebral distance). The cornea, a transparent membrane covering the anterior ocular surface, plays essential roles in visual transduction. When light enters the eyes, it first passes through the cornea (3). The corneal epithelium’s crucial metabolic functions and remarkable capacity for dynamic remodeling have long been acknowledged (4). Abnormalities in eyelid morphology often indicate varying types and severity levels of ocular surface diseases.

Automated measurement of eyelid morphological parameters holds significant clinical value in ophthalmic diagnosis and perioperative assessment, particularly for diagnosing blepharoptosis, eyelid lesions, and systemic disease-related palpebral alterations. Conventional manual measurements using physical tools remain subjective and labor-intensive. Traditional segmentation procedures mainly rely on manual operations requiring specialist knowledge, which are still time-consuming and labor-intensive (5). With recent advancements in computer vision and deep learning technologies, increasing research efforts have been directed toward automated measurement methodologies to enhance precision, minimize human intervention, and improve diagnostic/therapeutic efficiency for eyelid disorders. The field of intelligent ophthalmology (IO) is developing at a rapid pace due to the continuous innovation in medical technology, and this emerging force is transforming ophthalmic healthcare (6). It seems evident that deep learning technology is augmenting human ability in medical image analysis and raising new approaches to diagnosis and treatment in clinical settings (7). In the early 21st century, researchers pioneered methodologies to assess aging progression by quantifying dimensional discrepancies between the ptotic lower eyelid surface and its virtual projection area using three-dimensional (3D) imaging (8). This approach established the foundation for eyelid morphological studies while revealing intrinsic correlations between palpebral configurations and physiological states. In 2007, Read et al. (9) identified significant associations between corneal topography data and eyelid morphology in cohorts with refractive errors, providing new directions for exploring the diagnostic relevance of eyelid features. Further advancements emerged in 2015, when Maseedupally et al. (10) demonstrated negligible racial disparities in corneal asphericity but highlighted strong correlations between upper eyelid curvature, lower eyelid slope, and corneal parameters. The maturation of 3D imaging technologies catalyzed breakthroughs in 2021, as Liu et al. (11) reconstructed high-fidelity 3D eyelid models via point cloud data acquisition from precision scanners. Their framework integrated feature point detection with geometric analysis to quantify curvature, angular parameters, and other morphometric indices. Most recently, in 2023, Guo et al. (12) developed a standardized 3D imaging protocol for upper eyelid area and volume measurements, validating its reliability through systematic region-of-interest selection strategies. In 2023, Zhang et al. (13) demonstrated that the presence of pterygium significantly affects corneal densitometry, particularly in the anterior and central layers. These changes are closely correlated with the severity of pterygium and various corneal parameters, such as corneal astigmatism and keratometry. The emergence of deep learning convolutional neural network (CNN) technology has further developed image segmentation methods and applied them in clinical practice. Image semantic segmentation plays an important role in the field of computer vision (14). Recent advances in deep learning have significantly enhanced methodologies for measuring eyelid morphological parameters. In 2021, Van Brummen et al. (15) upgraded U-Net with ResNet50 as the backbone to segment iris, corneal, and brow regions, enabling automated measurements of medial canthal height, lateral canthal height, medial brow height, lateral brow height, intercanthal distance, and lateral intercanthal distance. Also in 2021, Deng et al. (16) integrated a modified ResNet18 encoder into U-Net to segment the pupil and tear meniscus for tear river height quantification. Concurrently, Cao et al. (17) incorporated attention gates into U-Net to improve feature extraction, achieving multi-dimensional measurements of eyelid height, width, and curvature through conjunctiva and iris segmentation. In 2023, Shao et al. (18) proposed an R2U-Net variant with attention gates to segment eyelids and corneas in thyroid-associated ophthalmopathy, automating comprehensive parameter calculations including palpebral fissure length, eyelid retraction distance, height, and area. Hua et al. (19) developed a DeepLab V3-based framework for palpebral fissure segmentation and tear meniscus height measurement to assist dry eye disease diagnosis. Most recently, in 2024, Nam et al. (20) devised a CNN-driven system employing three DeepLab V3+ models to measure vertical distances between the upper/lower eyelid margins and corneal light reflex points via scleral, corneal, and light reflex region segmentation.

While these studies effectively measure various eyelid morphological parameters, they exhibit notable limitations. First, most approaches improve segmentation accuracy by modifying specific components of U-Net (21), yet fail to effectively establish long-range dependency modeling. Second, their reliance on limited segmentation datasets with small sample sizes constrains model generalizability. To address these issues, we propose an effective and high-precision segmentation model that integrates long-range dependency construction and leverages non-segmentation datasets, aiming to enhance both measurement accuracy and model generalization capabilities. We present this article in accordance with the CLEAR reporting checklist (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1140/rc).


Methods

This study aims to achieve precise segmentation of the corneal region and palpebral fissure region in ocular surface images to accurately measure the palpebral fissure width, left palpebral fissure height, central palpebral fissure height, right palpebral fissure height, and palpebral fissure area. For the segmentation of the corneal and palpebral fissure regions, an improved TransUNet model, named TB-Net, is proposed. Based on this model, a reconstruction-segmentation network utilizing a Siamese architecture with dynamic parameter convolution is implemented and named STB-Net. Finally, the segmented corneal and palpebral fissure regions are input into a measurement module to automatically obtain the required eyelid morphological parameters.

Parameter measurement process

In the field of deep learning, image classification and image segmentation are two critical areas with substantial overlap. Specifically, the encoder module of a segmentation model is equivalent to a classification model without its fully connected layers. Moreover, while image classification aims to categorize the entire image, image segmentation can be interpreted as classifying each pixel within the image. This distinction, however, poses a challenge: many medical image datasets, which do not include segmentation labels, cannot be directly applied to image segmentation tasks. To mitigate this, our research adopts a Siamese architecture to connect the reconstruction task with the segmentation task. By utilizing unsupervised learning in the reconstruction model with classification datasets of the same type, the reconstruction task facilitates the segmentation task.

The eyelid morphological parameter measurement method proposed in this paper is illustrated in Figure 1, and the overall process can be divided into the following four stages:

  • Reconstruction task phase: initially, a reconstruction task is performed using ocular surface images and their counterparts. By feeding these images into the reconstruction model, the model learns the latent feature distributions and structural information within the images during this phase. The purpose of the reconstruction task is to train the model’s encoder through unsupervised learning, enabling it to fully extract image features. This lays a solid foundation for the subsequent segmentation task.
    Siamese architecture construction phase: after training the reconstruction model, the encoder module is extracted and embedded into the segmentation model as a dynamic convolution module. This modular design leverages the semantic information learned by the reconstruction model during the reconstruction task to generate adaptive convolution parameters for the segmentation model, thereby enhancing its ability to recognize details and complex structures. By utilizing the interoperability between the reconstruction and segmentation tasks, a Siamese architecture-based Reconstruction-Segmentation Network (STB-Net) is formed.
  • Segmentation task phase: upon completing the construction of the Siamese architecture, the ocular surface images from the dataset are input into the STB-Net to perform the segmentation task. Through the segmentation network, the corneal region and the palpebral fissure region can be accurately segmented. These segmentation results not only provide a precise basis for subsequent parameter measurements but also offer reliable support for medical diagnosis and treatment planning.
  • Automated measurement of eyelid morphological parameters phase: finally, the measurement module is utilized to analyze and compute the segmentation results, generating specific values for three eyelid morphological parameters. These parameters are automatically generated, eliminating the subjectivity and errors associated with manual annotation while significantly improving measurement efficiency and accuracy. The design of the measurement module fully considers the characteristics of the segmentation results and the computational requirements of medical parameters, achieving a seamless transition from images to clinical indicators.
Figure 1 The flowchart of eyelid morphological parameter measurement method.

TB-Net

In the field of medical image segmentation, the U-Net network is a classic and widely-used deep learning model. Its architecture is characterized by a typical encoder-decoder structure, forming a U-shaped configuration. With subsequent research advancements, the U-Net network has undergone multiple developmental stages, resulting in various variants and derivative versions. Among these, U-Net++ (22) is a notable improvement, which introduces multi-layer network structures within the skip connections to enhance the model’s feature representation and detail recovery capabilities, thereby further improving segmentation accuracy. In recent years, U-Net variants have also incorporated Transformer mechanisms, introducing self-attention mechanisms at the encoder level to address the limitations of traditional CNNs in capturing long-range dependencies. A representative example of this is TransUNet.

TransUNet employs a hybrid CNN-Transformer architecture to compensate for the loss of feature resolution inherent in Transformers. The decoder upsamples the encoded features and fuses them with CNN feature maps of different resolutions from the encoder to achieve precise localization. The Transformer network structure is illustrated in Figure 2. In the encoder part of the network, three stages of bottleneck modules (23) are first utilized, followed by a linear transformation to convert the feature format into the Transformer feature format. The features are then further encoded through 12 layers of Transformer Layers (24). The decoder part is similar to U-Net, employing CNN upsampling operations and fusing the features at each layer with the output features from the corresponding CNN layers in the encoder. While such decoding operations can enhance the model’s feature recovery capabilities through skip connections, the decoder itself does not consider the fusion of high-level semantic information with low-level semantic information, leading to limitations in local information processing.

Figure 2 The structure of TransUNet.

In the process of feature extraction, most convolutional networks learn high-level semantic features by gradually reducing the size of feature maps, which often results in local information being easily overwhelmed by surrounding background features in deeper layers, making it difficult to accurately reconstruct the edges of segmentation targets. To highlight and preserve local information, we employ a Bottom-Up Local Attentional Modulation (BLAM) (25) module, which integrates small-scale fine-grained features from low-level features into deeper high-level features. The BLAM module is an enhancement feature attention module applied in the field of infrared small target detection. Unlike general computer vision tasks, the bottleneck in infrared small target detection lies in how to retain and highlight the features of weak and small targets in deeper layers, rather than the lack of high-level semantic information in shallow layers. This is similar to the issue of local information loss encountered at the edges of targets in medical image segmentation. Therefore, inspired by the idea of preserving weak features, we apply the BLAM module to the decoder of the segmentation model to effectively enhance the decoder’s attention to fine and edge-local information. This is particularly beneficial in medical image segmentation, as it helps to more accurately restore the boundaries of complex structures and improve the segmentation accuracy of small lesions and ambiguous boundaries.

The structure of BLAM is illustrated in Figure 3. The Local Channel Attention Mechanism (L) aggregates channel contextual features at each spatial position, and the aggregation process is described by Eq. [1]. Here, PWConv, σ, δ, and B denote Pointwise Convolution, Sigmoid function (26), ReLU activation function (27), and Batch Normalization layer (28), respectively. The kernel sizes of PWConv_1 and PWConv_2 are C⁄4×C×1×1 and C×C⁄4×1×1, respectively, resembling a bottleneck structure. It is noteworthy that the attention weight map has the same shape as the input feature map, thus enabling the highlighting of subtle details in an element-wise manner (both spatially and across channels). The motivation behind the BLAM module is to embed small-scale details into high-level coarse feature maps, which is achieved through dynamic weighted modulation of high-level features guided by low-level features. Given X as the low-level feature and Y as the high-level feature, the cross-layer fusion feature can be obtained via a BLAM module, as described by Eq. [2], where represents element-wise multiplication.

L(X)(δ(B(PWConv2(δ(B(PWConv1(X)))))))

Z=X+L(X)Y

Figure 3 The structure of BLAM. BLAM, Bottom-up Local Attention Modulation.

The deeper networks can provide better semantic features and a more comprehensive understanding of scene context, which helps to resolve ambiguities between targets and background distractors. However, as the network depth increases, there is also a heightened risk of losing spatial details of the target. The network structure of TB-Net, as illustrated in Figure 4, improves upon the decoder module of TransUNet. In TB-Net, low-level features are upsampled and then interact with high-level features within the BLAM, thereby enhancing the local information focus of low-level features through dynamic weighting of high-level features. Through the progressive optimization facilitated by multiple BLAM modules, TB-Net achieves efficient feature progression in the decoder, ensuring that features at different levels possess stronger specificity and local focus during the fusion process. Each BLAM module dynamically adjusts the response of high-level semantic features to low-level detailed information, gradually reducing the over-reliance of deep networks on background information and thereby highlighting the saliency of target regions. This layer-by-layer reinforcement mechanism ensures that the segmentation network retains spatial details while effectively capturing local features of the target, thereby improving the robustness and accuracy of segmentation in complex scenes.

Figure 4 The structure of TB-Net. BLAM, Bottom-up Local Attention Modulation.

STB-Net

Over the past decade, deep learning-based methods have achieved remarkable progress in the field of computer vision, particularly in tasks such as image reconstruction and image segmentation. Image reconstruction aims to restore images through network models, which is an unsupervised task that does not rely on additional labeled data. In contrast, image segmentation requires network models to extract target objects from images, often depending on accurate segmentation labels. In the domain of medical image segmentation, the scarcity of datasets and the high cost of annotating segmentation labels result in limited datasets that struggle to demonstrate good generalization on deep learning models. Consequently, exploring an effective method to leverage large amounts of unlabeled data for segmentation tasks, thereby enhancing the performance of segmentation models, has become a critical issue in the field of medical image segmentation.

This study introduces a Siamese architecture, SRSNetwork, based on dynamic-parameter convolution for reconstruction-segmentation tasks. In medical image segmentation, due to the absence of textures, sizes, shapes, and other features commonly found in natural images, the data characteristics are highly complex. SRSNetwork proposes Dynamic-Parameter Convolution (DPConv), which generates adaptive dynamic-parameter convolution kernels based on the data distribution of input features. Leveraging DPConv, SRSNetwork links the reconstruction and segmentation tasks, forming a symmetrical Siamese architecture. As illustrated in Figure 5, the upper part is specifically designed for the reconstruction task, while the lower part is tailored for the segmentation task. DPConv establishes the connection between these two parts. In this study, the previously proposed TB-Net is employed for both the reconstruction and segmentation tasks within this architecture, resulting in a new model named STB-Net. STB-Net utilizes unsupervised reconstruction tasks to learn high-level semantic information from unlabeled medical image datasets of the same category, thereby providing more reliable dynamic-parameter convolution kernels for the segmentation task, ultimately enhancing the performance and accuracy of the segmentation model.

Figure 5 The SRSNetwork architecture. DPConv, Dynamic-Parameter Convolution.

Traditional dynamic convolution methods such as CConv (29), DyConv (30), and ODConv. The mathematical description of these methods can be summarized as Eq. [3], where ω represents the new convolution kernel, αji denotes the combination coefficients, and ωi signifies the set of fixed convolution kernels.

ω=i=1nj=1nαjiωi

In the realm of computer science, both DyConv and ODConv employ a fixed number of convolutional combinations, with the sole distinction being the quantity of these combinations. Although DyConv and ODConv utilize a series of convolutions to replace static convolutions, thereby achieving dynamic convolution, they are fundamentally still based on fixed convolutions. Once the network converges, they rely solely on a limited set of fixed convolutional kernels. In contrast, DPConv represents a truly dynamic convolution, as it directly generates the convolutional kernels.

The principle of DPConv is illustrated in Figure 6, where Er, Es, and DS represent the encoder of the reconstruction model, the encoder of the segmentation model, and the decoder of the segmentation model, respectively. The semantic features output from the upper part are transformed into convolutional kernels through parameter deformation, while the feature maps generated from the lower part serve as the convolution targets. These two components undergo a convolution operation, and the resulting feature maps are then added to the original feature maps before being passed to the decoder. This constitutes the core of the entire network’s operation. It is important to note that DPConv is only applied at the end of the encoder to generate additional feature outputs, which are then added to the original encoder’s outputs without altering the number of channels during the feature transmission process. This method, akin to residual connections, ensures that the features transmitted by the original segmentation model remain effective and are further enhanced by the integration of new features.

Figure 6 The principle of DPConv. BN, batch normalization; DPConv, Dynamic-Parameter Convolution.

Methodology for calculating eyelid morphological parameters

The final step of this study involves measuring three types of eyelid morphology parameters based on the corneal segmentation results and palpebral fissure segmentation results obtained from the previously described segmentation algorithm. The resolution of the ocular surface images in this study is 2,974×1,984. After professional ophthalmologists measured the height and width of all ocular surface images in the experiment, the average width was found to be 14.65 cm, and the average height was 9.77 cm. Since the instrument used to capture the ocular surface images produces images that are 4 times the actual size, the actual average width is 3.6625 cm, and the actual average height is 2.4425 cm. Therefore, on the original image, the width of each pixel is 0.012306 mm, and the height is 0.012310 mm. Due to the Transformer layer in the proposed model, there are constraints on the input image size. An excessively large input size would lead to insufficient GPU memory during training, while an excessively small input size would result in lower segmentation accuracy. Consequently, we resized the input images to 384×256, making the width of each pixel 0.09537 mm and the height 0.09541 mm.

For each ocular surface image in the test set, record the left boundary, right boundary, and the horizontal coordinate of the center point in the corneal segmentation map. In the palpebral fissure segmentation map, locate the number of target pixels in the columns corresponding to these three horizontal coordinates, and multiply by the corresponding pixel height to obtain the left palpebral fissure height, central palpebral fissure height, and right palpebral fissure height. For the palpebral fissure width, calculate the difference between the leftmost and rightmost horizontal coordinates in the palpebral fissure segmentation map, and then multiply by the pixel width to obtain the corresponding palpebral fissure width. As for the palpebral fissure area, directly calculate the total number of target pixels and multiply by the area of a single pixel to obtain the corresponding palpebral fissure area.

This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by Shenzhen Eye Hospital (approval No. 2025KYYJ023-01) and the Affiliated Eye Hospital of Nanjing Medical University (approval No. 2017012). Informed consent was waived due to the retrospective and anonymized nature of the data.


Results

Dataset

This study involves two types of datasets due to the inclusion of both reconstruction and segmentation tasks in the proposed algorithm. One is the ocular surface image classification dataset, and the other is the ocular surface image segmentation dataset. These two types of images are captured using different instruments, resulting in variations in their actual sizes and image resolutions. Additionally, the images are automatically labeled as left or right eye upon input by the instruments. However, the ocular surface image classification dataset, being part of ocular surface images, holds significant reuse value and can therefore be utilized alongside the ocular surface image segmentation dataset in the reconstruction task of this study.

Firstly, the ocular surface image segmentation dataset is provided by Shenzhen Eye Hospital, comprising a total of 250 ocular surface images with corneal and palpebral fissure labels, each with a resolution of 2,976×1,984. All images were captured under consistent conditions. Second, the ocular surface image classification dataset consists of 2,366 ocular surface disease images provided by Shenzhen Eye Hospital and 489 ocular surface disease images provided by the Affiliated Eye Hospital of Nanjing Medical University, with resolutions of 2,976×1,984 and 5,184×3,456, respectively (31). These images are categorized into three classes: normal ocular surface, ocular surface hemorrhage, and pterygium. All images have been anonymized to ensure no patient privacy information is included.

In the reconstruction task, both the classification dataset and the measurement dataset of ocular surface images are utilized for the implementation of the reconstruction task, totaling 3,105 ocular surface images. Since the reconstruction task is intended to assist the segmentation task, there is no need for testing and validation. Therefore, the dataset is divided into a training set and a validation set in a 9:1 ratio, resulting in 2,795 images for training and 310 images for validation. In the segmentation task, the ocular surface image segmentation dataset is partitioned into training, validation, and test sets in a 7:1:2 ratio, adhering to the principle of stratified sampling. This yields 174 images for training, 25 images for validation, and 51 images for testing. Moreover, to ensure the scientific rigor of model evaluation and the independence of dataset partitioning, we implemented controls during the data collection phase to guarantee that only one eye was imaged per patient. This approach effectively prevents images from the same patient or the same eye from appearing across the training, validation, and test sets, thereby minimizing the risk of data leakage at the source.

Secondly, since the width and height of the images are not the same, and the optimal input image size for the algorithm proposed in this study is 384×384, directly stretching the height and width of the images to the same scale does not conform to the original aspect ratio of the images. This can easily lead to distortion of the target shape, thereby affecting the network’s understanding and segmentation accuracy of the images. Therefore, in the experiments, we adopted a proportional scaling method, scaling the image to the required size based on the longer side and padding zeros on both sides of the shorter side to achieve the same dimensions. This preserves the original aspect ratio of the image and avoids information loss due to deformation. The data augmentation methods employed include random resizing, random horizontal flipping, and random vertical flipping. The segmentation labels in this study were completed under the guidance of medical professionals, using the annotation software LabelMe. The preprocessing methods for the labels are consistent with those for the original images. The ocular surface images and their corresponding labels are shown in Figure 7.

Figure 7 The ocular surface image and corresponding labels.

Model training

In the task of reconstructing and training the TB-Net, it is necessary to simultaneously restore the RGB three channels of the image and further evaluate the discrepancy between the model’s predicted results and the ground truth. Therefore, the mean squared error (MSE) is employed as the loss function. As shown in Eq. [4], where N represents the total number of pixels across the three channels, yi denotes the ground truth value of the i-th sample, and y^i represents the predicted value of the i-th sample.

MSE=1Ni=1N(yiy^i)2

In addition, Stochastic Gradient Descent (SGD) (32) is employed as the optimizer with a momentum of 0.9, weight decay of 0.0001, and a learning rate of 0.001. The batch size for training is set to 4, and the number of iterations is 200.

For the segmentation task, the STB-Net is trained using a weighted sum of BCE Loss and Dice Loss as the loss function, as detailed in Eqs. [5-7]. Here, N represents the total number of pixels, yi denotes the ground truth of the i-th sample, y^i is the predicted value of the i-th sample, and α and β are the weights for the two loss functions, both set to 0.5.

DiceLoss=12i=1Ny^iyii=1Ny^i2+i=1Nyi2

BCELoss=1NI=1N[yilog(y^i)+(1yi)log(1y^i)]

BCEDiceLoss=αDiceLoss+βBCELoss

In addition, the Adam optimizer (33) was employed with a weight decay set to 0.0003 and a learning rate of 0.0001. The training batch size was configured to 8, and the number of iterations was set to 300. A cosine annealing learning rate scheduling strategy was adopted to adjust the learning rate during the training process.

The experiments in this study were conducted using the PyTorch deep learning framework. The hardware configuration includes an Intel i7-14700KF processor running at 5.6 GHz and an NVIDIA RTX 4090 GPU with 24GB of memory. The software environment consists of Windows 11, Python 3.8, PyTorch 2.3.1, and OpenCV 4.5.5. This setup ensures a robust and efficient computational platform for the execution of deep learning tasks, adhering to the standards of academic research in the field of computer science.

Evaluation indicators

To evaluate the segmentation performance of the model, the experiment employs the Dice coefficient (34), Intersection-over-Union (IoU), and Global Accuracy (GA) as the evaluation metrics. The formulas for these metrics are presented in Eqs. [8-10], where y denotes the label pixel matrix, y^ represents the predicted pixel matrix, TP indicates the number of pixels correctly predicted as the target, TN signifies the number of pixels correctly predicted as the background, FP denotes the number of pixels incorrectly predicted as the target, and FN represents the number of pixels incorrectly predicted as the background. The Dice coefficient is utilized to describe the similarity between the predicted results and the ground truth, with a value closer to 1 indicating higher similarity. The IoU measures the overlap between the predicted results and the ground truth, where a ratio closer to 1 signifies greater overlap. The GA represents the proportion of correctly predicted pixels among all pixels, serving as a crucial metric for assessing the model’s ability to distinguish between the target and the background.

Dice=2|yy^||y|+|y^|

IoU=yy^y+y^

GA=TP+TNTP+TN+FP+FN

In this study, linear regression analysis and Bland-Altman consistency analysis (35) were employed to quantitatively evaluate the measurement results. In the linear regression analysis, the relationship and fitting degree between the measured values and the actual values were quantified by fitting the regression line and calculating the correlation coefficient r2. The closer the r2 value is to 1, the higher the consistency between the system’s predicted values and the true values, indicating better measurement accuracy. Additionally, the Bland-Altman consistency analysis was used to further assess the agreement between the system’s measurements and those of the physicians, analyzing the error distribution and the limits of agreement (LoA), thereby intuitively revealing the bias and reliability between the two. These two analytical methods complement each other: linear regression focuses on the correlation and accuracy of the results, while the consistency analysis emphasizes the stability and consistency of the measurement results in practical applications, ensuring the system’s reliability and effectiveness in clinical use.

Reconstruction results of ocular surface image

In this study, the proposed TB-Net is employed for reconstruction tasks. The optimal reconstruction model, obtained through training, achieves a MSE of 0.00311 on the validation set. As illustrated in Figure 8, the reconstruction task demonstrates commendable image restoration performance overall. This indicates that the model is effective in recovering high-quality images from the given data.

Figure 8 The reconstruction results.

Segmentation results of ocular surface image

In this study, for the segmentation task, in addition to conducting a comparative analysis of TransUNet, TB-Net, and STB-Net, we also selected two classical segmentation models, U-Net and U-Net++, for comparison. All models were tested on the same dataset and underwent multiple training sessions to achieve optimal performance, thereby evaluating the effectiveness of the proposed method in this research.

As shown in Table 1, for the segmentation of the palpebral fissure region, the proposed STB-Net model outperforms other models in all metrics, with Dice, GA, and IoU scores of 0.9875, 0.9955, and 0.9767, respectively. Similarly, as presented in Table 2, for the segmentation of the corneal region, the proposed STB-Net model also achieves the highest scores across all three metrics. Specifically, the GA score is tied with U-Net++ and TB-Net at 0.9978, while the Dice and IoU scores are 0.9891 and 0.9790, respectively. This demonstrates the superior performance of the STB-Net model in both palpebral fissure and corneal segmentation tasks.

Table 1

The evaluation indicators for palpebral fissure segmentation

Model Dice GA IoU
U-Net 0.9827 0.9940 0.9688
U-Net++ 0.9841 0.9945 0.9711
TransUNet 0.9861 0.9950 0.9740
TB-Net 0.9869 0.9953 0.9755
STB-Net 0.9875 0.9955 0.9767

, the optimal results. GA, Global Accuracy; IoU, Intersection-over-Union.

Table 2

The evaluation indicators for corneal segmentation

Model Dice GA IoU
U-Net 0.9869 0.9975 0.9760
U-Net++ 0.9885 0.9978 0.9781
TransUNet 0.9885 0.9977 0.9780
TB-Net 0.9887 0.9978 0.9783
STB-Net 0.9891 0.9978 0.9790

, the optimal results. GA, Global Accuracy; IoU, Intersection-over-Union.

In addition to evaluating the segmentation performance of the models on ocular surface images through quantitative metrics, this study also visually demonstrates the advantages of the proposed model by presenting the segmentation results of each model on ocular surface images. Specifically, the corneal segmentation results are illustrated in Figure 9, while the palpebral fissure segmentation results are shown in Figure 10.

Figure 9 The comparative analysis of corneal segmentation results.
Figure 10 The comparative analysis of palpebral fissure segmentation results.

Measurement results of ocular surface image

Further, the segmentation results of STB-Net were measured and compared with the ground truth values to calculate the relative errors. The relative errors were found to be 2% for the left palpebral fissure height, 0.78% for the central palpebral fissure height, 1.93% for the right palpebral fissure height, 1.31% for the palpebral fissure width, and 0.98% for the palpebral fissure area. The measurement values for some samples are shown in Tables 3,4, where LH, MH, RH, PW, and PA represent the left palpebral fissure height, central palpebral fissure height, right palpebral fissure height, palpebral fissure width, and palpebral fissure area, respectively. This further validates that STB-Net can accurately segment key regions in most samples and achieve precise measurements.

Table 3

The ground truth measurements of partial samples

Images LH (mm) MH (mm) RH (mm) PW (mm) PA (mm2)
11.png 8.481 10.845 8.764 25.448 186.879
60.png 8.321 10.968 9.466 22.876 186.408
73.png 7.435 10.180 7.176 21.757 159.183
79.png 6.647 9.417 7.373 24.808 154.043
97.png 8.592 9.835 8.321 26.482 191.516
151.png 5.290 7.324 5.379 23.467 119.110
248.png 4.788 6.918 5.699 24.452 116.494

LH, left palpebral fissure height; MH, middle palpebral fissure height; PA, palpebral fissure area; PW, palpebral fissure width; RH, right palpebral fissure height.

Table 4

The segmentation measurements of STB-Net of partial samples

Images LH MH RH PW PA
11.png 8.491 10.876 8.968 25.368 187.772
60.png 8.300 10.876 9.350 22.793 185.561
73.png 7.441 10.113 7.155 21.744 160.438
79.png 6.678 9.35 7.346 24.605 154.641
97.png 8.586 9.827 8.300 26.322 192.394
151.png 5.342 7.346 5.342 23.079 118.071
248.png 4.770 6.774 5.629 24.319 116.370

LH, left palpebral fissure height; MH, middle palpebral fissure height; PA, palpebral fissure area; PW, palpebral fissure width; RH, right palpebral fissure height.

To further evaluate the reliability of the model in measurement tasks, this study conducted a linear regression analysis between the measured values and the true values of 51 images in the test set. In Figure 11, the x-axis of all linear regression plots represents the measured values of the corresponding morphological parameters, while the y-axis represents the true values of these parameters. The blue dots correspond to the samples, and the red line represents the fitted curve. The results indicate that, except for the slightly lower r2 for the palpebral fissure height parameter, the r2 values for the remaining parameters are all above 0.99. This demonstrates a high degree of agreement between the eyelid morphological parameters measured from the segmentation maps predicted by STB-Net and the true values, thereby fully validating the reliability of the model.

Figure 11 The regression lines for each eyelid morphological parameters.

This study also employed the Bland-Altman analysis method to evaluate the consistency between the measured values and the true values of various eyelid morphological parameters. In the analysis, the average of the two sets of data was used as the horizontal axis, and the difference was used as the vertical axis, generating a Bland-Altman plot, as shown in Figure 12. The red dashed line represents the mean difference between the two sets of data, while the green dashed line indicates the 95% limits of agreement. Statistically, if the errors fall within the 95% limits of agreement, the errors are considered acceptable. The results showed that three sample points for the left palpebral fissure height exceeded the limits of agreement, as did three for the central palpebral fissure height, one for the right palpebral fissure height, four for the palpebral fissure width, and three for the palpebral fissure area. This indicates that the proposed method for measuring eyelid morphological parameters in this study has good agreement with the true values, demonstrating its potential value in practical applications.

Figure 12 The Bland-Altman consistency test plots for each eyelid morphologic parameters. LoA, limits of agreement.

Discussion

This study proposes an enhanced TransUNet model, termed TB-Net, and constructs a reconstruction-segmentation network, STB-Net, based on a Siamese architecture with dynamic parameter convolution. The network is designed for the precise segmentation of the cornea and palpebral fissure regions in ocular surface images, and for measuring corresponding parameters including the left palpebral fissure height, central palpebral fissure height, right palpebral fissure height, palpebral fissure width, and palpebral fissure area. The TB-Net model improves the decoder of TransUNet by introducing a BLAM module, effectively addressing the issue of missing fine and edge local information in ocular surface image segmentation, thereby enhancing segmentation accuracy. The STB-Net model adopts the SRSNetwork architecture, integrating reconstruction tasks with segmentation tasks, and utilizes dynamic parameter convolution to generate adaptive convolution kernels, optimizing segmentation performance on limited annotated data.

As illustrated in Figure 8, although the reconstruction task exhibits some localized speckles and voids in certain reflective regions, it demonstrates a high degree of accuracy in reconstructing the overall ocular surface area. Notably, the texture of the corneal region and the vascular and color details in the palpebral fissure are restored with remarkable quality. This indicates that TB-Net has achieved a robust learning of ocular surface image features in the reconstruction task, thereby laying a solid foundation for subsequent analysis and applications.

Further, as shown in Tables 1,2, the three metrics of TB-Net for palpebral fissure segmentation have improved compared to TransUNet, with Dice increasing by 0.08%, GA by 0.03%, and IoU by 0.15%, indicating that the introduction of BLAM into the decoder of TransUNet is effective. STB-Net, based on TB-Net, shows further improvements with Dice increasing by 0.06%, GA by 0.02%, and IoU by 0.12%, which also demonstrates the effectiveness of adopting a Siamese architecture with TB-Net as the baseline. In terms of corneal segmentation metrics, TB-Net exhibits enhancements in Dice, GA, and IoU compared to TransUNet, and STB-Net, compared to TB-Net, shows further improvements in Dice and IoU, with GA remaining unchanged. It is evident that the proposed model, STB-Net, can effectively segment both the palpebral fissure and corneal regions in ocular surface images.

In addition, as observed in Figure 10, STB-Net demonstrates superior accuracy in segmenting the palpebral fissure region. Firstly, in terms of the segmentation details of the palpebral fissure angle, other models exhibit varying degrees of omission or discontinuity. For instance, in the first set of segmentation results, STB-Net maximally restores the details at the palpebral fissure angle, whereas other models either fail to recognize it as part of the palpebral fissure or present incomplete segmentation. Secondly, in the segmentation of polyps on the upper and lower eyelids, other models tend to misclassify polyps. For example, in the third set of segmentation results, STB-Net accurately distinguishes polyps, treating them as background, while other models incorrectly include polyps as part of the palpebral fissure. Similarly, as shown in Figure 9, STB-Net achieves more precise corneal segmentation. In the first set of segmentation results, STB-Net effectively segments the cornea, whereas other models, influenced by polyps, exhibit hollow or missing areas. In the second set of segmentation results, STB-Net remains unaffected by eyelashes, while other models mistakenly include non-corneal regions as part of the target due to eyelash interference.

Compared to existing methods for measuring eyelid morphological parameters, the STB-Net proposed in this study demonstrates significant advantages in segmentation accuracy and detail capture capability. Previous studies, such as those by Van Brummen et al. (15), Deng et al. (16), and Cao et al. (17), primarily focused on enhancing the encoder of U-Net but did not consider designing network modules based on the structural features of the task-specific images. Meanwhile, Shao et al. (18), Hua et al. (19), and Nam et al. (20) utilized existing models for segmentation and measurement. However, these studies overlooked the construction of long-range dependency information and were constrained by the limitations of small sample datasets, resulting in insufficient generalization capability.

There are certain limitations in this study. Firstly, the experiment adopted a single pixel size calculated based on the average of all image size measurements by professional doctors as the measurement benchmark. However, since this method did not provide specific shooting width and height for each image, although these images were captured under the same scene conditions, there are still subtle differences. In future research, professional ocular surface measurement tools can be utilized to obtain actual measurement values of various eyelid morphology parameters, and the results of the intelligent measurement method proposed in this study can be compared with the actual measurement values to further verify its accuracy and effectiveness. Additionally, the reconstruction task in this study is an unsupervised learning task. In the future, by collecting more datasets of the same type, the semantic feature information extracted by the dynamic convolution model can be enhanced, thereby further improving the segmentation performance. Moreover, as the dataset used in this study is primarily derived from a Chinese population, we plan to incorporate publicly available datasets or multi-center data involving individuals from different ethnicities and regions in future work to systematically assess the model’s cross-population generalizability and robustness, thereby enhancing its clinical applicability across broader populations.


Conclusions

This study proposes a Siamese architecture-based reconstruction-segmentation network, STB-Net, based on dynamic parameter convolution, for the precise segmentation of the cornea and palpebral fissure regions in ocular surface images. STB-Net can be utilized for the automatic measurement of eyelid morphological parameters, thereby reducing the subjectivity and errors associated with manual measurements by physicians and enhancing diagnostic efficiency and accuracy. This approach holds significant clinical application value for the diagnosis and treatment of eyelid disorders.


Acknowledgments

The authors gratefully acknowledge the support from Shenzhen Eye Hospital and the Affiliated Eye Hospital of Nanjing Medical University for providing the ocular surface image datasets used in this study. We also extend our sincere appreciation to all individuals who contributed to this research through their technical assistance and valuable suggestions.


Footnote

Reporting Checklist: The authors have completed the CLEAR reporting checklist. Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1140/rc

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1140/dss

Funding: This work was financially supported by Sanming Project of Medicine in Shenzhen (No. SZSM202311012). The authors declare financial support was received for the research, authorship, and/or publication of this article.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1140/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. The study was approved by Shenzhen Eye Hospital (approval No. 2025KYYJ023-01) and the Affiliated Eye Hospital of Nanjing Medical University (approval No. 2017012). Informed consent was waived due to the retrospective and anonymized nature of the data.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Liu J, Rokohl AC, Liu H, Fan W, Li S, Hou X, Ju S, Guo Y, Heindl LM. Age-related changes of the periocular morphology: a two- and three-dimensional anthropometry study in Caucasians. Graefes Arch Clin Exp Ophthalmol 2023;261:213-22. [Crossref] [PubMed]
  2. Most SP, Mobley SR, Larrabee WF Jr. Anatomy of the eyelids. Facial Plast Surg Clin North Am 2005;13:487-92. v. [Crossref] [PubMed]
  3. Madrid J, Hout MC. Eye spy: Why we need to move our eyes to gather information about the world. Front. Young Minds 2018;6:71.
  4. Morgado CR, Santhiago MR. Normative distribution of corneal epithelial thickness on 9-mm OCT maps. Front Med (Lausanne) 2025;12:1572326. [Crossref] [PubMed]
  5. Yang T, Zhu G, Cai L, Yeo JH, Mao Y, Yang J. A benchmark study of convolutional neural networks in fully automatic segmentation of aortic root. Front Bioeng Biotechnol 2023;11:1171868. [Crossref] [PubMed]
  6. Gong D, Li WT, Li XM, Wan C, Zhou YJ, Wang SJ, Wang JT, Xu YW, Zhang SC, Yang WH. Development and research status of intelligent ophthalmology in China. Int J Ophthalmol 2024;17:2308-15. [Crossref] [PubMed]
  7. Suk HI, Liu M, Cao X, Kim J. Editorial: Advances in deep learning methods for medical image analysis. Front Radiol 2023;2:1097533. [Crossref] [PubMed]
  8. Fukuda Y, Fujimura T, Moriwaki S, Kitahara T. A new method to evaluate lower eyelid sag using three-dimensional image analysis. Int J Cosmet Sci 2005;27:283-90. [Crossref] [PubMed]
  9. Read SA, Collins MJ, Carney LG. The influence of eyelid morphology on normal corneal shape. Invest Ophthalmol Vis Sci 2007;48:112-9. [Crossref] [PubMed]
  10. Maseedupally V, Gifford P, Swarbrick H. Variation in normal corneal shape and the influence of eyelid morphometry. Optom Vis Sci 2015;92:286-300. [Crossref] [PubMed]
  11. Liu N, Liang G, Li L, Zhou H, Zhang L, Song X. An eyelid parameters auto-measuring method based on 3D scanning. Displays 2021;69:102063.
  12. Guo Y, Rokohl AC, Fan W, Theodosiou R, Li X, Lou L, Gao T, Lin M, Yao K, Heindl LM. A novel standardized approach for the 3D evaluation of upper eyelid area and volume. Quant Imaging Med Surg 2023;13:1686-98. [Crossref] [PubMed]
  13. Zhang J, Zhang L, Hu H, Sun L, He W, Zhang Z, Wang J, Nie D, Liu X. The influence of pterygium on corneal densitometry evaluated using the Oculus Pentacam system. Front Med (Lausanne) 2023;10:1184318. [Crossref] [PubMed]
  14. Wu Z, Li X, Zuo J. RAD-UNet: Research on an improved lung nodule semantic segmentation algorithm based on deep learning. Front Oncol 2023;13:1084096. [Crossref] [PubMed]
  15. Van Brummen A, Owen JP, Spaide T, Froines C, Lu R, Lacy M, Blazes M, Li E, Lee CS, Lee AY, Zhang M, Periorbit AI. Artificial Intelligence Automation of Eyelid and Periorbital Measurements. Am J Ophthalmol 2021;230:285-96. [Crossref] [PubMed]
  16. Deng X, Tian L, Liu Z, Zhou Y, Jie Y. A deep learning approach for the quantification of lower tear meniscus height. Biomedical Signal Processing and Control 2021;68:102655.
  17. Cao J, Lou L, You K, Gao Z, Jin K, Shao J, Ye J. A Novel Automatic Morphologic Analysis of Eyelids Based on Deep Learning Methods. Curr Eye Res 2021;46:1495-502. [Crossref] [PubMed]
  18. Shao J, Huang X, Gao T, Cao J, Wang Y, Zhang Q, Lou L, Ye J. Deep learning-based image analysis of eyelid morphology in thyroid-associated ophthalmopathy. Quant Imaging Med Surg 2023;13:1592-604. [Crossref] [PubMed]
  19. Wan C, Hua R, Guo P, Lin P, Wang J, Yang W, Hong X. Measurement method of tear meniscus height based on deep learning. Front Med (Lausanne) 2023;10:1126754. [Crossref] [PubMed]
  20. Nam Y, Song T, Lee J, Lee JK. Development of a neural network-based automated eyelid measurement system. Sci Rep 2024;14:1202. [Crossref] [PubMed]
  21. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer-assisted Intervention 2015:234-41.
  22. Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. Unet++: A nested u-net architecture for medical image segmentation. Deep Learning in Medical Image Analysis 2018:3-11.
  23. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition 2016:770-8.
  24. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations. 2021.
  25. Dai Y, Wu Y, Zhou F, Barnard K. Attentional local contrast networks for infrared small target detection. Transactions on Geoscience and Remote Sensing 2021;59:9813-24.
  26. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. 1943. Bull Math Biol 1990;52:99-115; discussion 73-97.
  27. Nair V, Hinton GE. Rectified linear units improve restricted boltzmann machines. International Conference on Machine Learning, 2010:807-14.
  28. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning 2015:448-56.
  29. Yang B, Bender G, Le QV, Ngiam J. Condconv: Conditionally parameterized convolutions for efficient inference. Advances in Neural Information Processing Systems 2019;32:6960-70.
  30. Chen Y, Dai X, Liu M, Chen D, Yuan L, Liu Z. Dynamic convolution: Attention over convolution kernels. IEEE Conference on Computer Vision and Pattern Recognition 2020:11030-9.
  31. Wan C, Shao Y, Wang C, Jing J, Yang W. A Novel System for Measuring Pterygium's Progress Using Deep Learning. Front Med (Lausanne) 2022;9:819971. [Crossref] [PubMed]
  32. Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics 1951;22:400-7.
  33. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [Preprint]. Available online: https://arxiv.org/abs/1412.6980
  34. Dice LR. Measures of the amount of ecologic association between species. Ecology 1945;26:297-302.
  35. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;1:307-10.
Cite this article as: Wan C, Wu J, Mao Y, Yang W, Yang Y. STB-Net: a Siamese architecture-based reconstruction-segmentation network for ocular surface image segmentation. Quant Imaging Med Surg 2025;15(12):11851-11869. doi: 10.21037/qims-2025-1140

Download Citation