Instance segmentation of cells and nuclei from multi-organ cross-protocol microscopic images

Sushish Baral; May Phu Paing

doi:10.21037/qims-24-801

Original Article

Instance segmentation of cells and nuclei from multi-organ cross-protocol microscopic images

Sushish Baral¹, May Phu Paing²

¹Department of Robotics and AI, School of Engineering, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand; ²Department of Biomedical Engineering, School of Engineering, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand

Contributions: (I) Conception and design: Both authors; (II) Administrative support: MP Paing; (III) Provision of study materials or patients: Both authors; (IV) Collection and assembly of data: Both authors; (V) Data analysis and interpretation: Both authors; (VI) Manuscript writing: Both authors; (VII) Final approval of manuscript: Both authors.

Correspondence to: May Phu Paing, D.Eng (Electrical Engineering). Department of Biomedical Engineering, School of Engineering, King Mongkut’s Institute of Technology Ladkrabang, 1 Chalong Krung 1 Alley, Lat Krabang, Bangkok 10520, Thailand. Email: may.pa@kmitl.ac.th.

Background: Light microscopy is a widely used technique in cell biology due to its satisfactory resolution for cellular structure analysis, prevalent availability of fluorescent probes for staining, and compatibility for the dynamic analysis of live cells. However, the segmentation of cells and nuclei from microscopic images is not a straightforward process because it has several challenges such as high variation in morphology and shape, the presence of noise and diverse contrast in backgrounds, clustering or overlapping nature of cells. Dealing with these challenges and facilitating more reliable analysis necessitates the implementation of computer-aided methods that leverage image processing techniques and deep learning algorithms. The major goal of this study is to propose a model, for instance segmentation of cells and nuclei, applying the most cutting-edge deep learning techniques.

Methods: A fine-tuned You Only Look at Once version 9 extended (YOLOv9-E) model is initially applied as a prompt generator to generate bounding box prompts. Using the generated prompts, a pre-trained segment anything model (SAM) is subsequently applied through zero-short inferencing to produce raw segmentation masks. These segmentation masks are then refined using non-max suppression and simple image processing methods such as image addition and morphological processing. The proposed method is developed and evaluated using an open-sourced dataset called Expert Visual Cell Annotation (EVICAN), which is relatively large and contains 4,738 microscopy images extracted from cross organs using different protocols.

Results: Based on the evaluation results on three different levels of EVICAN test sets, the proposed method demonstrates noticeable performances showing average mAP50 [mean average precision at intersection over union (IoU) =0.50] scores of 96.25, 95.05, and 94.18 for cell segmentation, and 68.04, 54.66, and 38.29 for nucleus segmentation on easy, medium, and difficult test sets, respectively.

Conclusions: Our proposed method for instance segmentation of cells and nuclei provided favorable performance compared to the existing methods in literature, indicating its potential utility as an assistive tool for cell culture experts, facilitating prompt and reliable analysis.

Keywords: Expert Visual Cell Annotation (EVICAN); You Only Look at Once version 9 extended (YOLOv9-E); segment anything model (SAM)

Submitted Apr 20, 2024. Accepted for publication Jul 30, 2024. Published online Aug 28, 2024.

doi: 10.21037/qims-24-801

Introduction

Analysis of cell structures and physiological functions is indispensable for diagnosing diseases and their progressions. Since its invention more than 400 years ago, microscopy has been widely used as a main technique to visualize and study cell structures. Illumination and geometry are the major fundamental aspects in microscopy and based on these principles, various techniques (electron, optical, scanning probe, super-resolution microscopy, etc.) have been devised (1). Among them, optical or light microscopy is broadly used for cell biology due to (I) its resolution which is well-suited for subcellular structures even at nanometer scales, (II) the abundance of fluorescent probes to stain the cells, and (III) the non-perturbing nature of light which allows the detection of dynamics living cells (2).

In general, analysis of cell structures from light microscopy images requires fluorescent straining which enhances the contrast between the cells and background (3). Since staining helps to improve the visualization of cells and cellular structures, more precise image analysis and high-performance diagnosis can be achieved. However, in some cases, the incompatibility between the staining agents and cells may lead to several issues including cell death, and undesired changes in cellular structures and morphology (4,5). On the other hand, analysis of cells directly from unstained microscopic images, especially, brightfield (BF) or phase-contrast (PhC) images, is also challenging due to several reasons, such as (I) non-specific contrast between cell structures and background, (II) high variety in cell morphology, (III) abundant of noises and (IV) unclear boundaries or overlapping cells.

Computer-aided diagnosis, leveraging advanced deep learning technology, paved the way for solving such challenges. According to a survey by Liu et al. (2021) (6), deep learning approaches have significantly contributed to cell analysis, including classification, segmentation, tracking, denoising, and super-resolution of cell structures from unstained microscopic images. Moreover, those approaches can perform cell analysis directly from unstrained images thus they can reduce not only the staining issues but also the laboratory workload and cost. The primary objective of this research is to segment cells and nuclei from the background and other undesired structures, facilitating further analysis. However, our focus is on unstained microscopic images, especially BF and PhC images, as they are more challenging but offer greater effectiveness.

Related works

Segmentation of cells and nuclei from microscopic images is a prerequisite in cell analysis as it can simplify the downstream analysis processes, for example, counting and tracking. Accurate segmentation is helpful for extracting precise features from the cellular structures and leads to more accurate analysis results. Different studies in literature have proposed various methods for the segmentation of cells from unstained microscopic images. Based on the algorithms used, the methods in the literature can be divided into two categories: (I) conventional segmentation and (II) deep learning-based segmentation methods. Earlier the era of deep learning, most of the research for microscopic cell segmentations used conventional image segmentation methods: intensity-based segmentation such as thresholding (7), region-based segmentation such as contour detection using snake (8,9) or level sets methods (10), and clustering based segmentation (9,11). Although those conventional methods are simple and easy to use, many researchers accepted that they have limitations in human involvement for feature extraction. Besides, the performance of those methods is highly dependent on user or developer-defined parameters and thus they are limited in adaptability, and low in generalization capabilities.

With the accelerated progress in deep learning technology, the trends of computer vision research have also changed accordingly. Recently, an immeasurable amount of powerful deep learning models have evolved and been deployed in microscopic cell segmentation research. Based on the types of segmentation, the deep learning models for microscopic cell segmentations can be further divided into two groups. The first group is semantic segmentation, which separates image pixels into foreground objects (i.e., regions of interest for example cells and nucleus in this study) and unwanted background. The most well-known deep learning architecture for semantic segmentation is a Fully Convolutional Network (FCN). Unlike the convolutional neural networks (CNN) commonly used for image classification or detection tasks, FCN does not contain fully connected layers. Instead, it is comprised of entirely convolutional layers (6) in an encoder-decoder style. The use of FCN for cell segmentation from unstained microscopic images can be seen in References (12-16). For instance, Zhu et al. (2017) (12) proposed a simple architecture of FCN, which stacked convolution layers, max-pooling layers, and deconvolution layers. On the other hand, Zhao and Yin (2018) (13) enhanced the architecture of FCN using a pyramid style. Their method segmented the cells in a cascaded refinement manner, in which high-level FCNs focused on coarse segmentation masks while lower levels focused on detailed structures of the cells. The architecture of FCNs can be modified by arranging the layers into different styles. The most popular FNC model for semantic segmentation is U-Net (14), which arranges layers in a U-shaped structure. Unlike vanilla FCN models, U-Net can provide more contextual information due to the use of feature fusion through skip connections (6,14). Aiming to get better segmentation results, many researchers in the field of microscopic cell analysis redesigned the classical U-Net architecture by combining it with advanced methods. For example, Long (2020) (15) proposed U-Net+, which is an enhanced and lightweight version of U-Net for cell nuclei segmentation from microscopic images. U-Net+ was especially enhanced in the encoded branch so that it can work with low-resource computing. The performance of U-Net+ was compared with those of original U-Net (14), and U-Net++ (16) which is another variation of U-Net. Based on the comparative results, he proved that U-Net+ consumes a fewer number of weights, and shorter inference time while providing more accurate segmentations. Moreover, the improvement of U-Net using attention models was also found in some research. To enhance the cell segmentation results, Kakumani et al. (2022) (17) used a pre-trained autoencoder with attention models for the encoding part and a U-Net decoder with attention for the decoding part. On the other hand, Ghaznavi et al. (2022) (18) used attention modules combined with residual connections in order to improve the segmentation results.

The second group of cell segmentation methods is instance segmentation. In contrast to semantic segmentation, which segments the images into only foreground objects and background, instance segmentation can distinguish the specific objects of the same class. Furthermore, instance segmentation can also generate bounding boxes and corresponding masks of the objects, meaning both detection and segmentation tasks can be performed simultaneously (19). Mask-RCNN (20), developed by Facebook Artificial Intelligent Research (FAIR), was one of the most prominent instance segmentation models across various application areas. Schwendy et al. (2020) (4) deployed the original version of Mask-RCNN to segment cells and nuclei from microscopic images while Fujita et al. (2021) (21) and Khalid et al. (2021) (22) applied an enhanced version of Mask-RCNN. A focal loss that mainly focuses on the training of hard samples was integrated to improve the performance of Mask-RCNN (21). On the other hand (22), applied a cascaded pipeline of Mask-RCNN), using ResNeSt as a backbone. Additionally, advanced deep learning methods, namely segment anything model (SAM) (2023) (23), had recently demonstrated impressive results in segmentation tasks. Inspiring SAM, a method called “CellSAM” (2023) was proposed by Israel et al. (24). Initially, CellSAM gets the prompts from a CellFinder which is a transformer-based object detector (DETR). Using those detected bounding boxes as prompts, a SAM model is developed and fine-tuned on cell images to perform segmentation. CellSAM was trained and evaluated on ten different datasets containing a variety of cellular imaging across different modalities. Alternatively, Na et al. (2024) (25) proposed SAM based segmentation method, called the “Segment any cell” model using fine-tuning and auto-prompting. They applied Low-Rank Attention Adaption (LoRA) to make the fine-tuning process more efficient and light-weight. Unlike CellSAM (24), Na et al. applied an auxiliary neural network, specifically U-Net for auto prompting. Their proposed “segment any cell model” was tested for segmenting nuclei from digital microscopic images, achieving an intersection over union (IoU) of up to 86.17%.

Similar to the methods in the literature, the major goal of this paper is also to develop a robust model, for instance segmentation of cells and nuclei from unstained microscopic images. As mentioned in the introduction, unstained microscopic images, especially BF or PhC images, are simple, low-cost, and compatible with live-cell imaging. Therefore, a direct and automatic segmentation of cells and nuclei from such images will have practical advantages, including (I) timely diagnosis due to no extra effort for fluorescent staining, (II) better results due to high resolution even for tiny sizes of subcellular structures, and (III) conformity for live-cell detection and tracking. Comparable to (24) and (25), our proposed segmentation model is also based on the SAM. However, it distinguishes itself from existing methods in the literature through two main contributions: (I) first we apply the most recent state-of-the-art object detection model, You Only Look at Once version 9 (YOLOv9), for the generation of bounding box prompts for SAM. (II) With the help of YOLOv9’s superior performance in localizations, no extra fine-tuning of SAM is needed for the segmentation task. A pre-trained SAM model from the original paper (23) was applied to versatile our proposed cell and nucleus segmentation. Following that, non-max suppression and simple image processing techniques are applied to refine the raw segmentation masks generated by SAM.

Methods

Dataset

In this study, we use an open-source dataset called Expert Visual Cell Annotation (EVICAN) (4) for the experiments. It comprises 4,738 microscopy images extracted from 30 different cell lines originating from various organs, including the lung, kidney, colon, mammary gland, pharynx, prostate, bone marrow, cervix, and more. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013). Moreover, The dataset also provides annotation files not only in binary masks but also in Common objects in context (COCO) JavaScript Object Notation (JSON) instance segmentation formats. These annotations were collected by cell culture experts using fluorescent staining. Even though the staining was applied to find the cellular outlines, the stained images were not included in the dataset. They were applied only for reference to get the location of cells and nuclei on the assembled BF or PhC images. There are a total of 54,016 instances (including cells and nuclei) in the dataset, which have been randomly divided into three sets: training, validation, and testing, as described in Table 1. Moreover, the dataset also includes an additional 1,000 background images (750 images for training, and 250 for validation). These background images do not contain any instances, such as cells or nuclei, and they are intentionally added into the training and validation sets to prevent false positives (FPs).

Table 1

Details of the EVICAN dataset

Splits	Total No. of images (background + images with cells/nuclei)	Background images	Images with cells/nuclei
Training	4,464	750	3,714 images
			Partially annotated: 42,317 instances
			Cells: 21,106
			Nuclei: 21,211
Validation	1,176	250	926 images
			Partially annotated: 10,642 instances
			Cells: 5,485
			Nuclei: 5,157
Testing	98	0	98 images
			Fully annotated: 1,057 instances
			Cells: 525
			Nuclei: 532

EVICAN, Expert Visual Cell Annotation.

In the training and validation sets, most images contain 5 to 7 cells and 2 to 5 nuclei. The maximum number of cells in a single image of train and validation sets is 25, while the maximum number of nuclei is 33. However, the minimum counts can be zero, as background images do not contain any cells or nuclei. It is important to note that the images in the training and validation sets are partially annotated, meaning some cells and nuclei are not included in the annotation.

For the test set, it is further divided into three levels of difficulties based on the cellular appearance, quality of the image, and contrast (4,22).

Easy (Difficulty level 1) contains 33 images with a total of 374 instances (187 cells and 187 nuclei). Most images in this level exhibit well-defined cell outlines and nuclei with less touching or overlapping, making them easy to detect.
Medium (Difficulty level 2) contains 33 images with a total of 356 instances (176 cells and 180 nuclei). Most images in this level contain cells that are touching each other. Moreover, the visibility of some nuclei is also not clear.
Difficult (Difficulty level 3) contains 32 images with a total of 327 instances (162 cells and 165 nuclei). Most images at this level feature cell clusters or colonies, with some nuclei being very difficult to see without staining.

Figure 1 illustrates some examples of microscopic images representing each level of difficulty in the test set. In Figure 1A, easy (Difficulty level 1), cells and nuclei in all three representative images are distinct and less touching to each other. However, in Figure 1B, medium (Difficulty level 2), some cells exhibit touching, and certain nuclei appear unclear. And, in Figure 1C, difficult (Difficulty level 3), the images contains cells clustering as a colony as well as some nuclei with very low visibility and challenging to detect.

Figure 1 Example microscopic images in three levels of test sets. (A) Easy (Difficulty level 1); (B) medium (Difficulty level 2); and (C) difficult (Difficulty level 3).

Proposed instance segmentation

The proposed cells and nuclei instance segmentation model in this study comprised of three main steps namely: (I) prompt generator; (II) segmentation model; and (III) mask refinement, as can be seen in Figure 2.

Figure 2 Overview of the proposed instance segmentation model consisting of three parts: (I) prompt generator; (II) segmentation model; and (III) mask refinement. EVICAN, Expert Visual Cell Annotation; YOLOv9-E, You Only Look at Once version 9 extended; SAM, segment anything model.

The first step, the prompt generator, aims to detect all possible locations of cells and nuclei and return the detected bounding boxes as prompts. Subsequently, the generated prompts are fed into the second step, the segmentation model, to produce raw binary segmentation masks of cells and nuclei. Finally, in the third step, mask refinement, the raw masks are enhanced by removing unnecessary objects. The technical details of each step are discussed in the following sections.

(I) Prompt generator: YOLOv9

Generally, prompts can be defined as hints to the segmentation model about what specific object or region to segment within the input images. A prompt can be a point, box, text, or mask. Specifically, points, boxes, and texts are termed as sparse prompts while masks are referred to as dense prompts. In cellular structure segmentations proposed by (24), the use of box prompts was found while in (25) mask prompts were used. A prompt generator is simply an auxiliary model to find the potential locations of cells and nuclei. It is necessary because our segmentation model is based on SAM which is a prompt-able model with zero-shot learning. Similar to (25), we empirically choose box prompts because they ensure better segmentation results compared to point prompts. For the mask prompts, we refine them because they are difficult to generate and require more computational effort. Unlike (25), instead of using the DETR model, we apply YOLOv9 (26) as a prompt generator. The rationale behind this choice is that it is the most recent state-of-the-art model for object detection and localization. To the best of our knowledge, no cell segmentation model has been proposed using YOLOv9. YOLOv9 was recently released, and its performance has been proven through comparative analysis against existing object detection models. YOLOv9 outperformed its counterparts due to the use of programmable gradient information (PGI) which can reduce the data loss (information bottleneck) problem found in the majority of deep learning models. Furthermore, it also incorporates an innovative concept called Generalized Efficient Layer Aggregation Network (GELAN), which enhances the model’s speed and efficiency while preserving accuracy. YOLOv9 released 4 different versions (S, M, C, and E) based on the parameter size from smallest to largest. Among them, we apply the YOLOv9-E version (58.1 M Parms) to ensure precise prompt generation.

(II) Segmentation model: SAM

SAM developed by (23) stands out as one of the latest and most popular segmentation methods. SAM demonstrates promising results across various segmentation tasks. In this study, we choose SAM as our segmentation model due to two main reasons: (i) its impressive prompt-able segmentation capability and (ii) its zero-short performances. The architecture of SAM is comprised of three components: (i) image encoder (ii) prompt encoder and (iii) mask decoder. SAM uses masked autoencoders (MAE) pre-trained vision transformer (ViT) model as an image encoder and extracts comprehensive features from the input images. The prompt encoder is a unique feature of SAM that cannot be found in other segmentation models. It receives different types of prompts (sparse or dense prompts) from user inputs or automatic prompt generators (YOLOv9 in our study) and converts them into embeddings. Those prompt embeddings indicate which areas the model should focus on for segmentation. And, the last component, the mask decoder, maps the image and prompt embeddings to the output segmentation masks. For the mask decoder, SAM uses a modified transformer decoder and dynamic mask prediction head. SAM demonstrates remarkable zero-shot performance, as it was trained using a huge dataset called SA-1B which contains 11M diverse, high-quality images. Therefore, most of the downstream tasks can easily and effectively transfer pre-trained weights of SAM without needing additional fine-tuning.

(III) Mask refinement

Using specific prompts generated from the prompt generator, SAM generates raw binary masks. As we are focusing on instance segmentation, our prompt generator (YOLOv9) detects each instance (individual cell or nucleus) and returns the corresponding bounding box prompt for each instance. Similarly, SAM also works on each bounding box and generates separate masks for each prompt. However, in some cases, multiple cells and nuclei may be present within a single input image. Thus, it is necessary to merge all corresponding masks to generate the final segmentation mask. Moreover, in other cases, the input image may contain overlapping or clustering instances. Therefore, in order to ensure the detection of all instances, our prompt generator is set to detect all possible objects, even if they exhibit low prediction scores. As a trade-off, it results in duplicated masks or over-segmented masks for specific input images. To solve this problem, we apply the non-max suppression method after we combine all binary masks. Instead of applying non-max suppression during detection, we apply it after combining individual masks within an input image. Besides, we also apply simple image processing techniques in this stage including image addition (for the combination of individual masks) and morphological processing (to refine minor over or under-segmentation results).

Performance evaluation criteria

To evaluate the capabilities of the proposed method, we calculate the following criteria:

Intersection over Union (IoU): IoU is the most fundamental and widely used measurement for image detection and segmentation tasks. It assesses how well the predicted outputs (segmented masks in this study) of the deep-learning model aligns with the ground truth masks generated by the human experts. IoU can be calculated by the following formula.
$I o U = \frac{m a s k_{p r e d i c t e d} \cap m a s k_{g r o u n d t r u t h}}{m a s k_{p r e d i c t e d} \cup m a s k_{g r o u n d t r u t h}}$ [1]
Mean average precision (mAP): mAP is another standard metric for object detection and instance segmentation tasks. As the name implies, mAP computes the mean value of average precision (AP). To calculate the AP, IoU has to be calculated first. Different thresholds of IoU, for example, 50, 75, and 95 can be applied during AP calculation. Based on the selected IoU thresholds, true positive (TP), true negative (TN), FP, and false negative (FN) can be calculated. Using TP, TN, FP, and FN, precision and recall are needed to be calculated.
$p r e c i s i o n = \frac{T P}{T P + F P}$ [2]

$r e c a l l = \frac{T P}{T P + F N}$ [3]

Precision focuses on the correctness of model predictions on the positive instances, while recall focuses on the model’s ability to find all positive instances in the dataset. Once getting the precision and recall values, a precision-recall curve can be generated by plotting the precision against recall for different confidence threshold values. The AP is the area under the precision-recall curve for a specific class and mAP extends this concept to multiple classes by calculating the average AP across all classes. In this study, we calculate mAP50 which uses IoU (0.5 or 50%), and mAP75 which uses IoU (0.75 or 75%). Moreover, we also calculate mAP50–95 which uses IoU values starting from 0.50 to 0.95 with an increment of 0.5 and returns the average.
Cellular features: additionally, we also take into account other metrics that are crucial for cell analysis. These include: (i) the count of segmented cells and nuclei per image; (ii) the size, measured as the area in pixels; and (iii) the sphericity of segmented cells and nuclei.

Results

Our proposed cells and nuclei segmentation model was developed using the train and validation sets of the EVICAN dataset. Initially, we developed a prompt generator using the YOLOv9-E version using COCO pre-trained weights, then fine-tuned it based on our training and validation data. The prompt generator aims to detect potential cell and nucleus locations, generating corresponding bounding boxes as output. The detailed settings of the YOLOv9-E model are stated in Table 2.

Table 2

Detailed setting of the prompt generator (YOLOv9-E)

Parameter name	Values
Input size	640×640 (pixels)
Batch size	4
Epochs	100
Pretrained_weights	YOLOv9e.pt
Overlap_mask	True
Mask ratio	4
Optimizer	Auto
Learning rate	0.01
Momentum	0.937
Weight_decay	0.0005
Warmup_epochs	3.0
Warmup_momentum	0.8
Warmup bias lr	0.1
Number of classes (nc)	2 (class 0 = cells and class 1 = nucleus)
Layers	1,225
Total number of parameters	58,146,454 params
FLOPS	192.7 G

YOLOv9-E, You Only Look at Once version 9 extended; FLOPS, floating point operations per second.

The training time for the prompt generator took approximately 6 hours using Google Colab (T4 GPU). As a training result, our prompt detector achieved mAP50 up to 0.855 showing 0.965 for the cell class and 0.745 for the nucleus class. For mAP50–95, it achieved up to 0.675 with 0.846 for cells and 0.505 for the nuclei respectively. Figure 3 demonstrates the detailed performance measurements of our prompt generator.

Figure 3 Training and validation performance of the prompt generator (YOLOv9-E). YOLOv9-E, You Only Look at Once version 9 extended.

Once the prompt generator is fine-tuned, it can be applied to produce bounding box prompts which are then utilized by the SAM for further mask generation. Figure 4 depicts an example of output bounding boxes generated by our prompt generator. Figure 4A shows the ground truth bounding boxes generated by human experts while Figure 4B shows the predictions of our prompt generator. The values above the boxes describe the predicted class (0 represents cell class and 1 represents nucleus class) and the confidence scores. To ensure comprehensive coverage of all cells and nuclei, we select all bounding box prompts with a confidence score <0.3 for subsequent segmentation tasks.

Figure 4 Outputs of prompt generator (YOLOv9-E) on an input image from medium test set (21_CHO image in Figure 1A). (A) Ground truth bounding boxes: purple boxes denote cells (Class 0), while red boxes represent nuclei (Class 1); (B) detected bounding boxes with confidence score >0.3: yellow boxes indicate cells (Class 0), and green boxes denote nuclei (Class 1). YOLOv9-E, You Only Look at Once version 9 extended.

Subsequently, these generated box prompts are fed into the SAM. Since we have fine-tuned the prompt generator, our SAM does not require extra fine-tuning. The input images along with the predicted box prompts can be directly input into SAM as zero-shot inferences. Based on each prompt, SAM segments the objects and outputs the binary masks as demonstrated in Figure 5. Figure 5A shows the binary masks of individual cells while Figure 5B shows those of nuclei. As we have seen in Figure 4A, there are only four boxes for the nuclei in the ground truth image, but our prompt generator predicts five boxes, Figure 4B. This is because of the utilization of low confidence scores (<0.3) and the absence of non-maximum suppression in the prompt generator. We intentionally used this idea with the aim of capturing all potential prompts for cells and nuclei, thereby reducing the probability of missed detections. However, these extra detected nuclei need to be refined to prevent over or under-segmentation results. Accurate segmentation is essential as the shape and size of nuclei provide important information to assess the state of cells including pathological conditions in many biomedical applications. Without proper refinement, the analysis may lead to incorrect conclusions, impacting clinical decision-making. Thus, we improve the raw segmentation masks in the mask refinement phase. After combining all individual masks using image addition, we apply non-max suppression and morphological image processing to get the correct final segmentation output as illustrated in Figure 5C.

Figure 5 Step-by-step outputs of SAM and final refined mask. (A) Masks of individual cells, (B) masks of individual nuclei and (C) refined masks of segmented cells and nuclei. SAM, segment anything model.

Moreover, Figure 6 illustrates some example outputs of our proposed segmentation model, showing its performance across three distinct levels of difficulty. In the figure, we compare the ground truth masks of each representative image with the segmentation results produced by our method. As well as the mAP50 for each class instances are also described. As we can see in Figure 6A, all three representative images at the Easy level achieved good segmentation results for both cells and nuclei. However, in Figure 6B at the medium level, some touching cells could not be separated, and some nuclei were under-segmented, particularly in the 48_HT29 image. And finally, in Figure 6C, the segmentation performance is the lowest compared to the easy and medium levels.

Figure 6 Example outputs of the proposed segmentation model, showing its performance across three distinct levels of difficulty. (A) Easy (Difficulty level 1); (B) medium (Difficulty level 2); (C) difficult (Difficulty level 3).

To assess the performance of our segmentation model, we calculated the mean IoU, mAP50, and mAP50–95 of all images across the three levels of test sets. Figure 7 is a graphical representation showing the distribution of mean IoU, mAP50, and mAP50–95 values. Figure 7A-7C show the boxplots with scatter points illustrating the distribution of IoU, mAP50, and mAP50–95 for cell segmentation, while Figure 7D-7F show the results of nucleus segmentation. The Mean IoU for each image was calculated by determining the average IoU of all instances (cell/nucleus) per image.

Figure 7 MeanIoU, mAP50, and mAP50–95 of cell and nucleus segmentation on three levels of test sets. (A) MeanIoU of cell segmentation, (B) mAP50 of cell segmentation, (C) mAP50–95 of cell segmentation, (D) MeanIoU of nucleus segmentation, (E) mAP50 of nucleus segmentation and (F) mAP50–95 of nucleus segmentation. MeanIoU, mean intersection over union; mAP, mean average precision.

As observed in the boxplots of Figure 7A, the minimum of Mean IoU values for cell segmentation in all three sets exceeds 0.6, indicating that our proposed method effectively segments cells, with the majority masks achieving an IoU >0.6. Similarly, as we can see in Figure 7B, our method also achieved high mAP50 scores for cells segmentation, showing majority of mAP50 scores reached up to 1 and the minimum value 0.5. Moreover, the mAP50–95 for cell segmentation also maintained promising scores (>0.7 in Q2 line) of Figure 7C. Based on the comparative analysis of mean intersection over union (meanIoU), mAP50 and mAP50–95 scores, it is evident that our method is performing well in cell segmentation across three levels of the test set.

However, for nucleus segmentation, some instances result in missed detection, leading to a minimum IoU of zero, as shown in Figure 7D. The median (Q2) value of Mean IoU for the easy test set is 0.64 (0.56±0.2), for the medium test set is 0.38 (0.42±0.2), and for the difficult test, set is 0.27 (0.39±0.3). For mAP50 (Figure 7E), the easy set was the highest, with a median (Q2) value of 0.78 (0.68±0.3), followed by the medium test set with a median (Q2) value of 0.5 (0.54±0.3), and the difficult test set with a median (Q2) value of 0.25 (0.38±0.3). Lastly, for the mAP50–95 scores (Figure 7F), the easy set was the highest, with a median (Q2) value of 0.46 (0.42±0.2), followed by the medium test set with a median (Q2) value of 0.25 (0.35±0.3), and the difficult test set with a median (Q2) value of 0.1 (0.19±0.2). Table 3 provides a detailed summary of performance measures for each level of test images.

Table 3

Detailed performance measurements of the proposed segmentation model

Level	MaxIoU	MeanIoU	mAP50	mAP75	mAP50–95
Cell segmentation
Easy	0.87	0.82	0.96	0.79	0.66
Medium	0.89	0.82	0.95	0.82	0.69
Difficult	0.90	0.82	0.94	0.86	0.72
Nucleus segmentation
Easy	0.77	0.56	0.68	0.45	0.42
Medium	0.72	0.42	0.54	0.35	0.35
Difficult	0.61	0.39	0.38	0.18	0.19

MaxIoU, maximum intersection over union; MeanIoU, mean intersection over union; mAP, mean average precision.

Moreover, as mentioned in the performance evaluation criteria session, we also measure some crucial features for cell analysis, including the counts, area in pixels, and sphericity of segmented cells and nuclei. Figure 8 depicts bar plots illustrating the counts of cells and nuclei segmented by our proposed method. Figure 8A represents the counts of cells per image, while Figure 8B displays the counts of nuclei per image. From this comparison, we can see that our proposed method is capable of segmenting a maximum of 15 cells and nuclei per image. For the minimum, it can reliably segment at least one cell, but in the case of nuclei, some may remain unsegmented, leading to a minimum count of zero. From Figure 8B, we can see that nuclei from four images of difficult test set failed to be segmented.

Figure 8 Cells and nuclei counts segmented per image. (A) Cells counts and (B) nuclei counts.

Then, we analyze the area and sphericity of segmented cells and nuclei. Figure 9A shows the distribution of the area (measured in pixels) of segmented cells, while Figure 9B presents the distribution of the area of segmented nuclei. The minimum and maximum range of cell areas that our proposed method can segment is ~200 pixels to ~150,000 pixels. For the nucleus, our method can segment the area ranging from ~130 to ~9,600 pixels. Similarly, Figure 10A,10B show the distribution of sphericity values of the segmented cells and nuclei from the test set. The sphericity value is a measure related to the shape of cells and nuclei and its value ranges from 0 (non-spherical) to 1 (spherical). As depicted in Figure 10, the majority of segmented cells and nuclei in our test set demonstrate a spherical structure, as evidenced by spherical scores exceeding 0.5. The range of spherical values that our model can segment is (from ~0.1 to ~0.92) for cells and (from ~0.3 to ~0.98) for the nucleus. Lastly, we assess the average processing time required to segment cells and nuclei per image. Our proposed method typically achieves final segmentation results within approximately 2 minutes.

Figure 9 Area in pixels of segmented cells and nuclei. (A) Areas of cells and (B) areas of nuclei.

Figure 10 Sphericity of segmented cells and nuclei. (A) Sphericity of cells and (B) sphericity of nuclei.

Discussion

In this section, we will conduct a comparative analysis of our proposed methods against the existing methods in the literature, especially emphasizing the ones designed and evaluated on the same dataset (EVICAN). To be a fair comparison, we select mAP50, mAP75, and mAP50–95 as standard assessment criteria because all the methods that we compared used them. Tables 4,5 summarize the performance of methods in the literature (4,22) and the proposed method on three different levels of the test set. Through this analysis, we shed light on the superior performance of our proposed method. For cell segmentation (Table 4), our method can segment almost all cells across all test sets achieving an IoU ≥0.5. This is evident by looking at the mAP50 scores exceeding 90% across all sets (96.25% for easy, 95.05% for medium, and 94.18% for difficult respectively). Moreover, our cell segmentation maintained higher scores not only in mAP50 but also in mAP75 and mAP50–95. It reached a mAP75 of ~80% and above 65% for mAP50–95. It indicates that our method ensures excellent capability, effectiveness, and robustness across a range of IoU thresholds.

Table 4

Cell segmentation: performance comparison with the existing methods in the literature

Level	Method	mAP50 (average %)	mAP75 (average %)	mAP50–95 (average %)
Easy	MRCNN (4)	50.03	21.95	24.61
	DeepCeNS (22)	89.25	74.25	63.38
	Proposed method	96.25	78.96	66.05
Medium	MRCNN (4)	7.63	2.38	3.19
	DeepCeNS (22)	54.17	29.18	29.73
	Proposed method	95.05	81.94	69.38
Difficult	MRCNN (4)	12.89	1.74	4.70
	DeepCeNS (22)	40.99	24.06	21.64
	Proposed method	94.18	85.50	71.60

mAP, mean average precision; MRCNN, mask region-based convolutional neural network.

Table 5

Nucleus segmentation: performance comparison with the existing methods in literature

Level	Method	mAP50 (average %)	mAP75 (average %)	mAP50–95 (average %)
Easy	MRCNN (4)	33.54	10.61	13.95
	DeepCeNS (22)	74.43	30.56	36.46
	Proposed Method	68.04	44.85	41.64
Medium	MRCNN (4)	11.15	1.52	4.51
	DeepCeNS (22)	39.60	14.44	18.46
	Proposed Method	54.66	34.61	30.43
Difficult	MRCNN (4)	2.32	0.99	1.15
	DeepCeNS (22)	23.28	5.36	8.86
	Proposed Method	38.29	17.55	19.19

mAP, mean average precision; MRCNN, mask region-based convolutional neural network.

For the nucleus segmentation (Table 5), our mAP50 score for easy test set (68.04%) is lower than that of DeepCeNS (74.43%). However, our approach consistently delivers better mAP50 scores for medium (54.66%) and difficult (38.29%). The reason behind the lower mAP50 score for the easy test set may be attributed to the relatively simpler nature of the images in that set. As the easy test set contains fewer complex patterns of the nucleus, which may not fully be represented in training and validation sets, it may potentially and negatively affect during the inference. Nevertheless, for the average mAP75 and mAP50-90 scores, our approach keeps maintaining higher scores. Overall, compared to cell segmentation, the performance of nucleus segmentation is significantly lower due to the tiny, invisible, and complex structure of nuclei.

Conclusions

This study has proposed a deep learning-based instance segmentation model to segment cells and nuclei from the light microscopies taken from different organs using different protocols. It initially used a fine-tuned YOLOv9-E, which is the most recent and advanced object detection model, to generate bounding box prompts. Using the predicted bounding boxes, the raw segmentation masks are then generated by a pre-trained SAM. Finally, those raw masks are refined using non-max suppression and image processing methods. The proposed instance segmentation model is designed and validated using a large public dataset called EVICAN and achieved a favorable performance showing average mAP50 scores of 96.25, 95.05, and 94.18 for cell segmentation, and 68.04, 54.66, and 38.29 for nucleus segmentation on easy, medium, and different test sets of EVICAN, respectively. However, there are areas in our proposed method that require improvements. While our cell segmentation demonstrates high performance, the results of nucleus segmentation are relatively low. Further enhancement can be achieved by augmenting the prompt generator to increase its generalizability specifically for nucleus segmentation.

Acknowledgments

Funding: This study was supported by King Mongkut’s Institute of Technology Ladkrabang (KMITL) Research and Innovative Service (KRIS) (No. 2566-02-01-010).

Footnote

Conflicts of Interest: Both authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-801/coif). Both authors report that this research was supported by King Mongkut’s Institute of Technology Ladkrabang (KMITL) Research and Innovative Service (KRIS) (No. 2566-02-01-010). The authors have no other conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki (as revised in 2013).

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Wang Y, Zhang X, Xu J, Sun X, Zhao X, Li H, Liu Y, Tian J, Hao X, Kong X, Wang Z, Yang J, Su Y. The Development of Microscopic Imaging Technology and its Application in Micro- and Nanotechnology. Front Chem 2022;10:931169. [Crossref] [PubMed]
Thorn K. A quick guide to light microscopy in cell biology. Mol Biol Cell 2016;27:219-22. [Crossref] [PubMed]
Wählby C, Lindblad J, Vondrus M, Bengtsson E, Björkesten L. Algorithms for cytoplasm segmentation of fluorescence labelled cells. Anal Cell Pathol 2002;24:101-11. [Crossref] [PubMed]
Schwendy M, Unger RE, Parekh SH. EVICAN-a balanced dataset for algorithm development in cell and nucleus segmentation. Bioinformatics 2020;36:3863-70. [Crossref] [PubMed]
Trizna EY, Sinitca AM, Lyanova AI, Baidamshina DR, Zelenikhin PV, Kaplun DI, Kayumov AR, Bogachev MI. Brightfield vs Fluorescent Staining Dataset-A Test Bed Image Set for Machine Learning based Virtual Staining. Sci Data 2023;10:160. [Crossref] [PubMed]
Liu Z, Jin L, Chen J, Fang Q, Ablameyko S, Yin Z, Xu Y. A survey on applications of deep learning in microscopy image analysis. Comput Biol Med 2021;134:104523. [Crossref] [PubMed]
Mohammed ZF, Abdulla AA. Thresholding-based White Blood Cells Segmentation from Microscopic Blood Images. UHD J Sci Technol 2020;4:9-17.
Hiremath PS, Bannigidad P. Automatic classification of bacterial cells in digital microscopic images. Int J Eng Technol 209;2:9-15.
Jo H, Han J, Kim YS, Lee Y, Yang S. A Novel Method for Effective Cell Segmentation and Tracking in Phase Contrast Microscopic Images. Sensors (Basel) 2021;21:3516. [Crossref] [PubMed]
Ali R, Gooding M, Christlieb M, Brady M. Phase-based segmentation of cells from brightfield microscopy. 2007 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, Arlington, VA, USA, 2007:57-60
Gamarra M, Manjarres Y, Torres MT, Escorcia-Gutierrez J, Zurek E. MC-Kmeans: an Approach to Cell Image Segmentation Using Clustering Algorithms. Int J Artif Intel 2021;19:80-94.
Zhu R, Sui D, Qin H, Hao A. An extended type cell detection and counting method based on FCN. 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE), Washington, DC, USA, 2017:51-6.
Zhao T, Yin Z. Pyramid-Based Fully Convolutional Networks for Cell Segmentation. In: Frangi A, Schnabel J, Davatziko, C, Alberola-López C, Fichtinger G. editors. Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. Lecture Notes in Computer Science, Springer, Cham 2018;11073:677-85.
Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N, Hornegger J, Wells W, Frangi A. editors. Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Lecture Notes in Computer Science, Springer, Cham 2015;9351:234-41.
Long F. Microscopy cell nuclei segmentation with enhanced U-Net. BMC Bioinformatics 2020;21:8. [Crossref] [PubMed]
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. Deep Learn Med Image Anal Multimodal Learn Clin Decis Support (2018) 2018;11045:3-11. [Crossref] [PubMed]
Kakumani AK, Sree LP, Krishna CS, Uppalapati G, Pavithra GSS, Harshini S. Semantic Segmentation of Cells in Microscopy Images via Pretrained Autoencoder and Attention U-Net. 2022 International Conference on Machine Learning, Computer Systems and Security (MLCSS), Bhubaneswar, India, 2022:94-9.
Ghaznavi A, Rychtáriková R, Saberioon M, Štys D. Cell segmentation from telecentric bright-field transmitted light microscopy images using a Residual Attention U-Net: A case study on HeLa line. Comput Biol Med 2022;147:105805. [Crossref] [PubMed]
Wen T, Tong B, Liu Y, Pan T, Du Y, Chen Y, Zhang S. Review of research on the instance segmentation of cell images. Comput Methods Programs Biomed 2022;227:107211. [Crossref] [PubMed]
HeKGkioxariGDollárPGirshickR.Mask R-CNN. arXiv: 1703.06870, 2018.
Fujita S, Han XH. Cell Detection and Segmentation in Microscopy Images with Improved Mask R-CNN. In: Sato I, Han B. editors. Computer Vision – ACCV 2020 Workshops. Lecture Notes in Computer Science, Springer, Cham, 2021;12628:58-70.
Khalid N, Munir M, Edlund C, Jackson TR, Trygg J, Sjögren R, Dengel A, Ahmed S. DeepCeNS: An end-to-end Pipeline for Cell and Nucleus Segmentation in Microscopic Images. 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 2021:1-8.
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo WY, Dollár P, Girshick R. Segment Anything. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023:3992-4003.
IsraelUMarksMDilipRLiQYuCLaubscherELiSSchwartzMPradhanEAtesAAbtMBrownCPaoEPearson-GoulartAPeronaPGkioxariGBarnowskiRYueYValenDV. A Foundation Model for Cell Segmentation.bioRxiv [Preprint]. 2024. doi: .
NaSGuoYJiangFMaHHuangJ.Segment Any Cell: A SAM-based Auto-prompting Fine-tuning Framework for Nuclei Segmentation. arXiv: 2401.13220, 2024.
WangCYYehIHLiaoHYM. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv: 2402.13616, 2024.

Cite this article as: Baral S, Paing MP. Instance segmentation of cells and nuclei from multi-organ cross-protocol microscopic images. Quant Imaging Med Surg 2024;14(9):6204-6221. doi: 10.21037/qims-24-801

Instance segmentation of cells and nuclei from multi-organ cross-protocol microscopic images

Introduction

Related works

Methods

Dataset

Table 1

Proposed instance segmentation

(I) Prompt generator: YOLOv9

(II) Segmentation model: SAM

(III) Mask refinement

Performance evaluation criteria

Results

Table 2

Table 3

Discussion

Table 4

Table 5

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share