Surgical instrument segmentation and classification in transcanal endoscopic ear surgery video using Segment Anything Model 2
Original Article

Surgical instrument segmentation and classification in transcanal endoscopic ear surgery video using Segment Anything Model 2

Ryunosuke Ueno1, Takeshi Fujita2 ORCID logo, Kazuhiro Matsui1,3 ORCID logo, Keita Atsuumi1,4 ORCID logo, Natsumi Uehara2 ORCID logo, Toshihiko Yamashita2 ORCID logo, Hiroaki Hirai1, Toshikazu Kawai5 ORCID logo, Hisashi Suzuki6 ORCID logo, Atsushi Nishikawa1 ORCID logo

1Graduate School of Engineering Science, The University of Osaka, Toyonaka, Japan; 2Department of Otolaryngology Head and Neck Surgery, Kobe University Graduate School of Medicine, Kobe, Japan; 3Faculty of Information Science and Arts, Osaka Electro-Communication University, Shijonawate, Japan; 4Graduate School of Information Sciences, Hiroshima City University, Hiroshima, Japan; 5Graduate School of Robotics and Design, Osaka Institute of Technology, Osaka, Japan; 6Faculty of Science and Engineering, Chuo University, Tokyo, Japan

Contributions: (I) Conception and design: R Ueno, A Nishikawa; (II) Administrative support: T Fujita, A Nishikawa; (III) Provision of study materials or patients: T Fujita, N Uehara, T Yamashita; (IV) Collection and assembly of data: R Ueno; (V) Data analysis and interpretation: R Ueno, K Matsui, K Atsuumi, H Hirai, A Nishikawa; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

Correspondence to: Takeshi Fujita, MD, PhD. Department of Otolaryngology Head and Neck Surgery, Kobe University Graduate School of Medicine, 7-5-1 Kusunokicho, Chuoku, Kobe, Hyogo 650-0017, Japan. Email: fujitake@med.kobe-u.ac.jp.

Background: Accurate automatic recognition of surgical instruments is important for enhancing the safety and efficiency of transcanal endoscopic ear surgery (TEES). Segment Anything Model 2 (SAM 2) is a state-of-the-art artificial intelligence (AI) model featuring “zero-shot segmentation”, which allows it to identify objects without extensive training data. The purpose of this study was to be the first to apply SAM 2 to TEES videos to evaluate its performance in segmenting and classifying surgical instruments.

Methods: From videos of three clinical TEES cases (tympanoplasty), we created a 684-frame evaluation video that simulated the sequential use of four types of surgical instruments: cupped forceps, pick, alligator forceps, and circular knife. SAM 2 was instructed to identify the instruments using only 12 reference images. Performance was quantitatively evaluated using four metrics—Precision, Recall, dice similarity coefficient (DSC), and intersection over union (IoU)—as well as a confusion matrix to assess frame-level classification accuracy by comparing the model’s output to manually created ground-truth annotations.

Results: For segmentation without classification, SAM 2 demonstrated excellent performance, with a mean Precision of 0.98, a mean Recall of 0.89, a mean DSC of 0.93, and a mean IoU of 0.87. When classifying the four specific instrument types, performance varied, with mean Precision ranging from 0.68 to 0.98, mean Recall from 0.79 to 0.86, mean DSC from 0.62 to 0.89, and mean IoU from 0.58 to 0.84. Notably, false-positive detections in non-instrument regions were extremely rare. A trend of lower classification accuracy was observed for articulating instruments with moving parts (cupped forceps and alligator forceps).

Conclusions: This study demonstrated that SAM 2 can accurately identify surgical instruments in TEES videos without the need to prepare large, costly training datasets. This zero-shot segmentation characteristic significantly lowers the barrier to clinical implementation of AI, suggesting it is a promising foundational technology for future surgical navigation and robotic automation. Key future challenges include improving classification accuracy for articulating instruments and accelerating processing speed for practical, real-time use.

Keywords: Endoscopic ear surgery; image segmentation; artificial intelligence (AI); surgical instruments; computer-assisted surgery


Submitted Jun 30, 2025. Accepted for publication Oct 11, 2025. Published online Nov 21, 2025.

doi: 10.21037/qims-2025-1469


Video S1 Video of segmentation and classification results for surgical instruments using SAM 2. This video shows the segmentation and classification results for the entire 68-second evaluation video, played at 10 frames per second. The region of each surgical instrument identified by SAM 2 is visualized as a color-coded overlay. The color key is as follows: orange, cupped forceps; green, pick; light blue, alligator forceps; and dark blue, circular knife.

Introduction

Transcanal endoscopic ear surgery (TEES) is a minimally invasive technique for treating middle ear diseases such as otitis media and otosclerosis (1). TEES involves inserting an endoscope and surgical instruments through the external auditory canal, which is less than 1 cm in diameter. This approach offers benefits such as smaller incisions and a wider field of view for the surgeon. However, TEES presents unique challenges. In contrast to laparoscopic surgery, both the endoscope and instruments pass through the same narrow, bony canal. This severely limits instrument maneuverability and frequently causes interference between the endoscope and the instruments. To overcome these challenges and further improve procedural safety and efficiency, the integration of surgical assistance technologies is anticipated. Specifically, the accurate and automatic recognition of surgical instruments within the operative field is an essential prerequisite for future applications like surgical navigation and robotic surgery.

In recent years, advances in artificial intelligence (AI), particularly in deep learning, have fueled research into the automatic identification of specific objects within surgical videos. The technique of precisely identifying an object’s boundaries on a pixel-by-pixel basis is known as “segmentation”. In the field of TEES, Horinouchi et al. successfully used an AI model (DeepLabv3+) (2) to achieve the segmentation of a single surgical instrument, but their work did not extend to classifying different types of instruments (3). Nwosu et al. reported the first use of an AI model (Detectron2) (4) to automatically detect both surgical instruments and anatomical structures in TEES videos (5). However, these previous studies did not perform instrument classification. In a related study, Liu et al. used another AI model (YOLOv8) (6) to detect two types of instruments in microscopic videos of mastoidectomy (7). However, a limitation of the AI models used in these prior studies is their primary reliance on spatial information within individual video frames, which prevents them from fully leveraging the temporal continuity inherent in video data.

Against this backdrop, research on segmentation for both images and videos has been actively pursued in the field of computer vision in recent years. Significant efforts have been made to improve video object segmentation by developing sophisticated memory networks (8,9), efficient feature propagation techniques (10,11), and long-term tracking capabilities (12). More recently, the field has moved towards creating generalized models capable of segmenting or tracking “anything” in context (13-17). Among these, this study focuses on the Segment Anything Model 2 (SAM 2), introduced by Meta in 2024 (17). Unlike many other temporal-aware segmentation models, SAM 2 is a state-of-the-art, general-purpose AI model featuring “zero-shot segmentation”, which allows it to identify any object without extensive training data. Notably, whereas its predecessor (SAM) (18) was primarily designed for still images, SAM 2 features significantly enhanced capabilities for processing temporal continuity in videos. This characteristic suggests its potential to robustly track and identify objects, even in real-world clinical videos containing unstable elements such as lighting changes and motion blur. Indeed, emerging reports have already begun to confirm its effectiveness in the analysis of other types of surgical videos (19,20).

To the best of our knowledge, as of June 2025, no study has applied SAM 2 to TEES videos for both the segmentation and classification of surgical instruments. The ability to automatically identify and log the type of instrument being used is crucial for the objective analysis of surgical workflows, the training of junior surgeons, and as a foundational technology for future surgical automation. Therefore, the purpose of this study was to apply SAM 2 to clinical TEES videos in which multiple surgical instruments (cupped forceps, pick, alligator forceps, circular knife) are used sequentially, and to quantitatively evaluate its performance in both segmenting and classifying these instruments.


Methods

Study data

For this study, we used video recordings from three clinical TEES cases, all of which were tympanoplasty procedures. The use of these surgical videos was approved by the Institutional Review Board (IRB) of Kobe University Graduate School of Medicine (No. #B200397). Individual consent for this retrospective analysis was waived. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Inputting instrument information to SAM 2 and segmentation procedure

The AI model, SAM 2, requires prior instruction on the visual features of the surgical instruments to be identified. Specifically, we used the Hiera Large-based model for this study (17). To instruct the model, we employed an interactive region specification method (17). This technique involves an operator using a set of reference images (three images for each of the four instruments, totaling 12 images), which were separate from the evaluation video frames. On these reference images, the operator specifies a few points belonging to the instrument (positive examples) and to the background (negative examples) with mouse clicks. From these sparse inputs, SAM 2 automatically learns and generates the entire region of the instrument. Based on the visual information learned through this process, SAM 2 then performs tracking and segmentation on the evaluation video.

We conducted two types of evaluations:

  • Instrument segmentation: this task involved identifying all four instrument types monolithically under a single “surgical instrument” category by assigning them a common identification (ID) label.
  • Instrument classification: this task involved distinguishing between the four instrument types by assigning a unique ID label to each (e.g., cupped forceps = ID 1, pick = ID 2).

An example of the interactive region specification by the operator is shown in Figure 1. The complete set of reference images and their specification results are provided in Figures S1-S4.

Figure 1 Example of interactive segmentation of surgical instruments using SAM 2. This figure illustrates the interactive instruction process for four types of surgical instruments. The orange overlays represent the instrument regions automatically segmented by SAM 2. This segmentation is guided by a few points interactively provided by an operator: positive points on the instrument (green dots) and negative points on the background (red “x” marks). SAM 2, Segment Anything Model 2.

The experiments were performed on a workstation equipped with an Intel Xeon W1270P CPU, 32 GB of RAM, and an NVIDIA Quadro RTX 5000 GPU with 16 GB of VRAM.

To assess the model’s suitability for real-time surgical applications, we employed SAM 2 in its online, streaming configuration. In this mode, the model processes video sequentially, using only past frames to inform the segmentation of the current frame, without accessing future frames.

Performance evaluation

To evaluate segmentation performance, we first created “ground truth” images by manually annotating the precise pixel-level area of the instrument in all frames. The “prediction images” generated by SAM 2 were then compared against these ground truth images on a frame-by-frame basis. Both prediction and ground truth images are binary, with pixels corresponding to the instrument region assigned a value of 1 (white) and background pixels a value of 0 (black).

As evaluation metrics, we used Precision and Recall, which are standard in the computer vision field and have been used in previous studies on otologic instrument detection (3,7), we also employed the dice similarity coefficient (DSC) and intersection over union (IoU). These metrics can be defined using the areas (in pixels) of true positive (TP), false negative (FN), and false positive (FP) predictions made by the AI model.

TP: the area where the AI correctly predicted the instrument region [unit: pixel].

FP: the area where the AI incorrectly predicted the background as the instrument region [unit: pixel].

FN: the area where the AI incorrectly predicted the instrument region as the background (missed detection) [unit: pixel].

Based on these, we define a ‘Predicted frame’ as any frame where the model predicted at least one pixel of a specific instrument (i.e., where TP >0 or FP >0). We define a ‘Ground-truth frame’ as any frame where a specific instrument was actually present (i.e., where TP >0 or FN >0).

The specific definitions and formulas are as follows:

  • Precision: the proportion of pixels predicted as “instrument” that were correct. This metric indicates the accuracy of the predictions. It is calculated as TP/(TP + FP) in frames where TP >0 or FP >0.
  • Recall: the proportion of all actual instrument pixels that were correctly identified by the model. This metric indicates the completeness of the detection. It is calculated as TP/(TP + FN) in frames where TP >0 or FN >0.
  • DSC: the harmonic mean of Precision and Recall, providing a comprehensive measure of both prediction accuracy and completeness. By definition of the harmonic mean, it is calculated as 2/(1/Precision + 1/Recall) = 2TP/(2TP + FP + FN) in frames where TP + FP + FN >0.
  • IoU: the ratio of the intersection area to the union area of the predicted instrument region and the actual instrument region. Similar to DSC, it comprehensively evaluates both prediction accuracy and completeness. It is calculated as TP/(TP + FP + FN) in frames where TP + FP + FN >0.

We calculated these metrics for each frame and for each ID label, then assessed the overall performance using their mean and standard deviation. Furthermore, to visually analyze the distribution of the performance scores, we generated histograms of the Precision and Recall values, dividing the [0, 1] range into 100 bins.

To quantitatively evaluate the frame-level instrument classification accuracy, we created a ground-truth frame-based confusion matrix (5 rows × 16 columns) comparing the actual instrument presence patterns with the AI’s estimated patterns (Figure 2). In this experiment, there are five actual instrument presence patterns (only cupped forceps present, only pick present, only alligator forceps present, only circular knife present, or no instrument present). However, since SAM 2 assigns instrument labels at the pixel level, the AI can output a total of 16 (24) possible presence patterns, representing the presence/absence combinations of the four instruments. Furthermore, we also created a predicted frame-based confusion matrix (4 rows × 4 columns) illustrating the relationship between the instruments predicted by the AI (4 types) and the actual instrument frames (4 types) (Figure 3).

Figure 2 Ground-truth frame-based 5×16 confusion matrix. This matrix compares the actual instrument presence patterns (rows) with the AI’s estimated presence patterns (columns). The numbers in the cells represent the count of frames for each combination. Af, alligator forceps; Cf, cupped forceps; Ck, circular knife; F, false (instrument absent); No inst, no instruments present; Pi, pick; T, true (instrument present).
Figure 3 Predicted frame-based 4×4 confusion matrix. This matrix, derived from the data in Figure 2, illustrates the relationship between the instrument type predicted by the AI (rows) and the actual instrument present in the frame (columns). The numbers in the cells represent the count of frames for each combination. The off-diagonal elements correspond to misclassification events where the Precision score is zero. Af, alligator forceps; AI, artificial intelligence; Cf, cupped forceps; Ck, circular knife; Pi, pick.

Results

Preparation of evaluation videos and visual evaluation of segmentation

The videos had a resolution of 1,920×1,080 pixels and a frame rate of 30 frames per second (fps). From these recordings, we extracted scenes where only one of four types of surgical instruments (cupped forceps, pick, alligator forceps, circular knife) appeared in the frame. We created two video clips per instrument, for a total of eight clips. Each clip was then converted into a sequence of still images (frames) at intervals of 5 to 10 frames, resulting in a total of 684 frames for evaluation. For the evaluation process, these eight clips were concatenated in a specific order (cupped forceps → pick → alligator forceps → circular knife → …) to create a single video that simulates the sequential changing of instruments during surgery.

For a visual assessment of the surgical instrument segmentation and classification results, the detected instrument regions were color-coded and overlaid on the original endoscopic video frames. The cupped forceps was colored orange, the pick green, the alligator forceps light blue, and the circular knife dark blue. A timeline of these results is presented in Figure 4, showing images extracted at 10-frame intervals from the total of 684 frames. Additionally, a video of the results, adjusted to 10 fps (68 seconds in length), is provided as Video S1.

Figure 4 Timeline of surgical instrument segmentation and classification results using SAM 2. This figure displays 69 frames extracted at 10-frame intervals from the full 684-frame evaluation video. The region of each surgical instrument identified by SAM 2 is visualized as a color-coded overlay. The color key is as follows: orange represents the cupped forceps; green, the pick; light blue, the alligator forceps; and dark blue, the circular knife. SAM 2, Segment Anything Model 2.

Quantitative performance of segmentation and classification

The quantitative performance evaluation under each condition is summarized in Table 1. In the task of individually classifying the four instrument types, the mean Precision was highest for the pick (0.98), followed by the circular knife (0.93), cupped forceps (0.74), and alligator forceps (0.68). The mean Recall was highest for the pick (0.86), followed by the alligator forceps (0.85), circular knife (0.82), and cupped forceps (0.79). The mean DSC was highest for the pick (0.89), followed by the circular knife (0.82), cupped forceps (0.64), and alligator forceps (0.62). The mean IoU was highest for the pick (0.84), followed by the circular knife (0.76), cupped forceps (0.59), and alligator forceps (0.58). The pick scored the highest on all four metrics, and the circular knife also maintained relatively high values across all metrics. In contrast, while the alligator forceps had a high Recall comparable to the pick, its other metrics were the lowest among the four instruments. The cupped forceps showed the lowest Recall, and its other metrics were also relatively low, although not the lowest.

Table 1

Performance of the SAM 2 model for the segmentation and classification of surgical instruments

Condition/instrument Total frames Predicted frames Ground-truth frames Mean precision (SD) Mean recall (SD) Mean DSC (SD) Mean IoU (SD)
Classification task
   Cupped forceps 684 135 109 0.74 (0.43) 0.79 (0.28) 0.64 (0.43) 0.59 (0.41)
   Pick 684 234 240 0.98 (0.09) 0.86 (0.18) 0.89 (0.19) 0.84 (0.19)
   Alligator forceps 684 196 133 0.68 (0.46) 0.85 (0.13) 0.62 (0.43) 0.58 (0.41)
   Circular knife 684 179 180 0.93 (0.20) 0.82 (0.22) 0.82 (0.27) 0.76 (0.26)
Segmentation-only task
   All (no classification) 684 662 662 0.98 (0.03) 0.89 (0.07) 0.93 (0.04) 0.87 (0.07)

DSC, dice similarity coefficient; IoU, intersection over union; SAM 2, Segment Anything Model 2; SD, standard deviation.

Next, when the four instruments were segmented as a single “surgical instrument” category without classification, the performance was excellent, with a mean Precision of 0.98, a mean Recall of 0.89, a mean DSC of 0.93, and a mean IoU of 0.87. Throughout the results in Table 1, a trend was observed where lower mean values for Precision, Recall, DSC, and IoU were associated with larger standard deviations, indicating greater variability in performance.

Distribution and analysis of performance scores

To understand the detailed distribution of the performance scores, histograms for Precision and Recall are shown in Figures 5,6, respectively. These histograms reveal that for most frames, segmentation was successful with extremely high accuracy under all conditions. Specifically, the mode (peak) of the Precision histogram was in the [0.99, 1.0] range, and the mode for Recall was also in a high range of [0.92, 1.0].

Figure 5 Histograms of Precision scores for each condition. This figure shows the distribution of Precision scores under five conditions: individual classification for the four instrument types (cupped forceps, pick, alligator forceps, and circular knife) and the segmentation-only task [all (no classification)]. The parenthesized red number at the far left of each histogram indicates the total number of frames with a Precision score of zero, representing the frequency of false positives or complete misclassifications.
Figure 6 Histograms of Recall scores for each condition. This figure shows the distribution of Recall scores under five conditions: individual classification for the four instrument types (cupped forceps, pick, alligator forceps, and circular knife) and the segmentation-only task [all (no classification)]. The parenthesized red number at the far left of each histogram indicates the total number of frames with a Recall score of zero, representing the frequency of complete detection failures (false negatives).

However, the mean values presented in Table 1 are lower than these peak values. This discrepancy is due to the presence of cases where segmentation failed completely in some frames, resulting in an evaluation score of zero. As indicated by the parenthesized red numbers below the x-axis in Figure 5, there were a number of frames with a Precision of zero, and these instances were the primary factor pulling down the mean score. This corresponds to cases of FPs, where the AI misidentified a different instrument or another object as the target instrument.

Analysis of instrument classification accuracy using confusion matrices

The ground-truth frame-based 5×16 confusion matrix, which tabulates the actual instrument presence patterns against the AI’s estimated patterns, is shown in Figure 2. For clarity, the AI’s possible output patterns are denoted with “T” (true) if an instrument is present and “F” (false) if it is absent.

The first row of Figure 2 shows that of the 684 total frames, 22 frames contained no instruments, and the AI correctly identified all of these frames as having no instruments present.

The second row shows that of the 109 frames containing only cupped forceps, the AI correctly estimated this in just over half (59 frames, 54%). In 41 frames, it output that both cupped forceps and alligator forceps were present (i.e., misidentifying a part of the cupped forceps as alligator forceps). The third row indicates that for the 240 frames containing only the pick, the AI correctly estimated this in over 93% of cases (225 frames). The fourth row shows that for the 133 frames containing only alligator forceps, the AI correctly estimated this in over 81% of cases (109 frames). The fifth row shows that for the 180 frames containing only the circular knife, the AI correctly estimated this in over 86% of cases (156 frames).

Next, the predicted frame-based 4×4 confusion matrix, derived from Figure 2, is presented in Figure 3. The off-diagonal elements in this matrix correspond to frames with a Precision of zero.

The first row of Figure 3 shows that of the 135 frames where the AI predicted the presence of cupped forceps, approximately 75% (101 frames) were actual frames containing only cupped forceps. The second row shows that of the 234 frames where the AI predicted the presence of the pick, approximately 99% (232 frames) were actual frames containing only the pick. The third row shows that of the 196 frames where the AI predicted the presence of alligator forceps, approximately 68% (133 frames) were actual frames containing only alligator forceps. The fourth row shows that of the 179 frames where the AI predicted the presence of the circular knife, approximately 96% (171 frames) were actual frames containing only the circular knife.

Processing speed

The processing speed for instrument segmentation in our experimental environment was 7 fps.


Discussion

In this study, we demonstrated that the state-of-the-art AI model, SAM 2, can segment and classify surgical instruments in TEES videos with high accuracy using only a small number of reference images. Notably, the model exhibited a very low rate of misidentifying non-instrument regions as instruments (high precision), while classification accuracy varied depending on the type and morphological characteristics of the instrument.

To compare the performance of our model with previous studies, we evaluated its segmentation capabilities without instrument classification. Our model achieved a Precision of 0.98, Recall of 0.89, DSC of 0.93, and IoU of 0.87. These results show a 0.02- to 0.04-point improvement across three metrics—Precision, Recall, and DSC—compared to a prior study using DeepLabv3+ (Precision 0.85–0.95, Recall 0.84–0.87, DSC 0.86–0.89) (3). A direct comparison of IoU was not possible as it was not reported in that study.

When compared with another study using Detectron2 (Recall 0.91–0.94, IoU 0.82–0.84) (5), our model’s Recall was slightly lower, while its IoU was slightly higher. The lower Recall indicates a slightly higher proportion of FNs relative to TPs in our model. Conversely, the higher IoU signifies a smaller combined total of FPs and FNs relative to TPs. Taken together, these two results imply that our model has a relatively lower proportion of FPs than the Detectron2-based model. Based on the definition of Precision, this strongly suggests that the Precision of our SAM 2-based method surpasses that of the Detectron2 approach, even though Precision was not directly reported in their work (5).

Based on this comparative analysis, the most significant advantage of our SAM 2-based method is its exceptionally high precision. Our model’s average Precision of 0.98 is clinically significant, as it implies an extremely low probability of falsely detecting instruments in non-instrument regions. A particularly noteworthy finding, as shown in the first row of our 5×16 ground-truth frame-based confusion matrix (Figure 2), is that in frames where no instrument was present, the model did not misidentify a single pixel as an instrument. This 100% accuracy in identifying instrument-absent frames means our system could serve as a highly reliable trigger for autonomous systems, such as an endoscopic robot (21).

Furthermore, the performance for classifying four instrument types (Precision 0.68–0.98, Recall 0.79–0.86) was comparable to that of a study using YOLOv8 to classify two instruments in mastoidectomy videos (Precision 0.67–0.93, Recall 0.61–0.89) (7), although a direct comparison of other metrics was not possible. The most noteworthy aspect of these results is that SAM 2 achieved this performance using a “zero-shot segmentation” approach. In contrast to prior methods that required thousands of training images (3,5,7), our study required zero training images and only simple guidance on 12 reference images, indicating a potential to dramatically reduce the preparatory costs of implementing AI systems.

Next, we discuss the variation in classification performance among instruments. Instruments with a fixed shape, such as the pick and circular knife, showed few misclassifications. In contrast, instruments with articulating parts that change shape upon opening and closing, such as the cupped forceps and alligator forceps, showed lower Precision. This is likely because the AI misidentified a part of a different instrument or other objects (e.g., blood clots) as a part of an “open” forceps, leading to a complete misclassification. Indeed, a number of frames for these instruments had a Precision of zero (complete misclassification), which was the primary factor that lowered the average scores. Specifically, the off-diagonal elements of the predicted frame-based 4×4 confusion matrix correspond to these zero-precision cases. The most frequent misclassification was identifying a part of the cupped forceps as alligator forceps, followed by other errors such as misclassifying parts of the alligator forceps or circular knife as cupped forceps. Given that only a single instrument was present in the images, occlusion was minimal, and no zero-precision cases occurred in the segmentation-only task, we infer that the primary cause of these misclassifications was insufficient variation in the reference images provided to SAM 2 (especially for the visually similar cupped and alligator forceps). However, in frames where this type of misclassification did not occur, the performance remained as high as that for other instruments. This suggests that developing strategies to counter these specific misclassification patterns is key to substantially improving overall performance.

Regarding Recall, complete misses of an instrument (Recall =0) were extremely rare for any instrument type, confirming that SAM 2 has an excellent characteristic of not losing track of the target object. However, partial detection failure remains a challenge and is the reason the mean Recall did not exceed 0.9.

Finally, we address the limitations of this study and future prospects. The first limitation is processing speed. The performance in our environment did not reach real-time (30 fps) capabilities, requiring future enhancements such as algorithmic optimization (22) or more powerful hardware. Notably, SAM 2 was originally developed with real-time video processing in mind. A previous report has shown that a frame rate of 30.2 fps was achieved using the same Hiera Large-based SAM 2 as in our study, but with a single high-performance GPU (A100 GPU) (17). We believe that real-time processing is achievable for our application by upgrading hardware. The second limitation is that our evaluation was confined to scenes with a single instrument. Considering future robotic-assisted TEES, where a surgeon can operate with both hands (21), the ability to handle multiple instruments simultaneously becomes an essential research objective. Furthermore, our evaluation was based on data from only three surgical cases performed at a single institution. While this study serves as a crucial first step in demonstrating the feasibility of SAM 2, future studies using a larger and more diverse dataset from multiple centers are necessary to validate the generalizability of our findings. A final limitation is the absence of a direct head-to-head comparison against prior methods on a standardized benchmark dataset. The lack of such a public dataset in the field of TEES constrained our analysis to a comparison of performance metrics reported in separate studies. Establishing a shared dataset would be a valuable future endeavor to facilitate more direct and robust model evaluation. We aim to address these challenges to enhance the autonomy of our TEES endoscopic robot (21).


Conclusions

This study demonstrated that the zero-shot segmentation model, SAM 2, can segment surgical instruments in TEES videos with high precision without the immense cost of preparing large training datasets. This characteristic suggests that the technology is a promising and readily implementable foundation for developing future surgical navigation and robotic-assistance systems. Key future challenges include improving the classification accuracy for articulating instruments and accelerating the processing speed for practical application.


Acknowledgments

None.


Footnote

Data Sharing Statement: Available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1469/dss

Funding: This work was supported by JSPS KAKENHI (grant No. JP22H00589, to A.N., T.F., T.K., and H.S.) and JST FOREST Program (grant No. JPMJFR215F, to T.F.).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-2025-1469/coif). T.F. reports a grant from JSPS KAKENHI (grant No. JP22H00589) and JST FOREST Program (grant No. JPMJFR215F) for the submitted work. T.K., H.S. and A.N. report a grant from JSPS KAKENHI (grant Number JP22H00589) for the submitted work. The other authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The use of these surgical videos was approved by the Institutional Review Board (IRB) of Kobe University Graduate School of Medicine (No. #B200397). Individual consent for this retrospective analysis was waived. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.


References

  1. Ito T, Kubota T, Furukawa T, Matsui H, Futai K, Kakehata S. Transcanal Endoscopic Ear Surgery for Congenital Middle Ear Anomalies. Otol Neurotol 2019;40:1299-305. [Crossref] [PubMed]
  2. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, eds. Computer Vision – ECCV 2018. Springer, Cham; 2018:833-51.
  3. Horinouchi T, Matsui K, Atsuumi K, Fujita T, Uehara N, Yamashita T, Kawai T, Suzuki H, Nishizawa Y, Taniguchi K, Hirai H, Nishikawa A. Surgical Instrument Segmentation Using Deep Learning for Robot-assisted Transcanal Endoscopic Ear Surgery. Proc 18th Asian Conf Comput Aided Surg (ACCAS); 2022:44-5.
  4. Wu Y, Kirillov A, Massa F, Lo WY, Girshick R. Detectron2. 2019. Accessed August 17, 2025. Available online: https://github.com/facebookresearch/detectron2
  5. Nwosu O, Suresh K, Lee DJ, Crowson MG. Proof-of-Concept Computer Vision Model for Instrument and Anatomy Detection During Transcanal Endoscopic Ear Surgery. Otolaryngol Head Neck Surg 2024;170:1602-4. [Crossref] [PubMed]
  6. Varghese R, Sambath M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS); 2024:1-6.
  7. Liu GS, Parulekar S, Lee MC, El Chemaly T, Diop M, Park R, Blevins NH. Artificial Intelligence Tracking of Otologic Instruments in Mastoidectomy Videos. Otol Neurotol 2024;45:1192-7. [Crossref] [PubMed]
  8. Cheng HK, Tai YW, Tang CK. Rethinking Space-time Networks with Improved Memory Coverage for Efficient Video Object Segmentation. In: Ranzato M, Lacoste-Julien S, Keshavan J, eds. Advances in Neural Information Processing Systems 34 (NeurIPS 2021); 2021:11781-94.
  9. Cheng HK, Schwing AG. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. In: Avidan S, Brostow G, Cissé M, eds. Computer Vision – ECCV 2022. Springer, Cham; 2022:640-58.
  10. Yang Z, Yang Y. Decoupling Features in Hierarchical Propagation for Video Object Segmentation. In: Koyejo S, Mohamed S, Agarwal A, eds. Advances in Neural Information Processing Systems 35 (NeurIPS 2022); 2022:36324-36.
  11. Li M, Hu L, Xiong Z, Zhang B, Pan P, Liu D. Recurrent Dynamic Embedding for Video Object Segmentation. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2022:1322-31.
  12. Hong L, Chen W, Liu Z, Zhang W, Guo P, Chen Z, Zhang W. LVOS: A Benchmark for Long-term Video Object Segmentation. Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); 2023:13434-46.
  13. Wang X, Zhang X, Cao Y, Wang W, Shen C, Huang T. SegGPT: Towards Segmenting Everything in Context. Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); 2023:1130-40.
  14. Cheng HK, Oh SW, Price B, Schwing A, Lee JY. Tracking Anything with Decoupled Video Segmentation. Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV); 2023:1316-26.
  15. Cheng HK, Oh SW, Price B, Lee JY, Schwing A. Putting the Object Back into Video Object Segmentation. Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2024:3151-61.
  16. Liu Q, Wang J, Yang Z, Li L, Lin K, Niethammer M, Wang L. LiVOS: Light Video Object Segmentation with Gated Linear Matching. Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2025:8668-78.
  17. Ravi N, Gabeur V, Hu YT, Hu R, Ryali C, Ma T, Khedr H, Rädle R, Rolland C, Gustafson L, Mintun E, Pan J, Alwala KV, Carion N, Wu CY, Girshick R, Dollár P, Feichtenhofer C. SAM 2: Segment Anything in Images and Videos. arXiv:2408.00714. Published online August 2, 2024.
  18. Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, Xiao T, Whitehead S, Berg AC, Lo WY, Dollar P, Girshick R. Segment anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV); 2023:3992-4003.
  19. Lou A, Li Y, Zhang Y, Labadie RF, Noble J. Zero-shot surgical tool segmentation in monocular video using Segment Anything Model 2. Proc SPIE 13406, Medical Imaging 2025: Image Processing; 2025;13406:134062V.
  20. Zhang J, Tang H. SAM2 for Image and Video Segmentation: A Comprehensive Survey. arXiv:2503.12781. Published online March 28, 2025.
  21. Fujita T, Yokoyama K, Kawai T, Uehara N, Yamashita T, Kamakura T, Matsumoto Y, Mizutari K, Kakigi A, Nibu K, Suzuki H, Nishikawa A. A Robot for Transcanal Endoscopic Ear Surgery with Gimbal-based Rotational Linkage and Linear Guide Rail Mechanisms. Advanced Biomedical Engineering 2025;14:146-54.
  22. Liu H, Zhang E, Wu J, Hong M, Jin Y. Surgical SAM 2: Real-time Segment Anything in Surgical Video by Efficient Frame Pruning. arXiv:2408.07931. Published online August 14, 2024.
Cite this article as: Ueno R, Fujita T, Matsui K, Atsuumi K, Uehara N, Yamashita T, Hirai H, Kawai T, Suzuki H, Nishikawa A. Surgical instrument segmentation and classification in transcanal endoscopic ear surgery video using Segment Anything Model 2. Quant Imaging Med Surg 2025;15(12):12386-12397. doi: 10.21037/qims-2025-1469

Download Citation