Detecting keypoints with semantic labels on skull point cloud for plastic surgery

Shenghui Liao; Qiuyang Chen; Peishan Dai; Xiaohui Qiu; Chi Zhong; Fuchang Han; Ziyang Hu; Xiaoyan Kui

doi:10.21037/qims-24-1358

Original Article

Detecting keypoints with semantic labels on skull point cloud for plastic surgery

Shenghui Liao^1#, Qiuyang Chen^1# , Peishan Dai¹ , Xiaohui Qiu², Chi Zhong², Fuchang Han³, Ziyang Hu¹, Xiaoyan Kui¹

¹School of Computer Science and Engineering, Central South University, Changsha, China; ²Department of Plastic and Reconstructive Surgery, The Third Xiangya Hospital of Central South University, Changsha, China; ³School of Advanced Interdisciplinary Studies, Hunan University of Technology and Business, Changsha, China

Contributions: (I) Conception and design: Q Chen, S Liao, Z Hu; (II) Administrative support: S Liao, X Kui, P Dai; (III) Provision of study materials or patients: X Qiu, C Zhong; (IV) Collection and assembly of data: X Qiu, C Zhong, Q Chen, F Han; (V) Data analysis and interpretation: Q Chen, F Han, Z Hu; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work.

Correspondence to: Peishan Dai, PhD. School of Computer Science and Engineering, Central South University, 932 South Lushan Road, Yuelu District, Changsha 410083, China. Email: daipeishan@csu.edu.cn.

Background: Using deep learning models to automatically generate reference keypoints with surgical semantic labels and segment bone blocks in plastic surgery provides valuable preoperative planning references. This study aimed to develop a robust and precise keypoint detection framework for dense three-dimensional (3D) skull point clouds to assist in plastic surgery.

Methods: A keypoint descriptor-detector framework was proposed to address keypoint detection in dense point cloud models for facial plastic surgery. The keypoint descriptor identified potential keypoint areas on the point cloud model using the PointRes2Net module to initialize keypoints, which were further optimized by a keypoint detector constructing a self-organized map. Based on the detected keypoints, a new localized small-part segmentation strategy for dense point cloud models was introduced. A bounding box was generated by the detected keypoints, enclosing small bone blocks as the region of interest (ROI) for segmentation.

Results: The mean squared error (MSE) between the keypoints detected on the point cloud using this framework and the ground truth has been reduced to 3.35 mm on the skull models with an average size of 231 mm × 173 mm × 151 mm, outperforming existing point cloud keypoint detection algorithms without requiring additional keypoint annotation on two-dimensional (2D) images for auxiliary training. Furthermore, the proposed framework’s segmentation strategy demonstrated a 22.69% improvement in average precision compared to direct segmentation, with a 34.15% improvement in precision for smaller parts.

Conclusions: The proposed method accurately detects keypoints with surgical semantic labels on dense medical point clouds. Both keypoint detection and segmentation results align closely with the ground truth, providing valuable references for preoperative planning in plastic surgery.

Keywords: Plastic surgery preoperative planning; medical image process; supervised keypoint detection; point cloud part segmentation

Submitted Jul 03, 2024. Accepted for publication Feb 10, 2025. Published online Mar 24, 2025.

doi: 10.21037/qims-24-1358

Introduction

Before performing plastic surgery, surgeons commonly need to conduct preoperative planning on the three-dimensional (3D) mesh models using computer simulation software. Most preoperative planning involves selecting relevant keypoints as references, which typically requires adherence to specific metric rules. However, manual annotation of these keypoints can be time-consuming and labor-intensive. Automatically detecting these keypoints with deep learning models could significantly alleviate the burden on surgeons. Therefore, we introduced deep learning models to predict these keypoints and assist surgeons in decision-making. Currently, deep learning models have gradually been applied in various areas of plastic surgery, such as rhinoplasty (1), blepharoplasty (2), and facial rejuvenation surgery (3), playing a critical role in preoperative planning, intraoperative navigation, and postoperative evaluation (4,5). Nonetheless, the automatic detection of specific anatomical keypoints on 3D skull models requires a specialized deep learning framework to meet clinical application needs. 3D medical models can be reconstructed from computed tomography (CT). However, conventional 3D medical models cannot be processed using standard convolution operations like two-dimensional (2D) images due to inconsistencies in their topological structure. Qi et al. (6,7) proposed using ball query strategy and introducing maxpooling as a symmetric function to aggregate local features of point cloud—a 3D data format—by operating on the features of points in the neighborhood of node sets. Building on this concept, an increasing number of studies have adopted point clouds as the data format for deep learning tasks on 3D models. Point clouds are now widely used in object detection, segmentation, reconstruction (8), and registration (9,10) tasks due to their ease of feature extraction and ability to be downsampled for deep learning models. Therefore, we converted the mesh models reconstructed from CT data into point clouds for further research.

Keypoints are crucial for capturing the shape information and the most representative characteristics of objects (11,12). Descriptors and detectors play pivotal roles in selecting keypoints for objects and scenes in computer vision. Detectors are primarily used for the direct extraction of matching keypoints, while descriptors are more commonly employed in applications such as registration (13) or scene recognition. 3D hand-crafted keypoint detectors include 3D Harris (14), 3D scale-invariant feature transform (SIFT) (15), 3D speeded up robust feature (SURF) (16), heat kernel signature (HKS) (17), and intrinsic shape signature (ISS) (18) with non-maximum suppression (19). These methods focus on the features of point clouds like curvatures (20,21) and shape information. As discussed earlier, the introduction of PointNet (6) and PointNet++ (7) proposed by Qi et al. paved the way for deep learning models for point clouds, such as relation-shape convolutional neural network (RSCNN) (22), dynamic graph convolutional neural network (DGCNN) (23), point relation-aware network (PRA-Net) (24), and transformers for point cloud (25,26). These networks have achieved excellent performance on public datasets, including ShapeNet (27) and ModelNet (28). Despite the effectiveness of deep neural networks in point cloud processing tasks, two main challenges prevent the adoption of supervised keypoint detection on skull point clouds:

Skull point cloud datasets typically have relatively low sample sizes but contain an increased number of features compared to public datasets. This necessitates specialized approaches, as conventional methods may fail to fully capture the detailed structures of skull point clouds.
Due to the lack of ground truth for supervision, research on point cloud keypoint detectors has primarily focused on self-supervised and unsupervised tasks. Typical unsupervised methods include several notable approaches. D3Feat (29) predicts points’ detection scores and description features for keypoints selection. Unsupervised stable interest point detection (USIP) (30) introduces a point-to-node feature proposal network that clusters points on nodes and employs a probabilistic chamfer loss to ensure the repeatability of keypoints. Spatial keypoints net (SK-Net) (31) uses the feature presentation of the entire point cloud learning from other deep learning tasks to optimize the general of keypoints. Meanwhile, Unsupervised Key Point GANeration (UKPGAN) (32) localizes keypoints through a salient information distillation process, while using generative adversarial network (GAN) loss to control the sparsity of keypoints. shape-aware neural 3D keypoint field (SNAKE) (33) projects point cloud into a high-dimensional feature space, calculating the implicit shape indicators and keypoint saliency. In contrast, typical supervised point cloud keypoint detectors include SyncspecCNN (34) and deep functional dictionaries (35). SyncspecCNN utilizes a spectral convolutional neural network (CNN) with shared weights as its backbone to effectively fit vertex functions in point clouds, leveraging the spectral domain to enhance feature learning. On the other hand, deep functional dictionaries produce dictionaries that encode semantic functions for both shape and points, focusing on capturing semantic relationships and characteristics to aid in more accurate keypoint detection and interpretation. However, the keypoints detected by the methods above often lack semantic labels that correspond directly to ground truth. SC3K (36) proposes a self-supervised learning method that calculates the “probability” of each point in a point cloud being a certain keypoint and performs weighted calculations to obtain the final coordinates of the keypoints with semantic labels. Nevertheless, this method does not utilize ground truth for supervision, and the detected keypoints are predominantly located on the convex hull of the point cloud, mainly reflecting the description of the object’s shape information.

Our task is to detect keypoints in point clouds with specific anatomical meanings, which serve as semantic labels to assist plastic surgeons in preoperative planning, intraoperative navigation, and postoperative evaluation. This ultimately improves the efficiency, precision, and consistency of surgical procedures. As mentioned earlier, few supervised methods exist for keypoint detection on point clouds, and even fewer are capable of detecting keypoints on dense medical point cloud models with semantic labels that correspond one-to-one with ground truth. To address this gap, we propose a supervised keypoint descriptor-detector framework designed to precisely detect keypoints with specific anatomical meanings on complex 3D point clouds. Building on the detected keypoints, we further introduce a novel strategy inspired by CenterNet (37) for small-part segmentation in high-vertex-density medical point cloud models.

Methods

Overview of this study

This study was approved by the Ethics Committee of The Third Xiangya Hospital of Central South University (fast-track No. 23687, 10/13/2023), and was conducted in accordance with the Declaration of Helsinki (as revised in 2013). Informed consent was taken from all the patients. Our task was to precisely detect keypoints with semantic labels on 3D skull point cloud models. These 3D skull point cloud models were extracted from 3D skull mesh models, which were reconstructed from CT images. The detected keypoints were used to generate a bounding box for the region of interest (ROI), enclosing the parts to be segmented. Within the ROI, we achieved accurate segmentation of small parts in high-vertex-density medical point cloud models.

Challenges and solutions

In facial plastic surgery, doctors typically need to manually mark keypoints on a 3D model to plan surgeries. However, existing 3D keypoint detection methods face challenges in identifying these keypoints with surgical semantic labels in high-vertex-density skull point cloud models. To address this challenge, we proposed a keypoint descriptor-detector framework. This framework first employed a keypoint descriptor to identify potential regions for keypoints, followed by selecting initial keypoints from these regions. The initial keypoints were then processed through a keypoint detector to construct a self-organized map, further optimizing the positional accuracy of the keypoints.

Building on the detected keypoints, we also proposed a “localized segmentation” strategy for small-part segmentation on complex skull point clouds. While numerous deep learning algorithms for part segmentation of point clouds, directly applying existing 3D part segmentation networks to skull point cloud data is unsuitable due to the dataset’s high vertex density, with millions of points per sample. Typically, dense point clouds should be downsampled for part segmentation. However, downsampling to a size manageable for segmentation networks often results in significant geometric information loss and destruction of the topological structure, making the point cloud sparser. This issue is particularly evident in the removed-out bone block on the zygomatic region, which contains fewer than 1×10⁴ points—substantially fewer compared to the entire skull. After downsampling, the remaining points in the bone block may be reduced to only a few hundred, leading to a defective topological structure. To address this issue, we proposed a strategy involving both localization and segmentation of bone blocks. First, a bounding box was generated using the keypoints detected by our framework to locate the ROI containing the bone blocks to be segmented. Then, part segmentation was performed within the ROI.

Keypoint descriptor-detector framework

We proposed a supervised point cloud keypoint descriptor-detector framework, which includes a regression-based descriptor for generating “keypoint-related saliency maps” and a coordinate regression-based detector, as shown in Figure 1. Figure 1A,1B separately depict the procedures of keypoint descriptor and keypoint detector. As shown in Figure 1A, we initialized the keypoints ${k p_{i}}$ by randomly selecting points from the keypoint-related saliency maps produced by the descriptor. The skull point cloud, labeled with $k p_{i}$ which can be denoted as $P = {p_{1}, \dots, p_{N}} \in R^{3 \times N}$ , was then fed into keypoint detector network, as shown in Figure 1B. The keypoint descriptor-detector framework localized the keypoints more precisely compared to using a single detector, with more details discussed later. Since an original point cloud contains millions of points, we reduced its size through downsampling. Specifically, we employed farthest point sampling (FPS), which preserves geometric information while ensuring a more even distribution of points. We tested point clouds of various scales in our experiments, and the best performance was achieved at approximately 6×10⁴ points.

Figure 1 Keypoints descriptor-detector framework. (A) The procedure of our keypoints descriptor. K represents the value chosen when gathering points near the keypoints using the KNN algorithm. (B) The procedure of our keypoints detector. KNN, K-nearest neighbor; MLP, multi-layer perceptron; ReLU, rectified linear unit.

Keypoint descriptor

Accurate keypoint localization relies heavily on the proper initialization of keypoints. Conventional methods often initialize keypoints using FPS. However, when applying FPS to over 6×10⁴ points to select only 10 keypoints, the initial keypoint distribution may deviate significantly from the ground truth. This inefficiency hampers training, as it becomes challenging for neural networks to locate keypoints across the entire skull point cloud. To address this issue, we introduced our keypoint descriptor to initialize keypoints, effectively constraining the prediction interval of keypoints to a much smaller range. Most existing descriptors emphasize geometric information, which is insufficient for locating the specific keypoints required in our task. Inspired by heatmap regression and keypoint saliency estimation (38), our keypoint descriptor identifies specific regions in the point cloud as ‘keypoint-related saliency regions’, which correspond one-to-one with keypoints. These regions, defined as $R_{i} = {(p o i n t_{i}^{1}, i), \dots, (p o i n t_{i}^{k}, i) | T P_{i}, 1 \leq i \leq 10 |}$ , gather points near the keypoints {TP_i} using K-nearest neighbor (KNN) as ground truth. The resulting point cloud containing ‘keypoint-related saliency regions’ is referred toas the ‘keypoint-related saliency map’ M, which is then used to initialize the keypoints. The basic structure of our keypoint descriptor is shown in Figure 1A. To capture local features and fine geometric details of high-vertex-density point clouds at multiple scales, we selected PointNet++ as the backbone of our network, leveraging its “multi-scale grouping” (MSG) operation. We further optimized the MSG operation based on the Res2Net (39) architecture, which we refer to as the ‘PointRes2Net’ module, as shown in Figure 1A. The structure of our ‘PointRes2Net’ module is depicted in Figure 2.

Figure 2 The structure of PointRes2Net. MLP, multi-layer perceptron; ReLU, rectified linear unit.

The PointRes2Net module integrated features in the node’s neighborhood across different sampling scales S_i enabling the extraction of rich local features H_l from the high-density neighborhoods of the nodes. The local features H_l are defined as:

$H_{1} = m a x p o o l (P o i n t R e s 2 N e t (c o n c a t (p o i n t s, p o i n t s - N_{1}))), p o i n t s \in p$ [1]

$H_{2} = m a x p o o l (P o i n t R e s 2 N e t (c o n c a t (p o i n t s, p o i n t s - N_{1}))), p o i n t s \in N_{1}$ [2]

$H_{3} = m a x p o o l ((P o i n t N e t (c o n c a t (p o i n t s, H_{2})))), p o i n t s \in N_{2}$ [3]

Where PointNet represents the combination of MLP, batch-normalization, and rectified linear unit (ReLU).

In layer l, the feature vector was scattered back to the points in layer l−1 through inverse interpolation, based on a KNN search as described in PointNet++ (7). Specifically, the features from the $k$ nearest nodes in layer l were scattered back to the corresponding node in layer l−1, weighted by their Euclidean distance. Subsequently, each point’s feature vector was updated using a PointNet module. Finally, the descriptor generated the predicted keypoint-related saliency map M:

$d m a t_{l} = \frac{1}{d i s t {(p o i n t s_{l - 1}, p o i n t s_{l})}^{k}}$ [4]

${H^{'}}_{2} = P o i n t N e t (c o n c a t (H_{2}, \frac{\sum (p o i n t s_{l - 1}, p o i n t s_{l})}{\sum d m a t_{3}}))$ [5]

${H^{'}}_{1} = P o i n t N e t (c o n c a t (H_{1}, \frac{\sum (d m a t_{2} ⊙ {H^{'}}_{2})}{\sum d m a t_{2}}))$ [6]

$P r e d = L o g S o f t m a x (P o i n t N e t ({H^{'}}_{1}))$ [7]

$⊙$ represents the Hadamard product, which is the element-wise multiplication between elements.

When using KNN to search the points in keypoints’ neighborhood, the smaller the k value we select, the smaller the region we search for initial keypoints, and the closer initial keypoints are to the ground truth. However, when k falls below a certain threshold, blank regions may appear in the keypoint-related saliency maps. This occurs because the input point cloud contains over 10,000 points, and without sufficient points in keypoint-related regions, the network may fail to focus its attention effectively. Based on our experiments, the threshold for k was determined to be 60. We generated maps at three levels (k=60, 80, 100), as shown in Figure 1A, to compare the distances between initial keypoints and ground truth keypoints. Additionally, we evaluated the mean intersection over union (mIoU) of maps under different k scales against ground truth. For this purpose, we used negative log-likelihood (NLL) loss L_N, commonly applied in segmentation tasks:

$L_{N} = \frac{1}{n} \sum_{i = 1}^{n} - 1 * p r e d_{(i, t a r g e t_{i})}$ [8]

n represents number of points of the input point cloud; target_i denotes the class of current point i belongs to; $p r e d_{(i, t a r g e t_{i})}$ refers to the probability of current point belongs to class target_i.

Keypoint detector

The keypoint detector often requires an initialization. In our framework, the initial keypoints ${k p_{i}}$ for our keypoint detector were randomly selected from keypoint-related saliency maps $M$ generated by our descriptor. Compared to FPS, the initialized keypoints were significantly closer to the ground truth, which facilitated subsequent training. To further augment the data, we introduced transformation matrices for the training point clouds. These transformation matrices ${T_{1}, \dots, T_{k}} \in S E (3)$ were randomly generated and applied to both point cloud P and the corresponded initial keypoints $k p_{i}$ , obtaining the transformed point cloud and keypoints. For each keypoint $k p_{i}$ , we gathered its neighborhood using a point-to-node grouping method, where each keypoint served as node. The point-to-node grouping identified the closest node for each point in the point cloud, and normalized the point by subtracting the coordinates of its corresponding closest node. This approach differs from KNN and ball query, which consider only points within a specific region.

The point-to-node grouping point sets were then fed into a PointNet module to generate local feature vectors ${V_{1} | k p_{1}, \dots, V_{10} | k p_{10}}$ corresponding to nodes:

$V_{i} = P o i n t N e t (p_{k}^{i} - k p_{i})$ [9]

To achieve hierarchical information aggregation, local features based on nodes were again gathered by a KNN fusion module. Node-vector pair in the neighborhood of nodes $k p_{i}$ were denoted as ${(V_{1} | k p_{j_{1}}) | k p_{i}, \dots, (V_{k} | k p_{j_{k}}) | k p_{i}}, 1 \leq (j_{1}, \dots, j_{k}) \leq 10$ . Nodes were normalized by subtracting the coordinates of its corresponding closest node $k p_{i}$ , and then concatenated with local features. The concatenated local features within certain region were aggregated as global features G, using max pooling as symmetric function:

$G = m a x p o o l (P o i n t N e t (c o n c a t (V_{i}, k p_{j k} - k p_{i})))$ [10]

At this stage, the feature vectors had already aggregated rich hierarchical information from multiple layers. To further enhance the understanding and capture of both local and global features, we introduced a self-attention module in place of a simple MLP, as shown in Figure 1B. The detector ultimately generated the coordinates of 10 predicted keypoints ${k p_{i}}$ along with their corresponding saliency uncertainties ${σ_{i}}$ . These outputs are represented as ${K P = [k p_{1}, \dots, k p_{10}], \sum = [σ_{1}, \dots, σ_{10}]}$ :

$A t t e n = s o f t m a x (M L P_{1} (G) \cdot M L P_{1} {(G)}^{T})$ [11]

$K P, Σ = M L P (R e L U (n o r m a l i z a t i o n (\frac{M L P_{2} (G) \cdot A t t e n}{\sum A t t e n})))$ [12]

Inspired by USIP (30), we incorporated the saliency uncertainties $\sum$ of points into the loss function calculation. Saliency uncertainties is a characteristic derived from projecting onto high-dimensional spaces. By manually setting the optimization function, it is possible to make the Euclidean distance d_i between the initialized keypoints $k p_{i}$ and the ground truth TP_i exponentially distributed with their significant uncertainty σ_i, as shown in follows:

$p (d_{i} | σ_{i}) = \frac{1}{σ_{i}} e^{- \frac{d_{i}}{σ_{i}}}, σ_{i} > 0, 1 \leq i \leq 10$ [13]

d_i represents the Euclidean distance between and ground truth TP_i. It can be seen that the shorter distance between $k p_{i}$ predicted keypoint and ground truth is, the higher probabilities keypoint match the ground truth. As each pair of $(k p_{i}, T P_{i})$ are independently distributed, the joint distribution should be defined as:

$p (D | Σ) = \prod_{i = 1}^{10} p (d_{i} | σ_{i})$ [14]

According to Eq. [1] and Eq. [2] the probabilistic distance loss LP could be given by the NLL of the joint:

$L_{p} = \sum_{i = 1}^{10} - \ln p (d_{i} | σ_{i}) = \sum_{i = 1}^{10} (\ln σ_{i} + \frac{d_{i}}{σ_{i}})$ [15]

This loss function also incorporates saliency uncertainty information projected into high-dimensional space, guiding the predicted keypoints to exhibit features that are closer to the ground truth in high-dimensional space. To ensure that the predicted keypoints are sufficiently close to the ground truth in 3D space, we additionally introduced the Euclidean distance loss as a corrective term. Therefore, the final loss function L was defined as:

$L = L_{p} + L_{d} = \sum_{i = 1}^{10} (\ln σ_{i} + \frac{d_{i}}{σ_{i}}) + \sum_{i = 1}^{10} d_{i} = \sum_{i = 1}^{10} (\ln σ_{i} + d_{i} * (1 + \frac{1}{σ_{i}}))$ [16]

The localized segmentation strategy

The basic procedure of our localized segmentation method for skull point cloud is illustrated in Figure 3. It consists of keypoint descriptor-detector framework (Figure 3A) which generates the bounding box of ROI, and segmentation network (Figure 3B).

Figure 3 The basic procedure for skull point cloud part segmentation method. (A) Our keypoints descriptor-detector framework. (B) The segmentation network. The segmented removed-out bone block is labeled in blue, and the pushed-in bone block is labeled in red. ROI, region of interest.

As shown in Figure 3A, the process begins with the keypoint descriptor generated keypoints-related saliency maps for the skull point cloud, from which the initial keypoints were selected. The keypoint detector then optimized the positions of these keypoints. Using the predicted keypoints, a bounding box was generated to encompass the required bone blocks for the segmentation task, and the ROI was extracted from the skull point cloud. The point cloud of the ROI was fed into the segmentation network to complete the task.

Our segmentation network shares the same structure as the keypoint descriptor, incorporating the PointRes2Net module, as shown in Figure 1A. To evaluate its performance on our skull point cloud dataset, we compared our segmentation network with mainstream part segmentation networks, including PointNet, PointNet++, RSCNN, DGCNN, and PointNext (40).

Results

Dataset

Our method was tested on a manually annotated skull point cloud dataset from The Third Xiangya Hospital of Central South University, which has 198 samples and each point cloud has over three million points. Each sample in the dataset contains two bone blocks annotated by doctors from the Department of Plastic and Reconstructive Surgery at The Third Xiangya Hospital of Central South University,, designated for either to be removed out or pushed in on the zygomatic bones, along with 10 keypoints marked on the cheeks on both sides, as shown in Figure 4A: (I) point 0 is the lower point on the left orbit, located at the intersection of the outer edge and lower edge of the eye socket; (II) point 1 is the prominent point below the temporal bone; (III) point 2 is the point below the root of the zygomatic arch; (IV) point 3 is the intersection point between the sagittal plane, determined by the lower point of the orbit and the foramen magnum of the occipital bone, and the most prominent position at the bottom of the zygomatic bone; and (V) point 4 is the inflection point between the zygomatic bone and the eye socket. Similarly, points 5 to 9 are the symmetrical counterparts of points 0 to 4, respectively.

Figure 4 Illustration of keypoints and segmented bone blocks in the zygomatic bone. (A) The keypoints manually labeled on skull mesh model. 0 represents the lower point on the orbit; 1 represents the prominent point below the temporal bone; 2 represents the point below the root of the zygomatic arch; 3 represents the intersection point between the sagittal plane; 4 represents the inflection point between the zygomatic bone and the eye socket; 5 to 9 are the symmetrical counterparts of 0 to 4, respectively. (B) Bone blocks manually labeled on skull mesh model: (I) pushed-in bone block in different views; and (II) removed-out bone block in different views.

We utilized keypoints to locate the zygomatic region on the skull for segmentation. The target bone blocks on zygomatic bone that required segmentation are shown in Figure 4B. As evident from the figure, the bone blocks to be segmented are relatively small compared to the entire skull. To address this, we generated a bounding box of the ROI using the keypoints and achieved accurate part segmentation within the ROI.

Experiment setting

The experiment dataset was divided into a training set and a testing set with a ratio of 8:2. The IoU of the output segmented bone blocks and the ground truth was used to evaluate the segmentation performance. Additionally, the Euclidean distance between the predicted keypoints and the ground truth was measured to assess the reliability of ROI.

Keypoints prediction

The regression of keypoints begins with the regression of keypoint-related saliency maps. To save computing resources and improve the efficiency, we downsampled point cloud to 6×10⁴ points for the regression of keypoint-related saliency maps. Keypoint-related maps were generated at different k-scales, with each map expected to contain 10 keypoint-related regions. However, our experiments showed that when k is below 60, some keypoint-related regions can become blank, resulting in a lack of initial keypoints. Thus, we set 60 as the threshold of k. The regression of keypoint-related map on point cloud is analogous to the regression of heatmap on 2D image, essentially performing a point-level classification similar to segmentation. The bone blocks requiring segmentation contain only 10²–10³ points, while the entire point cloud consists of 6×10⁴ points. As a result, even under optimal conditions, the mIoU will not be exceedingly high. However, this does not affect the initialization of keypoints, as long as the predicted keypoint-related regions have a sufficient intersection with the corresponding regions in the ground truth of keypoint-related saliency maps. The regression results for keypoint-related saliency maps at different k-scales are shown in Table 1. The results demonstrate that the initial keypoints generated by our method are significantly closer to the ground truth compared to those generated by FPS, with the best performance achieved when k=60. Additionally, we visualized the keypoint-related saliency maps, initial keypoints, and ground truth keypoints, as shown in Figure 5.

Table 1

Evaluation of keypoints’ initialization and comparison with FPS

k	Euclidean distance	mIoU
60	0.04200	0.61814
80	0.04853	0.66147
100	0.05300	0.66482
FPS	0.88843	NA

FPS, farthest point sampling; mIoU, mean intersection over union; NA, not available.

Figure 5 Visualization of keypoint-related map and initial keypoints. (A,C,E) The predicted keypoint-related maps under different k scales, as purple represents keypoint-related saliency regions. (B,D,F) The positions of initial keypoints and ground truth keypoints.

The initial keypoints were then fed into the keypoint detector. We compared the results of our keypoint descriptor-detector framework with other methods, as shown in Figure 6 and Table 2. In Figure 6, the percentage of correct keypoints (PCK) is defined as the proportion of predicted keypoints whose Euclidean distance to the ground truth keypoints falls below a predefined threshold, serving as a metric to evaluate the keypoint detection accuracy. By assessing PCK at different thresholds, we were able to intuitively compare the performance of various methods, highlighting the robustness and accuracy of the model under varying precision requirements. As there are very few supervised keypoint detection methods for point clouds, we selected an advanced and representative method, deep functional dictionaries (35), for comparison with our method. Additionally, since our detector network was inspired by USIP (30)—which integrates the self-attention module with USIP’s feature proposal network—we included both USIP with a modified loss function and our detector initialized with keypoints by FPS for comparison. The results indicate that our method outperforms other approaches.

Figure 6 Keypoints prediction comparison. Each point on the PCK curve shows the percentage of accurate predicted keypoints under a given euclidean distance threshold. FPS, farthest point sampling; PCK, percentage of correct keypoints; USIP, unsupervised stable interest point detection.

Table 2

The comparison of different methods for keypoints prediction

Methods	MSE
Deep functional dictionaries	0.0539
USIP	0.1533
FPS + detector	0.1494
Our framework	0.0267^†

^†, the best performance. FPS, farthest point sampling; MSE, mean squared error; USIP, unsupervised stable interest point detection.

We also tested SC3K (36) on our dataset by modifying its loss function. However, the results were unsatisfactory, as the distribution of detected keypoints appeared highly chaotic. This finding underscores the increased complexity and difficulty of handling high-vertex-density medical point clouds compared to public datasets. A visualization of the predicted keypoints is presented in Figure 7, demonstrating that the predicted keypoints align closely with the ground truth. Additionally, we remapped the detected keypoints from the normalized point clouds back onto the original mesh model. To further evaluate the accuracy, we computed the mean squared error (MSE) between each predicted keypoint and its corresponding ground truth on each mesh model. PCK was then calculated at thresholds of 2, 4, 6, and 8 mm, as shown in Table 3. Although there remains a precision gap compared to state-of-the-art 2D cephalometric X-ray landmark detection algorithms, our results provide significant reference value, especially considering the differences between 2D images and 3D space. Furthermore, our approach effectively generates bounding boxes and can be directly applied to 3D models reconstructed from CT data, eliminating the need for acquiring multi-view images or performing additional annotation work.

Figure 7 Visualization of predicted keypoints and ground truth keypoints.

Table 3

The MSE (mm) and PCK (%) between predicted keypoints and ground truth on original skull mesh model

Keypoints	MSE (mm)	PCK (%)
Keypoints	MSE (mm)	2 mm	4 mm	6 mm	8 mm
kp1	2.71	33.10	83.10	99.30	100.00
kp2	3.13	22.30	74.60	97.00	100.00
kp3	4.13	14.60	50.80	86.20	96.20
kp4	3.25	13.10	73.10	97.70	100.00
kp5	3.17	21.50	74.60	97.70	100.00
kp6	2.93	26.20	80.80	99.30	100.00
kp7	3.27	16.20	66.20	96.90	100.00
kp8	4.06	13.90	52.30	83.80	97.70
kp9	3.33	18.50	66.20	93.90	100.00
kp10	3.52	23.90	60.00	91.60	99.30
Mean	3.35	20.30	68.20	94.30	99.30

MSE, mean squared error; PCK, percentage of correct keypoints.

Skull point cloud segmentation

We constructed several mainstream part segmentation networks, including PointNet, PointNet++, RSCNN, DGCNN, and PointNext, to compare with our PointRes2Net for segmenting the point cloud of ROI, which was extracted from the original skull point cloud. For comparison, we also tested these methods on the entire skull point cloud, which was downsampled to 6×10⁴ due to video memory limitations, as well as on its corresponding ROI. The results, presented in Table 4, show that segmenting the ROI achieves significantly higher precision compared to directly segmenting the entire skull. This improvement is largely attributed to the removal of irrelevant or ‘noise’ points in the skull point cloud model, enabling the network to focus more effectively on the points critical for segmentation. Moreover, since the ROI contains far fewer points than the original skull point cloud, segmenting within the ROI greatly reduces computational resource requirements and significantly improves the segmentation efficiency. This makes it feasible to segment removed-out and pushed-in bone blocks on a skull point cloud with over three million points using a standard computer. We visualized the segmentation result in Figure 8, which clearly demonstrates that our method accurately segments the bone blocks requiring to be removed and pushed in, showing high consistency with the ground truth. Additionally, we reconstructed the point clouds into mesh models for a more intuitive visualization of the segmentation results, as shown in Figure 9.

Table 4

The comparison of segmentation results in ROI and skull point cloud by different networks

Segmentation networks	Segmentation region	mIoU (%)	Block IoU (%)
Segmentation networks	Segmentation region	mIoU (%)	Removed-out	Pushed-in
PointNet++ (MSG)	Downsampled skull	64.07	43.44	84.70
	ROI from downsampled skull	68.20	50.38	86.01
	ROI from original skull	86.53	80.14	92.92
PointNet	Downsampled skull	57.26	32.89	81.63
	ROI from downsampled skull	68.79	50.84	86.74
	ROI from original skull	76.78	66.80	86.76
DGCNN	Downsampled skull	64.46	45.48	83.44
	ROI from downsampled skull	70.89	54.37	87.42
	ROI from original skull	80.63	72.15	89.10
RSCNN	Downsampled skull	57.52	33.04	82.00
	ROI from downsampled skull	65.60	46.20	85.00
	ROI from original skull	67.91	53.41	82.40
PointNext	Downsampled skull	62.58	41.50	83.66
	ROI from downsampled skull	66.81	48.00	85.70
	ROI from original skull	82.12	82.80^†	81.40
PointRes2Net (ours)	Downsampled skull	65.11	48.25	81.97
	ROI from downsampled skull	68.82	50.47	87.17
	ROI from original skull	87.80^†	82.40	93.30^†

^†, the best performance. DGCNN, dynamic graph convolutional neural network; IoU, intersection over union; mIoU, mean intersection over union; MSG, multi-scale grouping; RSCNN, relation-shape convolutional neural network; ROI, region of interest.

Figure 8 Visualization of segmentation results in different segmentation regions. ROI, region of interest.

Figure 9 Reconstruction of the segmented bone block point clouds. The remove-out bone blocks are labeled in red, and the push-in bone blocks are labeled in blue.

In the experiments involving direct segmentation of downsampled skull models, we observed that due to the limited size of the original dataset, which contained only 198 samples, networks like RSCNN and DGCNN suffered from overfitting. This overfitting resulted in a significant gap between the segmentation performance on training samples and test samples. Additionally, it is worth noting that transformer-like networks, such as Point Transformer, are not recommended for complex point cloud part segmentation. These networks require training times that are tens of times longer than PointNet-like networks, making them less practical for this task.

Achieving high-precision results when segmenting dense skull point cloud models with complex topological structures is challenging using simple direct segmentation approaches. Keypoints-driven segmentation on ROI effectively eliminates many noise points in the point cloud, thereby significantly improving segmentation performance.

Discussion

This study demonstrates the detection of keypoints with semantic labels on high-vertex-density skull point cloud models. The proposed method can directly detect keypoints corresponding to ground truth semantic labels on point clouds, without requiring additional annotation of keypoints on 2D images for assistance. We first employed a keypoint descriptor with a novel MSG module, named PointRes2Net, to extract features from high-vertex-density medical point cloud models. The keypoint descriptor generated potential keypoint regions for initializing keypoints, which were then fed into the keypoint detector to construct a self-organized map for optimization. Our framework can precisely detect the keypoints with surgical semantic labels, achieving a MSE of 3.35 mm in 3D space. Based on the detected keypoints, we proposed a new localized segmentation strategy for small-part segmentation in high-vertex-density medical point cloud models. The detected keypoints were used to automatically generate a bounding box of the ROI, which encompasses the parts to be segmented.

To vividly highlight the differences between our method and other supervised point cloud keypoint detection methods, we conducted a visual comparison. This involved visualizing the predicted keypoints and the generated bounding boxes from both our method and the deep functional dictionaries, which was selected as a representative of existing methods. Principal components analysis (PCA) was applied to compute convex hull (41,42) and generated the minimum oriented bounding box of the ROI containing the bone blocks to be segmented. The results are shown in Figure 10. It should be noted that the keypoints predicted by deep functional dictionaries do not correspond one-to-one with the ground truth, requiring manual determination of the keypoint order to generate bounding boxes. In contrast, the keypoints predicted by our method correspond directly to the ground truth, enabling the automatic generation of bounding boxes. As shown in Figure 10, the bounding box generated by deep functional dictionaries fails to fully encompass the bone blocks to be segmented, demonstrating the infeasibility of such keypoint detection methods for this specific task.

Figure 10 The generated bounding box of ROI by different keypoints. The red spheres represent ground truth, while the blue spheres represent predicted keypoints. (A) The ground truth. (B) Deep functional dictionaries. (C) Our method. ROI, region of interest.

In the “Results” section, both the keypoint detection and segmentation outcomes demonstrate high accuracy, closely aligning with the ground truth. These results provide valuable references for preoperative planning in plastic surgery. Furthermore, to ensure patient safety during surgery, a minimum distance of 3 mm between the cutting plane and the nearest facial nerve is established as a safety criterion in zygomatic resection procedures. To evaluate our method’s adherence to this criterion, we tested 40 samples and measured the minimum distance between the segmentation plane and the infraorbital nerve. The results showed an average distance of 13.10 mm, with a minimum distance of 6.05 mm, confirming that our approach satisfies the safety requirements.

Conclusions

In this paper, we proposed a keypoint descriptor-detector framework for supervised 3D point cloud keypoint detection, which was further applied to small-part segmentation on high-vertex-density skull point clouds. Within this framework, the keypoint descriptor utilized the PointRes2Net module to generate potential keypoint regions on the point cloud model for initializing keypoints. These initialized keypoints were subsequently optimized by a keypoint detector that constructed a self-organized map. Based on the detected keypoints, we generated a bounding box as the boundary of the ROI. From this ROI, we extracted the point cloud containing the bone blocks for the segmentation task and achieved superior segmentation performance using our proposed network.

Acknowledgments

We are grateful for resources from the High Performance Computing Center of Central South University.

Footnote

Funding: This work was supported by the National Natural Science Foundation of China (Nos. 62372475, 61772556, 82171581, U22A2034, and 62177047).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://qims.amegroups.com/article/view/10.21037/qims-24-1358/coif). The authors have no conflicts of interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. This study was approved by the Ethics Committee of The Third Xiangya Hospital of Central South University (fast-track No. 23687, 10/13/2023), and was conducted in accordance with the Declaration of Helsinki (as revised in 2013). Informed consent was taken from all the patients.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Knoedler S, Alfertshofer M, Simon S, Panayi AC, Saadoun R, Palackic A, Falkner F, Hundeshagen G, Kauke-Navarro M, Vollbach FH, Bigdeli AK, Knoedler L. Turn Your Vision into Reality-AI-Powered Pre-operative Outcome Simulation in Rhinoplasty Surgery. Aesthetic Plast Surg 2024;48:4833-8. [Crossref] [PubMed]
Goodyear K, Saffari PS, Esfandiari M, Baugh S, Rootman DB, Karlin JN. Estimating apparent age using artificial intelligence: Quantifying the effect of blepharoplasty. J Plast Reconstr Aesthet Surg 2023;85:336-43. [Crossref] [PubMed]
Elliott ZT, Bheemreddy A, Fiorella M, Martin AM, Christopher V, Krein H, Heffelfinger R. Artificial intelligence for objectively measuring years regained after facial rejuvenation surgery. Am J Otolaryngol 2023;44:103775. [Crossref] [PubMed]
Nogueira R, Eguchi M, Kasmirski J, de Lima BV, Dimatos DC, Lima DL, Glatter R, Tran DL, Piccinini PS. Machine Learning, Deep Learning, Artificial Intelligence and Aesthetic Plastic Surgery: A Qualitative Systematic Review. Aesthetic Plast Surg 2025;49:389-99. [Crossref] [PubMed]
Lin X, Heredia Pérez SA, Harada K. A cranial-feature-based registration scheme for robotic micromanipulation using a microscopic stereo camera system. Advanced Robotics 2024;38:1730-42.
Qi CR, Su H, Mo K, Guibas LJ. Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:652-60.
Qi CR, Yi L, Su H, Guibas LJ. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems 30 (NIPS 2017). 2017.
Peng S, Niemeyer M, Mescheder L, Pollefeys M, Geiger A. Convolutional Occupancy Networks. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III 16. Cham: Springer International Publishing; 2020:523-40.
Gojcic Z, Zhou C, Wegner JD, Wieser A. The perfect match: 3d point cloud matching with smoothed densities. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:5545-54.
Du J, Wang R, Cremers D. Dh3d: Deep hierarchical 3d descriptors for robust large-scale 6dof relocalization. In: Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part IV 16. Springer International Publishing; 2020:744-62.
Geng Z, Sun K, Xiao B, Zhang Z, Wang J. Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021:14676-86.
Zhang Y, Lu Y, Liu B, Zhao Z, Chu Q, Yu N. Evopose: A Recursive Transformer for 3D Human Pose Estimation with Kinematic Structure Priors. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2023:1-5.
Tinchev G, Penate-Sanchez A, Fallon M. SKD: Keypoint detection for point clouds using saliency estimation. IEEE Robot Autom Lett 2021;6:3785-92.
Sipiran I, Bustos B. Harris 3D: a robust extension of the Harris operator for interest point detection on 3D meshes. Vis Comput 2011;27:963-76.
Scovanner P, Ali S, Shah M. A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia. New York: Association for Computing Machinery; 2007:357-60.
Knopp J, Prasad M, Willems G, Timofte R, Van Gool L. Hough transform and 3D SURF for robust three dimensional classification. In: Computer Vision-ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part VI 11. Springer Berlin Heidelberg; 2010:589-602.
Bronstein MM, Kokkinos I. Scale-invariant heat kernel signatures for non-rigid shape recognition. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE; 2010:1704-11.
Zhong Y. Intrinsic shape signatures: A shape descriptor for 3D object recognition. In: 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops. IEEE; 2009:689-96.
Neubeck A, Van Gool L. Efficient non-maximum suppression. In: 18th International Conference on Pattern Recognition (ICPR'06). IEEE; 2006:850-5.
Castellani U, Cristani M, Fantoni S, Murino V. Sparse points matching by combining 3D mesh saliency with statistical descriptors. Comput Graph Forum 2008;27:643-52.
Zaharescu A, Boyer E, Varanasi K, Horaud R. Surface feature detection and description with applications to mesh matching. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2009:373-80.
Liu Y, Fan B, Xiang S, Pan C. Relation-shape convolutional neural network for point cloud analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019:8895-904.
Wang Y, Sun Y, Liu Z, Sarma SE, Bronstein MM, Solomon JM. Dynamic graph CNN for learning on point clouds. ACM Trans Graph 2019;38:1-12.
Cheng S, Chen X, He X, Liu Z, Bai X. Pra-net: Point relation-aware network for 3D point cloud analysis. IEEE Trans Image Process 2021;30:4436-48. [Crossref] [PubMed]
Zhao H, Jiang L, Jia J, Torr PHS, Koltun V. Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021:16259-68.
Guo MH, Cai JX, Liu ZN, Mu TJ, Martin RR, Hu SM. PCT Point cloud transformer. Comput Vis Media 2021;7:187-99.
Chang AX, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Savva M, Song S, Su H, Xiao J, Yi L, Yu F. Shapenet: An information-rich 3d model repository. arXiv:1512.03012 [Preprint]. 2015. Available online: https://arxiv.org/abs/1512.03012
Wu Z, Song S, Khosla A, Yu F, Zhang L, Tang X, Xiao J. 3D shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015:1912-20.
Bai X, Luo Z, Zhou L, Fu H, Quan L, Tai CL. D3feat: Joint learning of dense detection and description of 3D local features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:6359-67.
Li J, Lee GH. USIP: Unsupervised stable interest point detection from 3D point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019:361-70.
Wu W, Zhang Y, Wang D, Lei Y. SK-Net: Deep learning on point cloud via end-to-end discovery of spatial keypoints. Proceedings of the AAAI Conference on Artificial Intelligence. 2020;34:6422-9.
You Y, Liu W, Ze Y, Li YL, Wang W, Lu C. UKPGAN: A general self-supervised keypoint detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022:17042-51.
Zhong C, You P, Chen X, Zhao H, Sun F, Zhou G, Mu X, Gan C, Huang W. SNAKE: Shape-aware neural 3D keypoint field. In: Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track. 2022;35:7052-64.
Yi L, Su H, Guo X, Guibas LJ. Syncspeccnn: Synchronized spectral CNN for 3D shape segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:2282-90.
Sung M, Su H, Yu R, Guibas LJ. Deep functional dictionaries: Learning consistent semantic structures on 3d models from functions. In: Advances in Neural Information Processing Systems 31 (NeurIPS 2018). 2018.
Zohaib M, Del Bue A. Sc3k: Self-supervised and coherent 3d keypoints estimation from rotated, noisy, and decimated point cloud data. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023:22509-19.
Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q. Centernet: Keypoint triplets for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019:6569-78.
You Y, Lou Y, Li C, Cheng Z, Li L, Ma L, Lu C, Wang W. Keypointnet: A large-scale 3d keypoint dataset aggregated from numerous human annotations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020:13647-56.
Gao SH, Cheng MM, Zhao K, Zhang XY, Yang MH, Torr P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans Pattern Anal Mach Intell 2021;43:652-62. [Crossref] [PubMed]
Qian G, Li Y, Peng H, Mai J, Hammoud H, Elhoseiny M, Ghanem B. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. In: Advances in Neural Information Processing Systems 35 (NeurIPS 2022) Main Conference Track. 2022;35:23192-204.
Dimitrov D, Knauer C, Kriegel K, Rote G. Bounds on the quality of the PCA bounding boxes. Computational Geometry 2009;42:772-89.
Naujoks B, Wuensche HJ. An orientation corrected bounding box fit based on the convex hull under real time constraints. In: 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE; 2018:1-6.

Cite this article as: Liao S, Chen Q, Dai P, Qiu X, Zhong C, Han F, Hu Z, Kui X. Detecting keypoints with semantic labels on skull point cloud for plastic surgery. Quant Imaging Med Surg 2025;15(4):3501-3516. doi: 10.21037/qims-24-1358

Detecting keypoints with semantic labels on skull point cloud for plastic surgery

Introduction

Methods

Overview of this study

Challenges and solutions

Keypoint descriptor-detector framework

Keypoint descriptor

Keypoint detector

The localized segmentation strategy

Results

Dataset

Experiment setting

Keypoints prediction

Table 1

Table 2

Table 3

Skull point cloud segmentation

Table 4

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share