Международный научный журнал «ВЕСТНИК НАУКИ» № 11 (80) Том 2. НОЯБРЬ 2024 г. УДК 004.8
Perfilev Dmitrii
Master student of Lanzhou University of Technology School of Computer and Communication (Lanzhou, China)
OPTIMIZING 3D POINT CLOUD PROCESSING WITH K-MEANS CLUSTERING: VALIDATION ON THE KITTI DATASET
Аннотация: this research utilizes K-means clustering to enhance object classification by preprocessing 3D point clouds, which are essential in robotics and autonomous driving. Traditional methods face challenges due to large data volumes and noise in point clouds obtained by LiDAR. By applying K-means clustering as a preprocessing step, the data volume is reduced, making it easier to focus on relevant regions within complex scenarios. Tests conducted on the KITTI dataset show that clustering significantly improves classification accuracy and processing efficiency. The findings underscore the potential of clustering to optimize data handling in large-scale 3D point cloud applications, particularly for autonomous driving systems, where reliable and efficient perception is crucial. This approach demonstrates that cluster-based preprocessing can make 3D point cloud analysis faster and more effective for real-world applications.
Ключевые слова: 3D point cloud, clustering, K-means, LiDAR, KITTI dataset.
1. Introduction.
Modern self-driving and robotics systems rely on accurate and efficient data processing to interpret their surroundings, with 3D point clouds from LiDAR[1,2] among the primary sources of this data. These point clouds consist of dense clusters of points that form the spatial representation of the environment, presenting challenges in data volume and noise. Traditional point cloud processing methods may be insufficient for large-scale data, underscoring the need for optimized preprocessing techniques. Cluster analysis is a promising approach, as it can segment point clouds into clusters representing distinct objects or areas, streamlining subsequent processing. This study validates a K-means[3-5] clustering approach for preprocessing 3D point clouds,
focusing on its impact on data efficiency and noise reduction. The validation process utilizes data from the KITTI dataset[6,7], a benchmark for evaluating 3D data processing techniques in autonomous driving. By validating clustering on this dataset, the study demonstrates how effective segmentation improves data handling, laying the groundwork for more reliable and efficient 3D data processing in autonomous vehicle perception and other applications. The results illustrate how K-means clustering can enhance the preprocessing stage, facilitating more rapid and precise downstream analysis.
2. Problem Statement.
This paper proposes a preprocessing method based on K-means clustering to improve the efficiency of 3D point cloud processing. By reducing data volume and mitigating noise, clustering enables faster and more accurate downstream analysis. This approach aims to address challenges in handling large, noisy datasets, as seen in autonomous driving applications, where efficient and reliable perception is crucial.
3. Method.
The proposed method focuses on improving the evaluation of cluster quality in 3D point cloud data by introducing the Variety Index, a novel metric designed to assess intra-cluster variability. This approach aims to provide a precise measurement of the homogeneity or diversity of points within each cluster, offering deeper insights into the internal organization of the clusters, particularly in large-scale point cloud datasets like the KITTI dataset.
3.1. Related Work.
3.1.1. K-means Clustering.
K-means clustering is a popular method for partitioning a dataset into several distinct, non-overlapping clusters. The algorithm operates by aiming to minimize the value of S as described in equation (1).
where Xj and Xk are the ith and kth data points.
3.1.2. DBSCAN Clustering.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)[8,9] is a clustering method that groups points based on their density. The algorithm is defined by two key parameters: the minimum number of points required to form a dense region and the maximum distance within which two points are considered neighbors.
3.1.3. Dunn Index.
The Dunn Index assesses cluster validity by calculating the ratio of the smallest distance between clusters to the largest distance within a cluster:
The Dunn Index ranges from 0 to ro, with higher values being preferable. A high ID indicates that the points within each cluster are tightly grouped, and the clusters are well-separated from each other.
3.1.4. Calinski-Harabasz Index.
The Calinski-Harabasz Index (also known as the Variance Ratio Criterion) is defined as:
The Calinski-Harabaz index(/CH) compares the variance between clusters to the variance within each cluster.
k=l C(i)=k
Ich (C) =
(3)
3.1.5. C-Index.
The C-Index measures the compactness of the clustering:
l с(С) =
(4)
where
S(C) = Tckec TiXiXjeck de (xi, xi)
Smin (C)=£ min (nw) XtXj E Xde (xt, xj) Smax(C)=Z max (nw) XiXj E Xde(Xi, Xj)
with X the entire dataset, xt, Xj the data points and de the Euclidean distance.
3.1.6 Silhouette Score.
The Silhouette Score measures how similar a point is to its own cluster compared to other clusters:
where a(i) is the average distance between point i and all other points in the same cluster, and b(i) is the minimum average distance from point i to points in a different cluster.
3.1.7. Davies-Bouldin Index.
The Davies-Bouldin Index evaluates clustering by measuring the average similarity ratio of each cluster with its most similar cluster:
Is (0 =
b(i)-a(i)
(5)
max {a(i),b(i)}
Idb (C )=1
3.2. Our approach to clustering.
Processing LiDAR data is essential for understanding the environment in various applications, including autonomous driving, robotics, and geographic information systems. Clustering is an effective technique for analyzing LiDAR data, as it organizes data points into distinct clusters based on their spatial distribution.
3.2.2. Variety Index Method (Our Index).
The focus is on intra-cluster variability. Unlike traditional clustering evaluation metrics, such as the silhouette score and Dunn index, which assess both intra-cluster cohesion and inter-cluster separation, the variation index specifically measures the variability within individual clusters. This provides a clear indication of the homogeneity or diversity of points within each cluster, offering a deeper understanding of their internal structure. The variation index is represented by the average standard deviation of point distances within each cluster, providing a simple and interpretable statistic to quantify the internal variation of clustering results.
To address small clusters, the method includes a mechanism that assigns a diversity value of zero to clusters with very few points. This prevents the overall index from being skewed by insufficient data in smaller clusters. This feature ensures that the index remains reliable even when clusters differ significantly in size. The Variety Index is designed to estimate intra-cluster variability by calculating the average variance of points within each cluster, formally defined in Formula 7.
The theoretical foundation of the Variety Index lies in its ability to assess intra-cluster variability by computing the average variance of points within each cluster. Its formal definition is as follows:
Variety Index = Var(Ci) (7)
k
where Var(Ci) is the variance of points within cluster i, and k is the total number of clusters.
Intra-cluster variance (Var(Ci)): The variance within cluster i(Var(Ci)) measures the degrof spread of points within the cluster. High variance suggests that points within the cluster are more dispersed, while low variance indicates a tighter and more uniform distribution. The Variety Index calculates the average variance across all clusters, providing an assessment of the overall diversity within the clusters. This metric is particularly useful for comparing clustering results: a lower Variety Index signifies more homogeneous groups.
4. Experiments.
Table 1. Clusters of internal indexes result for 5 scenarios (KITTI dataset 6-10 .bin).
Scenario Dunn Calinski-H. C-Index Silhouette Davies-B. Ours
06 0.0004 124784 186797 0.49 0.688 0.89
07 0.0017 189213 372867 0.64 0.507 0.47
08 0.0006 084184 154537 0.57 0.545 0.99
09 0.0045 143722 216997 0.48 0.662 0.46
10 0.0009 090154 300328 0.72 0.300 0.72
This article provides an analysis of clustering performance across five scenarios using internal validation metrics on the KITTI dataset (files 6 - 10). The clustering outcomes were evaluated with several metrics, including the Dunn Index, Calinski-Harabasz Index, C-Index, Silhouette score, Davies-Bouldin Index, and a custom metric named the Variety Index. As summarized in Table 2, the analysis shows that scenario 10 consistently achieved the best results in terms of silhouette score and Davies-Bouldin index, indicating well-defined and separated clusters. Scenario 07 also performed well on the Calinski-Harabasz Index. However, low Dunn Index values across all scenarios point to potential challenges in achieving sufficient cluster separation. These findings suggest that, while some scenarios demonstrate promising clustering characteristics, further optimization is necessary to enhance inter-cluster separation for more effective segmentation.
5. Conclusion.
In conclusion, this study demonstrates the effectiveness of using various clustering evaluation metrics to analyze and optimize point cloud clustering on the KITTI dataset. By applying internal validation indices, including the Dunn Index, Calinski-Harabasz Index, Silhouette score, and the custom Variety Index, we assessed the quality and separability of clusters across different scenarios.
The results show that certain scenarios, such as scenario 10, achieved well-defined clusters with higher silhouette and Davies-Bouldin scores, indicating compact and distinct cluster formations. However, the generally low Dunn Index values across all scenarios reveal ongoing challenges in inter-cluster separation. This indicates potential for further refinement in clustering methods to better partition the point clouds, thus improving segmentation accuracy.
Overall, these findings highlight the importance of using a combination of validation metrics tailored to both inter- and intra-cluster characteristics. The proposed Variety Index, focused on intra-cluster variability, adds a valuable perspective to the evaluation process, facilitating a more comprehensive understanding of clustering effectiveness for point cloud data. Future work could focus on enhancing cluster separation and testing the Variety Index on additional datasets to validate its general applicability.
СПИСОК ЛИТЕРАТУРЫ:
1. Abbasi R. et al. Lidar point cloud compression, processing and learning for autonomous driving // I E E E Trans. Intell. Transp. Syst. I E E E, 2022. Vol. 24, № 1. P. 962-979;
2. Ding P., Wang Z. 3D LiDAR point cloud loop detection based on dynamic object removal. I E E E, 2021. P. 980-985;
3. Abiodun M. Ikotun et al. K-means Clustering Algorithms: A Comprehensive Review, Variants Analysis, and Advances in the Era of Big Data // Inf. Sci. 2022;
4. Arthur D., Vassilvitskii S. k-means++: The advantages of careful seeding. Stanford, 2006;
5. Jain A.K. Data clustering: 50 years beyond K-means // Pattern Recognit. Lett. Elsevier, 2010. Vol. 31, № 8. P. 651-666;
6. Geiger A. et al. Vision meets robotics: The kitti dataset // Int. J. Robot. Res. Sage Publications Sage UK: London, England, 2013. Vol. 32, № 11. P. 1231-1237;
7. Geiger A., Lenz P., Urtasun R. Are we ready for autonomous driving? the kitti vision benchmark suite. I E E E, 2012. P. 3354-3361;
8. Chunxiao Wang et al. An Improved DBSCAN Method for LiDAR Data Segmentation with Automatic Eps Estimation. // Sensors. 2019. Vol. 19, № 1. P. 172;
9. Oliveira M.I., Marfal A.R. Clustering LiDAR Data with K-means and DBSCAN. 2023. P. 822-831