Is the following statement correct?
"k-means may diverge if there is noise in the data."
Is the following statement correct?
"k-means may diverge if there is noise in the data."
Though obviously large noise in the training data can affect the quality of the k-means clustering result, it's guaranteed to converge to a local minimum of its objective function $J$ which is the sum of squared distances between data points and their assigned centroids. k-means does not guarantee finding the global minimum due to its sensitivity to initial centroid positions.
The algorithm starts by initializing k centroids, usually chosen randomly. Each data point is then assigned to the nearest centroid which minimizes the distance between the data points and centroids within each cluster. The centroids are then updated to be the mean of the data points assigned to them which minimizes the sum of squared distances within each cluster. Each iteration of the assignment and update steps above leads to a monotonic non-increase in $J$. Since $J$ is bounded below (cannot be less than zero) and decreases monotonically along with the fact that the number of possible assignments of $n$ data points to $k$ clusters is finite, the algorithm must eventually reach a state where further iterations do not change the assignments and centroids, even with many outlier noises.
In practice to deal with possible large noise, density-based spatial clustering of applications with noise (DBSCAN) is preferred.
DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k-means... DBSCAN has a notion of noise, and is robust to outliers.