I'm experimenting with clustering algorithms that automatically weight features as they go, as a way to improve ML interpretability and filter out noise. I have some new algorithms I'm running experiments on, but want to find the broadest range to compare them to. I’m most interested in algorithms that do not require a pre-supplied tuning parameter., but not exclusively so.
So far, I’m benchmarking against:
Scikit-learn KMeans Defaulted to KMeans++, subclassing KMeans with an override for its init method, so I can capture internal data points, such as initial centroids https://github.com/scikit-learn/scikit-learn/blob/d5082d32d/sklearn/cluster/_kmeans.py#L1196
An implementation of w-KMeans (Weighted KMeans) J. Z. Huang, M. K. Ng, Hongqiang Rong and Zichen Li, "Automated variable weighting in k-means type clustering," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 5, pp. 657-668, May 2005, doi: 10.1109/TPAMI.2005.95. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1407871
An implementation of FWSA (Feature Weight Self Adjustment) Chieh-Yuan Tsai, Chuang-Cheng Chiu, “Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm,” in Computational Statistics & Data Analysis,Volume 52, Issue 10, 2008, Pages 4658-4672, ISSN 0167-9473, https://doi.org/10.1016/j.csda.2008.03.002 https://www.sciencedirect.com/science/article/pii/S0167947308001552
The experiments seek to minimise weighted _inertia (aka distortion cost, SSq etc), and rate the algorithms using Adjusted RAND Index vs. ground truth. For reference, I’m also looking to test the datasets for clusterability by sampling some measures of dataset entropy, and looking to swap out minimising inertia for some other novel test, such as maximizing some measure of cluster pseudo-density, or minimising some measure of cluster pseudo-surface area. The experiments are based on about 6000 datasets constructed by make_blobs, and so far, 50 or so that are real world and well known.
Could anyone recommend any other papers, or perhaps code references? I’m working in python, but would adapt code from any language.
I've started a literary review and been through sites like litmaps looking for comparable algorithms, and also sought council from my PhD supervisor, who recommended posting here.
Good would look like a paper with an algorithm, even if it were thick with mathematics. Great would be the same, but maybe with some code.