-
Notifications
You must be signed in to change notification settings - Fork 2k
BalancedKMeans.cs: It does not find a solution for this case #451
Description
I have prepared this little method with all the data you need to try/debug/fix the issue:
public static int[] Test()
{
int numClusters = 6;
double[][] observations = {
new double[] { 10.8, 18.706148721743876 },
new double[] { -10.8, 18.706148721743876 },
new double[] { -21.6, 0.0 },
new double[] { -10.8, -18.706148721743876 },
new double[] { 10.8, -18.706148721743876 },
new double[] { 21.6, 0.0 },
new double[] { 32.400000000000006, 18.706148721743876 },
new double[] { 21.600000000000005, 37.412297443487752 },
new double[] { 3.5527136788005009E-15, 37.412297443487752 },
new double[] { -21.599999999999998, 37.412297443487752 },
new double[] { -32.4, 18.706148721743876 },
new double[] { -43.2, 0.0 },
new double[] { -32.400000000000006, -18.706148721743876 },
new double[] { -21.600000000000005, -37.412297443487752 },
new double[] { -3.5527136788005009E-15, -37.412297443487752 },
new double[] { 21.599999999999998, -37.412297443487752 },
new double[] { 32.4, -18.706148721743876 },
new double[] { 43.2, 0.0 } };
Accord.Math.Random.Generator.Seed = 0;
Accord.MachineLearning.BalancedKMeans kmeans = new Accord.MachineLearning.BalancedKMeans(numClusters);
// If a limit is not set, the following Learn call does not return....
kmeans.MaxIterations = 1000;
Accord.MachineLearning.KMeansClusterCollection clusters = kmeans.Learn(observations);
int[] labels = clusters.Decide(observations);
return labels;
}
The following image shows the distribution of the points (they are the centers of each numbered circle):
It might seem a rather borderline example (all points are equally distributed using a triangle distribution), but it has only 18 points to be divided in 6 clusters (3 items each one). A valid solution can easily be found visually and even a brute-force backtracking algorithm solves it for such a small amount of points (although it does not scale at all when the numbers grow).
For example, a valid solution would be: (0,6,7), (1,8,9), (2,10,11), (3,13,14), (4,15,16), (5,6,17)
However, the former code returns this solution:
int[] labels = { 5, 4, 2, 0, 1, 3, 3, 3, 4, 4, 4, 2, 0, 0, 0, 0, 3, 3 };
As you can see, the clusters are not balanced: Cluster-0 has 5 elements, Cluster-1 has only 1 element and so forth.
If you comment the line setting MaxIterations, the Learn call does not return in quite a long time (more than 15 minutes). If you leave it there, actually shortcutting the calculations, the Proportions are incorrectly calculated. See my previous comment #450