Skip to content
This repository was archived by the owner on Nov 19, 2020. It is now read-only.
This repository was archived by the owner on Nov 19, 2020. It is now read-only.

BalancedKMeans.cs: It does not find a solution for this case #451

@jagbarcelo

Description

@jagbarcelo

I have prepared this little method with all the data you need to try/debug/fix the issue:

public static int[] Test()
{
    int numClusters = 6;
    double[][] observations = {
        new double[] { 10.8, 18.706148721743876 },
        new double[] { -10.8, 18.706148721743876 },
        new double[] { -21.6, 0.0 },
        new double[] { -10.8, -18.706148721743876 },
        new double[] { 10.8, -18.706148721743876 },
        new double[] { 21.6, 0.0 },
        new double[] { 32.400000000000006, 18.706148721743876 },
        new double[] { 21.600000000000005, 37.412297443487752 },
        new double[] { 3.5527136788005009E-15, 37.412297443487752 },
        new double[] { -21.599999999999998, 37.412297443487752 },
        new double[] { -32.4, 18.706148721743876 },
        new double[] { -43.2, 0.0 },
        new double[] { -32.400000000000006, -18.706148721743876 },
        new double[] { -21.600000000000005, -37.412297443487752 },
        new double[] { -3.5527136788005009E-15, -37.412297443487752 },
        new double[] { 21.599999999999998, -37.412297443487752 },
        new double[] { 32.4, -18.706148721743876 },
        new double[] { 43.2, 0.0 } };

    Accord.Math.Random.Generator.Seed = 0;
    Accord.MachineLearning.BalancedKMeans kmeans = new Accord.MachineLearning.BalancedKMeans(numClusters);

    // If a limit is not set, the following Learn call does not return....
    kmeans.MaxIterations = 1000;
    Accord.MachineLearning.KMeansClusterCollection clusters = kmeans.Learn(observations);

    int[] labels = clusters.Decide(observations);
    return labels;
}

The following image shows the distribution of the points (they are the centers of each numbered circle):
balancedkmeans-18points

It might seem a rather borderline example (all points are equally distributed using a triangle distribution), but it has only 18 points to be divided in 6 clusters (3 items each one). A valid solution can easily be found visually and even a brute-force backtracking algorithm solves it for such a small amount of points (although it does not scale at all when the numbers grow).

For example, a valid solution would be: (0,6,7), (1,8,9), (2,10,11), (3,13,14), (4,15,16), (5,6,17)

However, the former code returns this solution:
int[] labels = { 5, 4, 2, 0, 1, 3, 3, 3, 4, 4, 4, 2, 0, 0, 0, 0, 3, 3 };
As you can see, the clusters are not balanced: Cluster-0 has 5 elements, Cluster-1 has only 1 element and so forth.

If you comment the line setting MaxIterations, the Learn call does not return in quite a long time (more than 15 minutes). If you leave it there, actually shortcutting the calculations, the Proportions are incorrectly calculated. See my previous comment #450

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions