machine_learning.k_means_clust

README, Author - Anurag Kumar(mailto:anuragkumarak95@gmail.com) Requirements:

  • sklearn

  • numpy

  • matplotlib

Python:
  • 3.5

Inputs:
  • X , a 2D numpy array of features.

  • k , number of clusters to create.

  • initial_centroids , initial centroid values generated by utility function(mentioned in usage).

  • maxiter , maximum number of iterations to process.

  • heterogeneity , empty list that will be filled with heterogeneity values if passed to kmeans func.

Usage:
  1. define ‘k’ value, ‘X’ features array and ‘heterogeneity’ empty list

  2. create initial_centroids,
    initial_centroids = get_initial_centroids(

    X, k, seed=0 # seed value for initial centroid generation,

    # None for randomness(default=None)

    )

  3. find centroids and clusters using kmeans function.
    centroids, cluster_assignment = kmeans(

    X, k, initial_centroids, maxiter=400, record_heterogeneity=heterogeneity, verbose=True # whether to print logs in console or not.(default=False) )

  4. Plot the loss function and heterogeneity values for every iteration saved in heterogeneity list.

    plot_heterogeneity(

    heterogeneity, k

    )

  5. Transfers Dataframe into excel format it must have feature called

    ‘Clust’ with k means clustering numbers in it.

Attributes

TAG

dataset

Functions

assign_clusters(data, centroids)

centroid_pairwise_dist(x, centroids)

compute_heterogeneity(data, k, centroids, ...)

get_initial_centroids(data, k[, seed])

Randomly choose k data points as initial centroids

kmeans(data, k, initial_centroids[, maxiter, ...])

Runs k-means on given data and initial set of centroids.

plot_heterogeneity(heterogeneity, k)

report_generator(→ pandas.DataFrame)

Generate a clustering report given these two arguments:

revise_centroids(data, k, cluster_assignment)

Module Contents

machine_learning.k_means_clust.assign_clusters(data, centroids)
machine_learning.k_means_clust.centroid_pairwise_dist(x, centroids)
machine_learning.k_means_clust.compute_heterogeneity(data, k, centroids, cluster_assignment)
machine_learning.k_means_clust.get_initial_centroids(data, k, seed=None)

Randomly choose k data points as initial centroids

machine_learning.k_means_clust.kmeans(data, k, initial_centroids, maxiter=500, record_heterogeneity=None, verbose=False)

Runs k-means on given data and initial set of centroids. maxiter: maximum number of iterations to run.(default=500) record_heterogeneity: (optional) a list, to store the history of heterogeneity

as function of iterations if None, do not store the history.

verbose: if True, print how many data points changed their cluster labels in

each iteration

machine_learning.k_means_clust.plot_heterogeneity(heterogeneity, k)
machine_learning.k_means_clust.report_generator(predicted: pandas.DataFrame, clustering_variables: numpy.ndarray, fill_missing_report=None) pandas.DataFrame
Generate a clustering report given these two arguments:

predicted - dataframe with predicted cluster column fill_missing_report - dictionary of rules on how we are going to fill in missing values for final generated report (not included in modelling);

>>> predicted = pd.DataFrame()
>>> predicted['numbers'] = [1, 2, 3]
>>> predicted['col1'] = [0.5, 2.5, 4.5]
>>> predicted['col2'] = [100, 200, 300]
>>> predicted['col3'] = [10, 20, 30]
>>> predicted['Cluster'] = [1, 1, 2]
>>> report_generator(predicted, ['col1', 'col2'], 0)
           Features               Type   Mark           1           2
0    # of Customers        ClusterSize  False    2.000000    1.000000
1    % of Customers  ClusterProportion  False    0.666667    0.333333
2              col1    mean_with_zeros   True    1.500000    4.500000
3              col2    mean_with_zeros   True  150.000000  300.000000
4           numbers    mean_with_zeros  False    1.500000    3.000000
..              ...                ...    ...         ...         ...
99            dummy                 5%  False    1.000000    1.000000
100           dummy                95%  False    1.000000    1.000000
101           dummy              stdev  False    0.000000         NaN
102           dummy               mode  False    1.000000    1.000000
103           dummy             median  False    1.000000    1.000000

[104 rows x 5 columns]
machine_learning.k_means_clust.revise_centroids(data, k, cluster_assignment)
machine_learning.k_means_clust.TAG = 'K-MEANS-CLUST/ '
machine_learning.k_means_clust.dataset