machine_learning.k_means_clust¶
README, Author - Anurag Kumar(mailto:anuragkumarak95@gmail.com) Requirements:
sklearn
numpy
matplotlib
- Python:
3.5
- Inputs:
X , a 2D numpy array of features.
k , number of clusters to create.
initial_centroids , initial centroid values generated by utility function(mentioned in usage).
maxiter , maximum number of iterations to process.
heterogeneity , empty list that will be filled with heterogeneity values if passed to kmeans func.
- Usage:
define ‘k’ value, ‘X’ features array and ‘heterogeneity’ empty list
- create initial_centroids,
- initial_centroids = get_initial_centroids(
X, k, seed=0 # seed value for initial centroid generation,
# None for randomness(default=None)
)
- find centroids and clusters using kmeans function.
- centroids, cluster_assignment = kmeans(
X, k, initial_centroids, maxiter=400, record_heterogeneity=heterogeneity, verbose=True # whether to print logs in console or not.(default=False) )
Plot the loss function and heterogeneity values for every iteration saved in heterogeneity list.
- plot_heterogeneity(
heterogeneity, k
)
- Transfers Dataframe into excel format it must have feature called
‘Clust’ with k means clustering numbers in it.
Attributes¶
Functions¶
|
|
|
|
|
|
|
Randomly choose k data points as initial centroids |
|
Runs k-means on given data and initial set of centroids. |
|
|
|
Generate a clustering report given these two arguments: |
|
Module Contents¶
- machine_learning.k_means_clust.assign_clusters(data, centroids)¶
- machine_learning.k_means_clust.centroid_pairwise_dist(x, centroids)¶
- machine_learning.k_means_clust.compute_heterogeneity(data, k, centroids, cluster_assignment)¶
- machine_learning.k_means_clust.get_initial_centroids(data, k, seed=None)¶
Randomly choose k data points as initial centroids
- machine_learning.k_means_clust.kmeans(data, k, initial_centroids, maxiter=500, record_heterogeneity=None, verbose=False)¶
Runs k-means on given data and initial set of centroids. maxiter: maximum number of iterations to run.(default=500) record_heterogeneity: (optional) a list, to store the history of heterogeneity
as function of iterations if None, do not store the history.
- verbose: if True, print how many data points changed their cluster labels in
each iteration
- machine_learning.k_means_clust.plot_heterogeneity(heterogeneity, k)¶
- machine_learning.k_means_clust.report_generator(predicted: pandas.DataFrame, clustering_variables: numpy.ndarray, fill_missing_report=None) pandas.DataFrame ¶
- Generate a clustering report given these two arguments:
predicted - dataframe with predicted cluster column fill_missing_report - dictionary of rules on how we are going to fill in missing values for final generated report (not included in modelling);
>>> predicted = pd.DataFrame() >>> predicted['numbers'] = [1, 2, 3] >>> predicted['col1'] = [0.5, 2.5, 4.5] >>> predicted['col2'] = [100, 200, 300] >>> predicted['col3'] = [10, 20, 30] >>> predicted['Cluster'] = [1, 1, 2] >>> report_generator(predicted, ['col1', 'col2'], 0) Features Type Mark 1 2 0 # of Customers ClusterSize False 2.000000 1.000000 1 % of Customers ClusterProportion False 0.666667 0.333333 2 col1 mean_with_zeros True 1.500000 4.500000 3 col2 mean_with_zeros True 150.000000 300.000000 4 numbers mean_with_zeros False 1.500000 3.000000 .. ... ... ... ... ... 99 dummy 5% False 1.000000 1.000000 100 dummy 95% False 1.000000 1.000000 101 dummy stdev False 0.000000 NaN 102 dummy mode False 1.000000 1.000000 103 dummy median False 1.000000 1.000000 [104 rows x 5 columns]
- machine_learning.k_means_clust.revise_centroids(data, k, cluster_assignment)¶
- machine_learning.k_means_clust.TAG = 'K-MEANS-CLUST/ '¶
- machine_learning.k_means_clust.dataset¶