machine_learning.k_means_clust ============================== .. py:module:: machine_learning.k_means_clust .. autoapi-nested-parse:: README, Author - Anurag Kumar(mailto:anuragkumarak95@gmail.com) Requirements: - sklearn - numpy - matplotlib Python: - 3.5 Inputs: - X , a 2D numpy array of features. - k , number of clusters to create. - initial_centroids , initial centroid values generated by utility function(mentioned in usage). - maxiter , maximum number of iterations to process. - heterogeneity , empty list that will be filled with heterogeneity values if passed to kmeans func. Usage: 1. define 'k' value, 'X' features array and 'heterogeneity' empty list 2. create initial_centroids, initial_centroids = get_initial_centroids( X, k, seed=0 # seed value for initial centroid generation, # None for randomness(default=None) ) 3. find centroids and clusters using kmeans function. centroids, cluster_assignment = kmeans( X, k, initial_centroids, maxiter=400, record_heterogeneity=heterogeneity, verbose=True # whether to print logs in console or not.(default=False) ) 4. Plot the loss function and heterogeneity values for every iteration saved in heterogeneity list. plot_heterogeneity( heterogeneity, k ) 5. Transfers Dataframe into excel format it must have feature called 'Clust' with k means clustering numbers in it. Attributes ---------- .. autoapisummary:: machine_learning.k_means_clust.TAG machine_learning.k_means_clust.dataset Functions --------- .. autoapisummary:: machine_learning.k_means_clust.assign_clusters machine_learning.k_means_clust.centroid_pairwise_dist machine_learning.k_means_clust.compute_heterogeneity machine_learning.k_means_clust.get_initial_centroids machine_learning.k_means_clust.kmeans machine_learning.k_means_clust.plot_heterogeneity machine_learning.k_means_clust.report_generator machine_learning.k_means_clust.revise_centroids Module Contents --------------- .. py:function:: assign_clusters(data, centroids) .. py:function:: centroid_pairwise_dist(x, centroids) .. py:function:: compute_heterogeneity(data, k, centroids, cluster_assignment) .. py:function:: get_initial_centroids(data, k, seed=None) Randomly choose k data points as initial centroids .. py:function:: kmeans(data, k, initial_centroids, maxiter=500, record_heterogeneity=None, verbose=False) Runs k-means on given data and initial set of centroids. maxiter: maximum number of iterations to run.(default=500) record_heterogeneity: (optional) a list, to store the history of heterogeneity as function of iterations if None, do not store the history. verbose: if True, print how many data points changed their cluster labels in each iteration .. py:function:: plot_heterogeneity(heterogeneity, k) .. py:function:: report_generator(predicted: pandas.DataFrame, clustering_variables: numpy.ndarray, fill_missing_report=None) -> pandas.DataFrame Generate a clustering report given these two arguments: predicted - dataframe with predicted cluster column fill_missing_report - dictionary of rules on how we are going to fill in missing values for final generated report (not included in modelling); >>> predicted = pd.DataFrame() >>> predicted['numbers'] = [1, 2, 3] >>> predicted['col1'] = [0.5, 2.5, 4.5] >>> predicted['col2'] = [100, 200, 300] >>> predicted['col3'] = [10, 20, 30] >>> predicted['Cluster'] = [1, 1, 2] >>> report_generator(predicted, ['col1', 'col2'], 0) Features Type Mark 1 2 0 # of Customers ClusterSize False 2.000000 1.000000 1 % of Customers ClusterProportion False 0.666667 0.333333 2 col1 mean_with_zeros True 1.500000 4.500000 3 col2 mean_with_zeros True 150.000000 300.000000 4 numbers mean_with_zeros False 1.500000 3.000000 .. ... ... ... ... ... 99 dummy 5% False 1.000000 1.000000 100 dummy 95% False 1.000000 1.000000 101 dummy stdev False 0.000000 NaN 102 dummy mode False 1.000000 1.000000 103 dummy median False 1.000000 1.000000 [104 rows x 5 columns] .. py:function:: revise_centroids(data, k, cluster_assignment) .. py:data:: TAG :value: 'K-MEANS-CLUST/ ' .. py:data:: dataset