machine_learning.k_means_clust
==============================

.. py:module:: machine_learning.k_means_clust

.. autoapi-nested-parse::

   README, Author - Anurag Kumar(mailto:anuragkumarak95@gmail.com)
   Requirements:
     - sklearn
     - numpy
     - matplotlib
   Python:
     - 3.5
   Inputs:
     - X , a 2D numpy array of features.
     - k , number of clusters to create.
     - initial_centroids , initial centroid values generated by utility function(mentioned
       in usage).
     - maxiter , maximum number of iterations to process.
     - heterogeneity , empty list that will be filled with heterogeneity values if passed
       to kmeans func.
   Usage:
     1. define 'k' value, 'X' features array and 'heterogeneity' empty list
     2. create initial_centroids,
           initial_centroids = get_initial_centroids(
               X,
               k,
               seed=0 # seed value for initial centroid generation,
                      # None for randomness(default=None)
               )
     3. find centroids and clusters using kmeans function.
           centroids, cluster_assignment = kmeans(
               X,
               k,
               initial_centroids,
               maxiter=400,
               record_heterogeneity=heterogeneity,
               verbose=True # whether to print logs in console or not.(default=False)
               )
     4. Plot the loss function and heterogeneity values for every iteration saved in
        heterogeneity list.
           plot_heterogeneity(
               heterogeneity,
               k
           )
     5. Transfers Dataframe into excel format it must have feature called
         'Clust' with k means clustering numbers in it.


Attributes
----------

.. autoapisummary::

   machine_learning.k_means_clust.TAG
   machine_learning.k_means_clust.dataset


Functions
---------

.. autoapisummary::

   machine_learning.k_means_clust.assign_clusters
   machine_learning.k_means_clust.centroid_pairwise_dist
   machine_learning.k_means_clust.compute_heterogeneity
   machine_learning.k_means_clust.get_initial_centroids
   machine_learning.k_means_clust.kmeans
   machine_learning.k_means_clust.plot_heterogeneity
   machine_learning.k_means_clust.report_generator
   machine_learning.k_means_clust.revise_centroids


Module Contents
---------------

.. py:function:: assign_clusters(data, centroids)

.. py:function:: centroid_pairwise_dist(x, centroids)

.. py:function:: compute_heterogeneity(data, k, centroids, cluster_assignment)

.. py:function:: get_initial_centroids(data, k, seed=None)

   Randomly choose k data points as initial centroids


.. py:function:: kmeans(data, k, initial_centroids, maxiter=500, record_heterogeneity=None, verbose=False)

   Runs k-means on given data and initial set of centroids.
   maxiter: maximum number of iterations to run.(default=500)
   record_heterogeneity: (optional) a list, to store the history of heterogeneity
                         as function of iterations
                         if None, do not store the history.
   verbose: if True, print how many data points changed their cluster labels in
                         each iteration


.. py:function:: plot_heterogeneity(heterogeneity, k)

.. py:function:: report_generator(predicted: pandas.DataFrame, clustering_variables: numpy.ndarray, fill_missing_report=None) -> pandas.DataFrame

   Generate a clustering report given these two arguments:
       predicted - dataframe with predicted cluster column
       fill_missing_report - dictionary of rules on how we are going to fill in missing
       values for final generated report (not included in modelling);
   >>> predicted = pd.DataFrame()
   >>> predicted['numbers'] = [1, 2, 3]
   >>> predicted['col1'] = [0.5, 2.5, 4.5]
   >>> predicted['col2'] = [100, 200, 300]
   >>> predicted['col3'] = [10, 20, 30]
   >>> predicted['Cluster'] = [1, 1, 2]
   >>> report_generator(predicted, ['col1', 'col2'], 0)
              Features               Type   Mark           1           2
   0    # of Customers        ClusterSize  False    2.000000    1.000000
   1    % of Customers  ClusterProportion  False    0.666667    0.333333
   2              col1    mean_with_zeros   True    1.500000    4.500000
   3              col2    mean_with_zeros   True  150.000000  300.000000
   4           numbers    mean_with_zeros  False    1.500000    3.000000
   ..              ...                ...    ...         ...         ...
   99            dummy                 5%  False    1.000000    1.000000
   100           dummy                95%  False    1.000000    1.000000
   101           dummy              stdev  False    0.000000         NaN
   102           dummy               mode  False    1.000000    1.000000
   103           dummy             median  False    1.000000    1.000000
   <BLANKLINE>
   [104 rows x 5 columns]


.. py:function:: revise_centroids(data, k, cluster_assignment)

.. py:data:: TAG
   :value: 'K-MEANS-CLUST/ '


.. py:data:: dataset