machine_learning.word_frequency_functions
=========================================

.. py:module:: machine_learning.word_frequency_functions


Functions
---------

.. autoapisummary::

   machine_learning.word_frequency_functions.document_frequency
   machine_learning.word_frequency_functions.inverse_document_frequency
   machine_learning.word_frequency_functions.term_frequency
   machine_learning.word_frequency_functions.tf_idf


Module Contents
---------------

.. py:function:: document_frequency(term: str, corpus: str) -> tuple[int, int]

   Calculate the number of documents in a corpus that contain a
   given term
   @params : term, the term to search each document for, and corpus, a collection of
            documents. Each document should be separated by a newline.
   @returns : the number of documents in the corpus that contain the term you are
              searching for and the number of documents in the corpus
   @examples :
   >>> document_frequency("first", "This is the first document in the corpus.\nThIsis the second document in the corpus.\nTHIS is the third document in the corpus.")
   (1, 3)


.. py:function:: inverse_document_frequency(df: int, n: int, smoothing=False) -> float

   Return an integer denoting the importance
   of a word. This measure of importance is
   calculated by log10(N/df), where N is the
   number of documents and df is
   the Document Frequency.
   @params : df, the Document Frequency, N,
   the number of documents in the corpus and
   smoothing, if True return the idf-smooth
   @returns : log10(N/df) or 1+log10(N/1+df)
   @examples :
   >>> inverse_document_frequency(3, 0)
   Traceback (most recent call last):
    ...
   ValueError: log10(0) is undefined.
   >>> inverse_document_frequency(1, 3)
   0.477
   >>> inverse_document_frequency(0, 3)
   Traceback (most recent call last):
    ...
   ZeroDivisionError: df must be > 0
   >>> inverse_document_frequency(0, 3,True)
   1.477


.. py:function:: term_frequency(term: str, document: str) -> int

   Return the number of times a term occurs within
   a given document.
   @params: term, the term to search a document for, and document,
           the document to search within
   @returns: an integer representing the number of times a term is
           found within the document

   @examples:
   >>> term_frequency("to", "To be, or not to be")
   2


.. py:function:: tf_idf(tf: int, idf: int) -> float

   Combine the term frequency
   and inverse document frequency functions to
   calculate the originality of a term. This
   'originality' is calculated by multiplying
   the term frequency and the inverse document
   frequency : tf-idf = TF * IDF
   @params : tf, the term frequency, and idf, the inverse document
   frequency
   @examples :
   >>> tf_idf(2, 0.477)
   0.954