machine_learning.word_frequency_functions

Functions

document_frequency(→ tuple[int, int])

Calculate the number of documents in a corpus that contain a

inverse_document_frequency(→ float)

Return an integer denoting the importance

term_frequency(→ int)

Return the number of times a term occurs within

tf_idf(→ float)

Combine the term frequency

Module Contents

machine_learning.word_frequency_functions.document_frequency(term: str, corpus: str) tuple[int, int]

Calculate the number of documents in a corpus that contain a given term @params : term, the term to search each document for, and corpus, a collection of

documents. Each document should be separated by a newline.

@returnsthe number of documents in the corpus that contain the term you are

searching for and the number of documents in the corpus

@examples : >>> document_frequency(“first”, “This is the first document in the corpus.nThIsis the second document in the corpus.nTHIS is the third document in the corpus.”) (1, 3)

machine_learning.word_frequency_functions.inverse_document_frequency(df: int, n: int, smoothing=False) float

Return an integer denoting the importance of a word. This measure of importance is calculated by log10(N/df), where N is the number of documents and df is the Document Frequency. @params : df, the Document Frequency, N, the number of documents in the corpus and smoothing, if True return the idf-smooth @returns : log10(N/df) or 1+log10(N/1+df) @examples : >>> inverse_document_frequency(3, 0) Traceback (most recent call last):

ValueError: log10(0) is undefined. >>> inverse_document_frequency(1, 3) 0.477 >>> inverse_document_frequency(0, 3) Traceback (most recent call last):

ZeroDivisionError: df must be > 0 >>> inverse_document_frequency(0, 3,True) 1.477

machine_learning.word_frequency_functions.term_frequency(term: str, document: str) int

Return the number of times a term occurs within a given document. @params: term, the term to search a document for, and document,

the document to search within

@returns: an integer representing the number of times a term is

found within the document

@examples: >>> term_frequency(“to”, “To be, or not to be”) 2

machine_learning.word_frequency_functions.tf_idf(tf: int, idf: int) float

Combine the term frequency and inverse document frequency functions to calculate the originality of a term. This ‘originality’ is calculated by multiplying the term frequency and the inverse document frequency : tf-idf = TF * IDF @params : tf, the term frequency, and idf, the inverse document frequency @examples : >>> tf_idf(2, 0.477) 0.954