compression.lz77
================

.. py:module:: compression.lz77

.. autoapi-nested-parse::

   LZ77 compression algorithm
   - lossless data compression published in papers by Abraham Lempel and Jacob Ziv in 1977
   - also known as LZ1 or sliding-window compression
   - form the basis for many variations including LZW, LZSS, LZMA and others

   It uses a “sliding window” method. Within the sliding window we have:
     - search buffer
     - look ahead buffer
   len(sliding_window) = len(search_buffer) + len(look_ahead_buffer)

   LZ77 manages a dictionary that uses triples composed of:
       - Offset into search buffer, it's the distance between the start of a phrase and
         the beginning of a file.
       - Length of the match, it's the number of characters that make up a phrase.
       - The indicator is represented by a character that is going to be encoded next.

   As a file is parsed, the dictionary is dynamically updated to reflect the compressed
   data contents and size.

   Examples:
   "cabracadabrarrarrad" <-> [(0, 0, 'c'), (0, 0, 'a'), (0, 0, 'b'), (0, 0, 'r'),
                              (3, 1, 'c'), (2, 1, 'd'), (7, 4, 'r'), (3, 5, 'd')]
   "ababcbababaa" <-> [(0, 0, 'a'), (0, 0, 'b'), (2, 2, 'c'), (4, 3, 'a'), (2, 2, 'a')]
   "aacaacabcabaaac" <-> [(0, 0, 'a'), (1, 1, 'c'), (3, 4, 'b'), (3, 3, 'a'), (1, 2, 'c')]

   Sources:
   en.wikipedia.org/wiki/LZ77_and_LZ78


Attributes
----------

.. autoapisummary::

   compression.lz77.__author__
   compression.lz77.__version__
   compression.lz77.lz77_compressor


Classes
-------

.. autoapisummary::

   compression.lz77.LZ77Compressor
   compression.lz77.Token


Module Contents
---------------

.. py:class:: LZ77Compressor(window_size: int = 13, lookahead_buffer_size: int = 6)

   Class containing compress and decompress methods using LZ77 compression algorithm.


   .. py:method:: _find_encoding_token(text: str, search_buffer: str) -> Token

      Finds the encoding token for the first character in the text.

      Tests:
          >>> lz77_compressor = LZ77Compressor()
          >>> lz77_compressor._find_encoding_token("abrarrarrad", "abracad").offset
          7
          >>> lz77_compressor._find_encoding_token("adabrarrarrad", "cabrac").length
          1
          >>> lz77_compressor._find_encoding_token("abc", "xyz").offset
          0
          >>> lz77_compressor._find_encoding_token("", "xyz").offset
          Traceback (most recent call last):
              ...
          ValueError: We need some text to work with.
          >>> lz77_compressor._find_encoding_token("abc", "").offset
          0


   .. py:method:: _match_length_from_index(text: str, window: str, text_index: int, window_index: int) -> int

      Calculate the longest possible match of text and window characters from
      text_index in text and window_index in window.

      Args:
          text: _description_
          window: sliding window
          text_index: index of character in text
          window_index: index of character in sliding window

      Returns:
          The maximum match between text and window, from given indexes.

      Tests:
          >>> lz77_compressor = LZ77Compressor(13, 6)
          >>> lz77_compressor._match_length_from_index("rarrad", "adabrar", 0, 4)
          5
          >>> lz77_compressor._match_length_from_index("adabrarrarrad",
          ...     "cabrac", 0, 1)
          1


   .. py:method:: compress(text: str) -> list[Token]

      Compress the given string text using LZ77 compression algorithm.

      Args:
          text: string to be compressed

      Returns:
          output: the compressed text as a list of Tokens

      >>> lz77_compressor = LZ77Compressor()
      >>> str(lz77_compressor.compress("ababcbababaa"))
      '[(0, 0, a), (0, 0, b), (2, 2, c), (4, 3, a), (2, 2, a)]'
      >>> str(lz77_compressor.compress("aacaacabcabaaac"))
      '[(0, 0, a), (1, 1, c), (3, 4, b), (3, 3, a), (1, 2, c)]'


   .. py:method:: decompress(tokens: list[Token]) -> str

      Convert the list of tokens into an output string.

      Args:
          tokens: list containing triplets (offset, length, char)

      Returns:
          output: decompressed text

      Tests:
          >>> lz77_compressor = LZ77Compressor()
          >>> lz77_compressor.decompress([Token(0, 0, 'c'), Token(0, 0, 'a'),
          ... Token(0, 0, 'b'), Token(0, 0, 'r'), Token(3, 1, 'c'),
          ... Token(2, 1, 'd'), Token(7, 4, 'r'), Token(3, 5, 'd')])
          'cabracadabrarrarrad'
          >>> lz77_compressor.decompress([Token(0, 0, 'a'), Token(0, 0, 'b'),
          ... Token(2, 2, 'c'), Token(4, 3, 'a'), Token(2, 2, 'a')])
          'ababcbababaa'
          >>> lz77_compressor.decompress([Token(0, 0, 'a'), Token(1, 1, 'c'),
          ... Token(3, 4, 'b'), Token(3, 3, 'a'), Token(1, 2, 'c')])
          'aacaacabcabaaac'


   .. py:attribute:: lookahead_buffer_size
      :value: 6


   .. py:attribute:: search_buffer_size
      :value: 7


   .. py:attribute:: window_size
      :value: 13


.. py:class:: Token

   Dataclass representing triplet called token consisting of length, offset
   and indicator. This triplet is used during LZ77 compression.


   .. py:method:: __repr__() -> str

      >>> token = Token(1, 2, "c")
      >>> repr(token)
      '(1, 2, c)'
      >>> str(token)
      '(1, 2, c)'


   .. py:attribute:: indicator
      :type:  str


   .. py:attribute:: length
      :type:  int


   .. py:attribute:: offset
      :type:  int


.. py:data:: __author__
   :value: 'Lucia Harcekova'


.. py:data:: __version__
   :value: '0.1'


.. py:data:: lz77_compressor