Skip to content
Snippets Groups Projects

Readme

Overview

The algorithm can be used to find quotation in two texts, called source and target. If known, the source text should be the one that is quoted by the target text. This allows the algorithm to handle things like ellipses in quotations, e.g.

0	52	This is a long Text and the long test goes on and on
0	45	This is a long Text [...] test goes on and on

Usage

There are two ways to use the algorithm. The following two sections describe the use of the algorithm in code and from the command line.

In code

The algorithm can be found in the package lotte. To use it create a Lotte object which expects one argument.

  • The length of the shortest matches to be found (default: 5)

Then call the compare method on the object which expects two texts to be compared. The method returns a list with the following structure: List[Match]. Match stores two MatchSegments. One for the source text and one for the target text. MatchSegment stores the character_start_pos and character_end_pos for the matching segments in the source and target text.

Command line

The LotteCLI.py file provides a command line interface to the algorithm. It is called the following usage:

usage: LotteCLI.py [-h] [--text | --no-text] [--output-type {json,text}]
                   [--min-match-length MIN_MATCH_LENGTH]
                   source-file-path target-file-path

LotteCLI allows the user to find quotations in two texts, a sourcetext and a
target text. If known, the source text should be the onethat is quoted by the
target text. This allows the algorithm tohandle things like ellipses in
quotations.

positional arguments:
  source-file-path      Path to the source text file
  target-file-path      Path to the target text file

optional arguments:
  -h, --help            show this help message and exit
  --text, --no-text     Include matched text in the returned data structure
                        (default: True)
  --output-type {json,text}
  --min-match-length MIN_MATCH_LENGTH
                        The length of the shortest match (>= 3, default: 5)

By default, the result is returned as a json structure: List[Match]. Match stores two MatchSegments. One for the source text and one for the target text. MatchSegment stores the character_start_pos and character_end_pos for the matching segments in the source and target text. For example,

[
  {
    "source_match_segment": {
      "character_start_pos": 0,
      "character_end_pos": 52,
      "text": "This is a long Text and the long test goes on and on"
    },
    "target_match_segment": {
      "character_start_pos": 0,
      "character_end_pos": 45,
      "text": "This is a long Text [...] test goes on and on"
    }
  }
]

Alternatively, the result can be printed in a human-readable text format, e.g.:

0	52	This is a long Text and the long test goes on and on
0	45	This is a long Text [...] test goes on and on 

In case the matching text is not needed, the option --no-text allows to exclude the text from the output.

HTML

The package htmlOut contains code to create the content for a web page to visualize the result of the algorithm. For a functioning website, there's also javascript and css needed which is not yet publicly available.

License

  • TBD