Readme
Overview
The algorithm can be used to find quotation in two texts, called source and target. If known, the source text should be the one that is quoted by the target text. This allows the algorithm to handle things like ellipses in quotations, e.g.
0 52 This is a long Text and the long test goes on and on
0 45 This is a long Text [...] test goes on and on
Usage
There are two ways to use the algorithm. The following two sections describe the use of the algorithm in code and from the command line.
In code
The algorithm can be found in the package lotte
. To use it create a Lotte
object which expects one argument.
- The length of the shortest matches to be found (default: 5)
Then call the compare
method on the object which expects two texts to be compared.
The method returns a list with the following structure: List[Match]
. Match
stores two MatchSegments
. One for the source text and one for the target text. MatchSegment
stores the character_start_pos
and character_end_pos
for the matching segments in the source and target text.
Command line
The LotteCLI.py
file provides a command line interface to the algorithm. It is called the following usage:
usage: LotteCLI.py [-h] [--text | --no-text] [--output-type {json,text}]
[--min-match-length MIN_MATCH_LENGTH]
source-file-path target-file-path
LotteCLI allows the user to find quotations in two texts, a sourcetext and a
target text. If known, the source text should be the onethat is quoted by the
target text. This allows the algorithm tohandle things like ellipses in
quotations.
positional arguments:
source-file-path Path to the source text file
target-file-path Path to the target text file
optional arguments:
-h, --help show this help message and exit
--text, --no-text Include matched text in the returned data structure
(default: True)
--output-type {json,text}
--min-match-length MIN_MATCH_LENGTH
The length of the shortest match (>= 3, default: 5)
By default, the result is returned as a json structure: List[Match]
. Match
stores two MatchSegments
. One for the source text and one for the target text. MatchSegment
stores the character_start_pos
and character_end_pos
for the matching segments in the source and target text.
For example,
[
{
"source_match_segment": {
"character_start_pos": 0,
"character_end_pos": 52,
"text": "This is a long Text and the long test goes on and on"
},
"target_match_segment": {
"character_start_pos": 0,
"character_end_pos": 45,
"text": "This is a long Text [...] test goes on and on"
}
}
]
Alternatively, the result can be printed in a human-readable text format, e.g.:
0 52 This is a long Text and the long test goes on and on
0 45 This is a long Text [...] test goes on and on
In case the matching text is not needed, the option --no-text allows to exclude the text from the output.
HTML
The package htmlOut
contains code to create the content for a web page to visualize the result of the algorithm. For a functioning website, there's also javascript and css needed which is not yet publicly available.
License
- TBD