Merge branch 'release/v.2.0.0-beta01'

84c09b03 · Frederik Arnold · 4df448d8 · 2c787430 · 84c09b03 · 84c09b03
Commit 84c09b03 authored 2 years ago by Frederik Arnold
--- a/.gitignore
+++ b/.gitignore
@@ -3,6 +3,6 @@ venv/
 __pycache__/
 *.pyc
 .coverage
-Lotte.egg-info/
+Quid.egg-info/
 dist/
 build/
\ No newline at end of file
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -14,9 +14,9 @@ cache:
 stages:
  - test
  - badge
-  - upload
-  - release
-  - zenodo
+#  - upload
+#  - release
+#  - zenodo

 test:
  stage: test
@@ -28,7 +28,7 @@ test:
    - pip install -r requirements.txt
    - pip install coverage
  script:
-    - coverage run --source=lotte,match,key_passager,visualization -m unittest discover -p 'Test*.py'
+    - coverage run --source=quid -m unittest discover -p 'Test*.py'
    - coverage report -m

 badge:
@@ -45,35 +45,35 @@ badge:
  only:
    - tags

-upload:
-  stage: upload
-  script:
-    - pip install twine
-    - cat $PYPIRC > /tmp/.pypirc
-    - python -m pip install --upgrade build
-    - python -m build
-    - python -m twine upload --repository pypi dist/* --config-file /tmp/.pypirc
-  only:
-    - tags
+#upload:
+#  stage: upload
+#  script:
+#    - pip install twine
+#    - cat $PYPIRC > /tmp/.pypirc
+#    - python -m pip install --upgrade build
+#    - python -m build
+#    - python -m twine upload --repository pypi dist/* --config-file /tmp/.pypirc
+#  only:
+#    - tags

-release_job:
-  stage: release
-  image: registry.gitlab.com/gitlab-org/release-cli:latest
-  script:
-    - echo $CI_COMMIT_TAG
-  release:
-    tag_name: $CI_COMMIT_TAG
-    name: 'Lotte $CI_COMMIT_TAG'
-    description: 'Release $CI_COMMIT_TAG'
-  only:
-    - tags
+#release_job:
+#  stage: release
+#  image: registry.gitlab.com/gitlab-org/release-cli:latest
+#  script:
+#    - echo $CI_COMMIT_TAG
+#  release:
+#    tag_name: $CI_COMMIT_TAG
+#    name: 'Quid $CI_COMMIT_TAG'
+#    description: 'Release $CI_COMMIT_TAG'
+#  only:
+#    - tags

-zenodo_upload:
-  stage: zenodo
-  before_script:
-    - pip install requests
-  script:
-    - git archive --format zip --output Lotte-$CI_COMMIT_TAG.zip $CI_COMMIT_TAG
-    - python zenodo.py --access-token $ZENODO_ACCESS_TOKEN --id 6123229 --title Lotte --version $CI_COMMIT_TAG --file-path Lotte-$CI_COMMIT_TAG.zip
-  only:
-    - tags
\ No newline at end of file
+#zenodo_upload:
+#  stage: zenodo
+#  before_script:
+#    - pip install requests
+#  script:
+#    - git archive --format zip --output Quid-$CI_COMMIT_TAG.zip $CI_COMMIT_TAG
+#    - python zenodo.py --access-token $ZENODO_ACCESS_TOKEN --id 6123229 --title Quid --version $CI_COMMIT_TAG --file-path Quid-$CI_COMMIT_TAG.zip
+#  only:
+#    - tags
\ No newline at end of file
--- a/DATA_STRUCTURE_README.md
+++ b/DATA_STRUCTURE_README.md
@@ -5,6 +5,7 @@
 - The extent of a match between source text and target text is always indicated with a start character position, and an end character position.

 ## Building blocks
+- Each building block has a corresponding Python class in the `passager` package. This file gives a high level description for each class.

 ### SourceSegment
 - One block of text in the source text ranging from `start` to `end`.
@@ -13,7 +14,7 @@
 - A group of overlapping `SourceSegments`

 ### TargetLocation
- One block of text in the target text ranging from `start` to `end`.
+- One block of text (i.e. a quotation) in the target text ranging from `start` to `end`.

 ### TargetText
 - One scholarly work with a `filename` and a list of `TargetLocations`.
@@ -26,4 +27,20 @@
 - Only used in combination with `CitationSourceLink` to map `CitationSources` to `TargetLocations`.

 ### CitationSourceLink
- Links a `CitationSource` to a `TargetText` and matched `TargetLocations` via `TargetLocationSelections`.
\ No newline at end of file
+- Links a `CitationSource` to a `TargetText` and matched `TargetLocations` via `TargetLocationSelections`.
+
+## File documentation
+
+### citation_sources.json
+- A list of `CitationSources`. Each `CitationSource` has an ID and a list of `SourceSegments`.
+- A `SourceSegment` has an ID, start and end position, a frequency, a length in tokens and the text of the segment.
+
+### citation_source_links.json
+- A list of `CitationSourceLinks`.
+
+### target_texts.json
+- A list of `TargetTexts`. Each `TargetText` has an ID, a filename and a list of `TargetLocations`.
+- A `TargetLocation` has an ID, a start and end position and the text of the location.
+
+### target_text_location_links.json
+- A list of `TargetTextLocationLinks`.
\ No newline at end of file
--- a/README.md
+++ b/README.md
 # Readme

-Lotte is a tool for quotation detection in texts and can deal with common properties of quotations, for example, ellipses or inaccurate quotations.
+Quid is a tool for quotation detection in texts and can deal with common properties of quotations, for example, ellipses or inaccurate quotations.

-If you use Lotte or base your work on our code, please cite our paper:
+If you use Quid or base your work on our code, please cite our paper:
 ~~~
 @inproceedings{arnold2021lotte,
  title = {Lotte and Annette: A Framework for Finding and Exploring Key Passages in Literary Works},
@@ -14,7 +14,7 @@ If you use Lotte or base your work on our code, please cite our paper:
 For a prepint, see [Lotte and Annette: A Framework for Finding and Exploring Key Passages in Literary Works](https://amor.cms.hu-berlin.de/~arnolfre/paper/NLP4DH_2021_arnold_lotte_preprint.pdf)

 ## Overview
-Lotte is a tool to find quotations in two texts, called source and target. If known, the source text should be the one that is quoted by the target text. This allows the algorithm to handle things like ellipsis in quotations, e.g.
+Quid is a tool to find quotations in two texts, called source and target. If known, the source text should be the one that is quoted by the target text. This allows the algorithm to handle things like ellipsis in quotations, e.g.
 ~~~
 0	52	This is a long Text and the long test goes on and on
 0	45	This is a long Text [...] test goes on and on
@@ -22,30 +22,30 @@ Lotte is a tool to find quotations in two texts, called source and target. If kn

 ## Installation
 ~~~
-pip install Lotte
+pip install Quid
 ~~~

 ## Usage
 There are two ways to use the algorithm. The following two sections describe the use of the algorithm in code and from the command line.

 ### In code
-The algorithm can be found in the package `lotte`. To use it create a `Lotte` object which expects the following arguments:
+The algorithm can be found in the package `quid`. To use it create a `Quid` object which expects the following arguments:
 - The length of the shortest match (default: 5)
 - The number of tokens to skip when looking backwards (default: 10)
 - The number of tokens to skip when looking ahead (default: 3)
 - The maximum distance in tokens between to matches considered for merging (default: 2)
 - The maximum distance in tokens between two matches considered for merging where the target text contains an ellipsis between the matches (default: 10)

-
 Then call the `compare` method on the object which expects two texts to be compared.
-The method returns a list with the following structure: `List[Match]`. `Match` stores two `MatchSegments`. One for the source text and one for the target text. `MatchSegment` stores the `character_start_pos` and `character_end_pos` for the matching segments in the source and target text.
+The method returns a list with the following structure: `List[Match]`. `Match` stores two `MatchSpans`. One for the source text and one for the target text. `MatchSpan` stores the `start` and `end` character positions for the matching spans in the source and target text.

 ### Command line
-The `lotte compare` command provides a command line interface to the algorithm.
+The `quid compare` command provides a command line interface to the algorithm.

 ~~~
-usage: LotteCLI.py compare [-h] [--text] [--no-text]
-                           [--output-type {json,text}]
+usage: QuidCLI.py compare [-h] [--text] [--no-text]
+                           [--output-type {json,text, csv}]
+                           [--csv-sep CSV_SEP]
                           [--output-folder-path OUTPUT_FOLDER_PATH]
                           [--min-match-length MIN_MATCH_LENGTH]
                           [--look-back-limit LOOK_BACK_LIMIT]
@@ -59,7 +59,7 @@ usage: LotteCLI.py compare [-h] [--text] [--no-text]
                           [--no-keep-ambiguous-matches]
                           source-file-path target-path

-Lotte compare allows the user to find quotations in two texts, a source text
+Quid compare allows the user to find quotations in two texts, a source text
 and a target text. If known, the source text should be the one that is quoted
 by the target text. This allows the algorithm to handle things like ellipsis
 in quotations.
@@ -73,8 +73,9 @@ optional arguments:
  --text                Include matched text in the returned data structure
  --no-text             Don't include matched text in the returned data
                        structure
-  --output-type {json,text}
+  --output-type {json,text, csv}
                        The output type
+  --csv-sep CSV_SEP     output separator for csv (default: '\t')
  --output-folder-path OUTPUT_FOLDER_PATH
                        The output folder path. If this option is set the
                        output will be saved to a file created in the
@@ -109,20 +110,20 @@ optional arguments:
                        Don't ambiguous matches
 ~~~

-By default, the result is returned as a json structure: `List[Match]`. `Match` stores two `MatchSegments`. One for the source text and one for the target text. `MatchSegment` stores the `character_start_pos` and `character_end_pos` for the matching segments in the source and target text.
+By default, the result is returned as a json structure: `List[Match]`. `Match` stores two `MatchSpans`. One for the source text and one for the target text. `MatchSpan` stores the `start` and `end` character positions for the matching spans in the source and target text.
 For example,

 ~~~
 [
  {
-    "source_match_segment": {
-      "character_start_pos": 0,
-      "character_end_pos": 52,
+    "source_span": {
+      "start": 0,
+      "end": 52,
      "text": "This is a long Text and the long test goes on and on"
    },
-    "target_match_segment": {
-      "character_start_pos": 0,
-      "character_end_pos": 45,
+    "target_span": {
+      "start": 0,
+      "end": 45,
      "text": "This is a long Text [...] test goes on and on"
    }
  }
@@ -138,16 +139,17 @@ Alternatively, the result can be printed in a human-readable text format, e.g.:

 In case the matching text is not needed, the option --no-text allows to exclude the text from the output.

-## KeyPassager
-The package `key_passager` contains code to extract key passages from the found matches. The resulting data structure is documented in the [data structure readme](DATA_STRUCTURE_README.md).
+## Passager
+The package `passager` contains code to extract key passages from the found matches. The `passage` command produces several json files.
+The resulting data structure is documented in the [data structure readme](DATA_STRUCTURE_README.md).

 ### Usage
 ~~~
-usage: LotteCLI.py keypassage [-h]
+usage: QuidCLI.py passage [-h]
                              source-file-path target-folder-path
                              matches-folder-path output-folder-path

-Lotte keypassage allows the user to extract key passages from the found
+Quid passage allows the user to extract key passages from the found
 matches.

 positional arguments:
@@ -159,24 +161,24 @@ positional arguments:

 ## Visualization
 The package `visualization` contains code to create the content for a web page to visualize the key passages.
-For the website, see [LotteVizEx](/../../../../lottevizex/).
+For a white label version of the website, see [LotteVizEx](https://scm.cms.hu-berlin.de/schluesselstellen/lottevizex).

 ### Usage
 ~~~
-usage: LotteCLI.py visualize [-h] [--title TITLE] [--author AUTHOR]
+usage: QuidCLI.py visualize [-h] [--title TITLE] [--author AUTHOR]
                             [--year YEAR] [--censor]
                             source-file-path target-folder-path
-                             key-passages-folder-path output-folder-path
+                             passages-folder-path output-folder-path

-Lotte visualize allows the user to create the files needed for a website that
-visualizes the lotte algorithm results.
+Quid visualize allows the user to create the files needed for a website that
+visualizes the Quid algorithm results.

 positional arguments:
  source-file-path      Path to the source text file
  target-folder-path    Path to the target texts folder path
-  key-passages-folder-path
+  passages-folder-path
                        Path to the folder with the key passages files, i.e.
-                        the resulting files from lotte keypassage
+                        the resulting files from Quid passage
  output-folder-path    Path to the output folder

 optional arguments:

--- a/lotte/InternalMatchSegment.py
+++ b/lotte/InternalMatchSegment.py
-class InternalMatchSegment:
-    def __init__(self, token_start_pos: int, token_length: int, character_start_pos: int, character_end_pos: int):
-        self.token_start_pos = token_start_pos
-        self.token_length = token_length
-        self.character_start_pos = character_start_pos
-        self.character_end_pos = character_end_pos
-
-    def __str__(self):  # pragma: no cover
-        return "MatchSegment (" + str(self.character_start_pos) + ", " + str(self.character_end_pos) + ")"
--- a/quid/__init__.py
+++ b/quid/__init__.py
--- a/cli/LotteCLI.py
+++ b/cli/LotteCLI.py
--- a/quid/cli/__init__.py
+++ b/quid/cli/__init__.py
--- a/lotte/BestMatch.py
+++ b/lotte/BestMatch.py
@@ -3,7 +3,7 @@ from dataclasses import dataclass

 @dataclass
 class BestMatch:
-    source_token_start_pos: int
-    target_token_start_pos: int
+    source_token_start: int
+    target_token_start: int
    source_length: int
    target_length: int
--- a/lotte/InternalMatch.py
+++ b/lotte/InternalMatch.py
 from dataclasses import dataclass
-from lotte.InternalMatchSegment import InternalMatchSegment
+from quid.core.InternalMatchSpan import InternalMatchSpan


 @dataclass
 class InternalMatch:
-    source_match_segment: InternalMatchSegment
-    target_match_segment: InternalMatchSegment
+    source_match_span: InternalMatchSpan
+    target_match_span: InternalMatchSpan
--- a/quid/core/InternalMatchSpan.py
+++ b/quid/core/InternalMatchSpan.py
+class InternalMatchSpan:
+    def __init__(self, token_start: int, token_length: int, character_start: int, character_end: int):
+        self.token_start_pos = token_start
+        self.token_length = token_length
+        self.character_start = character_start
+        self.character_end = character_end
+
+    def __str__(self):  # pragma: no cover
+        return "MatchSpan (" + str(self.character_start) + ", " + str(self.character_end) + ")"
--- a/lotte/Lotte.py
+++ b/lotte/Lotte.py
--- a/lotte/Text.py
+++ b/lotte/Text.py
--- a/lotte/Token.py
+++ b/lotte/Token.py
--- a/quid/core/__init__.py
+++ b/quid/core/__init__.py
--- a/helper/Decoder.py
+++ b/helper/Decoder.py
-from key_passager.CitationSource import CitationSource
-from key_passager.CitationSourceLink import CitationSourceLink
-from key_passager.ImportantSegment import ImportantSegment
-from key_passager.SourceSegment import SourceSegment
-from key_passager.TargetLocationSelection import TargetLocationSelection
-from key_passager.TargetTextLocationLink import TargetTextLocationLink
-from match.Match import Match
-from match.MatchSegment import MatchSegment
+from quid.passager.CitationSource import CitationSource
+from quid.passager.CitationSourceLink import CitationSourceLink
+from quid.passager.ImportantSegment import ImportantSegment
+from quid.passager.SourceSegment import SourceSegment
+from quid.passager.TargetLocationSelection import TargetLocationSelection
+from quid.passager.TargetTextLocationLink import TargetTextLocationLink
+from quid.match.Match import Match
+from quid.match.MatchSpan import MatchSpan


 def json_decoder_match(json_input):
-    if 'source_match_segment' in json_input and 'target_match_segment' in json_input:
-        return Match(json_input['source_match_segment'], json_input['target_match_segment'])
+    if 'source_span' in json_input and 'target_span' in json_input:
+        return Match(json_input['source_span'], json_input['target_span'])
    else:
-        return MatchSegment(json_input['character_start_pos'], json_input['character_end_pos'])
+        return MatchSpan(json_input['start'], json_input['end'])


 def json_decoder_citation_source(json_input):

--- a/helper/Loader.py
+++ b/helper/Loader.py
 import json
-from helper.Decoder import json_decoder_match, json_decoder_citation_source, json_decoder_target_text_location_link, \
+from quid.helper.Decoder import json_decoder_match, json_decoder_citation_source, json_decoder_target_text_location_link, \
    json_decoder_citation_source_link



--- a/quid/helper/__init__.py
+++ b/quid/helper/__init__.py
--- a/match/Match.py
+++ b/match/Match.py
 from dataclasses import dataclass
-from match.MatchSegment import MatchSegment
+from quid.match.MatchSpan import MatchSpan


 @dataclass
 class Match:
-    source_match_segment: MatchSegment
-    target_match_segment: MatchSegment
+    source_span: MatchSpan
+    target_span: MatchSpan
--- a/match/MatchSegment.py
+++ b/match/MatchSegment.py
@@ -2,7 +2,7 @@ from dataclasses import dataclass


 @dataclass
-class MatchSegment:
-    character_start_pos: int
-    character_end_pos: int
+class MatchSpan:
+    start: int
+    end: int
    text: str = ''