Skip to content
Snippets Groups Projects
Commit 84c09b03 authored by Frederik Arnold's avatar Frederik Arnold
Browse files

Merge branch 'release/v.2.0.0-beta01'

parents 4df448d8 2c787430
No related branches found
No related tags found
No related merge requests found
Pipeline #37540 passed
Showing
with 344 additions and 284 deletions
......@@ -3,6 +3,6 @@ venv/
__pycache__/
*.pyc
.coverage
Lotte.egg-info/
Quid.egg-info/
dist/
build/
\ No newline at end of file
......@@ -14,9 +14,9 @@ cache:
stages:
- test
- badge
- upload
- release
- zenodo
# - upload
# - release
# - zenodo
test:
stage: test
......@@ -28,7 +28,7 @@ test:
- pip install -r requirements.txt
- pip install coverage
script:
- coverage run --source=lotte,match,key_passager,visualization -m unittest discover -p 'Test*.py'
- coverage run --source=quid -m unittest discover -p 'Test*.py'
- coverage report -m
badge:
......@@ -45,35 +45,35 @@ badge:
only:
- tags
upload:
stage: upload
script:
- pip install twine
- cat $PYPIRC > /tmp/.pypirc
- python -m pip install --upgrade build
- python -m build
- python -m twine upload --repository pypi dist/* --config-file /tmp/.pypirc
only:
- tags
#upload:
# stage: upload
# script:
# - pip install twine
# - cat $PYPIRC > /tmp/.pypirc
# - python -m pip install --upgrade build
# - python -m build
# - python -m twine upload --repository pypi dist/* --config-file /tmp/.pypirc
# only:
# - tags
release_job:
stage: release
image: registry.gitlab.com/gitlab-org/release-cli:latest
script:
- echo $CI_COMMIT_TAG
release:
tag_name: $CI_COMMIT_TAG
name: 'Lotte $CI_COMMIT_TAG'
description: 'Release $CI_COMMIT_TAG'
only:
- tags
#release_job:
# stage: release
# image: registry.gitlab.com/gitlab-org/release-cli:latest
# script:
# - echo $CI_COMMIT_TAG
# release:
# tag_name: $CI_COMMIT_TAG
# name: 'Quid $CI_COMMIT_TAG'
# description: 'Release $CI_COMMIT_TAG'
# only:
# - tags
zenodo_upload:
stage: zenodo
before_script:
- pip install requests
script:
- git archive --format zip --output Lotte-$CI_COMMIT_TAG.zip $CI_COMMIT_TAG
- python zenodo.py --access-token $ZENODO_ACCESS_TOKEN --id 6123229 --title Lotte --version $CI_COMMIT_TAG --file-path Lotte-$CI_COMMIT_TAG.zip
only:
- tags
\ No newline at end of file
#zenodo_upload:
# stage: zenodo
# before_script:
# - pip install requests
# script:
# - git archive --format zip --output Quid-$CI_COMMIT_TAG.zip $CI_COMMIT_TAG
# - python zenodo.py --access-token $ZENODO_ACCESS_TOKEN --id 6123229 --title Quid --version $CI_COMMIT_TAG --file-path Quid-$CI_COMMIT_TAG.zip
# only:
# - tags
\ No newline at end of file
......@@ -5,6 +5,7 @@
- The extent of a match between source text and target text is always indicated with a start character position, and an end character position.
## Building blocks
- Each building block has a corresponding Python class in the `passager` package. This file gives a high level description for each class.
### SourceSegment
- One block of text in the source text ranging from `start` to `end`.
......@@ -13,7 +14,7 @@
- A group of overlapping `SourceSegments`
### TargetLocation
- One block of text in the target text ranging from `start` to `end`.
- One block of text (i.e. a quotation) in the target text ranging from `start` to `end`.
### TargetText
- One scholarly work with a `filename` and a list of `TargetLocations`.
......@@ -26,4 +27,20 @@
- Only used in combination with `CitationSourceLink` to map `CitationSources` to `TargetLocations`.
### CitationSourceLink
- Links a `CitationSource` to a `TargetText` and matched `TargetLocations` via `TargetLocationSelections`.
\ No newline at end of file
- Links a `CitationSource` to a `TargetText` and matched `TargetLocations` via `TargetLocationSelections`.
## File documentation
### citation_sources.json
- A list of `CitationSources`. Each `CitationSource` has an ID and a list of `SourceSegments`.
- A `SourceSegment` has an ID, start and end position, a frequency, a length in tokens and the text of the segment.
### citation_source_links.json
- A list of `CitationSourceLinks`.
### target_texts.json
- A list of `TargetTexts`. Each `TargetText` has an ID, a filename and a list of `TargetLocations`.
- A `TargetLocation` has an ID, a start and end position and the text of the location.
### target_text_location_links.json
- A list of `TargetTextLocationLinks`.
\ No newline at end of file
# Readme
Lotte is a tool for quotation detection in texts and can deal with common properties of quotations, for example, ellipses or inaccurate quotations.
Quid is a tool for quotation detection in texts and can deal with common properties of quotations, for example, ellipses or inaccurate quotations.
If you use Lotte or base your work on our code, please cite our paper:
If you use Quid or base your work on our code, please cite our paper:
~~~
@inproceedings{arnold2021lotte,
title = {Lotte and Annette: A Framework for Finding and Exploring Key Passages in Literary Works},
......@@ -14,7 +14,7 @@ If you use Lotte or base your work on our code, please cite our paper:
For a prepint, see [Lotte and Annette: A Framework for Finding and Exploring Key Passages in Literary Works](https://amor.cms.hu-berlin.de/~arnolfre/paper/NLP4DH_2021_arnold_lotte_preprint.pdf)
## Overview
Lotte is a tool to find quotations in two texts, called source and target. If known, the source text should be the one that is quoted by the target text. This allows the algorithm to handle things like ellipsis in quotations, e.g.
Quid is a tool to find quotations in two texts, called source and target. If known, the source text should be the one that is quoted by the target text. This allows the algorithm to handle things like ellipsis in quotations, e.g.
~~~
0 52 This is a long Text and the long test goes on and on
0 45 This is a long Text [...] test goes on and on
......@@ -22,30 +22,30 @@ Lotte is a tool to find quotations in two texts, called source and target. If kn
## Installation
~~~
pip install Lotte
pip install Quid
~~~
## Usage
There are two ways to use the algorithm. The following two sections describe the use of the algorithm in code and from the command line.
### In code
The algorithm can be found in the package `lotte`. To use it create a `Lotte` object which expects the following arguments:
The algorithm can be found in the package `quid`. To use it create a `Quid` object which expects the following arguments:
- The length of the shortest match (default: 5)
- The number of tokens to skip when looking backwards (default: 10)
- The number of tokens to skip when looking ahead (default: 3)
- The maximum distance in tokens between to matches considered for merging (default: 2)
- The maximum distance in tokens between two matches considered for merging where the target text contains an ellipsis between the matches (default: 10)
Then call the `compare` method on the object which expects two texts to be compared.
The method returns a list with the following structure: `List[Match]`. `Match` stores two `MatchSegments`. One for the source text and one for the target text. `MatchSegment` stores the `character_start_pos` and `character_end_pos` for the matching segments in the source and target text.
The method returns a list with the following structure: `List[Match]`. `Match` stores two `MatchSpans`. One for the source text and one for the target text. `MatchSpan` stores the `start` and `end` character positions for the matching spans in the source and target text.
### Command line
The `lotte compare` command provides a command line interface to the algorithm.
The `quid compare` command provides a command line interface to the algorithm.
~~~
usage: LotteCLI.py compare [-h] [--text] [--no-text]
[--output-type {json,text}]
usage: QuidCLI.py compare [-h] [--text] [--no-text]
[--output-type {json,text, csv}]
[--csv-sep CSV_SEP]
[--output-folder-path OUTPUT_FOLDER_PATH]
[--min-match-length MIN_MATCH_LENGTH]
[--look-back-limit LOOK_BACK_LIMIT]
......@@ -59,7 +59,7 @@ usage: LotteCLI.py compare [-h] [--text] [--no-text]
[--no-keep-ambiguous-matches]
source-file-path target-path
Lotte compare allows the user to find quotations in two texts, a source text
Quid compare allows the user to find quotations in two texts, a source text
and a target text. If known, the source text should be the one that is quoted
by the target text. This allows the algorithm to handle things like ellipsis
in quotations.
......@@ -73,8 +73,9 @@ optional arguments:
--text Include matched text in the returned data structure
--no-text Don't include matched text in the returned data
structure
--output-type {json,text}
--output-type {json,text, csv}
The output type
--csv-sep CSV_SEP output separator for csv (default: '\t')
--output-folder-path OUTPUT_FOLDER_PATH
The output folder path. If this option is set the
output will be saved to a file created in the
......@@ -109,20 +110,20 @@ optional arguments:
Don't ambiguous matches
~~~
By default, the result is returned as a json structure: `List[Match]`. `Match` stores two `MatchSegments`. One for the source text and one for the target text. `MatchSegment` stores the `character_start_pos` and `character_end_pos` for the matching segments in the source and target text.
By default, the result is returned as a json structure: `List[Match]`. `Match` stores two `MatchSpans`. One for the source text and one for the target text. `MatchSpan` stores the `start` and `end` character positions for the matching spans in the source and target text.
For example,
~~~
[
{
"source_match_segment": {
"character_start_pos": 0,
"character_end_pos": 52,
"source_span": {
"start": 0,
"end": 52,
"text": "This is a long Text and the long test goes on and on"
},
"target_match_segment": {
"character_start_pos": 0,
"character_end_pos": 45,
"target_span": {
"start": 0,
"end": 45,
"text": "This is a long Text [...] test goes on and on"
}
}
......@@ -138,16 +139,17 @@ Alternatively, the result can be printed in a human-readable text format, e.g.:
In case the matching text is not needed, the option --no-text allows to exclude the text from the output.
## KeyPassager
The package `key_passager` contains code to extract key passages from the found matches. The resulting data structure is documented in the [data structure readme](DATA_STRUCTURE_README.md).
## Passager
The package `passager` contains code to extract key passages from the found matches. The `passage` command produces several json files.
The resulting data structure is documented in the [data structure readme](DATA_STRUCTURE_README.md).
### Usage
~~~
usage: LotteCLI.py keypassage [-h]
usage: QuidCLI.py passage [-h]
source-file-path target-folder-path
matches-folder-path output-folder-path
Lotte keypassage allows the user to extract key passages from the found
Quid passage allows the user to extract key passages from the found
matches.
positional arguments:
......@@ -159,24 +161,24 @@ positional arguments:
## Visualization
The package `visualization` contains code to create the content for a web page to visualize the key passages.
For the website, see [LotteVizEx](/../../../../lottevizex/).
For a white label version of the website, see [LotteVizEx](https://scm.cms.hu-berlin.de/schluesselstellen/lottevizex).
### Usage
~~~
usage: LotteCLI.py visualize [-h] [--title TITLE] [--author AUTHOR]
usage: QuidCLI.py visualize [-h] [--title TITLE] [--author AUTHOR]
[--year YEAR] [--censor]
source-file-path target-folder-path
key-passages-folder-path output-folder-path
passages-folder-path output-folder-path
Lotte visualize allows the user to create the files needed for a website that
visualizes the lotte algorithm results.
Quid visualize allows the user to create the files needed for a website that
visualizes the Quid algorithm results.
positional arguments:
source-file-path Path to the source text file
target-folder-path Path to the target texts folder path
key-passages-folder-path
passages-folder-path
Path to the folder with the key passages files, i.e.
the resulting files from lotte keypassage
the resulting files from Quid passage
output-folder-path Path to the output folder
optional arguments:
......
class InternalMatchSegment:
def __init__(self, token_start_pos: int, token_length: int, character_start_pos: int, character_end_pos: int):
self.token_start_pos = token_start_pos
self.token_length = token_length
self.character_start_pos = character_start_pos
self.character_end_pos = character_end_pos
def __str__(self): # pragma: no cover
return "MatchSegment (" + str(self.character_start_pos) + ", " + str(self.character_end_pos) + ")"
This diff is collapsed.
......@@ -3,7 +3,7 @@ from dataclasses import dataclass
@dataclass
class BestMatch:
source_token_start_pos: int
target_token_start_pos: int
source_token_start: int
target_token_start: int
source_length: int
target_length: int
from dataclasses import dataclass
from lotte.InternalMatchSegment import InternalMatchSegment
from quid.core.InternalMatchSpan import InternalMatchSpan
@dataclass
class InternalMatch:
source_match_segment: InternalMatchSegment
target_match_segment: InternalMatchSegment
source_match_span: InternalMatchSpan
target_match_span: InternalMatchSpan
class InternalMatchSpan:
def __init__(self, token_start: int, token_length: int, character_start: int, character_end: int):
self.token_start_pos = token_start
self.token_length = token_length
self.character_start = character_start
self.character_end = character_end
def __str__(self): # pragma: no cover
return "MatchSpan (" + str(self.character_start) + ", " + str(self.character_end) + ")"
This diff is collapsed.
File moved
File moved
from key_passager.CitationSource import CitationSource
from key_passager.CitationSourceLink import CitationSourceLink
from key_passager.ImportantSegment import ImportantSegment
from key_passager.SourceSegment import SourceSegment
from key_passager.TargetLocationSelection import TargetLocationSelection
from key_passager.TargetTextLocationLink import TargetTextLocationLink
from match.Match import Match
from match.MatchSegment import MatchSegment
from quid.passager.CitationSource import CitationSource
from quid.passager.CitationSourceLink import CitationSourceLink
from quid.passager.ImportantSegment import ImportantSegment
from quid.passager.SourceSegment import SourceSegment
from quid.passager.TargetLocationSelection import TargetLocationSelection
from quid.passager.TargetTextLocationLink import TargetTextLocationLink
from quid.match.Match import Match
from quid.match.MatchSpan import MatchSpan
def json_decoder_match(json_input):
if 'source_match_segment' in json_input and 'target_match_segment' in json_input:
return Match(json_input['source_match_segment'], json_input['target_match_segment'])
if 'source_span' in json_input and 'target_span' in json_input:
return Match(json_input['source_span'], json_input['target_span'])
else:
return MatchSegment(json_input['character_start_pos'], json_input['character_end_pos'])
return MatchSpan(json_input['start'], json_input['end'])
def json_decoder_citation_source(json_input):
......
import json
from helper.Decoder import json_decoder_match, json_decoder_citation_source, json_decoder_target_text_location_link, \
from quid.helper.Decoder import json_decoder_match, json_decoder_citation_source, json_decoder_target_text_location_link, \
json_decoder_citation_source_link
......
from dataclasses import dataclass
from match.MatchSegment import MatchSegment
from quid.match.MatchSpan import MatchSpan
@dataclass
class Match:
source_match_segment: MatchSegment
target_match_segment: MatchSegment
source_span: MatchSpan
target_span: MatchSpan
......@@ -2,7 +2,7 @@ from dataclasses import dataclass
@dataclass
class MatchSegment:
character_start_pos: int
character_end_pos: int
class MatchSpan:
start: int
end: int
text: str = ''
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment