Skip to content
Snippets Groups Projects

Update README.md

Merged Andreas Vollmer requested to merge laudaadm-master-patch-36913 into master
1 file
+ 18
18
Compare changes
  • Side-by-side
  • Inline
+ 18
18
@@ -38,7 +38,7 @@ python3 runmefirst.py
##LAUDATIO DATA PIPELINE XPATH MAPPING
## LAUDATIO DATA PIPELINE XPATH MAPPING ##
The Laudatio Data Pipeline is an XML based transformation pipeline, which processes any XML validating against its schema or DTD into a denormalized index in JSON format for the Elasticsearch search engine.
@@ -48,7 +48,7 @@ This mapping is not subject to an automated process as it is a mapping on the se
With the XPATH mapping in place, the pipeline first generates, if not already present, an Elasticsearch index template which informes the creation of indexes which again holds the documents indexed.
###The XPATH Mapping
### The XPATH Mapping ###
The general format of the XPATH mapping is as following:
```
@@ -80,7 +80,7 @@ The field names and their corresponfing paths to the corresponding xml content i
The xml data is as a default, denormalized into a flat document structure containing the specific data of interest in Elasticsearch, repeating the common or bibliographic data and metadata for each document.
The loop_mapping directive informs at what level of granluarity the core data for each indexed document is derived in the xml.
### The Loop Mapping
### The Loop Mapping ###
The loop mappng directive is an array of directives that informs how the pipeline parses the xml.
The array has two slots:
@@ -88,7 +88,7 @@ The array has two slots:
2. The second slot defines an element where the parser picks up data / metadata that is common to all denormalized documents, i.e. bibliographical data etc.
#### The XML
#### The XML ####
Considering the following XML, we could express the following loop mapping
```
@@ -155,15 +155,15 @@ Considering the following XML, we could express the following loop mapping
```
#### Loop Mapping:
#### Loop Mapping ####
#####Root element:
##### Root element #####
```
[p] => the pipeline loops all p elements
```
#####Parents:
##### Parents #####
```
[p>section>chapter]
@@ -172,7 +172,7 @@ Considering the following XML, we could express the following loop mapping
the pipeline loops through all p elements, then ascends up through the given parents
#####Preceding Siblings:
##### Preceding Siblings ######
```
First sibling preceding root element(s):
[p:pre(section)>section>chapter]
@@ -187,7 +187,7 @@ Sibling before root, filtered by conditional:
=> loop into the preceding sibling filtered by the conditional of the root element
```
#####Following Siblings:
##### Following Siblings #####
```
First sibling following root element
@@ -198,7 +198,7 @@ Second sibling following root element
```
##### Preceding and following siblings
##### Preceding and following siblings #####
```
Second sibling following root element
@@ -206,7 +206,7 @@ Second sibling following root element
```
##### Freestanding element
##### Freestanding element #####
The freestanding element kan be any element outside of the loop, which contains common data to be appended to each iteration of the loop
```
@@ -214,7 +214,7 @@ The freestanding element kan be any element outside of the loop, which contains
=> the second slot lists the rootelement which will be visited and parsed according to the xpath mapping
```
#### Date and language mapping
#### Date and language mapping ####
The date mapping directive makes it possible to express paths to elements containing dates, as it is not trivial to determine dates in a dataset considering all the differing formats in use as well as the consistent inconsistency of the humans writing the xml
The language mapping helps the correct tramslation for the dates in case they are in the form of i.e names for days and months:
@@ -222,19 +222,19 @@ A list of language codes, e.g. [‘en’, ‘es’, ‘zh-Hant’]. If locales a
The dates will, when possible, be converted into the **YYYY-MM-DD'T'HH:mm:ss** format
#####Example date mapping
##### Example date mapping #####
```
"date_mapping" : ["./head/date/text()"]
=> states that the rootelelement/head/date is a date and should be parsed accordingly
```
#####Example language mapping
##### Example language mapping #####
```
"language_mapping" : ["en"]
=> states that the rootelelement/head/date is a date and should be parsed accordingly
```
#### To nest or not
#### To nest or not ####
XML is in itself often a nested structure containing the semantics of a document, indirectly also creting a relation between the semantics of elements expressed by the syntax
The resulting Elasticsearch index templates and indexes are per default denormalized and flattened, creating an optimal performance when reading them as well as analyzing them for statistics in Kibana, the analysis and visualization tool of the Elastic stack.
@@ -248,7 +248,7 @@ The __nested__ directive is set to **"false"** as default.
"nested": "true"
```
###Example XPATH Mapping
### Example XPATH Mapping ###
```
{
@@ -277,7 +277,7 @@ The __nested__ directive is set to **"false"** as default.
}
```
### Resulting Elasticsearch template
### Resulting Elasticsearch template ###
```
{
@@ -383,7 +383,7 @@ The __nested__ directive is set to **"false"** as default.
}
```
### Resulting Elasticsearch document
### Resulting Elasticsearch document ###
```
{
"_index": "docu-main-title_1",
Loading