DHO Knowledge Graph Data Integration
Automated Pipelines and Mappings to integrate new data into the Digital Heraldry Knowledge Graph
Last Knowledge Graph can be found in data/rdf-output/2023-11-14_research-dataset
Please Note: This RDF does not represen the current state of the Digital Heraldry Knowledge Graph, but only the final RDF version of the transformed armorial.dk database.
- add info where the current KG dump can be found
Directory Structure
-
data/
-
input/
New data to be integrated into the Knowledge Graph -
rdf-output/
RDF files created by the transformation pipelines
-
-
src/
-
rdf-mappings
Mapping script to transform data into RDF
-
-
config/
Includes json files, containing information how to run the scripts. Each config-file has the corresponding script name embedded in its name as well as in its content.
Changelog
Changes between versions of all ontologies are documented in the CHANGELOG.md
Pipeline
- Visualisation of the complete Pipeline with Github mermaid
Usage
- Add usage instructions, when pipeline is complete
Current order of mappings
- Automate calling all mappings in a single pipeline
- Call
map-tblBranch.py
- Call
map-tblArmItems.py
- Call
merge_rdf_files_into_kg.py
with rdf files to be merged
Namespaces
All namespaces must be defined in dho_namespaces.py
. Each mapping script can bind all of these namespaces to its local graph by calling the function bind_namespaces()
.
Individual Mapping Scripts
Map descriptions of Coats of Arms to RDF
Uses the descriptions from the OMA table tblBranch
. Mapping is done by the script map-tblBranch.py
. The script can be configured through the file config/config-map-tblBranch.json
. This config-file contains:
-
csv_input_path
: source file from which the coat of arms descriptions shall be mapped. -
initial_ontology_definitions
: Decides, if classes and properties are defined before adding new data to the knowledge graph. Can be set with a python file which contains a number class and property definitions, executed by rdflib. These definitions are then executed inmap-tblBranch.py
before any data is being mapped fromtblBranch
(set incsv_input_path
). Ifnull
is given as a value forinitial_ontology_definitions
, no classes or properties are added in advance. -
existing_ontology
: File link to an existing knowledge graph. If set, this KG is loaded before adding any new data. The old data, including UUIDs, is then not overwritten, whenmap-tblBranch.py
is run. -
term_mappings
: Mapping table, resolving abbreviations for heraldic terms, that are used intblBranch
. -
concepts_with_multiple_inheritance
: List of heraldic concepts that are used as synonyms. This is special case if a heraldic concept is used in different context e.g. "per bend" may be used as a Pattern as well as an Arrangement. In this case,new_class_name
has to be differentiated by their subclass e.g. by transforming it toArrangedPerPale
orPatternedPerPale
. -
add_metadata
: Boolean value. States, if the metadata, defined indho_metadata.py
is to be added as ontology metadata. -
output_files
: List of output files and corresponding format into which the results are to be serialized. The first output-object in the list is considered as preferred and therefore used by following steps in the pipeline.
Map occurences of coats of arms in manuscripts to RDF
Uses the list of occurences of coats of arms in manuscripts from the OMA table tblArmItems
. Mapping is done by the script map-tblArmItems.py
. The script can be configured through the file config/config-map-tblArmItems.json
. This config-file contains:
-
csv_input_path
: source file from which the coat of arms descriptions shall be mapped. -
initial_ontology_definitions
: Decides, if classes and properties are defined before adding new data to the knowledge graph. Can be set with a python file which contains a number class and property definitions, executed by rdflib. These definitions are then executed inmap-tblArmItems.py
before any data is being mapped fromtblArmItems
(set incsv_input_path
). Ifnull
is given as a value forinitial_ontology_definitions
, no classes or properties are added in advance. -
existing_ontology
: File link to an existing knowledge graph. If set, this KG is loaded before adding any new data. The old data, including UUIDs, is then not overwritten, whenmap-tblArmItems.py
is run. -
output_files
: List of output files and corresponding format into which the results are to be serialized. The first output-object in the list is considered as preferred and therefore used by following steps in the pipeline. -
include_armcodes
: You can also only map specifically selected manuscripts (identified throughArmCode
) to RDF. To do so, setinclude_armcodes
to a list, containing theArmCode
s. If you want to map the whole OMA database to RDF, set this tonull
. -
use_id_mode_for_manuscripts
: Set if the URIs of thedhoo:Manuscripts
s should be composed of the letter code from OMA (ArmCode
) or from numbers (numerical
)
If you call map-tblArmItems.py
with "-t" as a parameter, only a small test dataset is created.
Merge multiple RDF files into one single Knowledge Graph
Merging is done by the script merge_rdf_files_into_kg.py
. The input is given as terminal parameters. For more information call python3 merge_rdf_files_into_kg.py -h
. The output and if the content of an existing graph is to be overwritten is set in the configuration file config-merge_rdf_files_into_kg.json
. This config-file contains:
-
existing_ontology
: File link to an existing knowledge graph. If set, this KG is loaded before adding any new data. The old data, including UUIDs, is then not overwritten, whenmerge_rdf_files_into_kg.py
is run. -
metadata_file
: File link to a table with manuscript metadata. Necessary to create complete IDs fordhor:CoatOfArmsRepresentation
s anddhoo:Manuscript
s. -
output_files
: List of output files and corresponding format into which the results are to be serialized. The first output-object in the list is considered as preferred and therefore used by following steps in the pipeline.
Integrate metadata
The script integrate_manuscript_metadata_into_kg.py
creates entities for the manuscript in the Knowledge Graph and integrates their metadata. The script is only a preliminary version; configuration is hard coded into the script.
Create ontology documentation
The content of the documentation of all classes and properties is stored as TSV files in data/input/documentation
. To integrate the whole content of the documentation directory into an RDF file, call the script update_documentation.py
with the RDF file as a command line parameter (in most cases, this will be digital-heraldry-ontology.ttl
)
The html documentation of the ontology is created with WIDOCO. To create the first version of the html documentation (without prior versions) call
java -jar widoco-1.4.16-jar-with-dependencies.jar -ontFile "data/rdf-output/digital-heraldry-ontology.ttl" -outFolder documentation -getOntologyMetadata -rewriteAll -htaccess -webVowl -includeAnnotationProperties
To create additional versions, call
java -jar widoco-1.4.16-jar-with-dependencies.jar -ontFile "data/rdf-output/digital-heraldry-ontology.ttl" -outFolder documentation -getOntologyMetadata -rewriteAll -webVowl -includeAnnotationProperties -licensius
To update an existing documentation to preserve prior versions call:
-
add command to take versioning into account
-
Automate Upload of html documentation to server
Testing changes
For testing purposes (e.g. to check if changes made to scripts or ontologies apply correctly in the data), the pipeline in test-dataset-creation-pipeline.ipynb
may be used. Note, that you may have to adapt the file paths in the config-files in the config
directory.
Important Dependencies
License
- Add License