Ieben Smessaert, Arthur Vercruysse, Julián Rojas Meléndez, Pieter Colpaert,
ISWC 2025, November 2, 2025
Ghent University – imec – IDLab, Belgium
ISWC 2025, November 2, 2025
# Clone the tutorial repository
git clone git@github.com:rdf-connect/nara-weather-forecast-kg-pipeline.git
cd nara-weather-forecast-kg-pipeline/pipeline/resources
# Start the development container
docker compose up -d
# Open a shell inside the container
docker compose exec devbox bash
# Prepare the Python environment for the processor
cd processor/
hatch env create
hatch shell
devbox environment.
You will learn the motivation behind RDF-Connect, its conceptual model, architecture and roadmap. All while setting up, extending and running an example RDF-Connect pipeline.
A pipeline of a knowledge graph lifecycle process, where weather data (from the Japanese meteorological API service) will be collected, transformed into RDF, enriched, validated against a SHACL shape, and published on a RDF graph store.
The tutorial website has a complete description and motivation for the tutorial. There you may also find all the resources you need to follow along, including these slides.
We have prepared a GitHub repository containing a step-by-step guide (split over dedicated branches) that will allow you to start and check the result of any task of the tutorial at any time.
We want to [fetch data from the JMA meteorological forecast API] and [log its contents] to the console.
Run locally or in a containerized environment:
# Build and run the Docker image
cd pipeline/resources
docker compose up -d
# Access the devbox container
docker compose exec devbox bash
cd pipeline/
# You can now run commands like `npm install` or `npx rdfc pipeline.ttl`
# inside the container
Install the orchestrator, runner, and processors:
npm install @rdfc/orchestrator-js
npm install @rdfc/js-runner
npm install @rdfc/http-utils-processor-ts
npm install @rdfc/log-processor-ts
Add the prefixes rdfc, owl, ex
@prefix rdfc: <https://w3id.org/rdf-connect#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix ex: <http://example.org/>.
Declare the RDF-Connect pipeline
<> a rdfc:Pipeline.
Import definition via owl:imports
<> owl:imports <./node_modules/@rdfc/js-runner/index.ttl>.
Attach it to the pipeline declaration
<> a rdfc:Pipeline;
rdfc:consistsOf [
rdfc:instantiates rdfc:NodeRunner;
].
Import definition via owl:imports
<> owl:imports <./node_modules/@rdfc/http-utils-processor-ts/processors.ttl>.
Define the channel
<json> a rdfc:Reader, rdfc:Writer.
Define the processor instantiation
<fetcher> a rdfc:HttpFetch;
rdfc:url "https://www.jma.go.jp/bosai/forecast/data/overview_forecast/290000.json";
rdfc:writer <json>.
Attach the processor to the runner
<> a rdfc:Pipeline;
rdfc:consistsOf [
rdfc:instantiates rdfc:NodeRunner;
rdfc:processor <fetcher> ].
Import definition via owl:imports
<> owl:imports <./node_modules/@rdfc/log-processor-ts/processor.ttl>.
Define the processor instantiation
<logger> a rdfc:LogProcessorJs;
rdfc:reader <json>;
rdfc:level "info";
rdfc:label "output".
Attach the processor to the runner
[ rdfc:instantiates rdfc:NodeRunner;
rdfc:processor <fetcher>, <logger> ].
npx rdfc pipeline.ttl
# or with debug logging:
LOG_LEVEL=debug npx rdfc pipeline.ttl
✅ Solution available in task-1 branch.
They enable the transformation, integration, and analysis of data from and to various sources and targets.
However, building, managing and reusing these pipelines can be complex and challenging.
We faced again and again the challenges of handling the lifecycle of Knowledge Graphs in multiple domains.
Traditional batch processing systems suffer from latency problems due to the need to collect input data into batches before it can be processed. — Isah, H., et al., A Survey of Distributed Data Stream Processing Frameworks, IEEE Access, 2019
Current real-world data systems often require real-time or near-real-time processing of dynamic data. Stream processing allows for the continuous ingestion and processing of data as it arrives, enabling timely insights and actions.
The ability to execute applications written in different programming languages in an integrated manner offers several advantages:
Scientists want to use provenance data to answer questions such as: Which data items were involved in the generation of a given partial result? or Did this actor employ outputs from one of these two other actors? — Cuevas-Vicentin, V., et al., Scientific Workflows and Provenance: Introduction and Research Opportunities, Datenbank Spektrum, 2012
Provenance is instrumental to activities such as traceability, reproducibility, accountability, and quality assessment. — Herschel, M., et al., A Survey on Provenance: What for? What form? What from?, VLDB, 2017
Prospective provenance—the execution plan—is essentially the workflow itself: it includes a machine-readable specification with the processing steps to be performed and the data and software dependencies to carry out each computation. — Simone, L., et al., Recording provenance of workflow runs with RO-Crate, PLoS ONE, 2024
Common Workflow Language (CWL) is an open standard for describing how to run command line tools and connect them to create workflows.
| Feature | RDF-Connect | CWL |
|---|---|---|
| Streaming support | Event-based design that supports both batch and streaming paradigms | Primarily batch-oriented, although implementation-dependent streaming can be supported (e.g,. using named pipes) |
| Polyglot | Supports any language through an add-in libraries approach | Can accomodate polylingual workflows via POSIX CLI interfaces |
| Provenance | Built-in semantic prospective and retrospective provenance tracking based on PROV-O | Retrospective provenance extension available (CWLProv) based on PROV-O |
| Schema expressivity | Full SHACL-based expressivity | Set of defined types and limited constraint definitions |
Workflow Run RO-Crate profiles provide a semantic way to describe workflows including:
A pipeline is described in RDF configuration files:
General definitions in the RDFC ontology:
# Processor class definition
rdfc:Processor a rdfs:Class;
rdfs:subClassOf prov:Activity.
rdfc:implementationOf a rdf:Property;
rdfs:subPropertyOf rdfs:subClassOf.
# Property for JavaScript processors
rdfc:jsImplementationOf a rdf:Property;
rdfs:subPropertyOf rdfc:implementationOf.
# JavaScript Runner definition
<myJSRunner> a rdfc:Runner;
rdfc:handlesSubjectsOf rdfc:jsImplementationOf;
rdfc:command "npx js-runner".
Concrete processor definition:
# Language-specific processor definition
ex:LogProcessorJS rdfc:jsImplementationOf rdfc:Processor.
rdfs:label "Simple Log Processor for JavaScript";
rdfs:comment "Logs incoming messages";
rdfc:entrypoint <./>;
rdfc:file <./lib/util_processors.js>;
rdfc:class "LogProcessor".
# Processor instantiation in pipeline
_:p1 a ex:LogProcessorJS;
...
rdfc:jsImplementationOf rdfs:subPropertyOf rdfs:subClassOf.
ex:LogProcessorJS rdfc:implementationOf rdfc:Processor;
rdfs:subClassOf rdfc:Processor;
rdfs:subClassOf prov:Activity.
_:p1 a rdfc:Processor, prov:Activity.
Each runner and processor comes with a SHACL shape.
These shapes serve as the glue of RDF-Connect:
Currently, runners exist for JavaScript, JVM, and Python.
Example of a runner configuration in RDF (Turtle):
DEBUG=:fetcher npx rdfc pipeline.ttl
npm installbuild.gradleuv add (or pip
install)
Follow allong in the GitHub repository.
All tasks are in the README. Each branch is a solution to a task!
open.gent/r/iswc-rdfc-repo
Follow along on branch task-1, or jump to the slides for a recap.
@prefix rdfc: <https://w3id.org/rdf-connect#>.
@prefix owl: <http://www.w3.org/2002/07/owl#>.
@prefix ex: <http://example.org/>.
Start the orchestrator with the configuration file:
npx rdfc pipeline.ttl
Follow allong in the GitHub repository.
All tasks are in the README. Each branch is a solution to a task!
open.gent/r/iswc-rdfc-repo
Install the additionally required processors:
npm install @rdfc/file-utils-processors-ts
npm install @rdfc/shacl-processor-ts
npm install @rdfc/sparql-ingest-processor-ts
Add the required dependency to your Gradle build file:
plugins { id 'java' }
repositories {
mavenCentral()
maven { url = uri("https://jitpack.io") }
}
dependencies {
implementation("com.github.rdf-connect:rml-processor-jvm:master-SNAPSHOT:all")
}
tasks.register('copyPlugins', Copy) {
from configurations.runtimeClasspath
into "$buildDir/plugins"
}
Install jars with
gradle copyPlugins.
The jvm-runner downloads the jvm-runner itselve, no installing required.
If you do not want to use Gradle, you can also download the jars manually and put them in the build/plugins/ folder.
wget 'jitpack.io/com/github/rdf-connect/rml-processor-jvm/master-SNAPSHOT/rml-processor-jvm-master-SNAPSHOT-all.jar'
Start the orchestrator with the configuration file:
npx rdfc pipeline.ttl
🛠️ Follow Part 1 in the repo (up to Task 4)
⏰ You have time till lunch
🙋 Ask questions!
Follow allong in the GitHub repository.
All tasks are in the README. Each branch is a solution to a task!
open.gent/r/iswc-rdfc-repo
init, transform, and produce methods
@prefix rdfc: <https://w3id.org/rdf-connect#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rdfc: <https://w3id.org/rdf-connect#>.
@prefix sh: <http://www.w3.org/ns/shacl#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
sh:targetClass links back to the IRI used on previous slide
Custom class for readers and writers: rdfc:Reader & rdfc:Writer
sh:name links to variable name in code
sh:path links to property in pipeline.ttl
Optional and multiple arguments with sh:minCount
and sh:maxCount
(sh:maxCount != 1 results in a list of arguments)
TypeScript
https://github.com/rdf-connect/template-processor-ts
transform method: consume the reader channelSet up the pyproject.toml for your pipeline
Configure specific Python version to have a deterministic path to the dependencies.
Add the rdfc-runner as a dependency.
Install the runner:
uv add rdfc_runner
Import definition via owl:imports
<> owl:imports <./.venv/lib/python3.13/site-packages/rdfc_runner/index.ttl>.
Attach it to the pipeline declaration
<> a rdfc:Pipeline;
rdfc:consistsOf [...], [
rdfc:instantiates rdfc:PyRunner;
rdfc:processor <translator>
].
Install your local processor after hatch build
uv add ../processor/dist/rdfc_translation_processor.tar.gz
Import definition via owl:imports
<> owl:imports <./.venv/lib/python3.13/site-packages/rdfc_translation_processor/processor.ttl>.
Define the channel
<translated> a rdfc:Reader, rdfc:Writer.
Define the processor instantiation
<translator> a rdfc:TranslationProcessor;
rdfc:reader <rdf>;
rdfc:writer <translated>;
... .
Follow allong in the GitHub repository.
All tasks are in the README. Each branch is a solution to a task!
open.gent/r/iswc-rdfc-repo
🛠️ Follow Part 2 in the repo (Task 5 - 7)
🙋 Ask questions!
We envision the following development and research roads:
Several innitiatives exist for the standardization of workflow metadata
The WCI aims to foster collaboration and standardization in the field of scientific workflow management. It provides a common framework for describing workflows, their components, and execution metadata. By aligning RDF-Connect with WCI standards, we can enhance interoperability and facilitate the sharing of workflow metadata across different platforms and tools.
WorkflowHub is a platform for sharing and discovering scientific workflows. It provides a repository for workflow definitions, metadata, and execution records.
Dockstore is a platform for sharing reusable and scalable analytical tools and workflows. It supports a variety of workflow languages and provides features for versioning, collaboration, and execution tracking.
The Workflow Run RO-Crates Profile extends the RO-Crate specification to better support the description of workflow executions and their associated metadata. A semantic alignment between RDF-Connect and the Workflow Run RO-Crates Profile will enable seamless integration of workflow execution metadata into RO-Crates, facilitating better reproducibility and sharing of scientific workflows.
OpenMetadata is an open-source metadata management platform that provides a unified view of data assets across an organization. It offers features for data discovery, lineage tracking, and governance.
OpenLineage is an open standard for metadata and lineage collection designed to instrument data pipelines and applications.
Integration of RDF-Connect with systems such as Prometheus. This will allow real-time tracking of pipeline execution, resource utilization, and performance metrics, enabling users to monitor and optimize their workflows effectively.
RDF-Connect extension to support other languages such as:
Remote execution of RDF-Connect Runners beyond CLI. For instance within EOSC (European Open Science Cloud) nodes.
Leverage generative AI capabilities to automate Processor and Pipeline development. Also, provideUI-based pipeline management.
Zero-copy data movement: Integration with Apache Arrow (where possible) to optimize data flow performance and efficiency.
Goal: build a pipeline using software that’s going to be introduced later during ISWC.
Each of you can contribute by wrapping new or existing software as a processor compatible with RDF-Connect.
Once we have a few, we’ll connect them into a shared pipeline and see what we can create together!
Do you know of any software we could try? We’ve spotted some ideas already: Jelly, RDFMutate, pycottas, rdf2vecgpu, or we could even generate data cubes.
We already have plenty of ideas for individual processors — but what about the bigger picture? What could a complete pipeline actually achieve?
Let’s brainstorm: combine analysis, transformation, or visualization — something fun and meaningful that shows what RDF-Connect can do when we link our work together.
👉 Don’t worry about perfection — the goal is to explore, experiment, and have fun connecting ideas.
Please create a GitHub repository for your processor and let us link them together in a pipeline.
Thank you!
We sincerely hope you enjoyed this tutorial
and found it valuable.