Description
This tutorial introduces RDF-Connect (RDFC), a novel, language-agnostic framework for constructing streaming data processing pipelines for RDF and non-RDF data. By leveraging RDF and PROV-O, RDFC enables seamless integration of data processors in multiple programming languages. The tutorial focuses on pipeline construction and processor development, equipping participants to build their own streaming workflows.
Participants will gain hands-on experience with RDFC, learning how to build pipelines by chaining modular components ( a.k.a. processors) that perform specific operations on (RDF) data streams. The tutorial focuses on implementing custom processors and integrating them into functioning pipelines. Instead of solving a specific data processing problem, it demonstrates structuring and managing adaptable (RDF) data workflows across domains and use cases.
To make the learning experience tangible, the tutorial includes a practical project based on a common use case for SEMANTiCS’ audience: creating a live and multilingual RDF knowledge graph using data from the GeoSphere Austria API. This example illustrates how different processors can be combined – such as a REST API client in JavaScript, a Python-based ML model for language translation, a Java-based RML engine, a SHACL validator, and a triple store (SPARQL) update processor.
The expected outcome will be a functional pipeline created by the participants that integrates both existing and custom
components within the RDFC framework. The pipeline will continuously extract and transform weather forecast data
from the GeoSphere Austria API to RDF. It will then validate the data against a predefined schema using a
SHACL validator. Then, the custom processor, implemented by the participant, will perform language-aware
transformations based on a machine learning model. This processor will translate literal objects tagged as German (
@de
) into English, generating new triples tagged as English (@en
). The resulting RDF data will be written into a
triple store using a SPARQL-based processor.
By the end of the tutorial, participants will be able to:
- Design language-independent, modular data processing pipelines.
- Create custom processors for diverse data processing tasks within RDF-Connect.
- Leverage RDF and PROV-O to document and trace pipeline structure and execution.
This tutorial is designed to empower researchers, developers, and data practitioners with the skills to build scalable, maintainable, and explainable streaming pipelines using RDF-based technologies.
Motivation
As the Semantic Web community embraces increasingly diverse data sources and application domains, there is a growing need for flexible, interoperable tooling bridging technology, language, and paradigm gaps. RDFC directly addresses this need by providing a language-agnostic framework for building modular, reusable and traceable streaming data pipelines.
Semantic Web workflows often use custom tooling in specific languages, leading to brittle, monolithic systems difficult to maintain, extend, and reuse. RDFC addresses these challenges by defining a specification that decouples processing logic from implementation language and describes pipeline configurations using SHACL and an extension of PROV-O. This approach simplifies pipeline component combination, reasoning, and sharing across teams and communities.
Moreover, the importance of provenance is more pressing than ever, especially in the current context of AI-generated content and automated decision-making. RDFC simplifies the publication of machine-readable documentation, in alignment with the FAIR principles, of data transformations, enhancing transparency, reproducibility, and trust.
This tutorial aims to fill a critical gap in current Semantic Web tooling by introducing a practical, extensible way to build explainable and modular streaming data pipelines. It is particularly valuable for early-career researchers, practitioners building real-world applications, and anyone seeking to build more interoperable and maintainable data-centric systems.
Format and Schedule
This tutorial is designed as a half-day session as outlined in Table 1. It includes presentations of the conceptual foundations of the RDFC framework and hands-on implementation.
Topic | Duration | |
---|---|---|
Morning: Introduction (1:30) | Why RDF-Connect? | 0:30 |
RDF-Connect architecture & components | 1:00 | |
Lunch Break | — | — |
Afternoon: Hands-on (1:30) | Recap: How to implement a RDFC Processor? | 0:10 |
Hands-on: Implementing a processor | 0:35 | |
Recap: How to build and execute a RDFC Pipeline? | 0:10 | |
Hands-on: Assembling a pipeline | 0:35 |
Table 1: Planning of the tutorial
The program is structured in two sessions, one in the morning and one in the afternoon, progressively building from a conceptual overview to hands-on development.
The first session start by providing an overview of the tutorial’s content and a description of the pipeline, participants will build throughout the day. Then, the motivation and rationale of RDFC will be presented, together with its high level architecture.
The second session is primarily hands-on, focusing on (i) processor development: participants will learn to implement a processor in a language of choice and configure it for pipeline integration. Concretely, the participants will implement a processor that transforms a stream of RDF triples based on a configured language translation, by leveraging a lightweight ML model to perform the translations locally; and (ii) pipeline assembly: participants will construct a working streaming pipeline, using both existing and the participant’s custom-built processor, that pulls data, applies transformations, validates and publishes the results to a triple store.
Material
The tutorial is guided by slides, which are shared online. For the hands-on coding sessions, we provide a git repository with separate branches for all sequentially completed tasks. This will allow participants that are unable to complete a certain task, to still begin with the next task by checking out the corresponding branch. An online specification is provided with detailed technical descriptions. The slides, specification and git repository are open for everyone under the CC BY 4.0 license.
Audience
This tutorial targets intermediate to advanced developers and researchers interested in data processing pipelines using Semantic Web technologies. It is designed to accommodate about 20 participants with a basic understanding of RDF and SHACL, and programming experience in either Python, JavaScript, or Java.
Presenters
This tutorial will be presented by Ieben Smessaert (primary contact: ieben.smessaert@ugent.be), Arthur Vercruysse and Pieter Colpaert from Ghent University–imec.
Ieben Smessaert is a second year PhD student and one of the main developers and maintainers of the RDF-Connect ecosystem. He actively applies RDFC in real-world use cases and also has teaching experience by assisting practical sessions of the Web Development course by Ruben Verborgh at Ghent University.
Arthur Vercruysse is a third year PhD student, who started his research journey with Knowledge Graph construction, and focused on developer tooling with his love for compilers. He is the main architect behind RDFC.
Prof. Pieter Colpaert co-leads the KNowledge on Web Scale team at Ghent University, where he also teaches the Knowledge Graphs course. He is also the editor of both the W3C TREE Community Group reports and the SEMIC LDES specification. With his team, he initiated the work on RDF-Connect.
Requirements
For this tutorial, we require a projector to present our slides, and a power plug for charging our laptop. The presenters and the participants need an internet connection.