Building Streaming and Cross-Environment Data Processing Pipelines with RDF-Connect

Ieben Smessaert; Arthur Vercruysse; Julián Rojas Meléndez; Pieter Colpaert

Description

This tutorial introduces RDF-Connect (RDFC), a novel, language-agnostic framework for constructing streaming data processing pipelines for RDF and non-RDF data. By leveraging RDF and PROV-O, RDFC enables seamless integration of data processors in multiple programming languages. The tutorial focuses on pipeline construction and processor development, equipping participants to build their own streaming workflows.

Participants will gain hands-on experience with RDFC, learning how to build pipelines by chaining modular components ( a.k.a. processors) that perform specific operations on (RDF) data streams. The tutorial focuses on implementing custom processors and integrating them into functioning pipelines. Instead of solving a specific data processing problem, it demonstrates structuring and managing adaptable (RDF) data workflows across domains and use cases.

To make the learning experience tangible, the tutorial includes a practical project based on a common use case for ISWC’s audience: creating a live and multilingual RDF knowledge graph. In this case, we will be using real data from the Japan Meteorological Agency. This example illustrates how different processors can be combined – such as a REST API client in JavaScript, a Python-based ML model for language translation, a Java-based RML engine for RDF generation, a SHACL validator, and a triple store (SPARQL) update processor.

The expected outcome will be a functional pipeline created by the participants that integrates both existing and custom components within the RDFC framework. The pipeline will continuously extract and transform weather forecast data from the Japan Meteorological Agency’s API to RDF. It will then validate the data against a predefined shape using a SHACL validator. Then, the custom processor, implemented by the participant, will perform language-aware transformations based on a dedicated machine learning model. This processor will translate literal objects tagged as Japanese ( @ja) into English, generating new triples tagged as English (@en). The resulting RDF data will be written into a triple store using a SPARQL-based processor.

By the end of the tutorial, participants will be able to:

Design language-independent, modular data processing pipelines.
Create custom processors for diverse data processing tasks within RDF-Connect.
Leverage RDF and PROV-O to document and trace pipeline structure and execution.

This tutorial is designed to empower researchers, developers, and data practitioners with the skills to build scalable, maintainable, and explainable streaming pipelines using RDF-based technologies.

Motivation

As the Semantic Web community embraces increasingly diverse data sources and application domains, there is a growing need for flexible, interoperable tooling bridging technology, language, and paradigm gaps. RDFC directly addresses this need by providing a language-agnostic framework for building modular, reusable and traceable streaming data pipelines.

Semantic Web workflows often use custom tooling in specific languages, leading to brittle, monolithic systems difficult to maintain, extend, and reuse. RDFC addresses these challenges by defining a specification that decouples processing logic from implementation language and describes pipeline configurations using SHACL and an extension of PROV-O. This approach simplifies pipeline component combination, reasoning, and sharing across teams and communities.

Moreover, the importance of provenance is more pressing than ever, especially in the current context of AI-generated content and automated decision-making. RDFC simplifies the publication of machine-readable documentation in alignment with the FAIR principles, of data transformations, enhancing transparency, reproducibility, and trust.

This tutorial aims to fill a critical gap in current Semantic Web tooling by introducing a practical, extensible way to build explainable and modular streaming data pipelines. It is particularly valuable for early-career researchers, practitioners building real-world applications, and anyone seeking to build more interoperable and maintainable data-centric systems.

Format and Schedule

This tutorial is designed as a full-day session as outlined in Table 1. It includes presentations of the conceptual foundations of the RDFC framework and hands-on implementation.

	Topic	Start time
Morning 1: Introduction (1:30)	What will happen in this tutorial? 🤔	9:00
	Let’s make our hands dirty already! 🛠️	9:10
	The what and why of RDF-Connect 🎯	10:00
Break	—	—
Morning 2: Architecture (1:30)	RDF-Connect concepts and architecture ⚙️	11:00
	Hands-on: Assembling a pipeline 🔗	11:30
Lunch Break	—	—
Afternoon 1: Roadmap (1:30)	Hands-on: Implementing a custom processor 🏗️	13:30
	What is next for RDF-Connect? 🛫	14:30
Break	—	—
Afternoon 1: Hackathon (1:30)	Hackathon 🧑‍💻	15:30

Table 1: Planning of the tutorial

The program is structured into four sessions, two in the morning and two in the afternoon, progressively building a conceptual overview while keeping participants engaged with hands-on tasks. The day concludes with a collaborative hackathon where participants could apply what they have learned to explore extensions or develop new applications.

The first session starts by presenting the tutorial’s structure, immediately followed by a first hands-on task on assembling a hello world pipeline. Later, a target example will be introduced, together with RDFC’s high level overview and motivation.

The second session gives a deep dive into RDFC’s design and architecture. Participants will then follow a step-by-step guide into setting up and running the target example pipeline, progresively increasing its complexity.learn to implement a processor in a language of choice and configure it for pipeline integration.

The third session will cover the outlook and roadmap of RDFC and will teach participants how to develope their own custom processors. They will implement a processor that transforms a stream of RDF triples based on a configured language translation, leveraging a lightweight ML model to perform the translations locally. Once implemented, the custom processor will be incorporated int the pipeline built in the previous session.

The fourth session is a hackathon, where all participants work together to either extend the pipeline created in the previous session with new data sources, or build a new pipeline using existing processors to achieve a different goal.

Material

The tutorial is guided by a set of Web slides. For the hands-on coding sessions, we provide a GitHub repository with separate branches for all sequentially completed tasks. This will allow participants that are unable to complete a certain task, to still begin with the next task by checking out the corresponding branch. The RDF-Connect website offers an overview and links towards different complementary resources, such as the online specification which provides a detailed technical description. The slides, specification and Github repository are open for everyone under the CC BY 4.0 license.

Audience

This tutorial targets intermediate to advanced developers and researchers interested in data processing pipelines using Semantic Web technologies. It is designed to accommodate about 20 participants with a basic understanding of RDF and SHACL, and programming experience in either Python, JavaScript, or Java.

Presenters

This tutorial will be presented by Arthur Vercruysse (primary contact: arthur.vercruysse@ugent.be), Ieben Smessaert and Julián Rojas from Ghent University – imec.

Arthur Vercruysse is a third year PhD student, who started his research journey with Knowledge Graph construction, and focused on developer tooling with his love for compilers. He is the main architect behind RDFC.

Ieben Smessaert is a second year PhD student and one of the main developers and maintainers of the RDF-Connect ecosystem. He actively applies RDFC in real-world use cases and also has teaching experience by assisting practical sessions of the Web Development course by Ruben Verborgh at Ghent University.

Julián Rojas is a postdoctoral researcher focused on efficient knowledge graph lifecycle management at Web scale. He also contributes to the development and maintenance of the RDFC ecosystem and applies it on various applied research projects.

Requirements

For this tutorial, we require a projector to present our slides, and a power plug for charging our laptop. The presenters and the participants need an internet connection.