Building Streaming and Cross-Environment Data Processing Pipelines with RDF-Connect

Tutorial at SEMANTiCS 2025, September TBD 2025


Abstract

RDF-Connect is a novel, language-agnostic framework for building provenance-aware, streaming data pipelines integrating heterogeneous processors across languages. It aims to facilitate the construction, maintenance, and reusability of modular, interoperable pipelines for complex, semantically rich data workflows. Data processing pipelines are essential for modern data-centric systems, such as knowledge graphs, LLMs, and machine learning systems. Developers and researchers need flexible, interoperable tools for creating multilingual data processing pipelines. To meet this need, we present a comprehensive tutorial that blends conceptual foundations with hands-on experience. Participants will learn how to use RDF-Connect to design and execute reusable, extensible, and transparent streaming pipelines.
Participants will construct a streaming data processing pipeline from real-world data: generating a weather forecast knowledge graph for Vienna, Austria. They will: (i) Construct a machine learning pipeline using processors in multiple programming languages, (ii) Create custom data processors for diverse endpoints, (iii) Explore provenance tracking using RDF and PROV-O ontology. By the end of the tutorial, participants from varied backgrounds, including Python, JavaScript, and Java developers, will gain practical experience building language-agnostic, semantically rich data processing pipelines. This tutorial not only introduces RDF-Connect but also opens new avenues for interdisciplinary data transformation strategies in Semantic Web research and development.

Contents

Description

This tutorial introduces RDF-Connect (RDFC), a novel, language-agnostic framework for constructing streaming data processing pipelines for RDF and non-RDF data. By leveraging RDF and PROV-O, RDFC enables seamless integration of data processors in multiple programming languages. The tutorial focuses on pipeline construction and processor development, equipping participants to build their own streaming workflows.

Participants will gain hands-on experience with RDFC, learning how to build pipelines by chaining modular components ( a.k.a. processors) that perform specific operations on (RDF) data streams. The tutorial focuses on implementing custom processors and integrating them into functioning pipelines. Instead of solving a specific data processing problem, it demonstrates structuring and managing adaptable (RDF) data workflows across domains and use cases.

To make the learning experience tangible, the tutorial includes a practical project based on a common use case for SEMANTiCS’ audience: creating a live and multilingual RDF knowledge graph using data from the GeoSphere Austria API. This example illustrates how different processors can be combined – such as a REST API client in JavaScript, a Python-based ML model for language translation, a Java-based RML engine, a SHACL validator, and a triple store (SPARQL) update processor.

The expected outcome will be a functional pipeline created by the participants that integrates both existing and custom components within the RDFC framework. The pipeline will continuously extract and transform weather forecast data from the GeoSphere Austria API to RDF. It will then validate the data against a predefined schema using a SHACL validator. Then, the custom processor, implemented by the participant, will perform language-aware transformations based on a machine learning model. This processor will translate literal objects tagged as German ( @de) into English, generating new triples tagged as English (@en). The resulting RDF data will be written into a triple store using a SPARQL-based processor.

By the end of the tutorial, participants will be able to:

  • Design language-independent, modular data processing pipelines.
  • Create custom processors for diverse data processing tasks within RDF-Connect.
  • Leverage RDF and PROV-O to document and trace pipeline structure and execution.

This tutorial is designed to empower researchers, developers, and data practitioners with the skills to build scalable, maintainable, and explainable streaming pipelines using RDF-based technologies.

Motivation

As the Semantic Web community embraces increasingly diverse data sources and application domains, there is a growing need for flexible, interoperable tooling bridging technology, language, and paradigm gaps. RDFC directly addresses this need by providing a language-agnostic framework for building modular, reusable and traceable streaming data pipelines.

Semantic Web workflows often use custom tooling in specific languages, leading to brittle, monolithic systems difficult to maintain, extend, and reuse. RDFC addresses these challenges by defining a specification that decouples processing logic from implementation language and describes pipeline configurations using SHACL and an extension of PROV-O. This approach simplifies pipeline component combination, reasoning, and sharing across teams and communities.

Moreover, the importance of provenance is more pressing than ever, especially in the current context of AI-generated content and automated decision-making. RDFC simplifies the publication of machine-readable documentation, in alignment with the FAIR principles, of data transformations, enhancing transparency, reproducibility, and trust.

This tutorial aims to fill a critical gap in current Semantic Web tooling by introducing a practical, extensible way to build explainable and modular streaming data pipelines. It is particularly valuable for early-career researchers, practitioners building real-world applications, and anyone seeking to build more interoperable and maintainable data-centric systems.

Format and Schedule

This tutorial is designed as a half-day session as outlined in Table 1. It includes presentations of the conceptual foundations of the RDFC framework and hands-on implementation.

  Topic Duration
Morning: Introduction (1:30) Why RDF-Connect? 0:30
  RDF-Connect architecture & components 1:00
Lunch Break
Afternoon: Hands-on (1:30) Recap: How to implement a RDFC Processor? 0:10
  Hands-on: Implementing a processor 0:35
  Recap: How to build and execute a RDFC Pipeline? 0:10
  Hands-on: Assembling a pipeline 0:35

Table 1: Planning of the tutorial

The program is structured in two sessions, one in the morning and one in the afternoon, progressively building from a conceptual overview to hands-on development.

The first session start by providing an overview of the tutorial’s content and a description of the pipeline, participants will build throughout the day. Then, the motivation and rationale of RDFC will be presented, together with its high level architecture.

The second session is primarily hands-on, focusing on (i) processor development: participants will learn to implement a processor in a language of choice and configure it for pipeline integration. Concretely, the participants will implement a processor that transforms a stream of RDF triples based on a configured language translation, by leveraging a lightweight ML model to perform the translations locally; and (ii) pipeline assembly: participants will construct a working streaming pipeline, using both existing and the participant’s custom-built processor, that pulls data, applies transformations, validates and publishes the results to a triple store.

Material

The tutorial is guided by slides, which are shared online. For the hands-on coding sessions, we provide a git repository with separate branches for all sequentially completed tasks. This will allow participants that are unable to complete a certain task, to still begin with the next task by checking out the corresponding branch. An online specification is provided with detailed technical descriptions. The slides, specification and git repository are open for everyone under the CC BY 4.0 license.

Audience

This tutorial targets intermediate to advanced developers and researchers interested in data processing pipelines using Semantic Web technologies. It is designed to accommodate about 20 participants with a basic understanding of RDF and SHACL, and programming experience in either Python, JavaScript, or Java.

Presenters

This tutorial will be presented by Ieben Smessaert (primary contact: ieben.smessaert@ugent.be), Arthur Vercruysse and Pieter Colpaert from Ghent University–imec.

Ieben Smessaert is a second year PhD student and one of the main developers and maintainers of the RDF-Connect ecosystem. He actively applies RDFC in real-world use cases and also has teaching experience by assisting practical sessions of the Web Development course by Ruben Verborgh at Ghent University.

Arthur Vercruysse is a third year PhD student, who started his research journey with Knowledge Graph construction, and focused on developer tooling with his love for compilers. He is the main architect behind RDFC.

Prof. Pieter Colpaert co-leads the KNowledge on Web Scale team at Ghent University, where he also teaches the Knowledge Graphs course. He is also the editor of both the W3C TREE Community Group reports and the SEMIC LDES specification. With his team, he initiated the work on RDF-Connect.

Requirements

For this tutorial, we require a projector to present our slides, and a power plug for charging our laptop. The presenters and the participants need an internet connection.