1. Introduction
RDF-Connect is a modular framework for building and executing multilingual data processing pipelines using RDF as the configuration and orchestration layer.
It enables fine-grained, reusable processor components that exchange streaming data, allowing workflows to be described declaratively across programming languages and environments. RDF-Connect is especially well suited for data transformation, integration, and linked data publication.
2. Usage Paths
Depending on your use case, you may only need a subset of this specification:
-
Pipeline Authors: Start with § 5 Getting Started and focus on § 4.1 Pipeline.
-
Processor Developers: Read § 4.2 Processor and § 4.3 Runner.
-
Platform Maintainers: Read all sections, including implementation notes.
3. Design Goals
RDF-Connect is designed to be language-agnostic, enabling seamless integration of processor components written in diverse programming languages such as JavaScript, Python, and shell scripts. This flexibility allows developers to leverage existing tools and libraries within their language of choice, avoiding the constraints of monolithic, single-language frameworks.
A key architectural principle of RDF-Connect is that it operates in a streaming-by-default mode. Data flows between processors via a central orchestrator, supporting real-time and large-scale data processing scenarios. This streaming model ensures efficient memory usage and enables continuous transformation and publication of data as it is ingested.
The configuration of pipelines, processors, inputs, and outputs is expressed semantically using RDF. This use of semantic configuration promotes clarity, extensibility, and interoperability by describing the structure and behavior of the system in a machine-readable, standards-based format.
Processor components in RDF Connect are designed for reusability. A processor defined once can be reused across multiple pipelines and in varying contexts, reducing duplication and encouraging modular design practices. This modularity fosters rapid prototyping and easier maintenance of complex workflows.
Transparency is a fundamental goal of the framework. By leveraging RDF vocabularies to describe pipeline components and their interactions, RDF-Connect makes it easy to inspect, document, and reason about the behavior and structure of a pipeline, both during development and after deployment.
Finally, RDF-Connect provides native support for provenance tracking using the PROV-O ontology. Each step in the pipeline can be traced to its source, enabling detailed lineage tracking. This capability is especially critical in applications involving data quality, reproducibility, and compliance.
4. Concepts
This section introduces the core concepts underpinning the RDF-Connect framework. These concepts collectively define the modular, streaming, and language-agnostic architecture that enables RDF-Connect to support sophisticated, provenance-aware data pipelines.
4.1. Pipeline
At the heart of RDF-Connect is the notion of a pipeline, which defines a structured sequence of processing steps. Each step corresponds to a processor, configured with parameters and paired with an appropriate runner that governs its execution. Pipelines specify the flow of data between processors using readers and writers, forming a streaming architecture in which data is continuously passed along and transformed. The pipeline configuration itself is expressed in RDF, making it semantically explicit and machine-interpretable.
4.2. Processor
A processor is a modular, reusable software component that performs a discrete data processing task. In a typical scenario, a processor receives input data through a reader, transforms it according to its logic, and emits the result via a writer. However, processors are also flexible enough to support single-directional tasks, such as those that only produce or only consume data. Crucially, processors are implementation-agnostic --- they can be written in any programming language and integrated into pipelines via language-specific runners. This makes processors the building blocks of RDF-Connect’s cross-language interoperability.
4.3. Runner
A runner is responsible for executing a processor on behalf of the orchestrator.
Each runner targets a specific language or execution environment, enabling processors written in different languages to participate seamlessly in a pipeline.
For example, a JavaScript processor would be executed using a NodeRunner
, which knows how to initialize and manage JavaScript-based components.
In this sense, a runner serves as an execution strategy, abstracting away the platform-specific details of launching and interacting with a processor.
4.4. Orchestrator
The orchestrator is the central component that interprets and runs RDF-Connect pipelines. It parses the RDF-based configuration, initializes and assigns runners, instantiates processors, and manages the data flow between components. Acting as the conductor of the system, the orchestrator ensures that processors execute in the correct order and that data is passed efficiently and correctly through the pipeline. Its role is fundamental to realizing the streaming and semantic integration goals of RDF-Connect.
4.5. Reader / Writer
Readers and writers provide the streaming interfaces that connect processors within a pipeline. A writer streams data out from a processor, while a reader receives data into a processor. Together, they define how data flows between pipeline steps in an idiomatic and composable way. This separation of concerns allows for flexible data routing and makes it easy to compose and recombine processors across different pipeline configurations.
5. Getting Started
This section provides a high-level overview of how to define and run a pipeline in RDF-Connect. The rest of the specification provides detail on how each part works.
Here is a basic example:
ex : pipeline a rdfc : Pipeline ; rdfc : instantiates ex : myRunner ; rdfc : processor ex : myProcessor .
Once a pipeline is fully described using RDF, it is handed over to the orchestrator. The orchestrator parses the configuration, resolves all runner and processor definitions, and initiates execution.
6. SHACL as Configuration Schema
RDF-Connect uses SHACL [shacl] not only as a data validation mechanism but also as a schema language for defining the configuration interface of components such as processors and runners. These SHACL shapes enable:
-
Static validation of component descriptions.
-
Programmatic extraction of configuration contracts.
-
Type-safe interpretation in environments like JavaScript/TypeScript.
Shapes define required and optional configuration properties, which are transformed into JSON objects at startup, according to the pipeline.
Shacl shape defining some required configuration for a processor
[] a sh : NodeShape ; sh : targetClass <FooBar> ; sh : property [ sh : name "repeat" ; # JSON field name sh : datatype xsd : integer ; # specify datatype sh : maxCount 1 ; sh : path : repeat ; ], [ sh : name "messages" ; sh : datatype xsd : string ; sh : path : msg ; ].
Processor configuration
<MyProcessor> a <FooBar> ; : repeat 10 ; : msg "Hello" , "World" .
Results in the following JSON structure.
{ "repeat" : 10 , "messages" : [ "Hello" , "World" ] }
6.1. Mapping SHACL to Configuration Structures
Each sh:property
statement in a SHACL shape directly maps to a field in a configuration object, such as a JSON structure.
This enables semantically rich, machine-validated configurations for processors and pipelines within RDF-Connect.
6.1.1. Required Fields
sh:path
This property indicates the RDF predicate that must be present on the target resource.
In the context of configuration mapping, sh:path
is used to extract the corresponding value from the data graph.
It defines the link between the RDF representation and the configuration data being generated or interpreted.
sh:name
The sh:name
property provides the external field name used in the resulting configuration object.
This allows decoupling the internal RDF predicate (as defined in sh:path) from how the value appears in the configuration file.
sh:datatype or sh:class
These properties define the expected type of the configuration field.
Use sh:datatype
for primitive types such as xsd:string
, xsd:boolean
, or xsd:anyURI
.
When the expected value is a nested resource, i.e., an object with its own structured fields, use sh:class
instead.
This signals that the value is an embedded configuration object to be interpreted according to its own SHACL shape, enabling deeply nested configuration structures.
6.1.2. Optional Fields
sh:minCount
This property defines the minimum number of values that must be provided.
If the actual number of values is below this threshold, a validation error will be raised.
A sh:minCount
of 1 or higher indicates that the field is required in the configuration.
sh:maxCount
This property sets the maximum number of values allowed for a field.
It also determines how the field should be interpreted structurally.
If sh:maxCount
is set to 1, the corresponding field is treated as a single value (T
).
If it is greater than 1 or left unspecified, the field is interpreted as a list or array of values (T[]
).
This behavior ensures that the structure of the configuration aligns with user expectations and downstream processing logic.
6.2. Nested Shapes and Component Types
Configuration fields can reference structured values instead of primitive literals.
This is supported via sh:class
, which indicates that the value must conform to another shape associated with a given RDF class.
For example:
sh : property [ sh : path rdfc : input ; sh : name "input" ; sh : class rdfc : Reader ; ]
This defines a configuration field named input
whose value is expected to be an instance of the class rdfc:Reader
.
6.2.1. Use of sh:class
and sh:targetClass
The sh:class
predicate is used within a property constraint to indicate that the value of a field must be an RDF resource belonging to a specific class.
This class must, in turn, have a shape associated with it that defines its expected structure.
To make this connection, sh:targetClass
is used on a sh:NodeShape
.
This associates the shape with a class so that tools can look up the shape definition when they encounter an RDF resource of that class.
In RDF-Connect, sh:targetClass
allows reusable schemas to be defined for configuration components.
These can then be referenced in other shapes through sh:class
, enabling configuration structures that are both modular and type-safe.
6.2.2. Special Component Types: rdfc:Reader
and rdfc:Writer
rdfc:Reader
and rdfc:Writer
are two special classes to represent input and output endpoints in RDF-Connect pipelines.
Fields that are declared with sh:class rdfc:Reader
or sh:class rdfc:Writer
are not simply nested configuration objects.
They serve as runtime injection points for streaming data.
At execution time, the runner is responsible for resolving these references into actual language-native abstractions. Depending on the programming environment, these may be JavaScript streams, async iterators, or callback-based interfaces.
For example:
sh : property [ sh : path rdfc : input ; sh : name "input" ; sh : class rdfc : Reader ; sh : minCount 1 ; ]
This declares an input
field that must be provided and will be automatically connected to an input stream by the runner.
This design decouples the declarative pipeline configuration from the underlying data transport logic, allowing developers to focus on processor behavior rather than infrastructure.
6.3. Example: Putting it all together
The following SHACL definitions and RDF instance demonstrate how a FooBar processor might be configured to append text to each incoming message and send the result to an output channel.
[ ] a sh : NodeShape ; sh : targetClass : Channels ; sh : property [ sh : path rdfc : input ; sh : name "input" ; sh : class rdfc : Reader ; sh : minCount 1 ; sh : maxCount 1 ; ], [ sh : path rdfc : output ; sh : name "output" ; sh : class rdfc : Writer ; sh : minCount 1 ; sh : maxCount 1 ; ]. [ ] a sh : NodeShape ; sh : targetClass <FooBar2> ; sh : property [ sh : path : channel ; sh : name "channels" ; sh : minCount 1 ; sh : class : Channels ; ], [ sh : path : append ; sh : name "append" ; sh : minCount 1 ; sh : maxCount 1 ; sh : datatype xsd : string ; ].
<foobar> a <FooBar2> ; : channel [ #`a :Channels` is not required, this is implicit from the definition rdfc : input <channel1> ; rdfc : output <channel2> ; ], [ rdfc : input <channel2> ; rdfc : output <channel1> ; ]; : append " World!" .
This results in the following JSON object:
{ channels: [ { "input" : { /* idiomatic input stream for channel <channel1> */ } , "output" : { /* idiomatic output stream for channel <channel2> */ } , }, { "input" : { /* idiomatic input stream for channel <channel2> */ } , "output" : { /* idiomatic output stream for channel <channel1> */ } , } ], append: " World!" }
7. RDF-Connect by Layer
7.1. Orchestrator
The orchestrator is the central runtime entity in RDF-Connect. It reads the pipeline configuration, sets up the runners, initiates processors, and routes messages between them. It ensures the dataflow graph described by the pipeline is brought to life across isolated runtimes. The orchestrator acts as a coordinator, not an executor. Each runner is responsible for running the actual processor code, but the orchestrator ensures the pipeline as a whole behaves as intended.
Responsibilities:
-
Parse the pipeline RDF.
-
Load SHACL shapes for processors and runners.
-
Validate and coerce configuration to structured JSON.
-
Instantiate runners.
-
Start processors.
-
Mediate messages (data and control).
-
Handle retries, streaming, and backpressure.
7.1.1. Protobuf Messaging Protocol
Communication between the orchestrator and the runners happens using a strongly typed protocol defined in Protocol Buffers (protobuf). This enables language-independent and efficient communication across processes and machines.
7.1.2. Startup Flow
The following diagram describes the startup sequence handled by the orchestrator. This includes validating pipeline structure, instantiating runners, and initializing processor instances.
and processors loop For every runner Note over R: Runner is started with cli O-->>R: Startup with address and uri R-->>O: RPC.Identify: with uri end loop For every processor O-->>R: RPC.Processor: Start processor Note over P: Load module and class R-->>P: Load processor R-->>P: Start processor with args R-->>O: RPC.ProcessorInit: processor started end loop For every runner O-->>R: RPC.Start: Start loop For every Processor R-->>P: Start end end
7.1.3. Message Handling Flow
Once processors are running, the orchestrator handles incoming messages and forwards them to the appropriate reader instances, based on their declared channels.
to processor of Runner 2 O -->> R2: Msg to Channel A R2-->>P2: Msg to Channel A
7.1.4. Streaming Messages
For large messages or real-time input, RDF-Connect supports a streaming model. Instead of sending entire payloads as a single message, the message can be broken into chunks and sends them over time. This is handled by the StreamChunk message type.
message critical Start stream message R1 ->> O: rpc.sendStreamMessage
(bidirectional stream) O -->> R1: sends generated ID
of stream message R1 -->> O: announce StreamMessage
with ID over normal stream O -->> R2: announce StreamMessage
with ID over normal stream R2 ->> O: rpc.receiveMessage with Id
starts receiving stream R2 -->> P2: incoming stream message end loop Streams data P1 -->> R1: Data chunks R1 -->> O: Data chunks over stream O -->> R2: Data chunks over stream R2 -->>P2: Data chnuks end
7.2. Runner
7.3. Processor
7.4. Pipeline
8. Ontology Reference
The RDF-Connect ontology provides the terms used in RDF pipeline definitions. See the full RDF-Connect Ontology for details.