RDF-Connect Specification

Living Standard,

Issue Tracking:
GitHub
Inline In Spec
Editor:
RDF-Connect Team

Abstract

Some abstract

1. Introduction

RDF-Connect is a modular framework for building and executing multilingual data processing pipelines using RDF as the configuration and orchestration layer.

It enables fine-grained, reusable processor components that exchange streaming data, allowing workflows to be described declaratively across programming languages and environments. RDF-Connect is especially well suited for data transformation, integration, and linked data publication.

2. Usage Paths

This section is complete in terms of content, but may be reorganized or rewritten for clarity during editorial review.

Depending on your use case, you may only need a subset of this specification:

3. Design Goals

This section is complete in terms of content, but may be reorganized or rewritten for clarity during editorial review.

RDF-Connect is designed to be language-agnostic, enabling seamless integration of processor components written in diverse programming languages such as JavaScript, Python, and shell scripts. This flexibility allows developers to leverage existing tools and libraries within their language of choice, avoiding the constraints of monolithic, single-language frameworks.

A key architectural principle of RDF-Connect is that it operates in a streaming-by-default mode. Data flows between processors via a central orchestrator, supporting real-time and large-scale data processing scenarios. This streaming model ensures efficient memory usage and enables continuous transformation and publication of data as it is ingested.

The configuration of pipelines, processors, inputs, and outputs is expressed semantically using RDF. This use of semantic configuration promotes clarity, extensibility, and interoperability by describing the structure and behavior of the system in a machine-readable, standards-based format.

Processor components in RDF Connect are designed for reusability. A processor defined once can be reused across multiple pipelines and in varying contexts, reducing duplication and encouraging modular design practices. This modularity fosters rapid prototyping and easier maintenance of complex workflows.

Transparency is a fundamental goal of the framework. By leveraging RDF vocabularies to describe pipeline components and their interactions, RDF-Connect makes it easy to inspect, document, and reason about the behavior and structure of a pipeline, both during development and after deployment.

Finally, RDF-Connect provides native support for provenance tracking using the PROV-O ontology. Each step in the pipeline can be traced to its source, enabling detailed lineage tracking. This capability is especially critical in applications involving data quality, reproducibility, and compliance.

4. Concepts

This section introduces the core concepts underpinning the RDF-Connect framework. These concepts collectively define the modular, streaming, and language-agnostic architecture that enables RDF-Connect to support sophisticated, provenance-aware data pipelines.

4.1. Pipeline

At the heart of RDF-Connect is the notion of a pipeline, which defines a structured sequence of processing steps. Each step corresponds to a processor, configured with parameters and paired with an appropriate runner that governs its execution. Pipelines specify the flow of data between processors using readers and writers, forming a streaming architecture in which data is continuously passed along and transformed. The pipeline configuration itself is expressed in RDF, making it semantically explicit and machine-interpretable.

4.2. Processor

A processor is a modular, reusable software component that performs a discrete data processing task. In a typical scenario, a processor receives input data through a reader, transforms it according to its logic, and emits the result via a writer. However, processors are also flexible enough to support single-directional tasks, such as those that only produce or only consume data. Crucially, processors are implementation-agnostic --- they can be written in any programming language and integrated into pipelines via language-specific runners. This makes processors the building blocks of RDF-Connect’s cross-language interoperability.

4.3. Runner

A runner is responsible for executing a processor on behalf of the orchestrator. Each runner targets a specific language or execution environment, enabling processors written in different languages to participate seamlessly in a pipeline. For example, a JavaScript processor would be executed using a NodeRunner, which knows how to initialize and manage JavaScript-based components. In this sense, a runner serves as an execution strategy, abstracting away the platform-specific details of launching and interacting with a processor.

4.4. Orchestrator

The orchestrator is the central component that interprets and runs RDF-Connect pipelines. It parses the RDF-based configuration, initializes and assigns runners, instantiates processors, and manages the data flow between components. Acting as the conductor of the system, the orchestrator ensures that processors execute in the correct order and that data is passed efficiently and correctly through the pipeline. Its role is fundamental to realizing the streaming and semantic integration goals of RDF-Connect.

4.5. Reader / Writer

Readers and writers provide the streaming interfaces that connect processors within a pipeline. A writer streams data out from a processor, while a reader receives data into a processor. Together, they define how data flows between pipeline steps in an idiomatic and composable way. This separation of concerns allows for flexible data routing and makes it easy to compose and recombine processors across different pipeline configurations.

5. Getting Started

🚧 This section is a work in progress and will be expanded soon.

This section provides a high-level overview of how to define and run a pipeline in RDF-Connect. The rest of the specification provides detail on how each part works.

Here is a basic example:

ex:pipeline a rdfc:Pipeline ;
    rdfc:instantiates ex:myRunner ;
    rdfc:processor ex:myProcessor .

Once a pipeline is fully described using RDF, it is handed over to the orchestrator. The orchestrator parses the configuration, resolves all runner and processor definitions, and initiates execution.

6. SHACL as Configuration Schema

RDF-Connect uses SHACL [shacl] not only as a data validation mechanism but also as a schema language for defining the configuration interface of components such as processors and runners. These SHACL shapes enable:

Shapes define required and optional configuration properties, which are transformed into JSON objects at startup, according to the pipeline.

This SHACL shape definition defines a configuration structure for a processor. In RDF-Connect, such shapes are used to describe required parameters. They result in a well-typed JSON object that developers can rely on during implementation.

Shacl shape defining some required configuration for a processor

[] a sh:NodeShape;
    sh:targetClass <FooBar>;
    sh:property [
        sh:name "repeat";         # JSON field name
        sh:datatype xsd:integer;  # specify datatype
        sh:maxCount 1;
        sh:path :repeat;
    ], [
        sh:name "messages";
        sh:datatype xsd:string;
        sh:path :msg;
    ].

Processor configuration

<MyProcessor> a <FooBar>;
    :repeat 10;
    :msg "Hello", "World".

Results in the following JSON structure.

{ 
    "repeat": 10,
    "messages": [ "Hello", "World"]
}

6.1. Mapping SHACL to Configuration Structures

Each sh:property statement in a SHACL shape directly maps to a field in a configuration object, such as a JSON structure. This enables semantically rich, machine-validated configurations for processors and pipelines within RDF-Connect.

6.1.1. Required Fields

sh:path

This property indicates the RDF predicate that must be present on the target resource. In the context of configuration mapping, sh:path is used to extract the corresponding value from the data graph. It defines the link between the RDF representation and the configuration data being generated or interpreted.

sh:name

The sh:name property provides the external field name used in the resulting configuration object. This allows decoupling the internal RDF predicate (as defined in sh:path) from how the value appears in the configuration file.

sh:datatype or sh:class

These properties define the expected type of the configuration field. Use sh:datatype for primitive types such as xsd:string, xsd:boolean, or xsd:anyURI. When the expected value is a nested resource, i.e., an object with its own structured fields, use sh:class instead. This signals that the value is an embedded configuration object to be interpreted according to its own SHACL shape, enabling deeply nested configuration structures.

6.1.2. Optional Fields

sh:minCount

This property defines the minimum number of values that must be provided. If the actual number of values is below this threshold, a validation error will be raised. A sh:minCount of 1 or higher indicates that the field is required in the configuration.

sh:maxCount

This property sets the maximum number of values allowed for a field. It also determines how the field should be interpreted structurally. If sh:maxCount is set to 1, the corresponding field is treated as a single value (T). If it is greater than 1 or left unspecified, the field is interpreted as a list or array of values (T[]). This behavior ensures that the structure of the configuration aligns with user expectations and downstream processing logic.

6.2. Nested Shapes and Component Types

Configuration fields can reference structured values instead of primitive literals. This is supported via sh:class, which indicates that the value must conform to another shape associated with a given RDF class.

For example:

sh:property [
  sh:path rdfc:input ;
  sh:name "input" ;
  sh:class rdfc:Reader ;
]

This defines a configuration field named input whose value is expected to be an instance of the class rdfc:Reader.

6.2.1. Use of sh:class and sh:targetClass

The sh:class predicate is used within a property constraint to indicate that the value of a field must be an RDF resource belonging to a specific class. This class must, in turn, have a shape associated with it that defines its expected structure.

To make this connection, sh:targetClass is used on a sh:NodeShape. This associates the shape with a class so that tools can look up the shape definition when they encounter an RDF resource of that class.

In RDF-Connect, sh:targetClass allows reusable schemas to be defined for configuration components. These can then be referenced in other shapes through sh:class, enabling configuration structures that are both modular and type-safe.

6.2.2. Special Component Types: rdfc:Reader and rdfc:Writer

rdfc:Reader and rdfc:Writer are two special classes to represent input and output endpoints in RDF-Connect pipelines. Fields that are declared with sh:class rdfc:Reader or sh:class rdfc:Writer are not simply nested configuration objects. They serve as runtime injection points for streaming data.

At execution time, the runner is responsible for resolving these references into actual language-native abstractions. Depending on the programming environment, these may be JavaScript streams, async iterators, or callback-based interfaces.

For example:

sh:property [
  sh:path rdfc:input ;
  sh:name "input" ;
  sh:class rdfc:Reader ;
  sh:minCount 1 ;
]

This declares an input field that must be provided and will be automatically connected to an input stream by the runner. This design decouples the declarative pipeline configuration from the underlying data transport logic, allowing developers to focus on processor behavior rather than infrastructure.

6.3. Example: Putting it all together

The following SHACL definitions and RDF instance demonstrate how a FooBar processor might be configured to append text to each incoming message and send the result to an output channel.

[ ] a sh:NodeShape;
  sh:targetClass :Channels;
  sh:property [
    sh:path rdfc:input ;
    sh:name "input" ;
    sh:class rdfc:Reader ;
    sh:minCount 1 ;
    sh:maxCount 1 ;
  ], [
    sh:path rdfc:output ;
    sh:name "output" ;
    sh:class rdfc:Writer ;
    sh:minCount 1 ;
    sh:maxCount 1 ;
  ].

[ ] a sh:NodeShape ;
  sh:targetClass <FooBar2> ;
  sh:property [
    sh:path :channel ;
    sh:name "channels" ;
    sh:minCount 1 ;
    sh:class :Channels ;
  ], [
    sh:path :append ;
    sh:name "append" ;
    sh:minCount 1 ;
    sh:maxCount 1 ;
    sh:datatype xsd:string ;
  ].
<foobar> a <FooBar2>;
    :channel [
        #`a :Channels` is not  required, this is implicit from the definition
        rdfc:input <channel1>;
        rdfc:output <channel2>;
    ], [
        rdfc:input <channel2>;
        rdfc:output <channel1>;
    ];
    :append " World!".

This results in the following JSON object:

{
  channels: [
        { 
            "input": { /* idiomatic input stream for channel <channel1> */ } ,
            "output": {/* idiomatic output stream for channel <channel2> */ } ,
        }, {
            "input": { /* idiomatic input stream for channel <channel2> */ } ,
            "output": {/* idiomatic output stream for channel <channel1> */ } ,
        }
  ],
  append: " World!"
}

7. RDF-Connect by Layer

7.1. Orchestrator

The orchestrator is the central runtime entity in RDF-Connect. It reads the pipeline configuration, sets up the runners, initiates processors, and routes messages between them. It ensures the dataflow graph described by the pipeline is brought to life across isolated runtimes. The orchestrator acts as a coordinator, not an executor. Each runner is responsible for running the actual processor code, but the orchestrator ensures the pipeline as a whole behaves as intended.

Responsibilities:

The remained of this section is intended for developers building custom runners or integrating RDF-Connect into infrastructure.

7.1.1. Protobuf Messaging Protocol

Communication between the orchestrator and the runners happens using a strongly typed protocol defined in Protocol Buffers (protobuf). This enables language-independent and efficient communication across processes and machines.

7.1.2. Startup Flow

The following diagram describes the startup sequence handled by the orchestrator. This includes validating pipeline structure, instantiating runners, and initializing processor instances.

sequenceDiagram autonumber participant O as Orchestrator participant R as Runner participant P as Processor Note over O: Discovered Runners
and processors loop For every runner Note over R: Runner is started with cli O-->>R: Startup with address and uri R-->>O: RPC.Identify: with uri end loop For every processor O-->>R: RPC.Processor: Start processor Note over P: Load module and class R-->>P: Load processor R-->>P: Start processor with args R-->>O: RPC.ProcessorInit: processor started end loop For every runner O-->>R: RPC.Start: Start loop For every Processor R-->>P: Start end end

7.1.3. Message Handling Flow

Once processors are running, the orchestrator handles incoming messages and forwards them to the appropriate reader instances, based on their declared channels.

sequenceDiagram participant P1 as Processor 1 participant R1 as Runner 1 participant O as Orchestrator participant R2 as Runner 2 participant P2 as Processor 2 P1-->>R1: Msg to Channel A R1-->>O: Msg to Channel A Note over O: Channel A is connected
to processor of Runner 2 O -->> R2: Msg to Channel A R2-->>P2: Msg to Channel A

7.1.4. Streaming Messages

For large messages or real-time input, RDF-Connect supports a streaming model. Instead of sending entire payloads as a single message, the message can be broken into chunks and sends them over time. This is handled by the StreamChunk message type.

sequenceDiagram autonumber participant P1 as Processor 1 participant R1 as Runner 1 participant O as Orchestrator participant R2 as Runner 2 participant P2 as Processor 2 P1 -->> R1: Send streaming
message critical Start stream message R1 ->> O: rpc.sendStreamMessage
(bidirectional stream) O -->> R1: sends generated ID
of stream message R1 -->> O: announce StreamMessage
with ID over normal stream O -->> R2: announce StreamMessage
with ID over normal stream R2 ->> O: rpc.receiveMessage with Id
starts receiving stream R2 -->> P2: incoming stream message end loop Streams data P1 -->> R1: Data chunks R1 -->> O: Data chunks over stream O -->> R2: Data chunks over stream R2 -->>P2: Data chnuks end

7.2. Runner

🚧 This section is a work in progress and will be expanded soon.

7.3. Processor

🚧 This section is a work in progress and will be expanded soon.

7.4. Pipeline

🚧 This section is a work in progress and will be expanded soon.

8. Ontology Reference

The RDF-Connect ontology provides the terms used in RDF pipeline definitions. See the full RDF-Connect Ontology for details.

9. Putting It All Together: Example Flow and Use Case

🚧 This section is a work in progress and will be expanded soon.

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

References

Normative References

[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119

Informative References

[SHACL]
Holger Knublauch; Dimitris Kontokostas. Shapes Constraint Language (SHACL). URL: https://w3c.github.io/data-shapes/shacl/

Issues Index

🚧 This section is a work in progress and will be expanded soon.
🚧 This section is a work in progress and will be expanded soon.
🚧 This section is a work in progress and will be expanded soon.
🚧 This section is a work in progress and will be expanded soon.
🚧 This section is a work in progress and will be expanded soon.