Skip to main content

Building

This section covers how to construct a knowledge graph from text using RetriCo's build pipeline.

How It Works

The build pipeline processes text through a series of steps:

Build pipeline: from raw text through chunking, NER, linking, relation extraction, graph writing, and embedding

  1. Chunking — split input texts into manageable pieces
  2. NER — extract entities from each chunk
  3. Relation Extraction — discover relationships between entities
  4. Entity Linking (optional) — resolve entities to a reference knowledge base
  5. Graph Writing — deduplicate and store entities, relations, and chunks in a graph database
  6. Embedding (optional) — generate vector embeddings for chunks and/or entities

Each step is a registered processor. Every step can be configured, swapped, or skipped independently.

Creating a Build Pipeline

RetriCo offers three ways to create a build pipeline:

Option 1: One-liner

import retrico

result = retrico.build_graph(
texts=["Einstein was born in Ulm and worked at the Swiss Patent Office."],
entity_labels=["person", "organization", "location"],
relation_labels=["born in", "works at"],
)

Option 2: Builder API

builder = retrico.RetriCoBuilder(name="my_pipeline")
builder.chunker(method="sentence")
builder.ner_gliner(labels=["person", "organization", "location"])
builder.relex_gliner(entity_labels=["person", "organization"], relation_labels=["works at", "born in"])
builder.graph_writer()
executor = builder.build()
result = executor.run(texts=["Einstein was born in Ulm."])

Option 3: YAML Config

name: my_pipeline
nodes:
- id: chunker
processor: chunker
inputs:
texts: {source: "$input", fields: "texts"}
output: {key: "chunker_result"}
config:
method: sentence

- id: ner
processor: ner_gliner
requires: [chunker]
inputs:
chunks: {source: "chunker_result", fields: "chunks"}
output: {key: "ner_result"}
config:
model: "knowledgator/gliner-multitask-large-v0.5"
labels: [person, organization, location]

- id: relex
processor: relex_gliner
requires: [ner]
inputs:
entities: {source: "ner_result", fields: "entities"}
chunks: {source: "ner_result", fields: "chunks"}
output: {key: "relex_result"}
config:
entity_labels: [person, organization, location]
relation_labels: [works at, born in]

- id: writer
processor: graph_writer
requires: [relex]
inputs:
entities: {source: "relex_result", fields: "entities"}
relations: {source: "relex_result", fields: "relations"}
chunks: {source: "relex_result", fields: "chunks"}
output: {key: "writer_result"}
executor = retrico.ProcessorFactory.create_pipeline("my_pipeline.yaml")
result = executor.run(texts=["Einstein was born in Ulm."])

Connecting to a Database

RetriCo needs a graph database to store the knowledge graph. By default, it uses FalkorDB Lite (embedded, zero-config). To use a different backend:

import retrico

# FalkorDB Lite (default — no setup needed)
result = retrico.build_graph(texts=[...], entity_labels=[...])

# Neo4j
result = retrico.build_graph(
texts=[...],
entity_labels=[...],
store_config=retrico.Neo4jConfig(
uri="bolt://localhost:7687",
user="neo4j",
password="password",
),
)

# FalkorDB (server)
result = retrico.build_graph(
texts=[...],
entity_labels=[...],
store_config=retrico.FalkorDBConfig(
host="localhost",
port=6379,
graph="my_graph",
),
)

# Memgraph
result = retrico.build_graph(
texts=[...],
entity_labels=[...],
store_config=retrico.MemgraphConfig(
uri="bolt://localhost:7687",
),
)

With the builder API:

builder = retrico.RetriCoBuilder(name="my_pipeline")
builder.graph_store(retrico.Neo4jConfig(uri="bolt://localhost:7687"))
# All downstream components inherit this store
builder.chunker(method="sentence")
builder.ner_gliner(labels=["person", "org"])
builder.graph_writer()

In YAML:

stores:
graph:
store_type: neo4j
uri: "bolt://localhost:7687"
user: neo4j
password: password

nodes:
# ... processors inherit the store automatically

See Databases for full configuration details.


Components

Chunking

The chunker splits input texts into smaller pieces for processing.

Builder API:

builder.chunker(
method="sentence", # "sentence", "paragraph", or "fixed"
max_length=512, # for "fixed" method
overlap=50, # overlap between fixed chunks
)

YAML:

- id: chunker
processor: chunker
config:
method: sentence
max_length: 512
overlap: 50

Parameters:

ParameterDefaultDescription
method"sentence"Splitting strategy: "sentence", "paragraph", or "fixed"
max_length512Maximum chunk length (for "fixed" method)
overlap50Token overlap between consecutive fixed chunks
MethodDescription
sentenceSplit on sentence boundaries (default)
paragraphSplit on paragraph breaks
fixedFixed-size chunks with optional overlap

NER (Named Entity Recognition)

RetriCo offers two interchangeable NER backends. Both produce the same output shape ({"entities": List[List[EntityMention]], "chunks": List[Chunk]}), so they can be swapped freely.

GLiNER (Local, Fast)

Runs locally with zero API costs. Supports any entity types — no fine-tuning needed.

Builder API:

builder.ner_gliner(
model="knowledgator/gliner-multitask-large-v0.5",
labels=["person", "organization", "location", "date"],
threshold=0.3,
flat_ner=True,
device="cpu",
)

YAML:

- id: ner
processor: ner_gliner
requires: [chunker]
inputs:
chunks: {source: "chunker_result", fields: "chunks"}
output: {key: "ner_result"}
config:
model: "knowledgator/gliner-multitask-large-v0.5"
labels: [person, organization, location, date]
threshold: 0.3
flat_ner: true
device: cpu

Parameters:

ParameterDefaultDescription
model"knowledgator/gliner-multitask-large-v0.5"HuggingFace model ID or local path
labels(required)Entity types to extract
threshold0.3Minimum confidence score
flat_nerTrueNon-overlapping entities
device"cpu""cpu" or "cuda"

LLM (API-based, High Accuracy)

Works with any OpenAI-compatible API: OpenAI, vLLM, Ollama, LM Studio, etc.

Builder API:

builder.ner_llm(
api_key="sk-...",
model="gpt-4o-mini",
labels=["person", "organization", "location"],
temperature=0.1,
base_url=None, # set for local servers
)

YAML:

- id: ner
processor: ner_llm
requires: [chunker]
inputs:
chunks: {source: "chunker_result", fields: "chunks"}
output: {key: "ner_result"}
config:
api_key: "sk-..."
model: "gpt-4o-mini"
labels: [person, organization, location]
temperature: 0.1

Parameters:

ParameterDefaultDescription
api_key(required)OpenAI-compatible API key
model"gpt-4o-mini"LLM model name
labels(required)Entity types to extract
temperature0.1LLM sampling temperature
base_urlNoneCustom API endpoint for local servers

With a local server:

builder.ner_llm(
base_url="http://localhost:8000/v1",
api_key="dummy",
model="Qwen/Qwen2.5-7B-Instruct",
labels=["person", "organization"],
)

Relation Extraction (Relex)

Like NER, relation extraction offers two interchangeable backends with identical output shapes.

GLiNER-Relex (Local)

Builder API:

builder.relex_gliner(
model="knowledgator/gliner-relex-large-v0.5",
entity_labels=["person", "organization", "location"],
relation_labels=["works at", "born in", "located in"],
relation_threshold=0.5,
)

YAML:

- id: relex
processor: relex_gliner
requires: [ner]
inputs:
entities: {source: "ner_result", fields: "entities"}
chunks: {source: "ner_result", fields: "chunks"}
output: {key: "relex_result"}
config:
model: "knowledgator/gliner-relex-large-v0.5"
entity_labels: [person, organization, location]
relation_labels: [works at, born in, located in]
relation_threshold: 0.5

Parameters:

ParameterDefaultDescription
model"knowledgator/gliner-relex-large-v0.5"HuggingFace model ID
entity_labels(required)Entity types
relation_labels(required)Relation types to extract
relation_threshold0.5Minimum relation confidence score
threshold0.5Entity confidence threshold
adjacency_threshold0.55Adjacency scoring threshold

When used after ner_gliner, it receives pre-extracted entities and only resolves relations — making it faster.

LLM-Relex (API-based)

Builder API:

builder.relex_llm(
api_key="sk-...",
model="gpt-4o-mini",
entity_labels=["person", "organization", "location"],
relation_labels=["works at", "born in", "located in"],
)

YAML:

- id: relex
processor: relex_llm
requires: [ner]
inputs:
entities: {source: "ner_result", fields: "entities"}
chunks: {source: "ner_result", fields: "chunks"}
output: {key: "relex_result"}
config:
api_key: "sk-..."
model: "gpt-4o-mini"
entity_labels: [person, organization, location]
relation_labels: [works at, born in, located in]

Parameters:

ParameterDefaultDescription
api_key(required)OpenAI-compatible API key
model"gpt-4o-mini"LLM model name
entity_labels(required)Entity types
relation_labels(required)Relation types to extract
temperature0.1LLM sampling temperature
base_urlNoneCustom API endpoint

Mixed Pipeline (Best of Both)

Use GLiNER for fast local NER, then LLM for higher-quality relation extraction:

builder = retrico.RetriCoBuilder(name="mixed")
builder.chunker(method="sentence")
builder.ner_gliner(labels=["person", "org", "location"])
builder.relex_llm(
api_key="sk-...",
entity_labels=["person", "org", "location"],
relation_labels=["works at", "born in"],
)
builder.graph_writer()

In YAML — just change the processor field:

nodes:
- id: ner
processor: ner_gliner # local GLiNER
config:
labels: [person, org, location]

- id: relex
processor: relex_llm # LLM-based relex
requires: [ner]
config:
api_key: "sk-..."
entity_labels: [person, org, location]
relation_labels: [works at, born in]

Graph Writer

Deduplicates entities, sanitizes relation types, and writes everything to the graph database.

Builder API:

builder.graph_writer(
json_output="output/data.json", # optional: also save to JSON
)

YAML:

- id: writer
processor: graph_writer
requires: [relex]
inputs:
entities: {source: "relex_result", fields: "entities"}
relations: {source: "relex_result", fields: "relations"}
chunks: {source: "relex_result", fields: "chunks"}
output: {key: "writer_result"}
config:
json_output: "output/data.json"

Parameters:

ParameterDefaultDescription
store_typefrom poolGraph database type (auto-resolved from store pool)
json_outputNonePath to save extracted data as JSON

Embedding

Generate vector embeddings for chunks and entities during the build phase. These enable semantic retrieval at query time.

Builder API:

builder.chunk_embedder(
embedding_method="sentence_transformer",
model_name="all-MiniLM-L6-v2",
vector_store_type="faiss",
)

builder.entity_embedder(
embedding_method="sentence_transformer",
model_name="all-MiniLM-L6-v2",
vector_store_type="in_memory",
)

YAML:

- id: chunk_embedder
processor: chunk_embedder
requires: [writer]
config:
embedding_method: sentence_transformer
model_name: "all-MiniLM-L6-v2"
vector_store_type: faiss

- id: entity_embedder
processor: entity_embedder
requires: [writer]
config:
embedding_method: sentence_transformer
model_name: "all-MiniLM-L6-v2"
vector_store_type: in_memory

Parameters:

ParameterDefaultDescription
embedding_method"sentence_transformer""sentence_transformer" or "openai"
model_name"all-MiniLM-L6-v2"Embedding model name
vector_store_type"in_memory""in_memory", "faiss", or "qdrant"

Entity Linking

Resolve extracted entities to a reference knowledge base using GLinker:

Builder API:

builder.linker(
model="knowledgator/gliner-linker-large-v1.0",
entities="data/entities.jsonl",
threshold=0.5,
)

YAML:

- id: linker
processor: entity_linker
requires: [ner]
config:
model: "knowledgator/gliner-linker-large-v1.0"
entities: "data/entities.jsonl"
threshold: 0.5

Parameters:

ParameterDefaultDescription
model(required)GLinker model name
entities(required)Path to reference entity file (JSONL)
threshold0.5Minimum linking confidence

Linked entities get a linked_entity_id that is used for deduplication in the graph writer and entity lookup during retrieval.


PDF Parsing

Build a knowledge graph directly from PDF files:

result = retrico.build_graph_from_pdf(
pdf_paths=["paper.pdf", "report.pdf"],
entity_labels=["person", "organization", "concept"],
relation_labels=["authored", "references", "describes"],
)

The PDF reader extracts text and converts tables to Markdown before passing to the pipeline.


Extraction Without a Database

You can run the extraction pipeline independently of graph writing. This is useful for inspecting results, exporting to other systems, or working offline.

Standalone Extraction

retrico.extract() runs NER and relation extraction and returns an ExtractionResult — no database, no graph writer, no pipeline setup:

import retrico

result = retrico.extract(
texts=[
"Acme Corporation was founded by John Smith in San Francisco.",
"Sarah Johnson joined Acme as CTO in 2015.",
],
entity_labels=["person", "organization", "location"],
relation_labels=["founded_by", "works_at", "located_in"],
)

# Per-text entity mentions (list of lists, one inner list per input text)
for i, text_entities in enumerate(result.entities):
print(f"Text {i}:")
for entity in text_entities:
print(f" [{entity.label}] {entity.text} (score: {entity.score:.2f})")

# Per-text relations
for i, text_relations in enumerate(result.relations):
for rel in text_relations:
print(f" {rel.head_text} --[{rel.relation_type}]--> {rel.tail_text}")

Works with both backends:

# GLiNER (default — local, no API key)
result = retrico.extract(texts=[...], entity_labels=[...], method="gliner")

# LLM (any OpenAI-compatible API)
result = retrico.extract(
texts=[...],
entity_labels=[...],
relation_labels=[...],
method="llm",
api_key="sk-...",
model="gpt-4o-mini",
)

Pipeline Extraction to JSON

Use the builder API with graph_writer(json_output=...) to run the full pipeline and save results to a JSON file. The JSON uses the same format as ingest_data(), so you can re-import it into any database later:

import retrico

builder = retrico.RetriCoBuilder(name="extract_only")
builder.chunker(method="sentence")
builder.ner_gliner(
labels=["person", "organization", "location", "date"],
threshold=0.3,
)
builder.relex_gliner(
entity_labels=["person", "organization", "location", "date"],
relation_labels=["works at", "founded by", "located in"],
)
builder.graph_writer(json_output="output/extracted.json")

executor = builder.build(verbose=True)
result = executor.run(texts=[
"Acme Corporation was founded by John Smith in 2010. "
"The company is headquartered in San Francisco.",
])

The output file (output/extracted.json) contains a list of documents with their entities and relations:

[
{
"text": "Acme Corporation was founded by John Smith in 2010. ...",
"entities": [
{"text": "Acme Corporation", "label": "organization"},
{"text": "John Smith", "label": "person"},
{"text": "San Francisco", "label": "location"},
{"text": "2010", "label": "date"}
],
"relations": [
{"head": "Acme Corporation", "tail": "John Smith", "type": "founded by"},
{"head": "Acme Corporation", "tail": "San Francisco", "type": "located in"}
]
}
]

You can later ingest this JSON into any supported database:

import json

with open("output/extracted.json") as f:
data = json.load(f)

retrico.ingest_data(
data=data,
store_config=retrico.Neo4jConfig(uri="bolt://localhost:7687"),
)

One-liner JSON Export

The build_graph() convenience function also supports json_output:

result = retrico.build_graph(
texts=[...],
entity_labels=["person", "organization"],
relation_labels=["works at", "founded by"],
json_output="output/data.json",
)

This writes to both the graph database and the JSON file. The JSON file is a portable snapshot that can be version-controlled, shared, or re-ingested independently.


Ingesting Structured Data

If you already have structured entities and relations (e.g. from an external source, a CSV, or a previous export), write them directly to the graph — no chunking, NER, or relation extraction needed.

ingest_data() convenience function

import retrico

ctx = retrico.ingest_data(
data=[
{
"entities": [
{"text": "Albert Einstein", "label": "person"},
{"text": "Ulm", "label": "location"},
{"text": "Princeton University", "label": "organization"},
],
"relations": [
{"head": "Albert Einstein", "tail": "Ulm", "type": "born_in", "score": 1.0},
{"head": "Albert Einstein", "tail": "Princeton University", "type": "works_at"},
],
},
],
)

stats = ctx.get("writer_result")
print(f"Entities: {stats['entity_count']}, Relations: {stats['relation_count']}")

You can also group entities and relations per document with source text and metadata (same format as JSON export):

result = retrico.ingest_data(
data=[
{
"entities": [
{"text": "Einstein", "label": "person", "properties": {"birth_year": 1879}},
{"text": "Ulm", "label": "location"},
],
"relations": [
{"head": "Einstein", "tail": "Ulm", "type": "born_in"},
],
"text": "Einstein was born in Ulm.",
"metadata": {"source": "wikipedia"},
},
],
)

Input Format

Entities — each dict requires text and label:

{"text": "Einstein", "label": "person"}                          # minimal
{"text": "Einstein", "label": "person", "id": "Q937"} # explicit ID (used for dedup)
{"text": "Einstein", "label": "person", "score": 0.95} # with confidence
{"text": "Einstein", "label": "person", "properties": {"birth_year": 1879}} # with properties

Relations — each dict requires head, tail, and type:

{"head": "Einstein", "tail": "Ulm", "type": "born_in"}                          # minimal
{"head": "Einstein", "tail": "Ulm", "type": "born_in", "score": 0.9} # with score
{"head": "Einstein", "tail": "Ulm", "type": "born_in", "head_label": "person",
"tail_label": "location", "properties": {"year": 1879}} # full
{"head": "Einstein", "tail": "ETH Zurich", "type": "worked_at",
"start_date": "1912-01-01", "end_date": "1914-03-01"} # temporal

The head and tail values must match an entity text (case-insensitive).

Ingest Builder API

from retrico import RetriCoIngest

builder = RetriCoIngest(name="my_ingest")
builder.graph_writer(
store_type="memgraph",
memgraph_uri="bolt://localhost:7687",
)

executor = builder.build()
ctx = executor.run({
"entities": [
{"text": "Einstein", "label": "person"},
{"text": "Ulm", "label": "location"},
],
"relations": [
{"head": "Einstein", "tail": "Ulm", "type": "born_in"},
],
})

# Save config for reproducibility
builder.save("configs/ingest.yaml")

Ingesting from a JSON File

The ingest format is designed to be loaded directly from JSON:

import json
import retrico

with open("data/knowledge_graph.json") as f:
data = json.load(f)

ctx = retrico.ingest_data(data=data)

Building from a Relational Store

Instead of passing texts directly, pull them from an existing relational database using the store_reader processor.

Pipeline: store_reader → chunker → NER → relex → graph_writer

build_graph_from_store() convenience function

import retrico

result = retrico.build_graph_from_store(
table="articles",
text_field="body",
id_field="article_id",
metadata_fields=["author", "date"],
relational_store_type="sqlite",
sqlite_path="/data/articles.db",
entity_labels=["person", "organization", "location"],
relation_labels=["works at", "born in"],
limit=1000,
offset=0,
filter_empty=True,
)

Builder API

builder = retrico.RetriCoBuilder(name="from_store")
builder.chunk_store(type="sqlite", sqlite_path="/data/articles.db")
builder.store_reader(
table="articles",
text_field="body",
id_field="article_id",
metadata_fields=["author", "date"],
limit=500,
)
builder.chunker(method="sentence")
builder.ner_gliner(labels=["person", "organization", "location"])
builder.relex_gliner(
entity_labels=["person", "organization", "location"],
relation_labels=["works at", "born in"],
)
builder.graph_writer()

executor = builder.build(verbose=True)
result = executor.run({}) # empty input — store_reader provides texts

Works with PostgreSQL and Elasticsearch too:

# PostgreSQL
builder.chunk_store(type="postgres", postgres_host="localhost", postgres_database="mydb")
builder.store_reader(table="documents", text_field="content")

# Elasticsearch
builder.chunk_store(type="elasticsearch", elasticsearch_url="http://localhost:9200")
builder.store_reader(table="articles", text_field="text")

Without store_reader(), the pipeline behaves exactly as before — the chunker reads from $input.texts.


Building from PDF Files

Extract text and tables from PDF documents and build a knowledge graph. Uses pdfminer.six for layout analysis and pdfplumber for table extraction.

Pipeline: pdf_reader → NER → relex → graph_writer

Each PDF page becomes one chunk. Tables are detected and converted to Markdown format.

pip install 'retrico[pdf]'

build_graph_from_pdf() convenience function

result = retrico.build_graph_from_pdf(
pdf_paths=["reports/annual_report.pdf", "papers/research.pdf"],
entity_labels=["person", "organization", "location"],
relation_labels=["works at", "born in"],
extract_tables=True,
page_ids=None, # None = all pages, or [0, 1, 2] for specific pages
)

Builder API

builder = retrico.RetriCoBuilder(name="pdf_pipeline")
builder.pdf_reader(
extract_text=True,
extract_tables=True,
page_ids=None,
)
builder.ner_gliner(labels=["person", "organization", "location"])
builder.relex_gliner(
entity_labels=["person", "organization", "location"],
relation_labels=["works at", "born in"],
)
builder.graph_writer()

executor = builder.build(verbose=True)
result = executor.run(pdf_paths=["document.pdf"])

How it works

  1. Layout analysis — pdfminer.six extracts page elements
  2. Table detection — pdfplumber identifies table structures
  3. Table → Markdown — tables are converted to pipe-separated Markdown
  4. Page chunking — each page becomes a Chunk with metadata {"page_number": N, "source_pdf": "filename.pdf"}
  5. Document creation — one Document per PDF file

Page-level chunking (without PDF reader)

If you already have page-separated text, use the "page" chunking method:

builder.chunker(method="page")  # splits on \f characters
text = "Page 1 content here...\fPage 2 content here..."

Accessing Intermediate Results

build_graph() and executor.run() return a PipeContext containing every pipeline stage output:

result = retrico.build_graph(texts=..., entity_labels=...)

# Chunks produced by the chunker
chunks = result.get("chunker_result")["chunks"]

# Per-chunk entity mentions from NER
entities = result.get("ner_result")["entities"] # List[List[EntityMention]]

# Linked entities (if linker was enabled)
linked = result.get("linker_result")["entities"]

# Per-chunk relations (if relex was enabled)
relations = result.get("relex_result")["relations"] # List[List[Relation]]

# Final write stats + deduplicated entity map
writer = result.get("writer_result")
print(writer["entity_count"], writer["relation_count"])
entity_map = writer["entity_map"] # dedup_key -> Entity

# Embedding stats (if embedders were enabled)
chunk_emb = result.get("chunk_embedder_result")
entity_emb = result.get("entity_embedder_result")

Available Processors

ProcessorDescriptionKey config
store_readerPull texts from a relational storetable, text_field, id_field, metadata_fields, limit
pdf_readerExtract text + tables from PDFsextract_text, extract_tables, page_ids
chunkerSplit text into chunksmethod (sentence/paragraph/fixed/page), chunk_size, overlap
ner_glinerEntity extraction with GLiNERmodel, labels, threshold, device
ner_llmEntity extraction with LLMmodel, labels, api_key, base_url
entity_linkerEntity linking with GLinkerexecutor, model, threshold, entities
relex_glinerRelation extraction with GLiNER-relexmodel, entity_labels, relation_labels, relation_threshold
relex_llmRelation extraction with LLMmodel, entity_labels, relation_labels, api_key
data_ingestConvert flat JSON to graph_writer format(internal)
graph_writerDeduplicate and write to graph storestore_type, json_output, setup_indexes
chunk_embedderEmbed chunk texts into vector storeembedding_method, model_name, vector_store_type
entity_embedderEmbed entity labels into vector storeembedding_method, model_name, vector_store_type
query_parserExtract entities from a querymethod (gliner/llm/tool), labels, api_key
retrieverLook up entities + k-hop subgraphmax_hops
path_retrieverShortest paths between parsed entitiesmax_path_length, max_pairs
community_retrieverVector search communitiestop_k, max_hops
entity_embedding_retrieverVector search entitiestop_k, max_hops
chunk_embedding_retrieverVector search chunkstop_k, max_hops
tool_retrieverLLM agent with graph toolsapi_key, model, max_tool_rounds
keyword_retrieverFull-text search chunkstop_k, search_source
fusionMerge multiple retriever subgraphsstrategy, top_k
chunk_retrieverFetch source chunks for entitiesmax_chunks, chunk_entity_source
reasonerLLM multi-hop reasoningapi_key, model
community_detectorCommunity detectionmethod, levels, resolution
community_summarizerLLM summaries for communitiesapi_key, model, top_k
community_embedderEmbed community summariesembedding_method, model_name
kg_triple_readerRead triples from graph store or TSVsource, train_ratio
kg_trainerTrain PyKEEN KG embedding modelmodel, embedding_dim, epochs
kg_embedding_storerStore trained KG embeddingsmodel_path, vector_store_type
kg_scorerScore triples, predict linksmodel_path, top_k, predict_tails

Graph Schema

RetriCo writes the following node and relationship types to the graph database:

(:Entity {id, label, entity_type, properties})
(:Chunk {id, document_id, text, index, start_char, end_char})
(:Document {id, source, metadata})

(entity)-[:MENTIONED_IN {start, end, score, text}]->(chunk)
(entity)-[:RELATION_TYPE {score, chunk_id, id}]->(entity)
(chunk)-[:PART_OF]->(document)
  • Entities are deduplicated by canonical name (label.strip().lower())
  • Relation types are sanitized: spaces become underscores, uppercased (e.g. "born in" becomes BORN_IN)

Full YAML Configuration Reference

GLiNER Build Pipeline

name: build_gliner
stores:
graph:
store_type: neo4j
uri: "bolt://localhost:7687"
user: neo4j
password: password

nodes:
- id: chunker
processor: chunker
inputs:
texts: {source: "$input", fields: "texts"}
output: {key: "chunker_result"}
config:
method: sentence

- id: ner
processor: ner_gliner
requires: [chunker]
inputs:
chunks: {source: "chunker_result", fields: "chunks"}
output: {key: "ner_result"}
config:
model: "knowledgator/gliner-multitask-large-v0.5"
labels: [person, organization, location, date]
threshold: 0.3
flat_ner: true
device: cpu

- id: relex
processor: relex_gliner
requires: [ner]
inputs:
entities: {source: "ner_result", fields: "entities"}
chunks: {source: "ner_result", fields: "chunks"}
output: {key: "relex_result"}
config:
model: "knowledgator/gliner-relex-large-v0.5"
entity_labels: [person, organization, location, date]
relation_labels: [works at, born in, located in, developed]
relation_threshold: 0.5

- id: writer
processor: graph_writer
requires: [relex]
inputs:
entities: {source: "relex_result", fields: "entities"}
relations: {source: "relex_result", fields: "relations"}
chunks: {source: "relex_result", fields: "chunks"}
documents: {source: "chunker_result", fields: "documents"}
output: {key: "writer_result"}

All-LLM Build Pipeline

name: build_llm
stores:
graph:
store_type: neo4j
uri: "bolt://localhost:7687"

nodes:
- id: chunker
processor: chunker
inputs:
texts: {source: "$input", fields: "texts"}
output: {key: "chunker_result"}
config:
method: sentence

- id: ner
processor: ner_llm
requires: [chunker]
inputs:
chunks: {source: "chunker_result", fields: "chunks"}
output: {key: "ner_result"}
config:
api_key: "sk-..."
model: "gpt-4o-mini"
labels: [person, organization, location]
temperature: 0.1

- id: relex
processor: relex_llm
requires: [ner]
inputs:
entities: {source: "ner_result", fields: "entities"}
chunks: {source: "ner_result", fields: "chunks"}
output: {key: "relex_result"}
config:
api_key: "sk-..."
model: "gpt-4o-mini"
entity_labels: [person, organization, location]
relation_labels: [works at, born in, located in]

- id: writer
processor: graph_writer
requires: [relex]
inputs:
entities: {source: "relex_result", fields: "entities"}
relations: {source: "relex_result", fields: "relations"}
chunks: {source: "relex_result", fields: "chunks"}
output: {key: "writer_result"}

Mixed Pipeline (GLiNER NER + LLM Relex)

name: build_mixed
stores:
graph:
store_type: falkordb_lite

nodes:
- id: chunker
processor: chunker
inputs:
texts: {source: "$input", fields: "texts"}
output: {key: "chunker_result"}
config:
method: sentence

- id: ner
processor: ner_gliner
requires: [chunker]
inputs:
chunks: {source: "chunker_result", fields: "chunks"}
output: {key: "ner_result"}
config:
labels: [person, organization, location]
threshold: 0.3

- id: relex
processor: relex_llm
requires: [ner]
inputs:
entities: {source: "ner_result", fields: "entities"}
chunks: {source: "ner_result", fields: "chunks"}
output: {key: "relex_result"}
config:
api_key: "sk-..."
model: "gpt-4o-mini"
entity_labels: [person, organization, location]
relation_labels: [works at, born in]

- id: writer
processor: graph_writer
requires: [relex]
inputs:
entities: {source: "relex_result", fields: "entities"}
relations: {source: "relex_result", fields: "relations"}
chunks: {source: "relex_result", fields: "chunks"}
output: {key: "writer_result"}

Build Pipeline with Embeddings

name: build_with_embeddings
stores:
graph:
store_type: neo4j
uri: "bolt://localhost:7687"
vector:
store_type: faiss

nodes:
- id: chunker
processor: chunker
inputs:
texts: {source: "$input", fields: "texts"}
output: {key: "chunker_result"}
config:
method: sentence

- id: ner
processor: ner_gliner
requires: [chunker]
inputs:
chunks: {source: "chunker_result", fields: "chunks"}
output: {key: "ner_result"}
config:
labels: [person, organization, location]

- id: relex
processor: relex_gliner
requires: [ner]
inputs:
entities: {source: "ner_result", fields: "entities"}
chunks: {source: "ner_result", fields: "chunks"}
output: {key: "relex_result"}
config:
entity_labels: [person, organization, location]
relation_labels: [works at, born in]

- id: writer
processor: graph_writer
requires: [relex]
inputs:
entities: {source: "relex_result", fields: "entities"}
relations: {source: "relex_result", fields: "relations"}
chunks: {source: "relex_result", fields: "chunks"}
output: {key: "writer_result"}

- id: chunk_embedder
processor: chunk_embedder
requires: [writer]
config:
embedding_method: sentence_transformer
model_name: "all-MiniLM-L6-v2"
vector_store_type: faiss

- id: entity_embedder
processor: entity_embedder
requires: [writer]
config:
embedding_method: sentence_transformer
model_name: "all-MiniLM-L6-v2"
vector_store_type: in_memory