AI Data Pipeline Automation with AIDB
AIDB is EDB's Postgres extension that automates the entire AI data preparation pipeline — chunking, embedding, and vector indexing — triggered live inside the database the moment new data arrives. To show it in action, I used an investigation story PDF: a real document full of players, events, and relationships that needed to become a queryable knowledge base without any manual data wrangling.
What AIDB automates — and why it matters
Every AI pipeline that works with unstructured documents has the same hidden cost: data preparation. Before a single query can be answered, raw text must be cleaned, split into model-friendly chunks, converted into vector embeddings, and indexed for retrieval. In a conventional setup, each of those steps is a separate job — something you write, schedule, monitor, and re-run every time source data changes.
AIDB removes that cost entirely. You declare a preparer and a knowledge base as live database objects. From that moment on, the database owns data preparation. Every INSERT into a source table automatically triggers chunking, embedding, and indexing — with no external scheduler, no ETL script, and no risk of the knowledge base falling out of sync.
The core principle: data preparation is not a job you run on a schedule — it is a behaviour the database exhibits automatically on every INSERT. AIDB's Live mode makes this real.
To put this to the test, I took an investigation story PDF — a dense, character-rich document full of named players, a timeline of events, and a web of relationships — and used AIDB to turn it into a fully automated, queryable RAG knowledge base. Here is the exact pipeline, in SQL.
The automated pipeline at a glance
PDF file → source_documents → target_preparer (auto) → RAG_KB (auto) → Query ready
Only the first step — loading parsed PDF content into source_documents — involves any manual action. Everything beyond that is owned by AIDB. The target_preparer chunks each passage the moment it arrives. The RAG_KB embeds each chunk immediately after. The investigation story is query-ready before you close your SQL client.
Step 1 — load the PDF content into source_documents
The investigation story PDF is parsed externally — each passage, section, or paragraph extracted as a discrete piece of text. Those pieces land as rows in the source table. The table is intentionally simple: just an ID, a part number, a generated unique key, and the raw text.
The generated unique_id column — combining document ID and part number — ensures every passage is traceable all the way through chunking and embedding. When you query RAG_KB later and get a result back, you can trace it directly to the original story passage.
DROP TABLE IF EXISTS source_documents;
CREATE TABLE source_documents (
id TEXT,
part_id INTEGER NOT NULL,
unique_id TEXT NOT NULL GENERATED ALWAYS AS
((id || '.part.') || part_id) STORED,
result TEXT,
CONSTRAINT source_documents_pkey PRIMARY KEY (unique_id)
);
CREATE INDEX source_documents_id_idx
ON source_documents (id);
-- confirm story passages are loaded
SELECT * FROM source_documents;In the investigation story use case, each row represents one passage — a player profile, a scene description, a timeline entry, or a relationship note. The richer and more granular the parsing, the more precise the retrieval. Once the rows are inserted, AIDB automation takes over immediately.
Step 2 — automate chunking with target_preparer
Story passages vary in length and structure. AIDB's target_preparer applies the ChunkText operation to break each passage into consistently-sized, embedding-ready chunks, writing them automatically to the prepared document table. This is the first stage of automated data preparation — configured once, runs forever.
Calling aidb.set_auto_preparer with 'Live' activates the automation. Every subsequent INSERT into source_documents is chunked instantly — no cron job, no Celery worker, no polling loop.
-- idempotent: safe to re-run
SELECT aidb.delete_preparer('target_preparer');
SELECT aidb.create_table_preparer(
name => 'target_preparer',
operation => 'ChunkText',
source_table => 'source_documents',
source_data_column => 'result',
destination_table => 'prepared_document',
destination_data_column => 'chunks',
source_key_column => 'unique_id',
destination_key_column => 'id',
options => '{"desired_length": 100}'::JSONB
);
-- Live mode: chunking fires automatically on every INSERT
SELECT aidb.set_auto_preparer('target_preparer', 'Live');A chunk size of 100 tokens suits narrative content well — specific enough to isolate a player or event, broad enough to carry surrounding context. For the investigation story, this means each chunk typically covers a single character trait, a single scene, or a single relationship link.
Step 3 — automate embedding with RAG_KnowledgeBase
With chunks flowing automatically from the preparer, RAG_KB handles the second stage: embedding. AIDB converts each chunk into a BERT vector and stores it in the backing vector table. Set to Live mode, this fires the moment a new chunk is written — completing the fully automated data preparation chain.
No manual embedding run. No batch job. The entire investigation story — every player, every event, every relationship — is encoded as semantic vectors inside Postgres, continuously and automatically.
-- idempotent: safe to re-run
SELECT aidb.delete_knowledge_base('RAG_KB');
SELECT aidb.create_table_knowledge_base(
name => 'RAG_KB',
model_name => 'bert',
source_table => 'prepared_document',
source_data_column => 'chunks',
source_data_format => 'Text',
source_key_column => 'unique_id'
);
-- Live mode: embedding fires automatically on every new chunk
SELECT aidb.set_auto_knowledge_base('RAG_KB', 'Live');
-- verify vectors are populated
SELECT * FROM RAG_KB_vector LIMIT 10;The pipeline is now fully live. Insert a new passage from the investigation story into source_documents — new testimony, a newly discovered document, an updated player profile — and it flows through target_preparer and into RAG_KB automatically. No further action needed.
The result: querying the story like a database
With the automated pipeline live, RAG_KB can be queried using natural language — passed directly to AIDB's semantic retrieval function. Ask about a player by name, a type of event, a location, or a relationship between parties. The retrieval is similarity-based, not keyword-based: even if the query words differ from the original text, AIDB finds and ranks the most relevant passages.
This is what AIDB's automated data preparation makes possible. The investigation story stops being a static PDF and becomes a structured, searchable intelligence layer — one that stays current automatically as new material is added.
Summary: what AIDB automates end to end
The investigation story example demonstrates a general capability. Any document — PDF, report, case file, research paper, HTML, etc. — can be loaded into source_documents and immediately become part of a live, automated RAG pipeline. AIDB handles the rest:
Chunking — target_preparer splits raw text into model-ready chunks automatically on INSERT.
Embedding — RAG_KB converts each chunk to a vector automatically as chunks arrive.
Indexing — vectors are stored and indexed inside Postgres, with no external vector store required.
Freshness — the knowledge base is always in sync with source data. No scheduled re-runs, no drift.
Swap bert for any AIDB-supported model — OpenAI embeddings, a local Ollama model, or a fine-tuned domain model — without changing the pipeline. target_preparer and RAG_KB are model-agnostic by design.