Pipeline steps v7

Each step in an AI pipeline is defined by its operation type and an optional configuration object. You pass the operation name as a string to the step_N parameter of aidb.create_pipeline(), and the configuration as JSONB to the corresponding step_N_options parameter.

SELECT aidb.create_pipeline(
    name              => 'my_pipeline',
    source            => 'my_source_table',
    source_key_column => 'id',
    source_data_column => 'content',
    step_1            => 'ChunkText',
    step_1_options    => aidb.chunk_text_config(desired_length => 120, max_length => 150),
    step_2            => 'KnowledgeBase',
    step_2_options    => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Use the helper functions described below to build the step_N_options value for each operation type. All helper functions return a JSONB configuration object.

ChunkText

Divides text into smaller segments to fit within LLM context windows.

Helper function: aidb.chunk_text_config()

ParameterTypeDefaultDescription
desired_lengthINTEGERRequiredTarget chunk size. The unit depends on strategy.
max_lengthINTEGERNULLMaximum chunk size. If omitted, desired_length is a strict upper limit.
overlap_lengthINTEGERNULLAmount of content to overlap between consecutive chunks. Defaults to 0 (no overlap).
strategyTEXTNULL'chars' (default) for character-based or 'words' for word-based chunking.

Basic chunking example:

SELECT aidb.create_pipeline(
    name               => 'chunk_pipeline',
    source             => 'source_table',
    source_key_column  => 'id',
    source_data_column => 'content',
    step_1             => 'ChunkText',
    step_1_options     => aidb.chunk_text_config(
        desired_length => 100,
        max_length     => 150,
        overlap_length => 20,
        strategy       => 'words'
    )
);

Result

ChunkText transforms the shape of the data by introducing a part_id column. Each source row may produce multiple output rows, one per chunk.

ParseHtml

Extracts readable text from HTML strings, stripping tags while preserving logical structure.

Helper function: aidb.html_parse_config()

ParameterTypeDefaultDescription
methodTEXTNULL'StructuredPlaintext' (default) for plain text extraction, or 'StructuredMarkdown' to retain hierarchy.
SELECT aidb.create_pipeline(
    name               => 'html_pipeline',
    source             => 'web_data_table',
    source_key_column  => 'id',
    source_data_column => 'html_content',
    step_1             => 'ParseHtml',
    step_1_options     => aidb.html_parse_config(method => 'StructuredMarkdown'),
    step_2             => 'ChunkText',
    step_2_options     => aidb.chunk_text_config(desired_length => 120)
);

ParsePdf

Extracts text from binary PDF data, with options to handle non-compliant or complex files.

Helper function: aidb.pdf_parse_config()

ParameterTypeDefaultDescription
methodTEXTNULL'Structured' (default) — uses the PDF specification to identify text blocks.
allow_partial_parsingBOOLEANNULLIf true (default), continues parsing when errors are encountered on individual pages.

The resulting part_id column maps to the page index from which each text block was extracted.

SELECT aidb.create_pipeline(
    name               => 'pdf_pipeline',
    source             => 'pdf_files_table',
    source_key_column  => 'id',
    source_data_column => 'pdf_data',
    step_1             => 'ParsePdf',
    step_1_options     => aidb.pdf_parse_config(
        method                => 'Structured',
        allow_partial_parsing => true
    ),
    step_2             => 'KnowledgeBase',
    step_2_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Result

ParsePdf unnests results — a multi-page PDF produces one row per page, each with a part_id corresponding to the page index.

PerformOcr

Extracts text from images using an OCR-capable AI model, such as NVIDIA NIM PaddleOCR.

Helper function: aidb.ocr_config()

ParameterTypeDefaultDescription
modelTEXTRequiredName of the registered OCR model to use.

Before using this step, register an OCR-capable model:

SELECT aidb.create_model(
    'my_paddle_ocr_model',
    'nim_paddle_ocr',
    credentials => '{"api_key": "<NVIDIA_NIM_API_KEY>"}'::JSONB
);

Then reference it in your pipeline:

SELECT aidb.create_pipeline(
    name               => 'ocr_pipeline',
    source             => 'images_table',
    source_key_column  => 'id',
    source_data_column => 'image_data',
    step_1             => 'PerformOcr',
    step_1_options     => aidb.ocr_config(model => 'my_paddle_ocr_model'),
    step_2             => 'KnowledgeBase',
    step_2_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

Result

PerformOcr unnests results — a single image may produce multiple rows, one per detected text block. The NVIDIA NIM provider currently supports only png and jpeg formats.

SummarizeText

Generates concise summaries of long text passages using an AI language model.

Helper function: aidb.summarize_text_config()

ParameterTypeDefaultDescription
modelTEXTRequiredName of the registered model to use for summarization.
chunk_configJSONBNULLOptional chunking configuration (from aidb.chunk_text_config()) applied before summarization.
promptTEXTNULLCustom prompt to guide the summarization. Uses a standard prompt if omitted.
strategyTEXTNULL'append' (default) concatenates per-chunk summaries, 'reduce' iteratively compresses.
reduction_factorINTEGERNULLUsed with 'reduce' strategy. Controls how aggressively text is reduced per iteration (default is 3).
inference_configJSONBNULLOptional runtime inference settings (from aidb.inference_config()).
SELECT aidb.create_pipeline(
    name               => 'summary_pipeline',
    source             => 'articles_table',
    source_key_column  => 'id',
    source_data_column => 'body',
    step_1             => 'SummarizeText',
    step_1_options     => aidb.summarize_text_config(
        model        => 'my_t5_model',
        chunk_config => aidb.chunk_text_config(100, 120, 10, 'words'),
        prompt       => 'Summarize the key points concisely',
        strategy     => 'reduce',
        reduction_factor => 3
    ),
    step_2             => 'KnowledgeBase',
    step_2_options     => aidb.knowledge_base_config(model => 'bert', data_format => 'Text')
);

KnowledgeBase

Converts processed text or image data into vector embeddings and stores them in a searchable knowledge base. This step must always be the last step in a pipeline, as its output is a VECTOR type that cannot be used as input by any subsequent step. For querying the knowledge base with semantic or hybrid search, see Knowledge bases.

Helper function: aidb.knowledge_base_config()

ParameterTypeDefaultDescription
modelTEXTRequiredName of the embedding model.
data_formataidb.PipelineDataFormatRequired'Text' or 'Image'.
distance_operatoraidb.DistanceOperatorNULLSimilarity metric: L2 (default), Cosine, or InnerProduct.
vector_indexJSONBNULLVector index config, built with a vector index helper such as aidb.vector_index_hnsw_config().
SELECT aidb.create_pipeline(
    name               => 'kb_pipeline',
    source             => 'source_table',
    source_key_column  => 'id',
    source_data_column => 'content',
    step_1             => 'KnowledgeBase',
    step_1_options     => aidb.knowledge_base_config(
        model             => 'bert',
        data_format       => 'Text',
        distance_operator => 'Cosine',
        vector_index      => aidb.vector_index_hnsw_config(m => 16, ef_construction => 64)
    )
);

To link multiple pipelines to the same knowledge base, use aidb.knowledge_base_config_from_kb(data_format) instead. This technique inherits the model and distance operator settings from the existing knowledge base.

Destination table

The KnowledgeBase step automatically creates a destination table named pipeline_<pipeline_name> with the following schema:

ColumnTypeDescription
idBIGSERIALPrimary key.
pipeline_idINTReference to the originating pipeline.
source_idTEXTID of the original source record.
part_idsBIGINT[]Tracks segments if the data was chunked or parsed.
valueVECTORThe pgvector embedding.

Multi-pipeline knowledge bases

A single knowledge base can aggregate embeddings from multiple pipelines. The internal knowledge_base_pipeline junction table manages these mappings. When retrieving results via aidb.retrieve_text(), each row includes a pipeline_name column so you can identify which pipeline produced each embedding.

For knowledge base views and statistics, see Knowledge bases reference.

To see pipeline steps used together in a complete end-to-end workflow, see Example.