Examples Innovation Release

The following examples how to perform bulk knowledge base (KB) embedding using AI pipelines, including the expected outputs for each stage. These examples cover ingesting data from a standard Postgres table and an external S3 bucket volume.

Example 1: Bulk embedding from a Postgres table

This workflow is used when your source data is already stored within your database.

Step 1: Initialize the AI model

First, register a model that supports text encoding operations:

SELECT aidb.create_model('model__6545', 'bert_local');
Output
 create_model 
--------------
 model__6545
(1 row)

Step 2: Prepare the source data

Create a table and insert the records you wish to embed:

CREATE TABLE source_table__6545 (
    id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    content TEXT NOT NULL
);

INSERT INTO source_table__6545 (id, content)
VALUES (43941, 'Catwalk Women Brown Heels'),
       (55018, 'Lakme 3 in 1 Orchid Aqua Shine Lip Color'),
       (19337, 'United Colors of Benetton Men Stripes Black Jacket');

Step 3: Define the pipeline

Create a pipeline that connects the source table to a knowledge base destination:

SELECT aidb.create_pipeline(
    name => 'pipeline__6545',
    source => 'source_table__6545',
    source_key_column => 'id',
    source_data_column => 'content',
    step_1 => 'KnowledgeBase',
    step_1_options => aidb.knowledge_base_config(
        model => 'dummy',
        data_format => 'Text'
    )
);
Output
INFO:  Created pipeline "pipeline__6545"
INFO:  Created knowledge base "public.pipeline_pipeline__6545" for pipeline "pipeline__6545"
                               create_pipeline                            
-----------------------------------------------------------------------
 (pipeline__6545,Table,public,pipeline_pipeline__6545,source_id,value)
(1 row)

Step 4: Execute and verify

Trigger the processing engine and inspect the generated vector table.

SELECT aidb.run_pipeline('pipeline__6545');

-- View the generated embeddings
SELECT * FROM pipeline_pipeline__6545 ORDER BY id;
Output
 id | pipeline_id | source_id | part_ids |     value     
----+-------------+-----------+----------+---------------
  1 |          49 | 43941     | {0}      | [0.1,0.2,0.3]
  2 |          49 | 55018     | {0}      | [0.1,0.2,0.3]
  3 |          49 | 19337     | {0}      | [0.1,0.2,0.3]
(3 rows)

Example 2: Bulk embedding from an S3 bucket (PGFS)

This workflow is used for unstructured data stored in external cloud storage.

Step 1: Configure external storage and volume

Define the S3 bucket location and map it to an AIDB volume.

SELECT pgfs.create_storage_location(
    'bucket__6545',
    's3://noah-html-test',
    options => '{"region": "eu-central-1", "skip_signature": "true"}'
);

SELECT aidb.create_volume('volume__6545', 'bucket__6545', '/', 'Text');

Step 2: List volume content

Verify the files available for processing:

SELECT * FROM aidb.list_volume_content('volume__6545');
Output
    file_name     | size  |      last_modified      
------------------+-------+-------------------------
 Bilge Ince.html  |  8878 | 2025-02-19 09:58:29 UTC
 all_tags.html    | 11170 | 2025-02-19 09:58:29 UTC
 hello_world.html |   108 | 2025-02-19 09:58:30 UTC
 simple.html      |   662 | 2025-02-19 09:58:29 UTC
(4 rows)

Step 3: Create and run the volume pipeline

SELECT aidb.create_pipeline(
    name => 'volume_pipeline__6545',
    source => 'volume__6545',
    step_1 => 'KnowledgeBase',
    step_1_options => aidb.knowledge_base_config(
        model => 'dummy',
        data_format => 'Text'
    )
);

SELECT aidb.run_pipeline('volume_pipeline__6545');

Step 4: Verify embeddings

SELECT source_id, part_ids, value FROM pipeline_volume_pipeline__6545 ORDER BY source_id;
Output
    source_id     | part_ids |     value     
------------------+----------+---------------
 Bilge Ince.html  | {0}      | [0.1,0.2,0.3]
 all_tags.html    | {0}      | [0.1,0.2,0.3]
 hello_world.html | {0}      | [0.1,0.2,0.3]
 simple.html      | {0}      | [0.1,0.2,0.3]
(4 rows)