Background workers v7

Background worker is the execution engine for asynchronous AI pipelines in EDB Postgres AI. It allows for high-volume data processing without blocking standard database transactions.

A background worker is activated automatically when a pipeline is created or updated with auto_processing => 'Background'. Before enabling Background mode, confirm that the required Postgres server settings are in place.

Postgres prerequisites

Two postgresql.conf settings must be in place before background pipelines can run.

shared_preload_libraries

shared_preload_libraries = 'aidb' must be set in postgresql.conf before background workers can start. This is configured during AIDB installation — see Configuring AIDB. Confirm it is in place before enabling Background mode on any pipeline.

max_worker_processes

Each pipeline in Background mode requires one dedicated Postgres background worker process. The max_worker_processes setting controls how many background worker processes are allowed across the entire cluster.

The default value is 8. If you plan to run multiple background pipelines, increase this value:

max_worker_processes = 20

Restart Postgres after changing this setting. To check the current value:

SHOW max_worker_processes;

If max_worker_processes is exhausted, newly created background pipelines will not start a worker and will silently skip processing until capacity is freed.

Core functionality

  • Asynchronous execution: When a pipeline is set to Background mode, processing occurs independently of the user session. This method ensures that queries or data modifications on the source table are not delayed by embedding generation or OCR tasks.

  • Batch processing: Background workers group records into configurable batch sizes. This optimizes throughput, especially when interacting with GPU-based models or remote AI service APIs.

  • Parallel operations: Within each batch, the worker runs pipeline steps (data retrieval, embedding computation, and storage) as parallel operations to maximize performance.

  • Continuous polling: The worker continuously monitors the source for changes based on a defined background_sync_interval.

Change detection

The background worker handles different source types using specific detection logic:

  • Table sources: Lightweight triggers capture change events like inserts, updates, and deletes and place them in a backlog. The background worker then processes this backlog at the next interval.

  • Volume sources: The background worker performs a scan of the external storage. It compares the last_modified timestamps of files against a state table to identify new or changed documents.

Configuration & constraints

Required

  • Postgres prerequisites: Background workers require shared_preload_libraries = 'aidb' and a sufficient max_worker_processes value in postgresql.conf. See Postgres prerequisites on this page.

Optional tuning

background_sync_interval controls how often the worker polls for new or changed data. This is configured per pipeline via aidb.create_pipeline() or aidb.update_pipeline(), not in postgresql.conf.

SELECT aidb.create_pipeline(
    name                     => 'my_pipeline',
    source                   => 'my_table',
    auto_processing          => 'Background',
    background_sync_interval => '30 seconds',
    ...
);

The default interval is suitable for most table sources. For volume sources backed by cloud object stores like AWS S3, each scan incurs a list operation which may have cost or rate-limit implications. In those cases, consider a longer interval:

SELECT aidb.update_pipeline(
    'my_s3_pipeline',
    background_sync_interval => '1 day'
);

background_sync_interval accepts any Postgres interval value ('5 minutes', '1 hour', '1 day', etc.).

Monitoring and observability

You can track the status and health of background workers using the aidb.pipeline_metrics view (also accessible as aidb.pipem). Key metrics include:

  • Table- unprocessed rows: For table sources, the number of rows not yet processed.

  • Volume- scans completed: For volume sources, the number of full scans completed.

  • Count (source records): Total number of records in the source.

  • Count (destination records): Total number of records in the destination.

  • Status: Current pipeline status.