Set up BM25 directly on your source table and reference it in your hybrid queries, it operates independently of AI pipelines.
Install the VectorChord-BM25 package first. See Installing VectorChord-BM25 for package installation instructions.
This makes the pg_tokenizer and vchord_bm25 extensions available in your database, which you then enable using CREATE EXTENSION in step 1 below.
Enable extensions
Enable both extensions before creating BM25 columns or running BM25 queries — pg_tokenizer handles tokenization and vchord_bm25 provides the BM25 vector type and index.
CREATE EXTENSION IF NOT EXISTS pg_tokenizer; CREATE EXTENSION IF NOT EXISTS vchord_bm25;
Create a tokenizer
Create a tokenizer to convert raw text into token sequences for BM25 indexing. Reference it by name when populating and querying the BM25 column, and use the same model consistently — mixing models produces incorrect results.
SELECT tokenizer_catalog.create_tokenizer('bert', $$ model = "bert_base_uncased" $$);
Add a BM25 column and index
Add a
bm25vectorcolumn to store the tokenized representation of each row, populate it, then create a BM25 index to enable efficient top-k keyword retrieval at query time.ALTER TABLE my_source ADD COLUMN bm25_tokens bm25_catalog.bm25vector; UPDATE my_source SET bm25_tokens = tokenizer_catalog.tokenize(description, 'bert'); CREATE INDEX my_source_bm25 ON my_source USING bm25 (bm25_tokens bm25_catalog.bm25_ops);
Keep the index current
Tokenize new rows before they'll appear in BM25 results. Use a trigger or run this update as part of your data ingestion process:
UPDATE my_source SET bm25_tokens = tokenizer_catalog.tokenize(description, 'bert') WHERE bm25_tokens IS NULL;
For full tokenizer configuration options, see VectorChord-BM25 and VectorChord-BM25 in pg_extensions.
Worked example: BM25 with RRF
This example shows how to combine semantic vector search with BM25 keyword search using Reciprocal Rank Fusion (RRF) to merge the two result sets. It uses the products table and food_products pipeline set up in the relational predicate filtering example.
Adding a BM25 index to the source table
Set up BM25 directly on your source table, as it is independent of AI pipelines This section adds the tokenizer and index that the keyword leg of the RRF query relies on.
Enable the
pg_tokenizerandvchord_bm25extensions:CREATE EXTENSION IF NOT EXISTS pg_tokenizer; CREATE EXTENSION IF NOT EXISTS vchord_bm25;
Create a tokenizer using an available model:
Use the BERT model as shown, or substitute any model supported by pg_tokenizer. Ensure you use the same model for both tokenizing the source data and the query.
SELECT tokenizer_catalog.create_tokenizer('bert', $$ model = "bert_base_uncased" $$);
Add a BM25 vector column to the source table and populate it with tokenized values:
ALTER TABLE products ADD COLUMN embedding_bm25 bm25_catalog.bm25vector; UPDATE products SET embedding_bm25 = tokenizer_catalog.tokenize(product_name, 'bert');
Create a BM25 index on the new column:
CREATE INDEX products_bm25_idx ON products USING bm25 (embedding_bm25 bm25_catalog.bm25_ops);
Hybrid query with RRF
This query runs semantic and keyword searches as separate CTEs, then merges the two result sets using Reciprocal Rank Fusion (RRF). Items that rank well in either search surface at the top; items that rank well in both receive the highest scores.
SET hnsw.iterative_scan = relaxed_order; SET search_path TO public, tokenizer_catalog, bm25_catalog; WITH semantic AS ( SELECT source.id, source.product_name, source.category, ROW_NUMBER() OVER (ORDER BY vector.value <=> aidb.kb_query_encode('public.pipeline_food_products', 'fast food')::vector) AS rank FROM public.pipeline_food_products AS vector JOIN products AS source ON source.id::TEXT = vector.source_id ORDER BY vector.value <=> aidb.kb_query_encode('public.pipeline_food_products', 'fast food')::vector LIMIT 20 ), keyword AS ( SELECT id, product_name, ROW_NUMBER() OVER (ORDER BY embedding_bm25 <&> to_bm25query('products_bm25_idx', tokenize('fast food', 'bert'))) AS rank FROM products ORDER BY embedding_bm25 <&> to_bm25query('products_bm25_idx', tokenize('fast food', 'bert')) LIMIT 20 ) SELECT COALESCE(s.id, k.id) AS id, COALESCE(s.product_name, k.product_name) AS product_name, s.category, s.rank AS semantic_rank, k.rank AS keyword_rank, COALESCE(1.0 / (60 + s.rank), 0) + COALESCE(1.0 / (60 + k.rank), 0) AS rrf_score FROM semantic s FULL OUTER JOIN keyword k ON s.id = k.id ORDER BY rrf_score DESC LIMIT 10;
id | product_name | category | semantic_rank | keyword_rank | rrf_score ----+---------------+------------+---------------+--------------+------------------------ 1 | Hamburger | meat | 6 | 1 | 0.03154495777446597119 4 | Fries | vegetarian | 3 | 4 | 0.03149801587301587302 6 | Pizza | meat | 2 | 6 | 0.03128054740957966765 8 | Veggie Burger | vegetarian | 1 | 8 | 0.03109932497589199614 3 | Fish n Chips | fish | 7 | 3 | 0.03079838900734423123 5 | Burrito | meat | 5 | 5 | 0.03076923076923076924 2 | Cheeseburger | meat | 9 | 2 | 0.03062178588125292193 7 | Sandwich | meat | 4 | 7 | 0.03055037313432835821 9 | Kebab | meat | 8 | 9 | 0.02919863597612958227
The RRF score combines both rankings: items that rank highly in either semantic or keyword search surface at the top, with items that rank well in both receiving the highest scores.
Next steps
- Query patterns — Use your BM25 index in hybrid queries.