Text preprocessing for summarization Innovation Release
- Hybrid Manager dual release strategy
- Documentation for the current Long-term support release
When working with real-world text data like CRM notes, emails, or documents, preprocessing can significantly improve summarization quality.
The following utility functions are written in pure PL/pgSQL. They allow you to clean text and extract high-value content before summarizing, and they require no external dependencies beyond AI Accelerator.
clean_text
The clean_text function removes general noise from text that often dilutes summarization quality. It handles:
- Markdown formatting:
**bold**,# headers, and lists. - Common filler words and phrases: "basically", "just wanted to", "at the end of the day".
- Extra whitespace and empty lines.
CREATE OR REPLACE FUNCTION clean_text(input_text TEXT) RETURNS TEXT LANGUAGE plpgsql IMMUTABLE STRICT AS $$ DECLARE result TEXT; BEGIN result := input_text; -- Remove markdown formatting: **bold**, __underline__, # headers, --- separators result := regexp_replace(result, '\*{1,3}', '', 'g'); result := regexp_replace(result, '_{1,3}', '', 'g'); result := regexp_replace(result, '^#{1,6}\s*', '', 'gm'); result := regexp_replace(result, '^-{3,}$', '', 'gm'); result := regexp_replace(result, '^\s*[-*+]\s+', '', 'gm'); -- Remove numbered list prefixes (1. 2. etc) but keep the content result := regexp_replace(result, '^\s*\d+\.\s+', '', 'gm'); -- Remove common filler phrases (use \y for word boundary in PostgreSQL) result := regexp_replace(result, '\y(just wanted to|I just wanted to|wanted to follow up|as discussed|per our conversation|as mentioned|going forward|at the end of the day|in terms of|with respect to|in regards to|please find attached|hope this helps|let me know if you have any questions|looking forward to hearing from you)\y', '', 'gi'); -- Remove common filler words (use \y for word boundary in PostgreSQL) result := regexp_replace(result, '\y(basically|essentially|actually|literally|honestly|frankly|obviously|clearly|simply|really|very|quite|rather|pretty much|kind of|sort of|in order to|due to the fact that|at this point in time|for all intents and purposes)\y', '', 'gi'); -- Lowercase result := LOWER(result); -- Collapse multiple spaces/tabs to single space result := regexp_replace(result, '[ \t]+', ' ', 'g'); -- Collapse multiple newlines to single newline result := regexp_replace(result, '\n\s*\n+', E'\n', 'g'); -- Remove leading/trailing whitespace per line result := regexp_replace(result, '^\s+', '', 'gm'); result := regexp_replace(result, '\s+$', '', 'gm'); -- Remove empty lines result := regexp_replace(result, '^\s*$\n?', '', 'gm'); -- Final trim result := TRIM(result); RETURN result; END; $$; COMMENT ON FUNCTION clean_text(TEXT) IS 'Removes markdown, filler words, and extra whitespace from text. Returns cleaned lowercase text.';
Usage
SELECT clean_text('**Meeting Notes** Just wanted to follow up on our discussion. Basically, the customer is interested in the product.');
clean_text --------------------------------------------------------------------------- meeting notes + follow up on our discussion. , the customer is interested in the product. (1 row)
clean_crm_boilerplate
The clean_crm_boilerplate function removes common CRM-specific boilerplate phrases that add no value to summarization. This function is separated from clean_text so you can customize the regex patterns for your organization's specific patterns.
CREATE OR REPLACE FUNCTION clean_crm_boilerplate(input_text TEXT) RETURNS TEXT LANGUAGE plpgsql IMMUTABLE STRICT AS $$ DECLARE result TEXT; BEGIN result := input_text; -- Remove boilerplate CRM patterns (use \y for word boundary in PostgreSQL) result := regexp_replace(result, '\yNo update[s]?\y\.?', '', 'gi'); result := regexp_replace(result, '\yCalendar invite sent[.]?\y', '', 'gi'); result := regexp_replace(result, '\ySent (proposal|case study|documentation|overview|pricing) (documentation |via email|as requested)?\.?\y', '', 'gi'); result := regexp_replace(result, '\yWaiting (on|for) callback\.?\y', '', 'gi'); result := regexp_replace(result, '\yUpdated CRM with latest info\.?\y', '', 'gi'); result := regexp_replace(result, '\yMeeting confirmed for next week\.?\y', '', 'gi'); result := regexp_replace(result, '\yFollowing standard sales process\.?\y', '', 'gi'); result := regexp_replace(result, '\yMeeting went as expected\.?\y', '', 'gi'); -- Remove empty lines created by removals result := regexp_replace(result, '^\s*$\n?', '', 'gm'); RETURN TRIM(result); END; $$; COMMENT ON FUNCTION clean_crm_boilerplate(TEXT) IS 'Removes common CRM boilerplate phrases like "No updates", "Calendar invite sent", etc. Customize this function for your organization.';
Usage
SELECT clean_crm_boilerplate('Customer is interested in the product. No updates. Calendar invite sent. Will follow up next week.');
clean_crm_boilerplate ---------------------------------------- Customer is interested in the product.+ . + Will follow up next week. (1 row)
extract_sentences
The extract_sentences function performs extractive summarization by selecting the most relevant sentences from text based on keyword matching. It scores sentences by:
- Keyword overlap with the provided search terms.
- Sentence length (bonus for substantive sentences over 200 characters).
- Presence of numbers (bonus for data-rich content).
- Causal/analytical language (bonus for words like "because", "reason", "issue", "competitor", "pricing").
CREATE OR REPLACE FUNCTION extract_sentences( input_text TEXT, keywords TEXT, num_sentences INTEGER DEFAULT 10 ) RETURNS TEXT LANGUAGE plpgsql IMMUTABLE STRICT AS $$ DECLARE keyword_arr TEXT[]; sentences TEXT[]; BEGIN IF input_text IS NULL OR input_text = '' THEN RETURN ''; END IF; -- Parse keywords: lowercase, split on spaces, remove short words keyword_arr := ARRAY( SELECT DISTINCT LOWER(TRIM(word)) FROM unnest(string_to_array(LOWER(keywords), ' ')) AS word WHERE LENGTH(TRIM(word)) > 2 ); IF array_length(keyword_arr, 1) IS NULL THEN RETURN LEFT(input_text, 2000); END IF; -- Split text into sentences on . ? ! and newlines -- Keep sentences that are meaningful (> 20 chars) sentences := ARRAY( SELECT TRIM(s) FROM unnest( regexp_split_to_array( regexp_replace(input_text, E'\n+', '. ', 'g'), E'(?<=[.!?])\\s+' ) ) AS s WHERE LENGTH(TRIM(s)) > 20 ); IF array_length(sentences, 1) IS NULL THEN RETURN input_text; END IF; -- Score each sentence and collect top N RETURN ( SELECT STRING_AGG(scored_sentence, E'\n' ORDER BY rank_score DESC) FROM ( SELECT s AS scored_sentence, ( SELECT COUNT(*)::NUMERIC FROM unnest(keyword_arr) AS keyword WHERE LOWER(s) LIKE '%' || keyword || '%' ) -- Bonus for longer, more substantive sentences + CASE WHEN LENGTH(s) > 200 THEN 0.5 ELSE 0 END -- Bonus for sentences with numbers (data-rich) + CASE WHEN s ~ '\d+' THEN 0.3 ELSE 0 END -- Bonus for causal/analytical language + CASE WHEN LOWER(s) ~ '(because|reason|due to|caused|result|impact|issue|problem|concern|risk|challenge|blocker|gap|lack|missing|competitor|pricing|budget|cost|expensive|cheaper|alternative|decided|chose|prefer|switched|rejected|declined)' THEN 1.0 ELSE 0 END AS rank_score FROM unnest(sentences) AS s ORDER BY rank_score DESC LIMIT num_sentences ) ranked ); END; $$; COMMENT ON FUNCTION extract_sentences(TEXT, TEXT, INTEGER) IS 'Extracts the N most relevant sentences from a text block based on keyword matching. Uses keyword overlap scoring with bonuses for substantive, data-rich, and causal/analytical language.';
Usage
SELECT extract_sentences( 'The customer loved the demo. They mentioned pricing concerns. The competitor offers a cheaper alternative. We discussed the technical requirements. The meeting went well overall.', 'pricing competitor cost', 2 );
extract_sentences ---------------------------------------------- They mentioned pricing concerns. + The competitor offers a cheaper alternative. (1 row)
extract_relevant
The extract_relevant function combines text cleaning and sentence extraction into a single call. It first cleans the input text (including CRM boilerplate), then extracts the most relevant sentences.
CREATE OR REPLACE FUNCTION extract_relevant( input_text TEXT, prompt TEXT, num_sentences INTEGER DEFAULT 10 ) RETURNS TEXT LANGUAGE plpgsql IMMUTABLE STRICT AS $$ BEGIN RETURN extract_sentences( clean_text(clean_crm_boilerplate(input_text)), prompt, num_sentences ); END; $$; COMMENT ON FUNCTION extract_relevant(TEXT, TEXT, INTEGER) IS 'Cleans text (removes markdown, filler, CRM boilerplate) then extracts the N most relevant sentences based on keyword matching against the prompt.';
Example: Analyzing lost deals with preprocessing
This example shows how to use extract_relevant with summarize_text_aggregate to analyze why deals were lost. The preprocessing extracts only the most relevant sentences about lost deals before summarizing.
-- Create a model for summarization SELECT aidb.create_model( 'nvidia_llama'::text, 'nim_completions'::character varying, config => '{"model": "nvidia/llama-3.3-nemotron-super-49b-v1.5"}'::JSONB, credentials => jsonb_build_object('api_key', '<your-api-key>')::JSONB, replace_credentials => true ); -- Analyze lost deals by product SELECT p.product_name, COUNT(*) AS lost_deals, aidb.summarize_text_aggregate( extract_relevant( sn.note_text, 'lost deal reason competitor pricing budget cost declined rejected', 2 ), aidb.summarize_text_config( 'nvidia_llama', prompt => 'Analyze these sales notes for lost deals. Identify the top 3 reasons we lost. Be specific and actionable. Keep your response under 250 words. Format as a numbered list.' )::json ORDER BY sn.created_at ) AS loss_insights FROM sales_orders so JOIN products p ON so.product_id = p.product_id JOIN sales_notes sn ON sn.order_id = so.order_id WHERE so.status = 'lost' AND so.order_date >= '2025-11-01' AND so.order_date <= '2026-02-01' GROUP BY p.product_name ORDER BY lost_deals DESC;
The extract_relevant function:
- Cleans each note (removes filler words, markdown, boilerplate).
- Extracts the two most relevant sentences based on keywords like "lost", "competitor", "pricing", "budget".
This preprocessed text is then aggregated and summarized, producing more focused insights than summarizing the raw notes directly.
Customizing the functions
These functions are designed as modular starting points. You can adapt them to your specific domain or organizational needs:
clean_text: Add or modify the filler word patterns to match your communication style.clean_crm_boilerplate: Add your organization's specific CRM templates and boilerplate phrases.extract_sentences: Adjust the bonus keywords in the scoring section to prioritize different types of content, such as technical terms, specific product names, or sentiment indicators.extract_relevant: Change the defaultnum_sentencesbased on your typical text length and summarization needs.