Monitoring and maintaining analytical tables v1.6

Suggest edits

Postgres Analytics Accelerator (PGAA) provides built-in functions to help you audit storage utilization, table formats, and metadata health of your data lake tables directly from the Postgres prompt.

Analyzing table statistics

In environments with hundreds of standard Postgres tables, identifying which ones are managed by PGAA can be challenging. The pgaa.list_analytics_tables() function provides a global overview of all lakehouse-integrated tables.

SELECT * FROM pgaa.list_analytics_tables();

The function output provides the following insights:

Identity & format: Displays the schema, table name, and storage format (e.g., Iceberg, Delta, or Parquet).
Storage location: The exact path to where the data files reside.
Catalog metadata: If using an external catalog, this function output shows the external catalog service name, namespace, and remote table name.
Replication status: Displays the current state of data movement (only for PGD replication scenarios).
Storage metrics:
- object_storage_snapshot_size_bytes: The size of the latest live snapshot. This represents the actual volume of data scanned during a SELECT *.
- object_storage_total_size_bytes: The total footprint in the bucket, including metadata, manifests, and historical data files not used by the current version.
- object_storage_total_size_bytes: The total footprint in the bucket, including metadata, manifests, and historical data files not used.

The function pgaa.lakehouse_table_stats() provides specific information about table's physical characteristics. This is essential for capacity planning and fine-tuning query performance.

SELECT * FROM pgaa.lakehouse_table_stats('my_schema.my_table'::regclass);

This function queries the underlying object storage to verify what is happening "under the hood" of a specific mapping. It helps you:

Distinguish live vs. dead data: Identify exactly how much storage is being "wasted" by old transaction logs or uncompacted small files.
Monitor growth: Track how snapshot sizes evolve over time as new data is ingested from external sources.
Cache awareness: The results of this function are governed by the pgaa.lakehouse_table_stats_cache_ttl_s configuration parameter. If you need the absolute latest stats and aren't seeing them, ensure your cache TTL is set to 0.

Comparing latest_snapshot_size and total_size is essential for balancing performance and cost. The snapshot size represents "live" data—the actual volume scanned during a query—while the total size includes the entire bucket footprint, such as historical versions, transaction logs, and uncompacted rows.

In formats like Iceberg or Delta Lake, the total size exceeds the snapshot size to support time-travel and rollbacks. A significant gap between these values indicates a buildup of historical overhead and serves as a signal to perform a VACUUM or compaction. Monitoring these metrics helps you maintain a lean, high-performing data lake.

Optimizing table performance

Analytical performance is heavily impacted by the "small file problem," where frequent data updates create thousands of tiny files that increase metadata overhead.

Performing Delta Lake maintenance

For Delta tables, maintenance is handled asynchronously by a background worker using the pgaa.launch_task() function. This allows you to trigger heavy operations without blocking your active SQL session.

SELECT pgaa.launch_task(
    table_name => 'my_delta_table'::regclass,
    task_type => 'compaction', -- or 'z-order', 'vacuum', 'purge'
    options => '{"target_size": "128MB"}'
);

Supported task types include:

Compaction: Merges small files into larger, optimized blocks (defined by target_size) to speed up analytical scans.
Z-Order: Reorganizes data across multiple columns so related values are stored physically close together. This enables data skipping, where the engine ignores entire files that don't match query filters.
Vacuum: Deletes data files no longer associated with active snapshots and older than a specified retention period. This reclaims object storage space.
Purge: An administrative tool to explicitly remove data from a specific storage path, useful for manual cleanup of failed exports.

Delta Lake maintenance examples

Use Compaction to merge small files into larger, 256MB blocks to speed up scan performance:

SELECT pgaa.launch_task(
    table_name => 'lakehouse.sales_data'::regclass,
    task_type  => 'compaction',
    options    => '{"target_size": "256MB"}'
);

Use Z-Order to physically reorganize data so that rows with similar values in customer_id and region are stored together, allowing the engine to skip irrelevant files:

SELECT pgaa.launch_task(
    table_name => 'lakehouse.sales_data'::regclass,
    task_type  => 'z-order',
    options    => '{"columns": ["customer_id", "region"]}'
);

Use Vacuum delete files that are no longer part of the active table state and are older than 7 days to reclaim storage space:

SELECT pgaa.launch_task(
    table_name => 'lakehouse.sales_data'::regclass,
    task_type  => 'vacuum',
    options    => '{"retention_period": "7 days"}'
);

See the function reference section for pgaa.launch_task() for a complete list of supported options and examples.

Performing Iceberg maintenance

Iceberg maintenance is performed synchronously using the Spark engine to ensure complex metadata updates and manifest rewrites follow Iceberg's consistency guarantees.

Note

To perform these operations, you must set the pgaa.spark_connect_url configuration parameter to point to an available Spark Connect service.

The pgaa.execute_compaction() function consolidates fragmented data files into larger, more efficient Parquet files, improving query performance and optimizing metadata:

SELECT pgaa.execute_compaction('my_iceberg_table'::regclass);

Use the function pgaa.spark_sql() to run Spark SQL queries directly:

SELECT pgaa.spark_sql('query', 'catalog_name');

Iceberg maintenance examples

Use standard table compaction to merge fragmented data files into larger files and updates the Iceberg manifest:
```
SELECT pgaa.execute_compaction('lakehouse.inventory_iceberg'::regclass);
```

Use pgaa.spark_sql() to reduce metadata overhead via the rewrite_data_files Spark task:

SELECT pgaa.spark_sql($$  
        CALL preexisting.system.rewrite_data_files(  
          table => '"preexisting"."ns-1"."table-1"',  
          strategy => 'sort',  
          sort_order => 'value DESC',  
          options => map('rewrite-all', 'true')  
        )  
    $$);

See the function reference section for pgaa.execute_compaction() for a complete list of supported options and examples.

← Prev

Distributed Spark execution

↑ Up

Postgres Analytics Accelerator (PGAA)

Optimizing query performance

Monitoring and maintaining analytical tables v1.6

Analyzing table statistics

Optimizing table performance

Performing Delta Lake maintenance

Performing Iceberg maintenance

Note

← Prev

↑ Up

Next →