Accelerating with Spark v1.6
By default, the Postgres Analytics Accelerator (PGAA) utilizes Seafowl, an embedded analytical engine, to accelerate queries. However, for large-scale data processing that exceeds the resources of a single Postgres instance, you can offload execution to a remote Apache Spark cluster via Spark Connect.
Spark Connect is a thin client-server protocol for Apache Spark that decouples the application from the Spark driver. It acts as a high-speed bridge, allowing Postgres to send query instructions to a remote, distributed Spark cluster. This enables you to leverage the massive compute power of an external cluster without requiring Spark to run on the same machine as your database.
Choosing your executor engine
The pgaa.executor_engine configuration parameter determines where the heavy lifting of your analytical queries happens.
| Feature | Seafowl | Spark Connect |
|---|---|---|
| Architecture | Runs as a process alongside Postgres. | Connects to an external Spark cluster. |
| Best for | Small to medium datasets, low latency. | Petabyte-scale data, heavy ETL/Z-Ordering. |
| Scalability | Limited by the host machine's RAM/CPU. | Distributed across multiple worker nodes. |
| Complexity | Zero-config; starts automatically. | Requires a running Spark Connect endpoint. |
| Performance | Faster for single-node data skipping. | Faster for massive joins and aggregations. |
When to switch to Spark?
While Seafowl is highly optimized for performance on a single node, you should consider switching to Spark Connect if:
- Memory constraints: Your aggregations or joins are hitting the
pgaa.autostart_seafowl_max_memory_mblimit. - Maintenance heavy: You are performing resource-intensive operations like Z-Ordering or large-scale Compaction on Delta or Iceberg tables.
- Centralized compute: You already have a managed Spark environment and want to leverage existing compute credits.