Powering Big Data Processing in Postgres with Apache Spark

March 02, 2017

Postgres plays a central role in today’s integrated data center. A powerful feature called a Foreign Data Wrapper (FDW) in Postgres supports data integration by combining data from multiple database solutions as if it were a single Postgres database. FDWs essentially act as pipelines connecting Postgres with external database solutions, including NoSQL solutions such as MongoDB, Cassandra, or Hadoop, and displaying the data in Postgres where it can be queried with SQL. The result is a seamless view of the data, and database administrators have much more control. (For more insight, read the blog by Stephen Horn, Why It’s Cool to be an OLTP Database Again.)

Enterprise-ready Postgres tools for high availability, monitoring, and disaster recovery. Download Now.

EnterpriseDB^® (EDB™) invests significantly in research and development into FDWs and releases key advances to the open source community to further this capability for all Postgres users. As advances in Postgres progress, EnterpriseDB releases new versions of FDWs to take advantage of emerging capabilities (e.g., Hadoop, MongoDB, and MySQL). EnterpriseDB also offers EDB Postgres Data Adapters for Hadoop, MongoDB, and MySQL as packaged solutions for the EDB Postgres™ Platform. The FDWs developed by EnterpriseDB can be found on EDB’s GitHub page, or via StackBuilder Plus or yum.enterprisedb.com if you are an EnterpriseDB subscription holder.

The most recent advance from EnterpriseDB is a new version of the EDB Postgres Data Adapter for Hadoop with compatibility for the Apache Spark cluster computing framework. The new version gives organizations the ability to combine analytic workloads based on the Hadoop Distributed File System (HDFS) with operational data in Postgres, using an Apache Spark interface. (The new version was announced on February 8, 2017. Read the press release here.)

Below is a demonstration of the Hadoop FDW with Apache Spark. Apache Spark is a general purpose, distributed computing framework which supports a wide variety of uses cases. It provides real-time stream as well as batch processing with speed, ease of use, and sophisticated analytics. Spark does not provide a storage layer, and instead it relies on third-party storage providers like Hadoop, HBASE, Cassandra, S3, and others. Spark integrates seamlessly with Hadoop and can process existing data. Spark SQL is 100 percent compatible with HiveQL and can be used as a replacement of hiveserver2, using Spark Thrift Server. (For background on the HDFS_FDW and how it works with Hive, please refer to the blog post Hadoop to Postgres - Bridging the Gap.)

Advantages of Apache Spark:

Apache Spark is fast. For a comparison against Hive, see the following blog post, Hive vs. SparkSQL.
Apache Spark is general purpose, providing:
- Batch processing (MapReduce)
- Stream Processing (Storm)
- Interactive Processing (Impala)
- Graph Processing (Neo4J)
- Spark SQL (Hive)
Apache Spark supports many third-party storage providers and formats, such as:
- Hadoop
- HBASE
- Cassandra
- Amazon S3
- Parquet

Using the hdfs_fdw with Apache Spark on top of a Hadoop cluster:

The following components must be installed in order to use the hdfs_fdw:

EDB Postgres™ Advanced Server 9.5 or PostgreSQL 9.5
Hadoop 2.6.4
Apache Spark 2.1.0
Spark Thrift Server
The hdfs_fdw extension
OS CentOS Linux release 7.2.1511 (Core)

The setup is as follows:

Steps to use the hdfs_fdw with Apache Spark:

1. Install EDB Postgres Advanced Server 9.5 and hdfs_fdw using installer.

2. At the edb-psql prompt, issue the following commands:

CREATE EXTENSION hdfs_fdw;

CREATE SERVER hdfs_svr FOREIGN DATA WRAPPER hdfs_fdw

OPTIONS (host '127.0.0.1',port '10000',client_type 'hiveserver2');

CREATE USER MAPPING FOR enterprisedb server hdfs_svr;

CREATE FOREIGN TABLE f_names_tab( a int, name varchar(255)) SERVER hdfs_svr
OPTIONS (dbname 'testdb', table_name 'my_names_tab');

Please note that we are using the same port and client_type while creating the foreign server because the Spark Thrift Server is compatible with the Hive Thrift Server. Applications using Hiveserver2 would work with Spark without any code changes.

3. Download and install Apache Spark in local mode.

4. Test the Apache Spark installation using spark shell:

    ./spark-shell
    Spark session available as 'spark'.
    Welcome to

    Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_111)
    Type in expressions to have them evaluated.
    Type :help for more information.

    scala> val no = Array(1, 2, 3, 4, 5,6,7,8,9,10)
    no: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
    scala> val noData = sc.parallelize(no)
    scala> noData.sum
    res0: Double = 55.0

5. In the folder $SPARK_HOME/conf create a file called spark-defaults.xml containing the following line:

spark.sql.warehouse.dir hdfs://localhost:9000/user/hive/warehouse

By default, Apache Spark uses derby for both metadata and the data itself (called warehouse in Apache Spark).In order to have Apache Spark use Hadoop as the warehouse, we have to add this property.

6. Start Spark Thrift Server

./start-thriftserver.sh

7. Make sure Spark Thrift Server is running by checking the log file.

8. Run the following commands in the beeline command line tool:

./beeline
Beeline version 1.0.1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000 abbasbutt '' org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.1.0)
Driver: Hive JDBC (version 1.0.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000> create database my_test_db;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.379 seconds)
0: jdbc:hive2://localhost:10000> use my_test_db;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.03 seconds)
0: jdbc:hive2://localhost:10000> create table my_names_tab(a int, name string)
row format delimited fields terminated by ' ';
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.11 seconds)
0: jdbc:hive2://localhost:10000>

0: jdbc:hive2://localhost:10000> load data local inpath '/path/to/file/names.txt'
                                   into table my_names_tab;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.33 seconds)
0: jdbc:hive2://localhost:10000> select * from my_names_tab;
+-------+---------+--+
|   a   | name   |
+-------+---------+--+
| 1     | abcd    |
| 2     | pqrs    |
| 3     | wxyz    |
| 4     | a_b_c   |
| 5     | p_q_r   |
| NULL | NULL    |
+-------+---------+--+

9. Stop Apache Thrift Server:

./stop-thriftserver.sh

10. Start Apache Thrift Server with no authentication:

./start-thriftserver.sh --hiveconf hive.server2.authentication=NOSASL

11. Run the following command in edb-psql:

Here are the corresponding files in Hadoop:

$ hadoop fs -ls /user/hive/warehouse/
Found 1 items
drwxrwxr-x - user supergroup 0 2017-01-19 10:47 /user/hive/warehouse/my_test_db.db
$ hadoop fs -ls /user/hive/warehouse/my_test_db.db/
Found 1 items
drwxrwxr-x - user supergroup 0 2017-01-19 10:50 /user/hive/warehouse/my_test_db.db/my_names_tab

Download EDB Postgres Advanced Server 9.5 and the Data Adapter for Hadoop from the Advanced Downloads page on the EnterpriseDB website.

A 17-minute demo video that shows the setup steps above is available here.

For more information on using EDB Postgres Data Adapters, contact us or send an email to info@edbpostgres.com.

Abbas Butt is an Architect at EnterpriseDB.

Resource Feature Callout 1

Powering Big Data Processing in Postgres with Apache Spark

More Blogs

Open Source Leadership in the Public Sector: Pioneering Data Sovereignty and Innovation

Self-Service vs. Managed Databases: Key Differences Explained

Why Fortune 50 Banks are Leaving Oracle for EDB Postgres