EDB Blog

Connecting Hadoop and EDB Postgres to Shrink Big Data Challenges

05/06/2016

EnterpriseDB® (EDB™) has certified the EDB Postgres™ Advanced Server database for the Hortonworks Data Platform. The certification ultimately will provide organizations with a powerful solution for combining data stored in Hadoop clusters with their transactional information. This seamless view of information will empower organizations to engage in a deeper analysis into corporate data for finding new efficiencies in operations or customer-driven revenue programs. (Read the press release here.)

EDB developed a Foreign Data Wrapper (FDW) for the Hadoop Distributed File System (HDFS_FDW) that helped achieve this certification for the Hortonworks Data Platform (HDP). FDWs are a Postgres feature for connecting external data stores to Postgres and enable applications to query this data with SQL as if it were native Postgres data. The HDFS_FDW is an EDB Postgres Data Adapter that uses the capabilities enabled by FDW technology to ease integration of EDB Postgres with other solutions deployed throughout the data center.  

EDB has an aggressive development program for EDB Postgres Data Adapters for Postgres to enable organizations to create a seamless data center fabric and extract greater insight from their data by combining information from multiple sources. The ability to more easily combine data also eliminates complexity and risk and positions Postgres as a single, federated data management solution.

Brief Introduction to HDP and YARN

The HDP, as explained by Hortonworks:

“Hortonworks Data Platform (HDP) is an open source distribution powered by Apache Hadoop. HDP provides you with the actual Apache-released versions of the components with all the necessary bug fixes to make all the components interoperable in your production environments. It is packaged with an easy-to-use installer (HDP Installer) that deploys the complete Apache Hadoop stack to your entire cluster and provides the necessary monitoring capabilities using Ganglia and Nagios.”

In a nutshell, HDP provides an integrated platform and open source distribution for Apache HDFS and components that interact with HDFS (i.e., Hive, Hbase, MapReduce, etc.). If you have tried installing these components separately and getting them to work together, you will appreciate the pain you save by installing HDP or by getting HDP Sandbox from Hortonworks that has all of these components installed, integrated and with the user-friendly web interface “Ambari" that lets you manage, administer, and monitor your HDP cluster.

The HDP certification means the EDB Postgres Advanced Server database has been validated to work with HDP, the Hortonworks data platform, and, by default, is YARN-certified as well. 

YARN (Yet Another Resource Negotiator) is a generic resource management and distributed application framework that allows multiple data processing applications customized to the task at hand to work in a coordinated fashion.

Hortonworks describes YARN as:

“YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.

YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run in Hadoop.”

YARN-ready certification means that EDB’s HDFS_FDW is integrated with YARN, allowing OLTP and SQL workloads in general to participate in YARN solutions. 

The HDFS_FDW – The Architecture and Roadmap

Certifying a solution with Hortonworks requires the partner – in this case EDB – to complete a set of architectural and design documents explaining how the solution is integrated with HDP. It also involves a live demo of the partner solution working with HDP in an integrated fashion.

In this blog, I will provide some insight into the technologies involved from EDB and Hortonworks for this certification, and I will explain how users can deploy and use this solution to meet their business needs.

FDW technology enables the EDB Postgres Advanced Server or open source PostgreSQL server to access and manipulate objects in a remote data store that ranges from SQL databases through to flat file storage. The HDFS_FDW is an open source extension provided by EDB for its EDB Postgres Advanced Server database and for community PostgreSQL. 

The HDFS_FDW extension can be downloaded from Github. Currently, you need to download the code and build the extension and its dependent libraries manually (using the instructions on hdfs_fdw github page). EDB is planning to release a GUI installer by mid-2016 that will install the HDFS_FDW and its dependent libraries automatically. This will make the process of installing HDFS_FDW much easier than what it is today.

The HDFS_FDW is an extension for EDB Postgres and community PostgreSQL; it provides the capability to read data from HDFS using SQL and use it for OLTP and analytical purposes. The HDFS_FDW uses the HiveServer in order to retrieve and process the data from the HDFS cluster (the architecture is explained further in this blog).

The HDFS_FDW combines the power of relational database and HDP (big data solution based on HDFS) by providing a SQL interface for big data stored in HDFS. The user can perform anything from simple SQL queries to complex analytical and aggregate queries using the relational SQL constructs in Postgres. The data is read from HDFS using the HiveServer and SQL query results are returned to the user in a timely fashion after applying the required query processing.

  • Currently the HDFS_FDW only provides READ capabilities but EDB is planning the following additional functionality:
  • GUI installer for HDFS_FDW
  • Support for various authentication methods (i.e. Kerberos, NOSASL, etc.)
  • Support for write-to HDFS
  • Bulk data transfer to and from HDFS and Advanced Server/PostgreSQL
  • Performance improvements by pushing down more SQL query clauses to remote server (This is explained further in this blog)

In terms of the architecture for HDFS_FDW, it uses Hive in order to connect and retrieve data from the HDFS cluster. Both hiveserver1 and hiveserver2 are supported. The HDFS_FDW transforms the SQL query into query language for Hive (HIVEQL) and sends it to the HiveServer. The HiveServer does the required processing, which in some cases includes running a MapReduce job and returns the data to the database server. The database server does its processing. In the case of aggregates that currently don’t get pushed down to the HiveServer, the required columns are retrieved from HDFS using the HiveServer and aggregate operations are performed before returning the result-set to the user.

The HDFS_FDW currently pushes down the WHERE predicate and target column list to the HiveServer; all the other operations like aggregates, foreign joins, and sort are performed by the database server. The FDW machinery in Postgres is being enhanced to support the push down for sort, joins and aggregates. Once this functionality is part of the FDW infrastructure, all the FDWs including the HDFS_FDW, will be enabled to  provide this performance enhancing capability. The push down features provide better performance by pushing more processing to the remote server so there is less data travelling to the database server. This reduces network traffic and significantly increases the query execution time.

The architectural diagram below shows how HDP and HDFS_FDW fit together in the overall architecture to meet today’s business needs. Big Data can be weblogs, telecommunications data, medical transcripts, etc. and is stored in HDFS using the HDP platform. The data is accessed by application using SQL via the HDFS_FDW extension provided by  EDB Postgres. The diagram below illustrates how  HDP, HDFS_FDW and Postgres work together to meet the real world business needs.

Next, I will describe how to get this integrated solution up and running using a real world use-case.

I am going to use the real-world use case of weblogs to demonstrate how the HDFS_FDW and HDP can work together to meet Big Data business needs. Weblogs refers to website traffic information. busy website can generate a massive amount of weblog data; weblogs can measure can result in  in gigabytes and terabytes for a busy website.

But first, we need to perform the installation of Hortonworks HDP and EDB Postgres with the HDFS_FDW extension. For the HDP installation, the easiest way is to either download and install the Hortonworks Sandbox on a virtual machine (VM) or instantiate an instance of Hortonworks Sandbox in the cloud.

Applications – How to setup an instance of Hortonworks Sandbox on Azure

The instructions for downloading and installing Hortonworks Sandbox on a VM or bringing HDP instances in the cloud are available at the following link:

http://hortonworks.com/products/hortonworks-sandbox/#install

http://hortonworks.com/hadoop-tutorial/deploying-hortonworks-sandbox-on-microsoft-azure/

For the purposes of this demo, I have deployed an instance of HDP 2.3 on Azure and created another CentOS 6.7 VM on Azure that has EDB Postgres and the HDFS_FDW installation.

After following the instructions from the above links, you should have an instance of HDP running on Azure. The following screenshot shows the instance of HDP 2.3 running on Azure that we will be using for this demo. Notice the DNS name “hdp4.cloudapp.net”. I will be using this DNS name for connecting to this HDP instance from Advanced Server using the HDFS_FDW. This HDP instance has the complete and working Apache HDFS cluster and related components that are ready to use.

Hortonworks Ambari Interface

The Hortonworks Ambari web interface for managing and monitoring your HDP cluster can be accessed at the following URL: http://hdp4.cloudapp.net:8080/#/main/dashboard/metrics. The screenshot below shows the landing page of Ambari for the HDP instance deployed on Azure in the previous step. You can see HDFS and other components up and running.

 

Uploading Files to HDFS Using Ambari Interface

Now that we have HDP up and running, we need to upload the weblogs file to HDFS. You can get the sample weblogs files from the EDB HDFS_FDW github page using the link shared previously, or here: https://github.com/EnterpriseDB/hdfs_fdw

The weblogs file can be uploaded to HDFS using the Ambari interface. Below are the steps:

  • Log in to Ambari interface using the URL:  http://hdp4.cloudapp.net:8080/#/main/dashboard/metrics
  • Click on the dropdown menu option “HDFS Files” as shown in the above screenshot
  • Go to tmp directory
  • Click on the “New Directory” button and add a new directory called “admin” and click to go inside admin directory. It will not have any files.
  • Next to “New Directory” button, you will see the “Upload” button. Click Upload and upload the weblogs file that you downloaded in the previous step
  • The file is now uploaded to HDFS

Installing EDB Postgres Advanced Server and HDFS_FDW

Next we need to install EDB Postgres Advanced Server and HDFS_FDW from EDB. You can download the installer from EnterpriseDB’s website here: http://www.enterprisedb.com/downloads/postgres-postgresql-downloads

For this demo, I have deployed a CentOS 6.7 VM on Azure and installed Advanced Server v9.4.  (The complete version string can be acquired by running “select version();” from connecting to edb-psql (Advanced Server command line interface).

[ahsanhadi@hdfscentos67 ~]$ uname -a

Linux hdfscentos67 2.6.32-573.1.1.el6.x86_64 #1 SMP Sat Jul 25 17:05:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

edb=# select version();

                                                     version                                                     

EnterpriseDB 9.4.1.3 on x86_64-unknown-linux-gnu, compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-55), 64-bit

(1 row)

Next you need to download HDFS_FDW from the following EnterpriseDB github location and build it. The detailed instructions for building the HDFS_FDW are given in Appendix A of this blog. The instructions are for CentOS 6.7. Note the instructions for building some of the dependent modules may be different on other operating systems.

The steps of manually building HDFS_FDW (which can be complicated) will be eliminated once EDB releases the GUI installer for HDFS_FDW in mid-2016. 

Creating Hive Table Using the Ambari Interface

At this point, you should have EDB Postgres and the HDFS_FDW up and running after successfully building HDFS_FDW (using the instructions in Appendix A) and starting the EDB Postgres database server. You are ready to start using the HDFS_FDW and performing SQL operations on the weblogs data that we uploaded to HDFS in the previous steps.

Before we go into the usage of HDFS_FDW from edb-psql, we need to create a table in Hive that maps to the data file uploaded on HDFS. This hive table will be used by HDFS_FDW internally to query data from HDFS.

The Hive table can be created by “clicking Hive from the Ambari dropdown menu” that was shown in the previous screenshot. This opens up the Hive interface with Hive Database explorer on the left and the worksheet pane in the middle. We are going to create the weblogs table in the default database for this demo. Select default database from the database explorer, copy the following commands in the worksheet and hit execute.

CREATE TABLE weblogs (

   client_ip              STRING,

   full_request_date      STRING,

   day                    STRING,

   month                  STRING,

   month_num              INT,

   year                   STRING,

   hour                   STRING,

   minute                 STRING,

   second                 STRING,

   timezone               STRING,

   http_verb              STRING,

   uri                    STRING,

   http_status_code       STRING,

   bytes_returned         STRING,

   referrer               STRING,

   user_agent             STRING)

row format delimited

fields terminated by '\t';

 

You can execute the following query to see if the table was created successfully:

select * from weblogs limit 1;

HDFS_FDW (setup and usage)

The command line application for EDB Postgres is edb-psql; you need to log in to edb-psql using the username/password given at the time of your Advanced Server install. The following command needs to be executed at the edb-psql prompt :

  • Load hdfs_fdw extension

CREATE EXTENSION hdfs_fdw;

  • Create the foreign server pointing to HDP instance created on Azure. Please see hostname, port (hiveserver 2) and client_type)

CREATE SERVER hdfs_server

FOREIGN DATA WRAPPER hdfs_fdw

OPTIONS (host 'hdp4.cloudapp.net', port '10000', client_type 'hiveserver2');

  • Create user mapping for the foreign server created in above step.

CREATE USER MAPPING FOR enterprisedb server hdfs_server;

  • Create foreign table. This table points to the table created in Hive using the Ambari interface in the step above. The database name as you can see in OPTIONS clause is “default” and table name is weblogs. The table columns map to table columns created in Hive.

 

CREATE FOREIGN TABLE weblogs

(

 client_ip         TEXT,

 full_request_date TEXT,

 day               TEXT,

 Month             TEXT,

 month_num         INTEGER,

 year              TEXT,

 hour              TEXT,

 minute            TEXT,

 second            TEXT,

 timezone          TEXT,

 http_verb         TEXT,

 uri               TEXT,

 http_status_code  TEXT,

 bytes_returned    TEXT,

 referrer          TEXT,

 user_agent        TEXT

)

SERVER hdfs_server

OPTIONS (dbname 'default', table_name 'weblogs');

  • You are now ready to use your HDFS_FDW foreign table and run simple and complex SQL queries. Here are a few examples:

 

edb=# select * from weblogs limit 1;

  client_ip   |     full_request_date      | day | month | month_num | year | hour | minute | second | timezone | http_verb |        uri        | http_status_code | bytes_returned |     referrer      |                                                    user_agent                                                   

---------------+----------------------------+-----+-------+-----------+------+------+--------+--------+----------+-----------+-------------------+------------------+----------------+-------------------+------------------------------------------------------------------------------------------------------------------

612.57.72.653 | 03/Jun/2012:09:12:23 -0500 | 03  | Jun   |         6 | 2012 | 09   | 12     | 23     | -0500    | GET       | /product/product2 | 200              | 0              | /product/product2 | Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)

(1 row)

—————-

select client_ip, (year || ' ' || month || ' '|| day || ' ' || hour || ':' || minute || ':' || second) datetime from weblogs order by datetime desc limit 20;

  client_ip    |       datetime      

-----------------+----------------------

55.50.35.364    | 2012 Sep 30 04:45:30

43.60.688.623   | 2012 Sep 30 04:45:00

658.338.616.336 | 2012 Sep 30 04:43:32

658.338.616.336 | 2012 Sep 30 04:43:30

11.308.46.48    | 2012 Sep 30 04:41:00

361.631.17.30   | 2012 Sep 30 04:40:31

361.631.17.30   | 2012 Sep 30 04:40:30

361.631.17.30   | 2012 Sep 30 04:40:30

325.87.75.36    | 2012 Sep 30 04:28:31

325.87.75.36    | 2012 Sep 30 04:28:30

324.01.320.658  | 2012 Sep 30 04:24:00

325.87.75.336   | 2012 Sep 30 04:13:30

325.87.75.336   | 2012 Sep 30 04:13:30

324.677.46.306  | 2012 Sep 30 04:09:26

324.677.46.306  | 2012 Sep 30 04:09:01

324.677.46.306  | 2012 Sep 30 04:09:01

324.677.46.306  | 2012 Sep 30 04:09:01

324.677.46.306  | 2012 Sep 30 04:09:00

324.01.67.83    | 2012 Sep 30 04:08:30

46.372.322.323  | 2012 Sep 30 04:05:30

(20 rows)

      

Join your hdfs_fdw table with a hive table

 

edb=# CREATE TABLE premium_ip

edb-# (

edb(#       client_ip TEXT, category TEXT

edb(# );

CREATE TABLE

edb=#

edb=# INSERT INTO premium_ip VALUES ('683.615.622.618','Category A');

INSERT 0 1

edb=# INSERT INTO premium_ip VALUES ('14.323.74.653','Category A');

INSERT 0 1

edb=# INSERT INTO premium_ip VALUES ('13.53.52.13','Category A');

INSERT 0 1

edb=# INSERT INTO premium_ip VALUES ('361.631.17.30','Category A');

INSERT 0 1

edb=# INSERT INTO premium_ip VALUES ('361.631.17.30','Category A');

INSERT 0 1

edb=# INSERT INTO premium_ip VALUES ('325.87.75.336','Category B');

INSERT 0 1

edb=#

 

edb=#  SELECT hd.client_ip IP, pr.category, count(hd.client_ip)

edb-#                            FROM weblogs hd, premium_ip pr

edb-#                            WHERE hd.client_ip = pr.client_ip

edb-#                            AND hd.year = '2011'                                                

edb-#                            GROUP BY hd.client_ip,pr.category;

      ip        |  category  | count

-----------------+------------+-------

14.323.74.653   | Category A |  3153

361.631.17.30   | Category A | 25462

683.615.622.618 | Category A |  3821

13.53.52.13     | Category A |  1096

325.87.75.336   | Category B |  1272

(5 rows)

 

Appendix A - How to build HDFS_FDW and its dependent components

STOP. Before doing anything, please replace the paths in the instructions below according to your environment. These instructions are for CentOS 6.7; building the dependent libraries on other operating systems might be different. Also please check the library version on your machine and only install if your current version if less than the one required for the HDFS_FDW. 

1. Make sure you able to able to telnet the machine that is running HDP. Telnet the port running hive server2.

telnet hdp4.cloudapp.net 10000

2. Make sure the hiveserver2 is running with NOSASL authentication. You can check the hive server configuration using the Ambari interface.

3. Get thrift package, the HDFS_FDW uses the thrift library

wget http://apache.osuosl.org/thrift/0.9.2/thrift-0.9.2.tar.gz

4. Install some modules that are required for thrift installation...

4a. sudo -s yum install gcc*

4b. yum remove bison

Please check your bison version before this step, only do the following step if the version is below 3.0.

4c. Install bison 3.0

wget http://ftp.gnu.org/gnu/bison/bison-3.0.tar.gz

tar -zxvf bison-3.0.tar.gz

cd bison-3.0

./configure

sudo make install

Please make sure the bison installed in the above step comes earlier in the path.

4d. sudo yum install libevent-devel

4e. Install boost 1.5.4

* Download boost_1_54_0.tar.gz

* Untar

* run bootstrap.sh

* exec ./b2

4f.  export CPPFLAGS for directory containing boost (the directory was created by the above step)

export CPPFLAGS='-I/home/ahsanhadi/boost'

5. untar the thrift package

5a. Configure thrift using the configure command below. We only need the C++ driver so we aren't interested in building other drivers for thrift.

./configure --prefix=/usr/local/thrift --without-csharp --without-java --without-erlang --without-python --without-perl --without-php --without-ruby --without-haskell --without-go --with-lua=no --with-boost=/home/ahsanhadi/boost_1_54_0

Please change the prefix and boost path according to your machine.

When you run configure make sure their are no errors in trying to find a required library. If you see an error then it needs to be resolved.

At the end of configure the C++ driver should be yes and rest should be no.

6. build fb303

* cd thrift-0.9.2/contrib/fb303

* ./configure --prefix=/usr/local/thrift  --with-thriftpath=/usr/local/thrift

* make && make install

7. Need to build libhive

* Edit the make file and change the following paths accordingly...

THRIFT_HOME=/usr/local/thrift/include

INSTALL_DIR=/opt/PostgresPlus/9.4AS/lib

* make && make install

8. Ready to build hdfs_fdw

git clone https://github.com/EnterpriseDB/hdfs_fdw.git

export PATH=/opt/PostgresPlus/9.4AS/bin/:$PATH

export LD_LIBRARY_PATH=/usr/local/thrift/lib/:$LD_LIBRARY_PATH

make USE_PGXS=1

sudo make USE_PGXS=1 install

9. The HDFS_FDW is built and deployed. Next you need to follow the instructions on our github page to create the extension, foreign server and foreign tables.

Ahsan Hadi is Senior Director, Product Development, at EnterpriseDB.

SHARE