You can use the Hadoop Foreign Data Wrapper either through the Apache Hive or the Apache Spark. Both Hive and Spark store metadata in the configured metastore, where databases and tables are created using HiveQL.
Using HDFS FDW with Apache Hive on Top of Hadoop
Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time, this language allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
There are two versions of Hive - HiveServer1 and HiveServer2 which can be downloaded from the Apache Hive website.
The Hadoop Foreign Data Wrapper supports only HiveServer2.
To use HDFS FDW with Apache Hive on top of Hadoop:
Step 2: Upload weblog_parse.txt file using these commands:
Step 3: Start HiveServer, if not already running, using following command:
Step 4: Connect to HiveServer2 using the hive beeline client. For example:
Step 5: Create a table in Hive. The example creates a table named weblogs"
Step 6: Load data into the table.
Step 7: Access your data from Postgres; you can now use the weblog table. Once you are connected using psql, follow the below steps:
Using HDFS FDW with Apache Spark on Top of Hadoop
Apache Spark is a general purpose distributed computing framework which supports a wide variety of use cases. It provides real time streaming as well as batch processing with speed, ease of use, and sophisticated analytics. Spark does not provide a storage layer as it relies on third party storage providers like Hadoop, HBASE, Cassandra, S3 etc. Spark integrates seamlessly with Hadoop and can process existing data. Spark SQL is 100% compatible with HiveQL and can be used as a replacement of Hiveserver2, using Spark Thrift Server.
To use HDFS FDW with Apache Spark on top of Hadoop:
Step 1: Download and install Apache Spark in local mode.
Step 2: In the folder $SPARK_HOME/conf create a file spark-defaults.conf containing the following line:
By default, Spark uses derby for both the meta data and the data itself (called a warehouse in Spark). To have Spark use Hadoop as a warehouse, you should add this property.
Step 3: Start the Spark Thrift Server.
Step 4: Make sure the Spark Thrift server is running and writing to a log file.
Step 5: Create a local file (names.txt) that contains the following entries:
Step 6: Connect to Spark Thrift Server2 using the Spark beeline client. For example:
Step 7: Prepare the sample data on Spark. Run the following commands in the beeline command line tool:
The following commands list the corresponding files in Hadoop:
Step 8: Access your data from Postgres using psql:
The same port was being used while creating foreign server because the Spark Thrift Server is compatible with the Hive Thrift Server. Applications using Hiveserver2 would work with Spark except for the behaviour of the ANALYZE command and the connection string in the case of NOSASL. We recommend using ALTER SERVER and changing the client_type option if Hive is to be replaced with Spark.