You can use the Hadoop Foreign Data Wrapper either through the Apache Hive or the Apache Spark. Both Hive and Spark store metadata in the configured metastore, where databases and tables are created using HiveQL.
Using HDFS FDW with Apache Hive on top of Hadoop
Apache Hive data warehouse software helps with querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time, this language allows traditional map/reduce programmers to plug in their custom mappers and reducers when it's inconvenient or inefficient to express this logic in HiveQL.
You can download the two versions of Hive—HiveServer1 and HiveServer2—from the Apache Hive website.
Note
The Hadoop Foreign Data Wrapper supports only HiveServer2.
To use HDFS FDW with Apache Hive on top of Hadoop:
Upload the weblog_parse.txt file using these commands:
Start HiveServer, if not already running, using following command:
or
Connect to HiveServer2 using the hive beeline client. For example:
Create a table in Hive. The example creates a table named weblogs:
Load data into the table.
Access your data from Postgres. You can now use the weblog table. Once you're connected using psql, follow these steps:
Using HDFS FDW with Apache Spark on top of Hadoop
Apache Spark is a general-purpose distributed computing framework that supports a wide variety of use cases. It provides real-time streaming as well as batch processing with speed, ease-of-use, and sophisticated analytics. Spark doesn't provide a storage layer, as it relies on third-party storage providers like Hadoop, HBASE, Cassandra, S3, and so on. Spark integrates seamlessly with Hadoop and can process existing data. Spark SQL is 100% compatible with HiveQL. You can use it to replace Hiveserver2, using Spark Thrift Server.
To use HDFS FDW with Apache Spark on top of Hadoop:
Download and install Apache Spark in local mode.
In the folder $SPARK_HOME/conf, create a file spark-defaults.conf containing the following line:
By default, Spark uses derby for both the meta data and the data itself (called a warehouse in Spark). To have Spark use Hadoop as a warehouse, add this property.
Start Spark Thrift Server.
Make sure Spark Thrift Server is running and writing to a log file.
Create a local file (names.txt) that contains the following entries:
Connect to Spark Thrift Server2 using the Spark beeline client. For example:
Prepare the sample data on Spark. Run the following commands in the beeline command line tool:
The following commands list the corresponding files in Hadoop:
Access your data from Postgres using psql:
Note
This example uses the same port while creating the foreign server because Spark Thrift Server is compatible with Hive Thrift Server. Applications using Hiveserver2 work with Spark except for the behavior of the ANALYZE command and the connection string in the case of NOSASL. We recommend using ALTER SERVER and changing the client_type option if you replace Hive with Spark.