Configuring the Hadoop Foreign Data Wrapper v2.0.7
Before creating the extension and the database objects that use the extension, you must modify the Postgres host, providing the location of the supporting libraries.
After installing Postgres, modify the
postgresql.conf located in:
Modify the configuration file with your editor of choice, adding the
hdfs_fdw.jvmpath parameter to the end of the configuration file, and setting the value to specify the location of the Java virtual machine (
libjvm.so). Set the value of
hdfs_fdw.classpath to indicate the location of the java class files used by the adapter; use a colon (:) as a delimiter between each path. For example:
The jar files (hive-jdbc-1.0.1-standalone.jar and hadoop-common-2.6.4.jar) mentioned in the above example should be copied from respective Hive and Hadoop sources or website to PostgreSQL instance where Hadoop Foreign Data Wrapper is installed.
If you are using EDB Advanced Server and have a
DATE column in your database, you must set
edb_redwood_date = OFF in the
After setting the parameter values, restart the Postgres server. For detailed information about controlling the service on an Advanced Server host, see the EDB Postgres Advanced Server Installation Guide, available at:
Before using the Hadoop Foreign Data Wrapper, you must:
- Use the CREATE EXTENSION command to create the extension on the Postgres host.
- Use the CREATE SERVER command to define a connection to the Hadoop file system.
- Use the CREATE USER MAPPING command to define a mapping that associates a Postgres role with the server.
- Use the CREATE FOREIGN TABLE command to define a table in the Advanced Server database that corresponds to a database that resides on the Hadoop cluster.
CREATE EXTENSION command to create the
hdfs_fdw extension. To invoke the command, use your client of choice (for example, psql) to connect to the Postgres database from which you will be querying the Hive or Spark server, and invoke the command:
IF NOT EXISTS
IF NOT EXISTS clause to instruct the server to issue a notice instead of throwing an error if an extension with the same name already exists.
Optionally specify the name of the schema in which to install the extension's objects.
The following command installs the
hdfs_fdw hadoop foreign data wrapper:
CREATE EXTENSION hdfs_fdw;
For more information about using the foreign data wrapper
CREATE EXTENSION command, see:
CREATE SERVER command to define a connection to a foreign server. The syntax is:
The role that defines the server is the owner of the server; use the
ALTER SERVER command to reassign ownership of a foreign server. To create a foreign server, you must have
USAGE privilege on the foreign-data wrapper specified in the
CREATE SERVER command.
server_name to specify a name for the foreign server. The server name must be unique within the database.
FOREIGN_DATA_WRAPPER clause to specify that the server should use the
hdfs_fdw foreign data wrapper when connecting to the cluster.
OPTIONS clause of the
CREATE SERVER command to specify connection information for the foreign server. You can include:
|host||The address or hostname of the Hadoop cluster. The default value is `localhost`.|
|port||The port number of the Hive Thrift Server or Spark Thrift Server. The default is `10000`.|
|client_type||Specify hiveserver2 or spark as the client type. To use the ANALYZE statement on Spark, you must specify a value of spark; if you do not specify a value for client_type, the default value is hiveserver2.|
|auth_type||The authentication type of the client; specify LDAP or NOSASL. If you do not specify an auth_type, the data wrapper will decide the auth_type value on the basis of the user mapping:- If the user mapping includes a user name and password, the data wrapper will use LDAP authentication. - If the user mapping does not include a user name and password, the data wrapper will use NOSASL authentication.|
|connect_timeout||The length of time before a connection attempt times out. The default value is `300` seconds.|
|fetch_size||A user-specified value that is provided as a parameter to the JDBC API setFetchSize. The default value is `10,000`.|
|log_remote_sql||If true, logging will include SQL commands executed on the remote hive server and the number of times that a scan is repeated. The default is `false`.|
|query_timeout||Use query_timeout to provide the number of seconds after which a request will timeout if it is not satisfied by the Hive server. Query timeout is not supported by the Hive JDBC driver.|
|use_remote_estimate||Include the use_remote_estimate to instruct the server to use EXPLAIN commands on the remote server when estimating processing costs. By default, use_remote_estimate is false, and remote tables are assumed to have `1000` rows.|
The following command creates a foreign server named
hdfs_server that uses the
hdfs_fdw foreign data wrapper to connect to a host with an IP address of
The foreign server uses the default port (
10000) for the connection to the client on the Hadoop cluster; the connection uses an LDAP server.
For more information about using the
CREATE SERVER command, see:
CREATE USER MAPPING command to define a mapping that associates a Postgres role with a foreign server:
You must be the owner of the foreign server to create a user mapping for that server.
Please note: the Hadoop Foreign Data Wrapper supports NOSASL and LDAP authentication. If you are creating a user mapping for a server that uses LDAP authentication, use the
OPTIONS clause to provide the connection credentials (the username and password) for an existing LDAP user. If the server uses NOSASL authentication, omit the OPTIONS clause when creating the user mapping.
role_name to specify the role that will be associated with the foreign server.
server_name to specify the name of the server that defines a connection to the Hadoop cluster.
OPTIONS clause to specify connection information for the foreign server. If you are using LDAP authentication, provide a:
username: the name of the user on the LDAP server.
password: the password associated with the username.
If you do not provide a user name and password, the data wrapper will use NOSASL authentication.
The following command creates a user mapping for a role named
enterprisedb; the mapping is associated with a server named
CREATE USER MAPPING FOR enterprisedb SERVER hdfs_server;
If the database host uses LDAP authentication, provide connection credentials when creating the user mapping:
The command creates a user mapping for a role named
enterprisedb that is associated with a server named
hdfs_server. When connecting to the LDAP server, the Hive or Spark server will authenticate as
alice, and provide a password of
For detailed information about the
CREATE USER MAPPING command, see:
A foreign table is a pointer to a table that resides on the Hadoop host. Before creating a foreign table definition on the Postgres server, connect to the Hive or Spark server and create a table; the columns in the table will map to to columns in a table on the Postgres server. Then, use the
CREATE FOREIGN TABLE command to define a table on the Postgres server with columns that correspond to the table that resides on the Hadoop host. The syntax is:
Specifies the name of the foreign table; include a schema name to specify the schema in which the foreign table should reside.
IF NOT EXISTS
IF NOT EXISTS clause to instruct the server to not throw an error if a table with the same name already exists; if a table with the same name exists, the server will issue a notice.
Specifies the name of a column in the new table; each column should correspond to a column described on the Hive or Spark server.
Specifies the data type of the column; when possible, specify the same data type for each column on the Postgres server and the Hive or Spark server. If a data type with the same name is not available, the Postgres server will attempt to cast the data type to a type compatible with the Hive or Spark server. If the server cannot identify a compatible data type, it will return an error.
COLLATE clause to assign a collation to the column; if not specified, the column data type's default collation is used.
INHERITS (parent_table [, ... ])
INHERITS clause to specify a list of tables from which the new foreign table automatically inherits all columns. Parent tables can be plain tables or foreign tables.
Specify an optional name for a column or table constraint; if not specified, the server will generate a constraint name.
NOT NULL keywords to indicate that the column is not allowed to contain null values.
NULL keywords to indicate that the column is allowed to contain null values. This is the default.
CHECK (expr) [NO INHERIT]
CHECK clause to specify an expression that produces a Boolean result that each row in the table must satisfy. A check constraint specified as a column constraint should reference that column's value only, while an expression appearing in a table constraint can reference multiple columns.
CHECK expression cannot contain subqueries or refer to variables other than columns of the current row.
NO INHERIT keywords to specify that a constraint should not propagate to child tables.
DEFAULT clause to specify a default data value for the column whose column definition it appears within. The data type of the default expression must match the data type of the column.
SERVER server_name [OPTIONS (option 'value' [, ... ] ) ]
To create a foreign table that will allow you to query a table that resides on a Hadoop file system, include the
SERVER clause and specify the
server_name of the foreign server that uses the Hadoop data adapter.
OPTIONS clause to specify the following
options and their corresponding values:
|dbname||The name of the database on the Hive server; the database name is required.|
|table_name||The name of the table on the Hive server; the default is the name of the foreign table.|
To use data that is stored on a distributed file system, you must create a table on the Postgres host that maps the columns of a Hadoop table to the columns of a Postgres table. For example, for a Hadoop table with the following definition:
You should execute a command on the Postgres server that creates a comparable table on the Postgres server:
SERVER clause to specify the name of the database stored on the Hadoop file system (
webdata) and the name of the table (
weblogs) that corresponds to the table on the Postgres server.
For more information about using the
CREATE FOREIGN TABLE command, see:
When using the foreign data wrapper, you must create a table on the Postgres server that mirrors the table that resides on the Hive server. The Hadoop data wrapper will automatically convert the following Hive data types to the target Postgres type: