In this article we will see one of the most interesting features added in Greenplum 4.2:
XML datatype support.
Introduction
This article has an educational goal. Most likely you will not want to use the code presented here in a production system. However, it will give you hints and interesting ideas on how to add XML capabilities (both for loading and – most importantly – for searching) to your Greenplum database.
My intention is to show the potential of XML data type and XML data processing in a concurrent environment.
Goals
We will load XML data from a file into a Greenplum table in a XML field.
We will then perform some XPath queries on that data.
The xpath
function operates on XML data, we will write a demo wrapper function which works on a XML file.
For the sake of this example, the XML file must reside on the server: we will use indeed the COPY
command to load data from it.
Can I see some code?
Download and extract a dataset in XML format directly from the blog:
$ wget https://www.2ndquadrant.com/wp-content/uploads/2012/07/100songs.tar.gz
$ tar xvf 100songs.tar.gz
Copy the following code in a file named xml.sql:
BEGIN;
CREATE TABLE t (id INTEGER, data XML)
DISTRIBUTED BY (id);
CREATE OR REPLACE FUNCTION _xpath(
_xml_file VARCHAR,
_xpath_query VARCHAR
)
RETURNS TEXT AS
$BODY$
DECLARE
_is_well_formed BOOLEAN;
_result TEXT;
BEGIN
CREATE TEMP TABLE tmp (d TEXT) distributed by (d);
EXECUTE $$ COPY tmp FROM $$ || quote_literal(_xml_file) ;
INSERT INTO t (data) SELECT xml(d) FROM tmp;
DROP TABLE tmp;
SELECT INTO _is_well_formed xml_is_well_formed(text(data)) FROM t;
IF _is_well_formed = False THEN
RETURN 'XML document is not well formed';
END IF;
SELECT INTO _result xpath(_xpath_query, data ) FROM t;
RETURN _result;
END;
$BODY$ LANGUAGE plpgsql;
SELECT _xpath('/greenplum/tests/data/100songs.xml',
'/response/row/row[artist="Tullio De Piscopo"]/title/text()') AS song;
ROLLBACK;
Execute the SQL script with:
$ psql -qtxf xml.sql
You should see:
song | {"Stop Bajon (Primavera)"}
Code in detail
First of all, we have to know the XML document structure:
<response>
<row>
<row _id=... _uuid=... _position=... _address=...>
<theme>...</theme>
<title>...</title>
<artist>...</artist>
<year>...</year>
<spotify_url url=.../>
</row>
...
</row>
</response>
The first part of the code creates a table to store XML data:
CREATE TABLE t (id INTEGER, data XML)
DISTRIBUTED BY (id);
In the function body, we create a temporary table to store the file content as a single TEXT:
CREATE TEMP TABLE tmp (d TEXT) distributed by (d);
EXECUTE $$ COPY tmp FROM $$ || quote_literal(_xml_file) ;
Then, we convert the TEXT data into XML data using the xml
function:
INSERT INTO t (data) SELECT xml(d) FROM tmp;
DROP TABLE tmp;
And we check if the XML is well formed:
SELECT INTO _is_well_formed xml_is_well_formed(text(data)) FROM t;
IF _is_well_formed = False THEN
RETURN 'XML document is not well formed';
END IF;
Lastly, we perform the XPath query and return the result:
SELECT INTO _result xpath(_xpath_query, data ) FROM t;
RETURN _result;
The last part of the script shows how to call the function to retrieve data:
-- Get the Tullio De Piscopo's song from the list
SELECT _xpath('/greenplum/tests/data/100songs.xml', '/response/row/row[artist="Tullio De Piscopo"]/title/text()') AS song;
Conclusions
XPath is a technology of mixed feeling. Many hate it, some love it. However, when working with XML structured data type, it is a must.
The good thing with Greenplum 4.2 is that it is now possible to load millions of XML documents in a Greenplum database, have the data distributed in multiples segments and take advantage of parallelism when retrieving data using XPath expressions.