Greenplum CE 4.2 XML support

July 23, 2012

In this article we will see one of the most interesting features added in Greenplum 4.2:
XML datatype support.

Introduction

This article has an educational goal. Most likely you will not want to use the code presented here in a production system. However, it will give you hints and interesting ideas on how to add XML capabilities (both for loading and – most importantly – for searching) to your Greenplum database.

My intention is to show the potential of XML data type and XML data processing in a concurrent environment.

Goals

We will load XML data from a file into a Greenplum table in a XML field.

We will then perform some XPath queries on that data.

The xpath function operates on XML data, we will write a demo wrapper function which works on a XML file.

For the sake of this example, the XML file must reside on the server: we will use indeed the COPY command to load data from it.

Can I see some code?

Download and extract a dataset in XML format directly from the blog:

$ wget https://www.2ndquadrant.com/wp-content/uploads/2012/07/100songs.tar.gz
$ tar xvf 100songs.tar.gz

Copy the following code in a file named xml.sql:

BEGIN;

CREATE TABLE t (id INTEGER, data XML)
        DISTRIBUTED BY (id);

CREATE OR REPLACE FUNCTION _xpath(
        _xml_file VARCHAR,
        _xpath_query VARCHAR
)
RETURNS TEXT AS
$BODY$
DECLARE
        _is_well_formed BOOLEAN;
        _result TEXT;
BEGIN

        CREATE TEMP TABLE tmp (d TEXT) distributed by (d);
        EXECUTE $$ COPY tmp FROM $$ ||  quote_literal(_xml_file) ;

        INSERT INTO t (data) SELECT xml(d) FROM tmp;
        DROP TABLE tmp;

        SELECT INTO _is_well_formed xml_is_well_formed(text(data)) FROM t;
        IF _is_well_formed = False THEN
                RETURN 'XML document is not well formed';
        END IF;

        SELECT INTO _result xpath(_xpath_query, data ) FROM t;
        RETURN _result;
END;
$BODY$ LANGUAGE plpgsql;

SELECT _xpath('/greenplum/tests/data/100songs.xml',
   '/response/row/row[artist="Tullio De Piscopo"]/title/text()') AS song;

ROLLBACK;

Execute the SQL script with:

$ psql -qtxf xml.sql

You should see:

song | {"Stop Bajon (Primavera)"}

Code in detail

First of all, we have to know the XML document structure:

<response>
    <row>
        <row _id=... _uuid=... _position=... _address=...>
            <theme>...</theme>
            <title>...</title>
            <artist>...</artist>
            <year>...</year>
            <spotify_url url=.../>
        </row>
        ...
    </row>
</response>

The first part of the code creates a table to store XML data:

CREATE TABLE t (id INTEGER, data XML)
        DISTRIBUTED BY (id);

In the function body, we create a temporary table to store the file content as a single TEXT:

CREATE TEMP TABLE tmp (d TEXT) distributed by (d);
EXECUTE $$ COPY tmp FROM $$ ||  quote_literal(_xml_file) ;

Then, we convert the TEXT data into XML data using the xml function:

INSERT INTO t (data) SELECT xml(d) FROM tmp;
DROP TABLE tmp;

And we check if the XML is well formed:

SELECT INTO _is_well_formed xml_is_well_formed(text(data)) FROM t;
IF _is_well_formed = False THEN
    RETURN 'XML document is not well formed';
END IF;

Lastly, we perform the XPath query and return the result:

SELECT INTO _result xpath(_xpath_query, data ) FROM t;
RETURN _result;

The last part of the script shows how to call the function to retrieve data:

-- Get the Tullio De Piscopo's song from the list
SELECT _xpath('/greenplum/tests/data/100songs.xml', '/response/row/row[artist="Tullio De Piscopo"]/title/text()') AS song;

Conclusions

XPath is a technology of mixed feeling. Many hate it, some love it. However, when working with XML structured data type, it is a must.

The good thing with Greenplum 4.2 is that it is now possible to load millions of XML documents in a Greenplum database, have the data distributed in multiples segments and take advantage of parallelism when retrieving data using XPath expressions.

 

Share this

More Blogs