JSONB type performance in PostgreSQL 9.4

March 03, 2015


The 9.4 version of PostgreSQL introduces the JSONB data type, a specialised representation of the JSON data, allowing PostgreSQL to be competitive in managing the “lingua franca” of the moment for the exchange of data via web services. It is useful to perform a number of tests to verify its actual performance.

Test dataarticolo-json-marco

We will run our tests using the customer reviews data from Amazon for the year 1998 in JSON format. The file customer_reviews_nested_1998.json.gz can be downloaded from the website of Citus Data.

The file, once unzipped, takes up 209MB and contains approximately 600k of records in JSON format with a structure similar to the following:

{

    "customer_id": "ATVPDKIKX0DER",
    "product": {
        "category": "Arts & Photography",
        "group": "Book",
        "id": "1854103040",
        "sales_rank": 72019,
        "similar_ids": [
            "1854102664",
            "0893815381",
            "0893816493",
            "3037664959",
            "089381296X"
        ],
        "subcategory": "Art",
        "title": "The Age of Innocence"
    },
    "review": {
        "date": "1995-08-10",
        "helpful_votes": 5,
        "rating": 5,
        "votes": 12
    }
}

Dimensions

The data can be loaded into a PostgreSQL database using the JSONB data type with the following commands:

CREATE TABLE reviews(review jsonb);
\copy reviews FROM 'customer_reviews_nested_1998.json'
VACUUM ANALYZE reviews;

The resulting table will take up approximately 268MB, with an additional cost of disk storage of around 28%. If we try to load the same data using the JSON type, which stores it as text, the result will be a table of 233MB, with an increase in space of roughly 11%. The reason for this difference is that the internal structures of JSONB, which are used to access the data without analysing the entire document each time, have a cost in terms of space.

Data access

Once the data is stored in the database, it is necessary to create an index in order to access it efficiently. Before the 9.4 version of PostgreSQL, the only method of indexing the contents of a JSON field was to use a B-tree index on a specific search expression. For example, if we want to perform a search by product category we will use:

CREATE INDEX on reviews ((review #>> '{product,category}'));

The newly created index takes up 21MB, or approximately 10% of the original data, and will be used for queries that contain within the WHERE clause the exact search term “review #>> {product,category}”, such as:

SELECT
    review #>> '{product,title}' AS title,
    avg((review #>> '{review,rating}')::int)
FROM reviews
WHERE review #>> '{product,category}' = 'Fitness & Yoga'
GROUP BY 1 ORDER BY 2;
                       title                       |        avg
---------------------------------------------------+--------------------
 Kathy Smith - New Yoga Challenge                  | 1.6666666666666667
 Pumping Iron 2                                    | 2.0000000000000000
 Kathy Smith - New Yoga Basics                     | 3.0000000000000000
 Men Are from Mars, Women Are from Venus           | 4.0000000000000000
 Kathy Smith - Functionally Fit - Peak Fat Burning | 4.5000000000000000
 Kathy Smith - Pregnancy Workout                   | 5.0000000000000000
(6 rows)

The query takes approximately 0.180ms to be performed on the test machine, but the index that has been created is highly specific and cannot be used for different searches.

Starting with version 9.4, the JSONB data type supports the use of inverted indexes (GIN, or General inverted Indexes), which allow indexing of the components of a complex object.

Let’s create a GIN index on our table reviews with the following command:

CREATE INDEX on reviews USING GIN (review);

The resulting index takes up 64MB of disk, which is approximately 30% of the size of the original table. This index can be used to speed up searches using the following operators:

  • JSON @> JSON is a subset
  • JSON ? TEXT contains a value
  • JSON ?& TEXT[] contains all the values
  • JSON ?| TEXT[] contains at least one value

The above query shall thus be rewritten using the operator @> to search for rows that contain '{"product": {"category": "Fitness & Yoga"}}':

SELECT
    review #>> '{product,title}' AS title,
    avg((review #>> '{review,rating}')::int)
FROM reviews
WHERE review @> '{"product": {"category": "Fitness & Yoga"}}'
GROUP BY 1 ORDER BY 2;

The query takes approximately 1.100ms to be performed on the test machine, and the index that has been created is flexible and can be used for any search within the JSON data.

In fact, it is often the case that the only operation used in the applications is the search for a subset. In this case it is possible to use a different GIN index which only supports the @> operator and is therefore considerably smaller. The syntax for creating this type of “optimised” index is as follows:

 

CREATE INDEX on reviews USING GIN (review jsonb_path_ops);

The resulting index occupies 46MB – only 22% of the size of the original data – and, thanks to its smaller size, it is used more efficiently by PostgreSQL.

This allows the above query to be run in just 0.167ms, with a performance increase of 650% compared to the original GIN index and 8% compared to the specific B-tree index initially used: all this without loss of generality with regards to the possible search operation.

Conclusions

With the introduction of the JSONB type and the GIN indexes
using jsonb_path_ops operator class, PostgreSQL combines the elasticity of the JSON format at an amazing data access speed.

Today it is thus possible to store and process data in JSON format with high performance while enjoying the robustness and flexibility that PostgreSQL has habitually provided us with over the years.

Share this

More Blogs