The 9.4 version of PostgreSQL introduces the JSONB
data type, a specialised representation of the JSON
data, allowing PostgreSQL to be competitive in managing the “lingua franca” of the moment for the exchange of data via web services. It is useful to perform a number of tests to verify its actual performance.
Test data
We will run our tests using the customer reviews data from Amazon for the year 1998 in JSON
format. The file customer_reviews_nested_1998.json.gz can be downloaded from the website of Citus Data.
The file, once unzipped, takes up 209MB and contains approximately 600k of records in JSON
format with a structure similar to the following:
|
Dimensions
The data can be loaded into a PostgreSQL database using the JSONB
data type with the following commands:
|
The resulting table will take up approximately 268MB, with an additional cost of disk storage of around 28%. If we try to load the same data using the JSON
type, which stores it as text, the result will be a table of 233MB, with an increase in space of roughly 11%. The reason for this difference is that the internal structures of JSONB
, which are used to access the data without analysing the entire document each time, have a cost in terms of space.
Data access
Once the data is stored in the database, it is necessary to create an index in order to access it efficiently. Before the 9.4 version of PostgreSQL, the only method of indexing the contents of a JSON
field was to use a B-tree
index on a specific search expression. For example, if we want to perform a search by product category we will use:
|
The newly created index takes up 21MB, or approximately 10% of the original data, and will be used for queries that contain within the WHERE clause the exact search term “review #>> {product,category}
”, such as:
|
The query takes approximately 0.180ms to be performed on the test machine, but the index that has been created is highly specific and cannot be used for different searches.
Starting with version 9.4, the JSONB
data type supports the use of inverted indexes (GIN, or General inverted Indexes), which allow indexing of the components of a complex object.
Let’s create a GIN
index on our table reviews
with the following command:
|
The resulting index takes up 64MB of disk, which is approximately 30% of the size of the original table. This index can be used to speed up searches using the following operators:
JSON
@>
JSON
is a subsetJSON
?
TEXT
contains a valueJSON
?&
TEXT[]
contains all the valuesJSON
?|
TEXT[]
contains at least one value
The above query shall thus be rewritten using the operator @>
to search for rows that contain '{"product": {"category": "Fitness & Yoga"}}'
:
|
The query takes approximately 1.100ms to be performed on the test machine, and the index that has been created is flexible and can be used for any search within the JSON
data.
In fact, it is often the case that the only operation used in the applications is the search for a subset. In this case it is possible to use a different GIN
index which only supports the @>
operator and is therefore considerably smaller. The syntax for creating this type of “optimised” index is as follows:
|
The resulting index occupies 46MB – only 22% of the size of the original data – and, thanks to its smaller size, it is used more efficiently by PostgreSQL.
This allows the above query to be run in just 0.167ms, with a performance increase of 650% compared to the original GIN
index and 8% compared to the specific B-tree
index initially used: all this without loss of generality with regards to the possible search operation.
Conclusions
With the introduction of the JSONB
type and the GIN
indexes
using jsonb_path_ops
operator class, PostgreSQL combines the elasticity of the JSON
format at an amazing data access speed.
Today it is thus possible to store and process data in JSON
format with high performance while enjoying the robustness and flexibility that PostgreSQL has habitually provided us with over the years.