Regression Analysis in PostgreSQL with Tensorflow: Part 1 - Getting Started

Regression Analysis in PostgreSQL with Tensorflow: Part 1 - Getting Started

In my last Machine Learning blog, PostgreSQL and Machine Learning, I looked at why we might want to integrate machine learning into our databases and showed examples using Apache MADlib and Tensorflow to analyse the Boston Housing Dataset.

In this blog mini-series, we'll take a deeper dive into the code I wrote to perform this analysis using Tensorflow. Whilst MADlib is certainly useful for building intelligent analytics into your databases, use of Tensorflow is arguably much more interesting as we can easily build whatever we want using pl/python3 as we'll have access to all the nuts and bolts of Tensorflow (or PyTorch or scikit-learn etc.) as well as virtually the entire Python package ecosystem which includes incredibly handy libraries such as Pandas and Numpy.

In part 1, we'll look at getting everything set up and some of the basics of using pl/python3 from within PostgreSQL.

 

Installing PostgreSQL and pl/python3

To get started, you'll need a PostgreSQL installation with the pl/python3 procedural language extension installed. On Windows or macOS, install PostgreSQL using the EDB Installers which you can get via the PostgreSQL website. You can then use the StackBuilder utility to install the EDB LanguagePack which will add the required Python support.

Instructions for installing PostgreSQL on Linux can also be found at the link above. In the examples below, I'll be working on Ubuntu 20.04.

dpage@ubuntu:~$ sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'
[sudo] password for dpage: 
dpage@ubuntu:~$ wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
OK
dpage@ubuntu:~$ sudo apt-get update
Hit:1 http://us.archive.ubuntu.com/ubuntu focal InRelease
Get:2 http://security.ubuntu.com/ubuntu focal-security InRelease [109 kB]
Get:3 http://us.archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]                 
Get:4 http://apt.postgresql.org/pub/repos/apt focal-pgdg InRelease [81.6 kB]                                          
Get:5 http://apt.postgresql.org/pub/repos/apt focal-pgdg/main amd64 Packages [188 kB]                
...
dpage@ubuntu:~$ sudo apt-get -y install postgresql-13 postgresql-plpython3-13
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  libpq5 pgdg-keyring postgresql-client-13 postgresql-client-common postgresql-common sysstat
...
dpage@ubuntu:~$ 

It's worth noting that the official Tensorflow packages support the use of GPUs on Linux and Windows, but not macOS. For simple regression analysis this shouldn't be a problem, but for more complex tasks it might cause performance issues when training. It is possible to use distributed nodes, GPUs, or TPUs with Tensorflow, so if you don't have a GPU on your database server or you're running on macOS you might consider offloading work to external nodes that do have GPUs. See the Tensorflow documentation for more information.

 

Setting up the Python environment

Now, we need to prepare the Python environment. On Windows or macOS, use the full path to the pip utility that was included in the LanguagePack installation to ensure that the packages are installed into the correct Python environment. On Linux, the system Python environment is used. Depending on your distribution, you may need to use pip3 instead of pip to ensure you're installing packages for Python 3 and not for Python 2:

dpage@ubuntu:~$ sudo pip3 install tensorflow pandas matplotlib seaborn
Collecting tensorflow
  Downloading tensorflow-2.4.1-cp38-cp38-manylinux2010_x86_64.whl (394.4 MB)
     |████████████████████████████████| 394.4 MB 25 kB/s 
Collecting numpy
  Downloading numpy-1.20.1-cp38-cp38-manylinux2010_x86_64.whl (15.4 MB)
     |████████████████████████████████| 15.4 MB 1.3 MB/s 
...
dpage@ubuntu:~$ 

Note that I've not explicitly installed the numpy library here; a compatible version will be automatically installed as Tensorflow is dependent on it. I've also installed a number of additional libraries:

  • pandas; a powerful library for data analysis
  • matplotlib; useful for plotting graphs for data analysis
  • seaborn; works with matplotlib to provide various data visualisation options

Now that everything is installed, we can run a quick test in PostgreSQL:

dpage@ubuntu:~$ sudo su - postgres
postgres@ubuntu:~$ psql postgres
psql (13.2 (Ubuntu 13.2-1.pgdg20.04+1))
Type "help" for help.
 
postgres=# CREATE DATABASE tf;
CREATE DATABASE
postgres=# \connect tf
You are now connected to database "tf" as user "postgres".
tf=# CREATE LANGUAGE plpython3u;
CREATE EXTENSION
tf=# CREATE FUNCTION tf_version()
tf-#   RETURNS text
tf-# AS $$
tf$#   import tensorflow as tf
tf$#   return tf.__version__
tf$# $$
tf-# LANGUAGE plpython3u;
CREATE FUNCTION
tf=# SELECT tf_version();
 test_tf 
---------
 2.4.1
(1 row)
 
tf=#

We're testing here by creating a database called tf, connecting to it, and then creating the pl/python3u procedural language (the u is intentional; it's part of PostgreSQL's naming convention that indicates this is an untrusted language, i.e. one that is not sandboxed for security). Then, we create a simple function that imports the Tensorflow library and returns the version number. Finally, we call the function using SELECT which returns the same version number for Tensorflow that we saw was installed by pip3 previously.

So, it looks like everything is set up and ready to go!
 

pl/python3 Basics

When we tested our Tensorflow installation, we created a simple pl/python3 function that simply returned a value. In practice we'll likely need to be able to pass values into the function, raise errors, execute SQL queries, and read and write files.
 

Passing and Returning Values

To pass values into a function, we declare them when we create the function and can immediately access the variables from within our Python code. For example:

tf=# CREATE FUNCTION py_add(a integer, b integer)
tf=#    RETURNS integer
tf=# AS $$
tf=#   return a+b
tf=# $$ LANGUAGE plpython3u;
tf=# SELECT py_add(4, 5);
 py_add 
--------
      9
(1 row)
 
tf=# 

The SQL code surrounds the Python code which can be seen between the $$ markers.

Note that you cannot redefine the values of parameters passed to the function from within the function, unless you declare them as globals in the function body.

Arrays passed into pl/python3 functions are converted to lists (or nested lists) in Python, and will be converted back the other way if returned. Sets of rows can be returned by returning a tuple of lists or similar constructs, an iterator, or a generator.
 

Displaying Messages

As print() won't work from within pl/python3, we need another way to output messages to the user, using PostgreSQL's logging infrastructure (which also handles messages sent to the client interface). We can raise notices, errors, warnings and more. For example:

tf=# CREATE FUNCTION hello_world()
tf-#   RETURNS void
tf-# AS $$
tf$#   plpy.notice('Hello world!')
tf$# $$ LANGUAGE plpython3u;
CREATE FUNCTION
tf=# SELECT hello_world();
NOTICE:  Hello world!
 hello_world 
-------------
 
(1 row)
 
tf=#

Note that raising an error (or fatal) will cause a Python exception to be raised which will be propagated back to PostgreSQL if it's not caught and handled, causing the transaction to abort.

 

Executing SQL Queries

Being able to execute SQL queries from within our functions is critical if we're going to truly combine the power of machine learning, Python and PostgreSQL. Thankfully, it's pretty easy. In this example, we select the name, schema and owner for all the tables in our database, load the data into a pandas dataframe, and then print out a sample (which pandas will format nicely for us):

tf=# CREATE FUNCTION show_tables()
tf-#   RETURNS void
tf-# AS $$
tf$#   import pandas as pd
tf$# 
tf$#   tables = plpy.execute('SELECT schemaname, tablename, tableowner FROM pg_tables;')
tf$# 
tf$#   columns = list(rows[0].keys())
tf$#   df = pd.DataFrame.from_records(tables, columns = columns)
tf$# 
tf$#   plpy.notice('Tables: \n{}'.format(df))
tf$# $$ LANGUAGE plpython3u;
tf=# SELECT show_tables();
NOTICE:  Tables: 
            schemaname                tablename tableowner
0           pg_catalog             pg_statistic   postgres
1           pg_catalog                  pg_type   postgres
2           pg_catalog         pg_foreign_table   postgres
3           pg_catalog                pg_authid   postgres
4           pg_catalog    pg_statistic_ext_data   postgres
..                 ...                      ...        ...
61          pg_catalog      pg_subscription_rel   postgres
62  information_schema  sql_implementation_info   postgres
63  information_schema                sql_parts   postgres
64  information_schema               sql_sizing   postgres
65  information_schema             sql_features   postgres
 
[66 rows x 3 columns]
 
 show_tables 
-------------
 
(1 row)
 
tf=#

 

Reading and Writing Files

In our use of Tensorflow we will need to read and write files on disk; at the very least, the models that we create and use.

When a function executes in PostgreSQL, we must remember that as far as the operating system is concerned, it's being executed by the user account under which PostgreSQL is running regardless of which user actually called the function, often referred to as the service account. This means that any files written from a pl/python3 function will be owned by the service account (typically postgres), and will be created with quite restrictive permissions on non-Windows platforms, usually allowing only the service account to read or write the file. Files must be written to a directory that the service account has permission to write to, and any files to be read must also be accessible to the service account of course. On Windows PostgreSQL doesn't attempt to manipulate file system ACLs itself, and relies on the data directory being properly secured.

It's also a good idea to use absolute paths when reading/writing files from pl/python3. The working directory for a postgres process will be the data directory, and it's generally not a good idea to write files there. Instead, create a suitable directory elsewhere and use that via absolute paths to avoid any complications or mistakes with relative paths.
 

Conclusion

In this post we've covered the basics of setting up a PostgreSQL installation with pl/python3 so we can use Tensorflow (and other Python libraries) from within our databases. We've also covered some of the basics of pl/python3 which will help us later when we start integrating machine learning techniques into our PostgreSQL databases.

When I originally started writing this blog post, it was intended to be one article that covered the setup and use of Tensorflow with PostgreSQL. It quickly became apparent that that was going to lead to a much longer post than expected if I was to do more than just skim the details, which is really not what I wanted. 

So now that we have everything setup and are familiar with the basics of pl/python3, look out for part two in which I'll cover some of the ways we can analyse our data and pre-process it ready for training a model. 

In the meantime, take a look at some of the other useful functionality that pl/python3 has to offer.
 

Dave Page

Dave Page is Vice President and Chief Architect, Database Infrastructure, currently working in the CTO team on research and development, best practices with Postgres, and providing high-level guidance and support for key customers. 

Dave has been working with PostgreSQL sin ...