The GDPR (General Data Protection Regulation) goes into effect on May 25, 2018 throughout the European Union. The regulation focuses on the secure management of personal information. For more detailed information, consult the official GDPR website or listen to our recent GDPR Webinar.
A successful GDPR-compliant implementation requires the use of many technical capabilities, such as authentication, authorization, access control, virtual database, and encryption. One of the techniques often considered is data redaction - a technique that limits sensitive data exposure by dynamically changing data as it is displayed for specific users.
Get ready for GDPR: 6 things database administrators should know. Download Whitepaper.
For example, a social security number (SSN) is stored as ‘021-23-9567’. Privileged users can see the full SSN, while other users only see the last four digits ‘xxx-xx-9567’.
In this article I describe how we can use standard EDB Postgres capabilities to create user-specific data redaction mechanisms. Keep in mind that all the fields shown in the example are examples of personal data, not just the ones that have been redacted for the purpose of this example.
N.B.: EDB Postgres Advanced Server 11, targeted for release in late 2018, will have native data redaction capabilities. Those will be more robust and performant than the techniques described in this blog.
Approach
The approach leverages the Postgres search_path feature to direct privileged users to the raw unredacted data when they run a query, and to direct non-privileged users to a view that implements redaction logic.
Step-by step walkthrough
- Two schemas are defined in the database ‘mycompany’:
- Schema ‘employeedata’ has the detailed data in a table called ‘employees’
- Schema ‘redacteddata’ has a view called ‘employees’ that refers to the table ‘employeedata.employees’ and applies user-defined functions and standard SQL to redact data in select columns
- A sample data set with employee IDs, names, social security numbers, birth dates etc. is created in the table ‘employees’ in the schema ‘employeedata’.
- A library of redaction functions for SSN, email, dates, salaries and phone numbers apply data type specific redaction techniques
- The ALTER command is used to set the search_path in Postgres to direct non-privileged users to the views, and privileged users to the underlying unredacted data
- A function is used to show how non-privileged users could query based on redacted fields
All the code described in this example is released under PostgreSQL open source license, and is intended for use with EDB Postgres Advanced Server 10 in Oracle-compatible mode. This code is in intended for illustration purposes only.
Scripts
-- connect to the database as user enterprisedb
DROP DATABASE IF EXISTS mycompany;
CREATE DATABASE mycompany WITH OWNER = enterprisedb;
-- connect to the new database
\c mycompany;
-- the schema employeedata will hold the personally identifiable information (PII)
CREATE SCHEMA employeedata;
-- the schema redacteddata will hold the view that does the redaction
CREATE SCHEMA redacteddata;
-- create table with employee information
CREATE TABLE employeedata.employees
(id integer GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
name varchar(40) NOT NULL,
SSN varchar(11) NOT NULL,
phone varchar(10),
birthday date,
salary money,
email varchar(100)
);
-- add sample data
INSERT INTO employeedata.employees (name, ssn, phone, birthday, salary, email)
VALUES
( 'Sally Sample', '020-78-9345', '5081234567', '1961-02-02', 51234.34, 'sally.sample@enterprisedb.com'),
( 'Jane Doe', '123-33-9345', '6171234567', '1963-02-14', 62500.00, 'jane.doe@gmail.com'),
( 'Bill Foo', '123-89-9345', '9781234567','1963-02-14', 45350, 'william.foe@hotmail.com');
-- define redaction functions. Add the SECURITY DEFINER to specify that the function is to be executed with the privileges of the user that owns it.
CREATE OR REPLACE FUNCTION redact_ssn (ssn varchar(11))
RETURNS varchar(11)
/* replaces 020-12-9876 with xxx-xx-9876 */
AS
$$ SELECT overlay (ssn placing 'xxx-xx' from 1) ;$$
LANGUAGE SQL SECURITY DEFINER;
CREATE OR REPLACE FUNCTION redact_date (input_date date)
RETURNS date
/* sets the year to 0 */
AS
$$ SELECT input_date - ((extract (year from input_date) + 1) *interval '1 year') ;$$
LANGUAGE SQL SECURITY DEFINER;
CREATE or REPLACE FUNCTION redact_email
(
email varchar(100),
visible integer DEFAULT 1,
red_char char default 'x'
)
RETURNS varchar(100)
/*
Redacts the first part of an email address, starting with visible number br of characters. Same for second part after the '@'. The redaction character can be set. Checks if the email address has exactly one '@' and at least one '.'. Otherwise returns 'illegal@email.address'
*/
AS
$$
DECLARE
pos1 integer;
pos2 integer;
part1 varchar; -- this is the name before the @
part2 varchar; -- this is the domain
part3 varchar; -- this is the suffix
BEGIN
--- check if this is an email address
IF (SELECT REGEXP_COUNT (email, '@',1) = 1) AND (SELECT REGEXP_COUNT (email, '\.',1) >= 1)
THEN
SELECT POSITION('@' in email) INTO pos1;
SELECT LENGTH(email) - POSITION('.' IN reverse(email)) INTO pos2;
part1 = RPAD(SUBSTRING(email, 1,visible),pos1-1, red_char );
part2 = RPAD(SUBSTRING(email,pos1+1,visible), pos2-pos1, red_char );
part3 = SUBSTRING(email, pos2 +1, LENGTH(email));
RETURN part1 || '@' || part2 || part3;
ELSE
RETURN 'illegal@email.address';
END IF;
END
$$ LANGUAGE PLPGSQL SECURITY DEFINER
CREATE OR REPLACE FUNCTION redact_salary (salary money)
RETURNS money
/* always returns 0 */
AS
$$ SELECT 0::money; $$
LANGUAGE SQL SECURITY DEFINER;
CREATE OR REPLACE FUNCTION redact_phonenbr
(
phone_nbr varchar(10),
visible integer DEFAULT 4,
red_char char default 'x')
RETURNS varchar(10)
/*
Replaces all digits, except for the last visible digits, with the redaction char
*/
AS
$$
SELECT overlay (phone_nbr placing rpad(red_char, length(phone_nbr) - visible, red_char) from 1);
$$
LANGUAGE SQL SECURITY DEFINER;
-- define redaction view in the schema redacteddata. It calls the redaction functions to redact data in certain columns.
CREATE OR REPLACE VIEW redacteddata.employees AS
SELECT
id,
name,
redact_ssn(ssn) ssn,
redact_phonenbr(phone) phone,
redact_date(birthday) birthday ,
redact_salary(salary) salary,
redact_email(email) email
FROM employeedata.employees;
-- create privileged user
CREATE ROLE privilegeduser LOGIN PASSWORD 'password';
-- grant access to schema and tables
GRANT USAGE ON SCHEMA employeedata TO privilegeduser;
GRANT ALL ON ALL TABLES IN SCHEMA employeedata TO privilegeduser;
-- set search path
ALTER ROLE privilegeduser IN DATABASE mycompany SET search_path = "$user", public, employeedata;
-- create redacted user
CREATE ROLE redacteduser LOGIN PASSWORD 'password';
-- grant access to schema redacteddata schema and views
GRANT USAGE ON SCHEMA redacteddata TO redacteduser;
GRANT ALL ON ALL TABLES IN SCHEMA redacteddata TO redacteduser ;
-- set search path
ALTER ROLE redacteduser IN DATABASE mycompany SET search_path TO "$user", public, redacteddata;
-- define a function that allows to search by SSN, but redacts other data
CREATE OR REPLACE FUNCTION employee_ssn (ssn varchar)
RETURNS TABLE (ID integer, name varchar(40), ssn varchar(11), phone varchar(10), birthday date, salary money, email varchar(100))
/*
Allows a non-priviledged user to search by SSN (a redacted data element). The function returns a table that implements the redaction functions. Obviously this function could be called in an exhaustive search loop to guess the SSN.
*/
AS
$$
SELECT
id,
name,
redact_ssn(ssn) ssn,
redact_phonenbr(phone) phone,
redact_date(birthday) birthday ,
redact_salary(salary) salary,
redact_email(email) email
FROM employeedata.employees WHERE ssn =$1;
$$
LANGUAGE SQL SECURITY DEFINER;
-- connect to database as privilegeduser
Server [localhost]:
Database [edb]: mycompany
Port [5444]:
Username [enterprisedb]: privilegeduser
Password for user privilegeduser:
edb-psql (10.1.5)
mycompany=> select * from employees;
id | name | ssn | phone | birthday
---+--------------+-------------+------------+--------------------
1 | Sally Sample | 020-78-9345 | 5081234567 | 02-FEB-61 00:00:00
1 | Jane Doe | 123-33-9345 | 6171234567 | 14-FEB-63 00:00:00
1 | Bill Foo | 123-89-9345 | 9781234567 | 14-FEB-63 00:00:00
(3 rows)
--- connect to database as redacteduser
Server [localhost]:
Database [edb]: mycompany
Port [5444]:
Username [enterprisedb]: redacteduser
Password for user redacteduser:
edb-psql (10.1.5)
mycompany=> select * from employees;
id | name | ssn | phone | birthday
---+--------------+-------------+------------+--------------------
1 | Sally Sample | xxx-xx-9345 | 5081234567 | 02-FEB-02 00:00:00
1 | Jane Doe | xxx-xx-9345 | 6171234567 | 14-FEB-02 00:00:00
1 | Bill Foo | xxx-xx-9345 | 9781234567 | 14-FEB-02 00:00:00
(3 rows)
-- redacteduser tries to access unredacted dataset
mycompany=> select * from employeedata.employees;
ERROR: permission denied for schema employeedata
LINE 1: select * from employeedata.employees;
^
-- redacted user uses table function to query dataset and queries based on redacted column
mycompany=> select name, ssn, phone, salary from employee_ssn ('123-89-9345');
name | ssn | phone | salary
----------+-------------+------------+--------
Bill Foo | xxx-xx-9345 | xxxxxx4567 | $0.00
(1 row)
Conclusions
Data redaction is one of many techniques being brought to bear on the challenges of GDPR and on other challenges of dealing with confidential data. This article shows how we can use the Postgres search_path, user defined functions and views to improve data protection.
This post benefited from helpful feedback from EnterpriseDB colleagues Phil Allsopp, Robert Haas and Vibhor Kumar.
Marc Linster, Ph.D., is Senior Vice President, Product Development, at EnterpriseDB.