Mapreduce is a very trendy software framework. It has been introduced by Google (TM) in 2004.
It is a large topic, and it is not possible to cover all of its aspetcs in a single blog article.
This is a simple introduction to the _mapreduce_ usage in Greenplum 4.1.
## What is mapreduce exactly?
Mapreduce’s main goal is to process highly distributable problems across huge datasets using a large number of computers (nodes).
As you may understand, this suits perfectly with Greenplum, which is at ease with huge distributed datasets and allows the integration
with the SQL language.
Mapreduce consists of two separate steps: _Map_ and _Reduce_.
### Map step
During this step, the main problem is partitioned into smaller sub-problems that are passed to children nodes, recursively.
This process leads to a multi-level tree structure.
### Reduce step
During this step, all the sub-problems solutions are merged to obtain the solution to the initial problem.
## Install Mapreduce in Greenplum
Do you think installing Mapreduce in Greenplum is a difficult task? The answer is no. Mapreduce is already included in Greeplum!
## Let’s get practical
I assume that you have a Greenplum 4.1 system installed.
Run gpmapreduce --version
:
$ gpmapreduce --version
gpmapreduce - Greenplum Map/Reduce Driver 1.00b2
Perfect, we can go on.
The main point here is a specially formatted file, that we will convetionally call it test.yml
from now on.
As you may guess, that is a YAML
file, which defines all parts that are needed by the mapreduce data flow to complete:
* Input Data
* Map Function
* Reduce Function
* Output Data
The Greenplum MapReduce specification file has a specific YAML schema. I invite you to have a look at the AdminGuide for details.
In particular, MapReduce is handled in Chapter 23.
For the sake of this article, we will focus on function definitions.
Let’s start writing a text file named test.yml
with the mandatory header:
%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: dbname
USER: gpadmin
HOST: host
where
and
are the name of the database and the host where MapReduce will connect to.
### Input Data
Input Data can be obtained in so many ways, in this example we will use an SQL SELECT
statement.
Let’s create a table in database
to get data from:
$ psql -c "CREATE TABLE mydata AS SELECT i AS x,
floor(random()*100) AS y FROM generate_series(1,5) i"
This will create a 5 rows table with this structure:
=# d mydata
Table "public.mydata"
Column | Type | Modifiers
--------+------------------+-----------
x | integer |
y | double precision |
Distributed by: (x)
The set of rows of this table is our Input Data.
Let’s define it in the MapReduce configuration file, by appending this to test.yml
:
DEFINE:
- INPUT:
NAME: my_input_data
QUERY: SELECT x,y FROM mydata
That is self-explanatory, it just selects all rows from the mydata
table as input data for mapreduce.
### Map Function
It is very important to understand that a Map function takes as input *a single row*, and produces *zero or more* output rows.
Map functions can be written in C, Perl or Python.
They reside directly in the YAML configuration file.
Parameters managment varies between programming languages (please consult AdminGuide for details).
Let’s see an example of a map function written in Python. You can append the following to test.yaml
:
- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [x integer, y float]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }
As you can see, function source is placed directly in the YAML configuration file.
The function takes x
and y
as input and returns (yield
) x
and the sum of x
and y
.
### The Reduce step
Reduce functions takes a set of rows in input and produces *a single* reduced row.
There are several predefined functions included in Greenplum.
Here’s the list:
* IDENTITY – returns (key, value) pairs unchanged
* SUM – calculates the sum of numeric data
* AVG – calculates the average of numeric data
* COUNT – calculates the count of input data
* MIN – calculates minimum value of numeric data
* MAX – calculates maximum value of numeric data
Let’s apply a REDUCE function to our input data, so append this at test.yml
:
EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
This return values unchanged. It is not very useful practically, but it is enough to show the Reduce step in action and get you started.
Ok, let’s see the complete test.yml
:
%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
- INPUT:
NAME: my_input_data
QUERY: SELECT x,y FROM my_data
- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [ x integer , y float ]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }
EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
Remember that YAML does not use TABS!
It is now possible to execute this Mapreduce job simply running:
$ gpmapreduce -f test.yaml
Results here will most likely be different from yours, due to the usage of the random()
function during data generation.
Here’s mine:
mapreduce_2508_run_1
key |value
--------+-----
Sum of x| 15
Sum of y| 278
(2 rows)
Exactly the sum of all x and y values from input table mydata
.
In conclusion, this is just a smattering of how MapReduce works in Greenplum.
MapReduce is a complex and wide topic, and its usage is growing in popularity every day.
Greenplum has an excellent support of it and allows business analytics users to take advantage
of the shared nothing architecture by executing map/reduce functions in a distributed way and by
working on distributed datasets.