Through this article, we are going to complete the MapReduce job started in the [previous article](https://www.2ndquadrant.com/en/2011/10/mapreduce-in-greenplum.html).
## Take up the problem from the previous article
In the [previous article](https://www.2ndquadrant.com/en/2011/10/mapreduce-in-greenplum.html), we left with this MapReduce configuration file:
%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
- INPUT:
NAME: my_input_data
QUERY: SELECT x,y FROM my_data
- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [ x integer , y float ]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }
EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
Which produces the following output:
key |value
--------+-----
Sum of x| 15
Sum of y| 278
(2 rows)
Naturally speaking, that job sums all values from two different columns of a test table.
Our goal here, is to use execute a division of these two values, in particular 15
and 278
.
Let’s check what the result is with a calculator, just to be sure that the MapReduce job will return the correct value:
$ psql -c "SELECT 15/278::FLOAT AS result" test_database
result
0.0539568345323741
(1 row)
Yes, we use Greenplum as a calculator :).
## Introducing “tasks”
What we are doing here is to define a separate task that performs the sum.
We will use the result of that task as input for a query that actually does the division step.
Let’s see it in practice.
* Remove the EXECUTE
part from test.yml
. In details, these lines:
EXECUTE:
- RUN:
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
* Define a task, wich is responsible to execute the sum of _x_ and _y_ values. To do that, it reuses the old map function.
Append this to test.yml
:
- TASK:
NAME: sums
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
The useful characteristic of tasks is that they can be used as input for further processing stages.
* Define the step that performs the division, actually. It is an SQL SELECT that use the task defined earlier as input. Append this to test.yml
:
- INPUT:
NAME: division
QUERY: |
SELECT
(SELECT value FROM sums where key = 'Sum of x') /
(SELECT value FROM sums where key = 'Sum of y')
AS final_division;
As you can see, the FROM
clause contains the name of the task defined above: sums
.
* Finally, execute the job and displays output. Append this to test.yml
:
EXECUTE:
- RUN:
SOURCE: division
TARGET: STDOUT
This step runs the _division_ query and display the result via standard output.
## Put everything together
This is the complete test.yml
file:
%YAML 1.1
---
VERSION: 1.0.0.1
DATABASE: test_database
USER: gpadmin
HOST: localhost
DEFINE:
- INPUT:
NAME: my_input_data
QUERY: SELECT x,y FROM my_data
- MAP:
NAME: my_map_function
LANGUAGE: PYTHON
PARAMETERS: [ x integer , y float ]
RETURNS: [key text, value float]
FUNCTION: |
yield {'key': 'Sum of x', 'value': x }
yield {'key': 'Sum of y', 'value': y }
- TASK:
NAME: sums
SOURCE: my_input_data
MAP: my_map_function
REDUCE: SUM
- INPUT:
NAME: division
QUERY: |
SELECT
(SELECT value FROM sums where key = 'Sum of x') /
(SELECT value FROM sums where key = 'Sum of y')
AS final_division;
EXECUTE:
- RUN:
SOURCE: division
TARGET: STDOUT
Execute the whole job with:
$ gpmapreduce -f test.yml
mapreduce_2235_run_1
final_division
0.0539568345323741
(1 row)
Compare it with the calculator result. Ok, it matches.
## Conclusion
The task is complete. We have calculated sum(x)/sum(y)
correctly.
The power of MapReduce is mainly in the number of servers involved in the calculation.
Many servers accomplishes small calculation to get the final result.
Maybe you will not notice the powerful of MapReduce here, but this is a good starting point.