Introduction

MapReduce is a programming pattern, a programming model and even synonymous with an implementation of the MapReduce pattern/programming model. MapReduce has become an extremely important approach for processing and generating large data sets. Here we will be work with a SAGA-based implementation of MapReduce. The framework uses a pilot-job as the underlying execution environment; this is specific and unique to this particular implementation. SAGA-BigJob is used to submit "Map" & "Reduce" tasks. The framework processes multiple files in the input directory. The files are physically chunked in the temp directory and a map task is submitted for each input chunk file. After completion of a map task, based on the number of reduces, partition files are created and sorted and then 'moved' from the temp directory to the output directory. Sorted partition files which belong to a reduce are grouped together and submitted as an argument to the reduce job. The output files are created in the output directory when reduce jobs are completed.

Local MapReduce(LMR)
Distributed MapReduce(DMR)

to launch BigJob remotely. To use this url, ssh adaptors need to be installed.

Dependencies

SAGA BigJob based MapReduce is dependent on

SAGA ( http://saga.cct.lsu.edu/)
PBSPro/ssh adaptors(http://www.saga-project.org/download/adaptors)
BigJob ( http://faust.cct.lsu.edu/trac/BigJob/wiki)

Deployments

MapReduce Installation

Obtain MapReduce code
https://svn.cct.lsu.edu/repos/saga-projects/applications/MapReduce/branches/pyMapReduce2011

Files

MapReduce

mapreduce.py - it is a python class which initiates BigJob, submit map & reduce tasks to the BigJob.
mrfunctions.py - It is class which contain functions to chunk the data based on the user requirement.

MapReduce/applications/wordcount

Directory which contain wordcount application.

wordcount_wrapper.sh - wrapper shell script to wordcount MapReduce application (wordcountApp.py)
wordcountApp.py - MapReduce application python script, uses mapreduce.py & mrfunctions.py class functions.
wordcount_map_partition.py - wordcount mapper script.
wordcount_reduce.py - wordcount reduce script.
wordcount_map_partition_comb.py - Used in case of distributed MapReduce. For local MapReduce it is not required.

LMR Execution

Create Input, temp, Output directories.

Edit the inputs in the wordcount_wrapper.sh script as per the requirement.

Run the wordcount_wrapper.sh which launches the wordcountApp.py script with required inputs.

Temporary & output files are created in the temp directory & output directory specified.

wordcount application help

python wordcountApp.py --help

contact

saga-users@cct.lsu.edu - SAGA installation related problems.
saga-users@cct.lsu.edu - MapReduce related problems.
bigjob-users@cct.lsu.edu - BigJob installation problems.