Introduction
MapReduce is a programming pattern, a programming model and even
synonymous with an implementation of the MapReduce pattern/programming
model. MapReduce has become an extremely important approach
for processing and generating large data sets.
Here we will be work with a SAGA-based implementation of MapReduce.
The framework uses a pilot-job as the underlying execution
environment; this is specific and unique to this particular
implementation. SAGA-BigJob is used to submit "Map" & "Reduce" tasks.
The framework processes multiple files in the input directory. The
files are physically chunked in the temp directory and a map
task is submitted for each input chunk file. After completion of a
map task, based on the number of reduces, partition files are created
and sorted and then 'moved' from the temp directory to the output
directory.
Sorted partition files which belong to a reduce are grouped together
and submitted as an argument to the reduce job. The output files are
created in the output directory when reduce jobs are completed.
Local MapReduce(LMR)
Data is moved into one
centralized cluster; all tasks are executed locally on the cluster.
Distributed MapReduce(DMR)
Data is distributed onto
multiple clusters, perform individual local MapReduce job on each
cluster and then combine the results with final MapReduce
job. (HINT: To implement DMR use SAGA BigJob to launch LMRs on
remote machines).
On futuregrid, as globus is not present use
pbs-ssh// to launch BigJob remotely. To use this
url, ssh adaptors need to be installed.
Dependencies
SAGA BigJob based MapReduce is dependent on
- SAGA ( http://saga.cct.lsu.edu/)
- PBSPro/ssh adaptors(http://www.saga-project.org/download/adaptors)
- BigJob ( http://faust.cct.lsu.edu/trac/BigJob/wiki)
Most of the dependencies are already available on
futuregrid. Checkout Deployments
MapReduce Installation
Obtain MapReduce code
https://svn.cct.lsu.edu/repos/saga-projects/applications/MapReduce/branches/pyMapReduce2011
Files
MapReduce
- mapreduce.py - it is a python class which initiates BigJob, submit map & reduce tasks to the BigJob.
- mrfunctions.py - It is class which contain functions to chunk the data based on the user requirement.
MapReduce/applications/wordcount
Directory which contain wordcount application.
- wordcount_wrapper.sh - wrapper shell script to wordcount MapReduce application (wordcountApp.py)
- wordcountApp.py - MapReduce application python script, uses mapreduce.py & mrfunctions.py class functions.
- wordcount_map_partition.py - wordcount mapper script.
- wordcount_reduce.py - wordcount reduce script.
- wordcount_map_partition_comb.py - Used in case of distributed MapReduce. For local MapReduce it is not required.
LMR Execution
Create Input, temp, Output directories.
Edit the inputs in the wordcount_wrapper.sh script as per the requirement.
Run the wordcount_wrapper.sh which launches the wordcountApp.py script with required inputs.
Temporary & output files are created in the temp directory & output directory specified.
wordcount application help
python wordcountApp.py --help
contact
- saga-users@cct.lsu.edu - SAGA installation related problems.
- saga-users@cct.lsu.edu - MapReduce related problems.
- bigjob-users@cct.lsu.edu - BigJob installation problems.