Introduction

MapReduce is a programming pattern, a programming model and even synonymous with an implementation of the MapReduce pattern/programming model. MapReduce has become an extremely important approach for processing and generating large data sets. Here we will be work with a SAGA-based implementation of MapReduce. The framework uses a pilot-job as the underlying execution environment; this is specific and unique to this particular implementation. SAGA-BigJob is used to submit "Map" & "Reduce" tasks. The framework processes multiple files in the input directory. The files are physically chunked in the temp directory and a map task is submitted for each input chunk file. After completion of a map task, based on the number of reduces, partition files are created and sorted and then 'moved' from the temp directory to the output directory. Sorted partition files which belong to a reduce are grouped together and submitted as an argument to the reduce job. The output files are created in the output directory when reduce jobs are completed.

Dependencies

SAGA BigJob based MapReduce is dependent on

MapReduce Installation

Obtain MapReduce code
https://svn.cct.lsu.edu/repos/saga-projects/applications/MapReduce/branches/pyMapReduce2011

Files

MapReduce

MapReduce/applications/wordcount

Directory which contain wordcount application.

LMR Execution

  • Create Input, temp, Output directories.
  • Edit the inputs in the wordcount_wrapper.sh script as per the requirement.
  • Run the wordcount_wrapper.sh which launches the wordcountApp.py script with required inputs.
  • Temporary & output files are created in the temp directory & output directory specified.
  • wordcount application help

    python wordcountApp.py --help

    contact