Basic Usage =========== This framework is designed to make it easy to implement parallel computations on large datasets. This README will describe how to set up and run this MapReduce implementation. The first thing you have to do is create binary executables for every type of machine you will be running your maps and reduces on. Ex: x86_32, x86_64, OS = Linux, Mac, Windows. After this you will have to create a simple XML file that describes your grid. Finally, you call master/main --config /path/to/xml/config/file 1. Creating executable First you must include "MapReduceBase.hpp" *JHA: Include where? Then you implement MapReduce::MapReduceBase::map(std::string chunkName) and MapReduce::MapReduceBase::reduce(std::string key, std::vector values) a. an MapReduce::MapReduceBase::emitIntermediate(std::string key, std::string value) function is provided that you must use to create a key/value pair b. A simple Word Count example is shown below #include "MapReduceBase.hpp" class MapReduceImpl : public MapReduceBase { public: MapReduceImpl(int argCount, char **argList) : MapReduceBase(argCount,argList) {} void map(std::string chunkName) { //boost::iostreams::stream is a way to create simple stream boost::iostreams::stream in (chunkName); std::string elem; while(in >> elem){ emitIntermediate(elem,"1"); } } void reduce(std::string key, std::vector values) { int result = 0; std::vector::const_iterator valuesIT = values.begin(); while(valuesIT != values.end()) { result += boost::lexical_cast(*valuesIT); } emit(key, boost::lexical_cast(result)); } }; Now, all you have to do is define a main class which calls MapReduce::MapReduceBase::init(int argc, char **argv) and link this executable against MapReduceBase.o a. A simple example is shown below int main(int argc,char **argv) { MapReduceImp app(argc,argv); app.run(); return 0; } 2. Configuring XML file The basic structure will look like this a. - Describes all of the information needed for one MapReducing Section i. Attributes *name - A name to identify the session by [not required] *version - A version of your session [not required] *user - The user creating the session [not required] *priority - The level of importance the session is [not required] *experimentID - Another ID to recognize the session [not required] *eventLevel - A way to label it's stability ex: DEBUG/STABLE [not required] b. - List where the host is located (only one host is allowed) i. Children - A URL to the path of the host directory c. - List where the machines are you want to run on i. Children - Describes each individual host's type i. Attributes *arch - Architecture type ex: x86_32 [required] *OS - Operating System ex: Linux [required] d. - A list executables for different platforms/architectures i. Children - A complete path to a executable i. Attributes *arch - Architecture type ex: x86_32 [required] *OS - Operating System ex: Linux [required] *extraArgs - **Not Implemented yet** [not required] e. - A list of files that you will map/reduce i. Children - A complete path to the location of the file 3. Running master/main From command prompt in the $SAGA_LOCATION/saga-projects/applications/MapReduce/source/master directory type: $./main --config /path/to/config/file