next up previous contents
Next: Parameters Up: Providing Your own Checkpointing/Recovery Previous: Adding a Checkpointing Method   Contents

Adding a Recovery Method

Recovering from a checkpoint is a two step operation:

  1. Right after reading the parameter file and finding out if recovery was requested, all the parameters are restored from the checkpoint file (overwriting any previous settings for non-steerable parameters).
  2. After the flesh has created the grid hierarchy with all containing grid variables and the driver has set up storage for these, their contents is restored from the checkpoint (overwriting any previously initialized contents).

The flesh provides the special time bins RECOVER_PARAMETERS and RECOVER_VARIABLES for these two steps (see also the chapter on Adding a Checkpointing/Recovery Method in the Infrastructure Thorn Writer's Guide as part of the Cactus User's Guide).

Thorn IOUtil evaluates the recovery parameters (determines the recovery mode to use, construct the name(s) of the recovery file(s) etc.). It also controls the recovery process by invoking the recovery methods of other I/O thorns, one after another until one method succeeded.

A recovery method must provide a routine with the following prototype:

  int Recover (cGH *GH, const char *basefilename, int called_from);

This routine will be invoked by IOUtil with the following arguments:

  • a GH pointer refer to the grid hierarchy and its grid variables
    Note that this can also be a NULL pointer if the routine was called at CP_RECOVER_PARAMETERS when no grid hierarchy exists yet.

  • the basename of the checkpoint file to recover from
    This name is constructed by IOUtil from the settings of IO::recovery_dir and IO::recover_file. It will also include an iteration number if one of the auto recovery modes is used. The filename extension by which checkpoint files of different recovery methods are distinguished must be appended explicitly.

  • a flag identifying when the routine was called
    This can be one of the keywords CP_INITIAL_DATA, CP_EVOLUTION_DATA, CP_RECOVER_PARAMETERS, CP_RECOVER_DATA, or FILEREADER_DATA). This flag tells the routine what it should do when it is being called by IOUtil. Note that IOUtil assumes that the recovery method can also be used as a filereader routine here which is essentially the same as recovering (individual) grid variables from (data) files.

To perform the first step of recovery process, a recovery method must register a routine with the flesh's scheduler at the RECOVER_PARAMETERS time bin. This routine itself should call the IOUtil API

  int IOUtil_RecoverParameters (int (*recover_fn) (cGH *GH,
                                                   const char *basefilename,
                                                   int called_from),
                                const char *file_extension,
                                const char *file_type)

which will determine the recovery filename(s) and in turn invoke the actual recovery method's routine as a callback function. The arguments to pass to this routine are

  • a function pointer recover_fn as a reference to the recovery method's actual recovery routine (as described above)

  • file_extension - the filename extension for recovery files which are accepted by this recovery method
    When IOUtil constructs the recovery filename and searches for potential recovery files (in the auto recovery modes) it will only match filenames with the basename as given in the IO::recovery_file parameter, appended by file_extension.

  • file_type - the type of checkpoint files which are accepted by this recovery method
    This is just a descriptive string to print some info output during recovery.

The routine registered at RECOVER_PARAMETERS should return to the scheduler a negative value if parameter recovery failed for this recovery method for some reason (e.g. if no appropriate recovery file was found). The scheduler will then continue with the next recovery method until one finally succeeds (a positive value is returned). If none of the available recovery methods were successful the flesh would stop the code.

A value of zero should be returned to the scheduler to indicate that no recovery was requested.

The second step during recovery -- restoring the contents of grid variables from the recovery file -- is invoked by thorn IOUtil which registers a routine at RECOVER_VARIABLES. This routine calls all recovery methods which were registered before with IOUtil via the API

  int IOUtil_RegisterRecover (const char *name,
                              int (*recover_fn) (cGH *GH,
                                                 const char *basefilename,
                                                 int called_from));

With this registration, all recovery method's actual recovery routines are made known to IOUtil, along with a descriptive name under which they are registered.

At RECOVER_VARIABLES thorn IOUtil will loop over all available recovery routines (passing the same arguments as for parameter recovery) until one succeeds (returns a positive value).


next up previous contents
Next: Parameters Up: Providing Your own Checkpointing/Recovery Previous: Adding a Checkpointing Method   Contents