Idea:

Do not submit simulations directly into queues.  Instead, introduce
another abstraction which handles PBS jobs ("jobs"), and one to handle
simulation restarts which should be executed.


Jobs:

A job is approximately what PBS calls a job.  (It is also called "job
shell".)  Jobs can be created, which means that a PBS job is
submitted.  They can be deleted, which means that the currently
running simulation is terminated.

Jobs do not execute simulations.  Jobs execute restarts.  When one
task is finished, a job will look for another restart.

This makes the following scenarios possible:

- A job runs a simulation.  The user interrupts the simulation, and
  the job switches over to a debugging run.  When the debugging run is
  finished, the job continues with the original simulation.

- A job ends before a simulation is finished.  Another job (which
  may have waited for the first to finish) takes over the simulation.

- In principle, two jobs can work on the same restart, or a job can be
  broken into two restarts.


Implementation:

Jobs: Each job has a directory, containing state information.  The
directory contains the PBS job id.  It contains also a list of tasks
which the job should execute.

Restarts: Each restart directory contains state information.  The
directory contains the set of jobs which should execute this restart,
and the job which is actually executing the restart.  Each restart can
only be executed once, by one job.


Commands:

create:
        create-simulation
submit/restart:
        create-restart
        create-job
        (add restart to job)
delete:
        remove restart from job
delete:
        delete-job
cleanup:
        cleanup-restart


Open questions:

- Do we allow empty jobs?  Should they stay around, waiting for a
  restart, or should they exit?  Maybe there should be a timeout.

- Do we allow empty restarts?  If they are created asynchronously,
  either empty restarts or empty jobs are necessary.  Why not allow
  both?


Further ideas:

- It would be nice if people could submit restarts to each others'
  jobs.  (This could also overcome queue limitations.)

- Jobs could depend on each other, so that they are automatically
  chained.

- Jobs could be waiting cold, and if not needed, could quickly exit.
  There could be a mechanism that automatically submits cold jobs.

- Multiple jobs could be waiting for the same simulation, e.g. in
  different queues, or with different numbers of nodes.  The first job
  to start gets the simulation.