Idea: Do not submit simulations directly into queues. Instead, introduce another abstraction which handles PBS jobs ("jobs"), and one to handle simulation restarts which should be executed. Jobs: A job is approximately what PBS calls a job. (It is also called "job shell".) Jobs can be created, which means that a PBS job is submitted. They can be deleted, which means that the currently running simulation is terminated. Jobs do not execute simulations. Jobs execute restarts. When one task is finished, a job will look for another restart. This makes the following scenarios possible: - A job runs a simulation. The user interrupts the simulation, and the job switches over to a debugging run. When the debugging run is finished, the job continues with the original simulation. - A job ends before a simulation is finished. Another job (which may have waited for the first to finish) takes over the simulation. - In principle, two jobs can work on the same restart, or a job can be broken into two restarts. Implementation: Jobs: Each job has a directory, containing state information. The directory contains the PBS job id. It contains also a list of tasks which the job should execute. Restarts: Each restart directory contains state information. The directory contains the set of jobs which should execute this restart, and the job which is actually executing the restart. Each restart can only be executed once, by one job. Commands: create: create-simulation submit/restart: create-restart create-job (add restart to job) delete: remove restart from job delete: delete-job cleanup: cleanup-restart Open questions: - Do we allow empty jobs? Should they stay around, waiting for a restart, or should they exit? Maybe there should be a timeout. - Do we allow empty restarts? If they are created asynchronously, either empty restarts or empty jobs are necessary. Why not allow both? Further ideas: - It would be nice if people could submit restarts to each others' jobs. (This could also overcome queue limitations.) - Jobs could depend on each other, so that they are automatically chained. - Jobs could be waiting cold, and if not needed, could quickly exit. There could be a mechanism that automatically submits cold jobs. - Multiple jobs could be waiting for the same simulation, e.g. in different queues, or with different numbers of nodes. The first job to start gets the simulation.