Trouble shooting

How to solve your problem

1) Find out what aborted (model ?, script ?)
2) Find out why your model/script aborted
3) Restart your model

Hints how to read this web page

### stands for the current last time step
${exp} is the current experiment name as specified in your 'configexp.dot.cfg'
${xfer} is used on this page as the machine and/or directory in which your post-processing will be done. It is specified in your 'configexp.dot.cfg'.
Your post-processing directory is a directory with the name of the experiment (${exp}) on the machine and in the directory specified by ${xfer} in your 'configexp.dot.cfg'.
${EXECDIR} is the execution directory on the machine on which you run the model. It is always set to '${HOME}/MODEL_EXEC_RUN/${mach}/${exp}'. It contains amongst other things a copy of your absolutes, 'outcfg.out', 'gemclim_settings.nml'.
Note that in the following, ${mach} is where the model was run, ${arch_mach} is where the archives are to be stored. Finally, ${lehost} stands for the main front-end that is used here to handle most of the actual job submissions and file transfers (defaults to alef at CMC/RPN).

Find out what aborted ( model / script )

If the model or a script got canceled (i.e. ran out of time) while running on machine, you should receive an email from the scheduler (LoadLeveler under AIX).

You can also check your delayed_jobs directory (eg. ${HOME}/delayed_jobs/machine). There you will find all of the post-processing jobs that were created and are running, aborted or never got started.

And of course, you should always check the listings on ${mach}, ${xfer}, and ${lehost} and to see which job / script aborted. If you received an email from LoadLeveler, the cancel time written in the email should match more or less with the time the listing was written.

The model aborted / got canceled

Check the end of your model listing. There you will see if it aborted.

Resubmit the model / clone.

The post processing aborted / got canceled

If you did not recieve the full output including the listings but the next model job got started, your post processing aborted.

Frequent reasons why models / scripts abort

a) Disc quota exceeded

Check your quota on the machine on which you

run the model,
do your post processing,
archive your data

You can check your user quota with "quota -v" (i.e. on pollux and Linux), "mmlsquota" (on AIX) and the disk space with "df -k" (i.e. on Linux). If your quota is exceeded, you will have to make room! and then restart the aborted model / scripts.

b) Time limit exceeded

On AIX you will receive an email from LoadLeveler telling you "Hard WALL CLOCK limit exceeded". On other machines the listings will just be "cut off".

If your model (or entry) ran out of time you can ask for more time by increasing the values for 'BACKEND_time_mod' (or 'BACKEND_time_ntr') in your 'configexp.dot.cfg'.
In case this is not possible for the model job (on certain machines exists a maximum time limit) you can:

use more CPU's
run less months in a job (see 'CLIMAT_interval')
run less than a month (a few days) per job (see 'CLIMAT_rsti')

If the post processing ran out of time you can:

ask for more time (see 'CLIMAT_job_size')
start the post processing more often (see 'CLIMAT_rsti' )
use more CPU's (see 'CLIMAT_pp_cpus')

Continuing / Resubmitting the model

Resubmitting the model / clone

Resubmitting the model (or a clone)
You can resubmit a model/clone that had aborted without rerunning the entry.
In your HOME you will find a job called *${exp}_M*.
Simply resubmit this job with:
r.qsub_clone ${HOME}/jobname

What is a clone?
When your model needs more than 3 Wall-clock hours to run, you cannot run it in one shot on AIX but have to run it in smaller 'chunks'. These 'chunks' are called clones. The number of time steps per clone can be set with 'Step_rsti' in your 'gemclim_settings.nml' or the number of days per clone with 'climat_rsti' in your 'configexp.dot.cfg' (this will then overwrite your 'Step_rsti').

Continuing the run from a restart file

You need have to have an appropriate set of restart files to continue from on machine in an ${OLD_EXECDIR} directory.
If the restarts already got archived and moved you need to copy it/them back on the machine on which you want to run the model into the directory ~/MODEL_EXEC_RUN/${mach}. The restarts get saved on ${arch_mach} in ${archdir} (as specified in your 'configexp.dot.cfg'). They are gzipped cmc-archives with the name:

${exp}step#.ca.gz

After having copied them back you need to 'gunzip' and unarchive them. You will then get the directory ${OLD_EXECDIR} back.

Then go into the directory and on the machine from where you started this experiment with 'Um_lance'. You will again use 'Um_lance', except that you now have to add four parameters to your call. If you started your model with 'Um_lance', you will now have to start your model with:

Um_lance . -UM_EXEC_exp ${new_exp} -CLIMAT_continue ${old_exp} -CLIMAT_step_total ${step_total} -CLIMAT_stepout ${old_last_step} -UM_EXEC_r_ent ${r_ent} -CLIMAT_interval ${interval} -restart 1

${new_exp}	:	Experiment you want to start
${old_exp}	:	Experiment you want to continue / start from
${step_total}	:	Last time step of the experiment you want to start
${old_last_step}	:	Last time step of the experiment you continue / start from
${r_ent}	:	'1' for LAM grids '0' for global grids
${interval}	:	'CLIMAT_interval' as defined in your 'configexp.dot.cfg'

If the job ${new_exp} did already get launched automatically before the command to restart it is written in the file:

~/Climat_log/${old_exp}.log

Author: Katja Winger
Last update: January 2010