Trouble shooting

How to solve your problem

1) Find out what aborted (model ?, script ?)
2) Find out why your model/script aborted
3) Fix the problem
4) Restart your model / script


Hints how to read this web page

Find out what aborted ( model / script )

If the model or a script got canceled (i.e. ran out of time) while running on machine, you should receive an email from the scheduler (LoadLeveler under AIX).

You can also check your delayed_jobs directory (eg. ${HOME}/delayed_jobs/machine). There you will find all of the post-processing jobs that were created and are running, aborted or never got started.

And of course, you should always check the listings on ${mach}, ${xfer}, ${lehost} and ${arch_mach} (on ${arch_mach} only if you save nesting informations) to see which job / script aborted. If you received an email from LoadLeveler, the cancel time written in the email should match more or less with the time the listing was written.

The model aborted / got canceled

Check the end of your model listing. There you will see if it aborted.

Resubmit the model / clone.

The rassemble-job aborted / got canceled

If the rassemble-job aborted / got canceled you will find in your post-processing directory a directory called 'WORKING_ON_last_step_###_files' in which you will have dm-, dp-, and pm-files, maybe also md- and pr-files and probably 1 to 8 working directories. You need to erase these working directories in order to have the diagnostics started! This post-processing directory should also contain a job called 'rassemble_WORKING_ON_last_step_###_files'.

Don't erase any files! You simply have to resubmit the job.

The diagnostic-job aborted / got canceled

If the diagnostic-job aborted / got canceled you will already have a few diagnostic files in your post-processing directory and a listing with the name 'diag_${exp}*'  in your listings directory on ${xfer}.

Don't erase anything! You simply have to resubmit the diagnostics.



Frequent reasons why models / scripts abort

a) Disc quota exceeded

Check your quota on the machine on which you
You can check your quota that "quota -v" (i.e. on pollux and Linux), "df -k" (i.e. on Linux), "mmlsquota" (on machine). If your quota is exceeded, you will have to make room! and then restart the aborted model / scripts.

b) Time limit exceeded

You will receive an email from LoadLeveler telling you "Hard WALL CLOCK limit exceeded".

This usually happens either with the rassemble- or the diag-job because you did not ask for enough time or because of memory management problems on machine. Sometimes you get charged for much more time than you are actually used, so that your job runs out of time without having done very much calculations. Check in your listings directory on ${xfer} to see what aborted. The cancel time written in the email should match more or less with the time the listing was written.
Or click here: Find out what aborted.

Fix the problem, and than restart the aborted model / scripts.

c) Diagnostics never got started

Check if you have a file called '${exp}_WORKING_ON_last_step_###_files'  in your ${HOME}/.WORKING_ON directory. If there is such a file check if all md- and pr-files for this time step are in your post-processing directory. If this is the case, the file in your .WORKING_ON directory may be a left-over file from a previous experiment that did not finish or the corresponding post-processing job had aborted and got restarted and there are some old working directories left in your post processing directory in the WORKING_ON-directory. You can then erase  the left over working directories and the file '${exp}_WORKING_ON_last_step_###_files'  in your ${HOME}/.WORKING_ON directory and submit the diagnostic job. If you find that there are missing md- and pr-files you will have to re-submit the corresponding rassemble-job.


Fixing the problem


Making room

If your disc quota is exceeded you have to make room. In the following, ${suffix} can be either of "_hi", "_lo" or "", depending on whether one of  the variables ${splitout}or ${window} are defined to be non-zero in your 'configexp.dot.cfg'.

Quota exceeded on ${arch_mach}

If your quota on the machine on which you archive the files is exceeded, save your files somewhere else. In this case, file transfers from ${xfer} to ${arch_mach} were stopped and some of them may still be found in your post-processing directory in a sub-directory named archives_${exp}${suffix}. You will have to transfer the files in this sub-directory by hand.

Quota exceeded on post-processing machine (${xfer})
Go in your post-processing directories.


Hard WALL CLOCK limit exceeded

Check in the email you got from LoadLeveler to see if you really ran the whole time you have been charged for. Compare the "Real time" with the "Total Job User Time". If the "Real time" is much bigger than the "Total Job User Time", there is not much you can do but just resubmitting the canceled job again. If your job ran almost all the time (and your jobs often get canceled this way) you can do the following:
Or, depending on which job got canceled:

The rassemble-job got canceled

The diagnostics got canceled



Continuing / Resubmitting the model

Continuing the run from a restart file

You need have to have an appropriate set of restart files to continue from on machine in an ${OLD_EXECDIR}/process directory.
If the restarts already got archived and moved you need to copy it/them back on the machine on which you want to run the model into the directory ~/gemclim/${mach}. The restarts get saved on ${arch_mach} in ${archdir} (as specified in your 'configexp.dot.cfg'). They are gzipped cmc-archives with the name:

   ${exp}step#.ca.gz

After having copied them back you need to 'gunzip' and unarchive them. You will then get the directory ${OLD_EXECDIR} back.

Then go into the directory and on the machine from where you started this experiment with 'Um_launch'. You will again use 'Um_launch', except that you now have to add four parameters to your call. If you started your model with 'Um_launch .', you will now have to start your model with:

Um_launch . -exp ${new_exp} -continue ${old_exp} -step_total ${step_total} -stepout ${old_last_step} -r_ent ${r_ent} -interval ${interval}

${new_exp}  :  Experiment you want to start
${old_exp}  :  Experiment you want to continue / start from
${step_total}  :  Last time step of the experiment you want to start
${old_last_step}  :  Last time step of the experiment you continue / start from
${r_ent}
 :  '1' for LAM grids
'0' for global grids
${interval}
 :  'interval' as defined in your 'configexp.dot.cfg'

If the job ${new_exp} did already get launched automatically before the command to restart it is written in the file:

   ~/Climat_log/${old_exp}.log

Resubmitting the model / clone

Erase all dm-, dp-, and pm-files in your '${EXECDIR}/output/current_last_step/??-??' resp. '${EXECDIR}/process/??-??' from the aborted job. If your job has clones, erase only files from the last clone.

Resubmitting the model
You can resubmit a model that had aborted without rerunning the entry.
You will have a file in your ${EXECDIR} called 'soumet_la_job'. Simply execute it with 'ksh soumet_la_job'.

Resubmitting a clone
To resubmit a clone you just have to submit the job in your ${HOME} with:
r.qsub_clone ${HOME}/jobname
The jobname will look like *${exp}_M*

What is a clone?
When your model needs more than 3 Wall-clock hours to run, you cannot run it in one shot on AIX but have to run it in smaller 'chunks'. These 'chunks' are called clones. The number of time steps per clone can be set with 'Step_rsti' in your 'gemclim_settings.nml' or the number of days per clone with 'climat_rsti' in your 'configexp.dot.cfg' (this will then overwrite your 'Step_rsti').


Resubmitting a script

All aborted or never started post-processing and diagnostic jobs are in your delayed_jobs directory of the machine on which you do your post-processing, eg. ${HOME}/delayed_jobs/machine on 'machine'. If one of the jobs here aborted you can find the command to resubmit this job in your climat-log-file.
You'll have in your ${HOME} a directory called 'Climat_log'. In this directory is a log-file for each experiment you ran called '${exp}.log'. In this log file are all the submittion commands that got executed so far. These commands are typically two to four lines long. You can execute them again from any machine. If a job did not yet got submitted, the submittion command will not be there. You could copy one from an older experiment and modify it according to the actual experiment.


There are three types of jobs you mainly find in the delayed_jobs directory:

*${exp}*_PP_###.jobtmp

The job '*${exp}*_PP_###.jobtmp' can be used to resubmit the whole post-processing.

To be able to (re)submit this job the dm-, dp-, and pm-files for this time step still have to be in a 'last_step_###' sub-directory of your output directory. If you don't know where this is, look into the job you want to submit. The parameter '-s' of the Um_delam call tells you where your files should be.
The command to submit this job looks like this:

cd ${HOME}/delayed_jobs/machine
soumet gem_${exp}_PP_###.jobtmp \
            -mach machine -t 3600 -cm 200000 \
            -listing ${HOME}/listings/machine  \
            -jn soumet_gem_${exp}_PP_###.jobtmp

post_${exp}_WORKING_ON_last_step_###_files.jobtmp

The job 'post_${exp}_WORKING_ON_last_step_###_files.jobtmp' starts the rassamble-job.

To be able to (re)submit this job, the job 'rassemble_WORKING_ON_last_step_###_files' has to be in your post-processing directory.
The command to submit this job looks like this:

cd ${HOME}/delayed_jobs/machine
soumet post_${exp}_WORKING_ON_last_step_###_files.jobtmp \
            -jn soumet_post_${exp}_WORKING_ON_last_step_###_files.jobtmp \
            -cpus # -mach machine -t 3600 -cm 100000 \
            -listing ${HOME}/listings/machine

The number of cpu's is 8 on a AIX
multi-node machine and 1 everywhere else.

diag_${exp}_WORKING_ON_last_step_###_files.jobtmp

The job 'diag_${exp}_WORKING_ON_last_step_###_files.jobtmp' starts the diagnostics themselves.

To be able to (re)submit this job, the job 'diag_${exp}' has to be in your post-processing directory.
The command to submit this job looks like this:

cd ${HOME}/delayed_jobs/machine
soumet diag_${exp}_WORKING_ON_last_step_###_files.jobtmp \
            -jn soumet_diag_${exp}_WORKING_ON_last_step_###_files.jobtmp  \
            -cpus # -mach machine -t 10800 -cm 500000 \
            -listing ${HOME}/listings/machine

The number of cpu's is 8 on a AIX multi-node machine and 1 everywhere else.



Author: Katja Winger
Last update: May 2006