Trouble shooting

How to solve your problem

1) Find out what aborted (model ?, script ?)
2) Find out why your model/script aborted
3) Fix the problem
4) Restart your model / script

Hints how to read this web page

### stands for the current last time step
${exp} is the current experiment name as specified in your 'configexp.dot.cfg'
${xfer} is used on this page as the machine and/or directory in which your post-processing will be done. It is specified in your 'configexp.dot.cfg'.
Your post-processing directory is a directory with the name of the experiment (${exp}) on the machine and in the directory specified by ${xfer} in your 'configexp.dot.cfg'.
${EXECDIR} is the execution directory on the machine on which you run the model. It is always set to '${HOME}/gemclim/${mach}/${exp}'. It contains amongst other things a copy of your absolutes, 'outcfg.out', 'gemclim_settings.nml' as well as the 'process' and 'output' directories. The process directory contains the model's current state (i.e. its restart).
Note that in the following, ${mach} is where the model was run, ${arch_mach} is where the archives are to be stored. Finally, ${lehost} stands for the main front-end that is used here to handle most of the actual job submissions and file transfers (defaults to pollux at CMC/RPN).

Find out what aborted ( model / script )

If the model or a script got canceled (i.e. ran out of time) while running on machine, you should receive an email from the scheduler (LoadLeveler under AIX).

You can also check your delayed_jobs directory (eg. ${HOME}/delayed_jobs/machine). There you will find all of the post-processing jobs that were created and are running, aborted or never got started.

And of course, you should always check the listings on ${mach}, ${xfer}, ${lehost} and ${arch_mach} (on ${arch_mach} only if you save nesting informations) to see which job / script aborted. If you received an email from LoadLeveler, the cancel time written in the email should match more or less with the time the listing was written.

The model aborted / got canceled

Check the end of your model listing. There you will see if it aborted.

Resubmit the model / clone.

The rassemble-job aborted / got canceled

If the rassemble-job aborted / got canceled you will find in your post-processing directory a directory called 'WORKING_ON_last_step_###_files' in which you will have dm-, dp-, and pm-files, maybe also md- and pr-files and probably 1 to 8 working directories. You need to erase these working directories in order to have the diagnostics started! This post-processing directory should also contain a job called 'rassemble_WORKING_ON_last_step_###_files'.

Don't erase any files! You simply have to resubmit the job.

The diagnostic-job aborted / got canceled

If the diagnostic-job aborted / got canceled you will already have a few diagnostic files in your post-processing directory and a listing with the name 'diag_${exp}*' in your listings directory on ${xfer}.

Don't erase anything! You simply have to resubmit the diagnostics.

Frequent reasons why models / scripts abort

a) Disc quota exceeded

Check your quota on the machine on which you

run the model,
do your post-processing,
archive your data and
your listings.

You can check your quota that "quota -v" (i.e. on pollux and Linux), "df -k" (i.e. on Linux), "mmlsquota" (on machine). If your quota is exceeded, you will have to make room! and then restart the aborted model / scripts.

b) Time limit exceeded

You will receive an email from LoadLeveler telling you "Hard WALL CLOCK limit exceeded".

This usually happens either with the rassemble- or the diag-job because you did not ask for enough time or because of memory management problems on machine. Sometimes you get charged for much more time than you are actually used, so that your job runs out of time without having done very much calculations. Check in your listings directory on ${xfer} to see what aborted. The cancel time written in the email should match more or less with the time the listing was written.
Or click here: Find out what aborted.

Fix the problem, and than restart the aborted model / scripts.

c) Diagnostics never got started

Check if you have a file called '${exp}_WORKING_ON_last_step_###_files' in your ${HOME}/.WORKING_ON directory. If there is such a file check if all md- and pr-files for this time step are in your post-processing directory. If this is the case, the file in your .WORKING_ON directory may be a left-over file from a previous experiment that did not finish or the corresponding post-processing job had aborted and got restarted and there are some old working directories left in your post processing directory in the WORKING_ON-directory. You can then erase the left over working directories and the file '${exp}_WORKING_ON_last_step_###_files' in your ${HOME}/.WORKING_ON directory and submit the diagnostic job. If you find that there are missing md- and pr-files you will have to re-submit the corresponding rassemble-job.

Fixing the problem

Making room

If your disc quota is exceeded you have to make room. In the following, ${suffix} can be either of "_hi", "_lo" or "", depending on whether one of the variables ${splitout}or ${window} are defined to be non-zero in your 'configexp.dot.cfg'.

Quota exceeded on ${arch_mach}
If your quota on the machine on which you archive the files is exceeded, save your files somewhere else. In this case, file transfers from ${xfer} to ${arch_mach} were stopped and some of them may still be found in your post-processing directory in a sub-directory named archives_${exp}${suffix}. You will have to transfer the files in this sub-directory by hand.

Quota exceeded on post-processing machine (${xfer})
Go in your post-processing directories.

If there are experiments for which the diagnostics are already finished, you'll find the archived files in archives_${exp}${suffix}. You have to transfer these files by hand.
If the diagnostics aborted you have to find out if they aborted before Climat_mdpr_clean was run (and every second (at 06Z and 18Z) md- and pr-file has been removed). If you still have the md- and pr-files at every 6 hours, you can erase the current diagnostic files. These are the following files: md${exp}* pr${exp}* ts${exp}* res* *lancer* mg and all temporary directories.
Never erase any file starting with mddate_...p, prdate_...p or diag_${exp}!!!
If the removal of the 06Z and 18Z md- and pr-files has already started, you could try to continue / resubmit the diagnostics as described here.
Note that this will only work if you still have enough room to archive the files.

If the model aborted, you can erase all dm-, dp-, and pm-files in your '${EXECDIR}/process/??-??' and all files and directories in '${EXECDIR}/output'.You can also erase the files and directories from this job or clone in your post-processing directory.
Erase only files and directories from the aborted job. If your job has clones, erase only files and directories from the last clone.

Hard WALL CLOCK limit exceeded

Check in the email you got from LoadLeveler to see if you really ran the whole time you have been charged for. Compare the "Real time" with the "Total Job User Time". If the "Real time" is much bigger than the "Total Job User Time", there is not much you can do but just resubmitting the canceled job again. If your job ran almost all the time (and your jobs often get canceled this way) you can do the following:

You can submit the rassemble-jobs with more time. To do so, use the parameter in your 'configexp.dot.cfg' called 'climat_job_size'. The possible values are 'small', 'medium', and 'big'.

Or, depending on which job got canceled:

The rassemble-job got canceled

Or you can do your "clean up" more often. Set the parameter 'climat_rsti' in your 'configexp.dot.cfg' to a smaller number of days, so that the post-processing will be started more often, i.e. for less time steps.

The diagnostics got canceled

If you do your diagnostics on AIX you use 8 cpu's. Check if the four different parts of the diagnostics (time series, means and variances of 3D fields, covariances, means and variances of 2D fields), which are distributed over the eight cpu's, take a similar amount of time. Look at the listings to find out how long the different parts took. This time the listings are in your process directory, they start with "res*'. The eight listings (four different types) you have to look at are:
resultat_diag_phy*
resultat_make_covar*
resultat_make_mean*
resultat_make_ts*
Whenever one part is finished it's listing(s) get gzipped in one file, 'resultat_type.gz'. If some of these listings are long gzipped while others are still not you might want to distribute the eight cpu's differently over these four types. You can change the number of cpu's used for each part of the diagnostics with the parameter 'climat_diag_cpus'. If none of the four parts has finished yet you really need more time.
You can interpolate your global grid to a gaussian grid with lower resolution by using the parameter 'gaussout' in your 'configexp.dot.cfg'. You can interpolate to any gaussian grid you want. If you want to interpolate i.e. to a 240 x 120 grid set "gaussout=240120". Alternatively, you can do all of the diagnostics on the model grid itself if you choose to define the 'rotate' and 'rlats' parameters.
In addition to the global interpolated grid, you can also save the high resolution area of a stretched grid (just set "splitout=1" in your 'configexp.dot.cfg') or any window with the original resolution using the parameter 'window' in your 'configexp.dot.cfg'. Usage: window="lon1 lon2 lat1 lat2", where lon1, lon2 and lat1, lat2 are the left, right and lower, upper grid indices, respectively. In these cases (using 'splitout' or 'window'), two independent diagnostics streams will be launched, one for the high resolution area/window and one for the (interpolated) global grid. The rassemble-job will take care of the interpolations.

Continuing / Resubmitting the model

Continuing the run from a restart file

You need have to have an appropriate set of restart files to continue from on machine in an ${OLD_EXECDIR}/process directory.
If the restarts already got archived and moved you need to copy it/them back on the machine on which you want to run the model into the directory ~/gemclim/${mach}. The restarts get saved on ${arch_mach} in ${archdir} (as specified in your 'configexp.dot.cfg'). They are gzipped cmc-archives with the name:

${exp}step#.ca.gz

After having copied them back you need to 'gunzip' and unarchive them. You will then get the directory ${OLD_EXECDIR} back.

Then go into the directory and on the machine from where you started this experiment with 'Um_launch'. You will again use 'Um_launch', except that you now have to add four parameters to your call. If you started your model with 'Um_launch .', you will now have to start your model with:

Um_launch . -exp ${new_exp} -continue ${old_exp} -step_total ${step_total} -stepout ${old_last_step} -r_ent ${r_ent} -interval ${interval}

${new_exp}	:	Experiment you want to start
${old_exp}	:	Experiment you want to continue / start from
${step_total}	:	Last time step of the experiment you want to start
${old_last_step}	:	Last time step of the experiment you continue / start from
${r_ent}	:	'1' for LAM grids '0' for global grids
${interval}	:	'interval' as defined in your 'configexp.dot.cfg'

If the job ${new_exp} did already get launched automatically before the command to restart it is written in the file:

~/Climat_log/${old_exp}.log

Resubmitting the model / clone

Erase all dm-, dp-, and pm-files in your '${EXECDIR}/output/current_last_step/??-??' resp. '${EXECDIR}/process/??-??' from the aborted job. If your job has clones, erase only files from the last clone.

Resubmitting the model
You can resubmit a model that had aborted without rerunning the entry.
You will have a file in your ${EXECDIR} called 'soumet_la_job'. Simply execute it with 'ksh soumet_la_job'.

Resubmitting a clone
To resubmit a clone you just have to submit the job in your ${HOME} with:
r.qsub_clone ${HOME}/jobname
The jobname will look like *${exp}_M*

What is a clone?
When your model needs more than 3 Wall-clock hours to run, you cannot run it in one shot on AIX but have to run it in smaller 'chunks'. These 'chunks' are called clones. The number of time steps per clone can be set with 'Step_rsti' in your 'gemclim_settings.nml' or the number of days per clone with 'climat_rsti' in your 'configexp.dot.cfg' (this will then overwrite your 'Step_rsti').

Resubmitting a script

All aborted or never started post-processing and diagnostic jobs are in your delayed_jobs directory of the machine on which you do your post-processing, eg. ${HOME}/delayed_jobs/machine on 'machine'. If one of the jobs here aborted you can find the command to resubmit this job in your climat-log-file.
You'll have in your ${HOME} a directory called 'Climat_log'. In this directory is a log-file for each experiment you ran called '${exp}.log'. In this log file are all the submittion commands that got executed so far. These commands are typically two to four lines long. You can execute them again from any machine. If a job did not yet got submitted, the submittion command will not be there. You could copy one from an older experiment and modify it according to the actual experiment.

There are three types of jobs you mainly find in the delayed_jobs directory:

*${exp}*_PP_###.jobtmp
post_${exp}_WORKING_ON_last_step_###_files.jobtmp
diag_${exp}_WORKING_ON_last_step_###_files.jobtmp

${exp}_PP_###.jobtmp

The job '*${exp}*_PP_###.jobtmp' can be used to resubmit the whole post-processing.

To be able to (re)submit this job the dm-, dp-, and pm-files for this time step still have to be in a 'last_step_###' sub-directory of your output directory. If you don't know where this is, look into the job you want to submit. The parameter '-s' of the Um_delam call tells you where your files should be.
The command to submit this job looks like this:

cd ${HOME}/delayed_jobs/machine
soumet gem_${exp}_PP_###.jobtmp \
        -mach machine -t 3600 -cm 200000 \
          -listing ${HOME}/listings/machine \
          -jn soumet_gem_${exp}_PP_###.jobtmp

post_${exp}_WORKING_ON_last_step_###_files.jobtmp

The job 'post_${exp}_WORKING_ON_last_step_###_files.jobtmp' starts the rassamble-job.

To be able to (re)submit this job, the job 'rassemble_WORKING_ON_last_step_###_files' has to be in your post-processing directory.
The command to submit this job looks like this:

cd ${HOME}/delayed_jobs/machine
soumet post_${exp}_WORKING_ON_last_step_###_files.jobtmp \
            -jn soumet_post_${exp}_WORKING_ON_last_step_###_files.jobtmp \
            -cpus # -mach machine -t 3600 -cm 100000 \
            -listing ${HOME}/listings/machine

The number of cpu's is 8 on a AIX multi-node machine and 1 everywhere else.

diag_${exp}_WORKING_ON_last_step_###_files.jobtmp

The job 'diag_${exp}_WORKING_ON_last_step_###_files.jobtmp' starts the diagnostics themselves.

To be able to (re)submit this job, the job 'diag_${exp}' has to be in your post-processing directory.
The command to submit this job looks like this:

cd ${HOME}/delayed_jobs/machine
soumet diag_${exp}_WORKING_ON_last_step_###_files.jobtmp \
            -jn soumet_diag_${exp}_WORKING_ON_last_step_###_files.jobtmp \
            -cpus # -mach machine -t 10800 -cm 500000 \
            -listing ${HOME}/listings/machine

The number of cpu's is 8 on a AIX multi-node machine and 1 everywhere else.

Author: Katja Winger
Last update: May 2006