${xfer}
is used on this page as the machineand/ordirectory in which your
post-processing will be done. It is specified
in
your 'configexp.dot.cfg'.
Your post-processing
directory is a directory with the name of the
experiment (${exp}) on the
machine and in the directory specified by
${xfer} in your
'configexp.dot.cfg'.
${EXECDIR}
is the execution directory on the maching on which you run the model.
It is always set to '${HOME}/gem/${mach}/${exp}'. It contains amongst
other things a copy of your absolutes,
'outcfg.out', 'gemclim_settings.nml' as well as the 'process' and
'output' directories. The process directory contains the model's
current state (i.e. its restart).
Note that in the following, ${mach} is where the model was
run, ${arch_mach} is
where the archives are to be stored. Finally, ${lehost}stands for the main
front-end that is used here to handle most of the actual job
submissions and file transfers (defaults to pollux at CMC/RPN).
Find out what aborted (
model / script )
If the model or a script got canceled
while running on azur, you will receive an email from LoadLeveler.
You can also check your delayed_jobs directory (eg.
${HOME}/delayed_jobs/machine).
There you will find all of the post-processing jobs that were created
and are running,
aborted or never got started.
And of course, you should always check the listings on ${mach},
${xfer},
${lehost} and ${arch_mach} (on ${arch_mach} only if
you save
nesting informations) to see which job / script aborted. If you
received
an
email from LoadLeveler, the cancel time written in the email should
match more or less with the time the listing was written.
The model aborted / got canceled
Check the end of your model listing.
There you will see if it aborted.
If the diagnostic-job aborted
/ got canceled you will already have a few diagnostic files in your
post-processing directory and a listing with the name 'diag_${exp}*'
in your listings directory on ${xfer}.
If the rassemble-job aborted / got canceled you will find in your
post-processing directory a directory called
'WORKING_ON_last_step_###_files'
in which you will have dm-, dp-, and pm-files and probably also md- and
pr-files. This post-processing directory should also contain a job
called
'rassemble_WORKING_ON_last_step_###_files'.
You can check your quota that with "quota
-v" (i.e. on pollux and Linux), "df -k" (i.e. on Linux), "mmlsquota" (on azur). If your
quota is exceeded, you will have to make room! and then restart
the aborted model / scripts.
b) Time limit exceeded
You will receive an email from LoadLeveler telling you "Hard WALL CLOCK
limit exceeded".
This usually happends either with the rassemble- or the diag-job
because of memory management problems on azur. Sometimes you get
charged
for much more time than you are actually used, so that your job runs
out of time without having done very much calculations. Check in your
listings directory on ${xfer}
to see what aborted. The cancel time written in
the email should match more or less with the time the listing was
written.
Or klick here: Find out what aborted.
Check if you have a file called
'${exp}_WORKING_ON_last_step_###_files' in your
${HOME}/.WORKING_ON directory. If there is such a file check if all md- and pr-files for this time step are in your
post-processing directory. If this is the case, the file in your
.WORKING_ON directory may be a left-over file from a previous
experiment that
did not finish: You can then erase it and submit
the diagnostic job. If you find that there are missing md- and
pr-files you will have to re-submit the
corresponding rassemble-job.
Fixing the problem
Making room
If your disc quota is exceeded you have to make room. In the
following, ${suffix} can be either of "_hi", "_lo" or "", depending on
wether one of the variables ${splitout}or ${window} are defined to be
non-zero in your 'configexp.dot.cfg'.
Quota exceeded on ${arch_mach}
If your quota on the machine on which you archive the files is
exceeded, save your files somewhere else. In this case, file transfers
from ${xfer} to ${arch_mach} were stopped and some of them may still be
found in your post-processing directory in a sub-directory named
archives_${exp}${suffix}. You
will have to transfer the files in this sub-directory by hand.
Quota exceeded on post-processing
machine (${xfer}) Go in your post-processing directories.
If there are experiments for which the diagnostics are
already finished, you'll find the archived files in
archives_${exp}${suffix}.
You have to transfer these files by hand.
If the diagnostics aborted you
have to find out if they aborted before Climat_mdpr_clean
was run (and every second
(at 06Z and 18Z) md- and pr-file has been removed). If you still
have the md- and pr-files at every 6 hours, you can erase the current
diagnostic files. These are the following files: md${exp}* pr${exp}* ts${exp}* res* *lancer*
mgand all temporary directories.
Never
erase
any file starting with mddate_...p,
prdate_...p or diag_${exp}!!!
If the removal of the 06Z and 18Z md- and pr-files has
already
started, you could try to continue / resubmit the diagnostics as
described here.
Note that this will only work if you still have enough room to archive
the files.
If the model
aborted, you can erase all dm-, dp-, and pm-files in your
'${EXECDIR}/process/??-??' and all files and directories in
'${EXECDIR}/output'.You can also erase the files and directories from
this job or clone in your post-processing directory. Erase only files and directories from
the aborted job. If your job has clones, erase
only files and directories from the last clone.
Hard WALL CLOCK limit exceeded
Check in the email you got from LoadLeveler to see if you really ran
the whole
time you have been charged for. Compare the "Real time" with the "Total
Job User Time". If the "Real time" is much bigger than the "Total Job
User Time", there is not much you can do but just resubmitting the
canceled job again. If your job ran almost all the time (and your jobs
often get canceled this way) you can do the following, depending on
which job got canceled:
The rassemble-job got canceled
You can submit the rassemble-jobs with more time. To do so, use
the parameter in your 'configexp.dot.cfg' called 'climat_job_size'.
The possible values are 'small', 'medium', and 'big'.
Or you can do your "clean up" more often. Set the parameter 'climat_rsti'
in your 'configexp.dot.cfg' to a smaller number of days, so that the
post-processing will be started more often, i.e. for less time steps.
The diagnostics got canceled
Make sure that 'climat_diag_cpus' in
your 'configexp.dot.cfg' is not
set to 1 when you do your post-processing on AIX. The default value for
this under AIX is 8,
and anywhere else it is 1.
You can only interpolate your global
grid to a gaussian grid with lower resolution by using the parameter 'gaussout' in
your 'configexp.dot.cfg'. You can interpolate to any gaussian grid you
want. If you want to interpolate i.e. to a 240 x 120 grid set
"gaussout=240120". Alternatively, you can do all of the diagnostics on
the model grid itself if you choose to define the 'rotate' and 'rlats'
parametres. In addition to the global
interpolated grid, you can also save
the high resolution area of a streched grid (just set "splitout=1" in
your 'configexp.dot.cfg') or any window with the original resolution
using the parameter 'window' in
your 'configexp.dot.cfg'. Usage: window="lon1 lon2 lat1 lat2", where
lon1, lon2 and lat1, lat2 are the left, right and lower, upper grid
indices, respectively. In these cases (using 'splitout' or 'window'),
two independend
diagnostics streams will be launched, one for the high resolution
area/window and one
for the (interpolated) global grid. The rassemble-job will take care of
the interpolations.
Continuing / Resubmitting the
model
Continuing the run
from a restart file
We assume that you still have to have an appropriate set of restart
files to continue from on azur in an ${OLD_EXECDIR}/process directory.
If this is the case, go to the directory from where you started this
experiment with 'Um_launch'. You
will again use 'Um_launch', except that you now have to add four
parameters to
your call. If you started your model with 'Um_launch .', you will now
have to
start your model with:
Last time step of the experiment you want to start
${old_last_step}:
Last time step of the experiment you continue / start from
Resubmitting the model /
clone
Erase all dm-, dp-, and pm-files in
your '${EXECDIR}/process/??-??' and in '${EXECDIR}/output' from the aborted job. If your job has clones, erase only files from the last clone.
Resubmitting the model You can resubmit a model that had aborted without
rerunning the entry.
You will have a file in your ${EXECDIR} called 'soumet_la_job'. Make
this file executable (i.e 'chmod u+x soumet_la_job') and
execute it.
Resubmitting a clone
To resubmit a clone you just have to submit the job in your ${HOME}
with:
r.qsub_clone ${HOME}/jobname
The jobname will look like
*${exp}_M*
What is a clone?
When your model needs more than 3 Wall-clock hours to run, you cannot
run it in
one
shot on AIX but have to run it in smaller 'chunks'. These 'chunks' are
called clones. The number of time steps per clone can be set with
'Step_rsti' in your 'gemclim_settings.nml' or the number of days per clone
with 'climat_rsti' in your 'configexp.dot.cfg' (this will then
overwrite
your 'Step_rsti').
Resubmitting a script
All aborted or never started post-processing jobs are in your
delayed_jobs directory of the machine on which you do your
post-processing, eg. ${HOME}/delayed_jobs/azur on 'azur'. If one of the
jobs here aborted you can find
the command to resubmit this job in your climat-log-file.
You'll have in your ${HOME} a directory called 'Climat_log'. In this
directory is a log-file for each experiment you ran called
${exp}.log.
In this log file are all the
submittion commands that got executed so
far. These commands are typically two to four lines long. You can
execute them again from
any machine. If a job did not yet got submitted, the submittion command
will not be there. You could
copy one from an older experiment and modify it according to the actual
experiment.
There are three types of jobs you mainly find in the delayed_jobs
directory:
The job '*${exp}*_PP_###.jobtmp' can be used to resubmit the whole
post-processing.
To be able to (re)submit this job the dm-, dp-, and pm-files for this
time step still have to be in a 'last_step_###' sub-directory of your
output directory. If you don't know where this is, look
into the job you want to submit. The parameter '-s' of the Um_delam
call
tells you where your files should be.
The command to submit this job looks
like this:
The job 'post_${exp}_WORKING_ON_last_step_###_files.jobtmp' starts
the rassamble-job.
To be able to (re)submit this job, the job
'rassemble_WORKING_ON_last_step_###_files' has to be in your
post-processing directory. The corrresponding
'WORKING_ON_last_step_###_files' directory must also contain the dm-,
dp-, and pm-files
for this time step and eventually some already processed md- and
pr-files.
The command to submit this job looks
like this: