Running on Colosse
Table of contents
- Apply for an account
- Initial setup
- Good things to know
Apply for an account
First you will have to apply for a Compute Canada account:
How
to obtain a CLUMEQ account?
Your sponsor will then receive and email to confirm your application.
Then you will be invited to create yourself an account on colosse.
Btw., you can choose your own
username!
Initial setup (only do once)
Running the model on colosse is essentially the same as running it on
marvin.
There are only a few things you have to set just once in the beginning:
In your HOME create a link:
ln -s /home/winger/armnlib/ssm/all/share/env_univ/.profile
and take a copy of my .bash_profile:
cp
~winger/.bash_profile ~/.
Allow 'ssh colosse1' without typing password:
cd ~/.ssh
ssh-keygen
(press
just 'Enter' whenever asked a question, 3 times)
cat id_rsa.pub >>
authorized_keys2
You can put your aliases, exports etc. in your
.profile_usr.
Just make sure that you do not set and export the variable
AFSISIO!
soumet
Create the directory:
mkdir ~/ovbin
and in there the link:
ln -s
/home/winger/armnlib/ssm/linux24-x86-64/bin/soumet_your_group_project soumet.
Your group project is either
xgk-345-ab
or
your group but with -aa instead of -01 at the end.
You can check your group with:
id -gn
If there is no soumet for your group project let me know:
Katja.Winger@ec.gc.ca
Then log out and back in.
Good things to know
Directories and other known machines
Wherever on marvin you use the directories:
/local/sata?/${USER}
/local/fiber1/${USER}
you use on colosse
/rap/your_group_directory/${USER}
(You will have to create this directory first.)
If you do not know your group directory just execute:
id -gn
and replace the -01 at the end by -aa.
Non of our UQAM machines can be seen from the model when running on
colosse.
So instead of 'headnode', 'st?',
'skynet?' you will now have to write
'colosse1'.
Runtimes
On marvin there is no time limit for the job durations and no priority
for smaller and shorter jobs.
But on colosse there is!!!
Therefor you should adjust the following parameters in your
configexp.dot.cfg as good as possible:
BACKEND_time_mod=
BACKEND_time_ntr=
Set these parameters to the time (in seconds) you think the model resp.
entry will take.
To start with put the time the jobs took on marvin.
But these numbers must not exceed 172800s (2 days).
The smaller these two numbers
(especially 'BACKEND_time_mod') the less long your jobs will be queued.
But when the time you write there is elapsed before your job has
finished, it will get kicked out and you will have to start over!!!
OpenMP
Whereas on marvin we can parallize the model only by using MPI, we can
use MPI and OpenMP on colosse.
Therefore I suggest you set in your 'configexp.dot.cfg':
BACKEND_OMP=2;
and in your 'gemclim_settings.nml':
Ptopo_smtdyn
=
2 ,
Ptopo_smtphy =
2 ,
Just make sure these 3 parameters are always set to the same value!!!
Output blocking
And please also set:
Ptopo_nblocx
=
1 ,
Ptopo_nblocy =
1 ,
There are IO problems on colosse you can avoid like this.
Another way to avoid the IO problems is to save all time steps of one
month in one file instead of having 1 file per time step.
Therefore please also set:
Clim_allin1_L
= .true. ,
To be able to use this parameter you might have to recreate your model
absolute with the patches in:
~winger/gem/v_3.3.2.1/Abs/Patches/AllOut
Number of cores to use
On colosse the total number of cores you want to use to run your model
must be a multiple
of 8.
Total number of cores = Ptopo_npex * Ptopo_npey * BACKEND_OMP
Therefore you will also have to set in your 'configexp.dot.cfg':
CLIMAT_pp_cpus=8;
Output compressing
Please make sure you have the parameter:
Out3_compress_L =
.true. ,
in your file gemclim_settings.nml so your model output will be
compressed.
Author: Katja Winger
Last update: July 2011