Running on Colosse

Table of contents

  1. Apply for an account
  2. Initial setup
  3. Good things to know


Apply for an account

First you will have to apply for a Compute Canada account:
 
  How to obtain a CLUMEQ account?

Your sponsor will then receive and email to confirm your application.
Then you will be invited to create yourself an account on colosse.
Btw., you can choose your own username!


Initial setup (only do once)

Running the model on colosse is essentially the same as running it on marvin.
There are only a few things you have to set just once in the beginning:

In your HOME create a link:
    ln -s /home/winger/armnlib/ssm/all/share/env_univ/.profile

and take a copy of my .bash_profile:
    cp ~winger/.bash_profile ~/.

Allow 'ssh colosse1' without typing password:
    cd ~/.ssh
   ssh-keygen         (press just 'Enter' whenever asked a question, 3 times)
    cat id_rsa.pub >> authorized_keys2

You can put your aliases, exports etc. in your .profile_usr.

Just make sure that you do not set and export the variable AFSISIO!

soumet

Create the directory:
    mkdir ~/ovbin
and in there the link:
    ln -s /home/winger/armnlib/ssm/linux24-x86-64/bin/soumet_your_group_project soumet.

Your group project is either
    xgk-345-ab
or
    your group but with -aa instead of -01 at the end.
You can check your group with:
    id -gn

If there is no soumet for your group project let me know: Katja.Winger@ec.gc.ca

Then log out and back in.



Good things to know

Directories and other known machines

Wherever on marvin you use the directories:
    /local/sata?/${USER}
    /local/fiber1/${USER}
you use on colosse
    /rap/your_group_directory/${USER}
(You will have to create this directory first.)

If you do not know your group directory just execute:
    id -gn
and replace the -01 at the end by -aa.

Non of our UQAM machines can be seen from the model when running on colosse.
So instead of 'headnode', 'st?', 'skynet?' you will now have to write 'colosse1'.

Runtimes

On marvin there is no time limit for the job durations and no priority for smaller and shorter jobs.
But on colosse there is!!!

Therefor you should adjust the following parameters in your configexp.dot.cfg as good as possible:

    BACKEND_time_mod=
    BACKEND_time_ntr=

Set these parameters to the time (in seconds) you think the model resp. entry will take.
To start with put the time the jobs took on marvin.
But these numbers must not exceed 172800s (2 days).

The smaller these two numbers (especially 'BACKEND_time_mod') the less long your jobs will be queued. But when the time you write there is elapsed before your job has finished, it will get kicked out and you will have to start over!!!

OpenMP

Whereas on marvin we can parallize the model only by using MPI, we can use MPI and OpenMP on colosse.
Therefore I suggest you set in your 'configexp.dot.cfg':
    BACKEND_OMP=2;
and in your 'gemclim_settings.nml':
    Ptopo_smtdyn    = 2          , Ptopo_smtphy    = 2          ,

Just make sure these 3 parameters are always set to the same value!!!

Output blocking

And please also set:
    Ptopo_nblocx    = 1          , Ptopo_nblocy    = 1          ,
There are IO problems on colosse you can avoid like this.


Another way to avoid the IO problems is to save all time steps of one month in one file instead of having 1 file per time step.
Therefore please also set:
    Clim_allin1_L   = .true.     ,

To be able to use this parameter you might have to recreate your model absolute with the patches in:
    ~winger/gem/v_3.3.2.1/Abs/Patches/AllOut

Number of cores to use

On colosse the total number of cores you want to use to run your model must be a multiple of 8.
Total number of cores = Ptopo_npex * Ptopo_npey * BACKEND_OMP

Therefore you will also have to set in your 'configexp.dot.cfg':
    CLIMAT_pp_cpus=8;

Output compressing

Please make sure you have the parameter:
    Out3_compress_L = .true.     ,
in your file gemclim_settings.nml so your model output will be compressed.




Author: Katja Winger
Last update: July 2011