Site Policy
2019 年 10 月 16 日

Basic procedures for using the Supercomputer System

 

This page explains the basic usage of this supercomputer system.Contents will be duplicated, but also the materials distributed at the briefing session (workshop) will be posted below,so please refer to them together.

 

Our Supercomputer System comprises the hardware described in hardware configuration. The hardware system environment utilizes the Univa Grid Engine (UGE) job management system so that multiple users can efficiently share it. The environment comprises the following components:

Gateway node (gw.ddbj.nig.ac.jp)

This is the node used to connect to the system from the internet. To utilize the system, a user must first connect to this node from outside the system.

Login node

This is the node through which users develop programs or enter jobs into the job management system. Users login to this node from the gateway node using the qlogin command. There are multiple login units within the system, and each user is assigned to the login node with the lowest load when they login.

Compute node

This is the node through which jobs entered by the users and managed under the job management system are executed. There are Thin compute nodes, Medium compute nodes, and Fat compute nodes, according to the type of computer.

Queue

A queue is a concept in UGE in which compute nodes are grouped logically. When UGE is instructed to execute a job using a queue, computation is carried out automatically on a compute node so that the specified conditions are satisfied. There are multiple types of queues; these will be described later. The specific procedures for using these queues are provided in a subsequent section.

A schematic drawing of system utilization that shows how these components are connected is provided below:
sys01_e2.png 
The basic steps in the procedure for using the system are as follows:

Connecting to the system

First, connect to the gateway node (gw.ddbj.nig.ac.jp) via ssh. Please prepare ssh client software that can be used by the terminal at hand and establish a connection. The authorization method is password authorization.

 ・Phase3 system gateway ⇒ gw.ddbj.nig.ac.jp

Please note that the gateway node is the entrance to the system from outside, and therefore no computations or programs can be executed on this node, nor are there any suitable environmental settings. Once you are connected to the gateway node, login to a login node by following the procedure below:

Logging in to a login node

Use qlogin, a UGE command, as follows:

[username@gw1 ~]$ qlogin
Your job 80308 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 80308 has been successfully scheduled.
Establishing /home/geadmin/UGER/utilbin/lx-amd64/qlogin_wrapper session to host at028 ...
Last login: Tue Feb 26 13:55:12 2019 from gw1

On being executed, qlogin checks the load conditions of multiple login nodes, automatically selects the node with the lowest load, and logs into it. (In the above example, the login session itself is recognized as an interactive job (JobID80308) and the compute node, at028, is selected for the login process.) On executing the qstat command (described later) on the logged in node, information about current jobs under UGE management in the queue can be viewed as follows:

[username@at028 ~]$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID 
------------------------------------------------------------------------------------------------------------------------------------------------
     80308 0.50000 QLOGIN     username      r     02/27/2019 16:03:29 login.q@at028                                                     1

Various development environments and scripting language environments have been preinstalled on the login nodes. To conduct development activities, please work on the login nodes. For information about the programming environments on the login nodes, please refer to programming environments. Do not execute large-scale computations on the login nodes. Please be sure to execute computations on the compute node by entering the computation job via UGE.

The login nodes equipped with GPGPU have been prepared in advance to execute program development utilizing the GPGPU development environment. To login to the corresponding node, specify gpu with the -I option when executing qlogin on the gateway node as follows:

  qlogin -l gpu

This allows you to login to a login node equipped with GPGPU.

 

The procedure and relevant steps to enter jobs on the compute nodes are as follows:

Entering computation jobs

To enter computation jobs on the compute nodes, the qsub command should be used. In order to enter jobs using the qsub command, a job script needs to be prepared and used. A simple descriptive example of such a script is shown below:

#!/bin/sh
#$ -S /bin/sh
pwd
hostname
date
sleep 20
date
echo “to stderr” 1>&2

The line specified with “#$” at the top sets the option for UGE. By setting the option instruction line as a shell script or by specifying it as an option for executing the qsub command, the mode of operation is communicated to UGE. The major options are as follows:

Description of instruction lineCommand line option instructionMeaning of instruction
#$ -S interpreter path -S interpreter path Specifies the path for the command interpreter. It is also possible to specify script languages other than shell. This option does not have to be specified.
#$ -cwd -cwd Specifies the current working directory for job execution. This will output the standard job outputs and error outputs to cwd. If this is not specified, the home directory will be the current working directory used for job execution.
#$ -N job name -N job name Specifies the name of the job. The script name is used as the job name if this is not specified.
#$ -o file path -o file path Specifies the output destination for standard output for jobs.
#$ -e file path -e file path Specifies the output destination for standard error output for jobs.


Many other options in addition to the ones listed above can be specified for qsub. For details, please use “man qsub” to access the online manual and check for other qsub commands after logging in to the system.

Queue selection when entering jobs

As described in the software configuration section, this system has queues that are set up as follows (as of March 11, 2019). Queues can be selected using the -l option with the qsub command. To enter a job by specifying the queues, execute qsub by specifying one of the “queue specification options” in the table below. If nothing is specified, the job will be entered in epyc.q.

Phase3 System

Queue nameNumber of job slotsMaximum memory capacityUpper limit for execution timePurposeOptions for queue specification
epyc.q 4,224 512G 62 days When there is no resource request in two months of the execution period No specification or
-l month (default)
intel.q 1,472 384G 62 days Use of Intel Xeon -l intel or -l month -l intel
gpu.q 384 512G 62 days Use of GPU -l gpu or -l month -l gpu
medium.q 160 3T 62 days Use of Medium compute node -l medium or
-l month -l medium
login.q 258 512G Unlimited Used to execute qlogin from the gateway node  
login_gpu.q 48 384 Unlimited Qlogin from gateway node when using GPU  
short.q 744 192G 3 days For short-time jobs -l short 

 

For example, enter the command below to enter a job called test.sh in intel.q.

    qsub -l intel test.sh

Additionally, to enter a job called test.sh on the Medium compute node, enter the following command:

    qsub -l medium test.sh

However, nodes that are equipped with GPGPU are also equipped with SSD, and nodes that are equipped with SSD are also equipped with HDD. As a result, the system is configured such that jobs flow to the GPU queue if the queue slot for using SSD is full. Further, jobs entered while the HDD queue is full will be entered in the SSD queue. Please note this point.

Checking the status of a job entered

Whether the job entered by qsub was actually entered as a job can be checked using the qstat command. The qstat command is used to check the status of entered jobs. If a number of jobs have been entered, for instance, qstat will give an output similar to the following:

[username@at027 ~]$ qstat
job-ID     prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID 
------------------------------------------------------------------------------------------------------------------------------------------------
     80312 0.50000 QLOGIN     username     r     02/27/2019 17:42:00 login.q@at027                                                     1
     80313 0.25000 jobname    username     r     02/27/2019 17:44:30 epyc.q@at040                                                      1
     80314 0.25000 jobname    username     r     02/27/2019 17:44:35 epyc.q@at040                                                      1
     80315 0.25000 jobname    username     r     02/27/2019 17:44:40 epyc.q@at040                                                      1

The meanings of the characters in the “state” column in this case are as follows:

CharacterMeaning
r Job being executed
qw Job in standby queue
t Job being transferred to the execution host
E An error occurred while processing the job
d Job in the process of deletion

To check the queue utilization status, input “qstat –f”. This gives the following output:

[username@at027 ~]$ qstat -f
queuename                      qtype resv/used/tot. np_load  arch          states
---------------------------------------------------------------------------------
medium.q@m01                   BP    0/0/80         0.00     lx-amd64      
---------------------------------------------------------------------------------
medium.q@m02                   BP    0/0/80         0.00     lx-amd64      
---------------------------------------------------------------------------------
medium.q@m03                   BP    0/0/80         0.00     lx-amd64      
---------------------------------------------------------------------------------
medium.q@m04                   BP    0/0/80         0.00     lx-amd64      
(omitted)
---------------------------------------------------------------------------------
epyc.q@at033                   BP    0/0/64         0.00     lx-amd64      
---------------------------------------------------------------------------------
epyc.q@at034                   BP    0/0/64         0.00     lx-amd64      
---------------------------------------------------------------------------------
epyc.q@at035                   BP    0/0/64         0.00     lx-amd64      
(omitted)
---------------------------------------------------------------------------------
intel.q@it003                  BP    0/0/32         0.00     lx-amd64      
---------------------------------------------------------------------------------
intel.q@it004                  BP    0/0/32         0.00     lx-amd64      
---------------------------------------------------------------------------------
intel.q@it005                  BP    0/0/32         0.00     lx-amd64      
---------------------------------------------------------------------------------
intel.q@it006                  BP    0/0/32         0.00     lx-amd64      
---------------------------------------------------------------------------------
(omitted)

This output can be used to determine the node (queue) on which the job is entered. To view the overall state for each queue such as job entering status and queue load status, “qstat -g c” can be used.

[username@at027 ~]$ qstat -g c
CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDS  cdsuE  
--------------------------------------------------------------------------------
epyc.q                            0.00      0      0   4160   4224      0     64 
gpu.q                             0.00      0      0     64    112      0     48 
intel.q                           0.00      0      0   1472   1472      0      0 
login.q                           0.00      1      0    383    384      0      0 
login_gpu.q                       0.00      0      0     48     48      0      0 
medium.q                          0.00      0      0    800    800      0      0 
short.q                           0.00      0      0    128    224      0     96

Detailed information on a job can be obtained by specifying “qstat -j jobID.”

[username@at027 ~]$ qstat -j 199666
==============================================================
job_number:                 199666
jclass:                     NONE
submission_time:            02/27/2019 17:42:00.867
owner:                      username
uid:                        9876
group:                      ddbj
gid:                        9876
supplementary group:        ddbj
sge_o_home:                 /home/username
sge_o_log_name:             username
sge_o_path:                 /cm/local/apps/gcc/7.2.0/bin:/home/geadmin/UGER/bin/lx-amd64:/cm/local/apps/environment-modules/4.0.0//bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/sbin:/usr/sbin:/cm/local/apps/environment-modules/4.0.0/bin:/home/username/.local/bin:/home/username/bin
sge_o_shell:                /bin/bash
sge_o_workdir:              /lustre8/home/username
sge_o_host:                 gw1
account:                    sge
stderr_path_list:           NONE:NONE:/dev/null
hard resource_list:         d_rt=259200,mem_req=8G,s_rt=259200,s_stack=10240K,s_vmem=8G
soft resource_list:         epyc=TRUE,gpu=TRUE,intel=TRUE,login=TRUE
mail_list:                  username@gw1
notify:                     FALSE
job_name:                   QLOGIN
stdout_path_list:           NONE:NONE:/dev/null
priority:                   0
jobshare:                   0
restart:                    n
env_list:                   TERM=xterm
department:                 defaultdepartment
binding:                    NONE
mbind:                      NONE
submit_cmd:                 qlogin
category_id:                4
request_dispatch_info:      FALSE
start_time            1:    02/27/2019 17:42:00.884
job_state             1:    r
exec_host_list        1:    at027:1
granted_req.          1:    mem_req=8.000G
usage                 1:    wallclock=01:00:01, cpu=00:00:00, mem=0.00000 GBs, io=0.00000 GB, iow=0.000 s, ioops=0, vmem=N/A, maxvmem=N/A
scheduling info:            -

To immediately delete a job without waiting for job completion when the job execution status is checked and it is found that the job status is incorrect, the qdel command can be used. Specify “qdel job ID.” To delete all the jobs entered by a specific user, specify “qdel -u username.”

Checking the results

The results of a job are output as standard job output in a file called jobname.o job ID, and standard job error output in a file called jobname.e job ID. Please check the files. Detailed information on how much resources an executed job used and so forth can be checked using the qacct command.

[username@at027 ~]$ qacct -j 199666
==============================================================
qname        intel.q             
hostname     it003               
group        lect                
owner        username            
project      NONE                
department   defaultdepartment   
jobname      jobscript.sh        
jobnumber    XXXXX               
taskid       undefined
pe_taskid    NONE                
account      sge                 
priority     0                   
cwd          NONE                
submit_host  at027               
submit_cmd   qsub -l intel jobscript.sh
qsub_time    02/27/2019 18:49:09.854
start_time   02/27/2019 18:49:15.069
end_time     02/27/2019 18:49:35.128
granted_pe   NONE                
slots        1                   
failed       0    
deleted_by   NONE
exit_status  0                   
ru_wallclock 20.059       
ru_utime     0.016        
ru_stime     0.034        
ru_maxrss    10220               
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    9181                
ru_majflt    2                   
ru_nswap     0                   
ru_inblock   344                 
ru_oublock   32                  
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     108                 
ru_nivcsw    18                  
wallclock    20.108       
cpu          0.051        
mem          0.001             
io           0.000             
iow          0.000             
ioops        1277                
maxvmem      228.078M
maxrss       0.000
maxpss       0.000
arid         undefined
jc_name      NONE
bound_cores  NONE

 

How to use high-speed domain (Lustre domain)

Lustre configuration

In this supercomputer system, the user home directory is constructed on the file system comprising the Lustre File System. The Lustre File System is a parallel file system that is mainly used by large-scale supercomputer systems.

In this system, two units of MDS (described later), 12 units of OSS (described later), and 41 sets of OST (described later) constitute one unit of the Lustre6 File System in Phase 2 (March 2018), MDS (described later). In Phase 3 (March 2019), two file systems, lustre7 and lustre8, which consisted of two unitsof MDS (described later), four units of OSS (described later) and 54 sets of OST (described later).

 File system Date of introduction Capacitymdsossost
 lustre6  March 1, 2018  3.8PB  2  4VM on SFA14KXE  41 x RAID6(8D+2P)
 lustre7  March 1, 2019  5.0PB  2  4VM on SFA14KXE  54 x RAID6(8D+2P)
 lustre8  March 1, 2019  5.0PB  2  4VM on SFA14KXE  54 x RAID6(8D+2P)

Lustre components

Simply put, a Lustre File System comprises an IO server called the Object Storage Server (OSS), a disk device called the Object Storage Target (OST), and a server called the MetaData Server (MDS) that manages the file metadata as components. A description of each component is provided in the table below:

ComponentDescription of term
Object Storage Server (OSS) Manages OST (described below) and controls the IO requests from the network to OST.
Object Storage Target (OST) This is a block storage device that stores the file data. One OST is considered one virtual disk, and comprises multiple physical disks. The user data is stored in one or more OSTs as one or more objects. It is possible to change the set number of objects per file, and the storage performance can be adjusted by tuning this. In this system configuration, eight OSTs are managed for one unit of OSS.
Meta Data Server (MDS) MDS comprises one server unit (two units with HA configuration in this system) in one Lustre file system, and it manages the position data on objects to which files are assigned and the file attributes data within the file system and guides the file IO requests to the proper OSS. When file IO is starts initially, MDS does not become related to the file IO, and data transmission is conducted directly between the client and OSS. Thus, Lustre File System can have high IO performance.
Meta Data Target (MDT) This is the storage device used to store the metadata (file name, directory, ACL, etc.) for the files in the Lustre File System. It is connected to MDS.

File striping

One of the characteristics of Lustre is its ability to divide one file into multiple segments and store them by dispersing them over multiple OSTs. This function is called file striping. The advantages of file striping is the capability to execute read/write in parallel from the client as one file is stored in multiple OSTs as multiple segments and to read/write large files at high speeds. However, there are also disadvantages to file striping. One disadvantage is that the overhead for handling the dispersed data increases as the file is dispersed over multiple OSTs. Therefore, only when the target file is several GBs or greater is stripe size and stripe count considered effective in general. Since Lustre centrally manages the metadata at the MDS, file operation that concurs with metadata operation (preparation of ls -l or many small-sized files, etc.) concentrates the load on MDS and thus the speed is not very high compared to the equivalent operations on the local file system. Please note this point and avoid operations that place tens of thousands of small-size files in the same directory and so forth (it is better to store them in multiple directories in this case).

Checking the usage status of the home directory

The usage status of the user’s current home directory can be checked using the “lfs quota” command.

[username@at031 ~]$ lfs quota -u username ./
Disk quotas for usr username (uid 9876):
     Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
             ./ 359743396       0       0       -    5250       0       0       -
Item Meaning/description
kbyte File capacity during use (KB)
quota Limitation value for file capacity/number (soft limit)
limit Absolute limitation value for file capacity/number (hard limit)
grace Tolerable period for exceeding the limitation value
files Number of files in use

 

How to set up file striping

To set up file striping, please follow the procedure below. First, check the current stripe count. It can be checked using “lfs getstrip subject file (directory).” (The system default is set to one.)

[username@at031 ~]$ ls -ld tmp
drwxr-xr-x 2 username lect 4096 Feb 22 00:51 tmp
[username@at031 ~]$ lfs getstripe tmp
tmp
stripe_count:  1 stripe_size:   1048576 stripe_offset: -1

Stripe settings can be set with the "lfs setstripe" command.

Option name Description
-c Sets up the stripe number and stripe width.
-o Specifies the offset.
-s Specifies the striping size.

 

How to use X client on the login node

To utilize X-Window Client on the login node, please prepare the corresponding X-Window server emulator if you are using Windows. For Mac, the X11 SDK is included in Xcode Tools. Install this so that X11 can be used.