Basic procedures for using the Supercomputer System
This page explains the basic usage of this supercomputer system.Contents will be duplicated, but also the materials distributed at the briefing session (workshop) will be posted below,so please refer to them together.
Our Supercomputer System comprises the hardware described in hardware configuration. The hardware system environment utilizes the Univa Grid Engine (UGE) job management system so that multiple users can efficiently share it. The environment comprises the following components:
Gateway node (gw.ddbj.nig.ac.jp)
This is the node used to connect to the system from the internet. To utilize the system, a user must first connect to this node from outside the system.
Login node
This is the node through which users develop programs or enter jobs into the job management system. Users login to this node from the gateway node using the qlogin command. There are multiple login units within the system, and each user is assigned to the login node with the lowest load when they login.
Compute node
This is the node through which jobs entered by the users and managed under the job management system are executed. There are Thin compute nodes, Medium compute nodes, and Fat compute nodes, according to the type of computer.
Queue
A queue is a concept in UGE in which compute nodes are grouped logically. When UGE is instructed to execute a job using a queue, computation is carried out automatically on a compute node so that the specified conditions are satisfied. There are multiple types of queues; these will be described later. The specific procedures for using these queues are provided in a subsequent section.
A schematic drawing of system utilization that shows how these components are connected is provided below:
The basic steps in the procedure for using the system are as follows:
Connecting to the system
First, connect to the gateway node (gw.ddbj.nig.ac.jp) via ssh. Please prepare ssh client software that can be used by the terminal at hand and establish a connection. The authorization method is password authorization.
・Phase3 system gateway ⇒ gw.ddbj.nig.ac.jp
Please note that the gateway node is the entrance to the system from outside, and therefore no computations or programs can be executed on this node, nor are there any suitable environmental settings. Once you are connected to the gateway node, login to a login node by following the procedure below:
Logging in to a login node
Use qlogin, a UGE command, as follows:
[username@gw1 ~]$ qlogin Your job 80308 ("QLOGIN") has been submitted waiting for interactive job to be scheduled ... Your interactive job 80308 has been successfully scheduled. Establishing /home/geadmin/UGER/utilbin/lx-amd64/qlogin_wrapper session to host at028 ... Last login: Tue Feb 26 13:55:12 2019 from gw1
On being executed, qlogin checks the load conditions of multiple login nodes, automatically selects the node with the lowest load, and logs into it. (In the above example, the login session itself is recognized as an interactive job (JobID80308) and the compute node, at028, is selected for the login process.) On executing the qstat command (described later) on the logged in node, information about current jobs under UGE management in the queue can be viewed as follows:
[username@at028 ~]$ qstat job-ID prior name user state submit/start at queue jclass slots ja-task-ID ------------------------------------------------------------------------------------------------------------------------------------------------ 80308 0.50000 QLOGIN username r 02/27/2019 16:03:29 login.q@at028 1
Various development environments and scripting language environments have been preinstalled on the login nodes. To conduct development activities, please work on the login nodes. For information about the programming environments on the login nodes, please refer to programming environments. Do not execute large-scale computations on the login nodes. Please be sure to execute computations on the compute node by entering the computation job via UGE.
The login nodes equipped with GPGPU have been prepared in advance to execute program development utilizing the GPGPU development environment. To login to the corresponding node, specify gpu with the -I option when executing qlogin on the gateway node as follows:
qlogin -l gpu
This allows you to login to a login node equipped with GPGPU.
The procedure and relevant steps to enter jobs on the compute nodes are as follows:
Entering computation jobs
To enter computation jobs on the compute nodes, the qsub command should be used. In order to enter jobs using the qsub command, a job script needs to be prepared and used. A simple descriptive example of such a script is shown below:
#!/bin/sh #$ -S /bin/sh pwd hostname date sleep 20 date echo “to stderr” 1>&2
The line specified with “#$” at the top sets the option for UGE. By setting the option instruction line as a shell script or by specifying it as an option for executing the qsub command, the mode of operation is communicated to UGE. The major options are as follows:
Description of instruction line | Command line option instruction | Meaning of instruction |
---|---|---|
#$ -S interpreter path | -S interpreter path | Specifies the path for the command interpreter. It is also possible to specify script languages other than shell. This option does not have to be specified. |
#$ -cwd | -cwd | Specifies the current working directory for job execution. This will output the standard job outputs and error outputs to cwd. If this is not specified, the home directory will be the current working directory used for job execution. |
#$ -N job name | -N job name | Specifies the name of the job. The script name is used as the job name if this is not specified. |
#$ -o file path | -o file path | Specifies the output destination for standard output for jobs. |
#$ -e file path | -e file path | Specifies the output destination for standard error output for jobs. |
Many other options in addition to the ones listed above can be specified for qsub. For details, please use “man qsub” to access the online manual and check for other qsub commands after logging in to the system.
Queue selection when entering jobs
As described in the software configuration section, this system has queues that are set up as follows (as of March 11, 2019). Queues can be selected using the -l option with the qsub command. To enter a job by specifying the queues, execute qsub by specifying one of the “queue specification options” in the table below. If nothing is specified, the job will be entered in epyc.q.
Phase3 System
Queue name | Number of job slots | Maximum memory capacity | Upper limit for execution time | Purpose | Options for queue specification |
---|---|---|---|---|---|
epyc.q | 4,224 | 512G | 62 days | When there is no resource request in two months of the execution period | No specification or -l month (default) |
intel.q | 1,472 | 384G | 62 days | Use of Intel Xeon | -l intel or -l month -l intel |
gpu.q | 384 | 512G | 62 days | Use of GPU | -l gpu or -l month -l gpu |
medium.q | 160 | 3T | 62 days | Use of Medium compute node | -l medium or -l month -l medium |
login.q | 258 | 512G | Unlimited | Used to execute qlogin from the gateway node | |
login_gpu.q | 48 | 384 | Unlimited | Qlogin from gateway node when using GPU | |
short.q | 744 | 192G | 3 days | For short-time jobs | -l short |
For example, enter the command below to enter a job called test.sh in intel.q.
qsub -l intel test.sh
Additionally, to enter a job called test.sh on the Medium compute node, enter the following command:
qsub -l medium test.sh
However, nodes that are equipped with GPGPU are also equipped with SSD, and nodes that are equipped with SSD are also equipped with HDD. As a result, the system is configured such that jobs flow to the GPU queue if the queue slot for using SSD is full. Further, jobs entered while the HDD queue is full will be entered in the SSD queue. Please note this point.
Checking the status of a job entered
Whether the job entered by qsub was actually entered as a job can be checked using the qstat command. The qstat command is used to check the status of entered jobs. If a number of jobs have been entered, for instance, qstat will give an output similar to the following:
[username@at027 ~]$ qstat job-ID prior name user state submit/start at queue jclass slots ja-task-ID ------------------------------------------------------------------------------------------------------------------------------------------------ 80312 0.50000 QLOGIN username r 02/27/2019 17:42:00 login.q@at027 1 80313 0.25000 jobname username r 02/27/2019 17:44:30 epyc.q@at040 1 80314 0.25000 jobname username r 02/27/2019 17:44:35 epyc.q@at040 1 80315 0.25000 jobname username r 02/27/2019 17:44:40 epyc.q@at040 1
The meanings of the characters in the “state” column in this case are as follows:
Character | Meaning |
---|---|
r | Job being executed |
qw | Job in standby queue |
t | Job being transferred to the execution host |
E | An error occurred while processing the job |
d | Job in the process of deletion |
To check the queue utilization status, input “qstat –f”. This gives the following output:
[username@at027 ~]$ qstat -f queuename qtype resv/used/tot. np_load arch states --------------------------------------------------------------------------------- medium.q@m01 BP 0/0/80 0.00 lx-amd64 --------------------------------------------------------------------------------- medium.q@m02 BP 0/0/80 0.00 lx-amd64 --------------------------------------------------------------------------------- medium.q@m03 BP 0/0/80 0.00 lx-amd64 --------------------------------------------------------------------------------- medium.q@m04 BP 0/0/80 0.00 lx-amd64 (omitted) --------------------------------------------------------------------------------- epyc.q@at033 BP 0/0/64 0.00 lx-amd64 --------------------------------------------------------------------------------- epyc.q@at034 BP 0/0/64 0.00 lx-amd64 --------------------------------------------------------------------------------- epyc.q@at035 BP 0/0/64 0.00 lx-amd64 (omitted) --------------------------------------------------------------------------------- intel.q@it003 BP 0/0/32 0.00 lx-amd64 --------------------------------------------------------------------------------- intel.q@it004 BP 0/0/32 0.00 lx-amd64 --------------------------------------------------------------------------------- intel.q@it005 BP 0/0/32 0.00 lx-amd64 --------------------------------------------------------------------------------- intel.q@it006 BP 0/0/32 0.00 lx-amd64 --------------------------------------------------------------------------------- (omitted)
This output can be used to determine the node (queue) on which the job is entered. To view the overall state for each queue such as job entering status and queue load status, “qstat -g c” can be used.
[username@at027 ~]$ qstat -g c CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDS cdsuE -------------------------------------------------------------------------------- epyc.q 0.00 0 0 4160 4224 0 64 gpu.q 0.00 0 0 64 112 0 48 intel.q 0.00 0 0 1472 1472 0 0 login.q 0.00 1 0 383 384 0 0 login_gpu.q 0.00 0 0 48 48 0 0 medium.q 0.00 0 0 800 800 0 0 short.q 0.00 0 0 128 224 0 96
Detailed information on a job can be obtained by specifying “qstat -j jobID.”
[username@at027 ~]$ qstat -j 199666 ============================================================== job_number: 199666 jclass: NONE submission_time: 02/27/2019 17:42:00.867 owner: username uid: 9876 group: ddbj gid: 9876 supplementary group: ddbj sge_o_home: /home/username sge_o_log_name: username sge_o_path: /cm/local/apps/gcc/7.2.0/bin:/home/geadmin/UGER/bin/lx-amd64:/cm/local/apps/environment-modules/4.0.0//bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/sbin:/usr/sbin:/cm/local/apps/environment-modules/4.0.0/bin:/home/username/.local/bin:/home/username/bin sge_o_shell: /bin/bash sge_o_workdir: /lustre8/home/username sge_o_host: gw1 account: sge stderr_path_list: NONE:NONE:/dev/null hard resource_list: d_rt=259200,mem_req=8G,s_rt=259200,s_stack=10240K,s_vmem=8G soft resource_list: epyc=TRUE,gpu=TRUE,intel=TRUE,login=TRUE mail_list: username@gw1 notify: FALSE job_name: QLOGIN stdout_path_list: NONE:NONE:/dev/null priority: 0 jobshare: 0 restart: n env_list: TERM=xterm department: defaultdepartment binding: NONE mbind: NONE submit_cmd: qlogin category_id: 4 request_dispatch_info: FALSE start_time 1: 02/27/2019 17:42:00.884 job_state 1: r exec_host_list 1: at027:1 granted_req. 1: mem_req=8.000G usage 1: wallclock=01:00:01, cpu=00:00:00, mem=0.00000 GBs, io=0.00000 GB, iow=0.000 s, ioops=0, vmem=N/A, maxvmem=N/A scheduling info: -
To immediately delete a job without waiting for job completion when the job execution status is checked and it is found that the job status is incorrect, the qdel command can be used. Specify “qdel job ID.” To delete all the jobs entered by a specific user, specify “qdel -u username.”
Checking the results
The results of a job are output as standard job output in a file called jobname.o job ID, and standard job error output in a file called jobname.e job ID. Please check the files. Detailed information on how much resources an executed job used and so forth can be checked using the qacct command.
[username@at027 ~]$ qacct -j 199666 ============================================================== qname intel.q hostname it003 group lect owner username project NONE department defaultdepartment jobname jobscript.sh jobnumber XXXXX taskid undefined pe_taskid NONE account sge priority 0 cwd NONE submit_host at027 submit_cmd qsub -l intel jobscript.sh qsub_time 02/27/2019 18:49:09.854 start_time 02/27/2019 18:49:15.069 end_time 02/27/2019 18:49:35.128 granted_pe NONE slots 1 failed 0 deleted_by NONE exit_status 0 ru_wallclock 20.059 ru_utime 0.016 ru_stime 0.034 ru_maxrss 10220 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 9181 ru_majflt 2 ru_nswap 0 ru_inblock 344 ru_oublock 32 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 108 ru_nivcsw 18 wallclock 20.108 cpu 0.051 mem 0.001 io 0.000 iow 0.000 ioops 1277 maxvmem 228.078M maxrss 0.000 maxpss 0.000 arid undefined jc_name NONE bound_cores NONE
How to use high-speed domain (Lustre domain)
Lustre configuration
In this supercomputer system, the user home directory is constructed on the file system comprising the Lustre File System. The Lustre File System is a parallel file system that is mainly used by large-scale supercomputer systems.
In this system, two units of MDS (described later), 12 units of OSS (described later), and 41 sets of OST (described later) constitute one unit of the Lustre6 File System in Phase 2 (March 2018), MDS (described later). In Phase 3 (March 2019), two file systems, lustre7 and lustre8, which consisted of two unitsof MDS (described later), four units of OSS (described later) and 54 sets of OST (described later).
File system | Date of introduction | Capacity | mds | oss | ost |
---|---|---|---|---|---|
lustre6 | March 1, 2018 | 3.8PB | 2 | 4VM on SFA14KXE | 41 x RAID6(8D+2P) |
lustre7 | March 1, 2019 | 5.0PB | 2 | 4VM on SFA14KXE | 54 x RAID6(8D+2P) |
lustre8 | March 1, 2019 | 5.0PB | 2 | 4VM on SFA14KXE | 54 x RAID6(8D+2P) |
Lustre components
Simply put, a Lustre File System comprises an IO server called the Object Storage Server (OSS), a disk device called the Object Storage Target (OST), and a server called the MetaData Server (MDS) that manages the file metadata as components. A description of each component is provided in the table below:
Component | Description of term |
---|---|
Object Storage Server (OSS) | Manages OST (described below) and controls the IO requests from the network to OST. |
Object Storage Target (OST) | This is a block storage device that stores the file data. One OST is considered one virtual disk, and comprises multiple physical disks. The user data is stored in one or more OSTs as one or more objects. It is possible to change the set number of objects per file, and the storage performance can be adjusted by tuning this. In this system configuration, eight OSTs are managed for one unit of OSS. |
Meta Data Server (MDS) | MDS comprises one server unit (two units with HA configuration in this system) in one Lustre file system, and it manages the position data on objects to which files are assigned and the file attributes data within the file system and guides the file IO requests to the proper OSS. When file IO is starts initially, MDS does not become related to the file IO, and data transmission is conducted directly between the client and OSS. Thus, Lustre File System can have high IO performance. |
Meta Data Target (MDT) | This is the storage device used to store the metadata (file name, directory, ACL, etc.) for the files in the Lustre File System. It is connected to MDS. |
File striping
One of the characteristics of Lustre is its ability to divide one file into multiple segments and store them by dispersing them over multiple OSTs. This function is called file striping. The advantages of file striping is the capability to execute read/write in parallel from the client as one file is stored in multiple OSTs as multiple segments and to read/write large files at high speeds. However, there are also disadvantages to file striping. One disadvantage is that the overhead for handling the dispersed data increases as the file is dispersed over multiple OSTs. Therefore, only when the target file is several GBs or greater is stripe size and stripe count considered effective in general. Since Lustre centrally manages the metadata at the MDS, file operation that concurs with metadata operation (preparation of ls -l or many small-sized files, etc.) concentrates the load on MDS and thus the speed is not very high compared to the equivalent operations on the local file system. Please note this point and avoid operations that place tens of thousands of small-size files in the same directory and so forth (it is better to store them in multiple directories in this case).
Checking the usage status of the home directory
The usage status of the user’s current home directory can be checked using the “lfs quota” command.
[username@at031 ~]$ lfs quota -u username ./ Disk quotas for usr username (uid 9876): Filesystem kbytes quota limit grace files quota limit grace ./ 359743396 0 0 - 5250 0 0 -
Item | Meaning/description |
kbyte | File capacity during use (KB) |
quota | Limitation value for file capacity/number (soft limit) |
limit | Absolute limitation value for file capacity/number (hard limit) |
grace | Tolerable period for exceeding the limitation value |
files | Number of files in use |
How to set up file striping
To set up file striping, please follow the procedure below. First, check the current stripe count. It can be checked using “lfs getstrip subject file (directory).” (The system default is set to one.)
[username@at031 ~]$ ls -ld tmp drwxr-xr-x 2 username lect 4096 Feb 22 00:51 tmp [username@at031 ~]$ lfs getstripe tmp tmp stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
Stripe settings can be set with the "lfs setstripe" command.
Option name | Description |
-c | Sets up the stripe number and stripe width. |
-o | Specifies the offset. |
-s | Specifies the striping size. |
How to use X client on the login node
To utilize X-Window Client on the login node, please prepare the corresponding X-Window server emulator if you are using Windows. For Mac, the X11 SDK is included in Xcode Tools. Install this so that X11 can be used.