Home My Accounts | Newsletter | News Flash | Contact Us | Search

 

Issues/Circulation
Current Issue: May 2005
Other Issues
Applications
Gaussian Parallel Performance
Amber 8 on Redwood
Mimosa Upgrade to G03 End of June
Programming
PGI Versions on Mimosa
Compiling Programs on Redwood
The Intel Debugger
The -O Intel Compiler Option
Endian Formatting on Altix
Services
MCSR in the Classroom
PBS Tutorial
Systems & Resources
Redwood Upgrade to 192 CPUs
32 New Mimosa Nodes End of June
Checking Parallel Efficiency
The MCSR Parallel-O-Gram
Checking Efficiency of Parallel Jobs

by Brian Hopkins In order to make the best use of MCSR computing resources, it i

In order to make the best use of MCSR computing resources, it is important to ensure that parallel computing jobs really do run in parallel. Users (especially new users) should monitor their jobs to ensure that all the processors being tied up by PBS are really being used by their applications. PBS provides various utilities with which users can monitor their jobs’ progress. To begin, simply type

qstat –u $USER

This will return a list of all the jobs queued by the current user. The list should look something like this:

00001.sweetgum r0001 MM-defR job1 4697 -- 2 1gb 100:0 R 00:45

00002.sweetgum r0001 MM-defR job2 5750 -- 2 1gb 100:0 R 00:44

00003.sweetgum r0001 MM-defR job3 5818 -- 2 1gb 100:0 R 00:45

The number preceding each row is the ID number assigned to the job by PBS. To gain more information on a particular job, type ‘qstat –f jobnumber’. For instance, to access more information about the first job listed above, type the command

qstat –f 00001

Unfortunately, this command returns a great deal of data, most of which is not really interesting. To get just the essential usage data, pipe the qstat –f command through a grep for the string 'used'; ie,

qstat –f 00001 | grep ‘used’

The output from this command is more manageable:

resources_used.cpupercent = 96

resources_used.cput = 00:47:58

resources_used.mem = 59792kb

resources_used.ncpus = 2

resources_used.vmem = 308592kb

resources_used.walltime = 00:48:40

The important numbers here are the CPU time used (cput), the walltime elapsed (walltime), and the number of CPUs (ncpus). For a job running with perfect parallel efficiency, the CPU time would be equal to the elapsed walltime times the number of CPUs being used:

cput = ncpus x walltime

Of course, no job runs with perfect parallel efficiency. In reality, the value cput will always be somewhat less than the product of ncpus and walltime. But it should never be hugely less. For instance, in the case above, the elapsed cput is actually less than the elapsed walltime. This condition raises a very large red flag, indicating that the job is not really running in parallel at all. In such cases, only one CPU is really being used by the job; the others PBS has set aside are simply wasted.

In order to prevent such waste, we ask that users periodically check on the parallel efficiency of their jobs. If you find that your jobs are not running efficiently, contact assist@mcsr.olemiss.edu and we will be happy to help you get things properly parallelized. Ensuring that jobs really use all the resources available to them makes everyone's work go faster.



--------------------------------
Last Modified: Thursday, 19-May-2005 16:50:11 CDT
Copyright © 1997-2005 The Mississippi Center for Supercomputing Research. All Rights Reserved.
[an error occurred while processing this directive]