| Home | My Accounts | Newsletter | News Flash | Contact Us | Search |
|
  |
by Brian Hopkins In order to make the best use of MCSR computing resources, it is important to ensure that parallel computing jobs really do run in parallel. Users (especially new users) should monitor their jobs to ensure that all the processors being tied up by PBS are really being used by their applications. PBS provides various utilities with which users can monitor their jobs’ progress. To begin, simply type qstat –u $USER This will return a list of all the jobs queued by the current user. The list should look something like this: 00001.sweetgum r0001 MM-defR job1 4697 -- 2 1gb 100:0 R 00:45 00002.sweetgum r0001 MM-defR job2 5750 -- 2 1gb 100:0 R 00:44 00003.sweetgum r0001 MM-defR job3 5818 -- 2 1gb 100:0 R 00:45 The number preceding each row is the ID number assigned to the job by PBS. To gain more information on a particular job, type ‘qstat –f jobnumber’. For instance, to access more information about the first job listed above, type the command qstat –f 00001 Unfortunately, this command returns a great deal of data, most of which is not really interesting. To get just the essential usage data, pipe the qstat –f command through a grep for the string 'used'; ie, qstat –f 00001 | grep ‘used’ The output from this command is more manageable: resources_used.cpupercent = 96 resources_used.cput = 00:47:58 resources_used.mem = 59792kb resources_used.ncpus = 2 resources_used.vmem = 308592kb resources_used.walltime = 00:48:40 The important numbers here are the CPU time used (cput), the walltime elapsed (walltime), and the number of CPUs (ncpus). For a job running with perfect parallel efficiency, the CPU time would be equal to the elapsed walltime times the number of CPUs being used: cput = ncpus x walltime Of course, no job runs with perfect parallel efficiency. In reality, the value cput will always be somewhat less than the product of ncpus and walltime. But it should never be hugely less. For instance, in the case above, the elapsed cput is actually less than the elapsed walltime. This condition raises a very large red flag, indicating that the job is not really running in parallel at all. In such cases, only one CPU is really being used by the job; the others PBS has set aside are simply wasted. In order to prevent such waste, we ask that users periodically check on the parallel efficiency of their jobs. If you find that your jobs are not running efficiently, contact assist@mcsr.olemiss.edu and we will be happy to help you get things properly parallelized. Ensuring that jobs really use all the resources available to them makes everyone's work go faster.
|
