MCSR_logo.jpg (56K)
Home My MCSR Supercomputers Software Research Education
Login
Engineering 692

mimosa
sweetgum
Orphan Processes
PBS Nodes File
Ch.4 Circuits Exercise
MCSR MPI Page
MCSR PBS Page
Timing Worksheet




Parallel-O-Gram
News

MCSR Accounts
Contact Us

ENGR 692: Parallel Programming
The University of Mississippi
Spring 2007
MCSR Course Support Page

Checking For Orphaned Processes on Mimosa

When your mimosa PBS job completes, aborts, or is deleted, there is a possibility that some of the parallel processes on the compute nodes may still be running, unbeknownst to PBS. It is important to check for, and kill, such orphaned processes after your job completes, since their survival could endanger the life of subsequent jobs, or compromise their performance.

checkprocs_me

should do the first part of the trick--checking for the orphaned processes. You can run it interactively after your job dies, or add it as the last line in your PBS script, and look for the output in the PBS output file. If all of your processes have died nicely, you the output will simply be the names of all of the compute nodes assigned to the MCSR-CA queue:

node4-1
node4-2
node4-3
...
node5-8

If, however, some processes owned by you appear on one or more of the nodes, with output similar to this:

node4-1
node4-2
root 20690 20689 0 14:07 ? 00:00:00 login -- en692603
en692603 20718 20691 0 14:08 pts/0 00:00:00 -bash
en692603 20691 20690 0 14:07 pts/0 00:00:00 -bash
en692603 20719 20718 0 14:08 pts/0 00:00:00 sleep 20
node4-3
...
node5-7
node5-8

then you will need to use rsh to kill processes on the nodes where they are found, until they are all gone. A crude script you can use to do this is:

killproc

Usage: killproc node-id process-id

If you cannot get your processes to die, send email to assist at mcsr.olemiss.edu and we will kill them for you.

Another script:

killprocs_me

will attempt to kill all of the orphaned processes discovered by checkprocs_me. You might need to run it twice, then run checkprocs_me to make sure they all died. It would be a good idea to put these two lines as the last lines of all your PBS jobs:

killprocs_me

If your computation is not working as expected in parallel, you might just make sure that none of your classmates have stranded processes. You can check to see whether they have any processes running on the MCSR-CA compute nodes with this script:

checkprocs_en6

It works just like checkprocs_me, but casts a wider net. If you see that some other Engineering 692 classmates have processes running, then note the account id owning the processes, and then use qstat to see if they have PBS jobs running. If they have processes, but no PBS jobs, then you may assume that the processes are orphans, and you should let the student, the professor, or the MCSR support staff know so that we can get the orphan processes nice and killed.

Viewing the PBS Nodes File Mimosa

On mimosa, the file:

/var/spool/PBS/server_priv/nodes

lists all of the compute nodes in the cluster, along with their PBS attributes. To see a list of the nodes that are associated with the MCSR-CA queue, you can run:

grep MCSR-CA /var/spool/PBS/server_priv/nodes

Chapter 4 Circuit MPI Exercise

The exercise notes for the Chapter 4 MPI lab dealing with the circuit program can be found here.

MCSR MPI Web Page

The MCSR MPI Web Page, including 3 additional MPI exercises for mimosa, can be found here.

MCSR PBS Tutorial

A tutorial for using PBS at MCSR can be found here.

Speedup Timings Worksheet

A worksheet for calculating the speedup and parallel efficiency of calcualtions is here. here.


Last Modified:June 08, 2007 10:31:48.   Copyright © 1997-2012 The Mississippi Center for Supercomputing Research. All Rights Reserved.   The University of Mississippi
Valid RSS