|
ENGR 692: Parallel Programming
The University of Mississippi
Spring 2007
MCSR Course Support Page
Checking For Orphaned Processes on Mimosa
When your mimosa PBS job completes, aborts, or is deleted, there is a possibility that some of the parallel processes on the compute nodes may still be running, unbeknownst to PBS. It is important to check for, and kill, such orphaned processes after your job completes, since their survival could endanger the life of subsequent jobs, or compromise their performance.
checkprocs_me
should do the first part of the trick--checking for the orphaned processes. You can run it interactively after your job dies, or add it as the last line in your PBS script, and look for the output in the PBS output file. If all of your processes have died nicely, you the output will simply be the names of all of the compute nodes assigned to the MCSR-CA queue:
node4-1
node4-2
node4-3
...
node5-8
If, however, some processes owned by you appear on one or more of the nodes, with output similar to this:
node4-1
node4-2
root 20690 20689 0 14:07 ? 00:00:00 login -- en692603
en692603 20718 20691 0 14:08 pts/0 00:00:00 -bash
en692603 20691 20690 0 14:07 pts/0 00:00:00 -bash
en692603 20719 20718 0 14:08 pts/0 00:00:00 sleep 20
node4-3
...
node5-7
node5-8
then you will need to use rsh to kill processes on the nodes where they are found, until they are all gone. A crude script you can use to do this is:
killproc
Usage: killproc node-id process-id
If you cannot get your processes to die, send email to assist at mcsr.olemiss.edu and we will kill them for you.
Another script:
killprocs_me
will attempt to kill all of the orphaned processes discovered by checkprocs_me. You might need to run it twice, then run checkprocs_me to make sure they all died. It would be a good idea to put these two lines as the last lines of all your PBS jobs:
killprocs_me
If your computation is not working as expected in parallel, you might just make sure that none of your classmates have stranded processes. You can check to see whether they have any processes running on the MCSR-CA compute nodes with this script:
checkprocs_en6
It works just like checkprocs_me, but casts a wider net. If you see that some other Engineering 692 classmates have processes running, then note the account id owning the processes, and then use qstat to see if they have PBS jobs running. If they have processes, but no PBS jobs, then you may assume that the processes are orphans, and you should let the student, the professor, or the MCSR support staff know so that we can get the orphan processes nice and killed.
Viewing the PBS Nodes File Mimosa
On mimosa, the file:
/var/spool/PBS/server_priv/nodes
lists all of the compute nodes in the cluster, along with their PBS attributes. To see a list of the nodes that are associated with the MCSR-CA queue, you can run:
grep MCSR-CA /var/spool/PBS/server_priv/nodes
Chapter 4 Circuit MPI Exercise
The exercise notes for the Chapter 4 MPI lab dealing with the circuit program can be found here.
MCSR MPI Web Page
The MCSR MPI Web Page, including 3 additional MPI exercises for mimosa, can be found here.
MCSR PBS Tutorial
A tutorial for using PBS at MCSR can be found here.
Speedup Timings Worksheet
A worksheet for calculating the speedup and parallel efficiency of calcualtions is here. here.
|