<html>
<head>
<style>
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
FONT-SIZE: 10pt;
FONT-FAMILY:Tahoma
}
</style>
</head>
<body class='hmmessage'>
Hello,<BR>
<BR>
I have a 64 node cluster that I am trying to run linpak on. Ecah node has 16 cores and 4 HCAs. After building multirail mvapich, I can sucessfuly run linpak on 10 nodes (160 cores). However, when I try to run any more than that, I get the following errors. Note that node cn35 and cn36 are my 11th and 12th nodes in this case. I used mpirun_rsh -np 192 -hostfile hostfile ./xhpl as the command. If I delete say 2 other nodes and use cn35 and cn36 as two of my 10 nodes and run with np 160, it completes just fine. I did set ulimit -l to be unlimited and each node has MaxStartups set to 32 in /etc/ssh/sshd_config. Any help would be greatly appreciated. <BR>
<BR>
<BR>
[176] Abort: [cn35:176] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[175] Abort: [cn35:175] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[172] Abort: [cn35:172] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[173] Abort: [cn35:173] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[170] Abort: [cn35:170] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[171] Abort: [cn35:171] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[174] Abort: [cn35:174] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[178] Abort: [cn36:178] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[181] Abort: [cn36:181] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[180] Abort: [cn36:180] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[189] Abort: [cn36:189] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[183] Abort: [cn36:183] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[187] Abort: [191] Abort: [cn36:191] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[179] Abort: [182] Abort: [184] Abort: [cn36:184] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[186] Abort: [cn36:186] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[185] Abort: [cn36:185] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[188] Abort: [cn36:188] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[177] Abort: [cn36:177] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[190] Abort: [cn36:190] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[cn36:187] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[cn36:182] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>[cn36:179] Got completion with error code 12<BR> at line 1277 in file viacheck.c<BR>Timeout alarm signaled<BR>Cleaning up all processes ...done.<BR>
<BR>
Thanks<BR>
Tatek<BR><br /><hr />More immediate than e-mail? <a href='http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008' target='_new'>Get instant access with Windows Live Messenger.</a></body>
</html>