<span style="font-weight: bold;">Hi</span><br><br style="font-weight: bold;"><span style="font-weight: bold;">I have been running MVAPICH 1.0 over OFED 1.2 uDAPL interface on four nodes.</span><br style="font-weight: bold;"><span style="font-weight: bold;">i ran 64 processes, that came out to be 16 processes per node. it ran finely.</span><br style="font-weight: bold;"><br style="font-weight: bold;"><span style="font-weight: bold;">but after increasing the number of processes further, i started getting error. here are some of the final lines of the output i got when i ran 68 processes on 4 nodes i.e 17 processes per node</span><br><br>[rdma_udapl_init.c:1875] error(-2147024846): Could not reset ep<br>rank 58 in job 12 in05_33381 caused collective abort of all ranks<br> exit status of rank 58: return code 1<br>[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep<br>[rdma_udapl_init.c:1875] error(-2147024849): Could not reset
ep<br>[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep<br>rank 57 in job 12 in05_33381 caused collective abort of all ranks<br> exit status of rank 57: killed by signal 9<br>[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep<br><br><br><span style="font-weight: bold;">and here is the same when i ran 200 processes i.e 50 processes per node</span><br><br>hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.<br>hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.<br>hello: dapl/common/dapl_evd_util.c:1012: dapli_evd_cqe_to_event: Assertion `(((void *)0) != cookie)' failed.<br>rank 158 in job 16 in05_36664 caused collective abort of all ranks<br> exit status of rank 158: killed by signal 9<br>[rdma_udapl_init.c:1875] error(-2147024849): Could not reset ep<br>rank 86 in job 16
in05_36664 caused collective abort of all ranks<br> exit status of rank 86: killed by signal 9<br>rank 46 in job 16 in05_36664 caused collective abort of all ranks<br> exit status of rank 46: killed by signal 9<br>rank 6 in job 16 in05_36664 caused collective abort of all ranks<br> exit status of rank 6: killed by signal 9<br><br><br><span style="font-weight: bold;">Could anybody please tell<br><br>why increasing the number of processes results in an absurd behaviour ?<br>Is any limit affecting this run, that needs to be changed ?<br>What is the solution<span style="font-weight: bold;"></span> to get more number of processes run successfully ?<br><br>thanks,<br>Jasjit Singh</span><p> 
<hr size=1>
For ideas on reducing your carbon footprint visit <a href="http://uk.promotions.yahoo.com/forgood/environment.html">Yahoo! For Good</a> this month.