Hi again -<br><br>I noticed this in the benchmark code:<br><pre>int large_message_size = 8192;</pre><br>Does MVAPICH internally treat messages over 8192 bytes differently than those under 8 KB? Could this be something wrong with how I've configured infiniband? I had a program running OpenMPI already over IB on the system, but maybe I need to configure something special for MVAPICH?
<br><br>Sorry if I appear to be grasping at straws... but I am ;)<br><br>Thanks,<br> Brian<br><br><div class="gmail_quote">On Jan 3, 2008 5:46 PM, Brian Budge <<a href="mailto:brian.budge@gmail.com">brian.budge@gmail.com
</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi all -<br><br>I'm new to the list here... hi! I have been using OpenMPI for a while, and LAM before that, but new requirements keep pushing me to new implementations. In particular, I was interested in using infiniband (using OFED
<a href="http://1.2.5.1" target="_blank">1.2.5.1</a>) in a multi-threaded environment. It seems that MVAPICH is the library for that particular combination :)<br><br>In any case, I installed MVAPICH, and I can boot the daemons, and run the ring speed test with no problems. When I run any programs with mpirun, however, I get an error when sending or receiving more than 8192 bytes.
<br><br>For example, if I run the bandwidth test from the benchmarks page (osu_bw.c), I get the following:<br>---------------------------------------------------------------<br>budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out Thursday 06:16:00
<br>burn<br>burn-3<br># OSU MPI Bandwidth Test v3.0<br># Size Bandwidth (MB/s)<br>1 1.24<br>2 2.72<br>4 5.44<br>8 10.18
<br>16 19.09<br>32 29.69<br>64 65.01<br>128 147.31<br>256 244.61<br>512 354.32<br>1024
367.91<br>2048 451.96<br>4096 550.66<br>8192 598.35<br>[1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to send<br>Internal Error: invalid error code ffffffff (Ring Index out of range) in MPIDI_CH3_RndvSend:263
<br>Fatal error in MPI_Waitall:<br>Other MPI error, error stack:<br>MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0, status_array=0xdb3140) failed<br>(unknown)(): Other MPI error<br>rank 1 in job 4 burn_37156 caused collective abort of all ranks
<br> exit status of rank 1: killed by signal 9 <br>---------------------------------------------------------------<br><br>I get a similar problem with the latency test, however, the protocol that is complained about is different:
<br>--------------------------------------------------------------------<br>budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out Thursday 09:21:20<br># OSU MPI Latency Test v3.0<br># Size Latency (us)
<br>0 3.93<br>1 4.07<br>2 4.06<br>4 3.82<br>8 3.98<br>16 4.03<br>32
4.00<br>64 4.28<br>128 5.22<br>256 5.88<br>512 8.65<br>1024 9.11<br>2048 11.53<br>4096
16.17<br>8192 25.67<br>[1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req to send<br>Internal Error: invalid error code ffffffff (Ring Index out of range) in MPIDI_CH3_RndvSend:263
<br>Fatal error in MPI_Recv:<br>Other MPI error, error stack:<br>MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, tag=1, MPI_COMM_WORLD, status=0x7fff14c7bde0) failed<br>(unknown)(): Other MPI error<br>
rank 1 in job 5 burn_37156 caused collective abort of all ranks<br>--------------------------------------------------------------------<br><br>The protocols (0 and 8126589) are consistent if I run the program multiple times.
<br><br>Anyone have any ideas? If you need more info, please let me know.<br><br>Thanks,<br><font color="#888888"> Brian<br><br>
</font></blockquote></div><br>