Hi Wei -<br><br>I am running gentoo linux on amd64, 2 or 4 opteron 8216 per node. Kernel is 2.6.23-gentoo-r4 SMP. I have infiniband built into the kernel:<br><br>CONFIG_INFINIBAND=y<br>CONFIG_INFINIBAND_USER_MAD=y<br>CONFIG_INFINIBAND_USER_ACCESS=y
<br>CONFIG_INFINIBAND_USER_MEM=y<br>CONFIG_INFINIBAND_ADDR_TRANS=y<br>CONFIG_INFINIBAND_MTHCA=y<br>CONFIG_INFINIBAND_MTHCA_DEBUG=y<br>CONFIG_INFINIBAND_AMSO1100=y<br>CONFIG_MLX4_INFINIBAND=y<br>CONFIG_INFINIBAND_IPOIB=y<br>
CONFIG_INFINIBAND_IPOIB_DEBUG=y<br><br>I am using the openib-mvapich2-1.0.1 package in the gentoo-science overlay addition to the standard gentoo packages. I have also tried 1.0 with the same results.<br><br>I compiled with multithreading turned on (haven't tried without this, but the sample codes I am initially testing are not multithreaded, although my application is). I also tried with or without rdma with no change. The script seems to be setting the build for SMALL_CLUSTER.
<br><br>Let me know what other information would be useful.<br><br>Thanks,<br> Brian<br><br><br><br><div class="gmail_quote">On Jan 4, 2008 6:12 PM, wei huang <<a href="mailto:huanwei@cse.ohio-state.edu">huanwei@cse.ohio-state.edu
</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hi Brian,<br><br>Thanks for letting us know this problem. Would you please let us know some
<br>more details to help us locate the issue.<br><br>1) More details on your platform.<br><br>2) Exact version of mvapich2 you are using. Is it from OFED package? or<br>some version from our website.<br><br>3) If it is from our website, did you change anything from the default
<br>compiling scripts?<br><br>Thanks.<br><font color="#888888"><br>-- Wei<br></font><div><div></div><div class="Wj3C7c">> I'm new to the list here... hi! I have been using OpenMPI for a while, and<br>> LAM before that, but new requirements keep pushing me to new
<br>> implementations. In particular, I was interested in using infiniband (using<br>> OFED <a href="http://1.2.5.1" target="_blank">1.2.5.1</a>) in a multi-threaded environment. It seems that MVAPICH is the<br>> library for that particular combination :)
<br>><br>> In any case, I installed MVAPICH, and I can boot the daemons, and run the<br>> ring speed test with no problems. When I run any programs with mpirun,<br>> however, I get an error when sending or receiving more than 8192 bytes.
<br>><br>> For example, if I run the bandwidth test from the benchmarks page<br>> (osu_bw.c), I get the following:<br>> ---------------------------------------------------------------<br>> budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out
<br>> Thursday 06:16:00<br>> burn<br>> burn-3<br>> # OSU MPI Bandwidth Test v3.0<br>> # Size Bandwidth (MB/s)<br>> 1 1.24<br>> 2 2.72<br>> 4
5.44<br>> 8 10.18<br>> 16 19.09<br>> 32 29.69<br>> 64 65.01<br>> 128 147.31<br>> 256
244.61<br>> 512 354.32<br>> 1024 367.91<br>> 2048 451.96<br>> 4096 550.66<br>> 8192 598.35<br>> [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to send
<br>> Internal Error: invalid error code ffffffff (Ring Index out of range) in<br>> MPIDI_CH3_RndvSend:263<br>> Fatal error in MPI_Waitall:<br>> Other MPI error, error stack:<br>> MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,
<br>> status_array=0xdb3140) failed<br>> (unknown)(): Other MPI error<br>> rank 1 in job 4 burn_37156 caused collective abort of all ranks<br>> exit status of rank 1: killed by signal 9<br>> ---------------------------------------------------------------
<br>><br>> I get a similar problem with the latency test, however, the protocol that is<br>> complained about is different:<br>> --------------------------------------------------------------------<br>> budge@burn
:~/tests/testMvapich2> mpirun -np 2 ./a.out<br>> Thursday 09:21:20<br>> # OSU MPI Latency Test v3.0<br>> # Size Latency (us)<br>> 0 3.93<br>> 1
4.07<br>> 2 4.06<br>> 4 3.82<br>> 8 3.98<br>> 16 4.03<br>> 32 4.00<br>> 64
4.28<br>> 128 5.22<br>> 256 5.88<br>> 512 8.65<br>> 1024 9.11<br>> 2048 11.53<br>> 4096
16.17<br>> 8192 25.67<br>> [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req to<br>> send<br>> Internal Error: invalid error code ffffffff (Ring Index out of range) in
<br>> MPIDI_CH3_RndvSend:263<br>> Fatal error in MPI_Recv:<br>> Other MPI error, error stack:<br>> MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0, tag=1,<br>> MPI_COMM_WORLD, status=0x7fff14c7bde0) failed
<br>> (unknown)(): Other MPI error<br>> rank 1 in job 5 burn_37156 caused collective abort of all ranks<br>> --------------------------------------------------------------------<br>><br>> The protocols (0 and 8126589) are consistent if I run the program multiple
<br>> times.<br>><br>> Anyone have any ideas? If you need more info, please let me know.<br>><br>> Thanks,<br>> Brian<br>><br><br></div></div></blockquote></div><br>