Hi Wei -<br><br>I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no difference. <br><br>When I build with rdma, this adds the following:<br> export LIBS="${LIBS} -lrdmacm"<br> export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM"
<br><br>It seems that I am using the make.mvapich2.detect script to build. It asks me for my interface, and gives me the option for the mellanox interface, which I choose.<br><br>I just tried a fresh install directly from the tarball instead of using the gentoo package. Now the program completes (goes beyond 8K message), but my bandwidth isn't very good. Running the osu_bw.c test, I get about 250 MB/s maximum. It seems like IB isn't being used.
<br><br>I did the following:<br>./make.mvapich2.detect #, and chose the mellanox option<br>./configure --enable-threads=multiple<br>make<br>make install<br><br>So it seems that the package is doing something to enable infiniband that I am not doing with the tarball. Conversely, the tarball can run without crashing.
<br><br>Advice?<br><br>Thanks,<br> Brian<br><br><div class="gmail_quote">On Jan 6, 2008 6:38 AM, wei huang <
<a href="mailto:huanwei@cse.ohio-state.edu" target="_blank">huanwei@cse.ohio-state.edu</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi Brian,
<br><div><br>> I am using the openib-mvapich2-1.0.1 package in the gentoo-science overlay<br>> addition to the standard gentoo packages. I have also tried 1.0 with the<br>> same results.<br>><br>
> I compiled with multithreading turned on (haven't tried without this, but<br>> the sample codes I am initially testing are not multithreaded, although my<br>> application is). I also tried with or without rdma with no change. The
<br>> script seems to be setting the build for SMALL_CLUSTER.<br><br></div>So you are using make.mvapich2.ofa to compile the package? I am a bit<br>confused about ''I also tried with or without rdma with no change''. What
<br>exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa<br>stack...<br><font color="#888888"><br>-- Wei<br></font><div><div></div><div><br>><br>> Let me know what other information would be useful.
<br>><br>> Thanks,<br>> Brian<br>><br>><br>><br>> On Jan 4, 2008 6:12 PM, wei huang <<a href="mailto:huanwei@cse.ohio-state.edu" target="_blank">huanwei@cse.ohio-state.edu</a>> wrote:<br>>
<br>> > Hi Brian,
<br>> ><br>> > Thanks for letting us know this problem. Would you please let us know some<br>> > more details to help us locate the issue.<br>> ><br>> > 1) More details on your platform.<br>
> >
<br>> > 2) Exact version of mvapich2 you are using. Is it from OFED package? or<br>> > some version from our website.<br>> ><br>> > 3) If it is from our website, did you change anything from the default
<br>> > compiling scripts?<br>> ><br>> > Thanks.<br>> ><br>> > -- Wei<br>> > > I'm new to the list here... hi! I have been using OpenMPI for a while,<br>> > and<br>> > > LAM before that, but new requirements keep pushing me to new
<br>> > > implementations. In particular, I was interested in using infiniband<br>> > (using<br>> > > OFED <a href="http://1.2.5.1" target="_blank">1.2.5.1</a>) in a multi-threaded environment. It seems that MVAPICH is
<br>> > the<br>> > > library for that particular combination :)<br>> > ><br>> > > In any case, I installed MVAPICH, and I can boot the daemons, and run<br>> > the<br>> > > ring speed test with no problems. When I run any programs with mpirun,
<br>> > > however, I get an error when sending or receiving more than 8192 bytes.<br>> > ><br>> > > For example, if I run the bandwidth test from the benchmarks page<br>> > > (osu_bw.c), I get the following:
<br>> > > ---------------------------------------------------------------<br>> > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out<br>> > > Thursday 06:16:00<br>> > > burn<br>
> > > burn-3
<br>> > > # OSU MPI Bandwidth Test v3.0<br>> > > # Size Bandwidth (MB/s)<br>> > > 1 1.24<br>> > > 2 2.72<br>> > > 4
5.44<br>> > > 8 10.18<br>> > > 16 19.09<br>> > > 32 29.69<br>> > > 64 65.01<br>> > > 128
147.31<br>> > > 256 244.61<br>> > > 512 354.32<br>> > > 1024 367.91<br>> > > 2048 451.96<br>> > > 4096
550.66<br>> > > 8192 598.35<br>> > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to<br>> > send<br>> > > Internal Error: invalid error code ffffffff (Ring Index out of range) in
<br>> > > MPIDI_CH3_RndvSend:263<br>> > > Fatal error in MPI_Waitall:<br>> > > Other MPI error, error stack:<br>> > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,<br>> > > status_array=0xdb3140) failed
<br>> > > (unknown)(): Other MPI error<br>> > > rank 1 in job 4 burn_37156 caused collective abort of all ranks<br>> > > exit status of rank 1: killed by signal 9<br>> > > ---------------------------------------------------------------
<br>> > ><br>> > > I get a similar problem with the latency test, however, the protocol<br>> > that is<br>> > > complained about is different:<br>> > > --------------------------------------------------------------------
<br>> > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out<br>> > > Thursday 09:21:20<br>> > > # OSU MPI Latency Test v3.0<br>> > > # Size Latency (us)<br>> > > 0
3.93<br>> > > 1 4.07<br>> > > 2 4.06<br>> > > 4 3.82<br>> > > 8 3.98<br>> > > 16
4.03<br>> > > 32 4.00<br>> > > 64 4.28<br>> > > 128 5.22<br>> > > 256 5.88<br>> > > 512
8.65<br>> > > 1024 9.11<br>> > > 2048 11.53<br>> > > 4096 16.17<br>> > > 8192 25.67<br>> > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv req
<br>> > to<br>> > > send<br>> > > Internal Error: invalid error code ffffffff (Ring Index out of range) in<br>> > > MPIDI_CH3_RndvSend:263<br>> > > Fatal error in MPI_Recv:<br>> > > Other MPI error, error stack:
<br>> > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0,<br>> > tag=1,<br>> > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed<br>> > > (unknown)(): Other MPI error<br>
> > > rank 1 in job 5 burn_37156 caused collective abort of all ranks<br>> > > --------------------------------------------------------------------<br>> > ><br>> > > The protocols (0 and 8126589) are consistent if I run the program
<br>> > multiple<br>> > > times.<br>> > ><br>> > > Anyone have any ideas? If you need more info, please let me know.<br>> > ><br>> > > Thanks,<br>> > > Brian
<br>
> > ><br>> ><br>> ><br>><br><br></div></div></blockquote></div><br>