Hi Matt -<br><br>I have now done the install from the ofa build file, and I can boot and run the ring test, but now when I run the osu_bw.c benchmark, the executable dies in MPI_Init().<br><br>The things I altered in make.mvapich2.ofa
were:<br><br>OPEN_IB_HOME=${OPEN_IB_HOME:-/usr}<br>SHARED_LIBS=${SHARED_LIBS:-yes}<br><br>and on the configure line I added:<br> --disable-f77 --disable-f90 <br><br>Here is the error message that I am getting:<br><br>rank 1 in job 1 burn_60139 caused collective abort of all ranks
<br> exit status of rank 1: killed by signal 9 <br><br>Thanks,<br> Brian<br><br><div class="gmail_quote">On Jan 7, 2008 1:21 PM, Matthew Koop <<a href="mailto:koop@cse.ohio-state.edu">koop@cse.ohio-state.edu</a>> wrote:
<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Brian,<br><br>The make.mvapich.detect script is just a helper script (not meant to be<br>
executed directly). You need to use the make.mvapich.ofa script, which<br>will call configure and make for you with the correct arguments.<br><br>More information can be found in our MVAPICH2 user guide under<br>"4.4.1
Build MVAPICH2 with OpenFabrics Gen2-IB and iWARP"<br><br><a href="https://mvapich.cse.ohio-state.edu/support/" target="_blank">https://mvapich.cse.ohio-state.edu/support/</a><br><br>Let us know if you have any other problems.
<br><br>Matt<br><div><div></div><div class="Wj3C7c"><br><br><br><br>On Mon, 7 Jan 2008, Brian Budge wrote:<br><br>> Hi Wei -<br>><br>> I changed from SMALL_CLUSTER to MEDIUM_CLUSTER, but it made no difference.<br>
><br>> When I build with rdma, this adds the following:<br>> export LIBS="${LIBS} -lrdmacm"<br>> export CFLAGS="${CFLAGS} -DADAPTIVE_RDMA_FAST_PATH -DRDMA_CM"<br>><br>> It seems that I am using the
make.mvapich2.detect script to build. It asks<br>> me for my interface, and gives me the option for the mellanox interface,<br>> which I choose.<br>><br>> I just tried a fresh install directly from the tarball instead of using the
<br>> gentoo package. Now the program completes (goes beyond 8K message), but my<br>> bandwidth isn't very good. Running the osu_bw.c test, I get about 250 MB/s<br>> maximum. It seems like IB isn't being used.
<br>><br>> I did the following:<br>> ./make.mvapich2.detect #, and chose the mellanox option<br>> ./configure --enable-threads=multiple<br>> make<br>> make install<br>><br>> So it seems that the package is doing something to enable infiniband that I
<br>> am not doing with the tarball. Conversely, the tarball can run without<br>> crashing.<br>><br>> Advice?<br>><br>> Thanks,<br>> Brian<br>><br>> On Jan 6, 2008 6:38 AM, wei huang < <a href="mailto:huanwei@cse.ohio-state.edu">
huanwei@cse.ohio-state.edu</a>> wrote:<br>><br>> > Hi Brian,<br>> ><br>> > > I am using the openib-mvapich2-1.0.1 package in the gentoo-science<br>> > overlay<br>> > > addition to the standard gentoo packages. I have also tried
1.0 with<br>> > the<br>> > > same results.<br>> > ><br>> > > I compiled with multithreading turned on (haven't tried without this,<br>> > but<br>> > > the sample codes I am initially testing are not multithreaded, although
<br>> > my<br>> > > application is). I also tried with or without rdma with no change. The<br>> ><br>> > > script seems to be setting the build for SMALL_CLUSTER.<br>> ><br>> > So you are using
make.mvapich2.ofa to compile the package? I am a bit<br>> > confused about ''I also tried with or without rdma with no change''. What<br>> > exact change you made here? Also, SMALL_CLUSTER is obsolete for ofa
<br>> > stack...<br>> ><br>> > -- Wei<br>> ><br>> > ><br>> > > Let me know what other information would be useful.<br>> > ><br>> > > Thanks,<br>> > > Brian
<br>> > ><br>> > ><br>> > ><br>> > > On Jan 4, 2008 6:12 PM, wei huang <<a href="mailto:huanwei@cse.ohio-state.edu">huanwei@cse.ohio-state.edu</a>> wrote:<br>> > ><br>> > > > Hi Brian,
<br>> > > ><br>> > > > Thanks for letting us know this problem. Would you please let us know<br>> > some<br>> > > > more details to help us locate the issue.<br>> > > >
<br>> > > > 1) More details on your platform.<br>> > > ><br>> > > > 2) Exact version of mvapich2 you are using. Is it from OFED package?<br>> > or<br>> > > > some version from our website.
<br>> > > ><br>> > > > 3) If it is from our website, did you change anything from the default<br>> ><br>> > > > compiling scripts?<br>> > > ><br>> > > > Thanks.
<br>> > > ><br>> > > > -- Wei<br>> > > > > I'm new to the list here... hi! I have been using OpenMPI for a<br>> > while,<br>> > > > and<br>> > > > > LAM before that, but new requirements keep pushing me to new
<br>> > > > > implementations. In particular, I was interested in using<br>> > infiniband<br>> > > > (using<br>> > > > > OFED <a href="http://1.2.5.1" target="_blank">1.2.5.1
</a>) in a multi-threaded environment. It seems that<br>> > MVAPICH is<br>> > > > the<br>> > > > > library for that particular combination :)<br>> > > > ><br>> > > > > In any case, I installed MVAPICH, and I can boot the daemons, and
<br>> > run<br>> > > > the<br>> > > > > ring speed test with no problems. When I run any programs with<br>> > mpirun,<br>> > > > > however, I get an error when sending or receiving more than 8192
<br>> > bytes.<br>> > > > ><br>> > > > > For example, if I run the bandwidth test from the benchmarks page<br>> > > > > (osu_bw.c), I get the following:<br>> > > > > ---------------------------------------------------------------
<br>> > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out<br>> > > > > Thursday 06:16:00<br>> > > > > burn<br>> > > > > burn-3<br>> > > > > # OSU MPI Bandwidth Test
v3.0<br>> > > > > # Size Bandwidth (MB/s)<br>> > > > > 1 1.24<br>> > > > > 2 2.72<br>> > > > > 4
5.44<br>> > > > > 8 10.18<br>> > > > > 16 19.09<br>> > > > > 32 29.69<br>> > > > > 64
65.01<br>> > > > > 128 147.31<br>> > > > > 256 244.61<br>> > > > > 512 354.32<br>> > > > > 1024
367.91<br>> > > > > 2048 451.96<br>> > > > > 4096 550.66<br>> > > > > 8192 598.35<br>> > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 0 type from rndv req to
<br>> > > > send<br>> > > > > Internal Error: invalid error code ffffffff (Ring Index out of<br>> > range) in<br>> > > > > MPIDI_CH3_RndvSend:263<br>> > > > > Fatal error in MPI_Waitall:
<br>> > > > > Other MPI error, error stack:<br>> > > > > MPI_Waitall(242): MPI_Waitall(count=64, req_array=0xdb21a0,<br>> > > > > status_array=0xdb3140) failed<br>> > > > > (unknown)(): Other MPI error
<br>> > > > > rank 1 in job 4 burn_37156 caused collective abort of all ranks<br>> > > > > exit status of rank 1: killed by signal 9<br>> > > > > ---------------------------------------------------------------
<br>> > > > ><br>> > > > > I get a similar problem with the latency test, however, the protocol<br>> > > > that is<br>> > > > > complained about is different:<br>> > > > > --------------------------------------------------------------------
<br>> ><br>> > > > > budge@burn:~/tests/testMvapich2> mpirun -np 2 ./a.out<br>> > > > > Thursday 09:21:20<br>> > > > > # OSU MPI Latency Test v3.0<br>> > > > > # Size Latency (us)
<br>> > > > > 0 3.93<br>> > > > > 1 4.07<br>> > > > > 2 4.06<br>> > > > > 4
3.82<br>> > > > > 8 3.98<br>> > > > > 16 4.03<br>> > > > > 32 4.00<br>> > > > > 64
4.28<br>> > > > > 128 5.22<br>> > > > > 256 5.88<br>> > > > > 512 8.65<br>> > > > > 1024
9.11<br>> > > > > 2048 11.53<br>> > > > > 4096 16.17<br>> > > > > 8192 25.67<br>> > > > > [1][ch3_rndvtransfer.c:112] Unknown protocol 8126589 type from rndv
<br>> > req<br>> > > > to<br>> > > > > send<br>> > > > > Internal Error: invalid error code ffffffff (Ring Index out of<br>> > range) in<br>> > > > > MPIDI_CH3_RndvSend:263
<br>> > > > > Fatal error in MPI_Recv:<br>> > > > > Other MPI error, error stack:<br>> > > > > MPI_Recv(186): MPI_Recv(buf=0xa8ff80, count=16384, MPI_CHAR, src=0,<br>> > > > tag=1,
<br>> > > > > MPI_COMM_WORLD, status=0x7fff14c7bde0) failed<br>> > > > > (unknown)(): Other MPI error<br>> > > > > rank 1 in job 5 burn_37156 caused collective abort of all ranks
<br>> > > > > --------------------------------------------------------------------<br>> > > > ><br>> > > > > The protocols (0 and 8126589) are consistent if I run the program<br>
> > > > multiple<br>> > > > > times.<br>> > > > ><br>> > > > > Anyone have any ideas? If you need more info, please let me know.<br>> > > > ><br>> > > > > Thanks,
<br>> > > > > Brian<br>> > > > ><br>> > > ><br>> > > ><br>> > ><br>> ><br>> ><br>><br><br></div></div></blockquote></div><br>