From nilesh_awate at yahoo.com Mon Dec 1 08:47:54 2008 From: nilesh_awate at yahoo.com (nilesh awate) Date: Mon Dec 1 09:15:01 2008 Subject: [mvapich-discuss] messege truncated References: Message-ID: <193397.30657.qm@web94104.mail.in2.yahoo.com> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: pallas_rdma_with_check_log Type: application/octet-stream Size: 131108 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081201/b6e9a278/pallas_rdma_with_check_log-0001.obj From perkinjo at cse.ohio-state.edu Mon Dec 1 11:34:36 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon Dec 1 11:34:59 2008 Subject: [mvapich-discuss] Error while compiling mvapich-1.1 with sunstudio In-Reply-To: <87y6z4kmbs.fsf@taris.box> References: <87y6z4kmbs.fsf@taris.box> Message-ID: <20081201163435.GD2973@cse.ohio-state.edu> On Fri, Nov 28, 2008 at 02:32:39PM +0100, Thomas Bach wrote: > Hi, > > I missinterpreted some signs in my last mail. Compilation still > fails. With the original make.mvapich.gen2 shipped by your > distribution it stops somewhere at viainit.c. > > I changed that a bit to get shared-libs, f90-modules, and to > explicitly compile the f77 stuff (which it doesn't by defautl). Thomas: In order to get fortran support using the Sun Studio compilers, you'll need to replace the -Wl,-rpath option with -R. http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide-1.1.html#x1-420007.1.14 I see from your config-mine.log that this is the first issue encountered. Can you try this out and let us know if this resolves this and any subsequent issues? > > So that the configure command looks like this: > ./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \ > --with-romio --without-mpe -lib="$LIBS" --enable-cxx --enable-f77 \ > --enable-f90modules --enable-f90 --enable-sharedlib 2>&1 |tee config-mine.log > > Now viainit.c compiles, but I get errors with cpplib which are > ignored. The test-routine still crashes with the same output as the > last compilation without adapted configure. > > Also I can't find any fortran files in the installation. > > Greets, > Thomas Bach. > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From bachth at uni-mainz.de Tue Dec 2 10:17:13 2008 From: bachth at uni-mainz.de (Thomas Bach) Date: Tue Dec 2 10:17:39 2008 Subject: [mvapich-discuss] Error while compiling mvapich-1.1 with sunstudio In-Reply-To: <20081201163435.GD2973@cse.ohio-state.edu> (Jonathan Perkins's message of "Mon, 1 Dec 2008 17:34:36 +0100") References: <87y6z4kmbs.fsf@taris.box> <20081201163435.GD2973@cse.ohio-state.edu> Message-ID: <87r64q7gjq.fsf@taris.box> Jonathan Perkins writes: > Thomas: > In order to get fortran support using the Sun Studio compilers, you'll > need to replace the -Wl,-rpath option with -R. OK, so I firstly changed export LIBS=${LIBS:--L${IBHOME_LIB} -Wl,-rpath=${IBHOME_LIB} -libverbs -libumad -lpthread} to export LIBS=${LIBS:--L${IBHOME_LIB} -R -libverbs -libumad -lpthread} and another time to export LIBS=${LIBS:--L${IBHOME_LIB} -R=${IBHOME_LIB} -libverbs -libumad -lpthread} The second one did fail with kind of the same output. The first one looked much better but still seems to fail. I attached the log-file of that one. Especially compilation of the testing programs fails: $ mpicc -o cpi cpi.c cpi.o: In function `main': cpi.c:(.text+0x276): undefined reference to `fabs' $ mpiCC -o hello++ hello++.cc "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi++.h", line 40: Error: Could not open include file "mpi2c++/mpi2c++_config.h". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 32: Error: Could not open include file "mpi2c++/mpi2c++_config.h". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 57: Error: Type name expected instead of "MPI2CPP_BOOL_T". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 57: Error: Identifier expected instead of "const". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 57: Error: Use ";" to terminate declarations. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Use ";" to terminate declarations. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Type name expected instead of "MPI2CPP_BOOL_T". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Identifier expected instead of "const". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Multiple declaration for const. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Use ";" to terminate declarations. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 63: Error: The operation "List::iter != List::iter" is illegal. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi++.h", line 97: Error: Type name expected instead of "_MPIPP_EXTERN_". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 32: Error: Type name expected instead of "_MPIPP_EXTERN_". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 32: Error: Identifier expected instead of "const". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 32: Error: Use ";" to terminate declarations. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Use ";" to terminate declarations. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Type name expected instead of "_MPIPP_EXTERN_". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Identifier expected instead of "const". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Multiple declaration for const. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Use ";" to terminate declarations. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Use ";" to terminate declarations. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Type name expected instead of "_MPIPP_EXTERN_". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Identifier expected instead of "const". "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Multiple declaration for const. "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Use ";" to terminate declarations. Compilation aborted, too many Error messages. $ mpif90 pi3f90.f90 sunf90: Warning: Option -Wl,-rpath-link passed to ld, if ld is invoked, ignored otherwise sunf90: Warning: Option -Wl,/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/lib/shared passed to ld, if ld is invoked, ignored otherwise /usr/bin/ld: unrecognized option '-Wl,-rpath-link' /usr/bin/ld: use the --help option for usage information I was only able to succesfully compile and run iotest... My configure-line still looks like this: >> ./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \ >> --with-romio --without-mpe -lib="$LIBS" --enable-cxx --enable-f77 \ >> --enable-f90modules --enable-f90 --enable-sharedlib 2>&1 |tee config-mine.log >> Greets, Thomas Bach. PS: Is mvapich-1.1 known to work with sunstudio? I'm not quite sure if the problem depends on our setup or if there is a basic conflict between mvapich and latest sunstudio. -------------- next part -------------- A non-text attachment was scrubbed... Name: make-mine.log.gz Type: application/octet-stream Size: 15007 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081202/ce2fa69e/make-mine.log-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: install-mine.log.gz Type: application/octet-stream Size: 962 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081202/ce2fa69e/install-mine.log-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: config-mine.log.gz Type: application/octet-stream Size: 5348 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081202/ce2fa69e/config-mine.log-0001.obj From Terrence.LIAO at total.com Tue Dec 2 10:32:02 2008 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Tue Dec 2 10:32:22 2008 Subject: [mvapich-discuss] Problem with slow start in mpirun_rsh using mvapich1.1 Message-ID: Dear mvapich, I have this slow start problem and do not know how to fix it. I am trying to run a very simple hello world and using this typical command: mpirun_rsh -hostfile host.list -np 27 ./mpi_hello.exe on our IB cluster, from time to time it will run but in most case it give me "Timeout during client startup". Any advice to fix this problem. Is VIA_CM_TIMEOUT a parameter I should tune for this? Thank you very much. -- Terrence -------------------------------------------------------- Terrence Liao, Ph.D. Research Computer Scientist TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC 1201 Louisiana, Suite 1800, Houston, TX 77002 Tel: 713.647.3498 Fax: 713.647.3638 Email: terrence.liao@total.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081202/eb6c3642/attachment.html From sridharj at cse.ohio-state.edu Tue Dec 2 12:46:18 2008 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Tue Dec 2 12:46:38 2008 Subject: [mvapich-discuss] Problem with slow start in mpirun_rsh using mvapich1.1 In-Reply-To: References: Message-ID: <4935746A.4090207@cse.ohio-state.edu> Hi Terrance, This timeout means we failed to launch mpispawn on some node within a reasonable amount of time. This could be due to ssh other network issues. Some node isn't accepting ssh connections or has a large delay. If you wan a larger timeout, you need to export MPIRUN_TIMEOUT (seconds) - $ export MPIRUN_TIMEOUT=11111 $ mpirun_rsh ... You can also try to use rsh with -rsh flag to mpirun_rsh. -Jaidev On Tuesday 02 December 2008 10:32 AM, Terrence.LIAO@total.com wrote: > > Dear mvapich, > > I have this slow start problem and do not know how to fix it. I am > trying to run a very simple hello world and using this typical command: > mpirun_rsh -hostfile host.list -np 27 ./mpi_hello.exe on our IB cluster, > from time to time it will run but in most case it give me "Timeout > during client startup". Any advice to fix this problem. Is > VIA_CM_TIMEOUT a parameter I should tune for this? > > Thank you very much. > > -- Terrence > -------------------------------------------------------- > Terrence Liao, Ph.D. > Research Computer Scientist > TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC > 1201 Louisiana, Suite 1800, Houston, TX 77002 > Tel: 713.647.3498 Fax: 713.647.3638 > Email: terrence.liao@total.com > > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From koop at cse.ohio-state.edu Tue Dec 2 14:57:19 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Dec 2 14:57:34 2008 Subject: [mvapich-discuss] messege truncated In-Reply-To: <193397.30657.qm@web94104.mail.in2.yahoo.com> Message-ID: Nilesh, Using RDMA fastpath requires that your network adapter place data into the destination buffer with the last byte being placed last. If this guarantee does not hold, then there will be corruption. Mellanox gives this guarantee for their InfiniBand adapters -- if your hardware does not (or you are not sure), the RDMA fast path method should be turned off for your system. Thanks, Matt On Mon, 1 Dec 2008, nilesh awate wrote: > Thanks for suggestion, sorry for late reply, initially we were using Pallas V2.2. Now with IMB3.2 over proprietary network(nic & dapl), we tried mvapich2-1.2 without RDMA_FAST_PATH. only send-recv path, it worked fine for a long duration run. but with rdma path its failing. error file is attached. for cross check we ran same thing over mellanox network(dapl) its working fine. what can we deduce from above error ? ________________________________ From: Dhabaleswar Panda To: nilesh awate Cc: MVAPICH2 ; pmb@pallas.com Sent: Thursday, 27 November, 2008 3:25:02 AM Subject: Re: [mvapich-discuss] messege truncated Which version of Pallas are you running? As you might be knowing, Pallas benchmarks are outdated. They have been replaced with Intel MPI Benchmarks (IMB). The latest version is 3.1. Can you try your tests with IMB 3.1. Thanks, DK On Tue, 25 Nov 2008, nilesh awate wrote: > Hi all, I want to detail the information regarding this discussion as all my trials are failing over standards I am using RHEL5 on AMD opteron dual core, mvapich2-1.2(dapl interconnect; with and without RDMA_FAST_PATH) with mellanox network. I am running Pallas (with check) with above setup. I got following error Fatal error in MPI_Recv: Message truncated, error stack: MPI_Recv(186)...........................: MPI_Recv(buf=0x7fff3072accc, count=896311571, MPI_INT, src=2, tag=1000, MPI_COMM_WORLD, status=0x7fff3072acb0) failed MPIDI_CH3U_Post_data_receive_found(243): Message from rank 2 and tag 1000 truncated; 4 bytes received but buffer size is -709721012 rank 0 in job 5 test01_44984 caused collective abort of all ranks exit status of rank 0: killed by signal 9 Above error occurs in SendRecv benchmark most of the time. I ran same thing with gen2, it worked fine . . . but with dapl interconnect its failing waiting for reply, Nilesh Nilesh Awate C-DAC R&D ________________________________ From: Justin To: nilesh awate Cc: MVAPICH2 Sent: Friday, 21 November, 2008 9:09:51 PM Subject: Re: [mvapich-discuss] messege truncated One thing that I have used to track down bugs of this nature in the past is to use the MPI_Errhandler functionality. Try placing this in your code after MPI_Init: MPI_Errhandler_set(MPI_COMM_WORLD,MPI_ERRORS_RETURN); Then at your MPI_Recv's add an if around them and some debugging output: if(MPI_Recv(...)!=MPI_SUCCESS) { char hostname[100]; gethostname(hostname,100); cout << "MPI Recv returned error on " << hostname << ":" << getpid() << endl; cout << "Waiting for a debugger\n"; while(1); } Then from here you should be able to ssh into the back node doing the processing (specified by the hostname above) and then attach gdb to the process (specified by the pid above). Make sure you have compiled with -g. Then look at the parameters to MPI_Recv and see if something doesn't look right. Good Luck, Justin nilesh awate wrote: > > Hi Justine, > > We are running Pallas over mpi( dapl interconnect), I got the same error while running Pallas with tcp-ip(ethernet) network. > > Fatal error in MPI_Recv: > Message truncated, error stack: > MPI_Recv(186)...........................: MPI_Recv(buf=0x7fff23cdd22c, count=976479459, MPI_INT, src=2, tag=1000,MPI_COMM_WORLD, status=0x7fff23cdd210) failed > MPIDI_CH3U_Post_data_receive_found(163): Message from rank 2 and tag 1000 truncated; 4 bytes received but buffersize is -389049460 > > I am running it over AMD 5 nodes cluster having this (1Ghz Dual-Core AMD Opteron Processor 1216) configuration. > > I don't know how MPI_Recv got such a huge count. . .when Pallas is sending max 4194304Bytes > > is this some garbage value it receives ? > > waiting for reply, > > Nilesh > > > > > > > ------------------------------------------------------------------------ > *From:* Justin > *To:* nilesh awate > *Cc:* Dhabaleswar Panda ; MVAPICH2 > *Sent:* Thursday, 20 November, 2008 9:27:42 PM > *Subject:* Re: [mvapich-discuss] messege truncated > > The message means mpi received a message larger than the buffer size you specified. Namely in this case the buffer length is '-514665432' thus any length of message would be bigger than it. What I find odd is the parameters you are sending MPI_Recv. You are sending a count of '945075466' are you really sending a message that is a gigabyte in size? It might be possible that the count is being converted to a signed int causing it to wrap to a negative number. Check the size that you are specifying for the buffer. It is odd that you have it specified to be a GB in size when you are only receiving 2 bytes. > nilesh awate wrote: > > > > Thanks for suggestion (use mvapich2-1.2) sir, > > > > I have tried the same but still we are facing same problem > > > > Fatal error in MPI_Recv: > > Message truncated, error stack: > > MPI_Recv(186).........................: MPI_Recv(buf=0x7fff1faf6008, count=945075466, MPI_INT, src=2, tag=1000, MPI_COMM_WORLD, status=0x7fff1faf5fe0) failed > > MPIDI_CH3U_Request_unpack_uebuf(590): Message truncated; 4 bytes received but buffer size is -514665432 > > rank 0 in job 4 test01_52519 caused collective abort of all ranks > > exit status of rank 0: killed by signal 9 > > > > is there any suggestion ? > > > > what does this error mean mean ? > > > > is this a result of data curruption/packet missing, or something else ? > > > > wating for reply > > Nilesh Awate > > > > > > > > ------------------------------------------------------------------------ > > *From:* Dhabaleswar Panda > > > *To:* nilesh awate > > > *Cc:* MVAPICH2 > > > *Sent:* Wednesday, 19 November, 2008 9:27:36 PM > > *Subject:* Re: [mvapich-discuss] messege truncated > > > > MVAPICH2 1.2 was released around two weeks back. Can you try the latest > > version. > > > > DK > > > > On Wed, 19 Nov 2008, nilesh awate wrote: > > > > > Hi all, > > I am using mvapich2-1.0.3 with dapl interconnect (its a proprietary nic & dapl library) > > I got following error while running pallas over (amd dual core) 5 nodes cluster. > > > > Fatal error in MPI_Recv: > > Message truncated, error stack: > > MPI_Recv(186)..........................: MPI_Recv(buf=0x7fff24744cec, count=952788905, MPI_INT, src=2, tag=1000,MPI_COMM_WORLD, status=0x7fff24744cd0) failed > > MPIDI_CH3U_Post_data_receive_found(243): Message from rank 2 and tag 1000 truncated; 4 bytes received but buffersize is -483811676 > > rank 0 in job 2 test01_40634 caused collective abort of all ranks > > exit status of rank 0: killed by signal 9 > > > > > > will you suggest where we should look for solving above error ? > > what can we interpret from above message ? > > > > wating for reply > > thanking > > Nilesh > > > > > > Bring your gang together. Do your thing. Find your favourite Yahoo! group at http://in.promos.yahoo.com/groups/ > > > > > > ------------------------------------------------------------------------ > > Add more friends to your messenger and enjoy! Invite them now. > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse..ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > ------------------------------------------------------------------------ > Add more friends to your messenger and enjoy! Invite them now. Add more friends to your messenger and enjoy! Go to http://messenger.yahoo.com/invite/ Add more friends to your messenger and enjoy! Go to http://messenger.yahoo.com/invite/ From kvdheeraj at indiatimes.com Wed Dec 3 08:47:37 2008 From: kvdheeraj at indiatimes.com (Dheeraj KV) Date: Wed Dec 3 08:54:26 2008 Subject: [mvapich-discuss] mpif77 and mpif90 not getting created Message-ID: <690090932.70391228312057443.JavaMail.root@mbr8.indiatimes.com> Hi I want to install mvapich-1.1 on my super computer. I have gcc version gcc-4.2 installed. I want to install mvapich on top of that. Though its getting created mpif77 and mpif90 is not getting created on bin of mvapich. I have set the path of gfortran, but still its not getting installed. I have installed mvapich with gcc-3.4.4 (default with RHEL 4 up 5), but mpif90 created an empty file. Is it that mvapch-1.1 can't be installed with gfortran ? Please suggest !!!! Thanks & Regards Dheeraj K V -- "KLUB MJ Hunt, Search for India's First Movie Jockey" For more details log on to http://www.zeecinema.com From perkinjo at cse.ohio-state.edu Wed Dec 3 11:23:19 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Wed Dec 3 11:23:35 2008 Subject: [mvapich-discuss] mpif77 and mpif90 not getting created In-Reply-To: <690090932.70391228312057443.JavaMail.root@mbr8.indiatimes.com> References: <690090932.70391228312057443.JavaMail.root@mbr8.indiatimes.com> Message-ID: <20081203162318.GE2874@cse.ohio-state.edu> Dheeraj: Please take a look at section 7.1.5 of the MVAPICH userguide. There is an environment variable that needs to be set during the build. Section 7.1.6 will also be useful to you once you're ready to run an MPI program. Here is a link for your convenience: http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide-1.1.html#x1-330007.1.5 Please let me know if this helps. On Wed, Dec 03, 2008 at 07:17:37PM +0530, Dheeraj KV wrote: > Hi > I want to install mvapich-1.1 on my super computer. > I have gcc version gcc-4.2 installed. I want to install mvapich on top of that. > Though its getting created mpif77 and mpif90 is not getting created on bin of mvapich. > I have set the path of gfortran, but still its not getting installed. > I have installed mvapich with gcc-3.4.4 (default with RHEL 4 up 5), but mpif90 created an empty file. Is it that mvapch-1.1 can't be installed with gfortran ? > Please suggest !!!! > > Thanks & Regards > Dheeraj K V > > > -- > > "KLUB MJ Hunt, Search for India's First Movie Jockey" > For more details log on to http://www.zeecinema.com > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From perkinjo at cse.ohio-state.edu Thu Dec 4 12:12:04 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu Dec 4 12:12:22 2008 Subject: [mvapich-discuss] Error while compiling mvapich-1.1 with sunstudio In-Reply-To: <87r64q7gjq.fsf@taris.box> References: <87y6z4kmbs.fsf@taris.box> <20081201163435.GD2973@cse.ohio-state.edu> <87r64q7gjq.fsf@taris.box> Message-ID: <20081204171203.GC2869@cse.ohio-state.edu> Thomas: We've tested mvapich-1.1 with a new install of Sun Studio 12 early this week. Everything works fine for us. It looks like there is a slight issue with the arguments that you're passing to the compiler. You should use the second export line after removing the '='. Ex. export LIBS=${LIBS:--L${IBHOME_LIB} -R${IBHOME_LIB} -libverbs -libumad -lpthread} On Tue, Dec 02, 2008 at 04:17:13PM +0100, Thomas Bach wrote: > Jonathan Perkins writes: > > Thomas: > > In order to get fortran support using the Sun Studio compilers, you'll > > need to replace the -Wl,-rpath option with -R. > OK, so I firstly changed > export LIBS=${LIBS:--L${IBHOME_LIB} -Wl,-rpath=${IBHOME_LIB} -libverbs -libumad -lpthread} > > to > export LIBS=${LIBS:--L${IBHOME_LIB} -R -libverbs -libumad -lpthread} > and another time to > export LIBS=${LIBS:--L${IBHOME_LIB} -R=${IBHOME_LIB} -libverbs -libumad -lpthread} > > The second one did fail with kind of the same output. The first one > looked much better but still seems to fail. I attached the log-file of > that one. > Especially compilation of the testing programs fails: > $ mpicc -o cpi cpi.c > cpi.o: In function `main': > cpi.c:(.text+0x276): undefined reference to `fabs' > > $ mpiCC -o hello++ hello++.cc > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi++.h", line 40: Error: Could not open include file "mpi2c++/mpi2c++_config.h". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 32: Error: Could not open include file "mpi2c++/mpi2c++_config.h". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 57: Error: Type name expected instead of "MPI2CPP_BOOL_T". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 57: Error: Identifier expected instead of "const". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 57: Error: Use ";" to terminate declarations. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Use ";" to terminate declarations. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Type name expected instead of "MPI2CPP_BOOL_T". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Identifier expected instead of "const". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Multiple declaration for const. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 58: Error: Use ";" to terminate declarations. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi2c++_list.h", line 63: Error: The operation "List::iter != List::iter" is illegal. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/mpi++.h", line 97: Error: Type name expected instead of "_MPIPP_EXTERN_". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 32: Error: Type name expected instead of "_MPIPP_EXTERN_". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 32: Error: Identifier expected instead of "const". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 32: Error: Use ";" to terminate declarations. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Use ";" to terminate declarations. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Type name expected instead of "_MPIPP_EXTERN_". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Identifier expected instead of "const". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Multiple declaration for const. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 33: Error: Use ";" to terminate declarations. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Use ";" to terminate declarations. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Type name expected instead of "_MPIPP_EXTERN_". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Identifier expected instead of "const". > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Multiple declaration for const. > "/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/include/mpi2c++/constants.h", line 34: Error: Use ";" to terminate declarations. > Compilation aborted, too many Error messages. > > $ mpif90 pi3f90.f90 > sunf90: Warning: Option -Wl,-rpath-link passed to ld, if ld is invoked, ignored otherwise > sunf90: Warning: Option -Wl,/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-wl-rpath-r2/lib/shared passed to ld, if ld is invoked, ignored otherwise > /usr/bin/ld: unrecognized option '-Wl,-rpath-link' > /usr/bin/ld: use the --help option for usage information > > I was only able to succesfully compile and run iotest... > > My configure-line still looks like this: > >> ./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \ > >> --with-romio --without-mpe -lib="$LIBS" --enable-cxx --enable-f77 \ > >> --enable-f90modules --enable-f90 --enable-sharedlib 2>&1 |tee config-mine.log > >> > > Greets, > Thomas Bach. > > PS: Is mvapich-1.1 known to work with sunstudio? I'm not quite sure if > the problem depends on our setup or if there is a basic conflict > between mvapich and latest sunstudio. > -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From kvdheeraj at indiatimes.com Fri Dec 5 01:30:44 2008 From: kvdheeraj at indiatimes.com (Dheeraj KV) Date: Fri Dec 5 01:38:03 2008 Subject: [mvapich-discuss] mpif77 and mpif90 not getting created In-Reply-To: <20081203162318.GE2874@cse.ohio-state.edu> Message-ID: <1021867863.159091228458644106.JavaMail.root@mbr8.indiatimes.com> Hi Thanks for your help. I could compile mvapich-1.1 with the option adn mpif77 and mpif90 is getting created. But there is a problem with mpif90. If i try to compile a code with mpif90 it is giving the below giving error. No Fortran 90 compiler specified when mpif90 was created, or configuration file does not specify a compiler. I have exported the path correctly and gfortran is available in PATH . IS there any option to compile mvapich to get mpif90 without any problem. Regards Dheeraj K V ----- Original Message ----- From: Jonathan Perkins To: Dheeraj KV Cc: mvapich-discuss@cse.ohio-state.edu Sent: Wed, 3 Dec 2008 21:53:19 +0530 (IST) Subject: Re: [mvapich-discuss] mpif77 and mpif90 not getting created Dheeraj: Please take a look at section 7.1.5 of the MVAPICH userguide. There is an environment variable that needs to be set during the build. Section 7.1.6 will also be useful to you once you're ready to run an MPI program. Here is a link for your convenience: http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide-1.1.html#x1-330007.1.5 Please let me know if this helps. On Wed, Dec 03, 2008 at 07:17:37PM +0530, Dheeraj KV wrote: > Hi > I want to install mvapich-1.1 on my super computer. > I have gcc version gcc-4.2 installed. I want to install mvapich on top of that. > Though its getting created mpif77 and mpif90 is not getting created on bin of mvapich. > I have set the path of gfortran, but still its not getting installed. > I have installed mvapich with gcc-3.4.4 (default with RHEL 4 up 5), but mpif90 created an empty file. Is it that mvapch-1.1 can't be installed with gfortran ? > Please suggest !!!! > > Thanks & Regards > Dheeraj K V > > > -- > > "KLUB MJ Hunt, Search for India's First Movie Jockey" > For more details log on to http://www.zeecinema.com > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -- "KLUB MJ Hunt, Search for India's First Movie Jockey" For more details log on to http://www.zeecinema.com From perkinjo at cse.ohio-state.edu Fri Dec 5 07:58:26 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Fri Dec 5 07:58:42 2008 Subject: [mvapich-discuss] mpif77 and mpif90 not getting created In-Reply-To: <1021867863.159091228458644106.JavaMail.root@mbr8.indiatimes.com> References: <20081203162318.GE2874@cse.ohio-state.edu> <1021867863.159091228458644106.JavaMail.root@mbr8.indiatimes.com> Message-ID: <20081205125825.GC2900@cse.ohio-state.edu> Dheeraj: Can you please forward me your build logs (configure and make-mine.log) as well as your build script (make.mvapich.gen2). On Fri, Dec 05, 2008 at 12:00:44PM +0530, Dheeraj KV wrote: > Hi > Thanks for your help. I could compile mvapich-1.1 with the option adn mpif77 and mpif90 is getting created. But there is a problem with mpif90. If i try to compile a code with mpif90 it is giving the below giving error. > > No Fortran 90 compiler specified when mpif90 was created, > or configuration file does not specify a compiler. > > I have exported the path correctly and gfortran is available in PATH . > IS there any option to compile mvapich to get mpif90 without any problem. > > Regards > Dheeraj K V > > > ----- Original Message ----- > From: Jonathan Perkins > To: Dheeraj KV > Cc: mvapich-discuss@cse.ohio-state.edu > Sent: Wed, 3 Dec 2008 21:53:19 +0530 (IST) > Subject: Re: [mvapich-discuss] mpif77 and mpif90 not getting created > > Dheeraj: > Please take a look at section 7.1.5 of the MVAPICH userguide. There is > an environment variable that needs to be set during the build. Section > 7.1.6 will also be useful to you once you're ready to run an MPI > program. > > Here is a link for your convenience: > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide-1.1.html#x1-330007.1.5 > > Please let me know if this helps. > > On Wed, Dec 03, 2008 at 07:17:37PM +0530, Dheeraj KV wrote: > > Hi > > I want to install mvapich-1.1 on my super computer. > > I have gcc version gcc-4.2 installed. I want to install mvapich on top of that. > > Though its getting created mpif77 and mpif90 is not getting created on bin of mvapich. > > I have set the path of gfortran, but still its not getting installed. > > I have installed mvapich with gcc-3.4.4 (default with RHEL 4 up 5), but mpif90 created an empty file. Is it that mvapch-1.1 can't be installed with gfortran ? > > Please suggest !!!! > > > > Thanks & Regards > > Dheeraj K V > > > > > > -- > > > > "KLUB MJ Hunt, Search for India's First Movie Jockey" > > For more details log on to http://www.zeecinema.com > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo > > > > -- > > "KLUB MJ Hunt, Search for India's First Movie Jockey" > For more details log on to http://www.zeecinema.com -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From perkinjo at cse.ohio-state.edu Fri Dec 5 09:33:15 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Fri Dec 5 09:33:30 2008 Subject: [mvapich-discuss] Re: [ofa-general] ***SPAM*** MPIR_Init_thread(310).......: Initialization failed In-Reply-To: References: Message-ID: <20081205143314.GE2900@cse.ohio-state.edu> Hello, At first glance it appears that this is a an issue with an inability to pin the memory required. I'm cc'ing mvapich-discuss as well since this message is specific to the MVAPICH/MVAPICH2 packages. The following is a snippet from the MVAPICH2 Userguide... http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html#x1-580009.3.4 A possible reason could be inability to pin the memory required. Make sure the following steps are taken. 1. In /etc/security/limits.conf add the following * soft memlock phys_mem_in_KB 2. After this, add the following to /etc/init.d/sshd ulimit -l phys_mem_in_KB 3. Restart sshd With some distros, we?ve found that adding the ulimit -l line to the sshd init script is no longer necessary. For instance, the following steps work for our rhel5 systems. 1. Add the following lines to /etc/security/limits.conf * soft memlock unlimited * hard memlock unlimited 2. Restart sshd On Wed, Dec 03, 2008 at 12:30:07PM +0530, ???? wrote: > Hi > > I have compiled mvapich2-1.2p1 for gen2. > > I tried to run IMB ( Intel MPI Benchmark) over it. > > But I'm getting the following error : > > Fatal error in MPI_Init_thread: > Other MPI error, error stack: > MPIR_Init_thread(310).......: Initialization failed > MPID_Init(113)..............: channel initialization failed > MPIDI_CH3_Init(168).........: > MPIDI_CH3I_RDMA_init(138)...: > rdma_setup_startup_ring(334): cannot create cq > MPI process terminated unexpectedly > Exit code -5 signaled from pnetib2 > # here pnetib2 is the host name assigned to ipoib interface > cleanupKilling remote processes...Signal 15 received. > DONE > > Please tell me where is the problem. Or how can i debug this. > > Thanks Alot > > Regards, > > -- > Anuj Aggarwal > > .''`. > : :? : # apt-get install hakuna-matata > `. `'` > `- > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From Terrence.LIAO at total.com Fri Dec 5 16:04:21 2008 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Fri Dec 5 16:04:43 2008 Subject: [mvapich-discuss] Question on VIADEV_ENABLE_AFFINITY and VIADEV_USE_AFFINITY Message-ID: Dear mvapich, I was asked to use VIADEV_ENABLE_AFFINITY=0 to disable the CPU binding, but reading the MVAPICH1.1 User and Tuning Guide, I only find VIADEV_USE_AFFINITY env. Here comes my question. Does mvapich has VIADEV_ENABLE_AFFINITY env? if so what is the difference between these two? Thank you very much. -- Terrence -------------------------------------------------------- Terrence Liao, Ph.D. Research Computer Scientist TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC 1201 Louisiana, Suite 1800, Houston, TX 77002 Tel: 713.647.3498 Fax: 713.647.3638 Email: terrence.liao@total.com Houston HPC site: http://us-hou-spt01/sites/rt/hpc/default.aspx Pau HPC site: http://collaboratif.ep.corp.local/sites/hpc/hpc/RD.aspx -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081205/2222fe41/attachment.html From koop at cse.ohio-state.edu Fri Dec 5 16:32:49 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri Dec 5 16:33:05 2008 Subject: [mvapich-discuss] Question on VIADEV_ENABLE_AFFINITY and VIADEV_USE_AFFINITY In-Reply-To: Message-ID: Terrence, They both work (and do the same thing), but VIADEV_ENABLE_AFFINITY has been deprecated. We've switched all parameters over to "USE" as not to confuse the user if the default were to change. Thanks, Matt On Fri, 5 Dec 2008 Terrence.LIAO@total.com wrote: > Dear mvapich, > > I was asked to use VIADEV_ENABLE_AFFINITY=0 to disable the CPU binding, > but reading the MVAPICH1.1 User and Tuning Guide, I only find > VIADEV_USE_AFFINITY env. Here comes my question. Does mvapich has > VIADEV_ENABLE_AFFINITY env? if so what is the difference between these > two? > > Thank you very much. > > -- Terrence > -------------------------------------------------------- > Terrence Liao, Ph.D. > Research Computer Scientist > TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC > 1201 Louisiana, Suite 1800, Houston, TX 77002 > Tel: 713.647.3498 Fax: 713.647.3638 > Email: terrence.liao@total.com > > Houston HPC site: http://us-hou-spt01/sites/rt/hpc/default.aspx > Pau HPC site: http://collaboratif.ep.corp.local/sites/hpc/hpc/RD.aspx > From forum.san at gmail.com Sun Dec 7 02:34:12 2008 From: forum.san at gmail.com (Sangamesh B) Date: Sun Dec 7 02:34:29 2008 Subject: [mvapich-discuss] mvapich2-1.2p1 + voltaire infiniband + intel: compilation failure Message-ID: Hello, The mvapich2-1.2p1 installation on a Rocks 4.3 cluster (& voltaire infiniband) with intel 10 compilers has failed (make) with the following error: rdma_cm.c(171): error: struct "rdma_cm_event" has no field "param" if (!event->param.conn.private_data_len){ ^ rdma_cm.c(176): error: struct "rdma_cm_event" has no field "param" rank = ((int *)event->param.conn.private_data)[0]; ^ rdma_cm.c(177): error: struct "rdma_cm_event" has no field "param" rail_index = ((int *)event->param.conn.private_data)[1]; ^ rdma_cm.c(376): warning #589: transfer of control bypasses initialization of: variable "cMinPort" (declared at line 388) variable "minPort" (declared at line 389) variable "portRange" (declared at line 409) variable "envPort" (declared at line 426) MPIU_ERR_SETANDJUMP3( ^ rdma_cm.c(397): warning #589: transfer of control bypasses initialization of: variable "portRange" (declared at line 409) variable "envPort" (declared at line 426) MPIU_ERR_SETANDJUMP3( ^ rdma_cm.c(414): warning #589: transfer of control bypasses initialization of: variable "envPort" (declared at line 426) MPIU_ERR_SETANDJUMP2( ^ compilation aborted for rdma_cm.c (code 2) During configuration the option "--with-rdma=gen2" is used & there were no errors. Is this the problem with the code or anything else? How that can be resolved? Thanks, Sangamesh From bachth at uni-mainz.de Mon Dec 8 08:57:40 2008 From: bachth at uni-mainz.de (Thomas Bach) Date: Mon Dec 8 08:58:02 2008 Subject: [mvapich-discuss] Error while compiling mvapich-1.1 with sunstudio In-Reply-To: <20081204171203.GC2869@cse.ohio-state.edu> (Jonathan Perkins's message of "Thu, 4 Dec 2008 18:12:04 +0100") References: <87y6z4kmbs.fsf@taris.box> <20081201163435.GD2973@cse.ohio-state.edu> <87r64q7gjq.fsf@taris.box> <20081204171203.GC2869@cse.ohio-state.edu> Message-ID: <87bpvmzs4r.fsf@taris.box> Hello, Jonathan Perkins writes: > Thomas: > We've tested mvapich-1.1 with a new install of Sun Studio 12 early this > week. Everything works fine for us. It looks like there is a slight > issue with the arguments that you're passing to the compiler. > > You should use the second export line after removing the '='. > > Ex. > export LIBS=${LIBS:--L${IBHOME_LIB} -R${IBHOME_LIB} -libverbs -libumad > -lpthread} > Ok, with the new LIBS variable it seems to work except for the c++ part. Also I had to exchange the line SHARED_LIB_SEARCH_PATH_LEADER='-Wl,-rpath-link -Wl,' with SHARED_LIB_SEARCH_PATH_LEADER='-R' in mpif77 and mpif90 to get both working. That also seems to be the problem with the c++ part: $ cat MPI-2-C++/config.log This file contains any messages produced by compilers while running configure, to aid debugging if configure makes a mistake. configure:604: checking for a BSD compatible install configure:657: checking whether build environment is sane configure:714: checking whether make --no-print-directory sets ${MAKE} configure:753: checking for working aclocal configure:766: checking for working autoconf configure:779: checking for working automake configure:792: checking for working autoheader configure:805: checking for working makeinfo configure:959: checking host system type configure:1083: checking for awk configure:1118: checking for wc configure:1398: checking MPICH version configure:1594: checking for MPICH's underlying C++ compiler configure:2127: checking if want profiling support configure:2186: checking for c++ configure:2218: checking whether the C++ compiler (/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-serious3/bin/mpicxx -DMPICH_SKIP_MPICXX ) works configure:2234: /homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-serious3/bin/mpicxx -o conftest -DMPICH_SKIP_MPICXX conftest.C -L/opt/ofed/lib64 -R/opt/ofed/lib64 -libverbs -libumad -lpthread 1>&5 sunCC: Warning: Option -Wl,-rpath-link passed to ld, if ld is invoked, ignored otherwise sunCC: Warning: Option -Wl,/homes/zdv/bachth/libraries/mvapich/mvapich-1.1-sun-serious3/lib/shared passed to ld, if ld is invoked, ignored otherwise /usr/local/sunstudio/suse_es10_64/sunstudio12/prod/lib/amd64/ld: unrecognized option '-Wl,-rpath-link' /usr/local/sunstudio/suse_es10_64/sunstudio12/prod/lib/amd64/ld: use the --help option for usage information configure: failed program was: #line 2229 "configure" #include "confdefs.h" int main(){return(0);} In which step is mpicxx generated? Greets, Thomas From perkinjo at cse.ohio-state.edu Mon Dec 8 10:25:31 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon Dec 8 10:25:55 2008 Subject: [mvapich-discuss] Error while compiling mvapich-1.1 with sunstudio In-Reply-To: <87bpvmzs4r.fsf@taris.box> References: <87y6z4kmbs.fsf@taris.box> <20081201163435.GD2973@cse.ohio-state.edu> <87r64q7gjq.fsf@taris.box> <20081204171203.GC2869@cse.ohio-state.edu> <87bpvmzs4r.fsf@taris.box> Message-ID: <20081208152530.GA3639@cse.ohio-state.edu> On Mon, Dec 08, 2008 at 02:57:40PM +0100, Thomas Bach wrote: > Hello, My responses are inline. > > Jonathan Perkins writes: > > > Thomas: > > We've tested mvapich-1.1 with a new install of Sun Studio 12 early this > > week. Everything works fine for us. It looks like there is a slight > > issue with the arguments that you're passing to the compiler. > > > > You should use the second export line after removing the '='. > > > > Ex. > > export LIBS=${LIBS:--L${IBHOME_LIB} -R${IBHOME_LIB} -libverbs -libumad > > -lpthread} > > > > Ok, with the new LIBS variable it seems to work except for the c++ > part. Also I had to exchange the line > SHARED_LIB_SEARCH_PATH_LEADER='-Wl,-rpath-link -Wl,' > with > SHARED_LIB_SEARCH_PATH_LEADER='-R' > in mpif77 and mpif90 to get both working. You'll also need to set this in mpicxx. > > That also seems to be the problem with the c++ part: > > $ cat MPI-2-C++/config.log ... > > In which step is mpicxx generated? This is generated during the configure step at the same time as the other mpi* compiler scripts. It looks like you're having to take this additional step because of some tests during configure that will not use -R over -Wl,-rpath-link -Wl, unless the arch is detected as solaris. We'll look into a more direct solution for users of Sun Studio in future releases. > > Greets, > Thomas > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From koop at cse.ohio-state.edu Mon Dec 8 15:43:44 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Dec 8 15:43:58 2008 Subject: [mvapich-discuss] mvapich2-1.2p1 + voltaire infiniband + intel: compilation failure In-Reply-To: Message-ID: Sangamesh, You seem to be using a very old version of OFED. Are you using 1.1? I'd suggest that you update your OFED install. If this is not possible, run the 'make' step with the following ENV set beforehand. export CFLAGS="OFED_VERSION_1_1" Let us know if you have any other problems, Matt On Sun, 7 Dec 2008, Sangamesh B wrote: > Hello, > > The mvapich2-1.2p1 installation on a Rocks 4.3 cluster (& voltaire > infiniband) with intel 10 compilers has failed (make) with the > following error: > > rdma_cm.c(171): error: struct "rdma_cm_event" has no field "param" > if (!event->param.conn.private_data_len){ > ^ > rdma_cm.c(176): error: struct "rdma_cm_event" has no field "param" > rank = ((int *)event->param.conn.private_data)[0]; > ^ > rdma_cm.c(177): error: struct "rdma_cm_event" has no field "param" > rail_index = ((int *)event->param.conn.private_data)[1]; > ^ > rdma_cm.c(376): warning #589: transfer of control bypasses initialization of: > variable "cMinPort" (declared at line 388) > variable "minPort" (declared at line 389) > variable "portRange" (declared at line 409) > variable "envPort" (declared at line 426) > MPIU_ERR_SETANDJUMP3( > ^ > rdma_cm.c(397): warning #589: transfer of control bypasses initialization of: > variable "portRange" (declared at line 409) > variable "envPort" (declared at line 426) > MPIU_ERR_SETANDJUMP3( > ^ > rdma_cm.c(414): warning #589: transfer of control bypasses initialization of: > variable "envPort" (declared at line 426) > MPIU_ERR_SETANDJUMP2( > ^ > compilation aborted for rdma_cm.c (code 2) > > During configuration the option "--with-rdma=gen2" is used & there > were no errors. > > Is this the problem with the code or anything else? How that can be resolved? > > Thanks, > Sangamesh > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From forum.san at gmail.com Tue Dec 9 01:59:08 2008 From: forum.san at gmail.com (Sangamesh B) Date: Tue Dec 9 01:59:24 2008 Subject: [mvapich-discuss] mvapich2-1.2p1 + voltaire infiniband + intel: compilation failure In-Reply-To: References: Message-ID: Thanks for the response.. Since infiniband hardware is voltaire, we've used Rocks 4.3 voltaire infiniband roll: http://www.rocksclusters.org/wordpress/?page_id=3 I used the option suggested by you. export CFLAGS="OFED_VERSION_1_1" The configure failed: icc failed to compile sample C program. Then, did configure without the above option. Edited Makefile of Mvapich2 root directory, but it failed with same error. CFLAGS = "-DNDEBUG -O2 OFED_VERSION_1_1" I see during compilation, it was taking -DNDEBUG -O2 only, there was no "OFED_VERSION_1_1". Is there a file which will be accessed by all Makefiles under subdirectories? Any older version of MVAPICH2 support voltaire? Thanks, Sangamesh On Tue, Dec 9, 2008 at 2:13 AM, Matthew Koop wrote: > Sangamesh, > > You seem to be using a very old version of OFED. Are you using 1.1? I'd > suggest that you update your OFED install. > > If this is not possible, run the 'make' step with the following ENV > set beforehand. > > export CFLAGS="OFED_VERSION_1_1" > > Let us know if you have any other problems, > > Matt > > On Sun, 7 Dec 2008, Sangamesh B wrote: > >> Hello, >> >> The mvapich2-1.2p1 installation on a Rocks 4.3 cluster (& voltaire >> infiniband) with intel 10 compilers has failed (make) with the >> following error: >> >> rdma_cm.c(171): error: struct "rdma_cm_event" has no field "param" >> if (!event->param.conn.private_data_len){ >> ^ >> rdma_cm.c(176): error: struct "rdma_cm_event" has no field "param" >> rank = ((int *)event->param.conn.private_data)[0]; >> ^ >> rdma_cm.c(177): error: struct "rdma_cm_event" has no field "param" >> rail_index = ((int *)event->param.conn.private_data)[1]; >> ^ >> rdma_cm.c(376): warning #589: transfer of control bypasses initialization of: >> variable "cMinPort" (declared at line 388) >> variable "minPort" (declared at line 389) >> variable "portRange" (declared at line 409) >> variable "envPort" (declared at line 426) >> MPIU_ERR_SETANDJUMP3( >> ^ >> rdma_cm.c(397): warning #589: transfer of control bypasses initialization of: >> variable "portRange" (declared at line 409) >> variable "envPort" (declared at line 426) >> MPIU_ERR_SETANDJUMP3( >> ^ >> rdma_cm.c(414): warning #589: transfer of control bypasses initialization of: >> variable "envPort" (declared at line 426) >> MPIU_ERR_SETANDJUMP2( >> ^ >> compilation aborted for rdma_cm.c (code 2) >> >> During configuration the option "--with-rdma=gen2" is used & there >> were no errors. >> >> Is this the problem with the code or anything else? How that can be resolved? >> >> Thanks, >> Sangamesh >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > From forum.san at gmail.com Tue Dec 9 04:33:49 2008 From: forum.san at gmail.com (Sangamesh B) Date: Tue Dec 9 04:34:06 2008 Subject: [mvapich-discuss] mvapich2-1.2p1 + voltaire infiniband + intel: compilation failure In-Reply-To: References: Message-ID: Some more updates: I put CFLAGS = -DNDEBUG -O2 -D_GNU_SOURCE OFED_VERSION_1_1 in mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2/Makefile. But no effect: /opt/intel/cce/10.1.018/bin/icc -DHAVE_CONFIG_H -I. -I. -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/include -DNDEBUG -O2 -D_GNU_SOURCE OFED_VERSION_1_1 -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/ch3/include -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/ch3/include -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/common/datatype -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/common/datatype -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/include -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/include -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2 -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2 -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/common/locks -I/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/common/locks -c ibv_send.c icc: error #10104: unable to open 'OFED_VERSION_1_1' make[8]: *** [ibv_send.o] Error 1 make[8]: Leaving directory `/opt/packages/libraries/mvapich2/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/gen2' make[7]: *** [all-redirect] Error 2 Changed and placed a "-" CFLAGS = -DNDEBUG -O2 -D_GNU_SOURCE -OFED_VERSION_1_1 It thrown: icc: command line warning #10156: ignoring option '-O'; no argument required rdma_cm.c(171): error: struct "rdma_cm_event" has no field "param" if (!event->param.conn.private_data_len){ ^ rdma_cm.c(176): error: struct "rdma_cm_event" has no field "param" rank = ((int *)event->param.conn.private_data)[0]; error. Thanks, Sangamesh On Tue, Dec 9, 2008 at 12:29 PM, Sangamesh B wrote: > Thanks for the response.. > > Since infiniband hardware is voltaire, we've used Rocks 4.3 voltaire > infiniband roll: > > http://www.rocksclusters.org/wordpress/?page_id=3 > > I used the option suggested by you. > export CFLAGS="OFED_VERSION_1_1" > The configure failed: icc failed to compile sample C program. > Then, did configure without the above option. > Edited Makefile of Mvapich2 root directory, but it failed with same error. > CFLAGS = "-DNDEBUG -O2 OFED_VERSION_1_1" > I see during compilation, it was taking -DNDEBUG -O2 only, there was > no "OFED_VERSION_1_1". > > Is there a file which will be accessed by all Makefiles under subdirectories? > > Any older version of MVAPICH2 support voltaire? > > Thanks, > Sangamesh > > On Tue, Dec 9, 2008 at 2:13 AM, Matthew Koop wrote: >> Sangamesh, >> >> You seem to be using a very old version of OFED. Are you using 1.1? I'd >> suggest that you update your OFED install. >> >> If this is not possible, run the 'make' step with the following ENV >> set beforehand. >> >> export CFLAGS="OFED_VERSION_1_1" >> >> Let us know if you have any other problems, >> >> Matt >> >> On Sun, 7 Dec 2008, Sangamesh B wrote: >> >>> Hello, >>> >>> The mvapich2-1.2p1 installation on a Rocks 4.3 cluster (& voltaire >>> infiniband) with intel 10 compilers has failed (make) with the >>> following error: >>> >>> rdma_cm.c(171): error: struct "rdma_cm_event" has no field "param" >>> if (!event->param.conn.private_data_len){ >>> ^ >>> rdma_cm.c(176): error: struct "rdma_cm_event" has no field "param" >>> rank = ((int *)event->param.conn.private_data)[0]; >>> ^ >>> rdma_cm.c(177): error: struct "rdma_cm_event" has no field "param" >>> rail_index = ((int *)event->param.conn.private_data)[1]; >>> ^ >>> rdma_cm.c(376): warning #589: transfer of control bypasses initialization of: >>> variable "cMinPort" (declared at line 388) >>> variable "minPort" (declared at line 389) >>> variable "portRange" (declared at line 409) >>> variable "envPort" (declared at line 426) >>> MPIU_ERR_SETANDJUMP3( >>> ^ >>> rdma_cm.c(397): warning #589: transfer of control bypasses initialization of: >>> variable "portRange" (declared at line 409) >>> variable "envPort" (declared at line 426) >>> MPIU_ERR_SETANDJUMP3( >>> ^ >>> rdma_cm.c(414): warning #589: transfer of control bypasses initialization of: >>> variable "envPort" (declared at line 426) >>> MPIU_ERR_SETANDJUMP2( >>> ^ >>> compilation aborted for rdma_cm.c (code 2) >>> >>> During configuration the option "--with-rdma=gen2" is used & there >>> were no errors. >>> >>> Is this the problem with the code or anything else? How that can be resolved? >>> >>> Thanks, >>> Sangamesh >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >> >> > From cincaipatron at gmx.net Tue Dec 9 05:36:37 2008 From: cincaipatron at gmx.net (Verdi March) Date: Tue Dec 9 05:37:44 2008 Subject: [mvapich-discuss] MVAPICH2 and hardware multicast for MPI_Bcast Message-ID: <200812091836.38072.cincaipatron@gmx.net> Hi, does MVAPICH2 support hardware multicast for MPI_Bcast? I don't see the MCST_* flag in its build scripts, nor does any indication in its source code. At least, I could find in MVAPICH's source that mcast is supported. Or, does MVAPICH2 automatically use multicast without a need to explictly set any built/run-time flag? Your clarification is appreciated. Regards, Verdi From perkinjo at cse.ohio-state.edu Tue Dec 9 07:35:19 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Tue Dec 9 07:35:48 2008 Subject: [mvapich-discuss] mvapich2-1.2p1 + voltaire infiniband + intel: compilation failure In-Reply-To: References: Message-ID: <20081209123519.GA2865@cse.ohio-state.edu> On Tue, Dec 09, 2008 at 12:29:08PM +0530, Sangamesh B wrote: > Thanks for the response.. > > Since infiniband hardware is voltaire, we've used Rocks 4.3 voltaire > infiniband roll: > > http://www.rocksclusters.org/wordpress/?page_id=3 > > I used the option suggested by you. > export CFLAGS="OFED_VERSION_1_1" I think this needs to be... export CFLAGS="-DOFED_VERSION_1_1" > The configure failed: icc failed to compile sample C program. > Then, did configure without the above option. > Edited Makefile of Mvapich2 root directory, but it failed with same error. > CFLAGS = "-DNDEBUG -O2 OFED_VERSION_1_1" or this can be... CFLAGS = "-DNDEBUG -O2 -DOFED_VERSION_1_1" > I see during compilation, it was taking -DNDEBUG -O2 only, there was > no "OFED_VERSION_1_1". > > Is there a file which will be accessed by all Makefiles under subdirectories? > > Any older version of MVAPICH2 support voltaire? > > Thanks, > Sangamesh > > On Tue, Dec 9, 2008 at 2:13 AM, Matthew Koop wrote: > > Sangamesh, > > > > You seem to be using a very old version of OFED. Are you using 1.1? I'd > > suggest that you update your OFED install. > > > > If this is not possible, run the 'make' step with the following ENV > > set beforehand. > > > > export CFLAGS="OFED_VERSION_1_1" > > > > Let us know if you have any other problems, > > > > Matt > > > > On Sun, 7 Dec 2008, Sangamesh B wrote: > > > >> Hello, > >> > >> The mvapich2-1.2p1 installation on a Rocks 4.3 cluster (& voltaire > >> infiniband) with intel 10 compilers has failed (make) with the > >> following error: > >> > >> rdma_cm.c(171): error: struct "rdma_cm_event" has no field "param" > >> if (!event->param.conn.private_data_len){ > >> ^ > >> rdma_cm.c(176): error: struct "rdma_cm_event" has no field "param" > >> rank = ((int *)event->param.conn.private_data)[0]; > >> ^ > >> rdma_cm.c(177): error: struct "rdma_cm_event" has no field "param" > >> rail_index = ((int *)event->param.conn.private_data)[1]; > >> ^ > >> rdma_cm.c(376): warning #589: transfer of control bypasses initialization of: > >> variable "cMinPort" (declared at line 388) > >> variable "minPort" (declared at line 389) > >> variable "portRange" (declared at line 409) > >> variable "envPort" (declared at line 426) > >> MPIU_ERR_SETANDJUMP3( > >> ^ > >> rdma_cm.c(397): warning #589: transfer of control bypasses initialization of: > >> variable "portRange" (declared at line 409) > >> variable "envPort" (declared at line 426) > >> MPIU_ERR_SETANDJUMP3( > >> ^ > >> rdma_cm.c(414): warning #589: transfer of control bypasses initialization of: > >> variable "envPort" (declared at line 426) > >> MPIU_ERR_SETANDJUMP2( > >> ^ > >> compilation aborted for rdma_cm.c (code 2) > >> > >> During configuration the option "--with-rdma=gen2" is used & there > >> were no errors. > >> > >> Is this the problem with the code or anything else? How that can be resolved? > >> > >> Thanks, > >> Sangamesh > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From forum.san at gmail.com Wed Dec 10 01:07:01 2008 From: forum.san at gmail.com (Sangamesh B) Date: Wed Dec 10 01:07:22 2008 Subject: [mvapich-discuss] mvapich2-1.2p1 + voltaire infiniband + intel: compilation failure In-Reply-To: <20081209123519.GA2865@cse.ohio-state.edu> References: <20081209123519.GA2865@cse.ohio-state.edu> Message-ID: Thanks. It got installed -Sangamesh On Tue, Dec 9, 2008 at 6:05 PM, Jonathan Perkins wrote: > On Tue, Dec 09, 2008 at 12:29:08PM +0530, Sangamesh B wrote: >> Thanks for the response.. >> >> Since infiniband hardware is voltaire, we've used Rocks 4.3 voltaire >> infiniband roll: >> >> http://www.rocksclusters.org/wordpress/?page_id=3 >> >> I used the option suggested by you. >> export CFLAGS="OFED_VERSION_1_1" > > I think this needs to be... > export CFLAGS="-DOFED_VERSION_1_1" > >> The configure failed: icc failed to compile sample C program. >> Then, did configure without the above option. >> Edited Makefile of Mvapich2 root directory, but it failed with same error. >> CFLAGS = "-DNDEBUG -O2 OFED_VERSION_1_1" > > or this can be... > CFLAGS = "-DNDEBUG -O2 -DOFED_VERSION_1_1" > >> I see during compilation, it was taking -DNDEBUG -O2 only, there was >> no "OFED_VERSION_1_1". >> >> Is there a file which will be accessed by all Makefiles under subdirectories? >> >> Any older version of MVAPICH2 support voltaire? >> >> Thanks, >> Sangamesh >> >> On Tue, Dec 9, 2008 at 2:13 AM, Matthew Koop wrote: >> > Sangamesh, >> > >> > You seem to be using a very old version of OFED. Are you using 1.1? I'd >> > suggest that you update your OFED install. >> > >> > If this is not possible, run the 'make' step with the following ENV >> > set beforehand. >> > >> > export CFLAGS="OFED_VERSION_1_1" >> > >> > Let us know if you have any other problems, >> > >> > Matt >> > >> > On Sun, 7 Dec 2008, Sangamesh B wrote: >> > >> >> Hello, >> >> >> >> The mvapich2-1.2p1 installation on a Rocks 4.3 cluster (& voltaire >> >> infiniband) with intel 10 compilers has failed (make) with the >> >> following error: >> >> >> >> rdma_cm.c(171): error: struct "rdma_cm_event" has no field "param" >> >> if (!event->param.conn.private_data_len){ >> >> ^ >> >> rdma_cm.c(176): error: struct "rdma_cm_event" has no field "param" >> >> rank = ((int *)event->param.conn.private_data)[0]; >> >> ^ >> >> rdma_cm.c(177): error: struct "rdma_cm_event" has no field "param" >> >> rail_index = ((int *)event->param.conn.private_data)[1]; >> >> ^ >> >> rdma_cm.c(376): warning #589: transfer of control bypasses initialization of: >> >> variable "cMinPort" (declared at line 388) >> >> variable "minPort" (declared at line 389) >> >> variable "portRange" (declared at line 409) >> >> variable "envPort" (declared at line 426) >> >> MPIU_ERR_SETANDJUMP3( >> >> ^ >> >> rdma_cm.c(397): warning #589: transfer of control bypasses initialization of: >> >> variable "portRange" (declared at line 409) >> >> variable "envPort" (declared at line 426) >> >> MPIU_ERR_SETANDJUMP3( >> >> ^ >> >> rdma_cm.c(414): warning #589: transfer of control bypasses initialization of: >> >> variable "envPort" (declared at line 426) >> >> MPIU_ERR_SETANDJUMP2( >> >> ^ >> >> compilation aborted for rdma_cm.c (code 2) >> >> >> >> During configuration the option "--with-rdma=gen2" is used & there >> >> were no errors. >> >> >> >> Is this the problem with the code or anything else? How that can be resolved? >> >> >> >> Thanks, >> >> Sangamesh >> >> _______________________________________________ >> >> mvapich-discuss mailing list >> >> mvapich-discuss@cse.ohio-state.edu >> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> >> > >> > >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo > From eirc.lew at gmail.com Wed Dec 10 02:52:44 2008 From: eirc.lew at gmail.com (luxingjing) Date: Wed Dec 10 08:26:03 2008 Subject: [mvapich-discuss] install error Message-ID: <493f83b6.09876e0a.77e4.3bf4@mx.google.com> Hi, When I install mvapich1.1, I encounter a problem as follow: And what I have done is: 1) ./configure ¨Cprefix=/home/autopar/lxj/mvapich1.1 2) ./make.mvapich.gen2 , Then it shows error as below: viainit.c: In function `create_srq': viainit.c:427: warning: assignment makes pointer from integer without a cast viainit.c:428: error: structure has no member named `xrc_srq_num' viainit.c:428: error: structure has no member named `xrc_srq_num' viainit.c: In function `xrc_init': viainit.c:1144: error: `IBV_DEVICE_XRC' undeclared (first use in this function) viainit.c:1144: error: (Each undeclared identifier is reported only once viainit.c:1144: error: for each function it appears in.) viainit.c:1161: warning: assignment makes pointer from integer without a cast make[3]: *** [viainit.o] Error 1 Exit status from make was 2 make[2]: *** [mpilib] Error 1 make[1]: *** [mpi-modules] Error 2 make: *** [mpi] Error 2 Failure in building MVAPICH. Wish for your reply! -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081210/e2b45e03/attachment.html From noam.bernstein at nrl.navy.mil Wed Dec 10 11:13:43 2008 From: noam.bernstein at nrl.navy.mil (Noam Bernstein) Date: Wed Dec 10 11:14:00 2008 Subject: [mvapich-discuss] non-deterministic crashes in mvapich-1.1 Message-ID: <6A7D6C72-0C6F-4C79-9D94-A4F4640731BA@nrl.navy.mil> We have a system with dual Opteron nodes, Infiniband Infinihost III Lx cards, running Rocks 5.1 (CentOS 5.2) OFED 1.3.1, and mvapich 1.1. My code is crashing in odd, non deterministic ways. The same code on the same hardware worked fine under Rocks 4.1 (CentOS 4.3, OFED 1.2.4.?,and mvapich-0.99?), and on this cluster when I use OpenMPI 1.2.8 instead of mvapich. It also works fine on other platforms. The code is in Fortran 90, compiled with Intel fortran 10.1.021 (same compiler used for MPI compilation, together with gcc). I'm using acml 3.6.0, because 4.2.0 leads to problem with the intel compiled code. I always run with exactly the same input, and there should be no randomness involved. There are several things that I have observed to happen: 1. some numbers become infinities (results of LAPACK routines, which are then combined using mpi_allreduce, but I'm not sure at what point they become infinity - the non-reproducibilty of the symptoms makes it hard to determine) 2. LAPACK zhegv complains that the B matrix is not positive definite, despite the fact that it should be exactly the same as on the previous call to zhegv 3. mpi_allreduce complains that the cookie on the communicator is invalid 4. segmentation fault in mpi_finalize() Symptoms 1-3 usually occur not on the first call to the problem routine, but after many calls. Symptoms 1 and 2 usually after a few calls (1-4), symptom 3 usually after tens of calls (about 40 iterations of the code, not sure exactly how many calls to mpi_allreduce). Right now the problem seems fairly reproducible - usually symptom 3, with infrequent symptom 1 or 2. Symptom 3 always occurs on task number 16, regardless of which node it happens to be. The allreduce is doing a sum of a smallish (dim=1116) array of reals. Given that the code behaves fine on other machines and on this machine with OpenMPI, I tend to suspect mvapich (or perhaps how mvapich interacts with OFED). I know this is a relatively unhelpful description of the problem, but I haven't been able to isolate it or make it more reproducible. Has anyone seen anything like this before? Does anyone have any ideas how to go about finding/ fixing the problem? thanks, Noam From koop at cse.ohio-state.edu Wed Dec 10 12:14:44 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Dec 10 12:15:00 2008 Subject: [mvapich-discuss] non-deterministic crashes in mvapich-1.1 In-Reply-To: <6A7D6C72-0C6F-4C79-9D94-A4F4640731BA@nrl.navy.mil> Message-ID: Noam, I suspect this may have something to do with the shared memory all reduce optimization. Can you try turning it off and seeing if the problem still occurs to help us narrow down the problem? e.g. mpirun_rsh -np 128 -hostfile ./h VIADEV_USE_SHMEM_ALLREDUCE=0 ./exec Thanks, Matt On Wed, 10 Dec 2008, Noam Bernstein wrote: > We have a system with dual Opteron nodes, Infiniband Infinihost III Lx > cards, > running Rocks 5.1 (CentOS 5.2) OFED 1.3.1, and mvapich 1.1. My code > is crashing > in odd, non deterministic ways. The same code on the same hardware > worked fine > under Rocks 4.1 (CentOS 4.3, OFED 1.2.4.?,and mvapich-0.99?), and on > this cluster > when I use OpenMPI 1.2.8 instead of mvapich. It also works fine on > other platforms. > > The code is in Fortran 90, compiled with Intel fortran 10.1.021 (same > compiler > used for MPI compilation, together with gcc). I'm using acml 3.6.0, > because > 4.2.0 leads to problem with the intel compiled code. > > I always run with exactly the same input, and there should be no > randomness > involved. There are several things that I have observed to happen: > 1. some numbers become infinities (results of LAPACK routines, which > are then > combined using mpi_allreduce, but I'm not sure at what point they > become infinity - > the non-reproducibilty of the symptoms makes it hard to determine) > 2. LAPACK zhegv complains that the B matrix is not positive definite, > despite > the fact that it should be exactly the same as on the previous > call to zhegv > 3. mpi_allreduce complains that the cookie on the communicator is > invalid > 4. segmentation fault in mpi_finalize() > > Symptoms 1-3 usually occur not on the first call to the problem > routine, but after > many calls. Symptoms 1 and 2 usually after a few calls (1-4), symptom > 3 usually > after tens of calls (about 40 iterations of the code, not sure exactly > how many calls > to mpi_allreduce). > > Right now the problem seems fairly reproducible - usually symptom 3, > with infrequent > symptom 1 or 2. Symptom 3 always occurs on task number 16, regardless > of which node it happens to be. The allreduce is doing a sum of a > smallish (dim=1116) > array of reals. > > Given that the code behaves fine on other machines and on this machine > with OpenMPI, I tend to suspect mvapich (or perhaps how mvapich > interacts > with OFED). > > I know this is a relatively unhelpful description of the problem, but > I haven't been > able to isolate it or make it more reproducible. Has anyone seen > anything > like this before? Does anyone have any ideas how to go about finding/ > fixing the > problem? > > thanks, > Noam > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From sridharj at cse.ohio-state.edu Wed Dec 10 13:04:21 2008 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Wed Dec 10 13:04:38 2008 Subject: [mvapich-discuss] install error In-Reply-To: <493f83b6.09876e0a.77e4.3bf4@mx.google.com> References: <493f83b6.09876e0a.77e4.3bf4@mx.google.com> Message-ID: <20081210180421.GA17286@kappa.cse.ohio-state.edu> Hi, On Wed, Dec 10, 2008 at 04:52:44PM +0900, luxingjing wrote: > > viainit.c: In function `create_srq': > viainit.c:427: warning: assignment makes pointer from integer without > a cast > viainit.c:428: error: structure has no member named `xrc_srq_num' > viainit.c:428: error: structure has no member named `xrc_srq_num' > viainit.c: In function `xrc_init': > viainit.c:1144: error: `IBV_DEVICE_XRC' undeclared (first use in this > function) > viainit.c:1144: error: (Each undeclared identifier is reported only > once > viainit.c:1144: error: for each function it appears in.) > viainit.c:1161: warning: assignment makes pointer from integer without > a cast > This failure is because you're using an older OFED version which doesn't support XRC. You can either - (a) Install OFED version 1.3 or later which has XRC support (b) Compile without XRC - remove the -DXRC from CFLAGS in make.mvapich.gen2 -Jaidev -- You can rent this space for only $5 a week. From koop at cse.ohio-state.edu Wed Dec 10 13:50:57 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Dec 10 13:51:13 2008 Subject: [mvapich-discuss] MVAPICH2 and hardware multicast for MPI_Bcast In-Reply-To: <200812091836.38072.cincaipatron@gmx.net> Message-ID: Verdi, MVAPICH2 does not currently support the hardware multicast. It instead it supports shared memory collective implementations that in practice we have found perform quite well. Thanks, Matt On Tue, 9 Dec 2008, Verdi March wrote: > Hi, > > does MVAPICH2 support hardware multicast for MPI_Bcast? > > I don't see the MCST_* flag in its build scripts, nor does any > indication in its source code. > > At least, I could find in MVAPICH's source that mcast is supported. > > Or, does MVAPICH2 automatically use multicast without a need to > explictly set any built/run-time flag? > > Your clarification is appreciated. > > Regards, > Verdi > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From worldeb at ukr.net Wed Dec 10 17:19:54 2008 From: worldeb at ukr.net (Egor Tur) Date: Wed Dec 10 17:20:11 2008 Subject: [mvapich-discuss] mpispawn: No such file or directory Message-ID: Hi folk. I compiled last trunk mvapich2 (mvapich2-trunk-2008-12-06.tar.gz) source code. OK. It was compiled & installed (to /opt/mvapich2-1.2p1) without errors. But I tried to run some code. For example from examples directory: I had compiled cpi. ok it was compiled. then /opt/mvapich2-1.2p1/bin/mpirun_rsh -np 4 n01 n02 n03 n04 ./cpi /usr/bin/env: /opt/mvapich2-1.2p1/bin/mpispawn: No such file or directory Child exited abnormally! cleanupKilling remote processes.../usr/bin/env: /opt/mvapich2-1.2p1/bin/mpispawn: No such file or directory /usr/bin/env: /opt/mvapich2-1.2p1/bin/mpispawn: No such file or directory /usr/bin/env: /opt/mvapich2-1.2p1/bin/mpispawn: No such file or directory DONE ls -l /opt/mvapich2-1.2p1/bin/mpispawn -rwxr-xr-x 1 root root 34797 2008-12-10 23:31 /opt/mvapich2-1.2p1/bin/mpispawn Any ideas? Thanx. From worldeb at ukr.net Wed Dec 10 17:50:44 2008 From: worldeb at ukr.net (Egor Tur) Date: Wed Dec 10 17:51:02 2008 Subject: [mvapich-discuss] mpispawn: No such file or directory In-Reply-To: Message-ID: OK. I share now mvapich2 directory to all node & it is working. Also I found this error message in mvapich2 manual. Sorry & thanx. --- I compiled last trunk mvapich2 (mvapich2-trunk-2008-12-06.tar.gz) source code. OK. It was compiled & installed (to /opt/mvapich2-1.2p1) without errors. But I tried to run some code. For example from examples directory: I had compiled cpi. ok it was compiled. then /opt/mvapich2-1.2p1/bin/mpirun_rsh -np 4 n01 n02 n03 n04 ./cpi /usr/bin/env: /opt/mvapich2-1.2p1/bin/mpispawn: No such file or directory Child exited abnormally! cleanupKilling remote processes.../usr/bin/env: /opt/mvapich2-1.2p1/bin/mpispawn: No such file or directory /usr/bin/env: /opt/mvapich2-1.2p1/bin/mpispawn: No such file or directory /usr/bin/env: /opt/mvapich2-1.2p1/bin/mpispawn: No such file or directory DONE ls -l /opt/mvapich2-1.2p1/bin/mpispawn -rwxr-xr-x 1 root root 34797 2008-12-10 23:31 /opt/mvapich2-1.2p1/bin/mpispawn Any ideas? From sridharj at cse.ohio-state.edu Thu Dec 11 11:11:53 2008 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Thu Dec 11 11:12:12 2008 Subject: [mvapich-discuss] install error In-Reply-To: <49408f5b.02066e0a.70f9.6750@mx.google.com> References: <49408f5b.02066e0a.70f9.6750@mx.google.com> Message-ID: <49413BC9.7060104@cse.ohio-state.edu> Hi, Good to know you got it to install and thanks for lettings us know. -Jaidev On Wednesday 10 December 2008 09:54 PM, luxingjing wrote: > Hi, > You are right, now I install successfully. Thanks for your help! > > -----Original Message----- > From: Jaidev Sridhar [mailto:sridharj@cse.ohio-state.edu] > Sent: Thursday, December 11, 2008 3:04 AM > To: luxingjing > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] install error > > Hi, > > On Wed, Dec 10, 2008 at 04:52:44PM +0900, luxingjing wrote: > > > > viainit.c: In function `create_srq': > > viainit.c:427: warning: assignment makes pointer from integer without > > a cast > > viainit.c:428: error: structure has no member named `xrc_srq_num' > > viainit.c:428: error: structure has no member named `xrc_srq_num' > > viainit.c: In function `xrc_init': > > viainit.c:1144: error: `IBV_DEVICE_XRC' undeclared (first use in this > > function) > > viainit.c:1144: error: (Each undeclared identifier is reported only > > once > > viainit.c:1144: error: for each function it appears in.) > > viainit.c:1161: warning: assignment makes pointer from integer without > > a cast > > > > This failure is because you're using an older OFED version which doesn't > support XRC. You can either - > (a) Install OFED version 1.3 or later which has XRC support > (b) Compile without XRC - remove the -DXRC from CFLAGS in > make.mvapich.gen2 > > -Jaidev > From bachth at uni-mainz.de Fri Dec 12 16:29:34 2008 From: bachth at uni-mainz.de (Thomas Bach) Date: Fri Dec 12 16:29:54 2008 Subject: [mvapich-discuss] Error while compiling mvapich-1.1 with sunstudio In-Reply-To: <20081208152530.GA3639@cse.ohio-state.edu> (Jonathan Perkins's message of "Mon, 8 Dec 2008 16:25:31 +0100") References: <87y6z4kmbs.fsf@taris.box> <20081201163435.GD2973@cse.ohio-state.edu> <87r64q7gjq.fsf@taris.box> <20081204171203.GC2869@cse.ohio-state.edu> <87bpvmzs4r.fsf@taris.box> <20081208152530.GA3639@cse.ohio-state.edu> Message-ID: <87y6ylkrpd.fsf@taris.box> Hi, after changing gcc) echo "-Wl,-rpath-link -Wl," ;; to gcc) echo "-R" ;; in util/makesharedlib and export LIBS=${LIBS:--L${IBHOME_LIB} -R${IBHOME_LIB} -libverbs -libumad -lpthread} in make.mvapich.gen2. Everything works fine now. I do configure with: ./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \ --with-romio --without-mpe -lib="$LIBS" --enable-cxx --enable-f77 \ --enable-f90modules --enable-f90 --enable-sharedlib 2>&1 |tee config-mine.log Thank you very much for all your help! Greets, Thomas. From eirc.lew at gmail.com Mon Dec 15 06:42:42 2008 From: eirc.lew at gmail.com (luxingjing) Date: Mon Dec 15 11:14:45 2008 Subject: [mvapich-discuss] run error when use pbs Message-ID: <4946511c.08486e0a.7cdf.ffffced3@mx.google.com> Hi, Recently, I installed mvapich1.1 and the network is infiniband. In the last, I install brkeley_upc-2.8 whose conduit is infiniband-ibv, And the upcrun will use mpirun( mvapich ) to layout the thread. I write the nodes from $PBS_NODEFILE to a file hosts, and MPIRUNCMD is MPIRUN_CMD="${MPIRUN_CMD:-/home/paraorc/lxj/mvapich1.1/bin/mpirun -machinefile /home/paraorc/lxj/test/hosts -np %N %C } But when I qsub hello.pb, in the file hello.e the errors are: Child exited abnormally! Killing remote processes...DONE Wish your help. Thank you! Eric -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081215/5c1344a7/attachment.html From sridharj at cse.ohio-state.edu Mon Dec 15 21:43:03 2008 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Mon Dec 15 21:43:20 2008 Subject: [mvapich-discuss] run error when use pbs In-Reply-To: <4946511c.08486e0a.7cdf.ffffced3@mx.google.com> References: <4946511c.08486e0a.7cdf.ffffced3@mx.google.com> Message-ID: <494715B7.5020903@cse.ohio-state.edu> Hi, Your command line is wrong. You should use - mpirun_rsh -np x -hostfile /path/to/file /path/to/app -Jaidev On Monday 15 December 2008 06:42 AM, luxingjing wrote: > Hi, > > Recently, I installed mvapich1.1 and the network is infiniband. In the > last, I install brkeley_upc-2.8 whose conduit is infiniband-ibv, > > And the upcrun will use mpirun( mvapich ) to layout the thread. > > I write the nodes from $PBS_NODEFILE to a file hosts, and MPIRUNCMD is > > MPIRUN_CMD="${MPIRUN_CMD:-/home/paraorc/lxj/mvapich1.1/bin/mpirun > -machinefile /home/paraorc/lxj/test/hosts -np %N %C } > > > > But when I qsub hello.pb, in the file hello.e the errors are: > > > > Child exited abnormally! > > Killing remote processes...DONE > > > > > > Wish your help. Thank you! > > > > > > Eric > > > > __________ Information from ESET NOD32 Antivirus, version of virus > signature database 3230 (20080701) __________ > > The message was checked by ESET NOD32 Antivirus. > > http://www.eset.com > > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From sridharj at cse.ohio-state.edu Wed Dec 17 15:15:23 2008 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Wed Dec 17 15:15:38 2008 Subject: [mvapich-discuss] run error when use pbs In-Reply-To: <494890a1.05886e0a.1646.ffff9a7a@mx.google.com> References: <20081216211500.GA27122@omicron.cse.ohio-state.edu> <494890a1.05886e0a.1646.ffff9a7a@mx.google.com> Message-ID: <1229544923.13665.0.camel@t13.nowlab.cis.ohio-state.edu> Thanks for letting us know that it works now, we'll consider putting this in the FAQ. -Jaidev On Wed, Dec 17, 2008 at 01:38:16PM +0900, luxingjing wrote: > > Hi, > > I am sorry for havingn not inform you that the problem is resolved. > > It is nothing wrong with mvapich1.1, it is the result of PBS, the PBS > does not > > Allow user to "ssh other node", instead we have to do like bellow: > > mpirun_rsh -rsh -np ...., > > Now it works . > > Thank you for advice. > > -Eric > > -----Original Message----- > From: 'Jaidev Sridhar' [mailto:sridharj@cse.ohio-state.edu] > Sent: Wednesday, December 17, 2008 6:15 AM > To: luxingjing > Subject: Re: [mvapich-discuss] run error when use pbs > > > Looks like the cpi application is crashing. Can you set 'ulimit -c > unlimited' > > in your bash profile and see if we get any core dumps? > > > -Jaidev > > > On Tue, Dec 16, 2008 at 11:13:35AM +0900, luxingjing wrote: > > > > > > Hi, > > > > > > Thank you for your repley, but it seems not the problem. > > > > > > Now my pbs script is: > > > > > > > > > #!/bin/sh > > > > > > #PBS -N cpi > > > > > > #PBS -l nodes=1:ppn=1 > > > > > > #PBS -q dawning > > > > > > #PBS -o cpi1 > > > > > > #PBS -e cpi1.e > > > > > > cd $PBS_O_WORKDIR > > > > > > declare -a no > > > > > > count=0 > > > > > > for i in $( uniq $PBS_NODEFILE ) > > > > > > do > > > > > > echo $i > > > > > > echo $count > > > > > > no[$count]=$i > > > > > > count=$(($count + 1)) > > > > > > done > > > > > > export UPC_NODES="${no[0]} ${no[1]} ${no[2]} ${no[3]}" > > > > > > #PBS -V > > > > > > exec 1>/home/paraorc/lxj/test/hosts > > > > > > echo "${no[0]}" > > > > > > exec 1<&- > > > > > > > > > /home/paraorc/lxj/mvapich1.1/bin/mpirun_rsh -np 1 -hostfile > > > /home/paraorc/lxj/test/hosts /home/paraorc/lxj/test/cpi > > > > > > Bash > > > > > > But the error is still there ,Error is: > > > > > > Child exited abnormally! > > > > > > Killing remote processes...DONE > > > > > > .The network is infiniband, and use openfabrics1.1, the mvapich > is > > > 1.1too. I wonder if mvapich1.1 support the openfabrics-1.1 , > > > > > > And when I install the mvapich, I removed the CFLAG CDXRC for > errors > > > as bellow, Does it matter ? > > > > > > viainit.c: In function `create_srq': > > > > > > viainit.c:427: warning: assignment makes pointer from integer > without > > > a cast > > > > > > viainit.c:428: error: structure has no member named > `xrc_srq_num' > > > > > > viainit.c:428: error: structure has no member named > `xrc_srq_num' > > > > > > viainit.c: In function `xrc_init': > > > > > > viainit.c:1144: error: `IBV_DEVICE_XRC' undeclared (first use in > this > > > function) > > > > > > viainit.c:1144: error: (Each undeclared identifier is reported > only > > > once > > > > > > viainit.c:1144: error: for each function it appears in.) > > > > > > viainit.c:1161: warning: assignment makes pointer from integer > without > > > a cast > > > > > > make[3]: *** [viainit.o] Error 1 > > > > > > Exit status from make was 2 > > > > > > make[2]: *** [mpilib] Error 1 > > > > > > make[1]: *** [mpi-modules] Error 2 > > > > > > make: *** [mpi] Error 2 > > > > > > Failure in building MVAPICH. > > > > > > > > > I have tried all day for the problem, but I have not got it > resovled > > > now. Thank you for your help > > > > > > > > > -Eric > > > > > > > > > -----Original Message----- > > > From: Jaidev Sridhar [mailto:sridharj@cse.ohio-state.edu] > > > Sent: Tuesday, December 16, 2008 11:43 AM > > > To: luxingjing > > > Cc: mvapich-discuss@cse.ohio-state.edu > > > Subject: Re: [mvapich-discuss] run error when use pbs > > > > > > > > > Hi, > > > > > > > > > Your command line is wrong. You should use - > > > > > > mpirun_rsh -np x -hostfile /path/to/file /path/to/app > > > > > > > > > -Jaidev > > > > > > > > > On Monday 15 December 2008 06:42 AM, luxingjing wrote: > > > > > > > Hi, > > > > > > > > > > > > > > Recently, I installed mvapich1.1 and the network is > infiniband. In > > > the > > > > > > > last, I install brkeley_upc-2.8 whose conduit is > infiniband-ibv, > > > > > > > > > > > > > > And the upcrun will use mpirun( mvapich ) to layout the > thread. > > > > > > > > > > > > > > I write the nodes from $PBS_NODEFILE to a file hosts, and > > > MPIRUNCMD is > > > > > > > > > > > > > > > MPIRUN_CMD="${MPIRUN_CMD:-/home/paraorc/lxj/mvapich1.1/bin/mpirun > > > > > > > -machinefile /home/paraorc/lxj/test/hosts -np %N %C } > > > > > > > > > > > > > > > > > > > > > > > > > > > > But when I qsub hello.pb, in the file hello.e the errors are: > > > > > > > > > > > > > > > > > > > > > > > > > > > > Child exited abnormally! > > > > > > > > > > > > > > Killing remote processes...DONE > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Wish your help. Thank you! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Eric > > > > > > > > > > > > > > > > > > > > > > > > > > > > __________ Information from ESET NOD32 Antivirus, version of > virus > > > > > > > signature database 3230 (20080701) __________ > > > > > > > > > > > > > > The message was checked by ESET NOD32 Antivirus. > > > > > > > > > > > > > > http://www.eset.com > > > > > > > > > > > > > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > -- > > > > > > > > > > > > > > _______________________________________________ > > > > > > > mvapich-discuss mailing list > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > __________ Information from ESET NOD32 Antivirus, version of > virus > > > signature database 3230 (20080701) __________ > > > > > > > > > The message was checked by ESET NOD32 Antivirus. > > > > > > > > > http://www.eset.com > > > -- > > You can rent this space for only $5 a week. -- You can rent this space for only $5 a week. From Terrence.LIAO at total.com Thu Dec 18 10:56:52 2008 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Fri Dec 19 18:53:05 2008 Subject: [mvapich-discuss] Need advice on Error code =12 problem only when running with MPIIO on lustre Message-ID: Dear Mvapich-discuss, I have encountered a very strange IBV_WC_RETRY_EXC_ERR code=12 problem and need your advise. This problem only happens when using MPI-IO calls such as mpi_file_write_all() on lustre. We are using ofed1.4rc3 on CentOS 5.2. The IB is infinipath SDR HTX. lustre is running version 1.6.5.1 and mounted with rw,_netdev flags. The same code run fine on standard ethernet type of storage, such as NetAPP (i.e. no IB to storage). Also, the code without using MPI-IO, has no problem to write into lustre. Thank you very much. -- Terrence -------------------------------------------------------- Terrence Liao, Ph.D. Research Computer Scientist TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC 1201 Louisiana, Suite 1800, Houston, TX 77002 Tel: 713.647.3498 Fax: 713.647.3638 Email: terrence.liao@total.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081218/8f320333/attachment.html From panda at cse.ohio-state.edu Mon Dec 22 17:22:57 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Dec 22 17:23:03 2008 Subject: [mvapich-discuss] Need advice on Error code =12 problem only when running with MPIIO on lustre In-Reply-To: Message-ID: Terrence, This error code signifies issues related to flow control in the IB network. This could be coming from the OFED implementation + InfiniPath SDR HTX. This particular adapter is an older one. Under high I/O load (when usign Lustre), the flow control issues might be becoming critical and you are getting this error code. You may check with QLogic people on this. Do you see the same error with any other recent IB adapters from QLogic or Mellanox. Thanks, DK > I have encountered a very strange IBV_WC_RETRY_EXC_ERR code=12 problem > and need your advise. > This problem only happens when using MPI-IO calls such as > mpi_file_write_all() on lustre. > We are using ofed1.4rc3 on CentOS 5.2. The IB is infinipath SDR HTX. > lustre is running version 1.6.5.1 and mounted with rw,_netdev flags. > The same code run fine on standard ethernet type of storage, such as > NetAPP (i.e. no IB to storage). Also, the code without using MPI-IO, has > no problem to write into lustre. > > Thank you very much. > > -- Terrence > -------------------------------------------------------- > Terrence Liao, Ph.D. > Research Computer Scientist > TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC > 1201 Louisiana, Suite 1800, Houston, TX 77002 > Tel: 713.647.3498 Fax: 713.647.3638 > Email: terrence.liao@total.com > > From Terrence.LIAO at total.com Tue Dec 23 07:26:12 2008 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Tue Dec 23 07:26:21 2008 Subject: [mvapich-discuss] Need advice on Error code =12 problem only when running with MPIIO on lustre In-Reply-To: Message-ID: Professor Panda, We do NOT have the same problem on our newer Cluster which has Mellanx PCIe IB card. Your mavpich works very nice on this cluster. On the old cluster, we have finally be able to use IB on Lustre with OFED driver, however, this error code=12, become a big problem. I also see the MPI pingpong run hung with np 36 from time to time, I guess this is also linked to the flow control issue you mentioned. I recalled you mentioned you have a cluster with HTX card and running infinipath's driver with mvapich. Is it better for me to try this? Also, is there IB parameter I can set to avoid this kind of flow control problem? Thank you very much. -- Terrence -------------------------------------------------------- Terrence Liao, Ph.D. Research Computer Scientist TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC 1201 Louisiana, Suite 1800, Houston, TX 77002 Tel: 713.647.3498 Fax: 713.647.3638 Email: terrence.liao@total.com Houston HPC site: http://us-hou-spt01/sites/rt/hpc/default.aspx Pau HPC site: http://collaboratif.ep.corp.local/sites/hpc/hpc/RD.aspx Dhabaleswar Panda 12/22/2008 04:22 PM To Terrence.LIAO@total.com cc mvapich-discuss@cse.ohio-state.edu, Jing WEN , Brian Stevens , , Craig VERSHON Subject Re: [mvapich-discuss] Need advice on Error code =12 problem only when running with MPIIO on lustre Terrence, This error code signifies issues related to flow control in the IB network. This could be coming from the OFED implementation + InfiniPath SDR HTX. This particular adapter is an older one. Under high I/O load (when usign Lustre), the flow control issues might be becoming critical and you are getting this error code. You may check with QLogic people on this. Do you see the same error with any other recent IB adapters from QLogic or Mellanox. Thanks, DK > I have encountered a very strange IBV_WC_RETRY_EXC_ERR code=12 problem > and need your advise. > This problem only happens when using MPI-IO calls such as > mpi_file_write_all() on lustre. > We are using ofed1.4rc3 on CentOS 5.2. The IB is infinipath SDR HTX. > lustre is running version 1.6.5.1 and mounted with rw,_netdev flags. > The same code run fine on standard ethernet type of storage, such as > NetAPP (i.e. no IB to storage). Also, the code without using MPI-IO, has > no problem to write into lustre. > > Thank you very much. > > -- Terrence > -------------------------------------------------------- > Terrence Liao, Ph.D. > Research Computer Scientist > TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC > 1201 Louisiana, Suite 1800, Houston, TX 77002 > Tel: 713.647.3498 Fax: 713.647.3638 > Email: terrence.liao@total.com > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081223/390b1c3c/attachment.html From dorian.krause at scai.fraunhofer.de Mon Dec 29 13:33:30 2008 From: dorian.krause at scai.fraunhofer.de (Dorian Krause) Date: Mon Dec 29 13:33:48 2008 Subject: [mvapich-discuss] Hang in MPI_Win_fence Message-ID: <495917FA.2090001@scai.fraunhofer.de> Hi List, the attached program (bs-db.cc) uses a combination of onesided communcication and derived datatypes to collect data from 2 origin processes on 2 other target processes. The derived datatypes have been checked to contain no overlap and the target window is large enough. Unfortunately the program hangs in MPI_Win_fence after the access epoche (MPI_Put). Two processes hang in MPIDI_CH3I_SMP_read_progress while two others don't return from MPIDI_CH3I_SMP_writev. I'm using mpich2-1.1a2 with the intel compiler suite. The OFED version is 1.1 (old, I know ...). The configure command was $ ./configure --prefix=/home/dkrause/mvapich2-1.2p1-icc10 CFLAGS=-DOFED_VERSION_1_1 CC=icc CXX=icpc FC=ifort --enable-romio --with-file-system=lustre The program works correctly with mpich2-1.1a2 (it crashes with OpenMPI though but I think this is a different issue). For my tests I ran 4 instances of the program on the cluster headnode. The program is extracted from a real application which hangs IFF the amount of transfered data is too large. Any help/ideas would be appreciated ... Thanks + Regards, Dorian -------------- next part -------------- A non-text attachment was scrubbed... Name: mvapichtest.tar.bz2 Type: application/x-bzip Size: 149632 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20081229/3a5a09db/mvapichtest.tar-0001.bin