From david-m at orbotech.com Wed Aug 1 02:24:25 2007 From: david-m at orbotech.com (David Minor) Date: Wed Aug 1 02:25:09 2007 Subject: [mvapich-discuss] core dump on mpi_init with ofed 2 Message-ID: Hello Wei, ib_rdma_lat and ib_rdma_bw work. Paths to ofed 1.2 are correct. Remember, I have the same problem with the mvapich that comes with ofed and the one I compiled from the 1.0 beta. What I haven't tried is compiling with udapl support. I'm using udapl successfully with the Intel MPI. Regards, David -----Original Message----- From: Abhinav Vishnu [mailto:vishnu@cse.ohio-state.edu] Sent: Tuesday, July 31, 2007 5:48 PM To: David Minor Cc: wei huang; mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] core dump on mpi_init with ofed 2 Hi David, Thanks for this information. With this information, i am speculating that it could be a problem with the setup. I think following these steps may help us narrow down the problem: 1. Are you able to run the Verbs level tests (ib_rdma_lat, ib_rdma_bw, etc) between the two nodes? 2. Please check the path of the OFED libraries which you are linking to the compilation script. I hope that you are recompiling your programs with your OFED 1.2 MPI installation. Please let us know the outcome of your experimentation. Thanks, :- Abhinav > Hi Wei, > I'm using 1.0 beta, but the same problem is with 0.9.8, both p3 and the version that comes with ofed 1.2 release. I compiled using the make.mvapich2.ofa option. I haven't specified any environment variables. I didn't change any of the scripts, except to set PREFIX before compiling. I reproduced the problem with a trivial program running on 2 nodes. I didn't see the problem running on ethernet on 0.9.8. > Thanks, > David > > -----Original Message----- > From: wei huang [mailto:huanwei@cse.ohio-state.edu] > Sent: Monday, July 30, 2007 4:36 PM > To: David Minor > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] core dump on mpi_init with ofed 2 > > Hi David, > > Thanks for letting us know the problem. Could you please tell us more > information so we can look into this problem? > > 1) Which version of mvapich2 you are using? Is it 0.9.8? The latest > version for 0.9.8 is mvapich2-0.9.8p3. Also, we have just released > mvapich2-1.0-beta. You are welcomed to try these two version and let us > know if your problem is reproducible there. > > 2) Are you using native ib verbs or udapl? > > 3) Have you specify any environmental variables? > > 4) Did you use our default compiling scripts? Or did you make any changes > to the scripts? > > 5) On how many processes do you see the problem? How many processes per > physical node? > > Thanks. > > Regards, > Wei Huang > > From panda at cse.ohio-state.edu Wed Aug 1 22:02:09 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Aug 1 22:02:32 2007 Subject: [mvapich-discuss] Announcing the release of MVAPICH2 1.0-beta In-Reply-To: <392f95800707310732n5409de51w2bcc95e6bbe1902f@mail.gmail.com> from "Eric A. Borisch" at Jul 31, 2007 09:32:35 AM Message-ID: <200708020202.l7222Arn021331@xi.cse.ohio-state.edu> Hi Eric, We have fixed these problems. Please feel free to download the latest version of the code from the trunk (either through svn checkout or the nightly tarball being generated from the trunk .... changes will be reflected in tonight's tarball). Web links to these are available on the mvapich2 download page. Let us know if you experience any additional problems. Thanks, DK > Dr. Panda, > > Thank you again for you and your group's hard work on this software. > > I'll start by saying that I know I should move over to OpenFabrics and Gen2, > but as we've discussed previously, this isn't currently a viable option for > reasons that are outside the scope of this forum. With that said... > > A few compilation snags with MVAPICH2-1.0-beta on the VAPI flavor: > > (1) In src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_read_progress.c line 146 > : > > type = MPIDI_CH3I_MRAILI_Cq_poll(v_ptr, NULL, 0, is_blocking) > > calls with four arguments; the VAPI version ( defined in > src/mpid/osu_ch3/channels/mrail/src/vapi/mpidi_ch3_rdma_post.h ) has only > the first three arguments. I imagine this is just a missing #ifdef switch > ... > > (2) in src/mpid/osu_ch3/channels/mrail/src/vapi/rdma_iba_1sc.c lines 151-156 > : > > if (SMP_INIT) > { > /*correspoding post has not been issued */ > flag = 0; > break; > } > > These lines appear to have migrated here from somewhere else in the code > (perhaps the function immediately above it.) The variable flag is undefined > at this point, and there's a break statement without a loop to break out > of... > > By no means a tested fix, but removing the last argument from the issue > mentioned in (1) and commenting out the offending lines in (2) appears to > allow the VAPI channel to compile and run (benchmarks, in-house tools) > successfully. I haven't been able to get logging working, but that is > another discussion. > > Your thoughts? > > Thanks again! > Eric Borisch > > On 7/26/07, Dhabaleswar Panda wrote: > > The MVAPICH team is pleased to announce the availability of > > MVAPICH2-1.0-beta with the following NEW features: > > > > - Message coalescing support to enable reduction of per Queue-pair > > send queues for reduction in memory requirement on large scale > > clusters. This design also increases the small message messaging > > rate significantly. Available for Open Fabrics Gen2-IB. > > > > - Hot-Spot Avoidance Mechanism (HSAM) for alleviating > > network congestion in large scale clusters. Available for > > Open Fabrics Gen2-IB. > > > > - RDMA CM based on-demand connection management for large scale > > clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP. > > > > - uDAPL on-demand connection management for large scale clusters. > > Available for uDAPL interface (including Solaris IB implementation). > > > > - RDMA Read support for increased overlap of computation and > > communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP. > > > > - Application-initiated system-level (synchronous) checkpointing in > > addition to the user-transparent checkpointing. User application can > > now request a whole program checkpoint synchronously with BLCR by > > calling special functions within the application. Available for > > OpenFabrics Gen2-IB. > > > > - Network-Level fault tolerance with Automatic Path Migration (APM) > > for tolerating intermittent network failures over InfiniBand. > > Available for OpenFabrics Gen2-IB. > > > > - Integrated multi-rail communication support for OpenFabrics > > Gen2-iWARP. > > > > - Blocking mode of communication progress. Available for OpenFabrics > > Gen2-IB. > > > > - Based on MPICH2 1.0.5p4. > > > > For downloading MVAPICH2 1.0-beta source code, associated user guide > > and accessing the anonymous SVN, please visit the following URL: > > > > http://mvapich.cse.ohio-state.edu > > > > All feedbacks, including bug reports and hints for performance tuning, > > are welcome. Please post it to the mvapich-discuss mailing list. > > > > Thanks, > > > > MVAPICH Team > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > -- > Eric A. Borisch > eborisch@ieee.org > > ------=_Part_46971_9804973.1185892355266 > Content-Type: text/html; charset=ISO-8859-1 > Content-Transfer-Encoding: 7bit > Content-Disposition: inline > > Dr. Panda,

Thank you again for you and your group's hard work on this software.

I'll start by saying that I know I should move over to OpenFabrics and Gen2, but as we've discussed previously, this isn't currently a viable option for reasons that are outside the scope of this forum. With that said... >

A few compilation snags with MVAPICH2-1.0-beta on the VAPI flavor:

(1) In src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_read_progress.c line 146 :

> type = MPIDI_CH3I_MRAILI_Cq_poll(v_ptr, NULL, 0, is_blocking)

calls with four arguments; the VAPI version ( defined in src/mpid/osu_ch3/channels/mrail/src/vapi/mpidi_ch3_rdma_post.h ) has only the first three arguments. I imagine this is just a missing > #ifdef switch ...

(2) in src/mpid/osu_ch3/channels/mrail/src/vapi/rdma_iba_1sc.c lines 151-156 :

> if (SMP_INIT)
{
>     /*correspoding post has not been issued */
    flag = 0; >
    break;
} >

These lines appear to have migrated here from somewhere else in the code (perhaps  the function immediately above it.) The variable flag is undefined at this point, and there's a > break statement without a loop to break out of...

By no means a tested fix, but removing the last argument from the issue mentioned in (1) and commenting out the offending lines in (2) appears to allow the VAPI channel to compile and run (benchmarks, in-house tools) successfully. I haven't been able to get logging working, but that is another discussion. >

Your thoughts?

Thanks again!
 Eric Borisch

On 7/26/07, Dhabaleswar Panda <panda@cse.ohio-state.edu> wrote:
> The MVAPICH team is pleased to announce the availability of >
> MVAPICH2-1.0-beta with the following NEW features:
>
> - Message coalescing support to enable reduction of per Queue-pair
>   send queues for reduction in memory requirement on large scale
>   clusters. This design also increases the small message messaging >
>   rate significantly. Available for Open Fabrics Gen2-IB.
>
> - Hot-Spot Avoidance Mechanism (HSAM) for alleviating
>   network congestion in large scale clusters. Available for
>   Open Fabrics Gen2-IB. >
>
> - RDMA CM based on-demand connection management for large scale
>   clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP.
>
> - uDAPL on-demand connection management for large scale clusters. >
>   Available for uDAPL interface (including Solaris IB implementation).
>
> - RDMA Read support for increased overlap of computation and
>   communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP. >
>
> - Application-initiated system-level (synchronous) checkpointing in
>   addition to the user-transparent checkpointing. User application can
>   now request a whole program checkpoint synchronously with BLCR by >
>   calling special functions within the application. Available for
>   OpenFabrics Gen2-IB.
>
> - Network-Level fault tolerance with Automatic Path Migration (APM)
>   for tolerating intermittent network failures over InfiniBand. >
>   Available for OpenFabrics Gen2-IB.
>
> - Integrated multi-rail communication support for OpenFabrics
>   Gen2-iWARP.
>
> - Blocking mode of communication progress. Available for OpenFabrics >
>   Gen2-IB.
>
> - Based on MPICH2 1.0.5p4.
>
> For downloading MVAPICH2 1.0-beta source code, associated user guide
> and accessing the anonymous SVN, please visit the following URL: >
>
> http://mvapich.cse.ohio-state.edu
>
> All feedbacks, including bug reports and hints for performance tuning,
> are welcome. Please post it to the mvapich-discuss mailing list. >
>
> Thanks,
>
> MVAPICH Team
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss@cse.ohio-state.edu >
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>


--
Eric A. Borisch
> eborisch@ieee.org

> > ------=_Part_46971_9804973.1185892355266-- > From paulhoward at microway.com Thu Aug 2 09:44:05 2007 From: paulhoward at microway.com (Paul Howard) Date: Thu Aug 2 09:44:23 2007 Subject: [mvapich-discuss] ENOMEM when writing to /dev/infiniband/uverbs0 In-Reply-To: <200708010246.l712kxGN014447@xi.cse.ohio-state.edu> References: <200708010246.l712kxGN014447@xi.cse.ohio-state.edu> Message-ID: <46B1DFA5.6000807@microway.com> Dr. Panda, You wrote: > Thanks for reporting this problem. Have you tried this application > with the latest MVAPICH 0.9.9 release. MVAPICH 0.9.8 is already one > year old. Many new features, ehancements and bug fixes have gone into > the 0.9.9 version. Some of these are related to memory allocation. Can > you try this with MVAPICH 0.9.9 and let us know the outcome. If the > problem persists with 0.9.9, it will be easier to debug. > The problem does not seem to occur with the latest MVAPICH 0.9.9. Thanks for the suggestion. > Also, do you see this problem with MVAPICH2 0.9.8p3 (or the latest > released MVAPICH2 1.0-beta). This will also help us to narrow down the > problem. > We did not try MVAPICH2. I guess no further action is required on your part or mine, unless you need more information from me. Thanks, Paul Original problem report: > >> I have an issue with an MPI application. >> >> The version of MVAPICH is 0.9.8, compiled with PGI 6.2. >> >> The program, also compiled with PGI 6.2, is running on an 8-node >> cluster, with 2 dual-core Opteron 2218's on each node. Each node has >> 4GB of memory. The nodes are named node10, node11, ..., node17. I >> start the MPI job on node10: "mpirun -np 32 ./wrf.exe". The machines >> list lists the 8 nodes on the first 8 lines, then repeats those 8 >> lines 3 more times, for a total of 32 lines. >> >> The program runs successfully as root with np=32. (It takes hours to >> run.) When run as an ordinary user, it fails almost immediately >> (within 5 seconds or so) with a segmentation fault. >> >> It also fails when I remove the last 3 occurrences of node10 from the >> machines list and run with np=29 as an ordinary user (and as expected, >> it does not fail immediately as root with np=29). Doing it this way >> lets me run strace on the single process on node10. >> >> It seems to fail with error ENOMEM some times but not every time that >> it writes to /dev/infiniband/uverbs0. It reports ENOMEM a number of >> times; the segmentation fault came on the 38th ENOMEM. (When run in a >> similar way as root, with np=29 and running strace on the only process >> on node10, there are no ENOMEM errors.) I couldn't find anything with >> Google. >> >> The output of strace is like this (I've added some blank lines to make >> things stand out). I can provide the whole 7MB strace log if >> it would be useful. >> >> =============== START OF strace SNIPPETS =============================== >> >> [280 lines deleted] >> >> open("/sys/class/infiniband_verbs/uverbs0/abi_version", O_RDONLY) = 3 >> read(3, "1\n", 8) = 2 >> close(3) = 0 >> open("/sys/class/infiniband_verbs/uverbs0/device/vendor", O_RDONLY) = 3 >> read(3, "0x15b3\n", 8) = 7 >> close(3) = 0 >> open("/sys/class/infiniband_verbs/uverbs0/device/device", O_RDONLY) = 3 >> read(3, "0x6274\n", 8) = 7 >> close(3) = 0 >> >> >> open("/dev/infiniband/uverbs0", O_RDWR) = 3 >> >> >> write(3, "\0\0\0\0\4\0\4\0000\223\336\356\377\177\0\0", 16) = 16 >> mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, 3, 0) = 0x2ac1bc8e2000 >> write(3, "\3\0\0\0\4\0\3\0\0\223\336\356\377\177\0\0", 16) = 16 >> write(3, "\3\0\0\0\4\0\3\0`\223\336\356\377\177\0\0", 16) = 16 >> write(3, "\2\0\0\0\6\0\n\0\20\223\336\356\377\177\0\0\1\0\0\0\0\0"..., >> 24) = 24 >> >> >> [about 157000 lines deleted, none of them involving opening or closing >> fd=3, but about 200 of them involving write(3,...)] >> >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220n/\0\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\222/\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\300\265/\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0!\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0\"\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0-\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0.\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0/\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0000\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0001\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0002\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0003\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0004\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0005\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0006\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0007\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0008\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0009\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0:\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0;\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0<\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0=\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0>\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\260\332/\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0?\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0@\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0A\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0B\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0C\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0D\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0E\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0F\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0G\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0H\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0J\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\220\377/\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\200$0\0\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0K\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\200$0\0\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0pI0\0\0\0\0\0\360"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0L\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0pI0\0\0\0\0\0\360"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0Pn0\0\0\0\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0M\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0Pn0\0\0\0\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0@\2230\0\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0N\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0@\2230\0\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\2700\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0O\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0000\2700\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\20\3350\0\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> write(3, "\r\0\0\0\3\0\0\0P\0\0\0", 12) = 12 >> write(3, "\t\0\0\0\f\0\3\0pf\336\356\377\177\0\0\0\20\3350\0\0\0"..., >> 48) = 48 >> lseek(8, 0, SEEK_CUR) = 71696384 >> read(8, "\276\22\335r\275\367\315A\275\312\337p\275\237\22\303\275"..., >> 131072) = 131072 >> lseek(8, 0, SEEK_CUR) = 71827456 >> >> >> >> [another 1000 lines or so not involving fd=3] >> >> >> >> write(3, "\t\0\0\0\f\0\3\0@f\336\356\377\177\0\0\0\240\353\'\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0@f\336\356\377\177\0\0\0@\360\'\0\0\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0@f\336\356\377\177\0\0\0\340\364\'\0\0"..., >> 48) = 48 >> write(3, "\t\0\0\0\f\0\3\0@f\336\356\377\177\0\0\0\200\371\'\0\0"..., >> 48) = -1 ENOMEM (Cannot allocate memory) >> --- SIGSEGV (Segmentation fault) @ 0 (0) --- >> +++ killed by SIGSEGV +++ >> Process 12024 detached >> >> =============== END OF strace SNIPPETS =============================== >> >> >> >> >> I'd appreciate any insight into this problem. Let me know if you need >> more information, or the full log file. >> >> Thanks, >> Paul >> >> -- >> Paul Howard >> Chief Scientist >> Microway, Inc. >> >> paulhoward@microway.com >> 1-508-732-5521 >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> -- Paul Howard Chief Scientist Microway, Inc. paulhoward@microway.com 1-508-732-5521 From THOMAS.T.O'SHEA at saic.com Fri Aug 3 18:33:20 2007 From: THOMAS.T.O'SHEA at saic.com (OShea, Thomas T.) Date: Fri Aug 3 18:33:56 2007 Subject: [mvapich-discuss] MVAPICH Error Message-ID: <3A8D5723B7BEC34C88B5506F25F3FA4607437C3E@0599-its-exmb02.us.saic.com> Hello again, Thanks for all your help in the past; I've been able to get my code up and running on a small 32 processor cluster. I'm doing scaling tests and I ran with an array size of 16x16x16 with 1,2,4,8 and 16 processors and saw fairly good scaling. When I increased the array sizes to 32x32x32 my code runs fine for all but the 8 processor case. The odd part is that is doesn't crash until the 15th iteration, and I'm doing 21 iterations for each case. Here is the error it produces: ch3_rndvtransfer.c:614: MPIDI_CH3_Get_rndv_push: Assertion '(get_resp_pkt->seqnum) + 1 == (vc)->seqnum_send' failed. I imagine this will be a pain for me to debug since it takes about 30 minutes to get to the point where it fails. Ever seen this error or have any idea what might be causing it? Any tips would be greatly appreciated. Thanks, Thomas O'Shea -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070803/7edae6df/attachment-0001.html From Nathan.Dauchy at noaa.gov Fri Aug 3 19:19:33 2007 From: Nathan.Dauchy at noaa.gov (Nathan Dauchy) Date: Sat Aug 4 07:53:32 2007 Subject: [mvapich-discuss] FATAL event IBV_EVENT_QP_LAST_WQE_REACHED Message-ID: <46B3B805.4080800@noaa.gov> Pierrick, All, We recently upgraded to OFED-1.2 (mvapich-0.9.9) and are now getting an error that looks similar to yours: [0:w72] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 at line 2552 in file viacheck.c Did you ever find a solution? (I can't find one in the archives.) Can someone explain what the IBV_EVENT_QP_LAST_WQE_REACHED error means? I can't find any clues in the source, and have been unable to turn up any relevant docs either. Thanks for any help and clues you can offer! Regards, Nathan Pierrick Penven, Tue Mar 27 12:33:03 EDT 2007: > Dear all, > > I am trying to install an ocean model on a cluster based on 64 bit bi-dual > core AMD opterons with infiniband using pathf90 and mvapich v0.9.8. > > The model is runs and scales well on 1 node, but is not able to run on several > nodes. I have tried to used mvapich v0.9.9.beta, and I get the following > message: > > [1:chpcc060] Abort: [1:chpcc060] Abort: [chpcc060:1] Got completion with error > IBV_WC_LOC_PROT_ERR, code=4 at line 2374 in file viacheck.c > [1] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 at line 2554 in > file viacheck.c > [0:chpcc058] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 > at line 2554 in file viacheck.c > /CHPC/usr/local/mvapich_099b/bin/mpirun: line 1: 17412 > Terminated /CHPC/usr/local/mvapich_099b/bin/mpirun_rsh -np > 2 -hostfile /CHPC/home/loadl/execute/chpcln.4269.0.machinefile /CHPC/home/ppenven/Roms_tools/TEST1/./roms > > The problem does not occur using the MPI over IP rather than VAPI. > > Is there a solution to this problem ? > > Thanks a lot > > Pierrick From panda at cse.ohio-state.edu Sat Aug 4 08:14:30 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sat Aug 4 08:14:53 2007 Subject: [mvapich-discuss] MVAPICH Error In-Reply-To: <3A8D5723B7BEC34C88B5506F25F3FA4607437C3E@0599-its-exmb02.us.saic.com> from "OShea, Thomas T." at Aug 03, 2007 03:33:20 PM Message-ID: <200708041214.l74CEU9N009109@xi.cse.ohio-state.edu> Hi Thomas, Are you seeing this behavior with MVAPICH2 0.9.8p2 with the patch Gopal had sent to you on July 7th? Have you tried MVAPICH2 0.9.8p3 or the latest release MVAPICH2 1.0-beta. Do you see the same behavior with these two versions also. In these versions we have applied a better solution to the problem you had reported originally. If you can let us know which version you are using currently, it will help us to narrow down the problem further. Best Regards, DK > Hello again, > > Thanks for all your help in the past; I've been able to get my code up > and running on a small 32 processor cluster. I'm doing scaling tests and > I ran with an array size of 16x16x16 with 1,2,4,8 and 16 processors and > saw fairly good scaling. When I increased the array sizes to 32x32x32 my > code runs fine for all but the 8 processor case. The odd part is that is > doesn't crash until the 15th iteration, and I'm doing 21 iterations for > each case. Here is the error it produces: > > =20 > > ch3_rndvtransfer.c:614: MPIDI_CH3_Get_rndv_push: Assertion > '(get_resp_pkt->seqnum) + 1 =3D=3D (vc)->seqnum_send' failed. > > =20 > > I imagine this will be a pain for me to debug since it takes about 30 > minutes to get to the point where it fails. Ever seen this error or have > any idea what might be causing it? Any tips would be greatly > appreciated.=20 > > =20 > > Thanks, > > Thomas O'Shea From surs at cse.ohio-state.edu Sat Aug 4 15:27:21 2007 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sat Aug 4 15:27:41 2007 Subject: [mvapich-discuss] FATAL event IBV_EVENT_QP_LAST_WQE_REACHED In-Reply-To: <46B3B805.4080800@noaa.gov> References: <46B3B805.4080800@noaa.gov> Message-ID: <46B4D319.5060302@cse.ohio-state.edu> Hi Nathan, Nathan Dauchy wrote: > Pierrick, All, > > We recently upgraded to OFED-1.2 (mvapich-0.9.9) and are now getting an > error that looks similar to yours: > > [0:w72] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, > code=16 at line 2552 in file viacheck.c > > Did you ever find a solution? (I can't find one in the archives.) > > Can someone explain what the IBV_EVENT_QP_LAST_WQE_REACHED error means? > I can't find any clues in the source, and have been unable to turn up > any relevant docs either. > > Thanks for any help and clues you can offer! > Thanks for reporting the problem. The event IBV_EVENT_QP_LAST_WQE_REACHED means that the QP (internal InfiniBand communication channel) is in an error state and all requests are consumed. Could it be related to a setup issue? Can you run any other MPI programs such as OSU benchmarks, IMB etc. on all these nodes? Thanks, Sayantan. > Regards, > Nathan > > > Pierrick Penven, Tue Mar 27 12:33:03 EDT 2007: > >> Dear all, >> >> I am trying to install an ocean model on a cluster based on 64 bit bi-dual >> core AMD opterons with infiniband using pathf90 and mvapich v0.9.8. >> >> The model is runs and scales well on 1 node, but is not able to run on several >> nodes. I have tried to used mvapich v0.9.9.beta, and I get the following >> message: >> >> [1:chpcc060] Abort: [1:chpcc060] Abort: [chpcc060:1] Got completion with error >> IBV_WC_LOC_PROT_ERR, code=4 at line 2374 in file viacheck.c >> [1] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 at line 2554 in >> file viacheck.c >> [0:chpcc058] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 >> at line 2554 in file viacheck.c >> /CHPC/usr/local/mvapich_099b/bin/mpirun: line 1: 17412 >> Terminated /CHPC/usr/local/mvapich_099b/bin/mpirun_rsh -np >> 2 -hostfile /CHPC/home/loadl/execute/chpcln.4269.0.machinefile /CHPC/home/ppenven/Roms_tools/TEST1/./roms >> >> The problem does not occur using the MPI over IP rather than VAPI. >> >> Is there a solution to this problem ? >> >> Thanks a lot >> >> Pierrick >> > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- http://www.cse.ohio-state.edu/~surs From Nathan.Dauchy at noaa.gov Sun Aug 5 09:53:23 2007 From: Nathan.Dauchy at noaa.gov (Nathan Dauchy) Date: Sun Aug 5 09:53:16 2007 Subject: [mvapich-discuss] FATAL event IBV_EVENT_QP_LAST_WQE_REACHED In-Reply-To: <46B4D319.5060302@cse.ohio-state.edu> References: <46B3B805.4080800@noaa.gov> <46B4D319.5060302@cse.ohio-state.edu> Message-ID: <46B5D653.8050706@noaa.gov> Sayantan, Thanks for the quick reply! We were previously running openIB gen2 (r-7.6.07 I think) and MVAPICH 0.9.8. In that environment, several benchmarks were run, as well as many user codes, and I'm not aware of any problems related to QP_LAST_WQE_REACHED. Since upgrading, we have not yet run many benchmarks. I'm out of the office for the next week but will see about running the ones you suggest when I get back. Thanks, Nathan Sayantan Sur wrote: > Hi Nathan, > > Nathan Dauchy wrote: >> Pierrick, All, >> >> We recently upgraded to OFED-1.2 (mvapich-0.9.9) and are now getting an >> error that looks similar to yours: >> >> [0:w72] Abort: [0] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, >> code=16 at line 2552 in file viacheck.c >> >> Did you ever find a solution? (I can't find one in the archives.) >> >> Can someone explain what the IBV_EVENT_QP_LAST_WQE_REACHED error means? >> I can't find any clues in the source, and have been unable to turn up >> any relevant docs either. >> >> Thanks for any help and clues you can offer! >> > > > Thanks for reporting the problem. The event > IBV_EVENT_QP_LAST_WQE_REACHED means that the QP (internal InfiniBand > communication channel) is in an error state and all requests are > consumed. Could it be related to a setup issue? Can you run any other > MPI programs such as OSU benchmarks, IMB etc. on all these nodes? > > Thanks, > Sayantan. > >> Regards, >> Nathan >> >> >> Pierrick Penven, Tue Mar 27 12:33:03 EDT 2007: >> >>> Dear all, >>> >>> I am trying to install an ocean model on a cluster based on 64 bit >>> bi-dual core AMD opterons with infiniband using pathf90 and mvapich >>> v0.9.8. >>> >>> The model is runs and scales well on 1 node, but is not able to run >>> on several nodes. I have tried to used mvapich v0.9.9.beta, and I >>> get the following message: >>> >>> [1:chpcc060] Abort: [1:chpcc060] Abort: [chpcc060:1] Got completion >>> with error IBV_WC_LOC_PROT_ERR, code=4 at line 2374 in file viacheck.c >>> [1] Got FATAL event IBV_EVENT_QP_LAST_WQE_REACHED, code=16 at line >>> 2554 in file viacheck.c >>> [0:chpcc058] Abort: [0] Got FATAL event >>> IBV_EVENT_QP_LAST_WQE_REACHED, code=16 >>> at line 2554 in file viacheck.c >>> /CHPC/usr/local/mvapich_099b/bin/mpirun: line 1: 17412 >>> Terminated /CHPC/usr/local/mvapich_099b/bin/mpirun_rsh >>> -np 2 -hostfile /CHPC/home/loadl/execute/chpcln.4269.0.machinefile >>> /CHPC/home/ppenven/Roms_tools/TEST1/./roms >>> >>> The problem does not occur using the MPI over IP rather than VAPI. >>> >>> Is there a solution to this problem ? >>> >>> Thanks a lot >>> >>> Pierrick >>> >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > From THOMAS.T.O'SHEA at saic.com Mon Aug 6 18:22:52 2007 From: THOMAS.T.O'SHEA at saic.com (OShea, Thomas T.) Date: Mon Aug 6 18:23:30 2007 Subject: [mvapich-discuss] MVAPICH Error In-Reply-To: <200708041214.l74CEU9N009109@xi.cse.ohio-state.edu> References: <3A8D5723B7BEC34C88B5506F25F3FA4607437C3E@0599-its-exmb02.us.saic.com> from "OShea, Thomas T." at Aug 03, 2007 03:33:20 PM <200708041214.l74CEU9N009109@xi.cse.ohio-state.edu> Message-ID: <3A8D5723B7BEC34C88B5506F25F3FA4607437C4B@0599-its-exmb02.us.saic.com> Hello, Actually we would like to be running 1.0-beta, but we are having trouble compiling it. The configure script bombs out while trying to find the size of 'bool' or something. The version we are currently using is the 0.9.8p3 with the patch you gave me earlier applied. Thanks, Tom -----Original Message----- From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] Sent: Saturday, August 04, 2007 5:15 AM To: OShea, Thomas T. Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] MVAPICH Error Hi Thomas, Are you seeing this behavior with MVAPICH2 0.9.8p2 with the patch Gopal had sent to you on July 7th? Have you tried MVAPICH2 0.9.8p3 or the latest release MVAPICH2 1.0-beta. Do you see the same behavior with these two versions also. In these versions we have applied a better solution to the problem you had reported originally. If you can let us know which version you are using currently, it will help us to narrow down the problem further. Best Regards, DK > Hello again, > > Thanks for all your help in the past; I've been able to get my code up > and running on a small 32 processor cluster. I'm doing scaling tests and > I ran with an array size of 16x16x16 with 1,2,4,8 and 16 processors and > saw fairly good scaling. When I increased the array sizes to 32x32x32 my > code runs fine for all but the 8 processor case. The odd part is that is > doesn't crash until the 15th iteration, and I'm doing 21 iterations for > each case. Here is the error it produces: > > =20 > > ch3_rndvtransfer.c:614: MPIDI_CH3_Get_rndv_push: Assertion > '(get_resp_pkt->seqnum) + 1 =3D=3D (vc)->seqnum_send' failed. > > =20 > > I imagine this will be a pain for me to debug since it takes about 30 > minutes to get to the point where it fails. Ever seen this error or have > any idea what might be causing it? Any tips would be greatly > appreciated.=20 > > =20 > > Thanks, > > Thomas O'Shea From panda at cse.ohio-state.edu Mon Aug 6 23:36:12 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Aug 6 23:36:36 2007 Subject: [mvapich-discuss] MVAPICH Error In-Reply-To: <3A8D5723B7BEC34C88B5506F25F3FA4607437C4B@0599-its-exmb02.us.saic.com> from "OShea, Thomas T." at Aug 06, 2007 03:22:52 PM Message-ID: <200708070336.l773aCDV024485@xi.cse.ohio-state.edu> > Hello, > > Actually we would like to be running 1.0-beta, but we are having trouble > compiling it. The configure script bombs out while trying to find the > size of 'bool' or something. Sorry to know that you are having trouble compiling 1.0-beta. Could you please let us know the exact error you are seeing. It will help us to solve this problem. We have not seen any such errors on our systems. > The version we are currently using is the 0.9.8p3 with the patch you > gave me earlier applied. Thanks for this information. We will investigate the assertion error issue. Thanks, DK > Thanks, > Tom > > -----Original Message----- > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > Sent: Saturday, August 04, 2007 5:15 AM > To: OShea, Thomas T. > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] MVAPICH Error > > Hi Thomas, > > Are you seeing this behavior with MVAPICH2 0.9.8p2 with the patch > Gopal had sent to you on July 7th? > > Have you tried MVAPICH2 0.9.8p3 or the latest release MVAPICH2 > 1.0-beta. Do you see the same behavior with these two versions > also. In these versions we have applied a better solution to the > problem you had reported originally. > > If you can let us know which version you are using currently, it will > help us to narrow down the problem further. > > Best Regards, > > DK > > > Hello again, > > > > Thanks for all your help in the past; I've been able to get my code up > > and running on a small 32 processor cluster. I'm doing scaling tests > and > > I ran with an array size of 16x16x16 with 1,2,4,8 and 16 processors > and > > saw fairly good scaling. When I increased the array sizes to 32x32x32 > my > > code runs fine for all but the 8 processor case. The odd part is that > is > > doesn't crash until the 15th iteration, and I'm doing 21 iterations > for > > each case. Here is the error it produces: > > > > =20 > > > > ch3_rndvtransfer.c:614: MPIDI_CH3_Get_rndv_push: Assertion > > '(get_resp_pkt->seqnum) + 1 =3D=3D (vc)->seqnum_send' failed. > > > > =20 > > > > I imagine this will be a pain for me to debug since it takes about 30 > > minutes to get to the point where it fails. Ever seen this error or > have > > any idea what might be causing it? Any tips would be greatly > > appreciated.=20 > > > > =20 > > > > Thanks, > > > > Thomas O'Shea > > From nilesha at cdac.in Wed Aug 8 10:29:19 2007 From: nilesha at cdac.in (Nilesh Awate) Date: Wed Aug 8 10:31:04 2007 Subject: [mvapich-discuss] Getting Error !!! Message-ID: <46B9D33F.1070809@cdac.in> Hi I m Trying Installation of MVAPICH2-1.0 beta using SilverStorm (InfiniHost0) udapl 1.1 stack but I encoured following error rdma_udapl_priv.c: In function `rdma_iba_hca_init': rdma_udapl_priv.c:753: error: structure has no member named `max_rdma_read_iov' rdma_udapl_priv.c:754: error: structure has no member named `max_rdma_write_iov' rdma_udapl_priv.c:762: error: structure has no member named `max_message_size' rdma_udapl_priv.c:779: error: structure has no member named `max_rdma_read_iov' rdma_udapl_priv.c:780: error: structure has no member named `max_rdma_write_iov' rdma_udapl_priv.c:781: error: structure has no member named `srq_soft_hw' rdma_udapl_priv.c: In function `cm_ep_create': rdma_udapl_priv.c:753: error: structure has no member named `max_rdma_read_iov' rdma_udapl_priv.c:754: error: structure has no member named `max_rdma_write_iov' rdma_udapl_priv.c:762: error: structure has no member named `max_message_size' rdma_udapl_priv.c:779: error: structure has no member named `max_rdma_read_iov' rdma_udapl_priv.c:780: error: structure has no member named `max_rdma_write_iov' rdma_udapl_priv.c:781: error: structure has no member named `srq_soft_hw' Is there any Solution or way of compilation or MVAPICH2-1.0 is not compatible with udapl1.1 waiting for reply Nilesh From THOMAS.T.O'SHEA at saic.com Wed Aug 8 13:36:58 2007 From: THOMAS.T.O'SHEA at saic.com (OShea, Thomas T.) Date: Wed Aug 8 13:37:36 2007 Subject: [mvapich-discuss] MVAPICH Error In-Reply-To: <200708070336.l773aCDV024485@xi.cse.ohio-state.edu> References: <3A8D5723B7BEC34C88B5506F25F3FA4607437C4B@0599-its-exmb02.us.saic.com> from "OShea, Thomas T." at Aug 06, 2007 03:22:52 PM <200708070336.l773aCDV024485@xi.cse.ohio-state.edu> Message-ID: <3A8D5723B7BEC34C88B5506F25F3FA4607437C54@0599-its-exmb02.us.saic.com> As it turns out my system admin tells me that the installation of open fabrics wasn't complete enough, once he fixed that it installed without a hitch. With this new 1.0-beta version I'm not getting that error any longer. Although, run times seem to be a bit longer; but we are still investigating that. Thanks for you help. Tom -----Original Message----- From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] Sent: Monday, August 06, 2007 8:36 PM To: OShea, Thomas T. Cc: panda@cse.ohio-state.edu; mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] MVAPICH Error > Hello, > > Actually we would like to be running 1.0-beta, but we are having trouble > compiling it. The configure script bombs out while trying to find the > size of 'bool' or something. Sorry to know that you are having trouble compiling 1.0-beta. Could you please let us know the exact error you are seeing. It will help us to solve this problem. We have not seen any such errors on our systems. > The version we are currently using is the 0.9.8p3 with the patch you > gave me earlier applied. Thanks for this information. We will investigate the assertion error issue. Thanks, DK > Thanks, > Tom > > -----Original Message----- > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > Sent: Saturday, August 04, 2007 5:15 AM > To: OShea, Thomas T. > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] MVAPICH Error > > Hi Thomas, > > Are you seeing this behavior with MVAPICH2 0.9.8p2 with the patch > Gopal had sent to you on July 7th? > > Have you tried MVAPICH2 0.9.8p3 or the latest release MVAPICH2 > 1.0-beta. Do you see the same behavior with these two versions > also. In these versions we have applied a better solution to the > problem you had reported originally. > > If you can let us know which version you are using currently, it will > help us to narrow down the problem further. > > Best Regards, > > DK > > > Hello again, > > > > Thanks for all your help in the past; I've been able to get my code up > > and running on a small 32 processor cluster. I'm doing scaling tests > and > > I ran with an array size of 16x16x16 with 1,2,4,8 and 16 processors > and > > saw fairly good scaling. When I increased the array sizes to 32x32x32 > my > > code runs fine for all but the 8 processor case. The odd part is that > is > > doesn't crash until the 15th iteration, and I'm doing 21 iterations > for > > each case. Here is the error it produces: > > > > =20 > > > > ch3_rndvtransfer.c:614: MPIDI_CH3_Get_rndv_push: Assertion > > '(get_resp_pkt->seqnum) + 1 =3D=3D (vc)->seqnum_send' failed. > > > > =20 > > > > I imagine this will be a pain for me to debug since it takes about 30 > > minutes to get to the point where it fails. Ever seen this error or > have > > any idea what might be causing it? Any tips would be greatly > > appreciated.=20 > > > > =20 > > > > Thanks, > > > > Thomas O'Shea > > From panda at cse.ohio-state.edu Wed Aug 8 15:50:07 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Aug 8 15:50:31 2007 Subject: [mvapich-discuss] MVAPICH Error In-Reply-To: <3A8D5723B7BEC34C88B5506F25F3FA4607437C54@0599-its-exmb02.us.saic.com> from "OShea, Thomas T." at Aug 08, 2007 10:36:58 AM Message-ID: <200708081950.l78Jo70v008856@xi.cse.ohio-state.edu> > As it turns out my system admin tells me that the installation of open > fabrics wasn't complete enough, once he fixed that it installed without > a hitch. > > With this new 1.0-beta version I'm not getting that error any longer. Glad to know that you do not see any errors with 1.0-beta. > Although, run times seem to be a bit longer; but we are still > investigating that. OK. Please keep us updated about your investigation result so that we will take a look at it. > Thanks for you help. You are welcome. Best Regards, DK > Tom > > -----Original Message----- > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > Sent: Monday, August 06, 2007 8:36 PM > To: OShea, Thomas T. > Cc: panda@cse.ohio-state.edu; mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] MVAPICH Error > > > Hello, > > > > Actually we would like to be running 1.0-beta, but we are having > trouble > > compiling it. The configure script bombs out while trying to find the > > size of 'bool' or something. > > Sorry to know that you are having trouble compiling 1.0-beta. Could > you please let us know the exact error you are seeing. It will help us > to solve this problem. We have not seen any such errors on our > systems. > > > The version we are currently using is the 0.9.8p3 with the patch you > > gave me earlier applied. > > Thanks for this information. We will investigate the assertion error > issue. > > Thanks, > > DK > > > Thanks, > > Tom > > > > -----Original Message----- > > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > > Sent: Saturday, August 04, 2007 5:15 AM > > To: OShea, Thomas T. > > Cc: mvapich-discuss@cse.ohio-state.edu > > Subject: Re: [mvapich-discuss] MVAPICH Error > > > > Hi Thomas, > > > > Are you seeing this behavior with MVAPICH2 0.9.8p2 with the patch > > Gopal had sent to you on July 7th? > > > > Have you tried MVAPICH2 0.9.8p3 or the latest release MVAPICH2 > > 1.0-beta. Do you see the same behavior with these two versions > > also. In these versions we have applied a better solution to the > > problem you had reported originally. > > > > If you can let us know which version you are using currently, it will > > help us to narrow down the problem further. > > > > Best Regards, > > > > DK > > > > > Hello again, > > > > > > Thanks for all your help in the past; I've been able to get my code > up > > > and running on a small 32 processor cluster. I'm doing scaling tests > > and > > > I ran with an array size of 16x16x16 with 1,2,4,8 and 16 processors > > and > > > saw fairly good scaling. When I increased the array sizes to > 32x32x32 > > my > > > code runs fine for all but the 8 processor case. The odd part is > that > > is > > > doesn't crash until the 15th iteration, and I'm doing 21 iterations > > for > > > each case. Here is the error it produces: > > > > > > =20 > > > > > > ch3_rndvtransfer.c:614: MPIDI_CH3_Get_rndv_push: Assertion > > > '(get_resp_pkt->seqnum) + 1 =3D=3D (vc)->seqnum_send' failed. > > > > > > =20 > > > > > > I imagine this will be a pain for me to debug since it takes about > 30 > > > minutes to get to the point where it fails. Ever seen this error or > > have > > > any idea what might be causing it? Any tips would be greatly > > > appreciated.=20 > > > > > > =20 > > > > > > Thanks, > > > > > > Thomas O'Shea > > > > > From chai.15 at osu.edu Wed Aug 8 18:00:46 2007 From: chai.15 at osu.edu (LEI CHAI) Date: Wed Aug 8 18:01:33 2007 Subject: [mvapich-discuss] Getting Error !!! Message-ID: <26082bd26062f0.26062f026082bd@osu.edu> Hi Nilesh, Thanks for trying MVAPICH2-1.0 beta. MVAPICH2-uDAPL is uDAPL 1.2 compliant. As far as we know both OpenFabrics uDAPL and Solaris uDAPL are of version 1.2. You may want to update your uDAPL library... Lei ----- Original Message ----- From: Nilesh Awate Date: Wednesday, August 8, 2007 7:29 am Subject: [mvapich-discuss] Getting Error !!! > Hi > > I m Trying Installation of MVAPICH2-1.0 beta using SilverStorm > (InfiniHost0) udapl 1.1 stack > > but I encoured following error > > rdma_udapl_priv.c: In function `rdma_iba_hca_init': > rdma_udapl_priv.c:753: error: structure has no member named > `max_rdma_read_iov' > rdma_udapl_priv.c:754: error: structure has no member named > `max_rdma_write_iov' > rdma_udapl_priv.c:762: error: structure has no member named > `max_message_size' > rdma_udapl_priv.c:779: error: structure has no member named > `max_rdma_read_iov' > rdma_udapl_priv.c:780: error: structure has no member named > `max_rdma_write_iov' > rdma_udapl_priv.c:781: error: structure has no member named > `srq_soft_hw'rdma_udapl_priv.c: In function `cm_ep_create': > rdma_udapl_priv.c:753: error: structure has no member named > `max_rdma_read_iov' > rdma_udapl_priv.c:754: error: structure has no member named > `max_rdma_write_iov' > rdma_udapl_priv.c:762: error: structure has no member named > `max_message_size' > rdma_udapl_priv.c:779: error: structure has no member named > `max_rdma_read_iov' > rdma_udapl_priv.c:780: error: structure has no member named > `max_rdma_write_iov' > rdma_udapl_priv.c:781: error: structure has no member named > `srq_soft_hw' > Is there any Solution or way of compilation > > or MVAPICH2-1.0 is not compatible with udapl1.1 > > waiting for reply > > Nilesh > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From mhouston at graphics.stanford.edu Mon Aug 13 18:07:43 2007 From: mhouston at graphics.stanford.edu (Mike Houston) Date: Mon Aug 13 18:08:12 2007 Subject: [mvapich-discuss] Announcing the release of MVAPICH2 1.0-beta In-Reply-To: <200708020202.l7222Arn021331@xi.cse.ohio-state.edu> References: <200708020202.l7222Arn021331@xi.cse.ohio-state.edu> Message-ID: <46C0D62F.9030002@graphics.stanford.edu> I just grabbed the tarball and now hit problems with the same line of code: ch3_read_progress.c: In function `MPIDI_CH3I_read_progress': ch3_read_progress.c:146: error: too many arguments to function `MPIDI_CH3I_MRAILI_Cq_poll' Dhabaleswar Panda wrote: > Hi Eric, > > We have fixed these problems. Please feel free to download the latest > version of the code from the trunk (either through svn checkout or the > nightly tarball being generated from the trunk .... changes will be > reflected in tonight's tarball). Web links to these are available on > the mvapich2 download page. > > Let us know if you experience any additional problems. > > Thanks, > > DK > > >> Dr. Panda, >> >> Thank you again for you and your group's hard work on this software. >> >> I'll start by saying that I know I should move over to OpenFabrics and Gen2, >> but as we've discussed previously, this isn't currently a viable option for >> reasons that are outside the scope of this forum. With that said... >> >> A few compilation snags with MVAPICH2-1.0-beta on the VAPI flavor: >> >> (1) In src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_read_progress.c line 146 >> : >> >> type = MPIDI_CH3I_MRAILI_Cq_poll(v_ptr, NULL, 0, is_blocking) >> >> calls with four arguments; the VAPI version ( defined in >> src/mpid/osu_ch3/channels/mrail/src/vapi/mpidi_ch3_rdma_post.h ) has only >> the first three arguments. I imagine this is just a missing #ifdef switch >> ... >> >> (2) in src/mpid/osu_ch3/channels/mrail/src/vapi/rdma_iba_1sc.c lines 151-156 >> : >> >> if (SMP_INIT) >> { >> /*correspoding post has not been issued */ >> flag = 0; >> break; >> } >> >> These lines appear to have migrated here from somewhere else in the code >> (perhaps the function immediately above it.) The variable flag is undefined >> at this point, and there's a break statement without a loop to break out >> of... >> >> By no means a tested fix, but removing the last argument from the issue >> mentioned in (1) and commenting out the offending lines in (2) appears to >> allow the VAPI channel to compile and run (benchmarks, in-house tools) >> successfully. I haven't been able to get logging working, but that is >> another discussion. >> >> Your thoughts? >> >> Thanks again! >> Eric Borisch >> >> On 7/26/07, Dhabaleswar Panda wrote: >> >>> The MVAPICH team is pleased to announce the availability of >>> MVAPICH2-1.0-beta with the following NEW features: >>> >>> - Message coalescing support to enable reduction of per Queue-pair >>> send queues for reduction in memory requirement on large scale >>> clusters. This design also increases the small message messaging >>> rate significantly. Available for Open Fabrics Gen2-IB. >>> >>> - Hot-Spot Avoidance Mechanism (HSAM) for alleviating >>> network congestion in large scale clusters. Available for >>> Open Fabrics Gen2-IB. >>> >>> - RDMA CM based on-demand connection management for large scale >>> clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP. >>> >>> - uDAPL on-demand connection management for large scale clusters. >>> Available for uDAPL interface (including Solaris IB implementation). >>> >>> - RDMA Read support for increased overlap of computation and >>> communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP. >>> >>> - Application-initiated system-level (synchronous) checkpointing in >>> addition to the user-transparent checkpointing. User application can >>> now request a whole program checkpoint synchronously with BLCR by >>> calling special functions within the application. Available for >>> OpenFabrics Gen2-IB. >>> >>> - Network-Level fault tolerance with Automatic Path Migration (APM) >>> for tolerating intermittent network failures over InfiniBand. >>> Available for OpenFabrics Gen2-IB. >>> >>> - Integrated multi-rail communication support for OpenFabrics >>> Gen2-iWARP. >>> >>> - Blocking mode of communication progress. Available for OpenFabrics >>> Gen2-IB. >>> >>> - Based on MPICH2 1.0.5p4. >>> >>> For downloading MVAPICH2 1.0-beta source code, associated user guide >>> and accessing the anonymous SVN, please visit the following URL: >>> >>> http://mvapich.cse.ohio-state.edu >>> >>> All feedbacks, including bug reports and hints for performance tuning, >>> are welcome. Please post it to the mvapich-discuss mailing list. >>> >>> Thanks, >>> >>> MVAPICH Team >>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >>> >> -- >> Eric A. Borisch >> eborisch@ieee.org >> >> ------=_Part_46971_9804973.1185892355266 >> Content-Type: text/html; charset=ISO-8859-1 >> Content-Transfer-Encoding: 7bit >> Content-Disposition: inline >> >> Dr. Panda,

Thank you again for you and your group's hard work on this software.

I'll start by saying that I know I should move over to OpenFabrics and Gen2, but as we've discussed previously, this isn't currently a viable option for reasons that are outside the scope of this forum. With that said... >>

A few compilation snags with MVAPICH2-1.0-beta on the VAPI flavor:

(1) In src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_read_progress.c line 146 :

>> type = MPIDI_CH3I_MRAILI_Cq_poll(v_ptr, NULL, 0, is_blocking)

calls with four arguments; the VAPI version ( defined in src/mpid/osu_ch3/channels/mrail/src/vapi/mpidi_ch3_rdma_post.h ) has only the first three arguments. I imagine this is just a missing >> #ifdef switch ...

(2) in src/mpid/osu_ch3/channels/mrail/src/vapi/rdma_iba_1sc.c lines 151-156 :

>> if (SMP_INIT)
{
>>     /*correspoding post has not been issued */
    flag = 0; >>
    break;
} >>

These lines appear to have migrated here from somewhere else in the code (perhaps  the function immediately above it.) The variable flag is undefined at this point, and there's a >> break statement without a loop to break out of...

By no means a tested fix, but removing the last argument from the issue mentioned in (1) and commenting out the offending lines in (2) appears to allow the VAPI channel to compile and run (benchmarks, in-house tools) successfully. I haven't been able to get logging working, but that is another discussion. >>

Your thoughts?

Thanks again!
 Eric Borisch

On 7/26/07, Dhabaleswar Panda <panda@cse.ohio-state.edu> wrote:
> The MVAPICH team is pleased to announce the availability of >>
> MVAPICH2-1.0-beta with the following NEW features:
>
> - Message coalescing support to enable reduction of per Queue-pair
>   send queues for reduction in memory requirement on large scale
>   clusters. This design also increases the small message messaging >>
>   rate significantly. Available for Open Fabrics Gen2-IB.
>
> - Hot-Spot Avoidance Mechanism (HSAM) for alleviating
>   network congestion in large scale clusters. Available for
>   Open Fabrics Gen2-IB. >>
>
> - RDMA CM based on-demand connection management for large scale
>   clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP.
>
> - uDAPL on-demand connection management for large scale clusters. >>
>   Available for uDAPL interface (including Solaris IB implementation).
>
> - RDMA Read support for increased overlap of computation and
>   communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP. >>
>
> - Application-initiated system-level (synchronous) checkpointing in
>   addition to the user-transparent checkpointing. User application can
>   now request a whole program checkpoint synchronously with BLCR by >>
>   calling special functions within the application. Available for
>   OpenFabrics Gen2-IB.
>
> - Network-Level fault tolerance with Automatic Path Migration (APM)
>   for tolerating intermittent network failures over InfiniBand. >>
>   Available for OpenFabrics Gen2-IB.
>
> - Integrated multi-rail communication support for OpenFabrics
>   Gen2-iWARP.
>
> - Blocking mode of communication progress. Available for OpenFabrics >>
>   Gen2-IB.
>
> - Based on MPICH2 1.0.5p4.
>
> For downloading MVAPICH2 1.0-beta source code, associated user guide
> and accessing the anonymous SVN, please visit the following URL: >>
>
> http://mvapich.cse.ohio-state.edu
>
> All feedbacks, including bug reports and hints for performance tuning,
> are welcome. Please post it to the mvapich-discuss mailing list. >>
>
> Thanks,
>
> MVAPICH Team
>
> _______________________________________________
> mvapich-discuss mailing list
> mvapich-discuss@cse.ohio-state.edu >>
> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
>


--
Eric A. Borisch
>> eborisch@ieee.org

>> >> ------=_Part_46971_9804973.1185892355266-- >> >> > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > From mhouston at graphics.stanford.edu Mon Aug 13 18:55:04 2007 From: mhouston at graphics.stanford.edu (Mike Houston) Date: Mon Aug 13 18:55:33 2007 Subject: [mvapich-discuss] Announcing the release of MVAPICH2 1.0-beta In-Reply-To: <46C0D62F.9030002@graphics.stanford.edu> References: <200708020202.l7222Arn021331@xi.cse.ohio-state.edu> <46C0D62F.9030002@graphics.stanford.edu> Message-ID: <46C0E148.3010909@graphics.stanford.edu> Nevermind, I was building a slightly older version I guess. Grabbing from SVN seems to build clean for me. Now onto multi-threaded testing.... Thanks! -Mike Mike Houston wrote: > I just grabbed the tarball and now hit problems with the same line of > code: > > ch3_read_progress.c: In function `MPIDI_CH3I_read_progress': > ch3_read_progress.c:146: error: too many arguments to function > `MPIDI_CH3I_MRAILI_Cq_poll' > > > Dhabaleswar Panda wrote: >> Hi Eric, >> We have fixed these problems. Please feel free to download the latest >> version of the code from the trunk (either through svn checkout or the >> nightly tarball being generated from the trunk .... changes will be >> reflected in tonight's tarball). Web links to these are available on >> the mvapich2 download page. >> >> Let us know if you experience any additional problems. >> >> Thanks, >> DK >> >> >>> Dr. Panda, >>> >>> Thank you again for you and your group's hard work on this software. >>> >>> I'll start by saying that I know I should move over to OpenFabrics >>> and Gen2, >>> but as we've discussed previously, this isn't currently a viable >>> option for >>> reasons that are outside the scope of this forum. With that said... >>> >>> A few compilation snags with MVAPICH2-1.0-beta on the VAPI flavor: >>> >>> (1) In src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_read_progress.c >>> line 146 >>> : >>> >>> type = MPIDI_CH3I_MRAILI_Cq_poll(v_ptr, NULL, 0, is_blocking) >>> >>> calls with four arguments; the VAPI version ( defined in >>> src/mpid/osu_ch3/channels/mrail/src/vapi/mpidi_ch3_rdma_post.h ) has >>> only >>> the first three arguments. I imagine this is just a missing #ifdef >>> switch >>> ... >>> >>> (2) in src/mpid/osu_ch3/channels/mrail/src/vapi/rdma_iba_1sc.c lines >>> 151-156 >>> : >>> >>> if (SMP_INIT) >>> { >>> /*correspoding post has not been issued */ >>> flag = 0; >>> break; >>> } >>> >>> These lines appear to have migrated here from somewhere else in the >>> code >>> (perhaps the function immediately above it.) The variable flag is >>> undefined >>> at this point, and there's a break statement without a loop to break >>> out >>> of... >>> >>> By no means a tested fix, but removing the last argument from the issue >>> mentioned in (1) and commenting out the offending lines in (2) >>> appears to >>> allow the VAPI channel to compile and run (benchmarks, in-house tools) >>> successfully. I haven't been able to get logging working, but that is >>> another discussion. >>> >>> Your thoughts? >>> >>> Thanks again! >>> Eric Borisch >>> >>> On 7/26/07, Dhabaleswar Panda wrote: >>> >>>> The MVAPICH team is pleased to announce the availability of >>>> MVAPICH2-1.0-beta with the following NEW features: >>>> >>>> - Message coalescing support to enable reduction of per Queue-pair >>>> send queues for reduction in memory requirement on large scale >>>> clusters. This design also increases the small message messaging >>>> rate significantly. Available for Open Fabrics Gen2-IB. >>>> >>>> - Hot-Spot Avoidance Mechanism (HSAM) for alleviating >>>> network congestion in large scale clusters. Available for >>>> Open Fabrics Gen2-IB. >>>> >>>> - RDMA CM based on-demand connection management for large scale >>>> clusters. Available for OpenFabrics Gen2-IB and Gen2-iWARP. >>>> >>>> - uDAPL on-demand connection management for large scale clusters. >>>> Available for uDAPL interface (including Solaris IB implementation). >>>> >>>> - RDMA Read support for increased overlap of computation and >>>> communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP. >>>> >>>> - Application-initiated system-level (synchronous) checkpointing in >>>> addition to the user-transparent checkpointing. User application can >>>> now request a whole program checkpoint synchronously with BLCR by >>>> calling special functions within the application. Available for >>>> OpenFabrics Gen2-IB. >>>> >>>> - Network-Level fault tolerance with Automatic Path Migration (APM) >>>> for tolerating intermittent network failures over InfiniBand. >>>> Available for OpenFabrics Gen2-IB. >>>> >>>> - Integrated multi-rail communication support for OpenFabrics >>>> Gen2-iWARP. >>>> >>>> - Blocking mode of communication progress. Available for OpenFabrics >>>> Gen2-IB. >>>> >>>> - Based on MPICH2 1.0.5p4. >>>> >>>> For downloading MVAPICH2 1.0-beta source code, associated user guide >>>> and accessing the anonymous SVN, please visit the following URL: >>>> >>>> http://mvapich.cse.ohio-state.edu >>>> >>>> All feedbacks, including bug reports and hints for performance tuning, >>>> are welcome. Please post it to the mvapich-discuss mailing list. >>>> >>>> Thanks, >>>> >>>> MVAPICH Team >>>> >>>> _______________________________________________ >>>> mvapich-discuss mailing list >>>> mvapich-discuss@cse.ohio-state.edu >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>> >>>> >>> -- >>> Eric A. Borisch >>> eborisch@ieee.org >>> >>> ------=_Part_46971_9804973.1185892355266 >>> Content-Type: text/html; charset=ISO-8859-1 >>> Content-Transfer-Encoding: 7bit >>> Content-Disposition: inline >>> >>> Dr. Panda,

Thank you again for you and your group's hard >>> work on this software.

I'll start by saying that I know I >>> should move over to OpenFabrics and Gen2, but as we've discussed >>> previously, this isn't currently a viable option for reasons >>> that are outside the scope of this forum. With that said... >>>

A few compilation snags with MVAPICH2-1.0-beta on the VAPI >>> flavor:

(1) In >>> src/mpid/osu_ch3/channels/mrail/src/rdma/ch3_read_progress.c line >>> 146 :

>>> type = MPIDI_CH3I_MRAILI_Cq_poll(v_ptr, NULL, 0, >>> is_blocking)

calls with four arguments; the VAPI >>> version ( defined in >>> src/mpid/osu_ch3/channels/mrail/src/vapi/mpidi_ch3_rdma_post.h ) has >>> only the first three arguments. I imagine this is just a missing >>> #ifdef >>> switch ...

(2) in >>> src/mpid/osu_ch3/channels/mrail/src/vapi/rdma_iba_1sc.c lines >>> 151-156 :

>> style="margin-left: 40px;"> >>> if >>> (SMP_INIT)
{
>>>     >>> /*correspoding post has not been issued */
>> style="font-family: courier new,monospace;">>> style="font-family: courier new,monospace;">    flag >>> = 0; >>>
>> style="font-family: courier new,monospace;">    >>> break;
>> style="font-family: courier new,monospace;">} >>>

These lines appear to have migrated here from >>> somewhere else in the code (perhaps  the function >>> immediately above it.) The variable flag is undefined at this point, and >>> there's a break statement without a loop to break out >>> of...

By no means a tested fix, but removing the last >>> argument from the issue mentioned in (1) and commenting out the >>> offending lines in (2) appears to allow the VAPI channel to compile >>> and run (benchmarks, in-house tools) successfully. I haven't >>> been able to get logging working, but that is another discussion. >>>

Your thoughts?

Thanks again!
 Eric >>> Borisch

On 7/26/07, Dhabaleswar Panda <>> href="mailto:panda@cse.ohio-state.edu">panda@cse.ohio-state.edu> >>> wrote:
> The MVAPICH team is pleased to announce the >>> availability of >>>
> MVAPICH2-1.0-beta with the following NEW features:
> >>>
> - Message coalescing support to enable reduction of per >>> Queue-pair
>   send queues for reduction in memory >>> requirement on large scale
>   clusters. This design >>> also increases the small message messaging >>>
>   rate significantly. Available for Open Fabrics >>> Gen2-IB.
>
> - Hot-Spot Avoidance Mechanism (HSAM) for >>> alleviating
>   network congestion in large scale >>> clusters. Available for
>   Open Fabrics Gen2-IB. >>>
>
> - RDMA CM based on-demand connection management >>> for large scale
>   clusters. Available for >>> OpenFabrics Gen2-IB and Gen2-iWARP.
>
> - uDAPL >>> on-demand connection management for large scale clusters. >>>
>   Available for uDAPL interface (including >>> Solaris IB implementation).
>
> - RDMA Read support for >>> increased overlap of computation and
>   >>> communication. Available for OpenFabrics Gen2-IB and Gen2-iWARP. >>>
>
> - Application-initiated system-level (synchronous) >>> checkpointing in
>   addition to the >>> user-transparent checkpointing. User application >>> can
>   now request a whole program checkpoint >>> synchronously with BLCR by >>>
>   calling special functions within the >>> application. Available for
>   OpenFabrics >>> Gen2-IB.
>
> - Network-Level fault tolerance with >>> Automatic Path Migration (APM)
>   for tolerating >>> intermittent network failures over InfiniBand. >>>
>   Available for OpenFabrics Gen2-IB.
> >>>
> - Integrated multi-rail communication support for >>> OpenFabrics
>   Gen2-iWARP.
>
> - >>> Blocking mode of communication progress. Available for OpenFabrics >>>
>   Gen2-IB.
>
> - Based on MPICH2 >>> 1.0.5p4.
>
> For downloading MVAPICH2 1.0-beta source >>> code, associated user guide
> and accessing the anonymous SVN, >>> please visit the following URL: >>>
>
> >> href="http://mvapich.cse.ohio-state.edu">http://mvapich.cse.ohio-state.edu
> >>>
> All feedbacks, including bug reports and hints for >>> performance tuning,
> are welcome. Please post it to the >>> mvapich-discuss mailing list. >>>
>
> Thanks,
>
> MVAPICH Team
> >>>
> _______________________________________________
> >>> mvapich-discuss mailing list
> >> href="mailto:mvapich-discuss@cse.ohio-state.edu">mvapich-discuss@cse.ohio-state.edu >>> >>>
> >> href="http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss">http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss
> >>>


--
Eric A. Borisch
>> href="mailto:eborisch@ieee.org"> >>> eborisch@ieee.org

>>> >>> ------=_Part_46971_9804973.1185892355266-- >>> >>> >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From yogyas at gmail.com Thu Aug 16 03:16:54 2007 From: yogyas at gmail.com (yogeshwar sonawane) Date: Thu Aug 16 03:17:19 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED Message-ID: Hi all, Usually, to run MVAPICH over OFED, make.mvapich2.ofa is used. After successful compilation, MVAPICH will use "OpenFabrics Gen2-IB" as underlying transport interfaces. This i have tried & is running fine. Now as OFED contains dapl component, so can uDAPL interfaces be used to run MVAPICH over OFED ? OR After compiling MVAPICH with make.mvapich2.udapl, will it work using "uDAPL" as underlying transport interfaces provided by OFED ? If anybody has tried this before, can help me. For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with OFED 1.1 on a infiniband card. Thanks, Yogeshwar From panda at cse.ohio-state.edu Thu Aug 16 08:10:51 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Aug 16 08:11:13 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED In-Reply-To: from "yogeshwar sonawane" at Aug 16, 2007 12:46:54 PM Message-ID: <200708161210.l7GCApWr020637@xi.cse.ohio-state.edu> > Hi all, > Usually, to run MVAPICH over OFED, make.mvapich2.ofa is used. After > successful compilation, MVAPICH will use "OpenFabrics Gen2-IB" as > underlying transport interfaces. > This i have tried & is running fine. > > Now as OFED contains dapl component, so can uDAPL interfaces be used > to run MVAPICH over OFED ? > OR > After compiling MVAPICH with make.mvapich2.udapl, will it work using > "uDAPL" as underlying transport interfaces provided by OFED ? Yes, this will work. The uDAPL support in MVAPICH/MVAPICH2 works well with any uDAPL layer (including that of OFED). In fact, during every release, we carry out extensive test of the uDAPL interface over OFED uDAPL. You can also find this information in the user guides (available from mvapich web site). > If anybody has tried this before, can help me. > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with OFED 1.1 on a > infiniband card. Thanks, DK > Thanks, > Yogeshwar > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From yogyas at gmail.com Thu Aug 16 08:53:05 2007 From: yogyas at gmail.com (yogeshwar sonawane) Date: Thu Aug 16 08:53:31 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED In-Reply-To: References: <200708161210.l7GCApWr020637@xi.cse.ohio-state.edu> Message-ID: Hello, I tried it. but i am getting following error when i run cpi application with 2 processes:- [rdma_udapl_priv.c:833] error(-2147287038): Could not create EP [rdma_udapl_priv.c:830] error(-2147287038): Could not create EP rank 1 in job 2 in06_32882 caused collective abort of all ranks exit status of rank 1: return code 1 rank 0 in job 2 in06_32882 caused collective abort of all ranks exit status of rank 0: return code 1 any help? Thanks, Yogeshwar On 8/16/07, yogeshwar sonawane wrote: > Thanks for help. > I will try it. > > Yogeshwar > > On 8/16/07, Dhabaleswar Panda wrote: > > > Hi all, > > > Usually, to run MVAPICH over OFED, make.mvapich2.ofa is used. After > > > successful compilation, MVAPICH will use "OpenFabrics Gen2-IB" as > > > underlying transport interfaces. > > > This i have tried & is running fine. > > > > > > Now as OFED contains dapl component, so can uDAPL interfaces be used > > > to run MVAPICH over OFED ? > > > OR > > > After compiling MVAPICH with make.mvapich2.udapl, will it work using > > > "uDAPL" as underlying transport interfaces provided by OFED ? > > > > Yes, this will work. The uDAPL support in MVAPICH/MVAPICH2 works well > > with any uDAPL layer (including that of OFED). In fact, during every > > release, we carry out extensive test of the uDAPL interface over OFED > > uDAPL. > > > > You can also find this information in the user guides (available from > > mvapich web site). > > > > > If anybody has tried this before, can help me. > > > > > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with OFED 1.1 on a > > > infiniband card. > > > > Thanks, > > > > DK > > > > > Thanks, > > > Yogeshwar > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > From panda at cse.ohio-state.edu Thu Aug 16 09:01:29 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Aug 16 09:01:53 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED In-Reply-To: from "yogeshwar sonawane" at Aug 16, 2007 06:23:05 PM Message-ID: <200708161301.l7GD1ToX021561@xi.cse.ohio-state.edu> On your system, are you able to run basic uDAPL-level tests with OFED 1.1-uDAPL installation? It will be good if you try this first to make sure that uDAPL installation is correct. Then you can put MPI on top of this and carry out MPI-level tests and performance evaluation. DK > Hello, > I tried it. but i am getting following error when i run cpi > application with 2 processes:- > > [rdma_udapl_priv.c:833] error(-2147287038): Could not create EP > [rdma_udapl_priv.c:830] error(-2147287038): Could not create EP > rank 1 in job 2 in06_32882 caused collective abort of all ranks > exit status of rank 1: return code 1 > rank 0 in job 2 in06_32882 caused collective abort of all ranks > exit status of rank 0: return code 1 > > any help? > Thanks, > Yogeshwar > > On 8/16/07, yogeshwar sonawane wrote: > > Thanks for help. > > I will try it. > > > > Yogeshwar > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > Hi all, > > > > Usually, to run MVAPICH over OFED, make.mvapich2.ofa is used. After > > > > successful compilation, MVAPICH will use "OpenFabrics Gen2-IB" as > > > > underlying transport interfaces. > > > > This i have tried & is running fine. > > > > > > > > Now as OFED contains dapl component, so can uDAPL interfaces be used > > > > to run MVAPICH over OFED ? > > > > OR > > > > After compiling MVAPICH with make.mvapich2.udapl, will it work using > > > > "uDAPL" as underlying transport interfaces provided by OFED ? > > > > > > Yes, this will work. The uDAPL support in MVAPICH/MVAPICH2 works well > > > with any uDAPL layer (including that of OFED). In fact, during every > > > release, we carry out extensive test of the uDAPL interface over OFED > > > uDAPL. > > > > > > You can also find this information in the user guides (available from > > > mvapich web site). > > > > > > > If anybody has tried this before, can help me. > > > > > > > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with OFED 1.1 on a > > > > infiniband card. > > > > > > Thanks, > > > > > > DK > > > > > > > Thanks, > > > > Yogeshwar > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > From yogyas at gmail.com Fri Aug 17 08:56:19 2007 From: yogyas at gmail.com (yogeshwar sonawane) Date: Fri Aug 17 08:56:45 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED In-Reply-To: <200708161301.l7GD1ToX021561@xi.cse.ohio-state.edu> References: <200708161301.l7GD1ToX021561@xi.cse.ohio-state.edu> Message-ID: Yes, the uDAPL-level tests with OFED 1.1-uDAPL installation are working fine. I am able to create EPs, transfer data, etc. But with MPI, i am getting this error. I am using MVAPICH2-0.9.8p2 with OpenIB-cma as uDAPL provider name. I have OFED1.1 installed. On 8/16/07, Dhabaleswar Panda wrote: > On your system, are you able to run basic uDAPL-level tests with OFED > 1.1-uDAPL installation? It will be good if you try this first to make > sure that uDAPL installation is correct. Then you can put MPI on top > of this and carry out MPI-level tests and performance evaluation. > > DK > > > > Hello, > > I tried it. but i am getting following error when i run cpi > > application with 2 processes:- > > > > [rdma_udapl_priv.c:833] error(-2147287038): Could not create EP > > [rdma_udapl_priv.c:830] error(-2147287038): Could not create EP > > rank 1 in job 2 in06_32882 caused collective abort of all ranks > > exit status of rank 1: return code 1 > > rank 0 in job 2 in06_32882 caused collective abort of all ranks > > exit status of rank 0: return code 1 > > > > any help? > > Thanks, > > Yogeshwar > > > > On 8/16/07, yogeshwar sonawane wrote: > > > Thanks for help. > > > I will try it. > > > > > > Yogeshwar > > > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > > Hi all, > > > > > Usually, to run MVAPICH over OFED, make.mvapich2.ofa is used. After > > > > > successful compilation, MVAPICH will use "OpenFabrics Gen2-IB" as > > > > > underlying transport interfaces. > > > > > This i have tried & is running fine. > > > > > > > > > > Now as OFED contains dapl component, so can uDAPL interfaces be used > > > > > to run MVAPICH over OFED ? > > > > > OR > > > > > After compiling MVAPICH with make.mvapich2.udapl, will it work using > > > > > "uDAPL" as underlying transport interfaces provided by OFED ? > > > > > > > > Yes, this will work. The uDAPL support in MVAPICH/MVAPICH2 works well > > > > with any uDAPL layer (including that of OFED). In fact, during every > > > > release, we carry out extensive test of the uDAPL interface over OFED > > > > uDAPL. > > > > > > > > You can also find this information in the user guides (available from > > > > mvapich web site). > > > > > > > > > If anybody has tried this before, can help me. > > > > > > > > > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with OFED 1.1 on a > > > > > infiniband card. > > > > > > > > Thanks, > > > > > > > > DK > > > > > > > > > Thanks, > > > > > Yogeshwar > > > > > _______________________________________________ > > > > > mvapich-discuss mailing list > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > > > > From chai.15 at osu.edu Fri Aug 17 15:23:14 2007 From: chai.15 at osu.edu (LEI CHAI) Date: Fri Aug 17 15:24:29 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED Message-ID: <3e2f73d73e.3d73e3e2f7@osu.edu> Hi, Could you try the following things: Could you check if gen2 works fine. You can use make.mvapich2.ofa script to build mvapich2 with gen2. After making sure gen2 works, if you want to upgrade your OFED version to 1.2 or the latest 1.2.5 release, we are sure mvapich2 will work with OpenIB-cma provider. (recommended) If you want to stick to OFED1.1, could you use libdaplscm.so instead of libdaplcma.so in your /etc/dat.conf file. And finally (not related to this problem), please try our latest MVAPICH2-1.0 beta release if you are interested :-) Lei ----- Original Message ----- From: yogeshwar sonawane Date: Friday, August 17, 2007 5:56 am Subject: Re: [mvapich-discuss] Query related to running MVAPICH over OFED > Yes, the uDAPL-level tests with OFED 1.1-uDAPL installation are > working fine. > I am able to create EPs, transfer data, etc. > But with MPI, i am getting this error. > > I am using MVAPICH2-0.9.8p2 with OpenIB-cma as uDAPL provider name. > I have OFED1.1 installed. > > On 8/16/07, Dhabaleswar Panda wrote: > > On your system, are you able to run basic uDAPL-level tests with > OFED> 1.1-uDAPL installation? It will be good if you try this > first to make > > sure that uDAPL installation is correct. Then you can put MPI on top > > of this and carry out MPI-level tests and performance evaluation. > > > > DK > > > > > > > Hello, > > > I tried it. but i am getting following error when i run cpi > > > application with 2 processes:- > > > > > > [rdma_udapl_priv.c:833] error(-2147287038): Could not create EP > > > [rdma_udapl_priv.c:830] error(-2147287038): Could not create EP > > > rank 1 in job 2 in06_32882 caused collective abort of all ranks > > > exit status of rank 1: return code 1 > > > rank 0 in job 2 in06_32882 caused collective abort of all ranks > > > exit status of rank 0: return code 1 > > > > > > any help? > > > Thanks, > > > Yogeshwar > > > > > > On 8/16/07, yogeshwar sonawane wrote: > > > > Thanks for help. > > > > I will try it. > > > > > > > > Yogeshwar > > > > > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > > > Hi all, > > > > > > Usually, to run MVAPICH over OFED, make.mvapich2.ofa is > used. After > > > > > > successful compilation, MVAPICH will use "OpenFabrics > Gen2-IB" as > > > > > > underlying transport interfaces. > > > > > > This i have tried & is running fine. > > > > > > > > > > > > Now as OFED contains dapl component, so can uDAPL > interfaces be used > > > > > > to run MVAPICH over OFED ? > > > > > > OR > > > > > > After compiling MVAPICH with make.mvapich2.udapl, will > it work using > > > > > > "uDAPL" as underlying transport interfaces provided by > OFED ? > > > > > > > > > > Yes, this will work. The uDAPL support in MVAPICH/MVAPICH2 > works well > > > > > with any uDAPL layer (including that of OFED). In fact, > during every > > > > > release, we carry out extensive test of the uDAPL > interface over OFED > > > > > uDAPL. > > > > > > > > > > You can also find this information in the user guides > (available from > > > > > mvapich web site). > > > > > > > > > > > If anybody has tried this before, can help me. > > > > > > > > > > > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with > OFED 1.1 on a > > > > > > infiniband card. > > > > > > > > > > Thanks, > > > > > > > > > > DK > > > > > > > > > > > Thanks, > > > > > > Yogeshwar > > > > > > _______________________________________________ > > > > > > mvapich-discuss mailing list > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich- > discuss> > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From yogyas at gmail.com Sun Aug 19 07:14:30 2007 From: yogyas at gmail.com (yogeshwar sonawane) Date: Sun Aug 19 07:14:56 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED In-Reply-To: <3e2f73d73e.3d73e3e2f7@osu.edu> References: <3e2f73d73e.3d73e3e2f7@osu.edu> Message-ID: Hello, I have tested mvapich2-0.9.8p2 with OFED1.1-gen2. Its working fine. I have tested mvapich2-0.9.8p2 with OFED1.2-gen2. Its also working fine. I have tested mvapich2-1.0-beta with OFED1.2-gen2. Its also working fine. But again, mvapich2-1.0-beta with OFED1.2-uDAPL(OpenIB-cma), its failing with error:- [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP rank 1 in job 1 in06_32868 caused collective abort of all ranks exit status of rank 1: return code 1 rank 0 in job 1 in06_32868 caused collective abort of all ranks exit status of rank 0: return code 1 Till now, i was not using the MVAPICH2 which is shipped with OFED1.2, I was compiling MVAPICH2 externally. But then i tried MVAPICH2-0.9.8-12 which is shipped with OFED1.2 & its working fine. Only when i compile MVAPICH2 externally, the problem seems to be occurring (i think)? Any comments ? What is "-12" from MVAPICH2-0.9.8-12 ? Some patched version? I have not tried the things with libdaplscm.so till now. Yogeshwar p.s:- As per ur suggestion, i am using now OFED1.2 & MVAPICH2-1.0-beta On 8/18/07, LEI CHAI wrote: > Hi, > > Could you try the following things: > > Could you check if gen2 works fine. You can use make.mvapich2.ofa script to build mvapich2 with gen2. > > After making sure gen2 works, if you want to upgrade your OFED version to 1.2 or the latest 1.2.5 release, we are sure mvapich2 will work with OpenIB-cma provider. (recommended) > > If you want to stick to OFED1.1, could you use libdaplscm.so instead of libdaplcma.so in your /etc/dat.conf file. > > And finally (not related to this problem), please try our latest MVAPICH2-1.0 beta release if you are interested :-) > > Lei > > > ----- Original Message ----- > From: yogeshwar sonawane > Date: Friday, August 17, 2007 5:56 am > Subject: Re: [mvapich-discuss] Query related to running MVAPICH over OFED > > > Yes, the uDAPL-level tests with OFED 1.1-uDAPL installation are > > working fine. > > I am able to create EPs, transfer data, etc. > > But with MPI, i am getting this error. > > > > I am using MVAPICH2-0.9.8p2 with OpenIB-cma as uDAPL provider name. > > I have OFED1.1 installed. > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > On your system, are you able to run basic uDAPL-level tests with > > OFED> 1.1-uDAPL installation? It will be good if you try this > > first to make > > > sure that uDAPL installation is correct. Then you can put MPI on top > > > of this and carry out MPI-level tests and performance evaluation. > > > > > > DK > > > > > > > > > > Hello, > > > > I tried it. but i am getting following error when i run cpi > > > > application with 2 processes:- > > > > > > > > [rdma_udapl_priv.c:833] error(-2147287038): Could not create EP > > > > [rdma_udapl_priv.c:830] error(-2147287038): Could not create EP > > > > rank 1 in job 2 in06_32882 caused collective abort of all ranks > > > > exit status of rank 1: return code 1 > > > > rank 0 in job 2 in06_32882 caused collective abort of all ranks > > > > exit status of rank 0: return code 1 > > > > > > > > any help? > > > > Thanks, > > > > Yogeshwar > > > > > > > > On 8/16/07, yogeshwar sonawane wrote: > > > > > Thanks for help. > > > > > I will try it. > > > > > > > > > > Yogeshwar > > > > > > > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > > > > Hi all, > > > > > > > Usually, to run MVAPICH over OFED, make.mvapich2.ofa is > > used. After > > > > > > > successful compilation, MVAPICH will use "OpenFabrics > > Gen2-IB" as > > > > > > > underlying transport interfaces. > > > > > > > This i have tried & is running fine. > > > > > > > > > > > > > > Now as OFED contains dapl component, so can uDAPL > > interfaces be used > > > > > > > to run MVAPICH over OFED ? > > > > > > > OR > > > > > > > After compiling MVAPICH with make.mvapich2.udapl, will > > it work using > > > > > > > "uDAPL" as underlying transport interfaces provided by > > OFED ? > > > > > > > > > > > > Yes, this will work. The uDAPL support in MVAPICH/MVAPICH2 > > works well > > > > > > with any uDAPL layer (including that of OFED). In fact, > > during every > > > > > > release, we carry out extensive test of the uDAPL > > interface over OFED > > > > > > uDAPL. > > > > > > > > > > > > You can also find this information in the user guides > > (available from > > > > > > mvapich web site). > > > > > > > > > > > > > If anybody has tried this before, can help me. > > > > > > > > > > > > > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with > > OFED 1.1 on a > > > > > > > infiniband card. > > > > > > > > > > > > Thanks, > > > > > > > > > > > > DK > > > > > > > > > > > > > Thanks, > > > > > > > Yogeshwar > > > > > > > _______________________________________________ > > > > > > > mvapich-discuss mailing list > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich- > > discuss> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > From yogyas at gmail.com Sun Aug 19 07:25:33 2007 From: yogyas at gmail.com (yogeshwar sonawane) Date: Sun Aug 19 07:25:59 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED In-Reply-To: References: <3e2f73d73e.3d73e3e2f7@osu.edu> Message-ID: On 8/19/07, yogeshwar sonawane wrote: > Hello, > I have tested mvapich2-0.9.8p2 with OFED1.1-gen2. Its working fine. > I have tested mvapich2-0.9.8p2 with OFED1.2-gen2. Its also working fine. > I have tested mvapich2-1.0-beta with OFED1.2-gen2. Its also working fine. > But again, mvapich2-1.0-beta with OFED1.2-uDAPL(OpenIB-cma), its > failing with error:- > > [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP > [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP > rank 1 in job 1 in06_32868 caused collective abort of all ranks > exit status of rank 1: return code 1 > rank 0 in job 1 in06_32868 caused collective abort of all ranks > exit status of rank 0: return code 1 > Below observation is related to OFED-uDAPL interface only. I am getting error only when i try the things with OFED-uDAPL interface. With OFED-gen2, things are running fine as mentioned above. > Till now, i was not using the MVAPICH2 which is shipped with OFED1.2, > I was compiling MVAPICH2 externally. But then i tried > MVAPICH2-0.9.8-12 which is shipped with OFED1.2 & its working fine. > Only when i compile MVAPICH2 externally, the problem seems to be > occurring (i think)? Any comments ? > > What is "-12" from MVAPICH2-0.9.8-12 ? Some patched version? > > I have not tried the things with libdaplscm.so till now. > > Yogeshwar > p.s:- As per ur suggestion, i am using now OFED1.2 & MVAPICH2-1.0-beta > > On 8/18/07, LEI CHAI wrote: > > Hi, > > > > Could you try the following things: > > > > Could you check if gen2 works fine. You can use make.mvapich2.ofa script to build mvapich2 with gen2. > > > > After making sure gen2 works, if you want to upgrade your OFED version to 1.2 or the latest 1.2.5 release, we are sure mvapich2 will work with OpenIB-cma provider. (recommended) > > > > If you want to stick to OFED1.1, could you use libdaplscm.so instead of libdaplcma.so in your /etc/dat.conf file. > > > > And finally (not related to this problem), please try our latest MVAPICH2-1.0 beta release if you are interested :-) > > > > Lei > > > > > > ----- Original Message ----- > > From: yogeshwar sonawane > > Date: Friday, August 17, 2007 5:56 am > > Subject: Re: [mvapich-discuss] Query related to running MVAPICH over OFED > > > > > Yes, the uDAPL-level tests with OFED 1.1-uDAPL installation are > > > working fine. > > > I am able to create EPs, transfer data, etc. > > > But with MPI, i am getting this error. > > > > > > I am using MVAPICH2-0.9.8p2 with OpenIB-cma as uDAPL provider name. > > > I have OFED1.1 installed. > > > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > On your system, are you able to run basic uDAPL-level tests with > > > OFED> 1.1-uDAPL installation? It will be good if you try this > > > first to make > > > > sure that uDAPL installation is correct. Then you can put MPI on top > > > > of this and carry out MPI-level tests and performance evaluation. > > > > > > > > DK > > > > > > > > > > > > > Hello, > > > > > I tried it. but i am getting following error when i run cpi > > > > > application with 2 processes:- > > > > > > > > > > [rdma_udapl_priv.c:833] error(-2147287038): Could not create EP > > > > > [rdma_udapl_priv.c:830] error(-2147287038): Could not create EP > > > > > rank 1 in job 2 in06_32882 caused collective abort of all ranks > > > > > exit status of rank 1: return code 1 > > > > > rank 0 in job 2 in06_32882 caused collective abort of all ranks > > > > > exit status of rank 0: return code 1 > > > > > > > > > > any help? > > > > > Thanks, > > > > > Yogeshwar > > > > > > > > > > On 8/16/07, yogeshwar sonawane wrote: > > > > > > Thanks for help. > > > > > > I will try it. > > > > > > > > > > > > Yogeshwar > > > > > > > > > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > > > > > Hi all, > > > > > > > > Usually, to run MVAPICH over OFED, make.mvapich2.ofa is > > > used. After > > > > > > > > successful compilation, MVAPICH will use "OpenFabrics > > > Gen2-IB" as > > > > > > > > underlying transport interfaces. > > > > > > > > This i have tried & is running fine. > > > > > > > > > > > > > > > > Now as OFED contains dapl component, so can uDAPL > > > interfaces be used > > > > > > > > to run MVAPICH over OFED ? > > > > > > > > OR > > > > > > > > After compiling MVAPICH with make.mvapich2.udapl, will > > > it work using > > > > > > > > "uDAPL" as underlying transport interfaces provided by > > > OFED ? > > > > > > > > > > > > > > Yes, this will work. The uDAPL support in MVAPICH/MVAPICH2 > > > works well > > > > > > > with any uDAPL layer (including that of OFED). In fact, > > > during every > > > > > > > release, we carry out extensive test of the uDAPL > > > interface over OFED > > > > > > > uDAPL. > > > > > > > > > > > > > > You can also find this information in the user guides > > > (available from > > > > > > > mvapich web site). > > > > > > > > > > > > > > > If anybody has tried this before, can help me. > > > > > > > > > > > > > > > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with > > > OFED 1.1 on a > > > > > > > > infiniband card. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > DK > > > > > > > > > > > > > > > Thanks, > > > > > > > > Yogeshwar > > > > > > > > _______________________________________________ > > > > > > > > mvapich-discuss mailing list > > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich- > > > discuss> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > From panda at cse.ohio-state.edu Sun Aug 19 09:52:54 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun Aug 19 09:53:19 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED In-Reply-To: from "yogeshwar sonawane" at Aug 19, 2007 04:44:30 PM Message-ID: <200708191352.l7JDqsYS016395@xi.cse.ohio-state.edu> Hi, > I have tested mvapich2-0.9.8p2 with OFED1.1-gen2. Its working fine. > I have tested mvapich2-0.9.8p2 with OFED1.2-gen2. Its also working fine. > I have tested mvapich2-1.0-beta with OFED1.2-gen2. Its also working fine. > But again, mvapich2-1.0-beta with OFED1.2-uDAPL(OpenIB-cma), its > failing with error:- > > [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP > [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP > rank 1 in job 1 in06_32868 caused collective abort of all ranks > exit status of rank 1: return code 1 > rank 0 in job 1 in06_32868 caused collective abort of all ranks > exit status of rank 0: return code 1 > > Till now, i was not using the MVAPICH2 which is shipped with OFED1.2, > I was compiling MVAPICH2 externally. But then i tried > MVAPICH2-0.9.8-12 which is shipped with OFED1.2 & its working fine. > Only when i compile MVAPICH2 externally, the problem seems to be > occurring (i think)? Any comments ? Thanks for letting us know the status of your testing for different MVAPICH2 versions. Good to know that MVAPICH2-0.9.8-12 shipped with OFED 1.2 is working fine for you. I am assuming this is working with the uDAPL interface. We will take a look at it why there is a compilation problem when you use MVAPICH2 externally and get back to you soon. > What is "-12" from MVAPICH2-0.9.8-12 ? Some patched version? As you might have noticed, OFED releases typically go through multiple RC versions. As these testings continue and problems come up, we have been updating MVAPICH2-0.9.8 with corresponding fixes and different suffixes (-1, -2, ..., -12). >From MPI code perspective, MVAPICH2-0.9.8-12 is equivalent to the latest MVAPICH2-0.9.8p3 (available from mvapich web site). The OFED version has additional stuff for building it in an integrated manner with other components. > I have not tried the things with libdaplscm.so till now. > > Yogeshwar > p.s:- As per ur suggestion, i am using now OFED1.2 & MVAPICH2-1.0-beta Thanks. DK > On 8/18/07, LEI CHAI wrote: > > Hi, > > > > Could you try the following things: > > > > Could you check if gen2 works fine. You can use make.mvapich2.ofa script to build mvapich2 with gen2. > > > > After making sure gen2 works, if you want to upgrade your OFED version to 1.2 or the latest 1.2.5 release, we are sure mvapich2 will work with OpenIB-cma provider. (recommended) > > > > If you want to stick to OFED1.1, could you use libdaplscm.so instead of libdaplcma.so in your /etc/dat.conf file. > > > > And finally (not related to this problem), please try our latest MVAPICH2-1.0 beta release if you are interested :-) > > > > Lei > > > > > > ----- Original Message ----- > > From: yogeshwar sonawane > > Date: Friday, August 17, 2007 5:56 am > > Subject: Re: [mvapich-discuss] Query related to running MVAPICH over OFED > > > > > Yes, the uDAPL-level tests with OFED 1.1-uDAPL installation are > > > working fine. > > > I am able to create EPs, transfer data, etc. > > > But with MPI, i am getting this error. > > > > > > I am using MVAPICH2-0.9.8p2 with OpenIB-cma as uDAPL provider name. > > > I have OFED1.1 installed. > > > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > On your system, are you able to run basic uDAPL-level tests with > > > OFED> 1.1-uDAPL installation? It will be good if you try this > > > first to make > > > > sure that uDAPL installation is correct. Then you can put MPI on top > > > > of this and carry out MPI-level tests and performance evaluation. > > > > > > > > DK > > > > > > > > > > > > > Hello, > > > > > I tried it. but i am getting following error when i run cpi > > > > > application with 2 processes:- > > > > > > > > > > [rdma_udapl_priv.c:833] error(-2147287038): Could not create EP > > > > > [rdma_udapl_priv.c:830] error(-2147287038): Could not create EP > > > > > rank 1 in job 2 in06_32882 caused collective abort of all ranks > > > > > exit status of rank 1: return code 1 > > > > > rank 0 in job 2 in06_32882 caused collective abort of all ranks > > > > > exit status of rank 0: return code 1 > > > > > > > > > > any help? > > > > > Thanks, > > > > > Yogeshwar > > > > > > > > > > On 8/16/07, yogeshwar sonawane wrote: > > > > > > Thanks for help. > > > > > > I will try it. > > > > > > > > > > > > Yogeshwar > > > > > > > > > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > > > > > Hi all, > > > > > > > > Usually, to run MVAPICH over OFED, make.mvapich2.ofa is > > > used. After > > > > > > > > successful compilation, MVAPICH will use "OpenFabrics > > > Gen2-IB" as > > > > > > > > underlying transport interfaces. > > > > > > > > This i have tried & is running fine. > > > > > > > > > > > > > > > > Now as OFED contains dapl component, so can uDAPL > > > interfaces be used > > > > > > > > to run MVAPICH over OFED ? > > > > > > > > OR > > > > > > > > After compiling MVAPICH with make.mvapich2.udapl, will > > > it work using > > > > > > > > "uDAPL" as underlying transport interfaces provided by > > > OFED ? > > > > > > > > > > > > > > Yes, this will work. The uDAPL support in MVAPICH/MVAPICH2 > > > works well > > > > > > > with any uDAPL layer (including that of OFED). In fact, > > > during every > > > > > > > release, we carry out extensive test of the uDAPL > > > interface over OFED > > > > > > > uDAPL. > > > > > > > > > > > > > > You can also find this information in the user guides > > > (available from > > > > > > > mvapich web site). > > > > > > > > > > > > > > > If anybody has tried this before, can help me. > > > > > > > > > > > > > > > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with > > > OFED 1.1 on a > > > > > > > > infiniband card. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > DK > > > > > > > > > > > > > > > Thanks, > > > > > > > > Yogeshwar > > > > > > > > _______________________________________________ > > > > > > > > mvapich-discuss mailing list > > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich- > > > discuss> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From yogyas at gmail.com Mon Aug 20 03:00:36 2007 From: yogyas at gmail.com (yogeshwar sonawane) Date: Mon Aug 20 03:01:03 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED In-Reply-To: <200708191352.l7JDqsYS016395@xi.cse.ohio-state.edu> References: <200708191352.l7JDqsYS016395@xi.cse.ohio-state.edu> Message-ID: On 8/19/07, Dhabaleswar Panda wrote: > Hi, > > > I have tested mvapich2-0.9.8p2 with OFED1.1-gen2. Its working fine. > > I have tested mvapich2-0.9.8p2 with OFED1.2-gen2. Its also working fine. > > I have tested mvapich2-1.0-beta with OFED1.2-gen2. Its also working fine. > > But again, mvapich2-1.0-beta with OFED1.2-uDAPL(OpenIB-cma), its > > failing with error:- > > > > [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP > > [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP > > rank 1 in job 1 in06_32868 caused collective abort of all ranks > > exit status of rank 1: return code 1 > > rank 0 in job 1 in06_32868 caused collective abort of all ranks > > exit status of rank 0: return code 1 > > > > Till now, i was not using the MVAPICH2 which is shipped with OFED1.2, > > I was compiling MVAPICH2 externally. But then i tried > > MVAPICH2-0.9.8-12 which is shipped with OFED1.2 & its working fine. > > Only when i compile MVAPICH2 externally, the problem seems to be > > occurring (i think)? Any comments ? > > Thanks for letting us know the status of your testing for different > MVAPICH2 versions. Good to know that MVAPICH2-0.9.8-12 shipped with > OFED 1.2 is working fine for you. I am assuming this is working with > the uDAPL interface. > Yes, it is working with uDAPL interface. > We will take a look at it why there is a compilation problem when you > use MVAPICH2 externally and get back to you soon. > slight change here :- The problem is not during compilation. Compilation is clean. During run error occurs. I am running a simple MPI application calling MPI_Init() & MPI_Finalize() only. > > What is "-12" from MVAPICH2-0.9.8-12 ? Some patched version? > > As you might have noticed, OFED releases typically go through multiple > RC versions. As these testings continue and problems come up, we have > been updating MVAPICH2-0.9.8 with corresponding fixes and different > suffixes (-1, -2, ..., -12). > > From MPI code perspective, MVAPICH2-0.9.8-12 is equivalent to the > latest MVAPICH2-0.9.8p3 (available from mvapich web site). The OFED > version has additional stuff for building it in an integrated manner > with other components. > Thanks for information, Yogeshwar > > I have not tried the things with libdaplscm.so till now. > > > > Yogeshwar > > p.s:- As per ur suggestion, i am using now OFED1.2 & MVAPICH2-1.0-beta > > Thanks. > > DK > > > > On 8/18/07, LEI CHAI wrote: > > > Hi, > > > > > > Could you try the following things: > > > > > > Could you check if gen2 works fine. You can use make.mvapich2.ofa script to build mvapich2 with gen2. > > > > > > After making sure gen2 works, if you want to upgrade your OFED version to 1.2 or the latest 1.2.5 release, we are sure mvapich2 will work with OpenIB-cma provider. (recommended) > > > > > > If you want to stick to OFED1.1, could you use libdaplscm.so instead of libdaplcma.so in your /etc/dat.conf file. > > > > > > And finally (not related to this problem), please try our latest MVAPICH2-1.0 beta release if you are interested :-) > > > > > > Lei > > > > > > > > > ----- Original Message ----- > > > From: yogeshwar sonawane > > > Date: Friday, August 17, 2007 5:56 am > > > Subject: Re: [mvapich-discuss] Query related to running MVAPICH over OFED > > > > > > > Yes, the uDAPL-level tests with OFED 1.1-uDAPL installation are > > > > working fine. > > > > I am able to create EPs, transfer data, etc. > > > > But with MPI, i am getting this error. > > > > > > > > I am using MVAPICH2-0.9.8p2 with OpenIB-cma as uDAPL provider name. > > > > I have OFED1.1 installed. > > > > > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > > On your system, are you able to run basic uDAPL-level tests with > > > > OFED> 1.1-uDAPL installation? It will be good if you try this > > > > first to make > > > > > sure that uDAPL installation is correct. Then you can put MPI on top > > > > > of this and carry out MPI-level tests and performance evaluation. > > > > > > > > > > DK > > > > > > > > > > > > > > > > Hello, > > > > > > I tried it. but i am getting following error when i run cpi > > > > > > application with 2 processes:- > > > > > > > > > > > > [rdma_udapl_priv.c:833] error(-2147287038): Could not create EP > > > > > > [rdma_udapl_priv.c:830] error(-2147287038): Could not create EP > > > > > > rank 1 in job 2 in06_32882 caused collective abort of all ranks > > > > > > exit status of rank 1: return code 1 > > > > > > rank 0 in job 2 in06_32882 caused collective abort of all ranks > > > > > > exit status of rank 0: return code 1 > > > > > > > > > > > > any help? > > > > > > Thanks, > > > > > > Yogeshwar > > > > > > > > > > > > On 8/16/07, yogeshwar sonawane wrote: > > > > > > > Thanks for help. > > > > > > > I will try it. > > > > > > > > > > > > > > Yogeshwar > > > > > > > > > > > > > > On 8/16/07, Dhabaleswar Panda wrote: > > > > > > > > > Hi all, > > > > > > > > > Usually, to run MVAPICH over OFED, make.mvapich2.ofa is > > > > used. After > > > > > > > > > successful compilation, MVAPICH will use "OpenFabrics > > > > Gen2-IB" as > > > > > > > > > underlying transport interfaces. > > > > > > > > > This i have tried & is running fine. > > > > > > > > > > > > > > > > > > Now as OFED contains dapl component, so can uDAPL > > > > interfaces be used > > > > > > > > > to run MVAPICH over OFED ? > > > > > > > > > OR > > > > > > > > > After compiling MVAPICH with make.mvapich2.udapl, will > > > > it work using > > > > > > > > > "uDAPL" as underlying transport interfaces provided by > > > > OFED ? > > > > > > > > > > > > > > > > Yes, this will work. The uDAPL support in MVAPICH/MVAPICH2 > > > > works well > > > > > > > > with any uDAPL layer (including that of OFED). In fact, > > > > during every > > > > > > > > release, we carry out extensive test of the uDAPL > > > > interface over OFED > > > > > > > > uDAPL. > > > > > > > > > > > > > > > > You can also find this information in the user guides > > > > (available from > > > > > > > > mvapich web site). > > > > > > > > > > > > > > > > > If anybody has tried this before, can help me. > > > > > > > > > > > > > > > > > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2-1.0 with > > > > OFED 1.1 on a > > > > > > > > > infiniband card. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > DK > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Yogeshwar > > > > > > > > > _______________________________________________ > > > > > > > > > mvapich-discuss mailing list > > > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich- > > > > discuss> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > From chai.15 at osu.edu Mon Aug 20 13:22:30 2007 From: chai.15 at osu.edu (LEI CHAI) Date: Mon Aug 20 13:22:57 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED Message-ID: <3abbb23acd57.3acd573abbb2@osu.edu> Hi, We have tested mvapich2 with ofed 1.2, and it's working fine. Could you try one thing, use environment variable to change the dapl_provider name: $ mpiexec -n 2 -env MV2_DAPL_PROVIDER OpenIB-cma ./aout Or when you compile mvapich2 you can specify OpenIB-cma as the default dapl_provider. thanks, Lei ----- Original Message ----- From: yogeshwar sonawane Date: Monday, August 20, 2007 0:00 am Subject: Re: [mvapich-discuss] Query related to running MVAPICH over OFED > On 8/19/07, Dhabaleswar Panda wrote: > > Hi, > > > > > I have tested mvapich2-0.9.8p2 with OFED1.1-gen2. Its working > fine.> > I have tested mvapich2-0.9.8p2 with OFED1.2-gen2. Its also > working fine. > > > I have tested mvapich2-1.0-beta with OFED1.2-gen2. Its also > working fine. > > > But again, mvapich2-1.0-beta with OFED1.2-uDAPL(OpenIB-cma), its > > > failing with error:- > > > > > > [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP > > > [rdma_udapl_priv.c:837] error(-2147287038): Could not create EP > > > rank 1 in job 1 in06_32868 caused collective abort of all ranks > > > exit status of rank 1: return code 1 > > > rank 0 in job 1 in06_32868 caused collective abort of all ranks > > > exit status of rank 0: return code 1 > > > > > > Till now, i was not using the MVAPICH2 which is shipped with > OFED1.2,> > I was compiling MVAPICH2 externally. But then i tried > > > MVAPICH2-0.9.8-12 which is shipped with OFED1.2 & its working > fine.> > Only when i compile MVAPICH2 externally, the problem seems > to be > > > occurring (i think)? Any comments ? > > > > Thanks for letting us know the status of your testing for different > > MVAPICH2 versions. Good to know that MVAPICH2-0.9.8-12 shipped with > > OFED 1.2 is working fine for you. I am assuming this is working with > > the uDAPL interface. > > > > Yes, it is working with uDAPL interface. > > > We will take a look at it why there is a compilation problem when > you> use MVAPICH2 externally and get back to you soon. > > > slight change here :- The problem is not during compilation. > Compilation is clean. > During run error occurs. I am running a simple MPI application calling > MPI_Init() & MPI_Finalize() only. > > > > What is "-12" from MVAPICH2-0.9.8-12 ? Some patched version? > > > > As you might have noticed, OFED releases typically go through > multiple> RC versions. As these testings continue and problems come > up, we have > > been updating MVAPICH2-0.9.8 with corresponding fixes and different > > suffixes (-1, -2, ..., -12). > > > > From MPI code perspective, MVAPICH2-0.9.8-12 is equivalent to the > > latest MVAPICH2-0.9.8p3 (available from mvapich web site). The OFED > > version has additional stuff for building it in an integrated manner > > with other components. > > > > Thanks for information, > Yogeshwar > > > > I have not tried the things with libdaplscm.so till now. > > > > > > Yogeshwar > > > p.s:- As per ur suggestion, i am using now OFED1.2 & MVAPICH2- > 1.0-beta > > > > Thanks. > > > > DK > > > > > > > On 8/18/07, LEI CHAI wrote: > > > > Hi, > > > > > > > > Could you try the following things: > > > > > > > > Could you check if gen2 works fine. You can use > make.mvapich2.ofa script to build mvapich2 with gen2. > > > > > > > > After making sure gen2 works, if you want to upgrade your > OFED version to 1.2 or the latest 1.2.5 release, we are sure > mvapich2 will work with OpenIB-cma provider. (recommended) > > > > > > > > If you want to stick to OFED1.1, could you use libdaplscm.so > instead of libdaplcma.so in your /etc/dat.conf file. > > > > > > > > And finally (not related to this problem), please try our > latest MVAPICH2-1.0 beta release if you are interested :-) > > > > > > > > Lei > > > > > > > > > > > > ----- Original Message ----- > > > > From: yogeshwar sonawane > > > > Date: Friday, August 17, 2007 5:56 am > > > > Subject: Re: [mvapich-discuss] Query related to running > MVAPICH over OFED > > > > > > > > > Yes, the uDAPL-level tests with OFED 1.1-uDAPL installation > are> > > > working fine. > > > > > I am able to create EPs, transfer data, etc. > > > > > But with MPI, i am getting this error. > > > > > > > > > > I am using MVAPICH2-0.9.8p2 with OpenIB-cma as uDAPL > provider name. > > > > > I have OFED1.1 installed. > > > > > > > > > > On 8/16/07, Dhabaleswar Panda > wrote:> > > > > On your system, are you able to run basic uDAPL- > level tests with > > > > > OFED> 1.1-uDAPL installation? It will be good if you try this > > > > > first to make > > > > > > sure that uDAPL installation is correct. Then you can put > MPI on top > > > > > > of this and carry out MPI-level tests and performance > evaluation.> > > > > > > > > > > DK > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > I tried it. but i am getting following error when i run > cpi> > > > > > application with 2 processes:- > > > > > > > > > > > > > > [rdma_udapl_priv.c:833] error(-2147287038): Could not > create EP > > > > > > > [rdma_udapl_priv.c:830] error(-2147287038): Could not > create EP > > > > > > > rank 1 in job 2 in06_32882 caused collective abort > of all ranks > > > > > > > exit status of rank 1: return code 1 > > > > > > > rank 0 in job 2 in06_32882 caused collective abort > of all ranks > > > > > > > exit status of rank 0: return code 1 > > > > > > > > > > > > > > any help? > > > > > > > Thanks, > > > > > > > Yogeshwar > > > > > > > > > > > > > > On 8/16/07, yogeshwar sonawane wrote: > > > > > > > > Thanks for help. > > > > > > > > I will try it. > > > > > > > > > > > > > > > > Yogeshwar > > > > > > > > > > > > > > > > On 8/16/07, Dhabaleswar Panda state.edu> wrote: > > > > > > > > > > Hi all, > > > > > > > > > > Usually, to run MVAPICH over OFED, > make.mvapich2.ofa is > > > > > used. After > > > > > > > > > > successful compilation, MVAPICH will use > "OpenFabrics> > > > Gen2-IB" as > > > > > > > > > > underlying transport interfaces. > > > > > > > > > > This i have tried & is running fine. > > > > > > > > > > > > > > > > > > > > Now as OFED contains dapl component, so can uDAPL > > > > > interfaces be used > > > > > > > > > > to run MVAPICH over OFED ? > > > > > > > > > > OR > > > > > > > > > > After compiling MVAPICH with make.mvapich2.udapl, > will> > > > it work using > > > > > > > > > > "uDAPL" as underlying transport interfaces > provided by > > > > > OFED ? > > > > > > > > > > > > > > > > > > Yes, this will work. The uDAPL support in > MVAPICH/MVAPICH2> > > > works well > > > > > > > > > with any uDAPL layer (including that of OFED). In > fact,> > > > during every > > > > > > > > > release, we carry out extensive test of the uDAPL > > > > > interface over OFED > > > > > > > > > uDAPL. > > > > > > > > > > > > > > > > > > You can also find this information in the user guides > > > > > (available from > > > > > > > > > mvapich web site). > > > > > > > > > > > > > > > > > > > If anybody has tried this before, can help me. > > > > > > > > > > > > > > > > > > > > For info:- I am using MVAPICH2-0.9.8/ MVAPICH2- > 1.0 with > > > > > OFED 1.1 on a > > > > > > > > > > infiniband card. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > DK > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Yogeshwar > > > > > > > > > > _______________________________________________ > > > > > > > > > > mvapich-discuss mailing list > > > > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > > > http://mail.cse.ohio- > state.edu/mailman/listinfo/mvapich- > > > > > discuss> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > mvapich-discuss mailing list > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich- > discuss> > > > > > > > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > From rcbord at wm.edu Mon Aug 20 17:00:03 2007 From: rcbord at wm.edu (rcbord@wm.edu) Date: Mon Aug 20 17:45:08 2007 Subject: [mvapich-discuss] ofed config Message-ID: Hi all, We are trying to determine the best way to install ofed-1.2 (infiniband drivers) for our cluster to make the mvapich/mvapich2 installation easier. Any suggestions? Thanks in advance! Chris Bording Application Analyst High Performance Computing Group Information Technology The College of William and Mary (757)-221-3488 rcbord@wm.edu From rowland at cse.ohio-state.edu Mon Aug 20 19:40:09 2007 From: rowland at cse.ohio-state.edu (Shaun Rowland) Date: Mon Aug 20 19:40:33 2007 Subject: [mvapich-discuss] ofed config In-Reply-To: References: Message-ID: <46CA2659.7080400@cse.ohio-state.edu> rcbord@wm.edu wrote: > Hi all, > We are trying to determine the best way to install ofed-1.2 > (infiniband drivers) for our cluster to make the mvapich/mvapich2 > installation easier. > Any suggestions? The easiest thing to do is select the MVAPICH/MVAPICH2 packages that come with OFED 1.2. Those are the current releases, and that will cause the OFED 1.2 install to make sure all the packages required are selected, even if you don't select them explicitly. Otherwise, you should make sure you have the following packages selected for installation: MVAPICH ------- libibverbs libibverbs-devel libibumad libibcommon MVAPICH2 (OFA Build - make.mvapich2.ofa) ---------------------------------------- libibverbs libibverbs-devel libibumad libibumad-devel librdmacm librdmacm-devel libibcommon libibcommon-devel MVAPICH2 (uDAPL Build - make.mvapich2.udapl) -------------------------------------------- dapl dapl-devel libibverbs librdmacm I think including those would include the right OFA base packages too. It's hard to tell from the install script mechanism actually. It's really complicated. It might be easier to just select to install everything, but regardless, the easiest thing to do I think is just select the MVAPICH/MVAPICH2 packages that are part of OFED 1.2. You can build your own later or remove the RPMs installed later if you want. -- Shaun Rowland rowland@cse.ohio-state.edu http://www.cse.ohio-state.edu/~rowland/ From yogyas at gmail.com Tue Aug 21 00:47:18 2007 From: yogyas at gmail.com (yogeshwar sonawane) Date: Tue Aug 21 00:47:45 2007 Subject: [mvapich-discuss] Query related to running MVAPICH over OFED In-Reply-To: <3abbb23acd57.3acd573abbb2@osu.edu> References: <3abbb23acd57.3acd573abbb2@osu.edu> Message-ID: Hi, With "MV2_DAPL_PROVIDER=OpenIB-cma" environment variable specified as per guidence, the things are working now. Thanks Mr. Panda & lei chai, for all the help & suggestions. One feedback :- For all this discussion & trials, i was using OpenIB-cma as the default dapl provider during compilation of MVAPICH2. When i was getting error, there also OpenIB-cma was specified as default dapl provider during compilation. But during run-time, specifying MV2_DAPL_PROVIDER has solved that problem. I tried ib0 as the default dapl provider during compilation, then also i got one error :- Cannot Open IA. But again during run-time, specifying "-env MV2_DAPL_PROVIDER OpenIB-cma" makes the things work. I think this variable overrides the provider during compilation. This info may be useful for u. Yogeshwar On 8/20/07, LEI CHAI wrote: > Hi, > > We have tested mvapich2 with ofed 1.2, and it's working fine. Could you try one thing, use environment variable to change the dapl_provider name: > > $ mpiexec -n 2 -env MV2_DAPL_PROVIDER OpenIB-cma ./aout > > Or when you compile mvapich2 you can specify OpenIB-cma as the default dapl_provider. > > thanks, > Lei > > > ----- Original Message ----- > From: yogeshwar sonawane > Date: Monday, August 20, 2007 0:00 am > S