From koop at cse.ohio-state.edu Wed Apr 2 16:30:41 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Apr 2 16:30:46 2008 Subject: Fwd: [mvapich-discuss] Very bad latency scaling on bisectional bandwidth test with OFED 1.3/MVAPICH 1.0.0 In-Reply-To: Message-ID: Chris, > > VIADEV_SRQ_SIZE=1024 > > Excellent. That fixes it. The problem seems to kick-in around > 125-132 nodes (running 1 process per node)... it's unstable in that > range... you may or may not see the issue in multiple runs at node > counts in that range. Higher node counts repeatably exhibit the > problem, and setting the above fixes it. > > Can you explain more of how this only effects "short running" benchmarks? The new method in 1.0 uses a dynamic method to increase the number of buffers available for communication. By default only a smaller number of buffers are allocated (256) -- in 0.9.9 the default was 512. However, unlike 0.9.9, 1.0 will dynamically increase the number of buffers when the buffers run out of the shared receive queue. Whenever this re-size occurs you will see decreased performance. In practice this re-size should not happen any more than a couple times during a job per process -- so a job running for a long time should see minimal impact due to re-sizing. > Also What is the cutoff for short running (what would I do to the > benchmark to make this not happen)? Given the current default, if a process receives more than 226 messages at a time (from all other hosts combined) an event is likely. It will depend on application characteristics. If the benchmark did a few warm up rounds I'd expect the behavior to be more "normal." > Extra credit: why doesn't my benchmark show full bisectional bandwidth > given a fat tree switch (this is not the fault of MVAPICH... but I'd > like to know what I'm doing wrong)? There's nothing really that you're doing wrong here. This is known problem with the static routing of InfiniBand. We have a paper on this topic: http://nowlab.cse.ohio-state.edu/publications/conf-papers/2007/vishnu-ccgrid07.pdf One workaround to this is to enable the LMC mechanism of InfiniBand which will increase the number of paths between pairs of LIDs on the network. We have enhanced MVAPICH2 to take advantage of these additional paths (see the user guide for additional details). Let us know if you have any additional questions and if this helps out, Matt > Thanks! > > Chris > > > > > > e.g. > > mpirun_rsh -np X ... VIADEV_SRQ_SIZE=1024 ./exec > > > > And see if performance goes back to 0.9.9 levels? > > > > Thanks, > > Matt > > > > > > On Fri, 28 Mar 2008, Chris Worley wrote: > > > > > > > > > I just upgraded to OFED 1.3 and MVAPICH 1.0.0 (from OFED 1.2.5.5 and > > > MVAPICH 0.9.9). I'm using ConnectX cards. IB diagnostics show no > > > fabric issues. > > > > > > I have a test for bisectional bandwidth and latency; the latency test > > > is showing very poor worst-case results repeatably as the node count > > > goes over ~100. Other MPI implementations (that will remain nameless > > > as they normally don't perform as well as MVAPICH) don't have this > > > issue... so I don't think it's strictly an OFED 1.3 issue. > > > > > > What I'm seeing shows worst-cast latency (in the msecs!), for all > > > nodes. Here's a sample of the current results (only testing ~120 > > > nodes), all times in usecs: > > > > > > C-25-38: worst=1038.127000 (C-27-06,C-27-37), best=2.688000 > > > (C-25-41,C-25-35), avg=31.816815 > > > C-25-29: worst=1037.159000 (C-26-42,C-27-28), best=2.694000 > > > (C-25-32,C-25-26), avg=31.645870 > > > C-25-26: worst=1038.052000 (C-26-39,C-27-25), best=2.695000 > > > (C-25-29,C-25-23), avg=31.757562 > > > C-25-41: worst=1037.349000 (C-27-09,C-27-40), best=2.695000 > > > (C-25-44,C-25-38), avg=31.776089 > > > C-25-17: worst=1036.924000 (C-26-30,C-27-16), best=2.697000 > > > (C-25-20,C-25-14), avg=31.664692 > > > C-26-05: worst=1037.973000 (C-27-18,C-27-54), best=2.697000 > > > (C-26-08,C-26-02), avg=31.809110 > > > C-26-17: worst=1038.095000 (C-27-30,C-25-04), best=2.703000 > > > (C-26-20,C-26-14), avg=31.774685 > > > C-26-20: worst=1038.225000 (C-27-33,C-25-07), best=2.704000 > > > (C-26-23,C-26-17), avg=30.208007 > > > C-25-14: worst=1037.357000 (C-26-27,C-27-13), best=2.705000 > > > (C-25-17,C-25-11), avg=31.576705 > > > C-26-08: worst=1038.058000 (C-27-21,C-27-57), best=2.705000 > > > (C-26-11,C-26-05), avg=31.639445 > > > C-27-20: worst=1037.043000 (C-25-21,C-26-07), best=2.706000 > > > (C-27-23,C-27-17), avg=31.819363 > > > C-27-23: worst=1037.909000 (C-25-24,C-26-10), best=2.706000 > > > (C-27-26,C-27-20), avg=31.714664 > > > C-26-32: worst=1037.963000 (C-27-45,C-25-19), best=2.707000 > > > (C-26-35,C-26-29), avg=31.694966 > > > C-26-41: worst=1037.817000 (C-27-59,C-25-28), best=2.707000 > > > (C-26-44,C-26-38), avg=31.674466 > > > C-27-08: worst=1037.036000 (C-25-09,C-25-40), best=2.708000 > > > (C-27-11,C-27-05), avg=31.781712 > > > C-26-29: worst=1038.093000 (C-27-42,C-25-16), best=2.709000 > > > (C-26-32,C-26-26), avg=31.812582 > > > C-27-11: worst=1038.136000 (C-25-12,C-25-43), best=2.710000 > > > (C-27-14,C-27-08), avg=31.653336 > > > C-26-44: worst=1038.070000 (C-27-62,C-25-31), best=2.711000 > > > (C-27-02,C-26-41), avg=31.666521 > > > C-27-32: worst=1037.563000 (C-25-33,C-26-19), best=2.712000 > > > (C-27-35,C-27-29), avg=31.778705 > > > C-25-44: worst=1036.881000 (C-27-12,C-27-43), best=2.713000 > > > (C-26-02,C-25-41), avg=31.752103 > > > > > > While other MPI implementations running under OFED 1.3 on the same > > > node set show more stability: > > > > > > C-25-30: worst=35.822153 (C-27-39,C-27-19), best=3.398895 > > > (C-25-10,C-26-34), avg=11.128738 > > > C-26-11: worst=35.799026 (C-27-22,C-26-39), best=3.398180 > > > (C-25-14,C-27-05), avg=10.864981 > > > C-27-36: worst=35.802126 (C-26-02,C-25-11), best=3.396034 > > > (C-26-34,C-27-38), avg=10.694269 > > > C-25-16: worst=35.804033 (C-27-51,C-27-26), best=3.391981 > > > (C-25-20,C-26-27), avg=11.112461 > > > C-25-10: worst=35.800934 (C-27-37,C-27-53), best=3.388882 > > > (C-26-23,C-25-30), avg=10.956828 > > > C-27-08: worst=35.817862 (C-26-30,C-25-25), best=3.388882 > > > (C-26-06,C-27-54), avg=10.765788 > > > C-27-11: worst=35.810947 (C-25-31,C-26-15), best=3.386974 > > > (C-27-56,C-26-23), avg=11.048172 > > > C-26-10: worst=35.799026 (C-27-20,C-26-40), best=3.386974 > > > (C-25-01,C-27-04), avg=10.720228 > > > C-26-31: worst=37.193060 (C-25-33,C-26-43), best=3.386021 > > > (C-27-27,C-25-20), avg=10.809506 > > > C-26-23: worst=35.791874 (C-26-35,C-27-17), best=3.386021 > > > (C-27-11,C-25-10), avg=10.747612 > > > C-25-20: worst=35.769939 (C-27-28,C-27-61), best=3.385782 > > > (C-26-31,C-25-16), avg=11.007588 > > > C-27-37: worst=35.790205 (C-26-03,C-25-10), best=3.385067 > > > (C-26-35,C-27-39), avg=10.998216 > > > C-27-54: worst=35.786867 (C-25-29,C-25-19), best=3.384829 > > > (C-27-08,C-27-35), avg=11.109948 > > > C-27-43: worst=35.766125 (C-25-14,C-25-44), best=3.382921 > > > (C-27-34,C-26-39), avg=11.048978 > > > C-26-13: worst=35.825968 (C-27-18,C-26-04), best=3.382921 > > > (C-25-28,C-27-10), avg=11.002756 > > > C-25-44: worst=35.761833 (C-27-43,C-26-20), best=3.382921 > > > (C-25-18,C-27-01), avg=10.673819 > > > C-25-19: worst=35.789013 (C-27-54,C-27-06), best=3.382206 > > > (C-25-25,C-26-05), avg=10.877252 > > > C-27-56: worst=37.191868 (C-26-43,C-27-59), best=3.381014 > > > (C-27-24,C-27-11), avg=10.994493 > > > C-26-24: worst=35.806894 (C-26-42,C-27-24), best=3.381014 > > > (C-27-03,C-25-12), avg=10.798348 > > > > > > Previously, MVAPICH 0.9.9 (with OFED 1.2.5.5) showed stability (this > > > example was on ~280 node test): > > > > > > C-25-35: worst=34.869000 (C-21-36,C-21-20), best=2.208000 > > > (C-21-28,C-21-28), avg=5.739774 > > > C-21-28: worst=34.914000 (C-25-43,C-25-27), best=2.210000 > > > (C-25-35,C-25-35), avg=5.692484 > > > C-27-63: worst=23.946000 (C-25-13,C-23-44), best=3.123000 > > > (C-21-01,C-27-61), avg=5.673201 > > > C-25-41: worst=34.792000 (C-21-42,C-21-26), best=3.177000 > > > (C-25-43,C-25-39), avg=5.715597 > > > C-25-45: worst=34.944000 (C-22-01,C-21-30), best=3.182000 > > > (C-26-02,C-25-43), avg=5.734901 > > > C-25-37: worst=34.829000 (C-21-38,C-21-22), best=3.183000 > > > (C-25-39,C-25-35), avg=5.715226 > > > C-25-33: worst=34.900000 (C-21-34,C-21-18), best=3.185000 > > > (C-25-35,C-25-31), avg=5.726198 > > > C-25-39: worst=34.939000 (C-21-40,C-21-24), best=3.185000 > > > (C-25-45,C-25-37), avg=5.760403 > > > C-25-43: worst=34.913000 (C-21-44,C-21-28), best=3.185000 > > > (C-25-45,C-25-41), avg=5.774223 > > > C-25-25: worst=34.834000 (C-21-26,C-21-10), best=3.187000 > > > (C-25-27,C-25-23), avg=5.701191 > > > C-25-29: worst=34.935000 (C-21-30,C-21-14), best=3.188000 > > > (C-25-31,C-25-27), avg=5.732307 > > > C-25-17: worst=34.905000 (C-21-18,C-21-02), best=3.193000 > > > (C-25-19,C-25-15), avg=5.712527 > > > C-25-09: worst=34.839000 (C-21-10,C-27-58), best=3.195000 > > > (C-25-11,C-25-07), avg=5.719269 > > > C-25-21: worst=34.826000 (C-21-22,C-21-06), best=3.195000 > > > (C-25-23,C-25-19), avg=5.709025 > > > > > > The Latency portion of the test seems unaffected, I expect ~1.5GB/s > > > best, ~300MB/s worst, and ~600MB/s average. Here's a sample from the > > > MVAPICH 1.0.0 test, values in MB/s, with about 120 nodes in the test: > > > > > > C-27-28: worst=311.533788 (C-26-43,C-25-01), best=1528.480740 > > > (C-27-27,C-27-29), avg=587.437197 > > > C-27-27: worst=304.659190 (C-26-42,C-27-62), best=1528.480740 > > > (C-27-26,C-27-28), avg=578.232542 > > > C-27-26: worst=305.106860 (C-26-43,C-27-59), best=1528.480740 > > > (C-27-25,C-27-27), avg=586.553406 > > > C-27-25: worst=305.004799 (C-26-42,C-27-58), best=1528.369347 > > > (C-27-24,C-27-26), avg=607.532908 > > > C-27-22: worst=305.069134 (C-26-39,C-27-55), best=1528.369347 > > > (C-27-21,C-27-23), avg=581.500478 > > > C-27-21: worst=387.081961 (C-27-33,C-27-09), best=1528.369347 > > > (C-27-20,C-27-22), avg=586.030643 > > > C-27-24: worst=312.083157 (C-26-39,C-27-59), best=1528.257970 > > > (C-27-23,C-27-25), avg=586.508800 > > > C-27-19: worst=389.750871 (C-26-21,C-25-05), best=1528.202288 > > > (C-27-18,C-27-20), avg=610.889360 > > > C-27-23: worst=305.124616 (C-26-40,C-27-56), best=1528.146610 > > > (C-27-22,C-27-24), avg=592.528629 > > > C-27-20: worst=375.241912 (C-26-23,C-25-05), best=1528.146610 > > > (C-27-19,C-27-21), avg=587.685815 > > > C-27-18: worst=381.980984 (C-27-30,C-27-06), best=1528.035265 > > > (C-27-17,C-27-19), avg=617.072824 > > > C-25-34: worst=367.293139 (C-27-34,C-27-01), best=1527.534416 > > > (C-25-33,C-25-35), avg=587.728400 > > > > > > Previous tests (280 nodes, in this case), look about the same... > > > here's a sample: > > > > > > C-27-62: worst=324.551124 (C-25-28,C-23-27), best=1528.759294 > > > (C-25-05,C-25-05), avg=513.596860 > > > C-25-05: worst=286.193170 (C-27-20,C-21-35), best=1528.480740 > > > (C-27-62,C-27-62), avg=531.348528 > > > C-26-06: worst=309.232357 (C-22-17,C-21-26), best=1527.200699 > > > (C-26-02,C-26-10), avg=526.550142 > > > C-26-02: worst=308.702059 (C-22-13,C-21-22), best=1527.089492 > > > (C-25-43,C-26-06), avg=517.635786 > > > C-26-10: worst=303.692998 (C-27-26,C-23-39), best=1527.033895 > > > (C-26-06,C-26-14), avg=526.867299 > > > C-26-14: worst=302.966896 (C-27-29,C-23-44), best=1526.255959 > > > (C-26-10,C-26-18), avg=521.675016 > > > C-22-09: worst=353.618467 (C-25-12,C-27-20), best=1526.144890 > > > (C-26-16,C-26-16), avg=564.268143 > > > C-26-26: worst=311.455134 (C-27-18,C-25-34), best=1526.089361 > > > (C-26-22,C-26-30), avg=547.530324 > > > C-26-18: worst=304.969316 (C-27-36,C-23-45), best=1526.089361 > > > (C-26-14,C-26-22), avg=515.463309 > > > C-26-22: worst=302.152809 (C-25-02,C-27-42), best=1526.033837 > > > (C-26-18,C-26-26), avg=529.672528 > > > C-26-13: worst=299.106027 (C-27-28,C-23-43), best=1525.978316 > > > (C-26-12,C-26-14), avg=549.159212 > > > C-21-08: worst=292.192329 (C-25-11,C-25-19), best=1525.978316 > > > (C-25-15,C-25-15), avg=539.609085 > > > C-25-15: worst=270.248064 (C-21-04,C-21-12), best=1525.922800 > > > (C-21-08,C-21-08), avg=523.654988 > > > C-26-16: worst=293.111198 (C-21-30,C-22-33), best=1525.867288 > > > (C-22-09,C-22-09), avg=488.649320 > > > C-26-11: worst=300.619544 (C-27-29,C-23-38), best=1525.811779 > > > (C-26-10,C-26-12), avg=496.530124 > > > C-25-24: worst=317.670885 (C-22-08,C-27-40), best=1525.756275 > > > (C-21-17,C-21-17), avg=510.542500 > > > > > > The test itself may be to blame. The test is run w/ one rank per > > > node. The idea is to get a bisectional test where each node is > > > exclusively sending to another node, and exclusively receiving from > > > another node, with all nodes sending/receiving simultaneously;note > > > that the sender and receiver will most likely be different in the > > > sendrecv call, allowing to test bisectional bandwidth for a odd number > > > of nodes. The latency test sends/receives zero bytes 1000 times, the > > > bandwidth test sends 4MB 10 times. Iteratively, all ranks will > > > eventually send and receive to/from all other ranks, but all send/recv > > > combinations will not be completely enumerated (where nodes>2). > > > > > > While you'd expect a fat-tree switch to get full bisectional > > > bandwidth, it never does; a problem w/ a static subnet manager > > > (opensm). Given that the average is ~1/3 the best bandwidth, I > > > interpret that to mean that on average a rank is being blocked by two > > > other ranks. The worst case shows roughly 5 or 6 ranks blocking each > > > other. > > > > > > The routine goes through a "for" loop starting at the current rank: > > > send-ranks are decreasing and recv-ranks are increasing (both > > > circularly) for each iteration until you get back to the current rank. > > > The core of the routine looks like: > > > > > > MPI_Init(&argc, &argv); > > > MPI_Comm_size(MPI_COMM_WORLD, &wsize); > > > MPI_Comm_rank(MPI_COMM_WORLD, &me); > > > for(hi = (me == 0) ? wsize - 1 : me - 1, > > > lo = (me + 1 == wsize) ? 0 : me + 1; > > > hi != me; > > > hi = (hi == 0) ? wsize - 1 : hi - 1, > > > lo = (lo + 1 == wsize) ? 0 : lo + 1) { > > > > > > MPI_Barrier(MPI_COMM_WORLD); > > > > > > start = MPI_Wtime(); > > > > > > for ( i = 0; i < iters; i++ ) { > > > MPI_Sendrecv(&comBuf, size, MPI_CHAR, lo, 0, &comBuf, size, > > > MPI_CHAR, hi, 0, MPI_COMM_WORLD, &stat); > > > } > > > diff = MPI_Wtime() - start; > > > sum += diff; > > > n++; > > > if (diff < min) { > > > minnode1 = lo; > > > minnode2 = hi; > > > min = diff; > > > } > > > if (diff > max) { > > > maxnode1 = lo; > > > maxnode2 = hi; > > > max = diff; > > > } > > > } > > > > > > At the end of the test, the best, worst, and average cases are > > > reported for each node/rank, along with the node names associated with > > > that best/worst event. So, if there is an issue with a node, you'd > > > expect that node to show up in multiple reports, as a single reported > > > event only narrows the culprit down to two. > > > > > > Any ideas would be appreciated. > > > > > > Chris > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From christopher.tanner at gatech.edu Thu Apr 3 16:20:21 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Thu Apr 3 16:20:30 2008 Subject: [mvapich-discuss] Is MVAPICH working? Message-ID: Hello all - This is my first post to this group and I am a newbie to MVAPICH. Our cluster has both gigabit ethernet and Infiniband connections. Since I was unfamiliar with Infiniband, I simply used MPICH2 with good success. However, we paid for the Infiniband connections, so I want to use them. So I downloaded and compiled the MVAPICH2 1.0.1 source. I have a couple issues: a) I don't know what kind of Infiniband libraries we have (i.e. VAPI, uDAPL, Gen2-IB, iWARP). In fact, I don't know what any of those are... Does it really matter? I just did the usual 'configure' and 'make' with what seemed to be no critical errors during compilation. All of the included make.mvapich2.* scripts did not work for one reason or another. b) MVAPICH2 uses an mpd just like MPICH2. Is there a way I can tell if the mpiexec compiled from the MVAPICH2 source is really using the Infiniband links instead of the ethernet links? I haven't been able to see a big speed difference in the MPI applications I executed recently. Thus far everything runs the same as when MPICH2 was installed... Thanks guys and sorry for the really newbie questions... ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner@gatech.edu ------------------------------------------- From koop at cse.ohio-state.edu Thu Apr 3 17:08:44 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Thu Apr 3 17:08:50 2008 Subject: [mvapich-discuss] Is MVAPICH working? In-Reply-To: Message-ID: Chris, Have you (or your vendor) installed any InfiniBand libraries on the machines yet? You will need to have a library like OpenFabrics that interfaces with InfinBand installed on the cluster. You can download the latest OpenFabrics packages at: http://www.openfabrics.org/builds/ofed-1.3/release/OFED-1.3.tgz This is the "Gen2" interface. You will need to use the make.mvapich.gen2 script to compile -- directly using configure and make will result in a TCP build only. OpenFabrics Enterprise Edition (OFED) includes the MVAPICH2 package as well, which may help simplify the process for you. To verify if InfiniBand is being used, you can use the included OSU benchmarks in the 'osu_benchmarks' directory to test performance. You can expect <5usec latency when using IB. Let us know if you get stuck anywhere. Thanks, Matt On Thu, 3 Apr 2008, Christopher Tanner wrote: > Hello all - > > This is my first post to this group and I am a newbie to MVAPICH. > Our cluster has both gigabit ethernet and Infiniband connections. > Since I was unfamiliar with Infiniband, I simply used MPICH2 with good > success. However, we paid for the Infiniband connections, so I want to > use them. So I downloaded and compiled the MVAPICH2 1.0.1 source. > > I have a couple issues: > a) I don't know what kind of Infiniband libraries we have (i.e. VAPI, > uDAPL, Gen2-IB, iWARP). In fact, I don't know what any of those are... > Does it really matter? I just did the usual 'configure' and 'make' > with what seemed to be no critical errors during compilation. All of > the included make.mvapich2.* scripts did not work for one reason or > another. > > b) MVAPICH2 uses an mpd just like MPICH2. Is there a way I can tell if > the mpiexec compiled from the MVAPICH2 source is really using the > Infiniband links instead of the ethernet links? I haven't been able to > see a big speed difference in the MPI applications I executed > recently. Thus far everything runs the same as when MPICH2 was > installed... > > Thanks guys and sorry for the really newbie questions... > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner@gatech.edu > ------------------------------------------- > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From panda at cse.ohio-state.edu Thu Apr 3 22:53:44 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Apr 3 22:53:52 2008 Subject: [mvapich-discuss] Is MVAPICH working? In-Reply-To: Message-ID: I just want to add the following thing to Matt's reply. Please feel free to refer to the detailed MVAPICH2 user guide. It will answer many of your questions wrt installation, usage, benchmarking, etc.. The User Guide is available from mvapich web page under `support'. DK On Thu, 3 Apr 2008, Matthew Koop wrote: > Chris, > > Have you (or your vendor) installed any InfiniBand libraries on the > machines yet? You will need to have a library like OpenFabrics that > interfaces with InfinBand installed on the cluster. You can download the > latest OpenFabrics packages at: > > http://www.openfabrics.org/builds/ofed-1.3/release/OFED-1.3.tgz > > This is the "Gen2" interface. You will need to use the make.mvapich.gen2 > script to compile -- directly using configure and make will result in a > TCP build only. OpenFabrics Enterprise Edition (OFED) includes the > MVAPICH2 package as well, which may help simplify the process for you. > > To verify if InfiniBand is being used, you can use the included OSU > benchmarks in the 'osu_benchmarks' directory to test performance. You can > expect <5usec latency when using IB. > > Let us know if you get stuck anywhere. Thanks, > > Matt > > > On Thu, 3 Apr 2008, Christopher Tanner wrote: > > > Hello all - > > > > This is my first post to this group and I am a newbie to MVAPICH. > > Our cluster has both gigabit ethernet and Infiniband connections. > > Since I was unfamiliar with Infiniband, I simply used MPICH2 with good > > success. However, we paid for the Infiniband connections, so I want to > > use them. So I downloaded and compiled the MVAPICH2 1.0.1 source. > > > > I have a couple issues: > > a) I don't know what kind of Infiniband libraries we have (i.e. VAPI, > > uDAPL, Gen2-IB, iWARP). In fact, I don't know what any of those are... > > Does it really matter? I just did the usual 'configure' and 'make' > > with what seemed to be no critical errors during compilation. All of > > the included make.mvapich2.* scripts did not work for one reason or > > another. > > > > b) MVAPICH2 uses an mpd just like MPICH2. Is there a way I can tell if > > the mpiexec compiled from the MVAPICH2 source is really using the > > Infiniband links instead of the ethernet links? I haven't been able to > > see a big speed difference in the MPI applications I executed > > recently. Thus far everything runs the same as when MPICH2 was > > installed... > > > > Thanks guys and sorry for the really newbie questions... > > > > ------------------------------------------- > > Chris Tanner > > Space Systems Design Lab > > Georgia Institute of Technology > > christopher.tanner@gatech.edu > > ------------------------------------------- > > > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From christopher.tanner at gatech.edu Fri Apr 4 11:41:44 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Fri Apr 4 11:41:53 2008 Subject: [mvapich-discuss] Is MVAPICH working? In-Reply-To: References: Message-ID: <2F8D1C69-70D5-4DD1-A082-323E11FDB921@gatech.edu> Thanks Matt - Upon trying to install OFED, I received the following error 'Failed to install ibutils RPM' Apparently this is b/c it couldn't find the dependency 'libstdc++.so. 6(GLIBCXX_3.4.9)(64bit)' I searched for an RPM for this, but I couldn't find one for my release (RHEL4). I have a file libstdc++.so.6.0.3 file in the /usr/lib directory, but I'm assuming this isn't what it needs. Is the ibutils package important? Can I install a library for a different release (I think I found it for Madriva)? If not, where can I find the source to compile it myself for my release? Thanks again. ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner@gatech.edu ------------------------------------------- On Apr 3, 2008, at 5:08 PM, Matthew Koop wrote: > Chris, > > Have you (or your vendor) installed any InfiniBand libraries on the > machines yet? You will need to have a library like OpenFabrics that > interfaces with InfinBand installed on the cluster. You can download > the > latest OpenFabrics packages at: > > http://www.openfabrics.org/builds/ofed-1.3/release/OFED-1.3.tgz > > This is the "Gen2" interface. You will need to use the > make.mvapich.gen2 > script to compile -- directly using configure and make will result > in a > TCP build only. OpenFabrics Enterprise Edition (OFED) includes the > MVAPICH2 package as well, which may help simplify the process for you. > > To verify if InfiniBand is being used, you can use the included OSU > benchmarks in the 'osu_benchmarks' directory to test performance. > You can > expect <5usec latency when using IB. > > Let us know if you get stuck anywhere. Thanks, > > Matt > > > On Thu, 3 Apr 2008, Christopher Tanner wrote: > >> Hello all - >> >> This is my first post to this group and I am a newbie to MVAPICH. >> Our cluster has both gigabit ethernet and Infiniband connections. >> Since I was unfamiliar with Infiniband, I simply used MPICH2 with >> good >> success. However, we paid for the Infiniband connections, so I want >> to >> use them. So I downloaded and compiled the MVAPICH2 1.0.1 source. >> >> I have a couple issues: >> a) I don't know what kind of Infiniband libraries we have (i.e. VAPI, >> uDAPL, Gen2-IB, iWARP). In fact, I don't know what any of those >> are... >> Does it really matter? I just did the usual 'configure' and 'make' >> with what seemed to be no critical errors during compilation. All of >> the included make.mvapich2.* scripts did not work for one reason or >> another. >> >> b) MVAPICH2 uses an mpd just like MPICH2. Is there a way I can tell >> if >> the mpiexec compiled from the MVAPICH2 source is really using the >> Infiniband links instead of the ethernet links? I haven't been able >> to >> see a big speed difference in the MPI applications I executed >> recently. Thus far everything runs the same as when MPICH2 was >> installed... >> >> Thanks guys and sorry for the really newbie questions... >> >> ------------------------------------------- >> Chris Tanner >> Space Systems Design Lab >> Georgia Institute of Technology >> christopher.tanner@gatech.edu >> ------------------------------------------- >> >> >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > From koop at cse.ohio-state.edu Fri Apr 4 13:16:36 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri Apr 4 13:16:41 2008 Subject: [mvapich-discuss] Is MVAPICH working? In-Reply-To: <2F8D1C69-70D5-4DD1-A082-323E11FDB921@gatech.edu> Message-ID: Chris, You should be able to skip the ibutils package (don't select it in the customized view). Matt On Fri, 4 Apr 2008, Christopher Tanner wrote: > Thanks Matt - > > Upon trying to install OFED, I received the following error > 'Failed to install ibutils RPM' > Apparently this is b/c it couldn't find the dependency 'libstdc++.so. > 6(GLIBCXX_3.4.9)(64bit)' > > I searched for an RPM for this, but I couldn't find one for my release > (RHEL4). I have a file libstdc++.so.6.0.3 file in the /usr/lib > directory, but I'm assuming this isn't what it needs. > > Is the ibutils package important? Can I install a library for a > different release (I think I found it for Madriva)? If not, where can > I find the source to compile it myself for my release? > > Thanks again. > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner@gatech.edu > ------------------------------------------- > > > > On Apr 3, 2008, at 5:08 PM, Matthew Koop wrote: > > Chris, > > > > Have you (or your vendor) installed any InfiniBand libraries on the > > machines yet? You will need to have a library like OpenFabrics that > > interfaces with InfinBand installed on the cluster. You can download > > the > > latest OpenFabrics packages at: > > > > http://www.openfabrics.org/builds/ofed-1.3/release/OFED-1.3.tgz > > > > This is the "Gen2" interface. You will need to use the > > make.mvapich.gen2 > > script to compile -- directly using configure and make will result > > in a > > TCP build only. OpenFabrics Enterprise Edition (OFED) includes the > > MVAPICH2 package as well, which may help simplify the process for you. > > > > To verify if InfiniBand is being used, you can use the included OSU > > benchmarks in the 'osu_benchmarks' directory to test performance. > > You can > > expect <5usec latency when using IB. > > > > Let us know if you get stuck anywhere. Thanks, > > > > Matt > > > > > > On Thu, 3 Apr 2008, Christopher Tanner wrote: > > > >> Hello all - > >> > >> This is my first post to this group and I am a newbie to MVAPICH. > >> Our cluster has both gigabit ethernet and Infiniband connections. > >> Since I was unfamiliar with Infiniband, I simply used MPICH2 with > >> good > >> success. However, we paid for the Infiniband connections, so I want > >> to > >> use them. So I downloaded and compiled the MVAPICH2 1.0.1 source. > >> > >> I have a couple issues: > >> a) I don't know what kind of Infiniband libraries we have (i.e. VAPI, > >> uDAPL, Gen2-IB, iWARP). In fact, I don't know what any of those > >> are... > >> Does it really matter? I just did the usual 'configure' and 'make' > >> with what seemed to be no critical errors during compilation. All of > >> the included make.mvapich2.* scripts did not work for one reason or > >> another. > >> > >> b) MVAPICH2 uses an mpd just like MPICH2. Is there a way I can tell > >> if > >> the mpiexec compiled from the MVAPICH2 source is really using the > >> Infiniband links instead of the ethernet links? I haven't been able > >> to > >> see a big speed difference in the MPI applications I executed > >> recently. Thus far everything runs the same as when MPICH2 was > >> installed... > >> > >> Thanks guys and sorry for the really newbie questions... > >> > >> ------------------------------------------- > >> Chris Tanner > >> Space Systems Design Lab > >> Georgia Institute of Technology > >> christopher.tanner@gatech.edu > >> ------------------------------------------- > >> > >> > >> > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From coutinho at dcc.ufmg.br Fri Apr 4 16:01:58 2008 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Fri Apr 4 16:02:06 2008 Subject: [mvapich-discuss] tcp/IPoIB and shared memory Message-ID: Does tcp/IPoIB support shared memory in MVAPICH2? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080404/bb3703f6/attachment.html From chai.15 at osu.edu Fri Apr 4 17:59:45 2008 From: chai.15 at osu.edu (LEI CHAI) Date: Fri Apr 4 18:00:32 2008 Subject: [mvapich-discuss] tcp/IPoIB and shared memory In-Reply-To: References: Message-ID: Hi, The tcp/IPoIB support in MVAPICH2 is from MPICH2. You can use the nemesis device which supports shared memory. Lei ----- Original Message ----- From: Bruno Coutinho Date: Friday, April 4, 2008 4:02 pm Subject: [mvapich-discuss] tcp/IPoIB and shared memory To: mvapich-discuss@cse.ohio-state.edu > Does tcp/IPoIB support shared memory in MVAPICH2? > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080404/5499d9a7/attachment.html From maya.usatu at gmail.com Sun Apr 6 09:43:43 2008 From: maya.usatu at gmail.com (Maya Khaliullina) Date: Sun Apr 6 09:43:50 2008 Subject: [mvapich-discuss] Migration for MPI processes Message-ID: Hi all, We have an infiniband cluster: Node: 2xQuad Core Intel Xeon 2.33 GHz O/S: RHEL4.5 File System: GPFS We are using MVAPICH2-1.0.2p1 with BLCR-0.6.5. At this moment we have no problems with C/R(everything works fine). I wonder could the MPI job be restarted after a checkpointing on another subset of nodes, i.e. could the migration for MPI processes be realized from a node on another one? If not so, will you support this capability in the future? Thanks. Maya -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080406/ff625898/attachment.html From huanwei at cse.ohio-state.edu Sun Apr 6 10:19:22 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Sun Apr 6 10:19:29 2008 Subject: [mvapich-discuss] Migration for MPI processes In-Reply-To: Message-ID: Hi Maya, This is doable. Currently such functionality is supported in mvapich2. Apparently, to do that your checkpoint images should be readable from the new node. Please let us know if you meet any issues here. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Sun, 6 Apr 2008, Maya Khaliullina wrote: > Hi all, > > We have an infiniband cluster: > Node: 2xQuad Core Intel Xeon 2.33 GHz > O/S: RHEL4.5 > File System: GPFS > We are using MVAPICH2-1.0.2p1 with BLCR-0.6.5. > At this moment we have no problems with C/R(everything works fine). > > I wonder could the MPI job be restarted after a checkpointing on another > subset of nodes, > i.e. could the migration for MPI processes be realized from a node on > another one? > If not so, will you support this capability in the future? Thanks. > > Maya > From pasha at dev.mellanox.co.il Sun Apr 6 11:04:14 2008 From: pasha at dev.mellanox.co.il (Pavel Shamis (Pasha)) Date: Sun Apr 6 11:04:36 2008 Subject: [mvapich-discuss] Re: [ofa-general] MVAPICH2 crashes on mixed fabric In-Reply-To: References: Message-ID: <47F8E66E.6060505@dev.mellanox.co.il> MVAPICH(1) and OMPI have HCA auto-detect system and both of them works well on heterogeneous cluster. I'm not sure about mvapich2 but I think that mvapich-discussion list will be better place for this kind of question. So I'm forwarding this mail to mvapich list. Pasha. Mike Heinz wrote: > Hey, all, I'm not sure if this is a known bug or some sort of > limitation I'm unaware of, but I've been building and testing with the > OFED 1.3 GA release on a small fabric that has a mix of Arbel-based > and newer Connect-X HCAs. > > What I've discovered is that mvapich and openmpi work fine across the > entire fabric, but mvapich2 crashes when I use a mix of Arbels and > Connect-X. The errors vary depending on the test program but here's an > example: > > [mheinz@compute-0-0 IMB-3.0]$ mpirun -n 5 ./IMB-MPI1 > . > . > . > (output snipped) > . > . > . > > #----------------------------------------------------------------------------- > # Benchmarking Sendrecv > # #processes = 2 > # ( 3 additional processes waiting in MPI_Barrier) > #----------------------------------------------------------------------------- > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] > Mbytes/sec > 0 1000 3.51 3.51 > 3.51 0.00 > 1 1000 3.63 3.63 > 3.63 0.52 > 2 1000 3.67 3.67 > 3.67 1.04 > 4 1000 3.64 3.64 > 3.64 2.09 > 8 1000 3.67 3.67 > 3.67 4.16 > 16 1000 3.67 3.67 > 3.67 8.31 > 32 1000 3.74 3.74 > 3.74 16.32 > 64 1000 3.90 3.90 > 3.90 31.28 > 128 1000 4.75 4.75 > 4.75 51.39 > 256 1000 5.21 5.21 > 5.21 93.79 > 512 1000 5.96 5.96 > 5.96 163.77 > 1024 1000 7.88 7.89 > 7.89 247.54 > 2048 1000 11.42 11.42 > 11.42 342.00 > 4096 1000 15.33 15.33 > 15.33 509.49 > 8192 1000 22.19 22.20 > 22.20 703.83 > 16384 1000 34.57 34.57 > 34.57 903.88 > 32768 1000 51.32 51.32 51.32 > 1217.94 > 65536 640 85.80 85.81 85.80 > 1456.74 > 131072 320 155.23 155.24 155.24 > 1610.40 > 262144 160 301.84 301.86 301.85 > 1656.39 > 524288 80 598.62 598.69 598.66 > 1670.31 > 1048576 40 1175.22 1175.30 1175.26 > 1701.69 > 2097152 20 2309.05 2309.05 2309.05 > 1732.32 > 4194304 10 4548.72 4548.98 4548.85 > 1758.64 > [0] Abort: Got FATAL event 3 > at line 796 in file ibv_channel_manager.c > rank 0 in job 1 compute-0-0.local_36049 caused collective abort of > all ranks > exit status of rank 0: killed by signal 9 > If, however, I define my mpdring to contain only Connect-X systems OR > only Arbel systems, IMB-MPI1 runs to completion. > > Can any suggest a workaround or is this a real bug with mvapich2? > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > ------------------------------------------------------------------------ > > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Pavel Shamis (Pasha) Mellanox Technologies From stuart at cs.ucdavis.edu Sun Apr 6 11:27:29 2008 From: stuart at cs.ucdavis.edu (Jeff Stuart) Date: Sun Apr 6 11:27:39 2008 Subject: [mvapich-discuss] Consistent and reproducible vbuf problems In-Reply-To: <696fd4820804060824l6cec48b6wda6709e788168a90@mail.gmail.com> References: <696fd4820804060822l796851a5g3583bf6c2b60ff45@mail.gmail.com> <696fd4820804060824l6cec48b6wda6709e788168a90@mail.gmail.com> Message-ID: <696fd4820804060827xf401112p631ea66e764a2b57@mail.gmail.com> I am using MVAPICH2 on a small set of workstations equipped with infiniband. I am using a GPU device library known as CUDA. CUDA uses page-locked memory areas, and I believe this is conflicting with MVAPICH2. If I run a series of broadcasts of size (1024, 2048, ..., 2MB) and run each size of broadcast a number of times (30 seems to work), nodes consistenly abort. A typical call stack: 2: 0x00002adab034db90 in memset () from /lib/libc.so.6 2: (gdb) 2: (gdb) bt 2: #0 0x00002adab034db90 in memset () from /lib/libc.so.6 2: #1 0x0000000000443bdd in allocate_vbuf_region () 2: #2 0x0000000000443fe5 in get_vbuf () 2: #3 0x000000000043716f in MRAILI_Get_Vbuf () 2: #4 0x000000000043739e in MPIDI_CH3I_MRAILI_Eager_send () 2: #5 0x000000000042f483 in MPIDI_CH3_Rendezvous_r3_push () 2: #6 0x000000000042f721 in MPIDI_CH3_Rendezvous_push () 2: #7 0x000000000042f9b1 in MPIDI_CH3I_MRAILI_Process_rndv () 2: #8 0x000000000042d904 in MPIDI_CH3I_Progress () 2: #9 0x0000000000420b6e in MPIC_Wait () 2: #10 0x00000000004215b9 in MPIC_Send () 2: #11 0x000000000041f62e in MPIR_Bcast () 2: #12 0x000000000042097e in PMPI_Bcast () 2: #13 0x000000000041dff9 in dcgn::BroadcastRequest::performCollectiveGlobal ( 2: this=0xc5c6f0) at infrastructure/src/BroadcastRequest.cpp:18 2: #14 0x000000000041e7f0 in dcgn::CollectiveRequest::poll (this=0xc5c6f0, 2: ioRequests=@0x5f3708) at infrastructure/src/CollectiveRequest.cpp:75 2: #15 0x000000000041e75c in dcgn::CollectiveRequest::poll (this=0xc5c6f0, 2: ioRequests=@0x5f3708) at infrastructure/src/CollectiveRequest.cpp:84 2: #16 0x00000000004146e3 in dcgn::MPIWorker::serviceRequest (this=0x5f3680, 2: req=0xc5c6f0, isShutdown=) 2: at infrastructure/src/MPIWorker.cpp:118 2: #17 0x000000000041499f in dcgn::MPIWorker::loop (this=0x5f3680) 2: at infrastructure/src/MPIWorker.cpp:78 2: #18 0x0000000000415986 in dcgn::MPIWorker::launchThread ( 2: param=) at infrastructure/src/MPIWorker.cpp:221 2: #19 0x000000000041ce1a in dcgn::Thread::run (p=0x801de0) 2: at infrastructure/src/Thread.cpp:18 2: #20 0x00002adaaf78e297 in start_thread () from /lib/libpthread.so.0 2: #21 0x00002adab039a51e in clone () from /lib/libc.so.6 2: (gdb) up 13 2: #13 0x000000000041dff9 in dcgn::BroadcastRequest::performCollectiveGlobal ( 2: this=0xc5c6f0) at infrastructure/src/BroadcastRequest.cpp:18 2: 18 MPI_Bcast(buf, numBytes, MPI_BYTE, mpiWorker->getMPIRankByTarget(root), MPI_COMM_WORLD); 2: Current language: auto; currently c++ 2: (gdb) print numBytes 2: $1 = 2097152 Before this crash, successful broadcasts were performed with sizes 1K, 2K, all the way to 1MB. The next broadcast is a 2MB broadcast and causes this crash. I am not sure if any 2MB broadcasts were performed before this, or if the first 2MB broadcast is what fails. Is there any way to either disable VBUF usage (I know this will cause performance degredation, but then again, slow performance is better than no performance) or limit vbuf usage? I tried limiting the number of vbufs to 1024 with a size of 16K (mpiexec -gdb -env MV2_VBUF_MAX 1024 -env MV2_VBUF_TOTAL_SIZE 16384) and I end up with 0: [0] Abort: VBUF alloc failure, limit exceeded at line 138 in file vbuf.c -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080406/402ea0c0/attachment.html From huanwei at cse.ohio-state.edu Sun Apr 6 14:24:25 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Sun Apr 6 14:24:33 2008 Subject: [mvapich-discuss] Consistent and reproducible vbuf problems (fwd) In-Reply-To: Message-ID: Hi Jeff, Would you please let us know the size of the job you are running (number of processes)? And are you running your broadcasts from the same user buffer or different ones? Also, could you please verify the memory load one the system when the failure happens? Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Sun, 6 Apr 2008, Dhabaleswar Panda wrote: > I am using MVAPICH2 on a small set of workstations equipped with infiniband. > I am using a GPU device library known as CUDA. CUDA uses page-locked memory > areas, and I believe this is conflicting with MVAPICH2. If I run a series of > broadcasts of size (1024, 2048, ..., 2MB) and run each size of broadcast a > number of times (30 seems to work), nodes consistenly abort. A typical call > stack: > > 2: 0x00002adab034db90 in memset () from /lib/libc.so.6 > 2: (gdb) 2: (gdb) bt > 2: #0 0x00002adab034db90 in memset () from /lib/libc.so.6 > 2: #1 0x0000000000443bdd in allocate_vbuf_region () > 2: #2 0x0000000000443fe5 in get_vbuf () > 2: #3 0x000000000043716f in MRAILI_Get_Vbuf () > 2: #4 0x000000000043739e in MPIDI_CH3I_MRAILI_Eager_send () > 2: #5 0x000000000042f483 in MPIDI_CH3_Rendezvous_r3_push () > 2: #6 0x000000000042f721 in MPIDI_CH3_Rendezvous_push () > 2: #7 0x000000000042f9b1 in MPIDI_CH3I_MRAILI_Process_rndv () > 2: #8 0x000000000042d904 in MPIDI_CH3I_Progress () > 2: #9 0x0000000000420b6e in MPIC_Wait () > 2: #10 0x00000000004215b9 in MPIC_Send () > 2: #11 0x000000000041f62e in MPIR_Bcast () > 2: #12 0x000000000042097e in PMPI_Bcast () > 2: #13 0x000000000041dff9 in > dcgn::BroadcastRequest::performCollectiveGlobal ( > 2: this=0xc5c6f0) at infrastructure/src/BroadcastRequest.cpp:18 > 2: #14 0x000000000041e7f0 in dcgn::CollectiveRequest::poll (this=0xc5c6f0, > 2: ioRequests=@0x5f3708) at infrastructure/src/CollectiveRequest.cpp:75 > 2: #15 0x000000000041e75c in dcgn::CollectiveRequest::poll (this=0xc5c6f0, > 2: ioRequests=@0x5f3708) at infrastructure/src/CollectiveRequest.cpp:84 > 2: #16 0x00000000004146e3 in dcgn::MPIWorker::serviceRequest > (this=0x5f3680, > 2: req=0xc5c6f0, isShutdown=) > 2: at infrastructure/src/MPIWorker.cpp:118 > 2: #17 0x000000000041499f in dcgn::MPIWorker::loop (this=0x5f3680) > 2: at infrastructure/src/MPIWorker.cpp:78 > 2: #18 0x0000000000415986 in dcgn::MPIWorker::launchThread ( > 2: param=) at infrastructure/src/MPIWorker.cpp:221 > 2: #19 0x000000000041ce1a in dcgn::Thread::run (p=0x801de0) > 2: at infrastructure/src/Thread.cpp:18 > 2: #20 0x00002adaaf78e297 in start_thread () from /lib/libpthread.so.0 > 2: #21 0x00002adab039a51e in clone () from /lib/libc.so.6 > 2: (gdb) up 13 > 2: #13 0x000000000041dff9 in > dcgn::BroadcastRequest::performCollectiveGlobal ( > 2: this=0xc5c6f0) at infrastructure/src/BroadcastRequest.cpp:18 > 2: 18 MPI_Bcast(buf, numBytes, MPI_BYTE, > mpiWorker->getMPIRankByTarget(root), MPI_COMM_WORLD); > 2: Current language: auto; currently c++ > 2: (gdb) print numBytes > 2: $1 = 2097152 > > > Before this crash, successful broadcasts were performed with sizes 1K, 2K, > all the way to 1MB. The next broadcast is a 2MB broadcast and causes this > crash. I am not sure if any 2MB broadcasts were performed before this, or if > the first 2MB broadcast is what fails. > > Is there any way to either disable VBUF usage (I know this will cause > performance degredation, but then again, slow performance is better than no > performance) or limit vbuf usage? I tried limiting the number of vbufs to > 1024 with a size of 16K (mpiexec -gdb -env MV2_VBUF_MAX 1024 -env > MV2_VBUF_TOTAL_SIZE 16384) and I end up with > > 0: [0] Abort: VBUF alloc failure, limit exceeded at line 138 in file vbuf.c > From stuart at cs.ucdavis.edu Sun Apr 6 14:51:47 2008 From: stuart at cs.ucdavis.edu (Jeff Stuart) Date: Sun Apr 6 14:51:59 2008 Subject: [mvapich-discuss] Consistent and reproducible vbuf problems (fwd) In-Reply-To: <696fd4820804061150y55c80981i37081509b979ad89@mail.gmail.com> References: <696fd4820804061150y55c80981i37081509b979ad89@mail.gmail.com> Message-ID: <696fd4820804061151l71bd77cco555d115170433a0b@mail.gmail.com> Four processes, multithreaded, though all MPI calls are made under one specific thread. As the broadcast size increases, the user buffer changes. Broadcasts of the same size use the same buffer. The memory load on the system appears to be under 5%, at least that is what I can discern from using top. > On Sun, Apr 6, 2008 at 11:24 AM, wei huang > wrote: > > > Hi Jeff, > > > > Would you please let us know the size of the job you are running (number > > of processes)? And are you running your broadcasts from the same user > > buffer or different ones? Also, could you please verify the memory load > > one the system when the failure happens? Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Sun, 6 Apr 2008, Dhabaleswar Panda wrote: > > > > > I am using MVAPICH2 on a small set of workstations equipped with > > infiniband. > > > I am using a GPU device library known as CUDA. CUDA uses page-locked > > memory > > > areas, and I believe this is conflicting with MVAPICH2. If I run a > > series of > > > broadcasts of size (1024, 2048, ..., 2MB) and run each size of > > broadcast a > > > number of times (30 seems to work), nodes consistenly abort. A typical > > call > > > stack: > > > > > > 2: 0x00002adab034db90 in memset () from /lib/libc.so.6 > > > 2: (gdb) 2: (gdb) bt > > > 2: #0 0x00002adab034db90 in memset () from /lib/libc.so.6 > > > 2: #1 0x0000000000443bdd in allocate_vbuf_region () > > > 2: #2 0x0000000000443fe5 in get_vbuf () > > > 2: #3 0x000000000043716f in MRAILI_Get_Vbuf () > > > 2: #4 0x000000000043739e in MPIDI_CH3I_MRAILI_Eager_send () > > > 2: #5 0x000000000042f483 in MPIDI_CH3_Rendezvous_r3_push () > > > 2: #6 0x000000000042f721 in MPIDI_CH3_Rendezvous_push () > > > 2: #7 0x000000000042f9b1 in MPIDI_CH3I_MRAILI_Process_rndv () > > > 2: #8 0x000000000042d904 in MPIDI_CH3I_Progress () > > > 2: #9 0x0000000000420b6e in MPIC_Wait () > > > 2: #10 0x00000000004215b9 in MPIC_Send () > > > 2: #11 0x000000000041f62e in MPIR_Bcast () > > > 2: #12 0x000000000042097e in PMPI_Bcast () > > > 2: #13 0x000000000041dff9 in > > > dcgn::BroadcastRequest::performCollectiveGlobal ( > > > 2: this=0xc5c6f0) at infrastructure/src/BroadcastRequest.cpp:18 > > > 2: #14 0x000000000041e7f0 in dcgn::CollectiveRequest::poll > > (this=0xc5c6f0, > > > 2: ioRequests=@0x5f3708) at > > infrastructure/src/CollectiveRequest.cpp:75 > > > 2: #15 0x000000000041e75c in dcgn::CollectiveRequest::poll > > (this=0xc5c6f0, > > > 2: ioRequests=@0x5f3708) at > > infrastructure/src/CollectiveRequest.cpp:84 > > > 2: #16 0x00000000004146e3 in dcgn::MPIWorker::serviceRequest > > > (this=0x5f3680, > > > 2: req=0xc5c6f0, isShutdown=) > > > 2: at infrastructure/src/MPIWorker.cpp:118 > > > 2: #17 0x000000000041499f in dcgn::MPIWorker::loop (this=0x5f3680) > > > 2: at infrastructure/src/MPIWorker.cpp:78 > > > 2: #18 0x0000000000415986 in dcgn::MPIWorker::launchThread ( > > > 2: param=) at > > infrastructure/src/MPIWorker.cpp:221 > > > 2: #19 0x000000000041ce1a in dcgn::Thread::run (p=0x801de0) > > > 2: at infrastructure/src/Thread.cpp:18 > > > 2: #20 0x00002adaaf78e297 in start_thread () from > > /lib/libpthread.so.0 > > > 2: #21 0x00002adab039a51e in clone () from /lib/libc.so.6 > > > 2: (gdb) up 13 > > > 2: #13 0x000000000041dff9 in > > > dcgn::BroadcastRequest::performCollectiveGlobal ( > > > 2: this=0xc5c6f0) at infrastructure/src/BroadcastRequest.cpp:18 > > > 2: 18 MPI_Bcast(buf, numBytes, MPI_BYTE, > > > mpiWorker->getMPIRankByTarget(root), MPI_COMM_WORLD); > > > 2: Current language: auto; currently c++ > > > 2: (gdb) print numBytes > > > 2: $1 = 2097152 > > > > > > > > > Before this crash, successful broadcasts were performed with sizes 1K, > > 2K, > > > all the way to 1MB. The next broadcast is a 2MB broadcast and causes > > this > > > crash. I am not sure if any 2MB broadcasts were performed before this, > > or if > > > the first 2MB broadcast is what fails. > > > > > > Is there any way to either disable VBUF usage (I know this will cause > > > performance degredation, but then again, slow performance is better > > than no > > > performance) or limit vbuf usage? I tried limiting the number of vbufs > > to > > > 1024 with a size of 16K (mpiexec -gdb -env MV2_VBUF_MAX 1024 -env > > > MV2_VBUF_TOTAL_SIZE 16384) and I end up with > > > > > > 0: [0] Abort: VBUF alloc failure, limit exceeded at line 138 in file > > vbuf.c > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080406/904b2fc6/attachment-0001.html From maya.usatu at gmail.com Mon Apr 7 11:55:06 2008 From: maya.usatu at gmail.com (Maya Khaliullina) Date: Mon Apr 7 11:55:21 2008 Subject: [mvapich-discuss] Fwd: MVAPICH2 + BLCR performance problem on multi-core cluster In-Reply-To: References: Message-ID: ---------- Forwarded message ---------- From: Maya Khaliullina Date: 07.04.2008 21:50 Subject: MVAPICH2 + BLCR performance problem on multi-core cluster To: mvapich-discuss@cse.ohio-state.edu. Hello, I have a performance problem when using mvapich2 compiled with BLCR support on infiniband cluster with following parameters: Node: 2xQuad Core Intel Xeon 2.33 GHz O/S: RHEL4.5 File System: GPFS We are using MVAPICH2-1.0.2p1 with BLCR-0.6.5. I've done 3 test runs of my program using 8 MPI processes: 1) All of 8 processes on one node 2) by 4 processes on two nodes 3) by 2 processes on 4 nodes *Results MVAPICH2 configured for BLCR support:* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf1 -np 8 ./test* *Calc time: 341.3279, send/recv time = 297.817* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf2 -np 8 ./test* *Calc time: 85.7075, send/recv time = 42.2270* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf3 -np 8 ./test* *Calc time: 84.6182, send/recv time = 40.3554* *Results MVAPICH2 configured without BLCR support:* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf1 -np 8 ./test* *Calc time: 51.5888, send/recv time = 8.0186* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf2 -np 8 ./test* *Calc time: 53.6679, send/recv time = 10.1187* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf3 -np 8 ./test* *Calc time: 63.6611, send/recv time = 20.0127 * So when using MVAPICH2 configured with BLCR support I have much time which is spent on communication between processes. Is it concerned with the fact of shared-memory support automatic disabling in such build? If it is so, do you plan to include support of both BLCR & shared-memory communications in future releases? And maybe there are another ways to improve performance of MPI program running on multi-core node? Thanks. Maya -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080407/c570b72a/attachment.html From devesh28 at gmail.com Mon Apr 7 07:55:03 2008 From: devesh28 at gmail.com (Devesh Sharma) Date: Mon Apr 7 11:55:33 2008 Subject: [mvapich-discuss] Problem in running mpi processes on TCP/IP Message-ID: <309a667c0804070455m6c8d7e26t8c68d9e1972e9ecc@mail.gmail.com> Hello all, I am trying to run an MPI job beween two nodes using TCP/IP. The nodes consists of two different ethernet interfaces those are eth0 and ib0 (using ipoib) I want to run my job through ib0 interface, host-names for both the interfaces are different node1 and node2 for eth0 and ibnode1 and ibnode2 for ib0, but when I run mpdboot by specifying ibnode1 and ibnode2 in the mdp.host file, mpdtrace shows node1 and node2 still! How can I select ib0 as communication interface? -Devesh -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080407/a09e348d/attachment.html From thuynh at vni.com Mon Apr 7 10:13:45 2008 From: thuynh at vni.com (Tai Huynh) Date: Mon Apr 7 11:56:55 2008 Subject: [mvapich-discuss] OPEN_IB_HOME Message-ID: <52779A68A4CAD94FAEED3C9ACB445B5A014B6D9B@vega.vni.com> Hi, I am trying to build MVAPICH2 and have a question. I get this error OPEN_IB_HOME directory /usr/local/ofed does not exist. How or where is can I get information on where is the ofed directory? Thanks, Tai From maya.usatu at gmail.com Mon Apr 7 11:57:01 2008 From: maya.usatu at gmail.com (Maya Khaliullina) Date: Mon Apr 7 11:57:11 2008 Subject: [mvapich-discuss] MVAPICH2 + BLCR performance problem on multi-core cluster Message-ID: Hello, I have a performance problem when using mvapich2 compiled with BLCR support on infiniband cluster with following parameters: Node: 2xQuad Core Intel Xeon 2.33 GHz O/S: RHEL4.5 File System: GPFS We are using MVAPICH2-1.0.2p1 with BLCR-0.6.5. I've done 3 test runs of my program using 8 MPI processes: 1) All of 8 processes on one node 2) by 4 processes on two nodes 3) by 2 processes on 4 nodes *Results MVAPICH2 configured for BLCR support:* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf1 -np 8 ./test* *Calc time: 341.3279, send/recv time = 297.817* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf2 -np 8 ./test* *Calc time: 85.7075, send/recv time = 42.2270* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf3 -np 8 ./test* *Calc time: 84.6182, send/recv time = 40.3554* *Results MVAPICH2 configured without BLCR support:* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf1 -np 8 ./test* *Calc time: 51.5888, send/recv time = 8.0186* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf2 -np 8 ./test* *Calc time: 53.6679, send/recv time = 10.1187* *[ccs-dev@n5304]$ mpiexec -machinefile ./mf3 -np 8 ./test* *Calc time: 63.6611, send/recv time = 20.0127 * So when using MVAPICH2 configured with BLCR support I have much time which is spent on communication between processes. Is it concerned with the fact of shared-memory support automatic disabling in such build? If it is so, do you plan to include support of both BLCR & shared-memory communications in future releases? And maybe there are another ways to improve performance of MPI program running on multi-core node? Thanks. Maya -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080407/6d065973/attachment.html From panda at cse.ohio-state.edu Mon Apr 7 12:15:18 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Apr 7 12:15:26 2008 Subject: [mvapich-discuss] Fwd: MVAPICH2 + BLCR performance problem on multi-core cluster In-Reply-To: Message-ID: > Hello, > > I have a performance problem when using mvapich2 compiled with BLCR support > on infiniband cluster with following parameters: > > Node: 2xQuad Core Intel Xeon 2.33 GHz > O/S: RHEL4.5 > File System: GPFS > We are using MVAPICH2-1.0.2p1 with BLCR-0.6.5. > > I've done 3 test runs of my program using 8 MPI processes: > 1) All of 8 processes on one node > 2) by 4 processes on two nodes > 3) by 2 processes on 4 nodes > > *Results MVAPICH2 configured for BLCR support:* > *[ccs-dev@n5304]$ mpiexec -machinefile ./mf1 -np 8 ./test* > *Calc time: 341.3279, send/recv time = 297.817* > *[ccs-dev@n5304]$ mpiexec -machinefile ./mf2 -np 8 ./test* > *Calc time: 85.7075, send/recv time = 42.2270* > *[ccs-dev@n5304]$ mpiexec -machinefile ./mf3 -np 8 ./test* > *Calc time: 84.6182, send/recv time = 40.3554* > > *Results MVAPICH2 configured without BLCR support:* > *[ccs-dev@n5304]$ mpiexec -machinefile ./mf1 -np 8 ./test* > *Calc time: 51.5888, send/recv time = 8.0186* > *[ccs-dev@n5304]$ mpiexec -machinefile ./mf2 -np 8 ./test* > *Calc time: 53.6679, send/recv time = 10.1187* > *[ccs-dev@n5304]$ mpiexec -machinefile ./mf3 -np 8 ./test* > *Calc time: 63.6611, send/recv time = 20.0127 > * > > So when using MVAPICH2 configured with BLCR support I have much > time which is spent on communication between processes. > Is it concerned with the fact of shared-memory support automatic disabling > in such build? The previous version of BLCR didn't have capability to check-point shared-memory. That's why the shared-memory support was disabled in MVAPICH2 1.0 to make it work with BLCR. The performance degradation is coming because of the lack of shared memory support. > If it is so, do you plan to include support of both BLCR & shared-memory > communications in future releases? Since the lastest version of BLCR supports check-pointing shared-memory, we are working on enabling this in MVAPICH2. The next version of MVAPICH2 (1.1) will support this capability. You will see the best performance as well as the checkpointing capability. Thanks, DK > And maybe there are another ways to improve performance of MPI program > running on multi-core node? > > > Thanks. > > Maya > From koop at cse.ohio-state.edu Mon Apr 7 12:59:40 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Apr 7 12:59:48 2008 Subject: [mvapich-discuss] Problem in running mpi processes on TCP/IP In-Reply-To: <309a667c0804070455m6c8d7e26t8c68d9e1972e9ecc@mail.gmail.com> Message-ID: Devesh, In addition to specifying ibnode1 and ibnode2 in the mpd.hosts, can you add --ifhn=ibnode1 to the mpdboot command (assuming you are starting mpdboot on ibnode1). mpdboot -n 2 --ifhn=ibnode1 Let us know if this helps. Thanks. Matt On Mon, 7 Apr 2008, Devesh Sharma wrote: > Hello all, > > I am trying to run an MPI job beween two nodes using TCP/IP. The nodes > consists of two different ethernet interfaces those are eth0 and ib0 (using > ipoib) > I want to run my job through ib0 interface, host-names for both the > interfaces are different node1 and node2 for eth0 and ibnode1 and ibnode2 > for ib0, > but when I run mpdboot by specifying ibnode1 and ibnode2 in the mdp.host > file, mpdtrace shows > node1 and node2 still! > How can I select ib0 as communication interface? > > -Devesh > From koop at cse.ohio-state.edu Mon Apr 7 13:03:40 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Apr 7 13:03:47 2008 Subject: [mvapich-discuss] OPEN_IB_HOME In-Reply-To: <52779A68A4CAD94FAEED3C9ACB445B5A014B6D9B@vega.vni.com> Message-ID: Tai, Generally you'd want to ask the system administrator who installed the OFED/OpenFabrics package on the system (that must be installed first). We have it defaulted to /usr/local/ofed, however, it may be /usr/local or /usr/ofed on your system. You can try 'which ofed_info' and see if OFED is installed. That generally will give you an idea of the installation path used (remove the bin/ofed_info). Let us know if you have any other questions, Matt On Mon, 7 Apr 2008, Tai Huynh wrote: > Hi, I am trying to build MVAPICH2 and have a question. I get this > error OPEN_IB_HOME directory /usr/local/ofed does not exist. How or > where is can I get information on where is the ofed directory? > > Thanks, > Tai > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From pasha at dev.mellanox.co.il Mon Apr 7 13:04:18 2008 From: pasha at dev.mellanox.co.il (Pavel Shamis (Pasha)) Date: Mon Apr 7 13:04:28 2008 Subject: [mvapich-discuss] OPEN_IB_HOME In-Reply-To: <52779A68A4CAD94FAEED3C9ACB445B5A014B6D9B@vega.vni.com> References: <52779A68A4CAD94FAEED3C9ACB445B5A014B6D9B@vega.vni.com> Message-ID: <47FA5412.2060908@dev.mellanox.co.il> Try to run: /etc/infiniband/info > Hi, > I am trying to build MVAPICH2 and have a question. I get this error OPEN_IB_HOME directory /usr/local/ofed does not exist. > How or where is can I get information on where is the ofed directory? > > Thanks, > Tai > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -- Pavel Shamis (Pasha) Mellanox Technologies From huanwei at cse.ohio-state.edu Mon Apr 7 13:15:26 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Mon Apr 7 13:15:35 2008 Subject: [mvapich-discuss] OPEN_IB_HOME In-Reply-To: <52779A68A4CAD94FAEED3C9ACB445B5A014B6D9B@vega.vni.com> Message-ID: Hi Tai, Please make sure that OFED package is properly installed on your systems. You can perhaps consult your system administrator regarding the installation path, etc. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Mon, 7 Apr 2008, Tai Huynh wrote: > Hi, > I am trying to build MVAPICH2 and have a question. I get this error OPEN_IB_HOME directory /usr/local/ofed does not exist. > How or where is can I get information on where is the ofed directory? > > Thanks, > Tai > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From devesh28 at gmail.com Tue Apr 8 02:14:51 2008 From: devesh28 at gmail.com (Devesh Sharma) Date: Tue Apr 8 10:24:34 2008 Subject: [mvapich-discuss] Problem in running mpi processes on TCP/IP In-Reply-To: <309a667c0804072200x4f1aa3d9leb20d42781c56a22@mail.gmail.com> References: <309a667c0804070455m6c8d7e26t8c68d9e1972e9ecc@mail.gmail.com> <309a667c0804072200x4f1aa3d9leb20d42781c56a22@mail.gmail.com> Message-ID: <309a667c0804072314g31a2dfe0k97fc419f51ec6658@mail.gmail.com> Hello Matthew, Thanks for replying, This I have alredy tried but I dont see and data traffic on ib0 interface (visual indication is LED blinking), event I run Pallas after boting mpd in the same way! secondly on starting node (for mpd) mdptrace -l shows IP of ibnode1 and IP on node2 and vice versa for other node what is the reason? On Mon, Apr 7, 2008 at 10:29 PM, Matthew Koop wrote: > Devesh, > > In addition to specifying ibnode1 and ibnode2 in the mpd.hosts, can you > add --ifhn=ibnode1 to the mpdboot command (assuming you are starting > mpdboot on ibnode1). > > mpdboot -n 2 --ifhn=ibnode1 > > Let us know if this helps. Thanks. > > Matt > > On Mon, 7 Apr 2008, Devesh Sharma wrote: > > > Hello all, > > > > I am trying to run an MPI job beween two nodes using TCP/IP. The nodes > > consists of two different ethernet interfaces those are eth0 and ib0 > (using > > ipoib) > > I want to run my job through ib0 interface, host-names for both the > > interfaces are different node1 and node2 for eth0 and ibnode1 and > ibnode2 > > for ib0, > > but when I run mpdboot by specifying ibnode1 and ibnode2 in the mdp.host > > file, mpdtrace shows > > node1 and node2 still! > > How can I select ib0 as communication interface? > > > > -Devesh > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080408/21861dbc/attachment.html From christopher.tanner at gatech.edu Tue Apr 8 14:58:29 2008 From: christopher.tanner at gatech.edu (Christopher Lee Tanner) Date: Tue Apr 8 14:58:38 2008 Subject: [mvapich-discuss] Does MVAPICH have to be on each node to setup a ring? Message-ID: <47FBC055.5040507@gatech.edu> All - I'm a getting an error when trying to execute mpdboot. My command is: $ mpdboot --rsh=rsh -n 16 I receive this error: masternode (handle_mpd_output 396): from mpd on node9, invald port info: bash: /usr/local/mvapich2/bin/mpd.py: No such file or directory I'm not savvy enough to setup the SSH stuff between the nodes, so that's why I use the 'rsh' option. Do I need to copy mvapich2 to the /usr/local directory on each node in order to start up the ring? Thanks! ~Chris Tanner From koop at cse.ohio-state.edu Tue Apr 8 15:28:48 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Apr 8 15:28:56 2008 Subject: [mvapich-discuss] Does MVAPICH have to be on each node to setup a ring? In-Reply-To: <47FBC055.5040507@gatech.edu> Message-ID: Chris, Yes, you will need to have MVAPICH2 installed on each node -- or available on all nodes through NFS. You'll need to have the executable you're running available on all nodes as well, so you'll need something like NFS regardless. Let us know if you have any other questions, Matt On Tue, 8 Apr 2008, Christopher Lee Tanner wrote: > All - > > I'm a getting an error when trying to execute mpdboot. My command is: > $ mpdboot --rsh=rsh -n 16 > > I receive this error: > masternode (handle_mpd_output 396): from mpd on node9, invald port info: > bash: /usr/local/mvapich2/bin/mpd.py: No such file or directory > > I'm not savvy enough to setup the SSH stuff between the nodes, so that's > why I use the 'rsh' option. Do I need to copy mvapich2 to the /usr/local > directory on each node in order to start up the ring? Thanks! > > ~Chris Tanner > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From Fred.Stecher at atk.com Wed Apr 9 13:43:35 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Apr 9 13:43:34 2008 Subject: [mvapich-discuss] MVAPICH default installation Message-ID: Hi, When I installed MVAPICH, I used the default. If Infiniband is not working will my executable still run? Thanks, Fred -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080409/0e793983/attachment.html From koop at cse.ohio-state.edu Wed Apr 9 14:17:35 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Apr 9 14:17:41 2008 Subject: [mvapich-discuss] MVAPICH default installation In-Reply-To: Message-ID: Hi Fred, If InfiniBand is not working then the job will not run. There is currently no method by which it will fall back to TCP/IP. Does this answer your question? Matt On Wed, 9 Apr 2008, Stecher, Fred wrote: > Hi, > When I installed MVAPICH, I used the default. If Infiniband is not > working will my executable still run? > > Thanks, > > Fred > > From thuynh at vni.com Wed Apr 9 15:08:58 2008 From: thuynh at vni.com (Tai Huynh) Date: Wed Apr 9 15:40:20 2008 Subject: [mvapich-discuss] error report Message-ID: <52779A68A4CAD94FAEED3C9ACB445B5A014B6DA7@vega.vni.com> hi, I received an error message when I try to make.mvapich2.ofa. I'm running on Red Hat 5 enterprise. Attach is my log file. Tai <> -------------- next part -------------- A non-text attachment was scrubbed... Name: config-mine.log Type: application/octet-stream Size: 11609 bytes Desc: config-mine.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080409/13f72b0b/config-mine-0001.obj From Fred.Stecher at atk.com Wed Apr 9 16:38:43 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Apr 9 16:46:05 2008 Subject: FW: [mvapich-discuss] MVAPICH default installation Message-ID: Matt, The system administrator for our new SGI cluster computer informed me that InfiniBand was not configured correctly and that my run could not be using InfiniBand. The Config file had the wrong name for the network interface card, i.e. NIC was set to ibO (ib letter "o) when NIC should have been set to ib0 ( ib number "0"). My executable still ran even though NIC was set wrong. Is this possible? Thanks, Fred -----Original Message----- From: Matthew Koop [mailto:koop@cse.ohio-state.edu] Sent: Wednesday, April 09, 2008 1:18 PM To: Stecher, Fred Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] MVAPICH default installation Hi Fred, If InfiniBand is not working then the job will not run. There is currently no method by which it will fall back to TCP/IP. Does this answer your question? Matt On Wed, 9 Apr 2008, Stecher, Fred wrote: > Hi, > When I installed MVAPICH, I used the default. If Infiniband is not > working will my executable still run? > > Thanks, > > Fred > > From christopher.tanner at gatech.edu Wed Apr 9 20:01:00 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Wed Apr 9 20:01:10 2008 Subject: [mvapich-discuss] Running latency tests In-Reply-To: References: Message-ID: <334CA8E5-6D19-4F64-9736-7C56758E920F@gatech.edu> All - I believe I am gravy with the mvapich2 install so now I'm trying to run the latency tests to see if it's really working. But, I'm a dummy and can't get it to work. Here's what I've done so far: a) Initiated a mpd ring with 16 hosts (i.e. mpdboot --rsh=rsh -n 16 -1). I have multiple processors, each with multiple cores on each node, thus the '-1'. b) Compiled osu_latency.c using mpicc (to an executable called osu_latency) b) Tried to execute the compile file via 'mpiexec -machinefile machine.list -n 16 ./osu_latency' I receive the following error (16 times naturally) :: ./osu_latency: error while loading shared libraries: librdmacm.so.1: cannot open shared object file: No such file or directory I don't know where this file would be -- it's not in the /usr/lib with all of the other *.so.* files. Any thoughts? Thanks. ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner@gatech.edu ------------------------------------------- On Apr 9, 2008, at 2:17 PM, Matthew Koop wrote: > Hi Fred, > > If InfiniBand is not working then the job will not run. There is > currently > no method by which it will fall back to TCP/IP. > > Does this answer your question? > > Matt > > On Wed, 9 Apr 2008, Stecher, Fred wrote: > >> Hi, >> When I installed MVAPICH, I used the default. If Infiniband is not >> working will my executable still run? >> >> Thanks, >> >> Fred >> >> > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From huanwei at cse.ohio-state.edu Thu Apr 10 11:18:27 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Thu Apr 10 11:18:34 2008 Subject: [mvapich-discuss] Running latency tests In-Reply-To: Message-ID: Hi Chris, It seems that some ib libraries are not in your default path. You may need to explicitly export the path to ib library in your environmental variables (bash profile or similar places). To find where those libraries are, you may try to see /etc/infiniband/info file. Or you can ask your system administrator about the path. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Thu, 10 Apr 2008, Dhabaleswar Panda wrote: > ---------- Forwarded message ---------- > Date: Wed, 9 Apr 2008 20:01:00 -0400 > From: Christopher Tanner > To: mvapich-discuss@cse.ohio-state.edu > Subject: [mvapich-discuss] Running latency tests > > All - > > I believe I am gravy with the mvapich2 install so now I'm trying to > run the latency tests to see if it's really working. But, I'm a dummy > and can't get it to work. Here's what I've done so far: > > a) Initiated a mpd ring with 16 hosts (i.e. mpdboot --rsh=rsh -n 16 > -1). I have multiple processors, each with multiple cores on each > node, thus the '-1'. > b) Compiled osu_latency.c using mpicc (to an executable called > osu_latency) > b) Tried to execute the compile file via 'mpiexec -machinefile > machine.list -n 16 ./osu_latency' > > I receive the following error (16 times naturally) :: > ./osu_latency: error while loading shared libraries: librdmacm.so.1: > cannot open shared object file: No such file or directory > > I don't know where this file would be -- it's not in the /usr/lib with > all of the other *.so.* files. > Any thoughts? Thanks. > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner@gatech.edu > ------------------------------------------- > > > > On Apr 9, 2008, at 2:17 PM, Matthew Koop wrote: > > Hi Fred, > > > > If InfiniBand is not working then the job will not run. There is > > currently > > no method by which it will fall back to TCP/IP. > > > > Does this answer your question? > > > > Matt > > > > On Wed, 9 Apr 2008, Stecher, Fred wrote: > > > >> Hi, > >> When I installed MVAPICH, I used the default. If Infiniband is not > >> working will my executable still run? > >> > >> Thanks, > >> > >> Fred > >> > >> > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From huanwei at cse.ohio-state.edu Thu Apr 10 11:53:45 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Thu Apr 10 11:53:52 2008 Subject: [mvapich-discuss] Running latency tests (fwd) Message-ID: Do you see the same error? Try: export LD_LIBRARY_PATH=/usr/local/lib64:$LD_LIBRARY_PATH Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Thu, 10 Apr 2008, Christopher Tanner wrote: > Thanks Wei. Of course, the problem isn't solved yet... > > So I found the file in the /usr/local/lib64 directory on the master > node only. I copied the file to the rest of the nodes to the /usr/ > local/lib64 directory and included the directory in my path. When I > tried to execute the osu_latency program, it gave me the same error. A > 'which librdmacm.so.1' command reveals that it can indeed find the > library. > > Any clues? Or perhaps, any other ways to determine if the Infiniband > is working? > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner@gatech.edu > ------------------------------------------- > > > > On Apr 10, 2008, at 11:18 AM, wei huang wrote: > > Hi Chris, > > > > It seems that some ib libraries are not in your default path. You > > may need > > to explicitly export the path to ib library in your environmental > > variables (bash profile or similar places). To find where those > > libraries > > are, you may try to see /etc/infiniband/info file. Or you can ask your > > system administrator about the path. > > > > Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Thu, 10 Apr 2008, Dhabaleswar Panda wrote: > > > >> ---------- Forwarded message ---------- > >> Date: Wed, 9 Apr 2008 20:01:00 -0400 > >> From: Christopher Tanner > >> To: mvapich-discuss@cse.ohio-state.edu > >> Subject: [mvapich-discuss] Running latency tests > >> > >> All - > >> > >> I believe I am gravy with the mvapich2 install so now I'm trying to > >> run the latency tests to see if it's really working. But, I'm a dummy > >> and can't get it to work. Here's what I've done so far: > >> > >> a) Initiated a mpd ring with 16 hosts (i.e. mpdboot --rsh=rsh -n 16 > >> -1). I have multiple processors, each with multiple cores on each > >> node, thus the '-1'. > >> b) Compiled osu_latency.c using mpicc (to an executable called > >> osu_latency) > >> b) Tried to execute the compile file via 'mpiexec -machinefile > >> machine.list -n 16 ./osu_latency' > >> > >> I receive the following error (16 times naturally) :: > >> ./osu_latency: error while loading shared libraries: librdmacm.so.1: > >> cannot open shared object file: No such file or directory > >> > >> I don't know where this file would be -- it's not in the /usr/lib > >> with > >> all of the other *.so.* files. > >> Any thoughts? Thanks. > >> > >> ------------------------------------------- > >> Chris Tanner > >> Space Systems Design Lab > >> Georgia Institute of Technology > >> christopher.tanner@gatech.edu > >> ------------------------------------------- > >> > >> > >> > >> On Apr 9, 2008, at 2:17 PM, Matthew Koop wrote: > >>> Hi Fred, > >>> > >>> If InfiniBand is not working then the job will not run. There is > >>> currently > >>> no method by which it will fall back to TCP/IP. > >>> > >>> Does this answer your question? > >>> > >>> Matt > >>> > >>> On Wed, 9 Apr 2008, Stecher, Fred wrote: > >>> > >>>> Hi, > >>>> When I installed MVAPICH, I used the default. If Infiniband is not > >>>> working will my executable still run? > >>>> > >>>> Thanks, > >>>> > >>>> Fred > >>>> > >>>> > >>> > >>> _______________________________________________ > >>> mvapich-discuss mailing list > >>> mvapich-discuss@cse.ohio-state.edu > >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > > From Fred.Stecher at atk.com Thu Apr 10 12:23:49 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Thu Apr 10 12:23:45 2008 Subject: [mvapich-discuss] Install MVAPICH 1 Message-ID: Hi, This is a follow-up to previous question concerning whether MVAPICH 1 is using InfiniBand or Ethernet. Upon monitoring network traffic, my executable is definitely using Ethernet. I have reinstalled MVAPICH. The user manual stated "Go to the mvapich-1.0 directory. We have included a single script for OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of different platforms, compilers and architectures. By default, the compilation script uses gcc. In order to select your compiler, please set the variable CC in the script to use either Intel, PathScale or PGI compiler. The platform/architecture is detected automatically." I tried make -f make.mvapich.gen2 with following error message: make.mvapich.gen2:7: *** missing separator. Stop. I then just typed make. This resulted in installation of some version of MVAPICH. I am not sure what version. Do anyone know what version was installed or how to determine the version? Thanks, Fred -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080410/120b1c56/attachment.html From huanwei at cse.ohio-state.edu Thu Apr 10 13:49:50 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Thu Apr 10 13:50:01 2008 Subject: [mvapich-discuss] Running latency tests (fwd) In-Reply-To: <96CE1ABD-C610-431A-B7BF-54EFB0B68049@gatech.edu> Message-ID: Hi Chris, You have to make sure related kernel modules are loaded (including rdma_ucm, ib_uverbs, ib_mthca, etc). Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Thu, 10 Apr 2008, Christopher Tanner wrote: > Ok Wei - > > Even though I've copied the libib* libraries from the master node to > all of the other nodes and included the /usr/local/lib directory in > the LD_LIBRARY_PATH, it seems that osu_latency still cannot find > libibverbs.so.1. I'm kind of stuck... Any thoughts? > > Also, whenever I try to execute osu_latency using just one core on the > master node (mpiexec -n 1 ./osu_latency), I receive the following error: > > libibverbs: Fatal: couldn't read uverbs ABI version. > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(259)...........: Initialization failed > MPID_Init(102)..................: channel initialization failed > MPIDI_CH3_Init(178).............: > MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters > rdma_get_control_parameters(432): > rdma_open_hca(367)..............: No IB device found > rank 0 in job 15 master.cl.ae.gatech.edu_42042 caused collective > abort of all ranks > exit status of rank 0: return code 1 > > Does this output help solve the other problem? > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner@gatech.edu > ------------------------------------------- > > > > On Apr 10, 2008, at 11:53 AM, wei huang wrote: > > > > Do you see the same error? > > > > Try: > > export LD_LIBRARY_PATH=/usr/local/lib64:$LD_LIBRARY_PATH > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Thu, 10 Apr 2008, Christopher Tanner wrote: > > > >> Thanks Wei. Of course, the problem isn't solved yet... > >> > >> So I found the file in the /usr/local/lib64 directory on the master > >> node only. I copied the file to the rest of the nodes to the /usr/ > >> local/lib64 directory and included the directory in my path. When I > >> tried to execute the osu_latency program, it gave me the same > >> error. A > >> 'which librdmacm.so.1' command reveals that it can indeed find the > >> library. > >> > >> Any clues? Or perhaps, any other ways to determine if the Infiniband > >> is working? > >> > >> ------------------------------------------- > >> Chris Tanner > >> Space Systems Design Lab > >> Georgia Institute of Technology > >> christopher.tanner@gatech.edu > >> ------------------------------------------- > >> > >> > >> > >> On Apr 10, 2008, at 11:18 AM, wei huang wrote: > >>> Hi Chris, > >>> > >>> It seems that some ib libraries are not in your default path. You > >>> may need > >>> to explicitly export the path to ib library in your environmental > >>> variables (bash profile or similar places). To find where those > >>> libraries > >>> are, you may try to see /etc/infiniband/info file. Or you can ask > >>> your > >>> system administrator about the path. > >>> > >>> Thanks. > >>> > >>> Regards, > >>> Wei Huang > >>> > >>> 774 Dreese Lab, 2015 Neil Ave, > >>> Dept. of Computer Science and Engineering > >>> Ohio State University > >>> OH 43210 > >>> Tel: (614)292-8501 > >>> > >>> > >>> On Thu, 10 Apr 2008, Dhabaleswar Panda wrote: > >>> > >>>> ---------- Forwarded message ---------- > >>>> Date: Wed, 9 Apr 2008 20:01:00 -0400 > >>>> From: Christopher Tanner > >>>> To: mvapich-discuss@cse.ohio-state.edu > >>>> Subject: [mvapich-discuss] Running latency tests > >>>> > >>>> All - > >>>> > >>>> I believe I am gravy with the mvapich2 install so now I'm trying to > >>>> run the latency tests to see if it's really working. But, I'm a > >>>> dummy > >>>> and can't get it to work. Here's what I've done so far: > >>>> > >>>> a) Initiated a mpd ring with 16 hosts (i.e. mpdboot --rsh=rsh -n 16 > >>>> -1). I have multiple processors, each with multiple cores on each > >>>> node, thus the '-1'. > >>>> b) Compiled osu_latency.c using mpicc (to an executable called > >>>> osu_latency) > >>>> b) Tried to execute the compile file via 'mpiexec -machinefile > >>>> machine.list -n 16 ./osu_latency' > >>>> > >>>> I receive the following error (16 times naturally) :: > >>>> ./osu_latency: error while loading shared libraries: librdmacm.so. > >>>> 1: > >>>> cannot open shared object file: No such file or directory > >>>> > >>>> I don't know where this file would be -- it's not in the /usr/lib > >>>> with > >>>> all of the other *.so.* files. > >>>> Any thoughts? Thanks. > >>>> > >>>> ------------------------------------------- > >>>> Chris Tanner > >>>> Space Systems Design Lab > >>>> Georgia Institute of Technology > >>>> christopher.tanner@gatech.edu > >>>> ------------------------------------------- > >>>> > >>>> > >>>> > >>>> On Apr 9, 2008, at 2:17 PM, Matthew Koop wrote: > >>>>> Hi Fred, > >>>>> > >>>>> If InfiniBand is not working then the job will not run. There is > >>>>> currently > >>>>> no method by which it will fall back to TCP/IP. > >>>>> > >>>>> Does this answer your question? > >>>>> > >>>>> Matt > >>>>> > >>>>> On Wed, 9 Apr 2008, Stecher, Fred wrote: > >>>>> > >>>>>> Hi, > >>>>>> When I installed MVAPICH, I used the default. If Infiniband is > >>>>>> not > >>>>>> working will my executable still run? > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Fred > >>>>>> > >>>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> mvapich-discuss mailing list > >>>>> mvapich-discuss@cse.ohio-state.edu > >>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>>> > >>>> _______________________________________________ > >>>> mvapich-discuss mailing list > >>>> mvapich-discuss@cse.ohio-state.edu > >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>>> > >>> > >>> > >> > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From gopalakk at cse.ohio-state.edu Thu Apr 10 14:58:04 2008 From: gopalakk at cse.ohio-state.edu (Karthik Gopalakrishnan) Date: Thu Apr 10 14:58:18 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: References: Message-ID: <92eddfb50804101158i3488336bs4a5f0d7769bef1ec@mail.gmail.com> Hi Fred. make.mvapich.gen2 is not a makefile. It is just a executable bash shell script that takes care of setting up the necessary environment, running "configure" & building the necessary components. You just have to execute it as follows. # export CC= # ./make.mvapich.gen2 The default PREFIX is /usr/local/mvapich. You can change this path by editing the "PREFIX" variable within make.mvapich.gen2. Hop this helps. Regards, Karthik On 4/10/08, Stecher, Fred wrote: > > > Hi, > This is a follow-up to previous question concerning whether MVAPICH 1 is > using InfiniBand or Ethernet. Upon monitoring network traffic, my executable > is definitely using Ethernet. > > I have reinstalled MVAPICH. The user manual stated "Go to the mvapich-1.0 > directory. We have included > a single script for OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of > different > platforms, compilers and architectures. By default, the compilation script > uses gcc. In > order to select your compiler, please set the variable CC in the script to > use either Intel, > PathScale or PGI compiler. The platform/architecture is detected > automatically." I tried make -f make.mvapich.gen2 with following error > message: > > make.mvapich.gen2:7: *** missing separator. Stop. > I then just typed make. This resulted in installation of some version of > MVAPICH. I am not sure what version. Do anyone know what version was > installed or how to determine the version? > > Thanks, > > Fred > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > From Fred.Stecher at atk.com Thu Apr 10 18:21:10 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Thu Apr 10 18:21:09 2008 Subject: [mvapich-discuss] (no subject) Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: make-mine.log_gz Type: application/octet-stream Size: 8243 bytes Desc: make-mine.log_gz Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080410/bab4516e/make-mine-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: make.log_gz Type: application/octet-stream Size: 12359 bytes Desc: make.log_gz Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080410/bab4516e/make-0001.obj From perkinjo at cse.ohio-state.edu Fri Apr 11 08:36:32 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Fri Apr 11 08:36:46 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> Message-ID: <20080411123632.GB2766@cse.ohio-state.edu> After you reinstalled MVAPICH, did you also rebuild your MPI application before running? It's possible that you were still using the old library when you restarted. In order to debug the compiler issue I'd like to see the other log files as well. Specifically the config.log and the config-mine.log. On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: > Jonathan, > I performed the ./make.mvapich.gen2 command and output to a make.log > file. In the make.log file there was a Warning message. Also, the pgcc > compiler was not used. No Fortran compiler was used either. I have > attached the make.log file. I then restarted my run. Monitoring the > InfiniBand network traffic indicated no traffic. Monitoring Ethernet > indicated some traffic. I do not think that InfiniBand is being used. > > > Thanks, > > Fred > > > -----Original Message----- > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Thursday, April 10, 2008 12:06 PM > To: Stecher, Fred > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: > > So, > > How do I make the InfiniBand version if make -f make.mvapich.gen2 > > Errors out? > > Don't call make yourself. Just type in './make.mvapich.gen2' at the > command line (without the quotes of course). Before doing this, be sure > to export any variables that you may need to override in > make.mvapich.gen2. > > For more information please see > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html > > Section 4 should answer most of your questions. > > > > > Thanks, > > > > Fred > > > > -----Original Message----- > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > Sent: Thursday, April 10, 2008 11:50 AM > > To: Stecher, Fred > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: > > > Hi, > > > This is a follow-up to previous question concerning whether MVAPICH > > > 1 is using InfiniBand or Ethernet. Upon monitoring network traffic, > > > my executable is definitely using Ethernet. > > > I have reinstalled MVAPICH. The user manual stated "Go to the > > > mvapich-1.0 directory. We have included a single script for > > > OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of different > > > platforms, compilers and architectures. By default, the compilation > > > script uses gcc. In order to select your compiler, please set the > > > variable CC in the script to use either Intel, PathScale or PGI > > > compiler. The platform/architecture is detected automatically." I > > > tried make -f make.mvapich.gen2 with following error > > > > You should use ./make.mvapich.gen2 > > > > > message: > > > make.mvapich.gen2:7: *** missing separator. Stop. > > > I then just typed make. This resulted in installation of some > > > version of MVAPICH. I am not sure what version. Do anyone know what > > > version was installed or how to determine the version? > > > > > > > By using make directly you almost certainly have made the TCP version. > > > > > Thanks, > > > > > > Fred > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From christopher.tanner at gatech.edu Fri Apr 11 14:58:28 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Fri Apr 11 14:58:45 2008 Subject: [mvapich-discuss] How do I start the IB modules? In-Reply-To: References: Message-ID: <30A341F7-5D76-4901-BA77-F3F08E1929EE@gatech.edu> All - How do I make sure that the pertinent IB modules are loading (i.e. rdma_ucm, ib_uverbs, etc)? I am getting the following error when I try to execute the OSU benchmarks: libibverbs: Fatal: couldn't read uverbs ABI version. Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(259)...........: Initialization failed MPID_Init(102)..................: channel initialization failed MPIDI_CH3_Init(178).............: MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters rdma_get_control_parameters(432): rdma_open_hca(367)..............: No IB device found rank 0 in job 15 master.cl.ae.gatech.edu_42042 caused collective abort of all ranks exit status of rank 0: return code 1 ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner@gatech.edu ------------------------------------------- On Apr 10, 2008, at 1:49 PM, wei huang wrote: > Hi Chris, > > You have to make sure related kernel modules are loaded (including > rdma_ucm, ib_uverbs, ib_mthca, etc). Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Thu, 10 Apr 2008, Christopher Tanner wrote: > >> Ok Wei - >> >> Even though I've copied the libib* libraries from the master node to >> all of the other nodes and included the /usr/local/lib directory in >> the LD_LIBRARY_PATH, it seems that osu_latency still cannot find >> libibverbs.so.1. I'm kind of stuck... Any thoughts? >> >> Also, whenever I try to execute osu_latency using just one core on >> the >> master node (mpiexec -n 1 ./osu_latency), I receive the following >> error: >> >> libibverbs: Fatal: couldn't read uverbs ABI version. >> Fatal error in MPI_Init: >> Other MPI error, error stack: >> MPIR_Init_thread(259)...........: Initialization failed >> MPID_Init(102)..................: channel initialization failed >> MPIDI_CH3_Init(178).............: >> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters >> rdma_get_control_parameters(432): >> rdma_open_hca(367)..............: No IB device found >> rank 0 in job 15 master.cl.ae.gatech.edu_42042 caused collective >> abort of all ranks >> exit status of rank 0: return code 1 >> >> Does this output help solve the other problem? >> >> ------------------------------------------- >> Chris Tanner >> Space Systems Design Lab >> Georgia Institute of Technology >> christopher.tanner@gatech.edu >> ------------------------------------------- >> >> >> >> On Apr 10, 2008, at 11:53 AM, wei huang wrote: >>> >>> Do you see the same error? >>> >>> Try: >>> export LD_LIBRARY_PATH=/usr/local/lib64:$LD_LIBRARY_PATH >>> >>> Regards, >>> Wei Huang >>> >>> 774 Dreese Lab, 2015 Neil Ave, >>> Dept. of Computer Science and Engineering >>> Ohio State University >>> OH 43210 >>> Tel: (614)292-8501 >>> >>> >>> On Thu, 10 Apr 2008, Christopher Tanner wrote: >>> >>>> Thanks Wei. Of course, the problem isn't solved yet... >>>> >>>> So I found the file in the /usr/local/lib64 directory on the master >>>> node only. I copied the file to the rest of the nodes to the /usr/ >>>> local/lib64 directory and included the directory in my path. When I >>>> tried to execute the osu_latency program, it gave me the same >>>> error. A >>>> 'which librdmacm.so.1' command reveals that it can indeed find the >>>> library. >>>> >>>> Any clues? Or perhaps, any other ways to determine if the >>>> Infiniband >>>> is working? >>>> >>>> ------------------------------------------- >>>> Chris Tanner >>>> Space Systems Design Lab >>>> Georgia Institute of Technology >>>> christopher.tanner@gatech.edu >>>> ------------------------------------------- >>>> >>>> >>>> >>>> On Apr 10, 2008, at 11:18 AM, wei huang wrote: >>>>> Hi Chris, >>>>> >>>>> It seems that some ib libraries are not in your default path. You >>>>> may need >>>>> to explicitly export the path to ib library in your environmental >>>>> variables (bash profile or similar places). To find where those >>>>> libraries >>>>> are, you may try to see /etc/infiniband/info file. Or you can ask >>>>> your >>>>> system administrator about the path. >>>>> >>>>> Thanks. >>>>> >>>>> Regards, >>>>> Wei Huang >>>>> >>>>> 774 Dreese Lab, 2015 Neil Ave, >>>>> Dept. of Computer Science and Engineering >>>>> Ohio State University >>>>> OH 43210 >>>>> Tel: (614)292-8501 >>>>> >>>>> >>>>> On Thu, 10 Apr 2008, Dhabaleswar Panda wrote: >>>>> >>>>>> ---------- Forwarded message ---------- >>>>>> Date: Wed, 9 Apr 2008 20:01:00 -0400 >>>>>> From: Christopher Tanner >>>>>> To: mvapich-discuss@cse.ohio-state.edu >>>>>> Subject: [mvapich-discuss] Running latency tests >>>>>> >>>>>> All - >>>>>> >>>>>> I believe I am gravy with the mvapich2 install so now I'm >>>>>> trying to >>>>>> run the latency tests to see if it's really working. But, I'm a >>>>>> dummy >>>>>> and can't get it to work. Here's what I've done so far: >>>>>> >>>>>> a) Initiated a mpd ring with 16 hosts (i.e. mpdboot --rsh=rsh - >>>>>> n 16 >>>>>> -1). I have multiple processors, each with multiple cores on each >>>>>> node, thus the '-1'. >>>>>> b) Compiled osu_latency.c using mpicc (to an executable called >>>>>> osu_latency) >>>>>> b) Tried to execute the compile file via 'mpiexec -machinefile >>>>>> machine.list -n 16 ./osu_latency' >>>>>> >>>>>> I receive the following error (16 times naturally) :: >>>>>> ./osu_latency: error while loading shared libraries: >>>>>> librdmacm.so. >>>>>> 1: >>>>>> cannot open shared object file: No such file or directory >>>>>> >>>>>> I don't know where this file would be -- it's not in the /usr/lib >>>>>> with >>>>>> all of the other *.so.* files. >>>>>> Any thoughts? Thanks. >>>>>> >>>>>> ------------------------------------------- >>>>>> Chris Tanner >>>>>> Space Systems Design Lab >>>>>> Georgia Institute of Technology >>>>>> christopher.tanner@gatech.edu >>>>>> ------------------------------------------- >>>>>> >>>>>> >>>>>> >>>>>> On Apr 9, 2008, at 2:17 PM, Matthew Koop wrote: >>>>>>> Hi Fred, >>>>>>> >>>>>>> If InfiniBand is not working then the job will not run. There is >>>>>>> currently >>>>>>> no method by which it will fall back to TCP/IP. >>>>>>> >>>>>>> Does this answer your question? >>>>>>> >>>>>>> Matt >>>>>>> >>>>>>> On Wed, 9 Apr 2008, Stecher, Fred wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> When I installed MVAPICH, I used the default. If Infiniband is >>>>>>>> not >>>>>>>> working will my executable still run? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Fred >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> mvapich-discuss mailing list >>>>>>> mvapich-discuss@cse.ohio-state.edu >>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>>>> >>>>>> _______________________________________________ >>>>>> mvapich-discuss mailing list >>>>>> mvapich-discuss@cse.ohio-state.edu >>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>>>> >>>>> >>>>> >>>> >>> >>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > From bgauthier at terrascale.net Fri Apr 11 16:06:16 2008 From: bgauthier at terrascale.net (Bruno Gauthier) Date: Fri Apr 11 16:08:25 2008 Subject: [mvapich-discuss] How do I start the IB modules? In-Reply-To: <30A341F7-5D76-4901-BA77-F3F08E1929EE@gatech.edu> References: <30A341F7-5D76-4901-BA77-F3F08E1929EE@gatech.edu> Message-ID: <1207944376.29911.83.camel@ajax.ts> I guess you need to load your infiniband drivers. lsmod shall giving you something like that: Module Size Used by iscsi_tcp 27904 0 ib_iser 37416 0 libiscsi 29824 2 iscsi_tcp,ib_iser scsi_transport_iscsi 36240 3 iscsi_tcp,ib_iser,libiscsi rdma_ucm 16128 0 rdma_cm 36132 2 ib_iser,rdma_ucm iw_cm 12552 1 rdma_cm ib_addr 10248 1 rdma_cm sunrpc 201608 3 dm_mirror 26112 0 dm_mod 64240 1 dm_mirror button 11424 0 ib_mthca 130052 0 i2c_amd756 9220 0 i2c_core 27648 1 i2c_amd756 ib_ipoib 82288 0 ib_umad 19752 0 ib_ucm 19720 0 ib_uverbs 45776 2 rdma_ucm,ib_ucm ib_cm 37208 3 rdma_cm,ib_ipoib,ib_ucm ib_sa 44248 3 rdma_cm,ib_ipoib,ib_cm ib_mad 40888 4 ib_mthca,ib_umad,ib_cm,ib_sa ib_core 64128 12 ib_iser,rdma_ucm,rdma_cm,iw_cm,ib_mthca,ib_ipoib,ib_umad,ib_ucm,ib_uverbs,ib_cm,ib_sa,ib_mad ipv6 278408 19 ib_ipoib tg3 115076 0 floppy 66056 0 sr_mod 20644 0 ext3 136464 2 jbd 56232 1 ext3 sata_sil 14216 3 libata 161144 1 sata_sil usb_storage 72480 0 uhci_hcd 27552 0 ohci_hcd 25220 0 ehci_hcd 36364 0 sd_mod 30592 4 scsi_mod 168056 8 iscsi_tcp,ib_iser,libiscsi,scsi_transport_iscsi,sr_mod,libata,usb_storage,sd_mod You might refer to your infiniband manufacturer instruction and/or openib instruction for a proper installation On Fri, 2008-04-11 at 14:58 -0400, Christopher Tanner wrote: > All - > > How do I make sure that the pertinent IB modules are loading (i.e. > rdma_ucm, ib_uverbs, etc)? I am getting the following error when I try > to execute the OSU benchmarks: > > libibverbs: Fatal: couldn't read uverbs ABI version. > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(259)...........: Initialization failed > MPID_Init(102)..................: channel initialization failed > MPIDI_CH3_Init(178).............: > MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters > rdma_get_control_parameters(432): > rdma_open_hca(367)..............: No IB device found > rank 0 in job 15 master.cl.ae.gatech.edu_42042 caused collective > abort of all ranks > exit status of rank 0: return code 1 > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner@gatech.edu > ------------------------------------------- > > > > On Apr 10, 2008, at 1:49 PM, wei huang wrote: > > Hi Chris, > > > > You have to make sure related kernel modules are loaded (including > > rdma_ucm, ib_uverbs, ib_mthca, etc). Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Thu, 10 Apr 2008, Christopher Tanner wrote: > > > >> Ok Wei - > >> > >> Even though I've copied the libib* libraries from the master node to > >> all of the other nodes and included the /usr/local/lib directory in > >> the LD_LIBRARY_PATH, it seems that osu_latency still cannot find > >> libibverbs.so.1. I'm kind of stuck... Any thoughts? > >> > >> Also, whenever I try to execute osu_latency using just one core on > >> the > >> master node (mpiexec -n 1 ./osu_latency), I receive the following > >> error: > >> > >> libibverbs: Fatal: couldn't read uverbs ABI version. > >> Fatal error in MPI_Init: > >> Other MPI error, error stack: > >> MPIR_Init_thread(259)...........: Initialization failed > >> MPID_Init(102)..................: channel initialization failed > >> MPIDI_CH3_Init(178).............: > >> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters > >> rdma_get_control_parameters(432): > >> rdma_open_hca(367)..............: No IB device found > >> rank 0 in job 15 master.cl.ae.gatech.edu_42042 caused collective > >> abort of all ranks > >> exit status of rank 0: return code 1 > >> > >> Does this output help solve the other problem? > >> > >> ------------------------------------------- > >> Chris Tanner > >> Space Systems Design Lab > >> Georgia Institute of Technology > >> christopher.tanner@gatech.edu > >> ------------------------------------------- > >> > >> > >> > >> On Apr 10, 2008, at 11:53 AM, wei huang wrote: > >>> > >>> Do you see the same error? > >>> > >>> Try: > >>> export LD_LIBRARY_PATH=/usr/local/lib64:$LD_LIBRARY_PATH > >>> > >>> Regards, > >>> Wei Huang > >>> > >>> 774 Dreese Lab, 2015 Neil Ave, > >>> Dept. of Computer Science and Engineering > >>> Ohio State University > >>> OH 43210 > >>> Tel: (614)292-8501 > >>> > >>> > >>> On Thu, 10 Apr 2008, Christopher Tanner wrote: > >>> > >>>> Thanks Wei. Of course, the problem isn't solved yet... > >>>> > >>>> So I found the file in the /usr/local/lib64 directory on the master > >>>> node only. I copied the file to the rest of the nodes to the /usr/ > >>>> local/lib64 directory and included the directory in my path. When I > >>>> tried to execute the osu_latency program, it gave me the same > >>>> error. A > >>>> 'which librdmacm.so.1' command reveals that it can indeed find the > >>>> library. > >>>> > >>>> Any clues? Or perhaps, any other ways to determine if the > >>>> Infiniband > >>>> is working? > >>>> > >>>> ------------------------------------------- > >>>> Chris Tanner > >>>> Space Systems Design Lab > >>>> Georgia Institute of Technology > >>>> christopher.tanner@gatech.edu > >>>> ------------------------------------------- > >>>> > >>>> > >>>> > >>>> On Apr 10, 2008, at 11:18 AM, wei huang wrote: > >>>>> Hi Chris, > >>>>> > >>>>> It seems that some ib libraries are not in your default path. You > >>>>> may need > >>>>> to explicitly export the path to ib library in your environmental > >>>>> variables (bash profile or similar places). To find where those > >>>>> libraries > >>>>> are, you may try to see /etc/infiniband/info file. Or you can ask > >>>>> your > >>>>> system administrator about the path. > >>>>> > >>>>> Thanks. > >>>>> > >>>>> Regards, > >>>>> Wei Huang > >>>>> > >>>>> 774 Dreese Lab, 2015 Neil Ave, > >>>>> Dept. of Computer Science and Engineering > >>>>> Ohio State University > >>>>> OH 43210 > >>>>> Tel: (614)292-8501 > >>>>> > >>>>> > >>>>> On Thu, 10 Apr 2008, Dhabaleswar Panda wrote: > >>>>> > >>>>>> ---------- Forwarded message ---------- > >>>>>> Date: Wed, 9 Apr 2008 20:01:00 -0400 > >>>>>> From: Christopher Tanner > >>>>>> To: mvapich-discuss@cse.ohio-state.edu > >>>>>> Subject: [mvapich-discuss] Running latency tests > >>>>>> > >>>>>> All - > >>>>>> > >>>>>> I believe I am gravy with the mvapich2 install so now I'm > >>>>>> trying to > >>>>>> run the latency tests to see if it's really working. But, I'm a > >>>>>> dummy > >>>>>> and can't get it to work. Here's what I've done so far: > >>>>>> > >>>>>> a) Initiated a mpd ring with 16 hosts (i.e. mpdboot --rsh=rsh - > >>>>>> n 16 > >>>>>> -1). I have multiple processors, each with multiple cores on each > >>>>>> node, thus the '-1'. > >>>>>> b) Compiled osu_latency.c using mpicc (to an executable called > >>>>>> osu_latency) > >>>>>> b) Tried to execute the compile file via 'mpiexec -machinefile > >>>>>> machine.list -n 16 ./osu_latency' > >>>>>> > >>>>>> I receive the following error (16 times naturally) :: > >>>>>> ./osu_latency: error while loading shared libraries: > >>>>>> librdmacm.so. > >>>>>> 1: > >>>>>> cannot open shared object file: No such file or directory > >>>>>> > >>>>>> I don't know where this file would be -- it's not in the /usr/lib > >>>>>> with > >>>>>> all of the other *.so.* files. > >>>>>> Any thoughts? Thanks. > >>>>>> > >>>>>> ------------------------------------------- > >>>>>> Chris Tanner > >>>>>> Space Systems Design Lab > >>>>>> Georgia Institute of Technology > >>>>>> christopher.tanner@gatech.edu > >>>>>> ------------------------------------------- > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Apr 9, 2008, at 2:17 PM, Matthew Koop wrote: > >>>>>>> Hi Fred, > >>>>>>> > >>>>>>> If InfiniBand is not working then the job will not run. There is > >>>>>>> currently > >>>>>>> no method by which it will fall back to TCP/IP. > >>>>>>> > >>>>>>> Does this answer your question? > >>>>>>> > >>>>>>> Matt > >>>>>>> > >>>>>>> On Wed, 9 Apr 2008, Stecher, Fred wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> When I installed MVAPICH, I used the default. If Infiniband is > >>>>>>>> not > >>>>>>>> working will my executable still run? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Fred > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> mvapich-discuss mailing list > >>>>>>> mvapich-discuss@cse.ohio-state.edu > >>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>>>>> > >>>>>> _______________________________________________ > >>>>>> mvapich-discuss mailing list > >>>>>> mvapich-discuss@cse.ohio-state.edu > >>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>>>>> > >>>>> > >>>>> > >>>> > >>> > >>> > >>> _______________________________________________ > >>> mvapich-discuss mailing list > >>> mvapich-discuss@cse.ohio-state.edu > >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From tgetachew at hotmail.com Sun Apr 13 01:01:25 2008 From: tgetachew at hotmail.com (tek mobster) Date: Sun Apr 13 11:47:13 2008 Subject: [mvapich-discuss] Multi-rail mvapich issue with more than 10 systems Message-ID: Hello, I have a 64 node cluster that I am trying to run linpak on. Ecah node has 16 cores and 4 HCAs. After building multirail mvapich, I can sucessfuly run linpak on 10 nodes (160 cores). However, when I try to run any more than that, I get the following errors. Note that node cn35 and cn36 are my 11th and 12th nodes in this case. I used mpirun_rsh -np 192 -hostfile hostfile ./xhpl as the command. If I delete say 2 other nodes and use cn35 and cn36 as two of my 10 nodes and run with np 160, it completes just fine. I did set ulimit -l to be unlimited and each node has MaxStartups set to 32 in /etc/ssh/sshd_config. Any help would be greatly appreciated. [176] Abort: [cn35:176] Got completion with error code 12 at line 1277 in file viacheck.c[175] Abort: [cn35:175] Got completion with error code 12 at line 1277 in file viacheck.c[172] Abort: [cn35:172] Got completion with error code 12 at line 1277 in file viacheck.c[173] Abort: [cn35:173] Got completion with error code 12 at line 1277 in file viacheck.c[170] Abort: [cn35:170] Got completion with error code 12 at line 1277 in file viacheck.c[171] Abort: [cn35:171] Got completion with error code 12 at line 1277 in file viacheck.c[174] Abort: [cn35:174] Got completion with error code 12 at line 1277 in file viacheck.c[178] Abort: [cn36:178] Got completion with error code 12 at line 1277 in file viacheck.c[181] Abort: [cn36:181] Got completion with error code 12 at line 1277 in file viacheck.c[180] Abort: [cn36:180] Got completion with error code 12 at line 1277 in file viacheck.c[189] Abort: [cn36:189] Got completion with error code 12 at line 1277 in file viacheck.c[183] Abort: [cn36:183] Got completion with error code 12 at line 1277 in file viacheck.c[187] Abort: [191] Abort: [cn36:191] Got completion with error code 12 at line 1277 in file viacheck.c[179] Abort: [182] Abort: [184] Abort: [cn36:184] Got completion with error code 12 at line 1277 in file viacheck.c[186] Abort: [cn36:186] Got completion with error code 12 at line 1277 in file viacheck.c[185] Abort: [cn36:185] Got completion with error code 12 at line 1277 in file viacheck.c[188] Abort: [cn36:188] Got completion with error code 12 at line 1277 in file viacheck.c[177] Abort: [cn36:177] Got completion with error code 12 at line 1277 in file viacheck.c[190] Abort: [cn36:190] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:187] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:182] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:179] Got completion with error code 12 at line 1277 in file viacheck.cTimeout alarm signaledCleaning up all processes ...done. Thanks Tatek _________________________________________________________________ More immediate than e-mail? Get instant access with Windows Live Messenger. http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080413/5bcc6169/attachment-0001.html From christopher.tanner at gatech.edu Sun Apr 13 16:02:10 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Sun Apr 13 16:02:21 2008 Subject: [mvapich-discuss] No IB device found Message-ID: <9AF09B57-CB4B-4F3C-A372-D89409D58E8A@gatech.edu> All - I have installed OFED and mvapich2 on the master and all 16 nodes. I have rebooted the master, followed by rebooting all of the nodes. After starting up a mpd ring, I try to execute the OSU benchmarks to receive: libibverbs: Fatal: couldn't read uverbs ABI version. Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(259)...........: Initialization failed MPID_Init(102)..................: channel initialization failed MPIDI_CH3_Init(178).............: MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters rdma_get_control_parameters(432): rdma_open_hca(367)..............: No IB device found rank 3 in job 1 master.cl.ae.gatech.edu_41302 caused collective abort of all ranks exit status of rank 3: return code 1 I've posted this error before, however I simply received the solution to "make sure you are loading your Infiniband kernel modules" or something similar. Aside from performing the install and rebooting the machines - how do I initialize the Infiniband stuff? I'm new to Linux so I'm pretty clueless as to the commands / procedures to do this... Thanks for your help guys. ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner@gatech.edu ------------------------------------------- From tgetachew at hotmail.com Sun Apr 13 22:30:13 2008 From: tgetachew at hotmail.com (tek mobster) Date: Sun Apr 13 22:30:24 2008 Subject: [mvapich-discuss] Multi-rail mvapich issue with more than 10 systems Message-ID: Hello, I have a 64 node cluster that I am trying to run linpak on. Ecah node has 16 cores and 4 HCAs. After building multirail mvapich, I can sucessfuly run linpak on 10 nodes (160 cores). However, when I try to run any more than that, I get the following errors. Note that node cn35 and cn36 are my 11th and 12th nodes in this case. I used mpirun_rsh -np 192 -hostfile hostfile ./xhpl as the command. If I delete say 2 other nodes and use cn35 and cn36 as two of my 10 nodes and run with np 160, it completes just fine. I did set ulimit -l to be unlimited and each node has MaxStartups set to 32 in /etc/ssh/sshd_config. Any help would be greatly appreciated. [176] Abort: [cn35:176] Got completion with error code 12 at line 1277 in file viacheck.c[175] Abort: [cn35:175] Got completion with error code 12 at line 1277 in file viacheck.c[172] Abort: [cn35:172] Got completion with error code 12 at line 1277 in file viacheck.c[173] Abort: [cn35:173] Got completion with error code 12 at line 1277 in file viacheck.c[170] Abort: [cn35:170] Got completion with error code 12 at line 1277 in file viacheck.c[171] Abort: [cn35:171] Got completion with error code 12 at line 1277 in file viacheck.c[174] Abort: [cn35:174] Got completion with error code 12 at line 1277 in file viacheck.c[178] Abort: [cn36:178] Got completion with error code 12 at line 1277 in file viacheck.c[181] Abort: [cn36:181] Got completion with error code 12 at line 1277 in file viacheck.c[180] Abort: [cn36:180] Got completion with error code 12 at line 1277 in file viacheck.c[189] Abort: [cn36:189] Got completion with error code 12 at line 1277 in file viacheck.c[183] Abort: [cn36:183] Got completion with error code 12 at line 1277 in file viacheck.c[187] Abort: [191] Abort: [cn36:191] Got completion with error code 12 at line 1277 in file viacheck.c[179] Abort: [182] Abort: [184] Abort: [cn36:184] Got completion with error code 12 at line 1277 in file viacheck.c[186] Abort: [cn36:186] Got completion with error code 12 at line 1277 in file viacheck.c[185] Abort: [cn36:185] Got completion with error code 12 at line 1277 in file viacheck.c[188] Abort: [cn36:188] Got completion with error code 12 at line 1277 in file viacheck.c[177] Abort: [cn36:177] Got completion with error code 12 at line 1277 in file viacheck.c[190] Abort: [cn36:190] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:187] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:182] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:179] Got completion with error code 12 at line 1277 in file viacheck.cTimeout alarm signaledCleaning up all processes ...done. ThanksTatek _________________________________________________________________ Get in touch in an instant. Get Windows Live Messenger now. http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_getintouch_042008 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/2151a1c2/attachment.html From koop at cse.ohio-state.edu Sun Apr 13 23:41:42 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Sun Apr 13 23:41:50 2008 Subject: [mvapich-discuss] No IB device found In-Reply-To: <9AF09B57-CB4B-4F3C-A372-D89409D58E8A@gatech.edu> Message-ID: Chris, Does the master node also have an IB card on it or not? Note that by default the node that you start the mpd ring from (probably the master) is also included in the ring, even if it doesn't have an IB card. You can either specifically configure it to not put any processes on the first node, use a machine file, or just start the mpd ring from a compute node. To use a machine file you'd start the ring normally and then create a host file: e.g. in 'h': m1 m1 m2 m2 mpiexec -machinefile ./h ./exec Otherwise, on each node run the 'ibstat' command and verify that it shows an HCA is installed. Let us know if these suggestions help, Matt On Sun, 13 Apr 2008, Christopher Tanner wrote: > All - > > I have installed OFED and mvapich2 on the master and all 16 nodes. I > have rebooted the master, followed by rebooting all of the nodes. > After starting up a mpd ring, I try to execute the OSU benchmarks to > receive: > > libibverbs: Fatal: couldn't read uverbs ABI version. > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(259)...........: Initialization failed > MPID_Init(102)..................: channel initialization failed > MPIDI_CH3_Init(178).............: > MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters > rdma_get_control_parameters(432): > rdma_open_hca(367)..............: No IB device found > rank 3 in job 1 master.cl.ae.gatech.edu_41302 caused collective > abort of all ranks > exit status of rank 3: return code 1 > > I've posted this error before, however I simply received the solution > to "make sure you are loading your Infiniband kernel modules" or > something similar. Aside from performing the install and rebooting the > machines - how do I initialize the Infiniband stuff? I'm new to Linux > so I'm pretty clueless as to the commands / procedures to do this... > > Thanks for your help guys. > > ------------------------------------------- > Chris Tanner > Space Systems Design Lab > Georgia Institute of Technology > christopher.tanner@gatech.edu > ------------------------------------------- > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From koop at cse.ohio-state.edu Sun Apr 13 23:47:26 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Sun Apr 13 23:47:35 2008 Subject: [mvapich-discuss] Multi-rail mvapich issue with more than 10 systems In-Reply-To: Message-ID: Tatek, If you run 192 cores on 12 nodes and skip over cn35 and cn36 do you still seen any issue? Also, is this a new cluster? Code 12 errors can occur when there are loose cables or bad internal switch links. Also, I'd encourage you to use MVAPICH2 since it has additional multirail features that are not present in MVAPICH. Thanks, Matt On Sun, 13 Apr 2008, tek mobster wrote: > > Hello, > I have a 64 node cluster that I am trying to run linpak on. Ecah > node has 16 cores and 4 HCAs. After building multirail mvapich, I can > sucessfuly run linpak on 10 nodes (160 cores). However, when I try to > run any more than that, I get the following errors. Note that node > cn35 and cn36 are my 11th and 12th nodes in this case. I used > mpirun_rsh -np 192 -hostfile hostfile ./xhpl as the command. If I > delete say 2 other nodes and use cn35 and cn36 as two of my 10 nodes > and run with np 160, it completes just fine. I did set ulimit -l to > be unlimited and each node has MaxStartups set to 32 in > /etc/ssh/sshd_config. Any help would be greatly appreciated. > > > [176] Abort: [cn35:176] Got completion with error code 12 at line 1277 in file viacheck.c[175] Abort: [cn35:175] Got completion with error code 12 at line 1277 in file viacheck.c[172] Abort: [cn35:172] Got completion with error code 12 at line 1277 in file viacheck.c[173] Abort: [cn35:173] Got completion with error code 12 at line 1277 in file viacheck.c[170] Abort: [cn35:170] Got completion with error code 12 at line 1277 in file viacheck.c[171] Abort: [cn35:171] Got completion with error code 12 at line 1277 in file viacheck.c[174] Abort: [cn35:174] Got completion with error code 12 at line 1277 in file viacheck.c[178] Abort: [cn36:178] Got completion with error code 12 at line 1277 in file viacheck.c[181] Abort: [cn36:181] Got completion with error code 12 at line 1277 in file viacheck.c[180] Abort: [cn36:180] Got completion with error code 12 at line 1277 in file viacheck.c[189] Abort: [cn36:189] Got completion with error code 12 at line 1277 in file viacheck.c[183] Abort: [cn36:183] Got completion with error code 12 at line 1277 in file viacheck.c[187] Abort: [191] Abort: [cn36:191] Got completion with error code 12 at line 1277 in file viacheck.c[179] Abort: [182] Abort: [184] Abort: [cn36:184] Got completion with error code 12 at line 1277 in file viacheck.c[186] Abort: [cn36:186] Got completion with error code 12 at line 1277 in file viacheck.c[185] Abort: [cn36:185] Got completion with error code 12 at line 1277 in file viacheck.c[188] Abort: [cn36:188] Got completion with error code 12 at line 1277 in file viacheck.c[177] Abort: [cn36:177] Got completion with error code 12 at line 1277 in file viacheck.c[190] Abort: [cn36:190] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:187] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:182] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:179] Got completion with error code 12 at line 1277 in file viacheck.cTimeout alarm signaledCleaning up all processes ...done. > > Thanks > Tatek > _________________________________________________________________ > More immediate than e-mail? Get instant access with Windows Live Messenger. > http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008 From tgetachew at hotmail.com Mon Apr 14 04:10:38 2008 From: tgetachew at hotmail.com (tek mobster) Date: Mon Apr 14 04:10:49 2008 Subject: [mvapich-discuss] Multi-rail mvapich issue with more than 10 systems In-Reply-To: References: Message-ID: Hello Matt, Thanks for the quick reply. Yes, I have tried different sets of 12 nodes and same error occurs. It always happens on the last 2 nodes in my hostfile. If I move cn35 and cn36 2 nodes up and have the last 2 nodes be cn33 and cn34 the exact error occurs now on cn33 and cn34. I have also verified the IB fabric already and was able to run linpak using openmpi on all 64 nodes. Also, using mvapich, I can run 10 nodes at a time without any problem (including the nodes that give the errors). So, I think the issue is something else. I will try mvapich2 and see what I get but I think the issue seems to be a limit of some sort on how many nodes I can run. Thanks Tatek> Date: Sun, 13 Apr 2008 23:47:26 -0400> From: koop@cse.ohio-state.edu> To: tgetachew@hotmail.com> CC: mvapich-discuss@cse.ohio-state.edu> Subject: Re: [mvapich-discuss] Multi-rail mvapich issue with more than 10 systems> > Tatek,> > If you run 192 cores on 12 nodes and skip over cn35 and cn36 do you still> seen any issue? Also, is this a new cluster? Code 12 errors can occur when> there are loose cables or bad internal switch links.> > Also, I'd encourage you to use MVAPICH2 since it has additional multirail> features that are not present in MVAPICH.> > Thanks,> > Matt> > On Sun, 13 Apr 2008, tek mobster wrote:> > >> > Hello,> > I have a 64 node cluster that I am trying to run linpak on. Ecah> > node has 16 cores and 4 HCAs. After building multirail mvapich, I can> > sucessfuly run linpak on 10 nodes (160 cores). However, when I try to> > run any more than that, I get the following errors. Note that node> > cn35 and cn36 are my 11th and 12th nodes in this case. I used> > mpirun_rsh -np 192 -hostfile hostfile ./xhpl as the command. If I> > delete say 2 other nodes and use cn35 and cn36 as two of my 10 nodes> > and run with np 160, it completes just fine. I did set ulimit -l to> > be unlimited and each node has MaxStartups set to 32 in> > /etc/ssh/sshd_config. Any help would be greatly appreciated.> >> >> > [176] Abort: [cn35:176] Got completion with error code 12 at line 1277 in file viacheck.c[175] Abort: [cn35:175] Got completion with error code 12 at line 1277 in file viacheck.c[172] Abort: [cn35:172] Got completion with error code 12 at line 1277 in file viacheck.c[173] Abort: [cn35:173] Got completion with error code 12 at line 1277 in file viacheck.c[170] Abort: [cn35:170] Got completion with error code 12 at line 1277 in file viacheck.c[171] Abort: [cn35:171] Got completion with error code 12 at line 1277 in file viacheck.c[174] Abort: [cn35:174] Got completion with error code 12 at line 1277 in file viacheck.c[178] Abort: [cn36:178] Got completion with error code 12 at line 1277 in file viacheck.c[181] Abort: [cn36:181] Got completion with error code 12 at line 1277 in file viacheck.c[180] Abort: [cn36:180] Got completion with error code 12 at line 1277 in file viacheck.c[189] Abort: [cn36:189] Got completion with error code 12 at line 1277 in file viacheck.c[183] Abort: [cn36:183] Got completion with error code 12 at line 1277 in file viacheck.c[187] Abort: [191] Abort: [cn36:191] Got completion with error code 12 at line 1277 in file viacheck.c[179] Abort: [182] Abort: [184] Abort: [cn36:184] Got completion with error code 12 at line 1277 in file viacheck.c[186] Abort: [cn36:186] Got completion with error code 12 at line 1277 in file viacheck.c[185] Abort: [cn36:185] Got completion with error code 12 at line 1277 in file viacheck.c[188] Abort: [cn36:188] Got completion with error code 12 at line 1277 in file viacheck.c[177] Abort: [cn36:177] Got completion with error code 12 at line 1277 in file viacheck.c[190] Abort: [cn36:190] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:187] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:182] Got completion with error code 12 at line 1277 in file viacheck.c[cn36:179] Got completion with error code 12 at line 1277 in file viacheck.cTimeout alarm signaledCleaning up all processes ...done.> >> > Thanks> > Tatek> > _________________________________________________________________> > More immediate than e-mail? Get instant access with Windows Live Messenger.> > http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008> _________________________________________________________________ More immediate than e-mail? Get instant access with Windows Live Messenger. http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/df90ce00/attachment-0001.html From christopher.tanner at gatech.edu Mon Apr 14 08:56:54 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Mon Apr 14 08:57:08 2008 Subject: [mvapich-discuss] No IB device found In-Reply-To: References: Message-ID: Matt - All good suggestions. I nominally use a machinefile that excludes the master node from executing any processes unless the rest of the cluster is busy. However, the master node IS in the mpd ring. If I need to remove it from the mpd ring altogether... let me know. Also, I don't have a command 'ibstat' located anywhere in the /usr/ local/bin, /usr/bin, or /etc/infiniband directories. The prefix during OFED install was /usr. I didn't install the 'ibutils' package b/c there was an error and it would not install - is the ibstat command in that package? ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner@gatech.edu ------------------------------------------- On Apr 13, 2008, at 11:41 PM, Matthew Koop wrote: > Chris, > > Does the master node also have an IB card on it or not? Note that by > default the node that you start the mpd ring from (probably the > master) is > also included in the ring, even if it doesn't have an IB card. You can > either specifically configure it to not put any processes on the first > node, use a machine file, or just start the mpd ring from a compute > node. > > To use a machine file you'd start the ring normally and then create > a host > file: > > e.g. in 'h': > m1 > m1 > m2 > m2 > > mpiexec -machinefile ./h ./exec > > Otherwise, on each node run the 'ibstat' command and verify that it > shows > an HCA is installed. > > Let us know if these suggestions help, > > Matt > > On Sun, 13 Apr 2008, Christopher Tanner wrote: > >> All - >> >> I have installed OFED and mvapich2 on the master and all 16 nodes. I >> have rebooted the master, followed by rebooting all of the nodes. >> After starting up a mpd ring, I try to execute the OSU benchmarks to >> receive: >> >> libibverbs: Fatal: couldn't read uverbs ABI version. >> Fatal error in MPI_Init: >> Other MPI error, error stack: >> MPIR_Init_thread(259)...........: Initialization failed >> MPID_Init(102)..................: channel initialization failed >> MPIDI_CH3_Init(178).............: >> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters >> rdma_get_control_parameters(432): >> rdma_open_hca(367)..............: No IB device found >> rank 3 in job 1 master.cl.ae.gatech.edu_41302 caused collective >> abort of all ranks >> exit status of rank 3: return code 1 >> >> I've posted this error before, however I simply received the solution >> to "make sure you are loading your Infiniband kernel modules" or >> something similar. Aside from performing the install and rebooting >> the >> machines - how do I initialize the Infiniband stuff? I'm new to Linux >> so I'm pretty clueless as to the commands / procedures to do this... >> >> Thanks for your help guys. >> >> ------------------------------------------- >> Chris Tanner >> Space Systems Design Lab >> Georgia Institute of Technology >> christopher.tanner@gatech.edu >> ------------------------------------------- >> >> >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > From perkinjo at cse.ohio-state.edu Mon Apr 14 11:15:45 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon Apr 14 11:16:04 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> Message-ID: <20080414151544.GA2836@cse.ohio-state.edu> On Fri, Apr 11, 2008 at 03:46:04PM -0500, Stecher, Fred wrote: > Jonathan, > When I tried to rebuild my application the following error message was > output: > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(dup_fnf.o > ): In function `pmpi_dup_fn_': > dup_fnf.c:(.text+0x23): undefined reference to `MPIR_F_TRUE' > dup_fnf.c:(.text+0x2a): undefined reference to `MPIR_F_FALSE' > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(initializ > ef.o): In function `pmpi_initialized_': > initializef.c:(.text+0x27): undefined reference to `MPIR_F_TRUE' > initializef.c:(.text+0x2e): undefined reference to `MPIR_F_FALSE' > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(null_copy > fnf.o): In function `pmpi_null_copy_fn_': > null_copyfnf.c:(.text+0x6): undefined reference to `MPIR_F_FALSE' This looks like it may have something to do with the Fortran bindings. Can you forward us those other log files? > > Thanks, > > Fred > > > -----Original Message----- > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Friday, April 11, 2008 7:37 AM > To: Stecher, Fred > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > After you reinstalled MVAPICH, did you also rebuild your MPI application > before running? It's possible that you were still using the old library > when you restarted. > > In order to debug the compiler issue I'd like to see the other log files > as well. Specifically the config.log and the config-mine.log. > > On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: > > Jonathan, > > I performed the ./make.mvapich.gen2 command and output to a make.log > > file. In the make.log file there was a Warning message. Also, the pgcc > > > compiler was not used. No Fortran compiler was used either. I have > > attached the make.log file. I then restarted my run. Monitoring the > > InfiniBand network traffic indicated no traffic. Monitoring Ethernet > > indicated some traffic. I do not think that InfiniBand is being used. > > > > > > Thanks, > > > > Fred > > > > > > -----Original Message----- > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > Sent: Thursday, April 10, 2008 12:06 PM > > To: Stecher, Fred > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: > > > So, > > > How do I make the InfiniBand version if make -f make.mvapich.gen2 > > > Errors out? > > > > Don't call make yourself. Just type in './make.mvapich.gen2' at the > > command line (without the quotes of course). Before doing this, be > > sure to export any variables that you may need to override in > > make.mvapich.gen2. > > > > For more information please see > > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html > > > > Section 4 should answer most of your questions. > > > > > > > > Thanks, > > > > > > Fred > > > > > > -----Original Message----- > > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > > Sent: Thursday, April 10, 2008 11:50 AM > > > To: Stecher, Fred > > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > > > On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: > > > > Hi, > > > > This is a follow-up to previous question concerning whether > > > > MVAPICH > > > > 1 is using InfiniBand or Ethernet. Upon monitoring network > > > > traffic, my executable is definitely using Ethernet. > > > > I have reinstalled MVAPICH. The user manual stated "Go to the > > > > mvapich-1.0 directory. We have included a single script for > > > > OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of different > > > > platforms, compilers and architectures. By default, the > > > > compilation script uses gcc. In order to select your compiler, > > > > please set the variable CC in the script to use either Intel, > > > > PathScale or PGI compiler. The platform/architecture is detected > > > > automatically." I tried make -f make.mvapich.gen2 with following > > > > error > > > > > > You should use ./make.mvapich.gen2 > > > > > > > message: > > > > make.mvapich.gen2:7: *** missing separator. Stop. > > > > I then just typed make. This resulted in installation of some > > > > version of MVAPICH. I am not sure what version. Do anyone know > > > > what version was installed or how to determine the version? > > > > > > > > > > By using make directly you almost certainly have made the TCP > version. > > > > > > > Thanks, > > > > > > > > Fred > > > > > > > > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > -- > > > Jonathan Perkins > > > http://www.cse.ohio-state.edu/~perkinjo > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > > > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From panda at cse.ohio-state.edu Tue Apr 15 08:52:47 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue Apr 15 08:52:57 2008 Subject: [mvapich-discuss] [mvapich-commit] dynamic process connections (accept/connect or MPI_Comm_join) and Infiniband... (fwd) Message-ID: Please note that mvapich-commit list is not the list for discussion. You need to use mvapich-discuss list for this. I am forwarding your note to mvapich-discuss. Regarding your question, dynamic process management over native IB is not available with MVAPICH2 yet. We are working on it and it will be available in future releases. You can try the TCP/IP interface of MVAPICH2 (which is equivalent to MPICH2). DK ---------- Forwarded message ---------- Date: Tue, 15 Apr 2008 17:00:26 +1000 From: leon zadorin To: mvapich-commit@cse.ohio-state.edu Subject: [mvapich-commit] dynamic process connections (accept/connect or MPI_Comm_join) and Infiniband... Hello everyone, I am relatively new to the whole MPI/Infiniband scene, so my apologies if some of the questions/thoughts of mine are naive... I am currently experiencing difficulties with dynamic process connections (MPI_Comm_join) between 2 hosts (each with Infiniband and Ethernet card). The setup is: 2 hosts, each with Ethernet (Gigabit) card and with Infiniband card (PCI-e), running Linux 32 bits. Hosts are connected via Infiniband switch (w.r.t Infiniband cards) and via Ethernet/IP network (w.r.t. Ethernet cards). mvapich2 has been made with "make.make.mvapich2.ofa" mpdboot has been executed and mpd daemons are running on both hosts I would like to know if it is currently possible to achieve the following: 1) start 1 app on 1 host (without using mpirun); 2) then later, after some time, start another app on 2nd host (without using mpirun); 3) make the app in step 2 automatically connect to the app started in step 1 I was able to achieve the above when running with mpich2 library, using sock channels and only when using 'MPI_Comm_join' call (using MPI_Publish_name, etc. did not work when starting apps without mpirun [even with all mpds being active]). However, the MPI_Comm_join tactic fails when attempting to use mvapich2 (mvapich2-1.0-2008-04-10) over Infiniband... I wonder if the following has something to do with it: http://lists.openfabrics.org/pipermail/commits/2006-January/004707.html " -------------------------------------------------------------------------------- - Known Deficiencies -------------------------------------------------------------------------------- - -- The sock channel is the only channel that implements dynamic process support - (i.e., MPI_COMM_SPAWN, MPI_COMM_CONNECT, MPI_COMM_ACCEPT, etc.). All other - channels will experience failures for tests exercising dynamic process - functionality. " and in http://lists.openfabrics.org/pipermail/commits/2006-May/007209.html we have: " -- MPI_COMM_JOIN has been implemented; although like the other dynamic process - routines, it is only supported by the Sock channel. " Given that above quotes mentioned both the MPI_Comm_join and MPI_Comm_connect ... is there any way at all to currently achieve the above 3 steps when using Infiniband cards (and may be having Ethernet cards on all of the hosts as well)? I would imagine that, albeit theoretically, it is plausible to use sock channel to 'bootstrap' the Infiniband channel? http://www.mpi-forum.org/docs/mpi-20-html/node115.htm " MPI uses the socket to bootstrap creation of the intercommunicator, and for nothing else. " Perhaps I need to build mvapich2 not via "make.make.mvapich2.ofa" but something else so that both: socket and infiniband channels are supported? Of course the same aforementioned link (http://www.mpi-forum.org/docs/mpi-20-html/node115.htm) says: " Advice to users. An MPI implementation may require a specific communication medium for MPI communication, such as a shared memory segment or a special switch. In this case, it may not be possible for two processes to successfully join even if there is a socket connecting them and they are using the same MPI implementation. ( End of advice to users.) " If this is the case here and there is no way to use MPI_Comm_join to achieve the originally described 3 steps (connecting apps started at different times and without the use of mpirun) - is that then at all possible (e.g. using MPI's open port, publish name, lookup name, accept/connect calls)? Are the limitations purely theoretical or more of a practical nature? Ideally, for async. server design purposes and, given that MPI_Comm_accept is blocking and there is no 'test'/'poll' for it, it would be good to be able to use sockets channel to coordinate infiniband channel bootstrapping with MPI_Comm_join (even if MPI_Comm_join in itself is blocking, at least one can 'poll' for the TCP's socket's fd before calling 'accept' and subsequently MPI_Comm_join)... If mvapich2 is unable to provide dynamic process connectivity over Infiniband... are there any other libs that could do that? Kind regards Leon. _______________________________________________ mvapich-commit mailing list mvapich-commit@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-commit From christopher.tanner at gatech.edu Tue Apr 15 19:20:59 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Tue Apr 15 19:21:12 2008 Subject: [mvapich-discuss] IB is not loading Message-ID: <349E9D29-897F-498B-A990-4F498663B415@gatech.edu> Sorry to keep hassling everyone, but I have received several potential solutions to my problem, but none have worked (or I'm a little to novice to understand what to do). Thanks for all your help though. Here's another try... I'm pretty sure the IB drivers are not loading and I don't know how to load them. Here's the error I get when trying to execute the osu_latency benchmark in mvapich2: libibverbs: Fatal: couldn't read uverbs ABI version. Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(259)...........: Initialization failed MPID_Init(102)..................: channel initialization failed MPIDI_CH3_Init(178).............: MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters rdma_get_control_parameters(432): rdma_open_hca(367)..............: No IB device found rank 3 in job 1 master.cl.ae.gatech.edu_41302 caused collective abort of all ranks exit status of rank 3: return code 1 Matt suggested running 'ibstat', which doesn't exist on my machine. I'm executing the script on four separate nodes via a machinefile (not the master), all of which have OFED and mvapich2 installed. So... I'm essentially looking for a way to load the drivers. Rebooting the master and each node post install didn't work. Anyone have any thoughts? Thanks! ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner@gatech.edu ------------------------------------------- From leonleon77 at gmail.com Tue Apr 15 20:06:19 2008 From: leonleon77 at gmail.com (leon zadorin) Date: Tue Apr 15 20:06:43 2008 Subject: [mvapich-discuss] dynamic process connections (accept/connect or MPI_Comm_join) and Infiniband... Message-ID: <26d2cb010804151706p3973e774ia8deb83e9d0f3ea3@mail.gmail.com> Hello everyone, It had occurred to me that my original post of my questions (to mvapich-commit) would be better if emailed to this list instead... so here it is... I am relatively new to the whole MPI/Infiniband scene, so my apologies if some of the questions/thoughts of mine are naive... I am currently experiencing difficulties with dynamic process connections (MPI_Comm_join) between 2 hosts (each with Infiniband and Ethernet card). The setup is: 2 hosts, each with Ethernet (Gigabit) card and with Infiniband card (PCI-e), running Linux 32 bits (AMD arch). Hosts are connected via Infiniband switch (w.r.t Infiniband cards) and via Ethernet/IP network (w.r.t. Ethernet cards). mvapich2 has been made with "make.mvapich2.ofa" mpdboot has been executed and mpd daemons are running on both hosts I would like to know if it is currently possible to achieve the following: 1) start 1 app on 1 host (without using mpirun); 2) then later, after some time, start another app on 2nd host (without using mpirun); 3) make the app in step 2 automatically connect to the app started in step 1 I was able to achieve the above when running with mpich2 library, using sock channels and only when using 'MPI_Comm_join' call (using MPI_Publish_name, etc. did not work when starting apps without mpirun [even with all mpds being active]). However, the MPI_Comm_join tactic fails when attempting to use mvapich2 (mvapich2-1.0-2008-04-10) over Infiniband... I wonder if the following has something to do with it: http://lists.openfabrics.org/pipermail/commits/2006-January/004707.html " -------------------------------------------------------------------------------- - Known Deficiencies -------------------------------------------------------------------------------- - -- The sock channel is the only channel that implements dynamic process support - (i.e., MPI_COMM_SPAWN, MPI_COMM_CONNECT, MPI_COMM_ACCEPT, etc.). All other - channels will experience failures for tests exercising dynamic process - functionality. " and in http://lists.openfabrics.org/pipermail/commits/2006-May/007209.html we have: " -- MPI_COMM_JOIN has been implemented; although like the other dynamic process - routines, it is only supported by the Sock channel. " Given that above quotes mentioned both the MPI_Comm_join and MPI_Comm_connect ... is there any way at all to currently achieve the above 3 steps when using Infiniband cards (and may be having Ethernet cards on all of the hosts as well)? I would imagine that, albeit theoretically, it is plausible to use sock channel to 'bootstrap' the Infiniband channel? http://www.mpi-forum.org/docs/mpi-20-html/node115.htm " MPI uses the socket to bootstrap creation of the intercommunicator, and for nothing else. " Perhaps I need to build mvapich2 not via "make.make.mvapich2.ofa" but something else so that both: socket and infiniband channels are supported? Of course the same aforementioned link (http://www.mpi-forum.org/docs/mpi-20-html/node115.htm) says: " Advice to users. An MPI implementation may require a specific communication medium for MPI communication, such as a shared memory segment or a special switch. In this case, it may not be possible for two processes to successfully join even if there is a socket connecting them and they are using the same MPI implementation. ( End of advice to users.) " If this is the case here and there is no way to use MPI_Comm_join to achieve the originally described 3 steps (connecting apps started at different times and without the use of mpirun) - is that then at all possible (e.g. using MPI's open port, publish name, lookup name, accept/connect calls)? Are the limitations purely theoretical or more of a practical nature? Ideally, for async. server design purposes and, given that MPI_Comm_accept is blocking and there is no 'test'/'poll' for it, it would be good to be able to use sockets channel to coordinate infiniband channel bootstrapping with MPI_Comm_join (even if MPI_Comm_join in itself is blocking, at least one can 'poll' for the TCP's socket's fd before calling 'accept' and subsequently MPI_Comm_join)... If mvapich2 is unable to provide dynamic process connectivity over Infiniband... are there any other libs that could do that? Kind regards Leon. From christopher.tanner at gatech.edu Wed Apr 16 09:39:59 2008 From: christopher.tanner at gatech.edu (Christopher Tanner) Date: Wed Apr 16 09:40:12 2008 Subject: [mvapich-discuss] IB is not loading In-Reply-To: <4805673D.7020705@ucla.edu> References: <349E9D29-897F-498B-A990-4F498663B415@gatech.edu> <4805673D.7020705@ucla.edu> Message-ID: <365A15E3-589C-496B-A7B6-5EF7C5E49A85@gatech.edu> Scott - thanks for your help man. I'm still new to Linux, so detailed commands were great. I did use the automated installer, and I did just the basic OFED 1.3 install. However... a) 'ibstat' just doesn't exist. I've installed OFED three times now and each time 'ibstat' is not created or is not placed in an intuitive directory (/usr/bin, /usr/local/bin, etc.) b) I've confirmed that the modules are NOT loaded - the lsmod returned nothing (literally) c) chkconfig resulted in this : openibd 0:off 1:off 2:on 3:on 4:on 5:on 6:off Which I assume to mean the initscripts for run levels 3 and 5 are executing. So, I think we're gravy there. d) service command resulted in this : Loading Mellanox HCA driver: [FAILED] Loading Mellanox MLX4 HCA driver: [FAILED] Loading cxgb3 driver: [FAILED] Loading HCA driver and Access Layer: [FAILED] Please open an issue in the http://openib.org/bugzilla and attach /tmp/ ib_debug_info.log Tek recommended burning new firmware onto each of the Infiniband cards, but that seems like an arduous process for a relatively new cluster. Is it this hard to get an Infiniband network running on every cluster or am I really missing something? ------------------------------------------- Chris Tanner Space Systems Design Lab Georgia Institute of Technology christopher.tanner@gatech.edu ------------------------------------------- On Apr 15, 2008, at 10:41 PM, Scott A. Friedman wrote: > Hi Chris, > > I have been watching your messages and thought I'd send you a note. > > If your drivers (kernel modules) are not loading you should try a > few simple things - especially if you used the OFED installer. > > You should run 'ibstat' first. You normally need to be root to run > this but you can also run it with the full path if you are not root. > > /usr/sbin/ibstat > > You should also confirm that the kernel models are in fact loaded. > > /sbin/lsmod | grep ib > > You should see a bunch of ib_blah entries - like ib_uverbs etc. > > if none of these work then the modules are probably not loaded at > all. In that case you should check (on a redhat/fedora/centos type > system) as root. > > chkconfig --list openibd > > It should show that the initscript runs (on) for the run level you > are using (typically 3 or 5). If it says it is off then that is your > problem - and why the modules are not loaded upon startup. > > chkconfig openibd on > service openibd start > > Then try your mpi again - no need to reboot. > > You will also need the subnet manager running - which is opensmd on > at least one node. /usr/sbin/sminfo will show you if it is running > someplace on your IB network - have to run this as root. If it > isn't... > > chkconfig opensmd on > service opensmd start > > do this on, say, your head node. > > You may also need to setup ipoib if you are using mvapich with the > newer connection management rdmacm setup (which I think the default > mvapich that comes with OFED does for the connection management to > work). > > Let me know how it goes, > Scott > > ---- > Scott A. Friedman, Ph.D > Computer Scientist > Research Computing Technologies Group > UCLA Academic Technology Services > 310-825-8607 > > > > Christopher Tanner wrote: >> Sorry to keep hassling everyone, but I have received several >> potential solutions to my problem, but none have worked (or I'm a >> little to novice to understand what to do). Thanks for all your >> help though. Here's another try... >> I'm pretty sure the IB drivers are not loading and I don't know how >> to load them. Here's the error I get when trying to execute the >> osu_latency benchmark in mvapich2: >> libibverbs: Fatal: couldn't read uverbs ABI version. >> Fatal error in MPI_Init: >> Other MPI error, error stack: >> MPIR_Init_thread(259)...........: Initialization failed >> MPID_Init(102)..................: channel initialization failed >> MPIDI_CH3_Init(178).............: >> MPIDI_CH3I_RMDA_init(115).......: rdma_get_control_parameters >> rdma_get_control_parameters(432): >> rdma_open_hca(367)..............: No IB device found >> rank 3 in job 1 master.cl.ae.gatech.edu_41302 caused collective >> abort of all ranks >> exit status of rank 3: return code 1 >> Matt suggested running 'ibstat', which doesn't exist on my machine. >> I'm executing the script on four separate nodes via a machinefile >> (not the master), all of which have OFED and mvapich2 installed. >> So... I'm essentially looking for a way to load the drivers. >> Rebooting the master and each node post install didn't work. Anyone >> have any thoughts? Thanks! >> ------------------------------------------- >> Chris Tanner >> Space Systems Design Lab >> Georgia Institute of Technology >> christopher.tanner@gatech.edu >> ------------------------------------------- >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From Fred.Stecher at atk.com Fri Apr 11 14:46:25 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Apr 16 14:33:40 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: <20080411123632.GB2766@cse.ohio-state.edu> References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> Message-ID: Jonathan, No I did not rebuild the executable. If I rebuild it will the new executable be able to read the restart files? I just have to try it. Thanks, Fred -----Original Message----- From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] Sent: Friday, April 11, 2008 7:37 AM To: Stecher, Fred Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] Install MVAPICH 1 After you reinstalled MVAPICH, did you also rebuild your MPI application before running? It's possible that you were still using the old library when you restarted. In order to debug the compiler issue I'd like to see the other log files as well. Specifically the config.log and the config-mine.log. On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: > Jonathan, > I performed the ./make.mvapich.gen2 command and output to a make.log > file. In the make.log file there was a Warning message. Also, the pgcc > compiler was not used. No Fortran compiler was used either. I have > attached the make.log file. I then restarted my run. Monitoring the > InfiniBand network traffic indicated no traffic. Monitoring Ethernet > indicated some traffic. I do not think that InfiniBand is being used. > > > Thanks, > > Fred > > > -----Original Message----- > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Thursday, April 10, 2008 12:06 PM > To: Stecher, Fred > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: > > So, > > How do I make the InfiniBand version if make -f make.mvapich.gen2 > > Errors out? > > Don't call make yourself. Just type in './make.mvapich.gen2' at the > command line (without the quotes of course). Before doing this, be > sure to export any variables that you may need to override in > make.mvapich.gen2. > > For more information please see > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html > > Section 4 should answer most of your questions. > > > > > Thanks, > > > > Fred > > > > -----Original Message----- > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > Sent: Thursday, April 10, 2008 11:50 AM > > To: Stecher, Fred > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: > > > Hi, > > > This is a follow-up to previous question concerning whether > > > MVAPICH > > > 1 is using InfiniBand or Ethernet. Upon monitoring network > > > traffic, my executable is definitely using Ethernet. > > > I have reinstalled MVAPICH. The user manual stated "Go to the > > > mvapich-1.0 directory. We have included a single script for > > > OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of different > > > platforms, compilers and architectures. By default, the > > > compilation script uses gcc. In order to select your compiler, > > > please set the variable CC in the script to use either Intel, > > > PathScale or PGI compiler. The platform/architecture is detected > > > automatically." I tried make -f make.mvapich.gen2 with following > > > error > > > > You should use ./make.mvapich.gen2 > > > > > message: > > > make.mvapich.gen2:7: *** missing separator. Stop. > > > I then just typed make. This resulted in installation of some > > > version of MVAPICH. I am not sure what version. Do anyone know > > > what version was installed or how to determine the version? > > > > > > > By using make directly you almost certainly have made the TCP version. > > > > > Thanks, > > > > > > Fred > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: config-mine.log Type: application/octet-stream Size: 16918 bytes Desc: config-mine.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080411/9c4af186/config-mine-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: config.log Type: application/octet-stream Size: 11406 bytes Desc: config.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080411/9c4af186/config-0001.obj From Fred.Stecher at atk.com Fri Apr 11 16:46:04 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Apr 16 14:34:07 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: <20080411123632.GB2766@cse.ohio-state.edu> References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> Message-ID: Jonathan, When I tried to rebuild my application the following error message was output: /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(dup_fnf.o ): In function `pmpi_dup_fn_': dup_fnf.c:(.text+0x23): undefined reference to `MPIR_F_TRUE' dup_fnf.c:(.text+0x2a): undefined reference to `MPIR_F_FALSE' /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(initializ ef.o): In function `pmpi_initialized_': initializef.c:(.text+0x27): undefined reference to `MPIR_F_TRUE' initializef.c:(.text+0x2e): undefined reference to `MPIR_F_FALSE' /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(null_copy fnf.o): In function `pmpi_null_copy_fn_': null_copyfnf.c:(.text+0x6): undefined reference to `MPIR_F_FALSE' Thanks, Fred -----Original Message----- From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] Sent: Friday, April 11, 2008 7:37 AM To: Stecher, Fred Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] Install MVAPICH 1 After you reinstalled MVAPICH, did you also rebuild your MPI application before running? It's possible that you were still using the old library when you restarted. In order to debug the compiler issue I'd like to see the other log files as well. Specifically the config.log and the config-mine.log. On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: > Jonathan, > I performed the ./make.mvapich.gen2 command and output to a make.log > file. In the make.log file there was a Warning message. Also, the pgcc > compiler was not used. No Fortran compiler was used either. I have > attached the make.log file. I then restarted my run. Monitoring the > InfiniBand network traffic indicated no traffic. Monitoring Ethernet > indicated some traffic. I do not think that InfiniBand is being used. > > > Thanks, > > Fred > > > -----Original Message----- > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Thursday, April 10, 2008 12:06 PM > To: Stecher, Fred > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: > > So, > > How do I make the InfiniBand version if make -f make.mvapich.gen2 > > Errors out? > > Don't call make yourself. Just type in './make.mvapich.gen2' at the > command line (without the quotes of course). Before doing this, be > sure to export any variables that you may need to override in > make.mvapich.gen2. > > For more information please see > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html > > Section 4 should answer most of your questions. > > > > > Thanks, > > > > Fred > > > > -----Original Message----- > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > Sent: Thursday, April 10, 2008 11:50 AM > > To: Stecher, Fred > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: > > > Hi, > > > This is a follow-up to previous question concerning whether > > > MVAPICH > > > 1 is using InfiniBand or Ethernet. Upon monitoring network > > > traffic, my executable is definitely using Ethernet. > > > I have reinstalled MVAPICH. The user manual stated "Go to the > > > mvapich-1.0 directory. We have included a single script for > > > OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of different > > > platforms, compilers and architectures. By default, the > > > compilation script uses gcc. In order to select your compiler, > > > please set the variable CC in the script to use either Intel, > > > PathScale or PGI compiler. The platform/architecture is detected > > > automatically." I tried make -f make.mvapich.gen2 with following > > > error > > > > You should use ./make.mvapich.gen2 > > > > > message: > > > make.mvapich.gen2:7: *** missing separator. Stop. > > > I then just typed make. This resulted in installation of some > > > version of MVAPICH. I am not sure what version. Do anyone know > > > what version was installed or how to determine the version? > > > > > > > By using make directly you almost certainly have made the TCP version. > > > > > Thanks, > > > > > > Fred > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From perkinjo at cse.ohio-state.edu Fri Apr 11 08:36:32 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Wed Apr 16 14:35:22 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> Message-ID: <20080411123632.GB2766@cse.ohio-state.edu> After you reinstalled MVAPICH, did you also rebuild your MPI application before running? It's possible that you were still using the old library when you restarted. In order to debug the compiler issue I'd like to see the other log files as well. Specifically the config.log and the config-mine.log. On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: > Jonathan, > I performed the ./make.mvapich.gen2 command and output to a make.log > file. In the make.log file there was a Warning message. Also, the pgcc > compiler was not used. No Fortran compiler was used either. I have > attached the make.log file. I then restarted my run. Monitoring the > InfiniBand network traffic indicated no traffic. Monitoring Ethernet > indicated some traffic. I do not think that InfiniBand is being used. > > > Thanks, > > Fred > > > -----Original Message----- > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Thursday, April 10, 2008 12:06 PM > To: Stecher, Fred > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: > > So, > > How do I make the InfiniBand version if make -f make.mvapich.gen2 > > Errors out? > > Don't call make yourself. Just type in './make.mvapich.gen2' at the > command line (without the quotes of course). Before doing this, be sure > to export any variables that you may need to override in > make.mvapich.gen2. > > For more information please see > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html > > Section 4 should answer most of your questions. > > > > > Thanks, > > > > Fred > > > > -----Original Message----- > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > Sent: Thursday, April 10, 2008 11:50 AM > > To: Stecher, Fred > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: > > > Hi, > > > This is a follow-up to previous question concerning whether MVAPICH > > > 1 is using InfiniBand or Ethernet. Upon monitoring network traffic, > > > my executable is definitely using Ethernet. > > > I have reinstalled MVAPICH. The user manual stated "Go to the > > > mvapich-1.0 directory. We have included a single script for > > > OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of different > > > platforms, compilers and architectures. By default, the compilation > > > script uses gcc. In order to select your compiler, please set the > > > variable CC in the script to use either Intel, PathScale or PGI > > > compiler. The platform/architecture is detected automatically." I > > > tried make -f make.mvapich.gen2 with following error > > > > You should use ./make.mvapich.gen2 > > > > > message: > > > make.mvapich.gen2:7: *** missing separator. Stop. > > > I then just typed make. This resulted in installation of some > > > version of MVAPICH. I am not sure what version. Do anyone know what > > > version was installed or how to determine the version? > > > > > > > By using make directly you almost certainly have made the TCP version. > > > > > Thanks, > > > > > > Fred > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From Fred.Stecher at atk.com Mon Apr 14 11:29:32 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Apr 16 14:36:41 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: <20080414151544.GA2836@cse.ohio-state.edu> References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> <20080414151544.GA2836@cse.ohio-state.edu> Message-ID: -----Original Message----- From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] Sent: Monday, April 14, 2008 10:16 AM To: Stecher, Fred Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] Install MVAPICH 1 On Fri, Apr 11, 2008 at 03:46:04PM -0500, Stecher, Fred wrote: > Jonathan, > When I tried to rebuild my application the following error message was > output: > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(dup_fnf > .o > ): In function `pmpi_dup_fn_': > dup_fnf.c:(.text+0x23): undefined reference to `MPIR_F_TRUE' > dup_fnf.c:(.text+0x2a): undefined reference to `MPIR_F_FALSE' > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(initial > iz > ef.o): In function `pmpi_initialized_': > initializef.c:(.text+0x27): undefined reference to `MPIR_F_TRUE' > initializef.c:(.text+0x2e): undefined reference to `MPIR_F_FALSE' > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(null_co > py > fnf.o): In function `pmpi_null_copy_fn_': > null_copyfnf.c:(.text+0x6): undefined reference to `MPIR_F_FALSE' This looks like it may have something to do with the Fortran bindings. Can you forward us those other log files? > > Thanks, > > Fred > > > -----Original Message----- > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Friday, April 11, 2008 7:37 AM > To: Stecher, Fred > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > After you reinstalled MVAPICH, did you also rebuild your MPI > application before running? It's possible that you were still using > the old library when you restarted. > > In order to debug the compiler issue I'd like to see the other log > files as well. Specifically the config.log and the config-mine.log. > > On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: > > Jonathan, > > I performed the ./make.mvapich.gen2 command and output to a make.log > > file. In the make.log file there was a Warning message. Also, the > > pgcc > > > compiler was not used. No Fortran compiler was used either. I have > > attached the make.log file. I then restarted my run. Monitoring the > > InfiniBand network traffic indicated no traffic. Monitoring Ethernet > > indicated some traffic. I do not think that InfiniBand is being used. > > > > > > Thanks, > > > > Fred > > > > > > -----Original Message----- > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > Sent: Thursday, April 10, 2008 12:06 PM > > To: Stecher, Fred > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: > > > So, > > > How do I make the InfiniBand version if make -f make.mvapich.gen2 > > > Errors out? > > > > Don't call make yourself. Just type in './make.mvapich.gen2' at the > > command line (without the quotes of course). Before doing this, be > > sure to export any variables that you may need to override in > > make.mvapich.gen2. > > > > For more information please see > > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html > > > > Section 4 should answer most of your questions. > > > > > > > > Thanks, > > > > > > Fred > > > > > > -----Original Message----- > > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > > Sent: Thursday, April 10, 2008 11:50 AM > > > To: Stecher, Fred > > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > > > On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: > > > > Hi, > > > > This is a follow-up to previous question concerning whether > > > > MVAPICH > > > > 1 is using InfiniBand or Ethernet. Upon monitoring network > > > > traffic, my executable is definitely using Ethernet. > > > > I have reinstalled MVAPICH. The user manual stated "Go to the > > > > mvapich-1.0 directory. We have included a single script for > > > > OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of > > > > different platforms, compilers and architectures. By default, > > > > the compilation script uses gcc. In order to select your > > > > compiler, please set the variable CC in the script to use either > > > > Intel, PathScale or PGI compiler. The platform/architecture is > > > > detected automatically." I tried make -f make.mvapich.gen2 with > > > > following error > > > > > > You should use ./make.mvapich.gen2 > > > > > > > message: > > > > make.mvapich.gen2:7: *** missing separator. Stop. > > > > I then just typed make. This resulted in installation of some > > > > version of MVAPICH. I am not sure what version. Do anyone know > > > > what version was installed or how to determine the version? > > > > > > > > > > By using make directly you almost certainly have made the TCP > version. > > > > > > > Thanks, > > > > > > > > Fred > > > > > > > > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > -- > > > Jonathan Perkins > > > http://www.cse.ohio-state.edu/~perkinjo > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > > > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: install-mine.log Type: application/octet-stream Size: 1392 bytes Desc: install-mine.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/688dc384/install-mine-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: config-mine.log Type: application/octet-stream Size: 16918 bytes Desc: config-mine.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/688dc384/config-mine-0001.obj From perkinjo at cse.ohio-state.edu Mon Apr 14 11:15:45 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Wed Apr 16 14:37:55 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> Message-ID: <20080414151544.GA2836@cse.ohio-state.edu> On Fri, Apr 11, 2008 at 03:46:04PM -0500, Stecher, Fred wrote: > Jonathan, > When I tried to rebuild my application the following error message was > output: > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(dup_fnf.o > ): In function `pmpi_dup_fn_': > dup_fnf.c:(.text+0x23): undefined reference to `MPIR_F_TRUE' > dup_fnf.c:(.text+0x2a): undefined reference to `MPIR_F_FALSE' > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(initializ > ef.o): In function `pmpi_initialized_': > initializef.c:(.text+0x27): undefined reference to `MPIR_F_TRUE' > initializef.c:(.text+0x2e): undefined reference to `MPIR_F_FALSE' > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(null_copy > fnf.o): In function `pmpi_null_copy_fn_': > null_copyfnf.c:(.text+0x6): undefined reference to `MPIR_F_FALSE' This looks like it may have something to do with the Fortran bindings. Can you forward us those other log files? > > Thanks, > > Fred > > > -----Original Message----- > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Friday, April 11, 2008 7:37 AM > To: Stecher, Fred > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > After you reinstalled MVAPICH, did you also rebuild your MPI application > before running? It's possible that you were still using the old library > when you restarted. > > In order to debug the compiler issue I'd like to see the other log files > as well. Specifically the config.log and the config-mine.log. > > On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: > > Jonathan, > > I performed the ./make.mvapich.gen2 command and output to a make.log > > file. In the make.log file there was a Warning message. Also, the pgcc > > > compiler was not used. No Fortran compiler was used either. I have > > attached the make.log file. I then restarted my run. Monitoring the > > InfiniBand network traffic indicated no traffic. Monitoring Ethernet > > indicated some traffic. I do not think that InfiniBand is being used. > > > > > > Thanks, > > > > Fred > > > > > > -----Original Message----- > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > Sent: Thursday, April 10, 2008 12:06 PM > > To: Stecher, Fred > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: > > > So, > > > How do I make the InfiniBand version if make -f make.mvapich.gen2 > > > Errors out? > > > > Don't call make yourself. Just type in './make.mvapich.gen2' at the > > command line (without the quotes of course). Before doing this, be > > sure to export any variables that you may need to override in > > make.mvapich.gen2. > > > > For more information please see > > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html > > > > Section 4 should answer most of your questions. > > > > > > > > Thanks, > > > > > > Fred > > > > > > -----Original Message----- > > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > > Sent: Thursday, April 10, 2008 11:50 AM > > > To: Stecher, Fred > > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > > > On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: > > > > Hi, > > > > This is a follow-up to previous question concerning whether > > > > MVAPICH > > > > 1 is using InfiniBand or Ethernet. Upon monitoring network > > > > traffic, my executable is definitely using Ethernet. > > > > I have reinstalled MVAPICH. The user manual stated "Go to the > > > > mvapich-1.0 directory. We have included a single script for > > > > OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of different > > > > platforms, compilers and architectures. By default, the > > > > compilation script uses gcc. In order to select your compiler, > > > > please set the variable CC in the script to use either Intel, > > > > PathScale or PGI compiler. The platform/architecture is detected > > > > automatically." I tried make -f make.mvapich.gen2 with following > > > > error > > > > > > You should use ./make.mvapich.gen2 > > > > > > > message: > > > > make.mvapich.gen2:7: *** missing separator. Stop. > > > > I then just typed make. This resulted in installation of some > > > > version of MVAPICH. I am not sure what version. Do anyone know > > > > what version was installed or how to determine the version? > > > > > > > > > > By using make directly you almost certainly have made the TCP > version. > > > > > > > Thanks, > > > > > > > > Fred > > > > > > > > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > -- > > > Jonathan Perkins > > > http://www.cse.ohio-state.edu/~perkinjo > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > > > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From Fred.Stecher at atk.com Mon Apr 14 12:24:39 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Apr 16 14:39:21 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: <20080414160011.GC2836@cse.ohio-state.edu> References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> <20080414151544.GA2836@cse.ohio-state.edu> <20080414160011.GC2836@cse.ohio-state.edu> Message-ID: -----Original Message----- From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] Sent: Monday, April 14, 2008 11:00 AM To: Stecher, Fred Subject: Re: [mvapich-discuss] Install MVAPICH 1 One more file, "config.log" and I should be able to look into why pgCC couldn't be used. I believe there may be some mismatch between using gcc as your c compiler and pgf77 as the fortran compiler. In the meantime while I'm inspecting your log files you can try using gcc and gfortran/g77 as your compilers to rebuild MVAPICH. Then we can see if this error persists when you try to rebuild your mpi application. On Mon, Apr 14, 2008 at 10:29:32AM -0500, Stecher, Fred wrote: > > > -----Original Message----- > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Monday, April 14, 2008 10:16 AM > To: Stecher, Fred > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > On Fri, Apr 11, 2008 at 03:46:04PM -0500, Stecher, Fred wrote: > > Jonathan, > > When I tried to rebuild my application the following error message > > was > > output: > > > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(dup_f > > nf > > .o > > ): In function `pmpi_dup_fn_': > > dup_fnf.c:(.text+0x23): undefined reference to `MPIR_F_TRUE' > > dup_fnf.c:(.text+0x2a): undefined reference to `MPIR_F_FALSE' > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(initi > > al > > iz > > ef.o): In function `pmpi_initialized_': > > initializef.c:(.text+0x27): undefined reference to `MPIR_F_TRUE' > > initializef.c:(.text+0x2e): undefined reference to `MPIR_F_FALSE' > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(null_ > > co > > py > > fnf.o): In function `pmpi_null_copy_fn_': > > null_copyfnf.c:(.text+0x6): undefined reference to `MPIR_F_FALSE' > > This looks like it may have something to do with the Fortran bindings. > Can you forward us those other log files? > > > > > Thanks, > > > > Fred > > > > > > -----Original Message----- > > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > > Sent: Friday, April 11, 2008 7:37 AM > > To: Stecher, Fred > > Cc: mvapich-discuss@cse.ohio-state.edu > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > After you reinstalled MVAPICH, did you also rebuild your MPI > > application before running? It's possible that you were still using > > the old library when you restarted. > > > > In order to debug the compiler issue I'd like to see the other log > > files as well. Specifically the config.log and the config-mine.log. > > > > On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: > > > Jonathan, > > > I performed the ./make.mvapich.gen2 command and output to a > > > make.log > > > > file. In the make.log file there was a Warning message. Also, the > > > pgcc > > > > > compiler was not used. No Fortran compiler was used either. I have > > > attached the make.log file. I then restarted my run. Monitoring > > > the > > > > InfiniBand network traffic indicated no traffic. Monitoring > > > Ethernet > > > > indicated some traffic. I do not think that InfiniBand is being > used. > > > > > > > > > Thanks, > > > > > > Fred > > > > > > > > > -----Original Message----- > > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > > Sent: Thursday, April 10, 2008 12:06 PM > > > To: Stecher, Fred > > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > > > On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: > > > > So, > > > > How do I make the InfiniBand version if make -f > > > > make.mvapich.gen2 > > > > > Errors out? > > > > > > Don't call make yourself. Just type in './make.mvapich.gen2' at > > > the > > > > command line (without the quotes of course). Before doing this, > > > be sure to export any variables that you may need to override in > > > make.mvapich.gen2. > > > > > > For more information please see > > > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html > > > > > > Section 4 should answer most of your questions. > > > > > > > > > > > Thanks, > > > > > > > > Fred > > > > > > > > -----Original Message----- > > > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > > > Sent: Thursday, April 10, 2008 11:50 AM > > > > To: Stecher, Fred > > > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > > > > > On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: > > > > > Hi, > > > > > This is a follow-up to previous question concerning whether > > > > > MVAPICH > > > > > 1 is using InfiniBand or Ethernet. Upon monitoring network > > > > > traffic, my executable is definitely using Ethernet. > > > > > I have reinstalled MVAPICH. The user manual stated "Go to the > > > > > mvapich-1.0 directory. We have included a single script for > > > > > OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of > > > > > different platforms, compilers and architectures. By default, > > > > > the compilation script uses gcc. In order to select your > > > > > compiler, please set the variable CC in the script to use > > > > > either > > > > > > Intel, PathScale or PGI compiler. The platform/architecture is > > > > > detected automatically." I tried make -f make.mvapich.gen2 > > > > > with > > > > > > following error > > > > > > > > You should use ./make.mvapich.gen2 > > > > > > > > > message: > > > > > make.mvapich.gen2:7: *** missing separator. Stop. > > > > > I then just typed make. This resulted in installation of some > > > > > version of MVAPICH. I am not sure what version. Do anyone know > > > > > what version was installed or how to determine the version? > > > > > > > > > > > > > By using make directly you almost certainly have made the TCP > > version. > > > > > > > > > Thanks, > > > > > > > > > > Fred > > > > > > > > > > > > > > _______________________________________________ > > > > > mvapich-discuss mailing list > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discus > > > > > s > > > > > > > > > > > > -- > > > > Jonathan Perkins > > > > http://www.cse.ohio-state.edu/~perkinjo > > > > > > -- > > > Jonathan Perkins > > > http://www.cse.ohio-state.edu/~perkinjo > > > > > > > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: config.log Type: application/octet-stream Size: 11406 bytes Desc: config.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/4f9f2b36/config-0001.obj From Fred.Stecher at atk.com Mon Apr 14 15:34:28 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Apr 16 14:39:40 2008 Subject: [MVAPICH-discuss] Install MVAPICH 1 In-Reply-To: <20080414165535.GF2836@cse.ohio-state.edu> References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> <20080414151544.GA2836@cse.ohio-state.edu> <20080414160011.GC2836@cse.ohio-state.edu> <20080414165535.GF2836@cse.ohio-state.edu> Message-ID: Jonathan, You were right. The new build used the PGI compilers. I have rebuilt the executable with the gen2 version of MVAPICH. I restarted my run and InfiniBand was not used. I then ran the osu_bw program with the following results: # OSU MPI Bandwidth Test v3.0 # Size Bandwidth (MB/s) 1 2.46 2 5.07 4 9.83 8 18.69 16 35.98 32 69.31 64 125.95 128 225.55 256 354.17 512 491.20 1024 615.13 2048 727.53 4096 795.82 8192 861.88 16384 858.40 32768 906.21 65536 931.13 131072 943.26 262144 949.19 524288 952.36 1048576 954.01 2097152 954.83 4194304 955.25 I have attached the log files. Thanks, Fred -----Original Message----- From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] Sent: Monday, April 14, 2008 11:56 AM To: Stecher, Fred Subject: Re: [mvapich-discuss] Install MVAPICH 1 It looks like you're having a problem contacting the PGI Licensce server. You may need to contact your sys admin to resolve this if the issue still exists. Try compiling a bare bones c program with pgcc to see if the problem persists. On Mon, Apr 14, 2008 at 11:24:39AM -0500, Stecher, Fred wrote: > > > -----Original Message----- > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Monday, April 14, 2008 11:00 AM > To: Stecher, Fred > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > One more file, "config.log" and I should be able to look into why pgCC > couldn't be used. I believe there may be some mismatch between using > gcc as your c compiler and pgf77 as the fortran compiler. > > In the meantime while I'm inspecting your log files you can try using > gcc and > gfortran/g77 as your compilers to rebuild MVAPICH. Then we can see if > this error persists when you try to rebuild your mpi application. > > On Mon, Apr 14, 2008 at 10:29:32AM -0500, Stecher, Fred wrote: > > > > > > -----Original Message----- > > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > > Sent: Monday, April 14, 2008 10:16 AM > > To: Stecher, Fred > > Cc: mvapich-discuss@cse.ohio-state.edu > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > On Fri, Apr 11, 2008 at 03:46:04PM -0500, Stecher, Fred wrote: > > > Jonathan, > > > When I tried to rebuild my application the following error message > > > was > > > output: > > > > > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(dup > > > _f > > > nf > > > .o > > > ): In function `pmpi_dup_fn_': > > > dup_fnf.c:(.text+0x23): undefined reference to `MPIR_F_TRUE' > > > dup_fnf.c:(.text+0x2a): undefined reference to `MPIR_F_FALSE' > > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(ini > > > ti > > > al > > > iz > > > ef.o): In function `pmpi_initialized_': > > > initializef.c:(.text+0x27): undefined reference to `MPIR_F_TRUE' > > > initializef.c:(.text+0x2e): undefined reference to `MPIR_F_FALSE' > > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(nul > > > l_ > > > co > > > py > > > fnf.o): In function `pmpi_null_copy_fn_': > > > null_copyfnf.c:(.text+0x6): undefined reference to `MPIR_F_FALSE' > > > > This looks like it may have something to do with the Fortran bindings. > > Can you forward us those other log files? > > > > > > > > Thanks, > > > > > > Fred > > > > > > > > > -----Original Message----- > > > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > > > Sent: Friday, April 11, 2008 7:37 AM > > > To: Stecher, Fred > > > Cc: mvapich-discuss@cse.ohio-state.edu > > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > > > After you reinstalled MVAPICH, did you also rebuild your MPI > > > application before running? It's possible that you were still > > > using > > > > the old library when you restarted. > > > > > > In order to debug the compiler issue I'd like to see the other log > > > files as well. Specifically the config.log and the config-mine.log. > > > > > > On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: > > > > Jonathan, > > > > I performed the ./make.mvapich.gen2 command and output to a > > > > make.log > > > > > > file. In the make.log file there was a Warning message. Also, > > > > the pgcc > > > > > > > compiler was not used. No Fortran compiler was used either. I > > > > have > > > > > attached the make.log file. I then restarted my run. Monitoring > > > > the > > > > > > InfiniBand network traffic indicated no traffic. Monitoring > > > > Ethernet > > > > > > indicated some traffic. I do not think that InfiniBand is being > > used. > > > > > > > > > > > > Thanks, > > > > > > > > Fred > > > > > > > > > > > > -----Original Message----- > > > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > > > Sent: Thursday, April 10, 2008 12:06 PM > > > > To: Stecher, Fred > > > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > > > > > On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: > > > > > So, > > > > > How do I make the InfiniBand version if make -f > > > > > make.mvapich.gen2 > > > > > > > Errors out? > > > > > > > > Don't call make yourself. Just type in './make.mvapich.gen2' at > > > > the > > > > > > command line (without the quotes of course). Before doing this, > > > > be sure to export any variables that you may need to override in > > > > make.mvapich.gen2. > > > > > > > > For more information please see > > > > http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.htm > > > > l > > > > > > > > Section 4 should answer most of your questions. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Fred > > > > > > > > > > -----Original Message----- > > > > > From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] > > > > > Sent: Thursday, April 10, 2008 11:50 AM > > > > > To: Stecher, Fred > > > > > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > > > > > > > > > On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: > > > > > > Hi, > > > > > > This is a follow-up to previous question concerning whether > > > > > > MVAPICH > > > > > > 1 is using InfiniBand or Ethernet. Upon monitoring network > > > > > > traffic, my executable is definitely using Ethernet. > > > > > > I have reinstalled MVAPICH. The user manual stated "Go to > > > > > > the mvapich-1.0 directory. We have included a single script > > > > > > for > > > > > > OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of > > > > > > different platforms, compilers and architectures. By > > > > > > default, the compilation script uses gcc. In order to select > > > > > > your compiler, please set the variable CC in the script to > > > > > > use either > > > > > > > > Intel, PathScale or PGI compiler. The platform/architecture > > > > > > is > > > > > > > detected automatically." I tried make -f make.mvapich.gen2 > > > > > > with > > > > > > > > following error > > > > > > > > > > You should use ./make.mvapich.gen2 > > > > > > > > > > > message: > > > > > > make.mvapich.gen2:7: *** missing separator. Stop. > > > > > > I then just typed make. This resulted in installation of > > > > > > some version of MVAPICH. I am not sure what version. Do > > > > > > anyone know > > > > > > > what version was installed or how to determine the version? > > > > > > > > > > > > > > > > By using make directly you almost certainly have made the TCP > > > version. > > > > > > > > > > > Thanks, > > > > > > > > > > > > Fred > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > mvapich-discuss mailing list > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-disc > > > > > > us > > > > > > s > > > > > > > > > > > > > > > -- > > > > > Jonathan Perkins > > > > > http://www.cse.ohio-state.edu/~perkinjo > > > > > > > > -- > > > > Jonathan Perkins > > > > http://www.cse.ohio-state.edu/~perkinjo > > > > > > > > > > > > > > > -- > > > Jonathan Perkins > > > http://www.cse.ohio-state.edu/~perkinjo > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > > > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: make-mine.log_gz Type: application/octet-stream Size: 13310 bytes Desc: make-mine.log_gz Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/01f8a4ef/make-mine-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: config.log Type: application/octet-stream Size: 7583 bytes Desc: config.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/01f8a4ef/config-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: config-mine.log Type: application/octet-stream Size: 23176 bytes Desc: config-mine.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/01f8a4ef/config-mine-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: install-mine.log Type: application/octet-stream Size: 2090 bytes Desc: install-mine.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/01f8a4ef/install-mine-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: make.log_gz Type: application/octet-stream Size: 18479 bytes Desc: make.log_gz Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080414/01f8a4ef/make-0001.obj From Fred.Stecher at atk.com Mon Apr 14 17:07:29 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Apr 16 14:39:59 2008 Subject: [MVAPICH-discuss] Install MVAPICH 1 In-Reply-To: <20080414203456.GE3435@cse.ohio-state.edu> References: <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> <20080414151544.GA2836@cse.ohio-state.edu> <20080414160011.GC2836@cse.ohio-state.edu> <20080414165535.GF2836@cse.ohio-state.edu> <20080414203456.GE3435@cse.ohio-state.edu> Message-ID: Jonathan, What I am saying is that when I monitor the network while my application is running, no packets are being transferred over InfiniBand, while Ethernet is busy. Maybe the SGI Clustervis monitor software is not catching the InfiniBand traffic. It outputs every 2 seconds. Is there another interactive utility that I can use to monitor the networks while my program is executing? Thanks, Fred -----Original Message----- From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] Sent: Monday, April 14, 2008 3:35 PM To: Stecher, Fred Subject: Re: [MVAPICH-discuss] Install MVAPICH 1 On Mon, Apr 14, 2008 at 02:34:28PM -0500, Stecher, Fred wrote: > Jonathan, > You were right. The new build used the PGI compilers. I have rebuilt > the executable with the gen2 version of MVAPICH. I restarted my run > and InfiniBand was not used. I then ran the osu_bw program with the > following results: > # OSU MPI Bandwidth Test v3.0 > # Size Bandwidth (MB/s) > 1 2.46 > 2 5.07 > 4 9.83 > 8 18.69 > 16 35.98 > 32 69.31 > 64 125.95 > 128 225.55 > 256 354.17 > 512 491.20 > 1024 615.13 > 2048 727.53 > 4096 795.82 > 8192 861.88 > 16384 858.40 > 32768 906.21 > 65536 931.13 > 131072 943.26 > 262144 949.19 > 524288 952.36 > 1048576 954.01 > 2097152 954.83 > 4194304 955.25 > > I have attached the log files. > > Thanks, > > Fred I'm not quite sure what is being asked here. The osu_bw numbers seem to indicate that Infiniband is being used. Are you saying that your application is still not using Infiniband even after recompiling it or are you saying that it is now working? -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From Fred.Stecher at atk.com Tue Apr 15 19:34:47 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Apr 16 14:40:19 2008 Subject: [MVAPICH-discuss] Install MVAPICH 1 In-Reply-To: <20080415012742.GA2808@cse.ohio-state.edu> References: <20080411123632.GB2766@cse.ohio-state.edu> <20080414151544.GA2836@cse.ohio-state.edu> <20080414160011.GC2836@cse.ohio-state.edu> <20080414165535.GF2836@cse.ohio-state.edu> <20080414203456.GE3435@cse.ohio-state.edu> <20080415012742.GA2808@cse.ohio-state.edu> Message-ID: Jonathan, The system administrator found another network monitor tool. I ran osu_bw while watching the monitor. The Ethernet in and out traffic peaked out at about 3000 counts/sec. There was no InfiniBand traffic. I need to run a test using the SGI mpirun package to make sure that the InfiniBand is working. I'll let you know. Thanks, Fred -----Original Message----- From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] Sent: Monday, April 14, 2008 8:28 PM To: Stecher, Fred Subject: Re: [MVAPICH-discuss] Install MVAPICH 1 On Mon, Apr 14, 2008 at 04:07:29PM -0500, Stecher, Fred wrote: > Jonathan, > What I am saying is that when I monitor the network while my > application is running, no packets are being transferred over > InfiniBand, while Ethernet is busy. Maybe the SGI Clustervis monitor > software is not catching the InfiniBand traffic. It outputs every 2 > seconds. Is there another interactive utility that I can use to > monitor the networks while my program is executing? I'm not too familiar with those tools. It'll be best to ask that question on the discuss list. However to find out if Clustervis monitor is catching any IB traffic would be to see if it shows output while running the OSU Benchmarks (such as osu_bw). > > Thanks, > > Fred > > > -----Original Message----- > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Monday, April 14, 2008 3:35 PM > To: Stecher, Fred > Subject: Re: [MVAPICH-discuss] Install MVAPICH 1 > > On Mon, Apr 14, 2008 at 02:34:28PM -0500, Stecher, Fred wrote: > > Jonathan, > > You were right. The new build used the PGI compilers. I have rebuilt > > the executable with the gen2 version of MVAPICH. I restarted my run > > and InfiniBand was not used. I then ran the osu_bw program with the > > following results: > > # OSU MPI Bandwidth Test v3.0 > > # Size Bandwidth (MB/s) > > 1 2.46 > > 2 5.07 > > 4 9.83 > > 8 18.69 > > 16 35.98 > > 32 69.31 > > 64 125.95 > > 128 225.55 > > 256 354.17 > > 512 491.20 > > 1024 615.13 > > 2048 727.53 > > 4096 795.82 > > 8192 861.88 > > 16384 858.40 > > 32768 906.21 > > 65536 931.13 > > 131072 943.26 > > 262144 949.19 > > 524288 952.36 > > 1048576 954.01 > > 2097152 954.83 > > 4194304 955.25 > > > > I have attached the log files. > > > > Thanks, > > > > Fred > > I'm not quite sure what is being asked here. The osu_bw numbers seem > to indicate that Infiniband is being used. Are you saying that your > application is still not using Infiniband even after recompiling it or > are you saying that it is now working? > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From sridharj at cse.ohio-state.edu Wed Apr 16 14:47:32 2008 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Wed Apr 16 14:47:47 2008 Subject: [mvapich-discuss] dynamic process connections (accept/connect or MPI_Comm_join) and Infiniband... In-Reply-To: <26d2cb010804151706p3973e774ia8deb83e9d0f3ea3@mail.gmail.com> References: <26d2cb010804151706p3973e774ia8deb83e9d0f3ea3@mail.gmail.com> Message-ID: <1208371652.2697.3.camel@t13.nowlab.cis.ohio-state.edu> Leon, I believe you missed Dr. Panda's earlier response (http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-April/001568.html). I've attached his response. Do let us know if have any other queries. Thanks, Jaidev On Wed, 2008-04-16 at 10:06 +1000, leon zadorin wrote: > Hello everyone, > > It had occurred to me that my original post of my questions (to > mvapich-commit) would be better if emailed to this list instead... so > here it is... > > I am relatively new to the whole MPI/Infiniband scene, so my apologies > if some of the questions/thoughts of mine are naive... > > I am currently experiencing difficulties with dynamic process > connections (MPI_Comm_join) between 2 hosts (each with Infiniband and > Ethernet card). > > The setup is: > 2 hosts, each with Ethernet (Gigabit) card and with Infiniband card > (PCI-e), running Linux 32 bits (AMD arch). > Hosts are connected via Infiniband switch (w.r.t Infiniband cards) and > via Ethernet/IP network (w.r.t. Ethernet cards). > mvapich2 has been made with "make.mvapich2.ofa" > mpdboot has been executed and mpd daemons are running on both hosts > > I would like to know if it is currently possible to achieve the following: > > 1) start 1 app on 1 host (without using mpirun); > 2) then later, after some time, start another app on 2nd host (without > using mpirun); > 3) make the app in step 2 automatically connect to the app started in step 1 > > I was able to achieve the above when running with mpich2 library, > using sock channels and only when using 'MPI_Comm_join' call (using > MPI_Publish_name, etc. did not work when starting apps without mpirun > [even with all mpds being active]). > > However, the MPI_Comm_join tactic fails when attempting to use > mvapich2 (mvapich2-1.0-2008-04-10) over Infiniband... I wonder if the > following has something to do with it: > http://lists.openfabrics.org/pipermail/commits/2006-January/004707.html > " > -------------------------------------------------------------------------------- > - Known Deficiencies > -------------------------------------------------------------------------------- > - > -- The sock channel is the only channel that implements dynamic process support > - (i.e., MPI_COMM_SPAWN, MPI_COMM_CONNECT, MPI_COMM_ACCEPT, etc.). All other > - channels will experience failures for tests exercising dynamic process > - functionality. > " > and in http://lists.openfabrics.org/pipermail/commits/2006-May/007209.html > we have: > " > -- MPI_COMM_JOIN has been implemented; although like the other dynamic process > - routines, it is only supported by the Sock channel. > " > > Given that above quotes mentioned both the MPI_Comm_join and > MPI_Comm_connect ... is there any way at all to currently achieve the > above 3 steps when using Infiniband cards (and may be having Ethernet > cards on all of the hosts as well)? > > I would imagine that, albeit theoretically, it is plausible to use > sock channel to 'bootstrap' the Infiniband channel? > http://www.mpi-forum.org/docs/mpi-20-html/node115.htm > " > MPI uses the socket to bootstrap creation of the intercommunicator, > and for nothing else. > " > > Perhaps I need to build mvapich2 not via "make.make.mvapich2.ofa" but > something else so that both: socket and infiniband channels are > supported? > > Of course the same aforementioned link > (http://www.mpi-forum.org/docs/mpi-20-html/node115.htm) > says: > " > Advice to users. An MPI implementation may require a specific > communication medium for MPI communication, such as a shared memory > segment or a special switch. In this case, it may not be possible for > two processes to successfully join even if there is a socket > connecting them and they are using the same MPI implementation. ( End > of advice to users.) > " > > If this is the case here and there is no way to use MPI_Comm_join to > achieve the originally described 3 steps (connecting apps started at > different times and without the use of mpirun) - is that then at all > possible (e.g. using MPI's open port, publish name, lookup name, > accept/connect calls)? Are the limitations purely theoretical or more > of a practical nature? > > Ideally, for async. server design purposes and, given that > MPI_Comm_accept is blocking and there is no 'test'/'poll' for it, it > would be good to be able to use sockets channel to coordinate > infiniband channel bootstrapping with MPI_Comm_join (even if > MPI_Comm_join in itself is blocking, at least one can 'poll' for the > TCP's socket's fd before calling 'accept' and subsequently > MPI_Comm_join)... > > If mvapich2 is unable to provide dynamic process connectivity over > Infiniband... are there any other libs that could do that? > > Kind regards > Leon. > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -------------- next part -------------- An embedded message was scrubbed... From: Dhabaleswar Panda Subject: [mvapich-discuss] [mvapich-commit] dynamic process connections (accept/connect or MPI_Comm_join) and Infiniband... (fwd) Date: Tue, 15 Apr 2008 08:52:47 -0400 (EDT) Size: 8794 Url: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080416/3daf718f/attachment.mht From methier at CGR.Harvard.edu Wed Apr 16 14:58:18 2008 From: methier at CGR.Harvard.edu (Michael Ethier) Date: Wed Apr 16 14:58:28 2008 Subject: [mvapich-discuss] error IBV_WC_LOC_LEN_ERR and FATAL event Message-ID: <72AF30DC2881964CB911FD08E57157E76ED369@lsdiv-msxbe-001.nucleus.harvard.edu> Hi, I just wanted to say that in regards to the IBV_WC_LOC_LEN_ERR and the event IBV_EVENT_QP_LAST_WQE_REACHED errors, the fix was to remove system calls in the code. Once that was done, it ran fine. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080416/5148c987/attachment-0001.html From curtisbr at cse.ohio-state.edu Wed Apr 16 14:58:43 2008 From: curtisbr at cse.ohio-state.edu (Brian Curtis) Date: Wed Apr 16 15:01:43 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> Message-ID: <48064C63.9030503@cse.ohio-state.edu> Fred, Please see http://www-unix.mcs.anl.gov/mpi/mpich1/docs/mpichman-chshmem/node111.htm. For some tips for this problem. Brian Stecher, Fred wrote: > Jonathan, > When I tried to rebuild my application the following error message was > output: > > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(dup_fnf.o > ): In function `pmpi_dup_fn_': > dup_fnf.c:(.text+0x23): undefined reference to `MPIR_F_TRUE' > dup_fnf.c:(.text+0x2a): undefined reference to `MPIR_F_FALSE' > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(initializ > ef.o): In function `pmpi_initialized_': > initializef.c:(.text+0x27): undefined reference to `MPIR_F_TRUE' > initializef.c:(.text+0x2e): undefined reference to `MPIR_F_FALSE' > /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(null_copy > fnf.o): In function `pmpi_null_copy_fn_': > null_copyfnf.c:(.text+0x6): undefined reference to `MPIR_F_FALSE' > > Thanks, > > Fred > > > -----Original Message----- > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: Friday, April 11, 2008 7:37 AM > To: Stecher, Fred > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > After you reinstalled MVAPICH, did you also rebuild your MPI application > before running? It's possible that you were still using the old library > when you restarted. > > In order to debug the compiler issue I'd like to see the other log files > as well. Specifically the config.log and the config-mine.log. > > On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: >> Jonathan, >> I performed the ./make.mvapich.gen2 command and output to a make.log >> file. In the make.log file there was a Warning message. Also, the pgcc > >> compiler was not used. No Fortran compiler was used either. I have >> attached the make.log file. I then restarted my run. Monitoring the >> InfiniBand network traffic indicated no traffic. Monitoring Ethernet >> indicated some traffic. I do not think that InfiniBand is being used. >> >> >> Thanks, >> >> Fred >> >> >> -----Original Message----- >> From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] >> Sent: Thursday, April 10, 2008 12:06 PM >> To: Stecher, Fred >> Subject: Re: [mvapich-discuss] Install MVAPICH 1 >> >> On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: >>> So, >>> How do I make the InfiniBand version if make -f make.mvapich.gen2 >>> Errors out? >> Don't call make yourself. Just type in './make.mvapich.gen2' at the >> command line (without the quotes of course). Before doing this, be >> sure to export any variables that you may need to override in >> make.mvapich.gen2. >> >> For more information please see >> http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html >> >> Section 4 should answer most of your questions. >> >>> Thanks, >>> >>> Fred >>> >>> -----Original Message----- >>> From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] >>> Sent: Thursday, April 10, 2008 11:50 AM >>> To: Stecher, Fred >>> Subject: Re: [mvapich-discuss] Install MVAPICH 1 >>> >>> On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: >>>> Hi, >>>> This is a follow-up to previous question concerning whether >>>> MVAPICH >>>> 1 is using InfiniBand or Ethernet. Upon monitoring network >>>> traffic, my executable is definitely using Ethernet. >>>> I have reinstalled MVAPICH. The user manual stated "Go to the >>>> mvapich-1.0 directory. We have included a single script for >>>> OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of different >>>> platforms, compilers and architectures. By default, the >>>> compilation script uses gcc. In order to select your compiler, >>>> please set the variable CC in the script to use either Intel, >>>> PathScale or PGI compiler. The platform/architecture is detected >>>> automatically." I tried make -f make.mvapich.gen2 with following >>>> error >>> You should use ./make.mvapich.gen2 >>> >>>> message: >>>> make.mvapich.gen2:7: *** missing separator. Stop. >>>> I then just typed make. This resulted in installation of some >>>> version of MVAPICH. I am not sure what version. Do anyone know >>>> what version was installed or how to determine the version? >>>> >>> By using make directly you almost certainly have made the TCP > version. >>>> Thanks, >>>> >>>> Fred >>>> >>>> _______________________________________________ >>>> mvapich-discuss mailing list >>>> mvapich-discuss@cse.ohio-state.edu >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >>> -- >>> Jonathan Perkins >>> http://www.cse.ohio-state.edu/~perkinjo >> -- >> Jonathan Perkins >> http://www.cse.ohio-state.edu/~perkinjo > > > > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From curtisbr at cse.ohio-state.edu Wed Apr 16 15:11:59 2008 From: curtisbr at cse.ohio-state.edu (Brian Curtis) Date: Wed Apr 16 15:14:55 2008 Subject: [mvapich-discuss] Install MVAPICH 1 In-Reply-To: References: <20080410164938.GA15644@cse.ohio-state.edu> <20080410170532.GC15644@cse.ohio-state.edu> <20080411123632.GB2766@cse.ohio-state.edu> <48064C63.9030503@cse.ohio-state.edu> Message-ID: <48064F7F.8030701@cse.ohio-state.edu> Right, sorry about that. The suggested link was for missing symbols, not undefined references. As for the application not running on InfiniBand, have you successful run one of the OSU benchmarks? Brian Stecher, Fred wrote: > Brian, > Jonathan figured out what why the PGI compilers were not used. I have > successfully built MVAPICH with the PGI compilers. The problem now is > that the executable only will not run on InfiniBand. > > Thanks, > > Fred > > > -----Original Message----- > From: Brian Curtis [mailto:curtisbr@cse.ohio-state.edu] > Sent: Wednesday, April 16, 2008 1:59 PM > To: Stecher, Fred > Cc: Jonathan Perkins; mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] Install MVAPICH 1 > > Fred, > > Please see > http://www-unix.mcs.anl.gov/mpi/mpich1/docs/mpichman-chshmem/node111.htm > . > For some tips for this problem. > > Brian > > Stecher, Fred wrote: >> Jonathan, >> When I tried to rebuild my application the following error message was >> output: >> >> /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(dup_fnf >> .o >> ): In function `pmpi_dup_fn_': >> dup_fnf.c:(.text+0x23): undefined reference to `MPIR_F_TRUE' >> dup_fnf.c:(.text+0x2a): undefined reference to `MPIR_F_FALSE' >> /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(initial >> iz >> ef.o): In function `pmpi_initialized_': >> initializef.c:(.text+0x27): undefined reference to `MPIR_F_TRUE' >> initializef.c:(.text+0x2e): undefined reference to `MPIR_F_FALSE' >> /data1/home/fstecher/mvapich/usr/local/mvapich/lib/libfmpich.a(null_co >> py >> fnf.o): In function `pmpi_null_copy_fn_': >> null_copyfnf.c:(.text+0x6): undefined reference to `MPIR_F_FALSE' >> >> Thanks, >> >> Fred >> >> >> -----Original Message----- >> From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] >> Sent: Friday, April 11, 2008 7:37 AM >> To: Stecher, Fred >> Cc: mvapich-discuss@cse.ohio-state.edu >> Subject: Re: [mvapich-discuss] Install MVAPICH 1 >> >> After you reinstalled MVAPICH, did you also rebuild your MPI >> application before running? It's possible that you were still using >> the old library when you restarted. >> >> In order to debug the compiler issue I'd like to see the other log >> files as well. Specifically the config.log and the config-mine.log. >> >> On Thu, Apr 10, 2008 at 04:49:41PM -0500, Stecher, Fred wrote: >>> Jonathan, >>> I performed the ./make.mvapich.gen2 command and output to a make.log >>> file. In the make.log file there was a Warning message. Also, the >>> pgcc >>> compiler was not used. No Fortran compiler was used either. I have >>> attached the make.log file. I then restarted my run. Monitoring the >>> InfiniBand network traffic indicated no traffic. Monitoring Ethernet >>> indicated some traffic. I do not think that InfiniBand is being used. >>> >>> >>> Thanks, >>> >>> Fred >>> >>> >>> -----Original Message----- >>> From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] >>> Sent: Thursday, April 10, 2008 12:06 PM >>> To: Stecher, Fred >>> Subject: Re: [mvapich-discuss] Install MVAPICH 1 >>> >>> On Thu, Apr 10, 2008 at 11:53:37AM -0500, Stecher, Fred wrote: >>>> So, >>>> How do I make the InfiniBand version if make -f make.mvapich.gen2 >>>> Errors out? >>> Don't call make yourself. Just type in './make.mvapich.gen2' at the >>> command line (without the quotes of course). Before doing this, be >>> sure to export any variables that you may need to override in >>> make.mvapich.gen2. >>> >>> For more information please see >>> http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html >>> >>> Section 4 should answer most of your questions. >>> >>>> Thanks, >>>> >>>> Fred >>>> >>>> -----Original Message----- >>>> From: Jonathan L. Perkins [mailto:perkinjo@cse.ohio-state.edu] >>>> Sent: Thursday, April 10, 2008 11:50 AM >>>> To: Stecher, Fred >>>> Subject: Re: [mvapich-discuss] Install MVAPICH 1 >>>> >>>> On Thu, Apr 10, 2008 at 11:23:49AM -0500, Stecher, Fred wrote: >>>>> Hi, >>>>> This is a follow-up to previous question concerning whether MVAPICH >>>>> 1 is using InfiniBand or Ethernet. Upon monitoring network traffic, > >>>>> my executable is definitely using Ethernet. >>>>> I have reinstalled MVAPICH. The user manual stated "Go to the >>>>> mvapich-1.0 directory. We have included a single script for >>>>> OpenFabrics/Gen2 (make.mvapich.gen2) that takes care of different >>>>> platforms, compilers and architectures. By default, the compilation > >>>>> script uses gcc. In order to select your compiler, please set the >>>>> variable CC in the script to use either Intel, PathScale or PGI >>>>> compiler. The platform/architecture is detected automatically." I >>>>> tried make -f make.mvapich.gen2 with following error >>>> You should use ./make.mvapich.gen2 >>>> >>>>> message: >>>>> make.mvapich.gen2:7: *** missing separator. Stop. >>>>> I then just typed make. This resulted in installation of some >>>>> version of MVAPICH. I am not sure what version. Do anyone know what > >>>>> version was installed or how to determine the version? >>>>> >>>> By using make directly you almost certainly have made the TCP >> version. >>>>> Thanks, >>>>> >>>>> Fred >>>>> >>>>> _______________________________________________ >>>>> mvapich-discuss mailing list >>>>> mvapich-discuss@cse.ohio-state.edu >>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>> -- >>>> Jonathan Perkins >>>> http://www.cse.ohio-state.edu/~perkinjo >>> -- >>> Jonathan Perkins >>> http://www.cse.ohio-state.edu/~perkinjo >> >> >> >> -- >> Jonathan Perkins >> http://www.cse.ohio-state.edu/~perkinjo >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From DDickerson2 at dow.com Wed Apr 16 16:08:20 2008 From: DDickerson2 at dow.com (Dickerson, Dee) Date: Wed Apr 16 18:41:24 2008 Subject: [mvapich-discuss] Problems running mpdboot Message-ID: <2C31D3432BD9574D97D9F8437FFCA778E57BD9@USFRPMDOWX011.dow.com> I have installed the latest version of mvapich2 using the make.makefile.ofa file. The only changes I made to this file was to change g77 to gfortran and the prefix to an NFS mounted directory. When I run mpdboot I get the following error node002 31% mpdboot -n 4 mpdboot_node002 (handle_mpd_output 396): from mpd on node001, invalid port info: I can manually start mpd & on each node but if I run mpiexec -n16 hostname it responds the hostname on the node 16 times. It does not go to the other node. Any help would be greatly appreciated. Thank you Dee _______________________________________________ Dee Dickerson Engineering & Process Sciences - Process Optimization Core R&D The Dow Chemical Company B-1603 Freeport, Texas 77541 Phone: +1 979-238-4449 Dickerson4@dow.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080416/5e690de6/attachment.html From gabra at us.ibm.com Wed Apr 16 16:15:19 2008 From: gabra at us.ibm.com (Gregory D Abram) Date: Wed Apr 16 18:43:15 2008 Subject: [mvapich-discuss] MVAPICH and pthreads Message-ID: I have two versions of a MPI program - one which does all the work in one thread, and one that spreads the work over several threads, but which does all the MPI calls in the original thread. I've been trying to figure out why the threaded version doesn't go any faster on an 8 processor system, and found that it appears that all the threads are running on one processor (according to the top command - one is pegged, seven are idle). So I wrote a little test program that spawns 5 threads that just sit there doing arithmetic. When run, top shows 5 processors busy. I then stuck MPI_Init at the start, compiled it for MVAPICH and ran it under mpirun on the same node - just one process - and sure enough, only 1 processor is busy. I then recompiled it for OpenMPI and ran it, again on the same node, and got 5 processors busy. Is this expected? I'm trying to be MPI-version agnostic, but this is a problem. Greg From pasha at dev.mellanox.co.il Thu Apr 17 03:27:16 2008 From: pasha at dev.mellanox.co.il (Pavel Shamis (Pasha)) Date: Thu Apr 17 03:27:29 2008 Subject: [mvapich-discuss] MVAPICH and pthreads In-Reply-To: References: Message-ID: <4806FBD4.9060504@dev.mellanox.co.il> In mvapich1 we have CPU/MEMORY support enabled by default and it may cause problem for multi thread application. (I had similar problem with threaded version of GOTO blas). The affinity support you may disable with "VIADEV_USE_AFFINITY=0" parameter. Thanks, Pasha Gregory D Abram wrote: > I have two versions of a MPI program - one which does all the work in one > thread, and one that spreads the work over several threads, but which does > all the MPI calls in the original thread. I've been trying to figure out > why the threaded version doesn't go any faster on an 8 processor system, > and found that it appears that all the threads are running on one processor > (according to the top command - one is pegged, seven are idle). > > So I wrote a little test program that spawns 5 threads that just sit there > doing arithmetic. When run, top shows 5 processors busy. I then stuck > MPI_Init at the start, compiled it for MVAPICH and ran it under mpirun on > the same node - just one process - and sure enough, only 1 processor is > busy. I then recompiled it for OpenMPI and ran it, again on the same node, > and got 5 processors busy. > > Is this expected? I'm trying to be MPI-version agnostic, but this is a > problem. > > Greg > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -- Pavel Shamis (Pasha) Mellanox Technologies From vera_wx_cn at yahoo.com.cn Thu Apr 17 05:15:51 2008 From: vera_wx_cn at yahoo.com.cn (=?gb2312?q?=C7=BF=20=C2=ED?=) Date: Thu Apr 17 05:16:04 2008 Subject: [mvapich-discuss] problem about ibv_dealloc_pd Message-ID: <506474.76991.qm@web15313.mail.cnb.yahoo.com> I build mvapich-1.0 with make.mvapich.gen2_multirail. I firstly run my MPI program on single HCA. (setting NUM_HCAS=1) I let mpi tasks all catch a signal. The steps in the signal handler are: 1) flush all pending messages; 2) MPIR_BsendRelease(,) 3) MPI_Barrier() 4) MPID_End() 5) checkpoint 6) exit In result, sometimes a few parts of MPI tasks failed in ibv_dealloc_pd() viainit.c:516, others successed. Somestimes all tasks finished all the above steps and exit successfully. When failed, ibv_dealloc_pd() always returns 16 (IBV_WC_REM_ABORT_ERR). What infiniband resources are still associated with pd? I spend almost two weeks on checking and debugging my sources, I'm tied. I test with bt.C.36 on the infiniband environments: CA type: MT25204, ports: 1, rate: 20 Please help me, thanks on advanced. --------------------------------- ÑÅ»¢ÓÊÏ䣬ÄúµÄÖÕÉúÓÊÏ䣡 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080417/218f26dd/attachment.html From L-marks at northwestern.edu Fri Apr 18 13:31:27 2008 From: L-marks at northwestern.edu (Laurence Marks) Date: Fri Apr 18 13:32:25 2008 Subject: [mvapich-discuss] Re: Architecture compatibility or maybe something else In-Reply-To: <876512660804181028k740136e3y1deecf552034b78b@mail.gmail.com> References: <876512660804181028k740136e3y1deecf552034b78b@mail.gmail.com> Message-ID: <876512660804181031vc80f673jeb4ca84a35af04aa@mail.gmail.com> I have an issue running mpi tasks on a new cluster which may be any of (or a combination of) a) The intel scalapack libraries b) The version of OFED and infiniband cards c) The compiler (ifort) and the architecture d) Something I've not thought of. It's not the code, that is stable and runs fine on other systems. Running mvapich, on a dual-quadcore Intel(R) Xeon(R) CPU E5410 everything works if I run only 1 mpi task per quadcore. If I do 2 or more I get a SIGSEV within the scalapack call PDSYEVX which looks like it is associated with threading: libpthread.so.0 00000030D9C0DD40 libpthread.so.0 00000030D9C0DC1D libiomp5.so 00002AAAAB4C1511 Running mvapich2 and/or intelmpi I get a different error, Fatal error in MPI_Comm_size: Invalid communicator, error stack: MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0xa4ee68) failed MPI_Comm_size(69).: Invalid communicator which I can trace to the scalapack call CALL SL_INIT(ICTXTALL, 1, NPE) The code ran fine when it was benchmarked a few months ago, and so far has been tested (by Intel) on a dual duo-core without problems; the engineer is going to use a dual quadcore. I would appreciate any suggestions as to where to look to try and understand what is going on. -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Commission on Electron Diffraction of IUCR www.numis.northwestern.edu/IUCR_CED From koop at cse.ohio-state.edu Fri Apr 18 16:57:16 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri Apr 18 16:57:27 2008 Subject: [mvapich-discuss] problem about ibv_dealloc_pd In-Reply-To: <506474.76991.qm@web15313.mail.cnb.yahoo.com> Message-ID: Hi, So you are trying to implement your own checkpointing library in MVAPICH? You may be interested in MVAPICH2, which already has multirail as well as checkpointing support. There are numerous issues with checkpointing InfiniBand -- things such as the QPs (connections) need to be torn down and registered memory should be unregistered. The following paper has additional information: http://nowlab.cse.ohio-state.edu/publications/conf-papers/2006/gaoq-icpp06.pdf Thanks, Matt On Thu, 17 Apr 2008, [gb2312] Ç¿ Âí wrote: > I build mvapich-1.0 with make.mvapich.gen2_multirail. I firstly run my MPI program on single HCA. (setting NUM_HCAS=1) > I let mpi tasks all catch a signal. The steps in the signal handler are: > 1) flush all pending messages; > 2) MPIR_BsendRelease(,) > 3) MPI_Barrier() > 4) MPID_End() > 5) checkpoint > 6) exit > > In result, sometimes a few parts of MPI tasks failed in ibv_dealloc_pd() viainit.c:516, others successed. > Somestimes all tasks finished all the above steps and exit successfully. > > When failed, ibv_dealloc_pd() always returns 16 (IBV_WC_REM_ABORT_ERR). > > What infiniband resources are still associated with pd? > > I spend almost two weeks on checking and debugging my sources, I'm tied. > I test with bt.C.36 on the infiniband environments: > CA type: MT25204, ports: 1, rate: 20 > > Please help me, > thanks on advanced. > > > --------------------------------- > ÑÅ»¢ÓÊÏ䣬ÄúµÄÖÕÉúÓÊÏ䣡 From koop at cse.ohio-state.edu Fri Apr 18 17:03:00 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri Apr 18 17:03:07 2008 Subject: [mvapich-discuss] Problems running mpdboot In-Reply-To: <2C31D3432BD9574D97D9F8437FFCA778E57BD9@USFRPMDOWX011.dow.com> Message-ID: Dee, You may want to follow the MPICH2 setup guide for MPD to diagnose the problem: http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-doc-install.pdf (Check Appendix A for MPD debugging) You will likely want to try starting the mpd daemons by hand rather than through mpdboot to figure out setup issues since it will give you more output. Simply doing a 'mpd &' on each node does not "connect" them into the ring, so only one is being added to the ring. You should check if you have any firewalls running, etc. Thanks, Matt On Wed, 16 Apr 2008, Dickerson, Dee wrote: > > I have installed the latest version of mvapich2 using the > make.makefile.ofa file. The only changes I made to this file was to > change g77 to gfortran and the prefix to an NFS mounted directory. > > When I run mpdboot I get the following error > > node002 31% mpdboot -n 4 > mpdboot_node002 (handle_mpd_output 396): from mpd on node001, invalid > port info: > > I can manually start mpd & on each node but if I run mpiexec -n16 > hostname it responds the hostname on the node 16 times. It does not go > to the other node. > > Any help would be greatly appreciated. Thank you > > > Dee > _______________________________________________ > Dee Dickerson > Engineering & Process Sciences - Process Optimization > Core R&D > The Dow Chemical Company > B-1603 > Freeport, Texas 77541 > Phone: +1 979-238-4449 > Dickerson4@dow.com > > > From koop at cse.ohio-state.edu Fri Apr 18 17:10:58 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri Apr 18 17:11:06 2008 Subject: [mvapich-discuss] Re: Architecture compatibility or maybe something else In-Reply-To: <876512660804181031vc80f673jeb4ca84a35af04aa@mail.gmail.com> Message-ID: Laurence, Has the code been run on any other InfiniBand cluster using MVAPICH? Does your code make any sort of system calls or fork? Also, is this code available so that we can try to reproduce and debug? Matt On Fri, 18 Apr 2008, Laurence Marks wrote: > I have an issue running mpi tasks on a new cluster which may be any of > (or a combination of) > a) The intel scalapack libraries > b) The version of OFED and infiniband cards > c) The compiler (ifort) and the architecture > d) Something I've not thought of. > > It's not the code, that is stable and runs fine on other systems. > > Running mvapich, on a dual-quadcore Intel(R) Xeon(R) CPU E5410 > everything works if I run only 1 mpi task per quadcore. > If I do 2 or more I get a SIGSEV within the scalapack call PDSYEVX > which looks like it is associated with threading: > libpthread.so.0 00000030D9C0DD40 > libpthread.so.0 00000030D9C0DC1D > libiomp5.so 00002AAAAB4C1511 > > Running mvapich2 and/or intelmpi I get a different error, > Fatal error in MPI_Comm_size: Invalid communicator, error stack: > MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0xa4ee68) failed > MPI_Comm_size(69).: Invalid communicator > > which I can trace to the scalapack call CALL SL_INIT(ICTXTALL, 1, NPE) > > The code ran fine when it was benchmarked a few months ago, and so far > has been tested (by Intel) on a dual duo-core without problems; the > engineer is going to use a dual quadcore. > > I would appreciate any suggestions as to where to look to try and > understand what is going on. > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Commission on Electron Diffraction of IUCR > www.numis.northwestern.edu/IUCR_CED > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From L-marks at northwestern.edu Fri Apr 18 18:10:36 2008 From: L-marks at northwestern.edu (Laurence Marks) Date: Fri Apr 18 18:29:32 2008 Subject: [mvapich-discuss] Re: Architecture compatibility or maybe something else In-Reply-To: References: <876512660804181031vc80f673jeb4ca84a35af04aa@mail.gmail.com> Message-ID: <876512660804181510l115d76d3nebbb785b8c379a70@mail.gmail.com> It's a fairly common code (http://www.wien2k.at/) that's been used on many computers, but I don't know for certain about on InfiniBand and the more recent dual quadcore systems (I may be the first). It does not use any system calls or fork UNLESS the most recent intel scalapack/mkl does -- which it might (I've no idea, I don't know how they are doing their multithreading). I have a subset of the code which can be used for debugging purposes; about 3.5Mb total. It would need to be compiled with "appropriate" options, but this is fast. I can send this directly to you, sending to the list would be inappropriate. On Fri, Apr 18, 2008 at 4:10 PM, Matthew Koop wrote: > Laurence, > > Has the code been run on any other InfiniBand cluster using MVAPICH? Does > your code make any sort of system calls or fork? > > Also, is this code available so that we can try to reproduce and debug? > > Matt > > > > On Fri, 18 Apr 2008, Laurence Marks wrote: > > > I have an issue running mpi tasks on a new cluster which may be any of > > (or a combination of) > > a) The intel scalapack libraries > > b) The version of OFED and infiniband cards > > c) The compiler (ifort) and the architecture > > d) Something I've not thought of. > > > > It's not the code, that is stable and runs fine on other systems. > > > > Running mvapich, on a dual-quadcore Intel(R) Xeon(R) CPU E5410 > > everything works if I run only 1 mpi task per quadcore. > > If I do 2 or more I get a SIGSEV within the scalapack call PDSYEVX > > which looks like it is associated with threading: > > libpthread.so.0 00000030D9C0DD40 > > libpthread.so.0 00000030D9C0DC1D > > libiomp5.so 00002AAAAB4C1511 > > > > Running mvapich2 and/or intelmpi I get a different error, > > Fatal error in MPI_Comm_size: Invalid communicator, error stack: > > MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0xa4ee68) failed > > MPI_Comm_size(69).: Invalid communicator > > > > which I can trace to the scalapack call CALL SL_INIT(ICTXTALL, 1, NPE) > > > > The code ran fine when it was benchmarked a few months ago, and so far > > has been tested (by Intel) on a dual duo-core without problems; the > > engineer is going to use a dual quadcore. > > > > I would appreciate any suggestions as to where to look to try and > > understand what is going on. > > > > -- > > Laurence Marks > > Department of Materials Science and Engineering > > MSE Rm 2036 Cook Hall > > 2220 N Campus Drive > > Northwestern University > > Evanston, IL 60208, USA > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > email: L-marks at northwestern dot edu > > Web: www.numis.northwestern.edu > > Commission on Electron Diffraction of IUCR > > www.numis.northwestern.edu/IUCR_CED > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Commission on Electron Diffraction of IUCR www.numis.northwestern.edu/IUCR_CED From bjoo at jlab.org Fri Apr 18 19:50:22 2008 From: bjoo at jlab.org (Balint Joo) Date: Fri Apr 18 19:50:33 2008 Subject: [mvapich-discuss] Optimised MPI features Message-ID: <480933BE.7030601@jlab.org> Dear All, I have recently returned from a trip to a Cray workshop where we discussed which features of MPI are optimized. As it turns out, they like to have receives pre-posted in which case messages are transferred to directly to the memory space of our application. Also very short messages (<1Kb) go through a 'fast path' by being directly included in the message headers. Our applications use the 'MPI_Request' paradigm to pre-declare persistent communications. We declare the requests with MPI_Send_init and MPI_recv_init and then start and finish them multiple times with MPI_Startall & MPI_Waitall. As it turns out MPI_Send_init and MPI_Recv_init do not pre-post the communications. The posting apparently only happens when we call MPI_Startall. Thus we can fall in the situation, that if processes are not tightly synchronized, that one process may post the sends, before others have pre-posted receives, even tho the receives are always declared before the sends. A rejigging of the comms pattern, to always start the receives at the beginning of the calculation helps if the local problem is big enough but for the minimal problem (strong scaling) the local compute is small enough that the MPI_Startalls are sufficiently unsynchronised so that the receives do not get pre-posted, although the declarations of the receives always preced the declaration of the sends. I would like to ask the developers whether this approach of using MPI_Startall and MPI_Waitall is well optimized in Mvapich, or whether we'd be better off usin MPI_Isend / MPI_Irecv pairs. As it happens, with MVAPICH 1.0.0 over infiniband we see poor scaling on Ranger and our own IB cluster beyond 128-256 cores - tho scaling seems near perfect up to the 128 cores. Other features we use which may be impacting us includes: - duplicating MPI_COMM_WORLD with MPI_Comm_dup to make a new communicator which we call QMP_COMM_WORLD and then using QMP_COMM_WORLD thereafter. On the Cray XT and IBM BG/P it appears that MPI_COMM_WORLD is 'special' and it knows that all MPI tasks communicate. Is MPI_Comm_world special in MVAPICH ? - Using MPI_Cart_create to try and get a virtual topology assuming that it will generate a topology that is somehow close to the machine. On Infiniband (which is a switch or tree of switches) I wonder whether this approach is efficient. I have heard, that on the Crays it is not optimal -- users don't have access to the physical topology and the MPI_Cart_create may return a topology unrelated to the machine (even tho it is a 3D mesh/torus underneath physically if not in terms of the current job). - Recent extensions to our comms interface are considering the use of single ended (one-sided) comms primitives from MPI2 (mpich2). Would the developers care to express an opinion on this issue? - Finally: Our problem is a closely coupled stencil like calculation. The reason for using the 'MPI_Request' paradigm is to pre-declare (preferably pre-post) an asynchronous communications pattern to essentially communicate our halos. We can then overlap the communication (Initiated by MPI_Startall) with computation which on finishing the computation we finalize with MPI_Waitall. I would very much appreciate views from the developers regarding these issues. We'd like to raise our performance on our own IB clusters as well as on Ranger or at least to improve our scaling beyond 256 cores. On Ranger our target is at least 2048 cores. We wonder whether our paradigm of using MPI_Requests is in some way inhibiting our performance. I look forward with thanks to comments on this issue. With my very best wishes, Balint -- ------------------------------------------------------------------- Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Mail Stop 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427 email: bjoo@jlab.org (old email: bj@ph.ed.ac.uk) ------------------------------------------------------------------- From L-marks at northwestern.edu Sat Apr 19 18:15:08 2008 From: L-marks at northwestern.edu (Laurence Marks) Date: Sat Apr 19 18:19:18 2008 Subject: [mvapich-discuss] Re: Architecture compatibility or maybe something else In-Reply-To: <876512660804181031vc80f673jeb4ca84a35af04aa@mail.gmail.com> References: <876512660804181028k740136e3y1deecf552034b78b@mail.gmail.com> <876512660804181031vc80f673jeb4ca84a35af04aa@mail.gmail.com> Message-ID: <876512660804191515t7d7bbf91la4c8cc0216904c2a@mail.gmail.com> Does mvapich need -lpthread ? This appears to be defined in mpid/ch_gen2/Makefile:BASE_LIB_LIST On Fri, Apr 18, 2008 at 12:31 PM, Laurence Marks wrote: > I have an issue running mpi tasks on a new cluster which may be any of > (or a combination of) > a) The intel scalapack libraries > b) The version of OFED and infiniband cards > c) The compiler (ifort) and the architecture > d) Something I've not thought of. > > It's not the code, that is stable and runs fine on other systems. > > Running mvapich, on a dual-quadcore Intel(R) Xeon(R) CPU E5410 > everything works if I run only 1 mpi task per quadcore. > If I do 2 or more I get a SIGSEV within the scalapack call PDSYEVX > which looks like it is associated with threading: > libpthread.so.0 00000030D9C0DD40 > libpthread.so.0 00000030D9C0DC1D > libiomp5.so 00002AAAAB4C1511 > > Running mvapich2 and/or intelmpi I get a different error, > Fatal error in MPI_Comm_size: Invalid communicator, error stack: > MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0xa4ee68) failed > MPI_Comm_size(69).: Invalid communicator > > which I can trace to the scalapack call CALL SL_INIT(ICTXTALL, 1, NPE) > > The code ran fine when it was benchmarked a few months ago, and so far > has been tested (by Intel) on a dual duo-core without problems; the > engineer is going to use a dual quadcore. > > I would appreciate any suggestions as to where to look to try and > understand what is going on. > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Commission on Electron Diffraction of IUCR > www.numis.northwestern.edu/IUCR_CED > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Commission on Electron Diffraction of IUCR www.numis.northwestern.edu/IUCR_CED From koop at cse.ohio-state.edu Sun Apr 20 00:41:24 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Sun Apr 20 00:41:35 2008 Subject: [mvapich-discuss] Re: Architecture compatibility or maybe something else In-Reply-To: <876512660804191515t7d7bbf91la4c8cc0216904c2a@mail.gmail.com> Message-ID: Yes, this is needed. If you use the make.mvapich.gen2 script to build and then use mpicc this should be included automatically though. Matt On Sat, 19 Apr 2008, Laurence Marks wrote: > Does mvapich need -lpthread ? This appears to be defined in > mpid/ch_gen2/Makefile:BASE_LIB_LIST > > On Fri, Apr 18, 2008 at 12:31 PM, Laurence Marks > wrote: > > I have an issue running mpi tasks on a new cluster which may be any of > > (or a combination of) > > a) The intel scalapack libraries > > b) The version of OFED and infiniband cards > > c) The compiler (ifort) and the architecture > > d) Something I've not thought of. > > > > It's not the code, that is stable and runs fine on other systems. > > > > Running mvapich, on a dual-quadcore Intel(R) Xeon(R) CPU E5410 > > everything works if I run only 1 mpi task per quadcore. > > If I do 2 or more I get a SIGSEV within the scalapack call PDSYEVX > > which looks like it is associated with threading: > > libpthread.so.0 00000030D9C0DD40 > > libpthread.so.0 00000030D9C0DC1D > > libiomp5.so 00002AAAAB4C1511 > > > > Running mvapich2 and/or intelmpi I get a different error, > > Fatal error in MPI_Comm_size: Invalid communicator, error stack: > > MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0xa4ee68) failed > > MPI_Comm_size(69).: Invalid communicator > > > > which I can trace to the scalapack call CALL SL_INIT(ICTXTALL, 1, NPE) > > > > The code ran fine when it was benchmarked a few months ago, and so far > > has been tested (by Intel) on a dual duo-core without problems; the > > engineer is going to use a dual quadcore. > > > > I would appreciate any suggestions as to where to look to try and > > understand what is going on. > > > > -- > > Laurence Marks > > Department of Materials Science and Engineering > > MSE Rm 2036 Cook Hall > > 2220 N Campus Drive > > Northwestern University > > Evanston, IL 60208, USA > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > email: L-marks at northwestern dot edu > > Web: www.numis.northwestern.edu > > Commission on Electron Diffraction of IUCR > > www.numis.northwestern.edu/IUCR_CED > > > > > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Commission on Electron Diffraction of IUCR > www.numis.northwestern.edu/IUCR_CED > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From L-marks at northwestern.edu Sun Apr 20 09:10:51 2008 From: L-marks at northwestern.edu (Laurence Marks) Date: Sun Apr 20 09:12:32 2008 Subject: [mvapich-discuss] Re: Architecture compatibility or maybe something else In-Reply-To: References: <876512660804191515t7d7bbf91la4c8cc0216904c2a@mail.gmail.com> Message-ID: <876512660804200610kfc96fc7qb8c477094ff7a691@mail.gmail.com> Thanks. I'm trying to trace where pthreads is included (since maybe my current problems are somehow thread related), and this appears to be the only location where it is used. Is there any was (as a debug) to exclude it? Also, what really is the meaning of Fatal error in MPI_Comm_size: Invalid communicator? On Sat, Apr 19, 2008 at 11:41 PM, Matthew Koop wrote: > > Yes, this is needed. If you use the make.mvapich.gen2 script to build and > then use mpicc this should be included automatically though. > > Matt > > > > On Sat, 19 Apr 2008, Laurence Marks wrote: > > > Does mvapich need -lpthread ? This appears to be defined in > > mpid/ch_gen2/Makefile:BASE_LIB_LIST > > > > On Fri, Apr 18, 2008 at 12:31 PM, Laurence Marks > > wrote: > > > I have an issue running mpi tasks on a new cluster which may be any of > > > (or a combination of) > > > a) The intel scalapack libraries > > > b) The version of OFED and infiniband cards > > > c) The compiler (ifort) and the architecture > > > d) Something I've not thought of. > > > > > > It's not the code, that is stable and runs fine on other systems. > > > > > > Running mvapich, on a dual-quadcore Intel(R) Xeon(R) CPU E5410 > > > everything works if I run only 1 mpi task per quadcore. > > > If I do 2 or more I get a SIGSEV within the scalapack call PDSYEVX > > > which looks like it is associated with threading: > > > libpthread.so.0 00000030D9C0DD40 > > > libpthread.so.0 00000030D9C0DC1D > > > libiomp5.so 00002AAAAB4C1511 > > > > > > Running mvapich2 and/or intelmpi I get a different error, > > > Fatal error in MPI_Comm_size: Invalid communicator, error stack: > > > MPI_Comm_size(110): MPI_Comm_size(comm=0x5b, size=0xa4ee68) failed > > > MPI_Comm_size(69).: Invalid communicator > > > > > > which I can trace to the scalapack call CALL SL_INIT(ICTXTALL, 1, NPE) > > > > > > The code ran fine when it was benchmarked a few months ago, and so far > > > has been tested (by Intel) on a dual duo-core without problems; the > > > engineer is going to use a dual quadcore. > > > > > > I would appreciate any suggestions as to where to look to try and > > > understand what is going on. > > > > > > -- > > > Laurence Marks > > > Department of Materials Science and Engineering > > > MSE Rm 2036 Cook Hall > > > 2220 N Campus Drive > > > Northwestern University > > > Evanston, IL 60208, USA > > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > > email: L-marks at northwestern dot edu > > > Web: www.numis.northwestern.edu > > > Commission on Electron Diffraction of IUCR > > > www.numis.northwestern.edu/IUCR_CED > > > > > > > > > > > -- > > Laurence Marks > > Department of Materials Science and Engineering > > MSE Rm 2036 Cook Hall > > 2220 N Campus Drive > > Northwestern University > > Evanston, IL 60208, USA > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > email: L-marks at northwestern dot edu > > Web: www.numis.northwestern.edu > > Commission on Electron Diffraction of IUCR > > www.numis.northwestern.edu/IUCR_CED > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Commission on Electron Diffraction of IUCR www.numis.northwestern.edu/IUCR_CED From L-marks at northwestern.edu Sun Apr 20 11:51:09 2008 From: L-marks at northwestern.edu (Laurence Marks) Date: Sun Apr 20 11:51:22 2008 Subject: [mvapich-discuss] Re: Architecture compatibility -- looks like Intel mkl Blacs_GridExit bug Message-ID: <876512660804200851p408fe7b6v2867dd2bbad5d144@mail.gmail.com> By replacing the Intel scalapack calls by scalapack version 1.7 I managed to trace the problem to the call from pdstebz to BLACS_GRIDEXIT; if I comment out this line (at the bottom of pdstebz) everything works fine. I'm not sure if this is a mvapich issue, maybe someone can tell me. -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Commission on Electron Diffraction of IUCR www.numis.northwestern.edu/IUCR_CED From koop at cse.ohio-state.edu Sun Apr 20 15:05:46 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Sun Apr 20 15:05:54 2008 Subject: [mvapich-discuss] Re: Architecture compatibility -- looks like Intel mkl Blacs_GridExit bug In-Reply-To: <876512660804200851p408fe7b6v2867dd2bbad5d144@mail.gmail.com> Message-ID: Laurence, Just to try to narrow this down, can you try running with VIADEV_USE_SHMEM_COLL=0 ? e.g. mpirun_rsh -np 4 -hostfile ./h VIADEV_USE_SHMEM_COLL=0 ./exec Matt On Sun, 20 Apr 2008, Laurence Marks wrote: > By replacing the Intel scalapack calls by scalapack version 1.7 I > managed to trace the problem to the call from pdstebz to > BLACS_GRIDEXIT; if I comment out this line (at the bottom of pdstebz) > everything works fine. > > I'm not sure if this is a mvapich issue, maybe someone can tell me. > > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Commission on Electron Diffraction of IUCR > www.numis.northwestern.edu/IUCR_CED > From L-marks at northwestern.edu Sun Apr 20 18:10:36 2008 From: L-marks at northwestern.edu (Laurence Marks) Date: Sun Apr 20 18:10:48 2008 Subject: [mvapich-discuss] Re: Architecture compatibility -- looks like Intel mkl Blacs_GridExit bug In-Reply-To: References: <876512660804200851p408fe7b6v2867dd2bbad5d144@mail.gmail.com> Message-ID: <876512660804201510j2087226dj2f637e3a79524749@mail.gmail.com> mpirun_rsh -np 4 -hostfile .machine1 VIADEV_USE_SHMEM_COLL=0 ./exec works. Can you expand a little, since I have no idea what this is doing. For instance does it suggest that something is not appropriate in how OFED is installed, mvapich being compiled by icc, Intel's scalapack, or all (or none) of the above? On Sun, Apr 20, 2008 at 2:05 PM, Matthew Koop wrote: > Laurence, > > Just to try to narrow this down, can you try running with > VIADEV_USE_SHMEM_COLL=0 ? > > e.g. > mpirun_rsh -np 4 -hostfile ./h VIADEV_USE_SHMEM_COLL=0 ./exec > > Matt > > > > On Sun, 20 Apr 2008, Laurence Marks wrote: > > > By replacing the Intel scalapack calls by scalapack version 1.7 I > > managed to trace the problem to the call from pdstebz to > > BLACS_GRIDEXIT; if I comment out this line (at the bottom of pdstebz) > > everything works fine. > > > > I'm not sure if this is a mvapich issue, maybe someone can tell me. > > > > > > -- > > Laurence Marks > > Department of Materials Science and Engineering > > MSE Rm 2036 Cook Hall > > 2220 N Campus Drive > > Northwestern University > > Evanston, IL 60208, USA > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > email: L-marks at northwestern dot edu > > Web: www.numis.northwestern.edu > > Commission on Electron Diffraction of IUCR > > www.numis.northwestern.edu/IUCR_CED > > > > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Commission on Electron Diffraction of IUCR www.numis.northwestern.edu/IUCR_CED From mamidala at cse.ohio-state.edu Sun Apr 20 20:35:08 2008 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Sun Apr 20 20:35:18 2008 Subject: [mvapich-discuss] Re: Architecture compatibility -- looks like Intel mkl Blacs_GridExit bug In-Reply-To: <876512660804201510j2087226dj2f637e3a79524749@mail.gmail.com> Message-ID: Hi Laurence, I might have missed this.. Which version of mvapich did you try? We had a bug while running scalapack and we applied a patch for it in the latest trunk. Please let us know.. Thanks, Amith On Sun, 20 Apr 2008, Laurence Marks wrote: > mpirun_rsh -np 4 -hostfile .machine1 VIADEV_USE_SHMEM_COLL=0 ./exec works. > > Can you expand a little, since I have no idea what this is doing. For > instance does it suggest that something is not appropriate in how OFED > is installed, mvapich being compiled by icc, Intel's scalapack, or all > (or none) of the above? > > > On Sun, Apr 20, 2008 at 2:05 PM, Matthew Koop wrote: > > Laurence, > > > > Just to try to narrow this down, can you try running with > > VIADEV_USE_SHMEM_COLL=0 ? > > > > e.g. > > mpirun_rsh -np 4 -hostfile ./h VIADEV_USE_SHMEM_COLL=0 ./exec > > > > Matt > > > > > > > > On Sun, 20 Apr 2008, Laurence Marks wrote: > > > > > By replacing the Intel scalapack calls by scalapack version 1.7 I > > > managed to trace the problem to the call from pdstebz to > > > BLACS_GRIDEXIT; if I comment out this line (at the bottom of pdstebz) > > > everything works fine. > > > > > > I'm not sure if this is a mvapich issue, maybe someone can tell me. > > > > > > > > > -- > > > Laurence Marks > > > Department of Materials Science and Engineering > > > MSE Rm 2036 Cook Hall > > > 2220 N Campus Drive > > > Northwestern University > > > Evanston, IL 60208, USA > > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > > email: L-marks at northwestern dot edu > > > Web: www.numis.northwestern.edu > > > Commission on Electron Diffraction of IUCR > > > www.numis.northwestern.edu/IUCR_CED > > > > > > > > > > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Commission on Electron Diffraction of IUCR > www.numis.northwestern.edu/IUCR_CED > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From yogyas at gmail.com Mon Apr 21 01:45:44 2008 From: yogyas at gmail.com (yogeshwar sonawane) Date: Mon Apr 21 01:46:04 2008 Subject: [mvapich-discuss] uDAPL SRQ support in MVAPICH2 ? Message-ID: Hi all, Whether SRQ support for uDAPL in MVAPICH2 is available ? If it is available, then what is the status - all features implemented OR some features implemented ? Is this feature available by default OR some flag is required ? Kindly, share the information. Thanks in advance. -Yogeshwar From L-marks at northwestern.edu Mon Apr 21 09:20:34 2008 From: L-marks at northwestern.edu (Laurence Marks) Date: Mon Apr 21 09:21:00 2008 Subject: [mvapich-discuss] Re: Architecture compatibility -- looks like Intel mkl Blacs_GridExit bug In-Reply-To: References: <876512660804201510j2087226dj2f637e3a79524749@mail.gmail.com> Message-ID: <876512660804210620j610a5b95xc808ab02876edf2d@mail.gmail.com> For mvapich2 in configure.in it says 1.0.5, although at the top of CHANGELOG it has MVAPICH2-1.0.2 (02/20/08) * Change the default MV2_DAPL_PROVIDER to OpenIB-cma For mvapich1 configure.in does not include the version number (tut-tut), the first entry in CHANGELOG is: 10/29/2007 * Added the typo patch provided by pat latifi@qlogic On Sun, Apr 20, 2008 at 7:35 PM, amith rajith mamidala wrote: > Hi Laurence, > > I might have missed this.. Which version of mvapich did you try? We had a > bug while running scalapack and we applied a patch for it in the latest > trunk. Please let us know.. > > Thanks, > Amith > > > > On Sun, 20 Apr 2008, Laurence Marks wrote: > > > mpirun_rsh -np 4 -hostfile .machine1 VIADEV_USE_SHMEM_COLL=0 ./exec works. > > > > Can you expand a little, since I have no idea what this is doing. For > > instance does it suggest that something is not appropriate in how OFED > > is installed, mvapich being compiled by icc, Intel's scalapack, or all > > (or none) of the above? > > > > > > On Sun, Apr 20, 2008 at 2:05 PM, Matthew Koop wrote: > > > Laurence, > > > > > > Just to try to narrow this down, can you try running with > > > VIADEV_USE_SHMEM_COLL=0 ? > > > > > > e.g. > > > mpirun_rsh -np 4 -hostfile ./h VIADEV_USE_SHMEM_COLL=0 ./exec > > > > > > Matt > > > > > > > > > > > > On Sun, 20 Apr 2008, Laurence Marks wrote: > > > > > > > By replacing the Intel scalapack calls by scalapack version 1.7 I > > > > managed to trace the problem to the call from pdstebz to > > > > BLACS_GRIDEXIT; if I comment out this line (at the bottom of pdstebz) > > > > everything works fine. > > > > > > > > I'm not sure if this is a mvapich issue, maybe someone can tell me. > > > > > > > > > > > > -- > > > > Laurence Marks > > > > Department of Materials Science and Engineering > > > > MSE Rm 2036 Cook Hall > > > > 2220 N Campus Drive > > > > Northwestern University > > > > Evanston, IL 60208, USA > > > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > > > email: L-marks at northwestern dot edu > > > > Web: www.numis.northwestern.edu > > > > Commission on Electron Diffraction of IUCR > > > > www.numis.northwestern.edu/IUCR_CED > > > > > > > > > > > > > > > > > > -- > > Laurence Marks > > Department of Materials Science and Engineering > > MSE Rm 2036 Cook Hall > > 2220 N Campus Drive > > Northwestern University > > Evanston, IL 60208, USA > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > email: L-marks at northwestern dot edu > > Web: www.numis.northwestern.edu > > Commission on Electron Diffraction of IUCR > > www.numis.northwestern.edu/IUCR_CED > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Commission on Electron Diffraction of IUCR www.numis.northwestern.edu/IUCR_CED From mamidala at cse.ohio-state.edu Mon Apr 21 10:17:28 2008 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Mon Apr 21 10:17:36 2008 Subject: [mvapich-discuss] Re: Architecture compatibility -- looks like Intel mkl Blacs_GridExit bug In-Reply-To: <876512660804210620j610a5b95xc808ab02876edf2d@mail.gmail.com> Message-ID: Hi Laurence, Looks like you may be working with an older copy. Can you checkout the latest mvapich from the trunk and give it a try? svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/trunk mvapich Thanks, Amith On Mon, 21 Apr 2008, Laurence Marks wrote: > For mvapich2 in configure.in it says 1.0.5, although at the top of > CHANGELOG it has > > MVAPICH2-1.0.2 (02/20/08) > > * Change the default MV2_DAPL_PROVIDER to OpenIB-cma > > > For mvapich1 configure.in does not include the version number (tut-tut), > the first entry in CHANGELOG is: > > 10/29/2007 > * Added the typo patch provided by pat latifi@qlogic > > > On Sun, Apr 20, 2008 at 7:35 PM, amith rajith mamidala > wrote: > > Hi Laurence, > > > > I might have missed this.. Which version of mvapich did you try? We had a > > bug while running scalapack and we applied a patch for it in the latest > > trunk. Please let us know.. > > > > Thanks, > > Amith > > > > > > > > On Sun, 20 Apr 2008, Laurence Marks wrote: > > > > > mpirun_rsh -np 4 -hostfile .machine1 VIADEV_USE_SHMEM_COLL=0 ./exec works. > > > > > > Can you expand a little, since I have no idea what this is doing. For > > > instance does it suggest that something is not appropriate in how OFED > > > is installed, mvapich being compiled by icc, Intel's scalapack, or all > > > (or none) of the above? > > > > > > > > > On Sun, Apr 20, 2008 at 2:05 PM, Matthew Koop wrote: > > > > Laurence, > > > > > > > > Just to try to narrow this down, can you try running with > > > > VIADEV_USE_SHMEM_COLL=0 ? > > > > > > > > e.g. > > > > mpirun_rsh -np 4 -hostfile ./h VIADEV_USE_SHMEM_COLL=0 ./exec > > > > > > > > Matt > > > > > > > > > > > > > > > > On Sun, 20 Apr 2008, Laurence Marks wrote: > > > > > > > > > By replacing the Intel scalapack calls by scalapack version 1.7 I > > > > > managed to trace the problem to the call from pdstebz to > > > > > BLACS_GRIDEXIT; if I comment out this line (at the bottom of pdstebz) > > > > > everything works fine. > > > > > > > > > > I'm not sure if this is a mvapich issue, maybe someone can tell me. > > > > > > > > > > > > > > > -- > > > > > Laurence Marks > > > > > Department of Materials Science and Engineering > > > > > MSE Rm 2036 Cook Hall > > > > > 2220 N Campus Drive > > > > > Northwestern University > > > > > Evanston, IL 60208, USA > > > > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > > > > email: L-marks at northwestern dot edu > > > > > Web: www.numis.northwestern.edu > > > > > Commission on Electron Diffraction of IUCR > > > > > www.numis.northwestern.edu/IUCR_CED > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Laurence Marks > > > Department of Materials Science and Engineering > > > MSE Rm 2036 Cook Hall > > > 2220 N Campus Drive > > > Northwestern University > > > Evanston, IL 60208, USA > > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > > email: L-marks at northwestern dot edu > > > Web: www.numis.northwestern.edu > > > Commission on Electron Diffraction of IUCR > > > www.numis.northwestern.edu/IUCR_CED > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > -- > Laurence Marks > Department of Materials Science and Engineering > MSE Rm 2036 Cook Hall > 2220 N Campus Drive > Northwestern University > Evanston, IL 60208, USA > Tel: (847) 491-3996 Fax: (847) 491-7820 > email: L-marks at northwestern dot edu > Web: www.numis.northwestern.edu > Commission on Electron Diffraction of IUCR > www.numis.northwestern.edu/IUCR_CED > From chai.15 at osu.edu Mon Apr 21 10:22:21 2008 From: chai.15 at osu.edu (LEI CHAI) Date: Mon Apr 21 10:23:13 2008 Subject: [mvapich-discuss] uDAPL SRQ support in MVAPICH2 ? In-Reply-To: References: Message-ID: Hi, The udapl device in MVAPICH2 does not support SRQ. The gen2 device supports all the features in MVAPICH2 including SRQ. Lei ----- Original Message ----- From: yogeshwar sonawane Date: Monday, April 21, 2008 1:47 am Subject: [mvapich-discuss] uDAPL SRQ support in MVAPICH2 ? To: mvapich-discuss@cse.ohio-state.edu > Hi all, > > Whether SRQ support for uDAPL in MVAPICH2 is available ? > If it is available, then what is the status - all features implemented > OR some features implemented ? > > Is this feature available by default OR some flag is? > required ? > Kindly, share the information. > > Thanks in advance. > -Yogeshwar > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080421/971a98ba/attachment.html From leonleon77 at gmail.com Mon Apr 21 11:19:27 2008 From: leonleon77 at gmail.com (leon zadorin) Date: Mon Apr 21 11:19:38 2008 Subject: [mvapich-discuss] dynamic process connections (accept/connect or MPI_Comm_join) and Infiniband... In-Reply-To: <1208371652.2697.3.camel@t13.nowlab.cis.ohio-state.edu> References: <26d2cb010804151706p3973e774ia8deb83e9d0f3ea3@mail.gmail.com> <1208371652.2697.3.camel@t13.nowlab.cis.ohio-state.edu> Message-ID: <26d2cb010804210819h52d4f407v761a790a68929417@mail.gmail.com> On 4/17/08, Jaidev Sridhar wrote: > Leon, > > I believe you missed Dr. Panda's earlier response > (http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2008-April/001568.html). > I've attached his response. Do let us know if have any other queries. ah - great - thanks for that. >---------- Forwarded message ---------- >From: Dhabaleswar Panda >To: mvapich-discuss@cse.ohio-state.edu >Date: Tue, 15 Apr 2008 08:52:47 -0400 (EDT) >Subject: [mvapich-discuss] [mvapich-commit] dynamic process connections (accept/connect or >MPI_Comm_join) and Infiniband... (fwd) >... >Regarding your question, dynamic process management over native IB is not >available with MVAPICH2 yet. We are working on it and it will be available >in future releases. You can try the TCP/IP interface of MVAPICH2 (which is >equivalent to MPICH2).> Thanks, I see :-) Any generic timeframe on future releases which might support the dynamic process connectivity over infiniband? I guess it will depend on how high on the 'stack of priorities' this issue is positioned :-) In the meantime - is anyone aware of any other libs which may support MPI_Comm_join for bootstrapping MPI message deliveries over Infiniband? Once again - thanks for your feedback. Leon. From jhawkes at penguincomputing.com Tue Apr 22 13:01:19 2008 From: jhawkes at penguincomputing.com (John Hawkes) Date: Tue Apr 22 13:01:47 2008 Subject: [mvapich-discuss] race in mvapich-0.9.9 cm_create_rc_qp() with viadev.connections==NULL Message-ID: <1208883679.25709.14.camel@jhawkes> I've encountered a race condition in mvapich-0.9.9 (also exists in mvapich-1.0) in cm_create_rc_qp() (mpid/ch_gen2/cm.c). On occasion, under conditions of dozens of threads starting up, cm_create_rc_qp() encounters viadev.connections==NULL. I believe the problem stems from the ordering of initialization. The main viainit.c calls: if (MPICM_Connect_UD(viadev.ud_qpn_table, viadev.lid_table)) { error_abort_all(GEN_EXIT_ERR, "MPICM_Connect_UD"); } and soon thereafter it initializes viadev.connections. Meanwhile, MPICM_Connect_UD() has done a pthread_create() of cm_completion_handler (). That concurrently executing thread handles incoming messages, one of which may get to cm_accept(), which then calls cm_create_rc_qp(), which may dereference viadev.connections before the main thread has initialized it. I seem to be able to avoid this race condition by moving the call to MPICM_Connect_UD() to follow the initialization of viadev.connections. Does this fix create other problems that my current testing has not yet encountered? John Hawkes jhawkes@PenguinComputing.com From koop at cse.ohio-state.edu Wed Apr 23 11:27:12 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Apr 23 11:27:23 2008 Subject: [mvapich-discuss] race in mvapich-0.9.9 cm_create_rc_qp() with viadev.connections==NULL In-Reply-To: <1208883679.25709.14.camel@jhawkes> Message-ID: John, Thanks for reporting this problem and looking into a possible solution. This does appear to be a race condition in the initialization of viadev.connections. We'll add this as a bug report and fix this in the very near future. Thanks again, Matt On Tue, 22 Apr 2008, John Hawkes wrote: > I've encountered a race condition in mvapich-0.9.9 (also exists in > mvapich-1.0) in cm_create_rc_qp() (mpid/ch_gen2/cm.c). On occasion, > under conditions of dozens of threads starting up, cm_create_rc_qp() > encounters viadev.connections==NULL. > > I believe the problem stems from the ordering of initialization. The > main viainit.c calls: > if (MPICM_Connect_UD(viadev.ud_qpn_table, viadev.lid_table)) { > error_abort_all(GEN_EXIT_ERR, "MPICM_Connect_UD"); > } > and soon thereafter it initializes viadev.connections. Meanwhile, > MPICM_Connect_UD() has done a pthread_create() of cm_completion_handler > (). That concurrently executing thread handles incoming messages, one > of which may get to cm_accept(), which then calls cm_create_rc_qp(), > which may dereference viadev.connections before the main thread has > initialized it. > > I seem to be able to avoid this race condition by moving the call to > MPICM_Connect_UD() to follow the initialization of viadev.connections. > Does this fix create other problems that my current testing has not yet > encountered? > > John Hawkes > jhawkes@PenguinComputing.com > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From L-marks at northwestern.edu Thu Apr 24 10:42:32 2008 From: L-marks at northwestern.edu (Laurence Marks) Date: Thu Apr 24 10:42:43 2008 Subject: [mvapich-discuss] Any plausible optimizations? Message-ID: <876512660804240742n3ece5d6wd161db6820b958d2@mail.gmail.com> For a small cluster of Dual Quad-Core E5410 @ 2.33GHz 1333 FSB with 8GB 667MHz DDR2 FB-DIMM the total time for mpi jobs is scaling roughly as Total Time ~ C1/(Total Number of mpi) *([Jobs per Bus]**F1) + C2 Where Jobs per Bus = max(1,[mpi per Node]/2) i.e. dual bus architecture C1 and C2 depend upon the specific benchmark (C1 >> C2) F1 ~ 0.6 For reference, I'm using mvapich (not mpd) and infiniband; the infiniband is fast enough Among the slew of different possible optimization options in mvapich, I wonder if there anything which might increase F1, i.e. what I understand as the FSB limiting term? -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Commission on Electron Diffraction of IUCR www.numis.northwestern.edu/IUCR_CED From L-marks at northwestern.edu Thu Apr 24 10:57:59 2008 From: L-marks at northwestern.edu (Laurence Marks) Date: Thu Apr 24 10:58:10 2008 Subject: [mvapich-discuss] Re: Architecture compatibility -- looks like Intel mkl Blacs_GridExit bug In-Reply-To: References: <876512660804210620j610a5b95xc808ab02876edf2d@mail.gmail.com> Message-ID: <876512660804240757mc46a3e9vd79f025f1c3b89a9@mail.gmail.com> That works. Thanks. On Mon, Apr 21, 2008 at 9:17 AM, amith rajith mamidala wrote: > Hi Laurence, > > Looks like you may be working with an older copy. Can you checkout the > latest mvapich from the trunk and give it a try? > > svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/trunk mvapich > > Thanks, > Amith > > > > > On Mon, 21 Apr 2008, Laurence Marks wrote: > > > For mvapich2 in configure.in it says 1.0.5, although at the top of > > CHANGELOG it has > > > > MVAPICH2-1.0.2 (02/20/08) > > > > * Change the default MV2_DAPL_PROVIDER to OpenIB-cma > > > > > > For mvapich1 configure.in does not include the version number (tut-tut), > > the first entry in CHANGELOG is: > > > > 10/29/2007 > > * Added the typo patch provided by pat latifi@qlogic > > > > > > On Sun, Apr 20, 2008 at 7:35 PM, amith rajith mamidala > > wrote: > > > Hi Laurence, > > > > > > I might have missed this.. Which version of mvapich did you try? We had a > > > bug while running scalapack and we applied a patch for it in the latest > > > trunk. Please let us know.. > > > > > > Thanks, > > > Amith > > > > > > > > > > > > On Sun, 20 Apr 2008, Laurence Marks wrote: > > > > > > > mpirun_rsh -np 4 -hostfile .machine1 VIADEV_USE_SHMEM_COLL=0 ./exec works. > > > > > > > > Can you expand a little, since I have no idea what this is doing. For > > > > instance does it suggest that something is not appropriate in how OFED > > > > is installed, mvapich being compiled by icc, Intel's scalapack, or all > > > > (or none) of the above? > > > > > > > > > > > > On Sun, Apr 20, 2008 at 2:05 PM, Matthew Koop wrote: > > > > > Laurence, > > > > > > > > > > Just to try to narrow this down, can you try running with > > > > > VIADEV_USE_SHMEM_COLL=0 ? > > > > > > > > > > e.g. > > > > > mpirun_rsh -np 4 -hostfile ./h VIADEV_USE_SHMEM_COLL=0 ./exec > > > > > > > > > > Matt > > > > > > > > > > > > > > > > > > > > On Sun, 20 Apr 2008, Laurence Marks wrote: > > > > > > > > > > > By replacing the Intel scalapack calls by scalapack version 1.7 I > > > > > > managed to trace the problem to the call from pdstebz to > > > > > > BLACS_GRIDEXIT; if I comment out this line (at the bottom of pdstebz) > > > > > > everything works fine. > > > > > > > > > > > > I'm not sure if this is a mvapich issue, maybe someone can tell me. > > > > > > > > > > > > > > > > > > -- > > > > > > Laurence Marks > > > > > > Department of Materials Science and Engineering > > > > > > MSE Rm 2036 Cook Hall > > > > > > 2220 N Campus Drive > > > > > > Northwestern University > > > > > > Evanston, IL 60208, USA > > > > > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > > > > > email: L-marks at northwestern dot edu > > > > > > Web: www.numis.northwestern.edu > > > > > > Commission on Electron Diffraction of IUCR > > > > > > www.numis.northwestern.edu/IUCR_CED > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Laurence Marks > > > > Department of Materials Science and Engineering > > > > MSE Rm 2036 Cook Hall > > > > 2220 N Campus Drive > > > > Northwestern University > > > > Evanston, IL 60208, USA > > > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > > > email: L-marks at northwestern dot edu > > > > Web: www.numis.northwestern.edu > > > > Commission on Electron Diffraction of IUCR > > > > www.numis.northwestern.edu/IUCR_CED > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > -- > > Laurence Marks > > Department of Materials Science and Engineering > > MSE Rm 2036 Cook Hall > > 2220 N Campus Drive > > Northwestern University > > Evanston, IL 60208, USA > > Tel: (847) 491-3996 Fax: (847) 491-7820 > > email: L-marks at northwestern dot edu > > Web: www.numis.northwestern.edu > > Commission on Electron Diffraction of IUCR > > www.numis.northwestern.edu/IUCR_CED > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- Laurence Marks Department of Materials Science and Engineering MSE Rm 2036 Cook Hall 2220 N Campus Drive Northwestern University Evanston, IL 60208, USA Tel: (847) 491-3996 Fax: (847) 491-7820 email: L-marks at northwestern dot edu Web: www.numis.northwestern.edu Commission on Electron Diffraction of IUCR www.numis.northwestern.edu/IUCR_CED From Lars.Paul.Huse at Sun.COM Fri Apr 25 04:10:21 2008 From: Lars.Paul.Huse at Sun.COM (Lars Paul Huse) Date: Fri Apr 25 10:26:17 2008 Subject: [mvapich-discuss] mvapich alltoall(v) Message-ID: <481191ED.4090600@Sun.COM> Hi all, In a large IB fabric we are observing application performance degradation (using mvapich 0.9.9) that seem to be correlated to running MPI_Alltoall or MPI_Alltoallv. To get a uniform traffic pattern MPI_Alltoall(v) might be aka' *): MPI_Alltoall(v) { MPI_Sendrecv(to myself); if (size > 1) { for (i = 1; i < size, i++) MPI_Irecv(source = (rank+i) % size); for (i = 1; i < size, i++) MPI_Isend(destin = (size + rank - i) % size); MPI_WaitAll(2*(size-1)); } } I assume that the collectives are implemented in the int??_fns.c files in the source tree. For the default implementation of MPI_Alltoallv in mvapich-0.9.9/src/coll/intra_fns.c things look ok'ish, but for what I assume is the IB relevant implementation in mvapich-0.9.9/mpid/vapi/intra_fns.c the source & destin indexes are equal to the loop-counter i.e. aka' stressing one MPI process at the time. Can someone please confirm or disprove my observation. /lars paul PS! Please bear with me for my lack of basic MVAPICH knowledge. *) other source/destin sequencing might give better performance :-) From mamidala at cse.ohio-state.edu Fri Apr 25 11:08:33 2008 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Fri Apr 25 11:08:44 2008 Subject: [mvapich-discuss] mvapich alltoall(v) In-Reply-To: <481191ED.4090600@Sun.COM> Message-ID: Hi Lars, We have optimized MPI_Alltoall and MPI_Alltoallv in the latest mvapich-1.0 branch. Can you download and see if you are seeing better performance? Also, I assume that you are using gen2 device and not vapi. Thanks, Amith On Fri, 25 Apr 2008, Lars Paul Huse wrote: > Hi all, > > In a large IB fabric we are observing application performance > degradation (using mvapich 0.9.9) that seem to be correlated to running > MPI_Alltoall or MPI_Alltoallv. To get a uniform traffic pattern > MPI_Alltoall(v) might be aka' *): > > > MPI_Alltoall(v) > { > MPI_Sendrecv(to myself); > if (size > 1) { > for (i = 1; i < size, i++) > MPI_Irecv(source = (rank+i) % size); > for (i = 1; i < size, i++) > MPI_Isend(destin = (size + rank - i) % size); > MPI_WaitAll(2*(size-1)); > } > } > > I assume that the collectives are implemented in the int??_fns.c files > in the source tree. For the default implementation of MPI_Alltoallv in > mvapich-0.9.9/src/coll/intra_fns.c things look ok'ish, but for what I > assume is the IB relevant implementation in > mvapich-0.9.9/mpid/vapi/intra_fns.c the source & destin indexes are > equal to the loop-counter i.e. aka' stressing one MPI process at the > time. Can someone please confirm or disprove my observation. > > /lars paul > > PS! Please bear with me for my lack of basic MVAPICH knowledge. > > *) other source/destin sequencing might give better performance :-) > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From Craig.Tierney at noaa.gov Fri Apr 25 11:44:01 2008 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Fri Apr 25 11:44:13 2008 Subject: [mvapich-discuss] Problem with fabric combining DDR and SDR cards Message-ID: <4811FC41.7090100@noaa.gov> I have a SDR based fabric running OFED-1.2.5.1 and MVAPICH (both 1.0 and 1.0.2p1). My vendor sent a DDR card as a replacement for a failed SDR and said 'it should just work'. I tried to use it, but I am not able to run jobs. I get the following error as codes startup: send desc error [0] Abort: [] Got completion with error 9, vendor code=8a, dest rank=2 at line 513 in file ibv_channel_manager.c rank 0 in job 1 w347_44628 caused collective abort of all ranks exit status of rank 0: killed by signal 9 The codes are able to start (for isntance HPL is able to its headers). This problem happens using both 1.0 and 1.0.2p1. It does not happen with OpenMPI-1.2.4. Should I be able to combine DDR and SDR cards in the same fabric and run jobs across them? Are there any performance issues with this (not with things running at DDR, but running worse than SDR)? Thanks, Craig -- Craig Tierney (craig.tierney@noaa.gov) From stevejones at stanford.edu Sun Apr 27 21:26:00 2008 From: stevejones at stanford.edu (Steve Jones) Date: Sun Apr 27 21:26:12 2008 Subject: [mvapich-discuss] PMGR_COLLECTIVE ERROR - pmgr_collective_mpispawn Message-ID: <20080427182600.3p26nn8uo804wwoo@webmail.stanford.edu> Hi. I'm receiving an error on a number of Intel MPI Benchmark (IMB) jobs that result in a PMGR_COLLECTIVE ERROR, shown below. The job failure is not constant, I'm able to run the benchmark on a large number of nodes, it seems to only error on sets of nodes. Can you provide more detail on this error? I'm using MVAPICH 1.0gen2 OFED 1.2.5 on RHEL4 2.6.9-55.0.12 The start command is $ mpirun_rsh -np 136 -hostfile $PBS_NODEFILE ./IMB-MPI1 mpispawn.c:303 Unexpected exit status Exit code -1 signaled from COMPUTE-1-3 Killing remote processes...PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: unexpected value: received 0, expecting 7 @ file pmgr_collective_mpispawn.c:137 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: unexpected value: received 0, expecting 7 @ file pmgr_collective_mpispawn.c:137 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 DONE Signal 15 received. Signal 15 received. Signal 15 received. Signal 15 received. Signal 15 received. Signal 15 received. Signal 15 received. Signal 15 received. From sridharj at cse.ohio-state.edu Sun Apr 27 22:29:13 2008 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Sun Apr 27 22:29:25 2008 Subject: [mvapich-discuss] PMGR_COLLECTIVE ERROR - pmgr_collective_mpispawn In-Reply-To: <20080427182600.3p26nn8uo804wwoo@webmail.stanford.edu> References: <20080427182600.3p26nn8uo804wwoo@webmail.stanford.edu> Message-ID: <48153679.4000607@cse.ohio-state.edu> Steve, On Sunday 27 April 2008 09:26 PM, Steve Jones wrote: > Hi. > > I'm receiving an error on a number of Intel MPI Benchmark (IMB) jobs > that result in a PMGR_COLLECTIVE ERROR, shown below. The job failure is > not constant, I'm able to run the benchmark on a large number of nodes, > it seems to only error on sets of nodes. Can you provide more detail on > this error? > > I'm using MVAPICH 1.0gen2 OFED 1.2.5 on RHEL4 2.6.9-55.0.12 > The start command is $ mpirun_rsh -np 136 -hostfile $PBS_NODEFILE > ./IMB-MPI1 > > mpispawn.c:303 Unexpected exit status This error message indicates that one of the processes terminated / was unable to start for some reason. We catch this and kill the other processes which is what caused the later messages. Do you see a reason why some processes are failing to start? A faulty node perhaps? You might want to try narrowing it down to the node(s) that are causing this. If you need anymore help, do let us know. -Jaidev > Exit code -1 signaled from COMPUTE-1-3 > Killing remote processes...PMGR_COLLECTIVE ERROR: reading from (read() > Success errno=0) @ file pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: unexpected value: received 0, expecting 7 @ file > pmgr_collective_mpispawn.c:137 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: unexpected value: received 0, expecting 7 @ file > pmgr_collective_mpispawn.c:137 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121PMGR_COLLECTIVE ERROR: reading from > (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: reading from (read() > Success errno=0) @ file pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > PMGR_COLLECTIVE ERROR: reading from (read() Success errno=0) @ file > pmgr_collective_mpispawn.c:121 > reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 > > reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 > reading from (read() Success errno=0) @ file pmgr_collective_mpispawn.c:121 > DONE > Signal 15 received. > Signal 15 received. > Signal 15 received. > Signal 15 received. > Signal 15 received. > Signal 15 received. > Signal 15 received. > Signal 15 received. > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From stevejones at stanford.edu Sun Apr 27 22:42:17 2008 From: stevejones at stanford.edu (Steve Jones) Date: Sun Apr 27 22:42:29 2008 Subject: [mvapich-discuss] PMGR_COLLECTIVE ERROR - pmgr_collective_mpispawn In-Reply-To: <48153679.4000607@cse.ohio-state.edu> References: <20080427182600.3p26nn8uo804wwoo@webmail.stanford.edu> <48153679.4000607@cse.ohio-state.edu> Message-ID: <20080427194217.q6csv4q5oog88o88@webmail.stanford.edu> >> I'm receiving an error on a number of Intel MPI Benchmark (IMB) >> jobs that result in a PMGR_COLLECTIVE ERROR, shown below. The job >> failure is not constant, I'm able to run the benchmark on a large >> number of nodes, it seems to only error on sets of nodes. Can you >> provide more detail on this error? >> >> I'm using MVAPICH 1.0gen2 OFED 1.2.5 on RHEL4 2.6.9-55.0.12 >> The start command is $ mpirun_rsh -np 136 -hostfile $PBS_NODEFILE ./IMB-MPI1 >> >> mpispawn.c:303 Unexpected exit status > > This error message indicates that one of the processes terminated / was > unable to start for some reason. We catch this and kill the other > processes which is what caused the later messages. > > Do you see a reason why some processes are failing to start? A faulty > node perhaps? You might want to try narrowing it down to the node(s) > that are causing this. > > If you need anymore help, do let us know. > > -Jaidev Hi Jaidev. This makes sense as I've been able to locate a few nodes with mismatched firmware. The job error rate has already decreased and I'm looking for the rest of the node issues. Thanks again for the sanity check. Steve From yiannis.georgiou at imag.fr Mon Apr 28 11:20:55 2008 From: yiannis.georgiou at imag.fr (Yiannis Georgiou) Date: Mon Apr 28 11:20:26 2008 Subject: [mvapich-discuss] mvapich2 installation with BLCR support Message-ID: <4815EB57.3030505@imag.fr> Hello, I'm trying to install MVAPICH2 with BLCR support upon an infiniband cluster and I would like to ask if it is possible to do this installation wihout the OFED packages installed? Some details for the system: Debian kernel 2.6.22-3-amd64 gcc version 4.2.3 (Debian 4.2.3-2) InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor compatibility mode) (rev a0) ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) blcr version 0.6.5 I am trying for mvapich2 version 1.0.2p1 The problem is that I cannot install successfully the official OFED distribution, needed to use the script "make.mvapich2.ofa" ....but I think that a lot of the OFED packages can be found as debian packages....That's why I would like to ask which are the OFED packages that are needed for the installation and if there is a different procedure that I could follow other than changing the script "make.mvapich2.ofa" to point to the correct values, for the installation-configuration.... Thank you in advance for your help... Regards, yiannis -- Yiannis Georgiou LIG Laboratory / MESCAL Project Yiannis.Georgiou@imag.fr http://mescal.imag.fr/ +33 (0)4.76.61.20.33 FRANCE From perkinjo at cse.ohio-state.edu Mon Apr 28 13:07:17 2008 From: perkinjo at cse.ohio-state.edu (Jonathan L. Perkins) Date: Mon Apr 28 13:07:27 2008 Subject: [mvapich-discuss] mvapich2 installation with BLCR support In-Reply-To: <4815EB57.3030505@imag.fr> References: <4815EB57.3030505@imag.fr> Message-ID: <20080428170716.GF7442@cse.ohio-state.edu> On Mon, Apr 28, 2008 at 05:20:55PM +0200, Yiannis Georgiou wrote: > Hello, > > I'm trying to install MVAPICH2 with BLCR support upon an infiniband cluster > and I would like to ask if it is possible to do this installation wihout > the OFED packages installed? Yes, if you have the corresponding packages installed from other sources. > > Some details for the system: > > Debian kernel 2.6.22-3-amd64 > gcc version 4.2.3 (Debian 4.2.3-2) > InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor > compatibility mode) (rev a0) > ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) > blcr version 0.6.5 > > I am trying for mvapich2 version 1.0.2p1 > > The problem is that I cannot install successfully the official OFED > distribution, needed to use the script "make.mvapich2.ofa" ....but I think > that a lot of the OFED packages can be found as debian packages....That's > why I would like to ask which are the OFED packages that are needed for the > installation and if there is a different procedure that I could follow > other than changing the script "make.mvapich2.ofa" to point to the correct > values, for the installation-configuration.... You'll need to have ibumad and ibverbs installed. Of course you'll also need the BLCR toolkit installed. Please see section 4.4.1 of our userguide for more information in regard to building MVAPICH2. http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2.html#x1-100004.4.1 Hope this helps. > > Thank you in advance for your help... > Regards, > > yiannis > > > > > > -- > > Yiannis Georgiou LIG Laboratory / MESCAL Project > Yiannis.Georgiou@imag.fr http://mescal.imag.fr/ +33 (0)4.76.61.20.33 > FRANCE > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From yiannis.georgiou at imag.fr Tue Apr 29 06:37:19 2008 From: yiannis.georgiou at imag.fr (yiannis georgiou) Date: Tue Apr 29 06:39:13 2008 Subject: [mvapich-discuss] mvapich2 installation with BLCR support In-Reply-To: <20080428170716.GF7442@cse.ohio-state.edu> References: <4815EB57.3030505@imag.fr> <20080428170716.GF7442@cse.ohio-state.edu> Message-ID: <20080429123719.u6x9tt5mja8gcgk0@webmail.imag.fr> Hello and thanks for the answer it helped a lot!!! I finally installed the ibverbs libraries using the debian packages and ibumad using the source files from openfabrics... I also had to put --disable-f77 --disable-f90 for the configure into the make.mvapich2.ofa script... The installation finished succesfully!! I now have another problem...: After the initialization phase I try to run the test ./cpi and I get the following errors: ----------------- g5k@bordeplage-9:~/mvapich2-1.0.2/examples$ mpdboot -n 2 -f ~/nodes g5k@bordeplage-9:~/mvapich2-1.0.2/examples$ mpdtrace bordeplage-9 bordeplage-7 g5k@bordeplage-9:~/mvapich2-1.0.2/examples$ /usr/local/mvapich2/bin/mpirun -machinefile ~/nodes -n 2 ./cpi libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. This will severely limit memory registrations. Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(259)...........: Initialization failed MPID_Init(102)..................: channel initialization failed MPIDI_CH3_Init(178).............: MPIDI_CH3I_CM_Init(878).........: rdma_get_control_parameters rdma_get_control_parameters(434): rdma_open_hca(382)..............: Failed to open HCA number 0 rank 1 in job 1 bordeplage-9.bordeaux.grid5000.fr_34110 caused collective abort of all ranks exit status of rank 1: return code 1 ---------------- I also try to run another program previously compiled using /usr/local/mvapich2/bin/mpicc and I get the following: ---------------- g5k@bordeplage-9:~$ /usr/local/mvapich2/bin/mpirun -machinefile ~/nodes -n 2 BENCHS/esp-2.1.1/src-mvapich2/pchksum -t 5 Attempting to use an MPI routine before initializing MPICH Attempting to use an MPI routine before initializing MPICH ---------------- Any ideas where those problems may come from? Regards, yiannis Quoting "Jonathan L. Perkins" : > On Mon, Apr 28, 2008 at 05:20:55PM +0200, Yiannis Georgiou wrote: >> Hello, >> >> I'm trying to install MVAPICH2 with BLCR support upon an infiniband cluster >> and I would like to ask if it is possible to do this installation wihout >> the OFED packages installed? > > Yes, if you have the corresponding packages installed from other sources. > >> >> Some details for the system: >> >> Debian kernel 2.6.22-3-amd64 >> gcc version 4.2.3 (Debian 4.2.3-2) >> InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor >> compatibility mode) (rev a0) >> ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) >> blcr version 0.6.5 >> >> I am trying for mvapich2 version 1.0.2p1 >> >> The problem is that I cannot install successfully the official OFED >> distribution, needed to use the script "make.mvapich2.ofa" ....but I think >> that a lot of the OFED packages can be found as debian packages....That's >> why I would like to ask which are the OFED packages that are needed for the >> installation and if there is a different procedure that I could follow >> other than changing the script "make.mvapich2.ofa" to point to the correct >> values, for the installation-configuration.... > > You'll need to have ibumad and ibverbs installed. Of course you'll also need > the BLCR toolkit installed. Please see section 4.4.1 of our > userguide for more > information in regard to building MVAPICH2. > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2.html#x1-100004.4.1 > > Hope this helps. > >> >> Thank you in advance for your help... >> Regards, >> >> yiannis >> >> >> >> >> >> -- >> >> Yiannis Georgiou LIG Laboratory / MESCAL Project >> Yiannis.Georgiou@imag.fr http://mescal.imag.fr/ +33 (0)4.76.61.20.33 >> FRANCE >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- Yiannis Georgiou LIG Laboratory / MESCAL Project Yiannis.Georgiou@imag.fr http://mescal.imag.fr/ +33 (0)4.76.61.20.33 FRANCE From huanwei at cse.ohio-state.edu Tue Apr 29 09:51:37 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Tue Apr 29 09:51:47 2008 Subject: [mvapich-discuss] mvapich2 installation with BLCR support In-Reply-To: <20080429123719.u6x9tt5mja8gcgk0@webmail.imag.fr> Message-ID: Hi Yiannis, You need to increase the memlock limit (the size of memory that a user level program is allowed to lock) on your system. Please refer to Section 8.2.4 in our user guide to change it: http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2.html#x1-480008.2.4 Let us know if it works for you. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Tue, 29 Apr 2008, yiannis georgiou wrote: > Hello and thanks for the answer it helped a lot!!! > > I finally installed the ibverbs libraries using the debian packages > and ibumad using the source files from openfabrics... > I also had to put --disable-f77 --disable-f90 for the configure into > the make.mvapich2.ofa script... > > The installation finished succesfully!! > > I now have another problem...: > > After the initialization phase I try to run the test ./cpi and I get > the following errors: > > ----------------- > g5k@bordeplage-9:~/mvapich2-1.0.2/examples$ mpdboot -n 2 -f ~/nodes > g5k@bordeplage-9:~/mvapich2-1.0.2/examples$ mpdtrace > bordeplage-9 > bordeplage-7 > g5k@bordeplage-9:~/mvapich2-1.0.2/examples$ > /usr/local/mvapich2/bin/mpirun -machinefile ~/nodes -n 2 ./cpi > libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes. > This will severely limit memory registrations. > Fatal error in MPI_Init: > Other MPI error, error stack: > MPIR_Init_thread(259)...........: Initialization failed > MPID_Init(102)..................: channel initialization failed > MPIDI_CH3_Init(178).............: > MPIDI_CH3I_CM_Init(878).........: rdma_get_control_parameters > rdma_get_control_parameters(434): > rdma_open_hca(382)..............: Failed to open HCA number 0 > rank 1 in job 1 bordeplage-9.bordeaux.grid5000.fr_34110 caused > collective abort of all ranks > exit status of rank 1: return code 1 > ---------------- > > I also try to run another program previously compiled using > /usr/local/mvapich2/bin/mpicc > > and I get the following: > > ---------------- > g5k@bordeplage-9:~$ /usr/local/mvapich2/bin/mpirun -machinefile > ~/nodes -n 2 BENCHS/esp-2.1.1/src-mvapich2/pchksum -t 5 > Attempting to use an MPI routine before initializing MPICH > Attempting to use an MPI routine before initializing MPICH > > ---------------- > > Any ideas where those problems may come from? > > Regards, > yiannis > > > Quoting "Jonathan L. Perkins" : > > > On Mon, Apr 28, 2008 at 05:20:55PM +0200, Yiannis Georgiou wrote: > >> Hello, > >> > >> I'm trying to install MVAPICH2 with BLCR support upon an infiniband cluster > >> and I would like to ask if it is possible to do this installation wihout > >> the OFED packages installed? > > > > Yes, if you have the corresponding packages installed from other sources. > > > >> > >> Some details for the system: > >> > >> Debian kernel 2.6.22-3-amd64 > >> gcc version 4.2.3 (Debian 4.2.3-2) > >> InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex (Tavor > >> compatibility mode) (rev a0) > >> ib_mthca: Mellanox InfiniBand HCA driver v0.08 (February 14, 2006) > >> blcr version 0.6.5 > >> > >> I am trying for mvapich2 version 1.0.2p1 > >> > >> The problem is that I cannot install successfully the official OFED > >> distribution, needed to use the script "make.mvapich2.ofa" ....but I think > >> that a lot of the OFED packages can be found as debian packages....That's > >> why I would like to ask which are the OFED packages that are needed for the > >> installation and if there is a different procedure that I could follow > >> other than changing the script "make.mvapich2.ofa" to point to the correct > >> values, for the installation-configuration.... > > > > You'll need to have ibumad and ibverbs installed. Of course you'll also need > > the BLCR toolkit installed. Please see section 4.4.1 of our > > userguide for more > > information in regard to building MVAPICH2. > > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2.html#x1-100004.4.1 > > > > Hope this helps. > > > >> > >> Thank you in advance for your help... > >> Regards, > >> > >> yiannis > >> > >> > >> > >> > >> > >> -- > >> > >> Yiannis Georgiou LIG Laboratory / MESCAL Project > >> Yiannis.Georgiou@imag.fr http://mescal.imag.fr/ +33 (0)4.76.61.20.33 > >> FRANCE > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > -- > > Jonathan Perkins > > http://www.cse.ohio-state.edu/~perkinjo > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > -- > > Yiannis Georgiou LIG Laboratory / MESCAL Project > Yiannis.Georgiou@imag.fr http://mescal.imag.fr/ > +33 (0)4.76.61.20.33 FRANCE > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From koop at cse.ohio-state.edu Tue Apr 29 12:25:07 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Apr 29 12:25:17 2008 Subject: [mvapich-discuss] Problem with fabric combining DDR and SDR cards In-Reply-To: <4811FC41.7090100@noaa.gov> Message-ID: Craig, So are you running with MVAPICH2? Currently MVAPICH2 will require an additional environment variable when using cards of different types: MV2_DEFAULT_MTU=IBV_MTU_1024 We will be adding support for cards of different speeds and cards. MVAPICH 1.0 already has this support. Let us know if this does not help, Matt On Fri, 25 Apr 2008, Craig Tierney wrote: > I have a SDR based fabric running OFED-1.2.5.1 and MVAPICH (both > 1.0 and 1.0.2p1). My vendor sent a DDR card as a replacement > for a failed SDR and said 'it should just work'. I tried to use > it, but I am not able to run jobs. I get the following error > as codes startup: > > send desc error > [0] Abort: [] Got completion with error 9, vendor code=8a, dest rank=2 > at line 513 in file ibv_channel_manager.c > rank 0 in job 1 w347_44628 caused collective abort of all ranks > exit status of rank 0: killed by signal 9 > > The codes are able to start (for isntance HPL is able to its headers). > This problem happens using both 1.0 and 1.0.2p1. It does not happen > with OpenMPI-1.2.4. > > Should I be able to combine DDR and SDR cards in the same fabric and > run jobs across them? Are there any performance issues with this > (not with things running at DDR, but running worse than SDR)? > > Thanks, > Craig > -- > Craig Tierney (craig.tierney@noaa.gov) > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From Craig.Tierney at noaa.gov Wed Apr 30 12:37:36 2008 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Wed Apr 30 12:37:49 2008 Subject: [mvapich-discuss] Problem with fabric combining DDR and SDR cards In-Reply-To: References: Message-ID: <4818A050.10300@noaa.gov> Matthew Koop wrote: > Craig, > > So are you running with MVAPICH2? Currently MVAPICH2 will require an > additional environment variable when using cards of different types: > > MV2_DEFAULT_MTU=IBV_MTU_1024 > > We will be adding support for cards of different speeds and cards. > MVAPICH 1.0 already has this support. > > Let us know if this does not help, > > I am running MVAPICH2. Specifying any IBV_MTU_* setting (256,512,1024,2048) solves the problem for a small program (HPL). Why is this setting needed? Are there any performance issues with setting this value? Why not just use the IBV_MTU_2048 variable? Thansk, Craig > Matt > > On Fri, 25 Apr 2008, Craig Tierney wrote: > >> I have a SDR based fabric running OFED-1.2.5.1 and MVAPICH (both >> 1.0 and 1.0.2p1). My vendor sent a DDR card as a replacement >> for a failed SDR and said 'it should just work'. I tried to use >> it, but I am not able to run jobs. I get the following error >> as codes startup: >> >> send desc error >> [0] Abort: [] Got completion with error 9, vendor code=8a, dest rank=2 >> at line 513 in file ibv_channel_manager.c >> rank 0 in job 1 w347_44628 caused collective abort of all ranks >> exit status of rank 0: killed by signal 9 >> >> The codes are able to start (for isntance HPL is able to its headers). >> This problem happens using both 1.0 and 1.0.2p1. It does not happen >> with OpenMPI-1.2.4. >> >> Should I be able to combine DDR and SDR cards in the same fabric and >> run jobs across them? Are there any performance issues with this >> (not with things running at DDR, but running worse than SDR)? >> >> Thanks, >> Craig >> -- >> Craig Tierney (craig.tierney@noaa.gov) >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- Craig Tierney (craig.tierney@noaa.gov) From koop at cse.ohio-state.edu Wed Apr 30 22:08:16 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Apr 30 22:08:27 2008 Subject: [mvapich-discuss] Problem with fabric combining DDR and SDR cards In-Reply-To: <4818A050.10300@noaa.gov> Message-ID: Craig, This setting is needed currently since MVAPICH2 auto-selects an MTU of 1K for SDR cards and 2K for DDR cards. If the environment is mixed you must force it to one. We will be changing this for the next version. You may try running with other MTUs. On our machines at OSU we found 1K to be the best when running at SDR rates on that card -- particularly for small-medium sized messages. It may be that for your application another MTU may work better. Matt On Wed, 30 Apr 2008, Craig Tierney wrote: > Matthew Koop wrote: > > Craig, > > > > So are you running with MVAPICH2? Currently MVAPICH2 will require an > > additional environment variable when using cards of different types: > > > > MV2_DEFAULT_MTU=IBV_MTU_1024 > > > > We will be adding support for cards of different speeds and cards. > > MVAPICH 1.0 already has this support. > > > > Let us know if this does not help, > > > > > > > I am running MVAPICH2. Specifying any IBV_MTU_* setting (256,512,1024,2048) > solves the problem for a small program (HPL). > > Why is this setting needed? Are there any performance issues with > setting this value? Why not just use the IBV_MTU_2048 variable? > > Thansk, > Craig > > > > > Matt > > > > On Fri, 25 Apr 2008, Craig Tierney wrote: > > > >> I have a SDR based fabric running OFED-1.2.5.1 and MVAPICH (both > >> 1.0 and 1.0.2p1). My vendor sent a DDR card as a replacement > >> for a failed SDR and said 'it should just work'. I tried to use > >> it, but I am not able to run jobs. I get the following error > >> as codes startup: > >> > >> send desc error > >> [0] Abort: [] Got completion with error 9, vendor code=8a, dest rank=2 > >> at line 513 in file ibv_channel_manager.c > >> rank 0 in job 1 w347_44628 caused collective abort of all ranks > >> exit status of rank 0: killed by signal 9 > >> > >> The codes are able to start (for isntance HPL is able to its headers). > >> This problem happens using both 1.0 and 1.0.2p1. It does not happen > >> with OpenMPI-1.2.4. > >> > >> Should I be able to combine DDR and SDR cards in the same fabric and > >> run jobs across them? Are there any performance issues with this > >> (not with things running at DDR, but running worse than SDR)? > >> > >> Thanks, > >> Craig > >> -- > >> Craig Tierney (craig.tierney@noaa.gov) > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > -- > Craig Tierney (craig.tierney@noaa.gov) >