From singh.jasjit at yahoo.co.in Fri Aug 1 11:50:40 2008 From: singh.jasjit at yahoo.co.in (jasjit singh) Date: Fri Aug 1 11:50:54 2008 Subject: [mvapich-discuss] Intel AMD run Message-ID: <475173.402.qm@web94005.mail.in2.yahoo.com> Hi I am running PALLAS v2.2 over mvapich2-1.0.1. We have Silverstorm's Infiniband cards. I am using OFED-1.2.5.3. I have tried with both gen2 and udapl stacks. Both give the same result for all my runs. OS is RHEL4-U5 2.6.9-55.ELlargesmp First I ran it between two Intel (Xeon) machines with number of processes equal to two. It went through successfully. Then I ran between two AMD (Opteron) machines with the same number of processes. It also went through. Thereafter I tried between one Intel machine and one AMD machine. Then it didn't run. It was stuck at the very start(Output file is attached). Has anybody tried this kind of thing earlier? I have also tried, between Intel and AMD, a DAPL level application that does dat_ep_post_rdma_write() continuously in a bidirectional manner. This was running finely. So... Has MPI something specific to Intel and/or AMD architectures ? Can I do some work around to make it run? Or I am not supposed to run this across different architectures ? I tried one other variation by compiliing MPI (udapl) without two flags namely _SMP_ and RDMA_FAST_PATH. Then also it was running finely. So does it have anything to do with RDMA_FAST_PATH flag ? Thanks in advance, Jasjit Singh __________________________________________________________ Not happy with your email address?. Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080801/303df281/attachment.html From perkinjo at cse.ohio-state.edu Fri Aug 1 12:02:07 2008 From: perkinjo at cse.ohio-state.edu (Jonathan L. Perkins) Date: Fri Aug 1 12:03:02 2008 Subject: [mvapich-discuss] Problems with mvapich, gfortan and rhel4? In-Reply-To: References: Message-ID: <20080801160207.GI2942@cse.ohio-state.edu> Michael: Thank you for using mvapich. I'm sorry to hear that you're experiencing problems with building Fortran90 programs. If possible, can you send me the configure and make logs from the mvapich installation. Also, can you send the full output from an attempt to build a Fortran90 mpi program. It will also be good if you can provide a reproducer program that we can try locally. At this point I cannot say whether multiple versions of the compiler being present is the source of the problem or not. Is the PATH and LD_LIBRARY_PATH variables the same as of the time of the installation? On Thu, Jul 31, 2008 at 01:18:43PM -0500, Mike Heinz wrote: > I've been building test clusters using OFED 1.3.1, which includes > mvapich. One of the problems I've been running into is odd problems with > Fortran90 programs. These programs fail to compile with odd messages > that make me think that mvapich failed to build incorrectly. For > example: > > Fatal Error: Reading module mpi at line 4 column 61: Expected left > parenthesis > > >From the message, it would appear that mpi.f90 is faulty. But mpi.f90 > appears to be created by mvapich during the build process (of mvapich) > and then deleted, which makes it hard to determine for sure. > > The one thing I can say is that these machines have both gcc version 3 > and gcc version 4 compilers installed and that the default gcc and > fortran compilers are the version 3 ones. Is it possible that this is > causing the problem? > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From panda at cse.ohio-state.edu Fri Aug 1 13:15:01 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Aug 1 13:15:12 2008 Subject: [mvapich-discuss] MVAPICH2 Allreduce Performance In-Reply-To: <48907E7E.1050009@inl.gov> Message-ID: Peter, Thanks for reporting these performance numbers and the comparisions. MVAPICH 0.9.9 is an older version. Several multi-core-aware collective optimizations went into MVAPICH 1.0 series. Please check the latest MVAPICH 1.0.1 version and let us know whether you still see the performance degradation. Similarly, the multi-core-aware collective optimizations have gone into the latest MVAPICH2 1.2 series. Please check out the latest MVAPICH2 1.2 version from the trunk (not RC1, we have added some enhancements and tuning after RC1 was released) and let us know if you still see the performance degradation. DK On Wed, 30 Jul 2008, Peter Cebull wrote: > We are looking at some scalability issues for a particular application > on one of our clusters. Specifically, I plotted the MPI_Allreduce > performance of MVAPICH2, MVAPICH, Intel MPI, and Open MPI as measured by > the Intel MPI Allreduce Benchmark. The plot shows average time in > microseconds vs the number of processes from 2 to 512 for a message size > of 4 kB. > > The results show MVAPICH2 performing very well up to 128 process, but > for 256 and 512 processes the performance drops off by an order of > magnitude to match the performance of MVAPICH and Intel MPI. Is this > expected behavior, and is there a way to improve the scalability for > 256+ processes? I didn't see this topic in the archive, I apologize if > it's been discussed before. > > We are running dual quad-core EM64t nodes, OFED 1.2, Mellanox > Technologies MT25204 [InfiniHost III Lx HCA]. This machine is an SGI > Altix ICE with ProPack 5 SP3. The timing data are listed below. > > mpich2version > Version: mvapich2-1.0 > Device: osu_ch3:mrail > Configure Options: > '--prefix=/usr/local/mvapich2/mvapich2-1.0.2/intel-opt' > '--with-device=osu_ch3:mrail' '--with-rdma=gen2' '--with-pm=mpd' > '--enable-shared=gcc' '--enable-sharedlibs=gcc' '--disable-romio' > '--without-mpe' 'CC=icc' 'CFLAGS=-fPIC -D_EM64T_ -D_SMP_ > -DUSE_HEADER_CACHING -DONE_SIDED -DMPIDI_CH3_CHANNEL_RNDV > -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM -I/usr/include -fPIC -O2' > 'CXX=icpc' 'F77=ifort' 'F90=ifort' 'FFLAGS=-L/usr/lib64 -fPIC' > CC: icc -fPIC -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED > -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM > -I/usr/include -fPIC -O2 > CXX: icpc > F77: ifort -L/usr/lib64 -fPIC > F90: ifort > > Thanks, > Peter > > # processes vs time in us > Intel MPI 3.1 > 2 7.12 > 4 14.82 > 8 26.07 > 16 83.85 > 32 543.00 > 64 1025.87 > 128 1492.71 > 256 1957.55 > 512 2445.58 > > MVAPICH 0.9.9 > 2 13.44 > 4 20.72 > 8 37.08 > 16 84.59 > 32 545.56 > 64 1018.50 > 128 1509.70 > 256 1959.09 > 512 2481.70 > > MVAPICH2 1.0.2 > 2 11.76 > 4 19.16 > 8 37.26 > 16 80.09 > 32 105.88 > 64 111.21 > 128 126.11 > 256 1942.33 > 512 2434.15 > > Open MPI 1.2.6 > 2 13.23 > 4 30.25 > 8 63.63 > 16 95.66 > 32 155.05 > 64 272.42 > 128 512.11 > 256 752.29 > 512 999.50 > > -- > Peter Cebull > Idaho National Laboratory > > > From curtisbr at cse.ohio-state.edu Fri Aug 1 14:54:51 2008 From: curtisbr at cse.ohio-state.edu (Brian Curtis) Date: Fri Aug 1 14:54:58 2008 Subject: [mvapich-discuss] Performance differences between mvapich2-1.0 and mvapich2-1.2 In-Reply-To: <1217349938.3562.470.camel@kallies.zib.de> References: <1217349938.3562.470.camel@kallies.zib.de> Message-ID: <48935BFB.50104@cse.ohio-state.edu> Bernd, I've looked over your configuration for MVAPICH2-1.2rc1 and I have some suggestions. The default configuration of MVAPICH2-1.2 includes --enable-fast=defopt,ndebug. By specifying --enable-fast=defopt, you lose the ndebug option which results in the inclusion of assertions and other debug statements. The "--with-thread-package" parameter is a NOOP unless you specify an option (as is, it is selecting pthread). I recommend checking out our latest source from trunk and using the following configuration: ./configure --with-file-system=lustre+nfs --without-mpe --enable-sharedlibs=gcc Please note that ROMIO is enabled by default and gen2 is selected by default for Linux. Let us know if your performance improves by using the latest source from trunk and the recommend configuration. Brian Bernd Kallies wrote: > It seems to me that mvapich2-1.2rc1 seems to be slower that previous > versions when compiling/using defaults. I'd like to know if I forgot > some secret preprocessor flag or configure option for 1.2. > > I compiled the nighty build for mvapich2-1.0 as of July 28 (I guess it > is something like mvapich2-1.0.5) with the following settings: > > export CC=icc > export CXX=icpc > export F77=ifort > export F90=ifort > export CFLAGS='-D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED > -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM -O2' > configure --with-device=osu_ch3:mrail --with-rdma=gen2 --with-pm=mpd > --disable-romio --enable-sharedlibs=gcc --without-mpe > > I compiled the tarball source of mvapich2-1.2rc1 with > unset CFLAGS > ./configure --enable-romio --with-file-system=lustre+nfs > --enable-fast=defopt --with-rdma=gen2 --with-thread-package > --enable-sharedlibs=gcc --without-mpe > > I get the following when running osu_alltoall with 1 task per node on > two nodes after setting MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0: > > mvapich2-1.0.5-intel: > # OSU MPI All-to-All Personalized Exchange Latency Test v3.1 > # Size Latency (us) > 1 1.62 > 2 1.71 > 4 1.66 > 8 1.64 > 16 1.68 > 32 1.74 > 64 1.97 > 128 3.04 > 256 3.42 > 512 4.01 > 1024 5.26 > 2048 6.62 > 4096 9.45 > 8192 15.20 > 16384 17.76 > 32768 23.21 > 65536 38.60 > 131072 76.32 > 262144 151.70 > 524288 296.74 > 1048576 591.68 > > mvapich2-1.2rc1-intel: > # OSU MPI All-to-All Personalized Exchange Latency Test v3.1 > # Size Latency (us) > 1 1.87 > 2 1.80 > 4 1.81 > 8 1.82 > 16 1.86 > 32 1.92 > 64 2.10 > 128 3.16 > 256 3.53 > 512 4.07 > 1024 5.33 > 2048 6.79 > 4096 9.54 > 8192 15.34 > 16384 17.48 > 32768 22.88 > 65536 38.78 > 131072 76.55 > 262144 149.74 > 524288 297.11 > 1048576 591.25 > > Other OSU benchmarks yield no visible differences between the two > builds, e.g. osu_mbw_mr with 2 nodes and 4 tasks per node: > > mvapich2-1.0.5-intel: > # OSU MPI Multiple Bandwidth / Message Rate Test v3.1 > # [ pairs: 4 ] [ window size: 64 ] > # Size MB/s Messages/s > 1 3.45 3447336.26 > 2 6.93 3463236.43 > 4 13.83 3458551.26 > 8 27.68 3460000.08 > 16 62.91 3931824.03 > 32 109.74 3429389.41 > 64 213.14 3330258.12 > 128 353.90 2764881.74 > 256 624.27 2438548.84 > 512 980.57 1915173.15 > 1024 1241.38 1212281.33 > 2048 1463.71 714703.42 > 4096 1612.25 393616.25 > 8192 1721.11 210096.00 > 16384 1851.29 112993.94 > 32768 2051.28 62600.09 > 65536 2062.08 31464.92 > 131072 2065.59 15759.17 > 262144 2074.04 7911.82 > 524288 2082.66 3972.35 > 1048576 2087.94 1991.22 > 2097152 2090.20 996.69 > 4194304 2075.23 494.77 > > mvapich2-1.2rc1-intel: > # OSU MPI Multiple Bandwidth / Message Rate Test v3.1 > # [ pairs: 4 ] [ window size: 64 ] > # Size MB/s Messages/s > 1 3.42 3424686.07 > 2 6.92 3459442.70 > 4 13.73 3431691.09 > 8 27.59 3449218.84 > 16 62.63 3914337.15 > 32 108.91 3403302.14 > 64 210.89 3295101.65 > 128 347.89 2717920.88 > 256 621.49 2427687.32 > 512 982.32 1918595.24 > 1024 1246.40 1217187.35 > 2048 1490.18 727625.11 > 4096 1684.54 411264.55 > 8192 1768.11 215833.58 > 16384 1852.36 113059.37 > 32768 2048.83 62525.18 > 65536 2062.01 31463.76 > 131072 2066.38 15765.20 > 262144 2074.90 7915.12 > 524288 2082.75 3972.54 > 1048576 2088.07 1991.34 > 2097152 2090.04 996.61 > 4194304 2077.47 495.31 > > I also compiled the quantum chemistry code CPMD 3.11.1 with both libs. > The code has own profiling. A benchmark run yields for a run with 64 > nodes, 1 task per node, 1 thread per task, application-defined task > pinning, MV2_NUM_PORTS=2 MV2_ENABLE_AFFINITY=0: > > mvapich2-1.0.5-intel: > ... > CPU TIME : 0 HOURS 17 MINUTES 7.53 SECONDS > ELAPSED TIME : 0 HOURS 17 MINUTES 40.26 SECONDS > ... > ================================================================ > = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS = > = SEND/RECEIVE 36385. BYTES 722421. = > = BROADCAST 37880. BYTES 368. = > = GLOBAL SUMMATION 393974. BYTES 10556. = > = GLOBAL MULTIPLICATION 0. BYTES 1. = > = ALL TO ALL COMM 484310. BYTES 46464. = > = PERFORMANCE TOTAL TIME = > = SEND/RECEIVE 681.133 MB/S 38.591 SEC = > = BROADCAST 87.115 MB/S 0.160 SEC = > = GLOBAL SUMMATION 1520.563 MB/S 16.410 SEC = > = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC = > = ALL TO ALL COMM 86.898 MB/S 258.959 SEC = > = SYNCHRONISATION 1.750 SEC = > ================================================================ > > mvapich2-1.2rc1-intel: > ... > CPU TIME : 0 HOURS 18 MINUTES 59.23 SECONDS > ELAPSED TIME : 0 HOURS 19 MINUTES 31.68 SECONDS > ... > ================================================================ > = COMMUNICATION TASK AVERAGE MESSAGE LENGTH NUMBER OF CALLS = > = SEND/RECEIVE 36385. BYTES 722421. = > = BROADCAST 37880. BYTES 368. = > = GLOBAL SUMMATION 393974. BYTES 10556. = > = GLOBAL MULTIPLICATION 0. BYTES 1. = > = ALL TO ALL COMM 484310. BYTES 46464. = > = PERFORMANCE TOTAL TIME = > = SEND/RECEIVE 699.651 MB/S 37.570 SEC = > = BROADCAST 87.114 MB/S 0.160 SEC = > = GLOBAL SUMMATION 1557.608 MB/S 16.020 SEC = > = GLOBAL MULTIPLICATION 0.000 MB/S 0.001 SEC = > = ALL TO ALL COMM 61.302 MB/S 367.082 SEC = > = SYNCHRONISATION 1.950 SEC = > ================================================================ > > The difference is reproducible (mvapich2-1.2rc1-intel is slower, seems > to be the reason of slow all to all comm.), also compared to > mvapich2-1.0.3 from tarball, or mvapich2-1.0.1 and mvapich-0.9.9 (both > precompiled from SGI, available from SGI). Note that the benchmarks are > run with no intra-node communication. > > Sincerely, BK > From chai.15 at osu.edu Fri Aug 1 17:55:38 2008 From: chai.15 at osu.edu (Lei Chai) Date: Fri Aug 1 17:55:43 2008 Subject: [mvapich-discuss] Intel AMD run In-Reply-To: <475173.402.qm@web94005.mail.in2.yahoo.com> References: <475173.402.qm@web94005.mail.in2.yahoo.com> Message-ID: <4893865A.1050407@osu.edu> Hi Jasjit, I just tried IMB-3.0 (the new PALLAS) and mvapich2 with one Intel and one AMD machine and it was running fine. How did you compile mvapich2 and the program? If you compile on one machine and try to run the same program through a shared file system (e.g. NFS) then there shouldn't be any problem. If you compile two versions, one on Intel and one on AMD and try to run them together , then you are likely to observe hanging, since we have platform specific thresholds for performance optimizations, and therefore we don't recommend you to do so. If you continue to see the problem, please send us your output file (I didn't see it in your previous email). Thanks, Lei jasjit singh wrote: > Hi > > I am running PALLAS v2.2 over mvapich2-1.0.1. > We have Silverstorm's Infiniband cards. > I am using OFED-1.2.5.3. > I have tried with both gen2 and udapl stacks. Both give the same > result for all my runs. > OS is RHEL4-U5 2.6.9-55.ELlargesmp > > First I ran it between two Intel (Xeon) machines with number of > processes equal to two. It went through successfully. > > Then I ran between two AMD (Opteron) machines with the same number of > processes. It also went through. > > Thereafter I tried between one Intel machine and one AMD machine. Then > it didn't run. It was stuck at the very start(Output file is attached). > > Has anybody tried this kind of thing earlier? > > I have also tried, between Intel and AMD, a DAPL level application > that does dat_ep_post_rdma_write() continuously in a bidirectional > manner. This was running finely. > > So... > Has MPI something specific to Intel and/or AMD architectures ? > Can I do some work around to make it run? > Or I am not supposed to run this across different architectures ? > > I tried one other variation by compiliing MPI (udapl) without two > flags namely _SMP_ and RDMA_FAST_PATH. > Then also it was running finely. > So does it have anything to do with RDMA_FAST_PATH flag ? > > Thanks in advance, > Jasjit Singh > > ------------------------------------------------------------------------ > Not happy with your email address? > Get the one you really want > - millions of new email addresses available now at Yahoo! > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From yogyas at gmail.com Sat Aug 2 03:47:08 2008 From: yogyas at gmail.com (yogeshwar sonawane) Date: Sat Aug 2 03:47:19 2008 Subject: [mvapich-discuss] Intel AMD run In-Reply-To: <4893865A.1050407@osu.edu> References: <475173.402.qm@web94005.mail.in2.yahoo.com> <4893865A.1050407@osu.edu> Message-ID: Hi lei, Thanks for the reply. It is working now. As per suggestion, we are using same binary compiled on either one machine, shared on other. Earlier we were compiling both mvapich2 & pallas separately on both the machines. Now, any recommendations related to, on which machine should we compile (intel OR amd) ? OR any one will do ? any previous observation about performance difference because of platform specific thresholds for performance optimizations ? One more pt, as jasjit said, >> I tried one other variation by compiliing MPI (udapl) without two flags >> namely _SMP_ and RDMA_FAST_PATH. >> Then also it was running finely. >> So does it have anything to do with RDMA_FAST_PATH flag ? The above compilation was done separately on both the machines, still it is working. Any thing related to this ? With regards, Yogeshwar On Sat, Aug 2, 2008 at 3:25 AM, Lei Chai wrote: > Hi Jasjit, > > I just tried IMB-3.0 (the new PALLAS) and mvapich2 with one Intel and one > AMD machine and it was running fine. How did you compile mvapich2 and the > program? If you compile on one machine and try to run the same program > through a shared file system (e.g. NFS) then there shouldn't be any problem. > If you compile two versions, one on Intel and one on AMD and try to run them > together , then you are likely to observe hanging, since we have platform > specific thresholds for performance optimizations, and therefore we don't > recommend you to do so. If you continue to see the problem, please send us > your output file (I didn't see it in your previous email). > > Thanks, > Lei > > > jasjit singh wrote: >> >> Hi >> >> I am running PALLAS v2.2 over mvapich2-1.0.1. >> We have Silverstorm's Infiniband cards. >> I am using OFED-1.2.5.3. >> I have tried with both gen2 and udapl stacks. Both give the same result >> for all my runs. >> OS is RHEL4-U5 2.6.9-55.ELlargesmp >> >> First I ran it between two Intel (Xeon) machines with number of processes >> equal to two. It went through successfully. >> >> Then I ran between two AMD (Opteron) machines with the same number of >> processes. It also went through. >> >> Thereafter I tried between one Intel machine and one AMD machine. Then it >> didn't run. It was stuck at the very start(Output file is attached). >> >> Has anybody tried this kind of thing earlier? >> >> I have also tried, between Intel and AMD, a DAPL level application that >> does dat_ep_post_rdma_write() continuously in a bidirectional manner. This >> was running finely. >> >> So... >> Has MPI something specific to Intel and/or AMD architectures ? >> Can I do some work around to make it run? >> Or I am not supposed to run this across different architectures ? >> >> I tried one other variation by compiliing MPI (udapl) without two flags >> namely _SMP_ and RDMA_FAST_PATH. >> Then also it was running finely. >> So does it have anything to do with RDMA_FAST_PATH flag ? >> >> Thanks in advance, >> Jasjit Singh >> >> ------------------------------------------------------------------------ >> Not happy with your email address? >> Get the one you really want - >> millions of new email addresses available now at Yahoo! >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From chai.15 at osu.edu Sat Aug 2 14:41:52 2008 From: chai.15 at osu.edu (Lei Chai) Date: Sat Aug 2 14:42:04 2008 Subject: [mvapich-discuss] Intel AMD run In-Reply-To: References: <475173.402.qm@web94005.mail.in2.yahoo.com> <4893865A.1050407@osu.edu> Message-ID: <4894AA70.1010905@osu.edu> Yogeshwar, > Thanks for the reply. It is working now. > As per suggestion, we are using same binary compiled on either one > machine, shared on other. > Earlier we were compiling both mvapich2 & pallas separately on both > the machines. > Glad to know it's working now. > Now, any recommendations related to, on which machine should we > compile (intel OR amd) ? > OR any one will do ? any previous observation about performance > difference because of platform > specific thresholds for performance optimizations ? > We haven't done performance tuning on such heterogeneous systems. I suppose either machine will do. Please let us know if you observe interesting performance differences. > One more pt, as jasjit said, > > >>> I tried one other variation by compiliing MPI (udapl) without two flags >>> namely _SMP_ and RDMA_FAST_PATH. >>> Then also it was running finely. >>> So does it have anything to do with RDMA_FAST_PATH flag ? >>> > > The above compilation was done separately on both the machines, still > it is working. > Any thing related to this ? > The _SMP_ flags shouldn't make a difference since it only controls parameters within a node. I guess most platform specific parameters are associated with RDMA_FAST_PATH. The parameters that are not controlled by RDMA_FAST_PATH are either the same for these two systems or they don't cause protocol mismatch. Lei > With regards, > Yogeshwar > > On Sat, Aug 2, 2008 at 3:25 AM, Lei Chai wrote: > >> Hi Jasjit, >> >> I just tried IMB-3.0 (the new PALLAS) and mvapich2 with one Intel and one >> AMD machine and it was running fine. How did you compile mvapich2 and the >> program? If you compile on one machine and try to run the same program >> through a shared file system (e.g. NFS) then there shouldn't be any problem. >> If you compile two versions, one on Intel and one on AMD and try to run them >> together , then you are likely to observe hanging, since we have platform >> specific thresholds for performance optimizations, and therefore we don't >> recommend you to do so. If you continue to see the problem, please send us >> your output file (I didn't see it in your previous email). >> >> Thanks, >> Lei >> >> >> jasjit singh wrote: >> >>> Hi >>> >>> I am running PALLAS v2.2 over mvapich2-1.0.1. >>> We have Silverstorm's Infiniband cards. >>> I am using OFED-1.2.5.3. >>> I have tried with both gen2 and udapl stacks. Both give the same result >>> for all my runs. >>> OS is RHEL4-U5 2.6.9-55.ELlargesmp >>> >>> First I ran it between two Intel (Xeon) machines with number of processes >>> equal to two. It went through successfully. >>> >>> Then I ran between two AMD (Opteron) machines with the same number of >>> processes. It also went through. >>> >>> Thereafter I tried between one Intel machine and one AMD machine. Then it >>> didn't run. It was stuck at the very start(Output file is attached). >>> >>> Has anybody tried this kind of thing earlier? >>> >>> I have also tried, between Intel and AMD, a DAPL level application that >>> does dat_ep_post_rdma_write() continuously in a bidirectional manner. This >>> was running finely. >>> >>> So... >>> Has MPI something specific to Intel and/or AMD architectures ? >>> Can I do some work around to make it run? >>> Or I am not supposed to run this across different architectures ? >>> >>> I tried one other variation by compiliing MPI (udapl) without two flags >>> namely _SMP_ and RDMA_FAST_PATH. >>> Then also it was running finely. >>> So does it have anything to do with RDMA_FAST_PATH flag ? >>> >>> Thanks in advance, >>> Jasjit Singh >>> >>> ------------------------------------------------------------------------ >>> Not happy with your email address? >>> Get the one you really want - >>> millions of new email addresses available now at Yahoo! >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >>> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> From pasha at dev.mellanox.co.il Sun Aug 3 02:40:53 2008 From: pasha at dev.mellanox.co.il (Pavel Shamis (Pasha)) Date: Sun Aug 3 02:41:06 2008 Subject: [mvapich-discuss] Problems with mvapich, gfortan and rhel4? In-Reply-To: References: Message-ID: <489552F5.9030602@dev.mellanox.co.il> Mike Heinz wrote: > > The one thing I *can* say is that these machines have both gcc version > 3 and gcc version 4 compilers installed and that the default gcc and > fortran compilers are the version 3 ones. Is it possible that this is > causing the problem? The mixed gcc install may brake OFED mvapich install. Can you temporary rename the gfortran binary to some other name (gforran.orig) and re-run the OFED install ? Regards, Pasha > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From forum.san at gmail.com Mon Aug 4 00:48:18 2008 From: forum.san at gmail.com (Sangamesh B) Date: Mon Aug 4 00:48:30 2008 Subject: [mvapich-discuss] mvapich2-1.0.3 + charm++ error Message-ID: Hi, I've installed mvapich2-1.0.3 (make.mvapich2.ofa stack) with Intel compilers on Cent OS on a Rocks 4.3 cluster. Installed charm5.9 (a parallel programming OO language library) The charm build is successful with following options: ./build charm++ mpi-linux iccstatic pthreads --no-build-shared --incdir=/opt/MPI/mvapich2_intel/include --libdir=/opt/MPI/mvapich2_intel/lib --libdir=/usr/lib But when I try to execute tests/charm++/simplearrayhello program: [root@local simplearrayhello]# make ../../../bin/charmc hello.ci ../../../bin/charmc -c hello.C ../../../bin/charmc -language charm++ -o hello hello.o icpc: command line remark #10010: option '-static-libcxa' is deprecated and will be removed in a future release. See '-help deprecated' ld: skipping incompatible /usr/lib/libpthread.so when searching for -lpthread ld: skipping incompatible /usr/lib/libpthread.a when searching for -lpthread ld: skipping incompatible /usr/lib/libpthread.so when searching for -lpthread .. .. ld: skipping incompatible /usr/lib/libc.so when searching for -lc ld: skipping incompatible /usr/lib/libc.a when searching for -lc /opt/MPI/mvapich2_intel/lib/libmpich.a(rdma_iba_init.o)(.text+0xa7): In function `MPIDI_CH3I_CM_Finalize': : undefined reference to `ibv_dereg_mr' /opt/MPI/mvapich2_intel/lib/libmpich.a(rdma_iba_init.o)(.text+0xb9): In function `MPIDI_CH3I_CM_Finalize': : undefined reference to `ibv_dereg_mr' /opt/MPI/mvapich2_intel/lib/libmpich.a(rdma_iba_init.o)(.text+0x112): In function `MPIDI_CH3I_CM_Finalize': : undefined reference to `ibv_destroy_qp' ... Fatal Error by charmc in directory /opt/apps/namd_intel/charm-5.9/tests/charm++/simplearrayhello Command icpc -static-libcxa -L/opt/MPI/mvapich2_intel/lib -L/usr/lib -L/opt/MPI/mvapich2_intel/lib -L/usr/lib -o hello -L../../../bin/../lib -I../../../bin/../include ../../../bin/../lib/libldb-rand.o hello.o moduleinit3173.o ../../../bin/../lib/libmemory-default.o ../../../bin/../lib/libthreads-default.o -lck -lconv-cplus-y -lconv-core -lconv-util -L/opt/MPI/mvapich2_intel/lib -L/usr/lib -lckqt -lmpich -lpthread -lpthread -ldl -lpthread -lm returned error code 1 charmc exiting... make: *** [hello] Error 1 It looks, charmc is not able to link libibverbs. With MPICH2, there is no problem charm++ execution. The src/arch/mpi-linux/conv-mach.sh is: CMK_CPP_CHARM='/lib/cpp -P' CMK_CPP_C='mpicc -E' CMK_CC='mpicc ' CMK_CXX='mpicxx ' CMK_CXXPP='mpicxx -E ' CMK_CF77='ifort' CMK_CF90='ifort' CMK_RANLIB='ranlib' CMK_LIBS='-L/opt/MPI/mvapich2_intel/lib -L/usr/lib -lckqt -lmpich -lpthread' CMK_LD_LIBRARY_PATH="-Wl,-rpath,$CHARMLIBSO/" CMK_NATIVE_LIBS='' CMK_NATIVE_CC='icc ' CMK_NATIVE_LD='icpc' CMK_NATIVE_CXX='icpc ' CMK_NATIVE_LDXX='icpc' CMK_NATIVE_CC='icc ' CMK_NATIVE_CXX='icpc ' CMK_F90LIBS='-L/usr/absoft/lib -L/opt/absoft/lib -lf90math -lfio -lU77 -lf77math ' CMK_MOD_NAME_ALLCAPS=1 CMK_MOD_EXT="mod" CMK_F90_USE_MODDIR=1 CMK_F90_MODINC="-p Is this setting is ok or I need to add some more library paths (i.e. OFED libs)? Thank you, Sangamesh -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080804/c6da1818/attachment.html From koop at cse.ohio-state.edu Mon Aug 4 10:05:48 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Aug 4 10:05:59 2008 Subject: [mvapich-discuss] mvapich2-1.0.3 + charm++ error In-Reply-To: Message-ID: Sangamesh, In the past I have given these instructions for MVAPICH: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-June/000189.html MVAPICH2 should be the same with the exception of mpiCC for mpicxx. The settings you have below look fine to me. If you have the proper wrapper scripts (mpicc, etc) there should be no need to specify the OFED libraries there. Can you verify that 'which mpicc' and 'which mpicxx' do point to the working MVAPICH2 installation? Thanks, Matt On Mon, 4 Aug 2008, Sangamesh B wrote: > Hi, > > I've installed mvapich2-1.0.3 (make.mvapich2.ofa stack) with Intel > compilers on Cent OS on a Rocks 4.3 cluster. > > Installed charm5.9 (a parallel programming OO language library) > > The charm build is successful with following options: > > ./build charm++ mpi-linux iccstatic pthreads --no-build-shared > --incdir=/opt/MPI/mvapich2_intel/include > --libdir=/opt/MPI/mvapich2_intel/lib --libdir=/usr/lib > > But when I try to execute tests/charm++/simplearrayhello program: > > [root@local simplearrayhello]# make > ../../../bin/charmc hello.ci > ../../../bin/charmc -c hello.C > ../../../bin/charmc -language charm++ -o hello hello.o > icpc: command line remark #10010: option '-static-libcxa' is deprecated and > will be removed in a future release. See '-help deprecated' > ld: skipping incompatible /usr/lib/libpthread.so when searching for > -lpthread > ld: skipping incompatible /usr/lib/libpthread.a when searching for -lpthread > ld: skipping incompatible /usr/lib/libpthread.so when searching for > -lpthread > > .. > .. > ld: skipping incompatible /usr/lib/libc.so when searching for -lc > ld: skipping incompatible /usr/lib/libc.a when searching for -lc > /opt/MPI/mvapich2_intel/lib/libmpich.a(rdma_iba_init.o)(.text+0xa7): In > function `MPIDI_CH3I_CM_Finalize': > : undefined reference to `ibv_dereg_mr' > /opt/MPI/mvapich2_intel/lib/libmpich.a(rdma_iba_init.o)(.text+0xb9): In > function `MPIDI_CH3I_CM_Finalize': > : undefined reference to `ibv_dereg_mr' > /opt/MPI/mvapich2_intel/lib/libmpich.a(rdma_iba_init.o)(.text+0x112): In > function `MPIDI_CH3I_CM_Finalize': > : undefined reference to `ibv_destroy_qp' > > > ... > Fatal Error by charmc in directory > /opt/apps/namd_intel/charm-5.9/tests/charm++/simplearrayhello > Command icpc -static-libcxa -L/opt/MPI/mvapich2_intel/lib -L/usr/lib > -L/opt/MPI/mvapich2_intel/lib -L/usr/lib -o hello -L../../../bin/../lib > -I../../../bin/../include ../../../bin/../lib/libldb-rand.o hello.o > moduleinit3173.o ../../../bin/../lib/libmemory-default.o > ../../../bin/../lib/libthreads-default.o -lck -lconv-cplus-y -lconv-core > -lconv-util -L/opt/MPI/mvapich2_intel/lib -L/usr/lib -lckqt -lmpich > -lpthread -lpthread -ldl -lpthread -lm returned error code 1 > charmc exiting... > make: *** [hello] Error 1 > > It looks, charmc is not able to link libibverbs. > > With MPICH2, there is no problem charm++ execution. > > The src/arch/mpi-linux/conv-mach.sh is: > > CMK_CPP_CHARM='/lib/cpp -P' > CMK_CPP_C='mpicc -E' > CMK_CC='mpicc ' > CMK_CXX='mpicxx ' > CMK_CXXPP='mpicxx -E ' > CMK_CF77='ifort' > CMK_CF90='ifort' > CMK_RANLIB='ranlib' > CMK_LIBS='-L/opt/MPI/mvapich2_intel/lib -L/usr/lib -lckqt -lmpich -lpthread' > CMK_LD_LIBRARY_PATH="-Wl,-rpath,$CHARMLIBSO/" > CMK_NATIVE_LIBS='' > CMK_NATIVE_CC='icc ' > CMK_NATIVE_LD='icpc' > CMK_NATIVE_CXX='icpc ' > CMK_NATIVE_LDXX='icpc' > CMK_NATIVE_CC='icc ' > CMK_NATIVE_CXX='icpc ' > CMK_F90LIBS='-L/usr/absoft/lib -L/opt/absoft/lib -lf90math -lfio -lU77 > -lf77math ' > CMK_MOD_NAME_ALLCAPS=1 > CMK_MOD_EXT="mod" > CMK_F90_USE_MODDIR=1 > CMK_F90_MODINC="-p > > Is this setting is ok or I need to add some more library paths (i.e. OFED > libs)? > > Thank you, > Sangamesh > From forum.san at gmail.com Wed Aug 6 00:58:08 2008 From: forum.san at gmail.com (Sangamesh B) Date: Wed Aug 6 00:58:21 2008 Subject: [mvapich-discuss] mvapich2-1.0.3 + charm++ error In-Reply-To: References: Message-ID: Hi Matthew, My answers are in-line. On Mon, Aug 4, 2008 at 7:35 PM, Matthew Koop wrote: > Sangamesh, > > In the past I have given these instructions for MVAPICH: > > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2006-June/000189.html > > MVAPICH2 should be the same with the exception of mpiCC for mpicxx. The > settings you have below look fine to me. If you have the proper wrapper > scripts (mpicc, etc) there should be no need to specify the OFED libraries > there. > > Can you verify that 'which mpicc' and 'which mpicxx' do point to the > working MVAPICH2 installation? > Yes. It is mvapich2's mpicc only. One more point is: After building charm++, the charmc script in the bin dir doesn't contain reference to "-lmpich". So it gave errors " undefined reference to MPI_CH3_ .." . Then I put it manually "-lmpich". The above errors disappeared but " > : undefined reference to `ibv_dereg_mr' " errors started appearing. Shall I try with mvapich2-1.0.2rc1? Thank you, Sangamesh > > Thanks, > > Matt > > On Mon, 4 Aug 2008, Sangamesh B wrote: > > > Hi, > > > > I've installed mvapich2-1.0.3 (make.mvapich2.ofa stack) with Intel > > compilers on Cent OS on a Rocks 4.3 cluster. > > > > Installed charm5.9 (a parallel programming OO language library) > > > > The charm build is successful with following options: > > > > ./build charm++ mpi-linux iccstatic pthreads --no-build-shared > > --incdir=/opt/MPI/mvapich2_intel/include > > --libdir=/opt/MPI/mvapich2_intel/lib --libdir=/usr/lib > > > > But when I try to execute tests/charm++/simplearrayhello program: > > > > [root@local simplearrayhello]# make > > ../../../bin/charmc hello.ci > > ../../../bin/charmc -c hello.C > > ../../../bin/charmc -language charm++ -o hello hello.o > > icpc: command line remark #10010: option '-static-libcxa' is deprecated > and > > will be removed in a future release. See '-help deprecated' > > ld: skipping incompatible /usr/lib/libpthread.so when searching for > > -lpthread > > ld: skipping incompatible /usr/lib/libpthread.a when searching for > -lpthread > > ld: skipping incompatible /usr/lib/libpthread.so when searching for > > -lpthread > > > > .. > > .. > > ld: skipping incompatible /usr/lib/libc.so when searching for -lc > > ld: skipping incompatible /usr/lib/libc.a when searching for -lc > > /opt/MPI/mvapich2_intel/lib/libmpich.a(rdma_iba_init.o)(.text+0xa7): In > > function `MPIDI_CH3I_CM_Finalize': > > : undefined reference to `ibv_dereg_mr' > > /opt/MPI/mvapich2_intel/lib/libmpich.a(rdma_iba_init.o)(.text+0xb9): In > > function `MPIDI_CH3I_CM_Finalize': > > : undefined reference to `ibv_dereg_mr' > > /opt/MPI/mvapich2_intel/lib/libmpich.a(rdma_iba_init.o)(.text+0x112): In > > function `MPIDI_CH3I_CM_Finalize': > > : undefined reference to `ibv_destroy_qp' > > > > > > ... > > Fatal Error by charmc in directory > > /opt/apps/namd_intel/charm-5.9/tests/charm++/simplearrayhello > > Command icpc -static-libcxa -L/opt/MPI/mvapich2_intel/lib -L/usr/lib > > -L/opt/MPI/mvapich2_intel/lib -L/usr/lib -o hello -L../../../bin/../lib > > -I../../../bin/../include ../../../bin/../lib/libldb-rand.o hello.o > > moduleinit3173.o ../../../bin/../lib/libmemory-default.o > > ../../../bin/../lib/libthreads-default.o -lck -lconv-cplus-y -lconv-core > > -lconv-util -L/opt/MPI/mvapich2_intel/lib -L/usr/lib -lckqt -lmpich > > -lpthread -lpthread -ldl -lpthread -lm returned error code 1 > > charmc exiting... > > make: *** [hello] Error 1 > > > > It looks, charmc is not able to link libibverbs. > > > > With MPICH2, there is no problem charm++ execution. > > > > The src/arch/mpi-linux/conv-mach.sh is: > > > > CMK_CPP_CHARM='/lib/cpp -P' > > CMK_CPP_C='mpicc -E' > > CMK_CC='mpicc ' > > CMK_CXX='mpicxx ' > > CMK_CXXPP='mpicxx -E ' > > CMK_CF77='ifort' > > CMK_CF90='ifort' > > CMK_RANLIB='ranlib' > > CMK_LIBS='-L/opt/MPI/mvapich2_intel/lib -L/usr/lib -lckqt -lmpich > -lpthread' > > CMK_LD_LIBRARY_PATH="-Wl,-rpath,$CHARMLIBSO/" > > CMK_NATIVE_LIBS='' > > CMK_NATIVE_CC='icc ' > > CMK_NATIVE_LD='icpc' > > CMK_NATIVE_CXX='icpc ' > > CMK_NATIVE_LDXX='icpc' > > CMK_NATIVE_CC='icc ' > > CMK_NATIVE_CXX='icpc ' > > CMK_F90LIBS='-L/usr/absoft/lib -L/opt/absoft/lib -lf90math -lfio -lU77 > > -lf77math ' > > CMK_MOD_NAME_ALLCAPS=1 > > CMK_MOD_EXT="mod" > > CMK_F90_USE_MODDIR=1 > > CMK_F90_MODINC="-p > > > > Is this setting is ok or I need to add some more library paths (i.e. OFED > > libs)? > > > > Thank you, > > Sangamesh > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080806/aff98df1/attachment.html From michael.heinz at qlogic.com Wed Aug 6 11:43:19 2008 From: michael.heinz at qlogic.com (Mike Heinz) Date: Wed Aug 6 11:43:32 2008 Subject: [mvapich-discuss] Problems with mvapich, gfortan and rhel4? In-Reply-To: <489552F5.9030602@dev.mellanox.co.il> References: <489552F5.9030602@dev.mellanox.co.il> Message-ID: Pasha, Thanks for the reply. It does appear to be a build issue, but it may be a subtle one - we're trying to provide pre-built binaries of OFED to our customers and the problem is only occuring on one OS - and I've since discovered that if I build by hand on a problem machine, the problem goes away. So, now I'm going to dig into why our pre-builts have a problem rather than waste your time. Thanks again. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Pavel Shamis (Pasha) [mailto:pasha@dev.mellanox.co.il] Sent: Sunday, August 03, 2008 2:41 AM To: Mike Heinz Cc: mvapich-discuss@cse.ohio-state.edu; John Malantonio; John Russo Subject: Re: [mvapich-discuss] Problems with mvapich, gfortan and rhel4? Mike Heinz wrote: > > The one thing I *can* say is that these machines have both gcc version > 3 and gcc version 4 compilers installed and that the default gcc and > fortran compilers are the version 3 ones. Is it possible that this is > causing the problem? The mixed gcc install may brake OFED mvapich install. Can you temporary rename the gfortran binary to some other name (gforran.orig) and re-run the OFED install ? Regards, Pasha > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania > > ---------------------------------------------------------------------- > -- > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From manfred.muecke at univie.ac.at Fri Aug 8 13:02:56 2008 From: manfred.muecke at univie.ac.at (Manfred Muecke) Date: Fri Aug 8 13:07:44 2008 Subject: [mvapich-discuss] MV2_CPU_MAPPING in 1.2-rc1 Message-ID: <59925.129.27.140.172.1218214976.squirrel@webmail.univie.ac.at> Hi, I tried to test the user defined CPU (Core) Mapping (as described in http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2rc1.html#x1-320006.8), but fail to see any effect. I work on a Sun Cluster with two DualCore Opterons per node, Infiniband interconnect and running Solaris 10 1/06. Some time ago, I wrote a small DTrace script showing task migration of my application. Running the application with four processes on a single node (four cores) without the use of MV2_CPU_MAPPING results in each process being assigned to a different node but with the processes migrating happily back and forth every few milliseconds. Below is a snippet of the output for a single process. 49052us pid: 4379: cpu1 -> cpu3 60327us pid: 4379: cpu3 -> cpu0 64067us pid: 4379: cpu0 -> cpu1 68705us pid: 4379: cpu1 -> cpu0 70325us pid: 4379: cpu0 -> cpu1 72929us pid: 4379: cpu1 -> cpu0 73176us pid: 4379: cpu0 -> cpu3 With MVAPICH 1.0, I succeeded to get rid of the task migration using VIADEV_CPU_MAPPING. Now with MVAPICH2 1.2rc1 (the first version of MVAPICH2 supporting CPU mapping - thanks a lot, BTW), I tried again: "mpdrun -np 4 -env MV2_ENABLE_AFFINITY 1 -env MV2_CPU_MAPPING 0:1:2:3 a.out" which results in the following output (again for a single process only): 79040us pid: 4438: cpu2 -> cpu3 79662us pid: 4438: cpu3 -> cpu2 83276us pid: 4438: cpu2 -> cpu3 97615us pid: 4438: cpu3 -> cpu1 98513us pid: 4438: cpu1 -> cpu3 99471us pid: 4438: cpu3 -> cpu1 99794us pid: 4438: cpu1 -> cpu0 Again, the tasks migrate every few milliseconds. It seems like setting MV2_CPU_MAPPING has no effect on my system. What is your preffered way of verifying this functionality? Do you have any suggestions on other parameters in my configuration to check? Thanks, Manfred -- Manfred M?cke Research Lab Computational Technologies and Applications rlcta.univie.ac.at Lenaugasse 2, 1080 Wien, AUSTRIA From peter.cebull at inl.gov Fri Aug 8 16:17:15 2008 From: peter.cebull at inl.gov (Peter Cebull) Date: Fri Aug 8 16:17:52 2008 Subject: [mvapich-discuss] Invalid Communicator in MVAPICH2 1.2.0 Message-ID: <489CA9CB.10300@inl.gov> I'm trying out the latest trunk version of MVAPICH2 to test the improvements in MPI_Allreduce. When I try running the Intel MPI allreduce benchmark I get errors similar to the following: Fatal error in MPI_Comm_size: Invalid communicator, error stack: MPI_Comm_size(112): MPI_Comm_size(comm=0x5b, size=0x7fff852e0a90) failed MPI_Comm_size(70).: Invalid communicator I've seen these errors discussed on the mailing list but haven't seen a resolution. I've checked to make sure I'm not including an mpi.h from a previous release and I'm pretty sure I'm not. Any ideas on what's going wrong? Here's the configuration. . . $ mpiname -a MVAPICH2 1.2.0 Unofficial Build ch3:mrail Compilation CC: icc -DNDEBUG -O2 CXX: icpc -DNDEBUG -O2 F77: ifort -DNDEBUG -O2 F90: ifort -DNDEBUG -O2 Configuration --prefix=/usr/local/mvapich2/mvapich2-trunk-2008-08-06/intel-opt CC=icc F77=ifort CXX=icpc F90=ifort Here's the PBS script to launch the job. . . #!/bin/bash #PBS -l select=64:ncpus=8:mpiprocs=8 cd $PBS_O_WORKDIR nprocs=`cat $PBS_NODEFILE | wc -l` mpirun_rsh -hostfile $PBS_NODEFILE -n $nprocs ./MPI1_mvapich2 Allreduce Thanks, Peter -- Peter Cebull Idaho National Laboratory From chai.15 at osu.edu Fri Aug 8 19:07:07 2008 From: chai.15 at osu.edu (Lei Chai) Date: Fri Aug 8 19:07:18 2008 Subject: [mvapich-discuss] MV2_CPU_MAPPING in 1.2-rc1 In-Reply-To: <59925.129.27.140.172.1218214976.squirrel@webmail.univie.ac.at> References: <59925.129.27.140.172.1218214976.squirrel@webmail.univie.ac.at> Message-ID: <489CD19B.1080005@osu.edu> Manfred, Thanks for trying the user defined cpu mapping function and sending us the feedbacks. mvapich and mvapich2 have so far supported cpu affinity and mapping on Linux only, not on Solaris yet. Sorry about the inconvenience. The reason you didn't observe process migration with mvapich-1.0 might be because the processes didn't migrate during that time. Did you observe it consistently during a long run program? We have noted down your feedback, and will consider to add the cpu affinity and mapping support for solaris in the future mvapich2 releases. Lei Manfred Muecke wrote: > Hi, > > I tried to test the user defined CPU (Core) Mapping (as described in > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2rc1.html#x1-320006.8), > but fail to see any effect. > > I work on a Sun Cluster with two DualCore Opterons per node, Infiniband > interconnect and running Solaris 10 1/06. Some time ago, I wrote a small > DTrace script showing task migration of my application. Running the > application with four processes on a single node (four cores) without the > use of MV2_CPU_MAPPING results in each process being assigned to a > different node but with the processes migrating happily back and forth > every few milliseconds. Below is a snippet of the output for a single > process. > > 49052us pid: 4379: cpu1 -> cpu3 > 60327us pid: 4379: cpu3 -> cpu0 > 64067us pid: 4379: cpu0 -> cpu1 > 68705us pid: 4379: cpu1 -> cpu0 > 70325us pid: 4379: cpu0 -> cpu1 > 72929us pid: 4379: cpu1 -> cpu0 > 73176us pid: 4379: cpu0 -> cpu3 > > With MVAPICH 1.0, I succeeded to get rid of the task migration using > VIADEV_CPU_MAPPING. > > Now with MVAPICH2 1.2rc1 (the first version of MVAPICH2 supporting CPU > mapping - thanks a lot, BTW), I tried again: > "mpdrun -np 4 -env MV2_ENABLE_AFFINITY 1 -env MV2_CPU_MAPPING 0:1:2:3 a.out" > which results in the following output (again for a single process only): > > 79040us pid: 4438: cpu2 -> cpu3 > 79662us pid: 4438: cpu3 -> cpu2 > 83276us pid: 4438: cpu2 -> cpu3 > 97615us pid: 4438: cpu3 -> cpu1 > 98513us pid: 4438: cpu1 -> cpu3 > 99471us pid: 4438: cpu3 -> cpu1 > 99794us pid: 4438: cpu1 -> cpu0 > > Again, the tasks migrate every few milliseconds. It seems like setting > MV2_CPU_MAPPING has no effect on my system. > > What is your preffered way of verifying this functionality? Do you have > any suggestions on other parameters in my configuration to check? > > Thanks, Manfred > > > > From Vlad.Cojocaru at eml-r.villa-bosch.de Thu Aug 14 11:32:21 2008 From: Vlad.Cojocaru at eml-r.villa-bosch.de (Vlad Cojocaru) Date: Thu Aug 14 11:32:43 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1 and/or mvapich2 Message-ID: <48A45005.2020407@eml-r.villa-bosch.de> Dear mvapich users, I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 infiniband cluster using the intel 10.1.015 compilers. With mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in the /examples directory. Then, I bult charm++ and tested it with "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on top of mvapich1.0.1 and charm, Everything seemed ok only that the namd executable hangs without error messages. In fact it appears as if it still runs but it doesn't produce any output. If I repeat exactly the same procedure but with openmpi instead of mvapich, everything works fine ....(however I am not so happy about the scaling of openmpi on infiniband) Does anyone have experience with installing namd using mvapich1.0.1 ? If yes, any idea why this happens? I must say when I did the same on another cluster which had mvapich1.0.1 already compiled with the intel compilers, everything worked out correcltly. So, it must be something with the compilation of mvapich1.0.1 on our new infiniband setup that creates the problem. The german in the error simply says that executable "mpiname was not found" Best wishes vlad ----------------------------------error------------------------------------------------------------------------ I also tried mvapich2 but the compilation fails when installing the mpiname application (see error below) which apparently fails to compile (no executable is found in /env/mpiname dir). However no error messages are printed by make and the build completes correctly. So I am not sure why mpiname does not compile and still make install tries to install it ... /usr/bin/install -c mpiname/mpiname /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname /usr/bin/install: Aufruf von stat f??r ?mpiname/mpiname? nicht m??glich: Datei oder Verzeichnis nicht gefunden make[1]: *** [install] Fehler 1 make[1]: Leaving directory `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' make: *** [install] Fehler 2 -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- From mbozzore at platform.com Thu Aug 14 13:20:51 2008 From: mbozzore at platform.com (Mehdi Bozzo-Rey) Date: Thu Aug 14 13:19:55 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 References: <48A45005.2020407@eml-r.villa-bosch.de> Message-ID: <531893A968B34D40B36C7A6445BC828A01CF64F6@catoexm06.noam.corp.platform.com> Hello Vlad, I just recompiled NAMD and it looks ok for me (output of simple test below). I guess the problem is on the compilation side. Best regards, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform computing Phone : +1 905 948 4649 E-mail : mbozzore@platform.com [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile ./hosts.8 ./namd2 src/alanin Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Info: NAMD 2.6 for Linux-amd64-MPI Info: Info: Please visit http://www.ks.uiuc.edu/Research/namd/ Info: and send feedback or bug reports to namd@ks.uiuc.edu Info: Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) Info: in all publications reporting results obtained with NAMD. Info: Info: Based on Charm++/Converse 50914 for mpi-linux-x86_64-gfortran-smp-mpicxx Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on tyan04.lsf.platform.com Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore Info: Running on 8 processors. Info: 8208 kB of memory in use. Info: Memory usage based on mallinfo Info: Changed directory to src Info: Configuration file is alanin TCL: Suspending until startup complete. Info: SIMULATION PARAMETERS: Info: TIMESTEP 0.5 Info: NUMBER OF STEPS 9 Info: STEPS PER CYCLE 3 Info: LOAD BALANCE STRATEGY Other Info: LDB PERIOD 600 steps Info: FIRST LDB TIMESTEP 15 Info: LDB BACKGROUND SCALING 1 Info: HOM BACKGROUND SCALING 1 Info: MAX SELF PARTITIONS 50 Info: MAX PAIR PARTITIONS 20 Info: SELF PARTITION ATOMS 125 Info: PAIR PARTITION ATOMS 200 Info: PAIR2 PARTITION ATOMS 400 Info: MIN ATOMS PER PATCH 100 Info: INITIAL TEMPERATURE 0 Info: CENTER OF MASS MOVING INITIALLY? NO Info: DIELECTRIC 1 Info: EXCLUDE SCALED ONE-FOUR Info: 1-4 SCALE FACTOR 0.4 Info: NO DCD TRAJECTORY OUTPUT Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT Info: NO VELOCITY DCD OUTPUT Info: OUTPUT FILENAME output Info: BINARY OUTPUT FILES WILL BE USED Info: NO RESTART FILE Info: SWITCHING ACTIVE Info: SWITCHING ON 7 Info: SWITCHING OFF 8 Info: PAIRLIST DISTANCE 9 Info: PAIRLIST SHRINK RATE 0.01 Info: PAIRLIST GROW RATE 0.01 Info: PAIRLIST TRIGGER 0.3 Info: PAIRLISTS PER CYCLE 2 Info: PAIRLISTS ENABLED Info: MARGIN 1 Info: HYDROGEN GROUP CUTOFF 2.5 Info: PATCH DIMENSION 12.5 Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL Info: TIMING OUTPUT STEPS 15 Info: USING VERLET I (r-RESPA) MTS SCHEME. Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS Info: RANDOM NUMBER SEED 1218734148 Info: USE HYDROGEN BONDS? NO Info: COORDINATE PDB alanin.pdb Info: STRUCTURE FILE alanin.psf Info: PARAMETER file: XPLOR format! (default) Info: PARAMETERS alanin.params Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS Info: SUMMARY OF PARAMETERS: Info: 61 BONDS Info: 179 ANGLES Info: 38 DIHEDRAL Info: 42 IMPROPER Info: 0 CROSSTERM Info: 21 VDW Info: 0 VDW_PAIRS Info: **************************** Info: STRUCTURE SUMMARY: Info: 66 ATOMS Info: 65 BONDS Info: 96 ANGLES Info: 31 DIHEDRALS Info: 32 IMPROPERS Info: 0 CROSSTERMS Info: 0 EXCLUSIONS Info: 195 DEGREES OF FREEDOM Info: 55 HYDROGEN GROUPS Info: TOTAL MASS = 783.886 amu Info: TOTAL CHARGE = 8.19564e-08 e Info: ***************************** Info: Entering startup phase 0 with 8208 kB of memory in use. Info: Entering startup phase 1 with 8208 kB of memory in use. Info: Entering startup phase 2 with 8208 kB of memory in use. Info: Entering startup phase 3 with 8208 kB of memory in use. Info: PATCH GRID IS 1 BY 1 BY 1 Info: REMOVING COM VELOCITY 0 0 0 Info: LARGEST PATCH (0) HAS 66 ATOMS Info: CREATING 11 COMPUTE OBJECTS Info: Entering startup phase 4 with 8208 kB of memory in use. Info: Entering startup phase 5 with 8208 kB of memory in use. Info: Entering startup phase 6 with 8208 kB of memory in use. Measuring processor speeds... Done. Info: Entering startup phase 7 with 8208 kB of memory in use. Info: CREATING 11 COMPUTE OBJECTS Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 Info: NONBONDED TABLE SIZE: 705 POINTS Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 Info: Entering startup phase 8 with 8208 kB of memory in use. Info: Finished startup with 8208 kB of memory in use. ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP TOTAL2 TOTAL3 TEMPAVG ENERGY: 0 0.0050 0.4192 0.0368 0.4591 -210.1610 1.0506 0.0000 0.0000 0.0000 -208.1904 0.0000 -208.1877 -208.1877 0.0000 ENERGY: 1 0.0051 0.4196 0.0367 0.4585 -210.1611 1.0184 0.0000 0.0000 0.0325 -208.1905 0.1675 -208.1878 -208.1877 0.1675 ENERGY: 2 0.0058 0.4208 0.0365 0.4568 -210.1610 0.9219 0.0000 0.0000 0.1285 -208.1907 0.6632 -208.1881 -208.1877 0.6632 ENERGY: 3 0.0092 0.4232 0.0361 0.4542 -210.1599 0.7617 0.0000 0.0000 0.2845 -208.1910 1.4683 -208.1885 -208.1878 1.4683 ENERGY: 4 0.0176 0.4269 0.0356 0.4511 -210.1565 0.5386 0.0000 0.0000 0.4952 -208.1914 2.5561 -208.1890 -208.1878 2.5561 ENERGY: 5 0.0327 0.4327 0.0350 0.4480 -210.1489 0.2537 0.0000 0.0000 0.7552 -208.1917 3.8977 -208.1894 -208.1879 3.8977 ENERGY: 6 0.0552 0.4409 0.0343 0.4454 -210.1354 -0.0915 0.0000 0.0000 1.0592 -208.1920 5.4666 -208.1898 -208.1880 5.4666 ENERGY: 7 0.0839 0.4522 0.0334 0.4440 -210.1137 -0.4951 0.0000 0.0000 1.4031 -208.1922 7.2418 -208.1900 -208.1882 7.2418 ENERGY: 8 0.1162 0.4674 0.0325 0.4448 -210.0822 -0.9550 0.0000 0.0000 1.7839 -208.1923 9.2074 -208.1902 -208.1883 9.2074 ENERGY: 9 0.1492 0.4870 0.0315 0.4485 -210.0391 -1.4687 0.0000 0.0000 2.1990 -208.1925 11.3497 -208.1905 -208.1884 11.3497 WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 WRITING COORDINATES TO OUTPUT FILE AT STEP 9 WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 ========================================== WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB End of program -----Original Message----- From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad Cojocaru Sent: August-14-08 11:32 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Dear mvapich users, I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 infiniband cluster using the intel 10.1.015 compilers. With mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in the /examples directory. Then, I bult charm++ and tested it with "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on top of mvapich1.0.1 and charm, Everything seemed ok only that the namd executable hangs without error messages. In fact it appears as if it still runs but it doesn't produce any output. If I repeat exactly the same procedure but with openmpi instead of mvapich, everything works fine ....(however I am not so happy about the scaling of openmpi on infiniband) Does anyone have experience with installing namd using mvapich1.0.1 ? If yes, any idea why this happens? I must say when I did the same on another cluster which had mvapich1.0.1 already compiled with the intel compilers, everything worked out correcltly. So, it must be something with the compilation of mvapich1.0.1 on our new infiniband setup that creates the problem. The german in the error simply says that executable "mpiname was not found" Best wishes vlad ----------------------------------error------------------------------------------------------------------------ I also tried mvapich2 but the compilation fails when installing the mpiname application (see error below) which apparently fails to compile (no executable is found in /env/mpiname dir). However no error messages are printed by make and the build completes correctly. So I am not sure why mpiname does not compile and still make install tries to install it ... /usr/bin/install -c mpiname/mpiname /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname /usr/bin/install: Aufruf von stat f??r ?mpiname/mpiname? nicht m??glich: Datei oder Verzeichnis nicht gefunden make[1]: *** [install] Fehler 1 make[1]: Leaving directory `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' make: *** [install] Fehler 2 -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From vlad.cojocaru at eml-r.villa-bosch.de Thu Aug 14 16:35:14 2008 From: vlad.cojocaru at eml-r.villa-bosch.de (Cojocaru,Vlad) Date: Thu Aug 14 16:35:32 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 References: <48A45005.2020407@eml-r.villa-bosch.de> <531893A968B34D40B36C7A6445BC828A01CF64F6@catoexm06.noam.corp.platform.com> Message-ID: <3FA65DF819C05B40B9779B89D1FEB2950D0A09@vbemail20.villa-bosch.de> Hi Mehdi, Did you use intel 10.1 as well ? Did you build on openfabrics ? what compiler flags did you pass to the mvapich build? Did you build with --enable sharedlib or without? I would be grateful If you give me some bits of the details how you built mvapich?. Thanks for the reply. Yes, there is something about the compilation of mvapich. As I said I successfully compiled NAMD on a cluster that had already mvapich compiled with intel as the default mpi lib. However, on the new cluster (quad cores AMD opterons with mellanox infiniband) I got these problems. So, its definitely the mvapich build which fails although I don't get any errors fro make. Any idea why the mpiname application fails to compile when compiling mvapich2 ? Thanks again Best wishes vlad -----Original Message----- From: Mehdi Bozzo-Rey [mailto:mbozzore@platform.com] Sent: Thu 8/14/2008 7:20 PM To: Cojocaru,Vlad; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Hello Vlad, I just recompiled NAMD and it looks ok for me (output of simple test below). I guess the problem is on the compilation side. Best regards, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform computing Phone : +1 905 948 4649 E-mail : mbozzore@platform.com [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile ./hosts.8 ./namd2 src/alanin Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Info: NAMD 2.6 for Linux-amd64-MPI Info: Info: Please visit http://www.ks.uiuc.edu/Research/namd/ Info: and send feedback or bug reports to namd@ks.uiuc.edu Info: Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) Info: in all publications reporting results obtained with NAMD. Info: Info: Based on Charm++/Converse 50914 for mpi-linux-x86_64-gfortran-smp-mpicxx Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on tyan04.lsf.platform.com Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore Info: Running on 8 processors. Info: 8208 kB of memory in use. Info: Memory usage based on mallinfo Info: Changed directory to src Info: Configuration file is alanin TCL: Suspending until startup complete. Info: SIMULATION PARAMETERS: Info: TIMESTEP 0.5 Info: NUMBER OF STEPS 9 Info: STEPS PER CYCLE 3 Info: LOAD BALANCE STRATEGY Other Info: LDB PERIOD 600 steps Info: FIRST LDB TIMESTEP 15 Info: LDB BACKGROUND SCALING 1 Info: HOM BACKGROUND SCALING 1 Info: MAX SELF PARTITIONS 50 Info: MAX PAIR PARTITIONS 20 Info: SELF PARTITION ATOMS 125 Info: PAIR PARTITION ATOMS 200 Info: PAIR2 PARTITION ATOMS 400 Info: MIN ATOMS PER PATCH 100 Info: INITIAL TEMPERATURE 0 Info: CENTER OF MASS MOVING INITIALLY? NO Info: DIELECTRIC 1 Info: EXCLUDE SCALED ONE-FOUR Info: 1-4 SCALE FACTOR 0.4 Info: NO DCD TRAJECTORY OUTPUT Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT Info: NO VELOCITY DCD OUTPUT Info: OUTPUT FILENAME output Info: BINARY OUTPUT FILES WILL BE USED Info: NO RESTART FILE Info: SWITCHING ACTIVE Info: SWITCHING ON 7 Info: SWITCHING OFF 8 Info: PAIRLIST DISTANCE 9 Info: PAIRLIST SHRINK RATE 0.01 Info: PAIRLIST GROW RATE 0.01 Info: PAIRLIST TRIGGER 0.3 Info: PAIRLISTS PER CYCLE 2 Info: PAIRLISTS ENABLED Info: MARGIN 1 Info: HYDROGEN GROUP CUTOFF 2.5 Info: PATCH DIMENSION 12.5 Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL Info: TIMING OUTPUT STEPS 15 Info: USING VERLET I (r-RESPA) MTS SCHEME. Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS Info: RANDOM NUMBER SEED 1218734148 Info: USE HYDROGEN BONDS? NO Info: COORDINATE PDB alanin.pdb Info: STRUCTURE FILE alanin.psf Info: PARAMETER file: XPLOR format! (default) Info: PARAMETERS alanin.params Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS Info: SUMMARY OF PARAMETERS: Info: 61 BONDS Info: 179 ANGLES Info: 38 DIHEDRAL Info: 42 IMPROPER Info: 0 CROSSTERM Info: 21 VDW Info: 0 VDW_PAIRS Info: **************************** Info: STRUCTURE SUMMARY: Info: 66 ATOMS Info: 65 BONDS Info: 96 ANGLES Info: 31 DIHEDRALS Info: 32 IMPROPERS Info: 0 CROSSTERMS Info: 0 EXCLUSIONS Info: 195 DEGREES OF FREEDOM Info: 55 HYDROGEN GROUPS Info: TOTAL MASS = 783.886 amu Info: TOTAL CHARGE = 8.19564e-08 e Info: ***************************** Info: Entering startup phase 0 with 8208 kB of memory in use. Info: Entering startup phase 1 with 8208 kB of memory in use. Info: Entering startup phase 2 with 8208 kB of memory in use. Info: Entering startup phase 3 with 8208 kB of memory in use. Info: PATCH GRID IS 1 BY 1 BY 1 Info: REMOVING COM VELOCITY 0 0 0 Info: LARGEST PATCH (0) HAS 66 ATOMS Info: CREATING 11 COMPUTE OBJECTS Info: Entering startup phase 4 with 8208 kB of memory in use. Info: Entering startup phase 5 with 8208 kB of memory in use. Info: Entering startup phase 6 with 8208 kB of memory in use. Measuring processor speeds... Done. Info: Entering startup phase 7 with 8208 kB of memory in use. Info: CREATING 11 COMPUTE OBJECTS Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 Info: NONBONDED TABLE SIZE: 705 POINTS Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 Info: Entering startup phase 8 with 8208 kB of memory in use. Info: Finished startup with 8208 kB of memory in use. ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP TOTAL2 TOTAL3 TEMPAVG ENERGY: 0 0.0050 0.4192 0.0368 0.4591 -210.1610 1.0506 0.0000 0.0000 0.0000 -208.1904 0.0000 -208.1877 -208.1877 0.0000 ENERGY: 1 0.0051 0.4196 0.0367 0.4585 -210.1611 1.0184 0.0000 0.0000 0.0325 -208.1905 0.1675 -208.1878 -208.1877 0.1675 ENERGY: 2 0.0058 0.4208 0.0365 0.4568 -210.1610 0.9219 0.0000 0.0000 0.1285 -208.1907 0.6632 -208.1881 -208.1877 0.6632 ENERGY: 3 0.0092 0.4232 0.0361 0.4542 -210.1599 0.7617 0.0000 0.0000 0.2845 -208.1910 1.4683 -208.1885 -208.1878 1.4683 ENERGY: 4 0.0176 0.4269 0.0356 0.4511 -210.1565 0.5386 0.0000 0.0000 0.4952 -208.1914 2.5561 -208.1890 -208.1878 2.5561 ENERGY: 5 0.0327 0.4327 0.0350 0.4480 -210.1489 0.2537 0.0000 0.0000 0.7552 -208.1917 3.8977 -208.1894 -208.1879 3.8977 ENERGY: 6 0.0552 0.4409 0.0343 0.4454 -210.1354 -0.0915 0.0000 0.0000 1.0592 -208.1920 5.4666 -208.1898 -208.1880 5.4666 ENERGY: 7 0.0839 0.4522 0.0334 0.4440 -210.1137 -0.4951 0.0000 0.0000 1.4031 -208.1922 7.2418 -208.1900 -208.1882 7.2418 ENERGY: 8 0.1162 0.4674 0.0325 0.4448 -210.0822 -0.9550 0.0000 0.0000 1.7839 -208.1923 9.2074 -208.1902 -208.1883 9.2074 ENERGY: 9 0.1492 0.4870 0.0315 0.4485 -210.0391 -1.4687 0.0000 0.0000 2.1990 -208.1925 11.3497 -208.1905 -208.1884 11.3497 WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 WRITING COORDINATES TO OUTPUT FILE AT STEP 9 WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 ========================================== WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB End of program -----Original Message----- From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad Cojocaru Sent: August-14-08 11:32 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Dear mvapich users, I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 infiniband cluster using the intel 10.1.015 compilers. With mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in the /examples directory. Then, I bult charm++ and tested it with "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on top of mvapich1.0.1 and charm, Everything seemed ok only that the namd executable hangs without error messages. In fact it appears as if it still runs but it doesn't produce any output. If I repeat exactly the same procedure but with openmpi instead of mvapich, everything works fine ....(however I am not so happy about the scaling of openmpi on infiniband) Does anyone have experience with installing namd using mvapich1.0.1 ? If yes, any idea why this happens? I must say when I did the same on another cluster which had mvapich1.0.1 already compiled with the intel compilers, everything worked out correcltly. So, it must be something with the compilation of mvapich1.0.1 on our new infiniband setup that creates the problem. The german in the error simply says that executable "mpiname was not found" Best wishes vlad ----------------------------------error------------------------------------------------------------------------ I also tried mvapich2 but the compilation fails when installing the mpiname application (see error below) which apparently fails to compile (no executable is found in /env/mpiname dir). However no error messages are printed by make and the build completes correctly. So I am not sure why mpiname does not compile and still make install tries to install it ... /usr/bin/install -c mpiname/mpiname /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname /usr/bin/install: Aufruf von stat f??r ?mpiname/mpiname? nicht m??glich: Datei oder Verzeichnis nicht gefunden make[1]: *** [install] Fehler 1 make[1]: Leaving directory `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' make: *** [install] Fehler 2 -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080814/930ce3d7/attachment-0001.html From Vlad.Cojocaru at eml-r.villa-bosch.de Fri Aug 15 04:15:40 2008 From: Vlad.Cojocaru at eml-r.villa-bosch.de (Vlad Cojocaru) Date: Fri Aug 15 04:16:01 2008 Subject: [mvapich-discuss] configure script of mvapich2 does not find timer on Debian Lenny Message-ID: <48A53B2C.9080509@eml-r.villa-bosch.de> Dear all, I am wondering if anybody has tried to build mvapich2 1.2rc1 on an amd64 with Debian Lenny. I get a very strange error message from the configure script "could not find timer" and the configure exits. I thought gettimeofday should be in any linux kernel and it looks as if it is in the debian lenny as well. I do not get the same error on Debian etch. Does anybody have any idea how I fix this ? Best wishes vlad -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- From Vlad.Cojocaru at eml-r.villa-bosch.de Fri Aug 15 05:29:15 2008 From: Vlad.Cojocaru at eml-r.villa-bosch.de (Vlad Cojocaru) Date: Fri Aug 15 05:29:35 2008 Subject: [mvapich-discuss] bug in the Makefiles of mvapich2-1.2rc1 source tree Message-ID: <48A54C6B.8020009@eml-r.villa-bosch.de> Dear all, I found one strange reference to a directory called /home/7/curtisbr/svn/mvapich2/mvapich2-1.2rc1 in all Makefile.in file present in the source tree of mvapich2-1.2rc1 downloaded from the website. I assume that this reference should be changed to ${srcdir} in all places . Cheers vlad -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- From mbozzore at platform.com Fri Aug 15 07:45:38 2008 From: mbozzore at platform.com (Mehdi Bozzo-Rey) Date: Fri Aug 15 07:44:45 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 References: <48A45005.2020407@eml-r.villa-bosch.de> <531893A968B34D40B36C7A6445BC828A01CF64F6@catoexm06.noam.corp.platform.com> <3FA65DF819C05B40B9779B89D1FEB2950D0A09@vbemail20.villa-bosch.de> Message-ID: <531893A968B34D40B36C7A6445BC828A01CF6517@catoexm06.noam.corp.platform.com> Hi Vlad, No, I did not use the intel compilers (not yet). I used gfortran. More precisely: OS: RHEL 5.1 (Kernel 2.6.18-53.el5) [mbozzore@tyan04 ~]$ mpicc --version gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) [mbozzore@tyan04 ~]$ mpicxx --version g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) [mbozzore@tyan04 ~]$ mpif77 --version GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) [mbozzore@tyan04 ~]$ mpif90 --version GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) Hardware: intel quads for the nodes, topspin switch and hcas for IB. Yes, I used OFED (1.3). I did not enable sharedlibs for that build. I will double check but if I remember well, everything was fine (compilation) on the mvapich2 side. What version did you use ? Cheers, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform computing Phone : +1 905 948 4649 E-mail : mbozzore@platform.com From: Cojocaru,Vlad [mailto:vlad.cojocaru@eml-r.villa-bosch.de] Sent: August-14-08 4:35 PM To: Mehdi Bozzo-Rey; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Hi Mehdi, Did you use intel 10.1 as well ? Did you build on openfabrics ? what compiler flags did you pass to the mvapich build? Did you build with --enable sharedlib or without? I would be grateful If you give me some bits of the details how you built mvapich?. Thanks for the reply. Yes, there is something about the compilation of mvapich. As I said I successfully compiled NAMD on a cluster that had already mvapich compiled with intel as the default mpi lib. However, on the new cluster (quad cores AMD opterons with mellanox infiniband) I got these problems. So, its definitely the mvapich build which fails although I don't get any errors fro make. Any idea why the mpiname application fails to compile when compiling mvapich2 ? Thanks again Best wishes vlad -----Original Message----- From: Mehdi Bozzo-Rey [mailto:mbozzore@platform.com] Sent: Thu 8/14/2008 7:20 PM To: Cojocaru,Vlad; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Hello Vlad, I just recompiled NAMD and it looks ok for me (output of simple test below). I guess the problem is on the compilation side. Best regards, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform computing Phone : +1 905 948 4649 E-mail : mbozzore@platform.com [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile ./hosts.8 ./namd2 src/alanin Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Info: NAMD 2.6 for Linux-amd64-MPI Info: Info: Please visit http://www.ks.uiuc.edu/Research/namd/ Info: and send feedback or bug reports to namd@ks.uiuc.edu Info: Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) Info: in all publications reporting results obtained with NAMD. Info: Info: Based on Charm++/Converse 50914 for mpi-linux-x86_64-gfortran-smp-mpicxx Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on tyan04.lsf.platform.com Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore Info: Running on 8 processors. Info: 8208 kB of memory in use. Info: Memory usage based on mallinfo Info: Changed directory to src Info: Configuration file is alanin TCL: Suspending until startup complete. Info: SIMULATION PARAMETERS: Info: TIMESTEP 0.5 Info: NUMBER OF STEPS 9 Info: STEPS PER CYCLE 3 Info: LOAD BALANCE STRATEGY Other Info: LDB PERIOD 600 steps Info: FIRST LDB TIMESTEP 15 Info: LDB BACKGROUND SCALING 1 Info: HOM BACKGROUND SCALING 1 Info: MAX SELF PARTITIONS 50 Info: MAX PAIR PARTITIONS 20 Info: SELF PARTITION ATOMS 125 Info: PAIR PARTITION ATOMS 200 Info: PAIR2 PARTITION ATOMS 400 Info: MIN ATOMS PER PATCH 100 Info: INITIAL TEMPERATURE 0 Info: CENTER OF MASS MOVING INITIALLY? NO Info: DIELECTRIC 1 Info: EXCLUDE SCALED ONE-FOUR Info: 1-4 SCALE FACTOR 0.4 Info: NO DCD TRAJECTORY OUTPUT Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT Info: NO VELOCITY DCD OUTPUT Info: OUTPUT FILENAME output Info: BINARY OUTPUT FILES WILL BE USED Info: NO RESTART FILE Info: SWITCHING ACTIVE Info: SWITCHING ON 7 Info: SWITCHING OFF 8 Info: PAIRLIST DISTANCE 9 Info: PAIRLIST SHRINK RATE 0.01 Info: PAIRLIST GROW RATE 0.01 Info: PAIRLIST TRIGGER 0.3 Info: PAIRLISTS PER CYCLE 2 Info: PAIRLISTS ENABLED Info: MARGIN 1 Info: HYDROGEN GROUP CUTOFF 2.5 Info: PATCH DIMENSION 12.5 Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL Info: TIMING OUTPUT STEPS 15 Info: USING VERLET I (r-RESPA) MTS SCHEME. Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS Info: RANDOM NUMBER SEED 1218734148 Info: USE HYDROGEN BONDS? NO Info: COORDINATE PDB alanin.pdb Info: STRUCTURE FILE alanin.psf Info: PARAMETER file: XPLOR format! (default) Info: PARAMETERS alanin.params Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS Info: SUMMARY OF PARAMETERS: Info: 61 BONDS Info: 179 ANGLES Info: 38 DIHEDRAL Info: 42 IMPROPER Info: 0 CROSSTERM Info: 21 VDW Info: 0 VDW_PAIRS Info: **************************** Info: STRUCTURE SUMMARY: Info: 66 ATOMS Info: 65 BONDS Info: 96 ANGLES Info: 31 DIHEDRALS Info: 32 IMPROPERS Info: 0 CROSSTERMS Info: 0 EXCLUSIONS Info: 195 DEGREES OF FREEDOM Info: 55 HYDROGEN GROUPS Info: TOTAL MASS = 783.886 amu Info: TOTAL CHARGE = 8.19564e-08 e Info: ***************************** Info: Entering startup phase 0 with 8208 kB of memory in use. Info: Entering startup phase 1 with 8208 kB of memory in use. Info: Entering startup phase 2 with 8208 kB of memory in use. Info: Entering startup phase 3 with 8208 kB of memory in use. Info: PATCH GRID IS 1 BY 1 BY 1 Info: REMOVING COM VELOCITY 0 0 0 Info: LARGEST PATCH (0) HAS 66 ATOMS Info: CREATING 11 COMPUTE OBJECTS Info: Entering startup phase 4 with 8208 kB of memory in use. Info: Entering startup phase 5 with 8208 kB of memory in use. Info: Entering startup phase 6 with 8208 kB of memory in use. Measuring processor speeds... Done. Info: Entering startup phase 7 with 8208 kB of memory in use. Info: CREATING 11 COMPUTE OBJECTS Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 Info: NONBONDED TABLE SIZE: 705 POINTS Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 Info: Entering startup phase 8 with 8208 kB of memory in use. Info: Finished startup with 8208 kB of memory in use. ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP TOTAL2 TOTAL3 TEMPAVG ENERGY: 0 0.0050 0.4192 0.0368 0.4591 -210.1610 1.0506 0.0000 0.0000 0.0000 -208.1904 0.0000 -208.1877 -208.1877 0.0000 ENERGY: 1 0.0051 0.4196 0.0367 0.4585 -210.1611 1.0184 0.0000 0.0000 0.0325 -208.1905 0.1675 -208.1878 -208.1877 0.1675 ENERGY: 2 0.0058 0.4208 0.0365 0.4568 -210.1610 0.9219 0.0000 0.0000 0.1285 -208.1907 0.6632 -208.1881 -208.1877 0.6632 ENERGY: 3 0.0092 0.4232 0.0361 0.4542 -210.1599 0.7617 0.0000 0.0000 0.2845 -208.1910 1.4683 -208.1885 -208.1878 1.4683 ENERGY: 4 0.0176 0.4269 0.0356 0.4511 -210.1565 0.5386 0.0000 0.0000 0.4952 -208.1914 2.5561 -208.1890 -208.1878 2.5561 ENERGY: 5 0.0327 0.4327 0.0350 0.4480 -210.1489 0.2537 0.0000 0.0000 0.7552 -208.1917 3.8977 -208.1894 -208.1879 3.8977 ENERGY: 6 0.0552 0.4409 0.0343 0.4454 -210.1354 -0.0915 0.0000 0.0000 1.0592 -208.1920 5.4666 -208.1898 -208.1880 5.4666 ENERGY: 7 0.0839 0.4522 0.0334 0.4440 -210.1137 -0.4951 0.0000 0.0000 1.4031 -208.1922 7.2418 -208.1900 -208.1882 7.2418 ENERGY: 8 0.1162 0.4674 0.0325 0.4448 -210.0822 -0.9550 0.0000 0.0000 1.7839 -208.1923 9.2074 -208.1902 -208.1883 9.2074 ENERGY: 9 0.1492 0.4870 0.0315 0.4485 -210.0391 -1.4687 0.0000 0.0000 2.1990 -208.1925 11.3497 -208.1905 -208.1884 11.3497 WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 WRITING COORDINATES TO OUTPUT FILE AT STEP 9 WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 ========================================== WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB End of program -----Original Message----- From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad Cojocaru Sent: August-14-08 11:32 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Dear mvapich users, I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 infiniband cluster using the intel 10.1.015 compilers. With mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in the /examples directory. Then, I bult charm++ and tested it with "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on top of mvapich1.0.1 and charm, Everything seemed ok only that the namd executable hangs without error messages. In fact it appears as if it still runs but it doesn't produce any output. If I repeat exactly the same procedure but with openmpi instead of mvapich, everything works fine ....(however I am not so happy about the scaling of openmpi on infiniband) Does anyone have experience with installing namd using mvapich1.0.1 ? If yes, any idea why this happens? I must say when I did the same on another cluster which had mvapich1.0.1 already compiled with the intel compilers, everything worked out correcltly. So, it must be something with the compilation of mvapich1.0.1 on our new infiniband setup that creates the problem. The german in the error simply says that executable "mpiname was not found" Best wishes vlad ----------------------------------error------------------------------------------------------------------------ I also tried mvapich2 but the compilation fails when installing the mpiname application (see error below) which apparently fails to compile (no executable is found in /env/mpiname dir). However no error messages are printed by make and the build completes correctly. So I am not sure why mpiname does not compile and still make install tries to install it ... /usr/bin/install -c mpiname/mpiname /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname /usr/bin/install: Aufruf von stat f??r ?mpiname/mpiname? nicht m??glich: Datei oder Verzeichnis nicht gefunden make[1]: *** [install] Fehler 1 make[1]: Leaving directory `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' make: *** [install] Fehler 2 -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080815/745715f9/attachment-0001.html From Vlad.Cojocaru at eml-r.villa-bosch.de Fri Aug 15 08:04:26 2008 From: Vlad.Cojocaru at eml-r.villa-bosch.de (Vlad Cojocaru) Date: Fri Aug 15 08:04:52 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 In-Reply-To: <531893A968B34D40B36C7A6445BC828A01CF6517@catoexm06.noam.corp.platform.com> References: <48A45005.2020407@eml-r.villa-bosch.de> <531893A968B34D40B36C7A6445BC828A01CF64F6@catoexm06.noam.corp.platform.com> <3FA65DF819C05B40B9779B89D1FEB2950D0A09@vbemail20.villa-bosch.de> <531893A968B34D40B36C7A6445BC828A01CF6517@catoexm06.noam.corp.platform.com> Message-ID: <48A570CA.1020607@eml-r.villa-bosch.de> Thanks Mehdi for all details, I guess you mean gcc when you say gfortran ... namd is not written in fortran but in charm++ which is an adaptation of c++... Well, we have debian here so we used Debian packages to install the inifiniband libs and headers ...(our sys administrator did that). Then I tried to compile mvapich 1.0.1 and I found that I need the drastically change the make.mvapich.gen2 file in order to get it to build (since the defaults for $IBHOME are very strange ... we have everything in /usr/include/infiniband and /usr/lib/infiniband ). After all I managed to get it built but the namd hangs .... So I decided to try mvapich2 (1.2rc1 version) and I found lots problems. Some of them I could fix but some are very strange. For instance in the entire source tree there are lots of references to strange directories /home/daffy ... or /home/7 ... and so on .. Some of them I replaced with ${master_top_srcdir} since I figured out that one should replace them but others I don't know ... Also, when I tried to build with shared libs, the make is not able to build the mpiname application ... I could not figure out why ... So, lots of problems ....I'll try to figure them out ... However, the problems with mvapich2 look more as bugs in the Makefiiles .. So, maybe somebody would like to change those ... Cheers vlad Mehdi Bozzo-Rey wrote: > > Hi Vlad, > > > > No, I did not use the intel compilers (not yet). I used gfortran. More > precisely: > > > > OS: > > > > RHEL 5.1 (Kernel 2.6.18-53.el5) > > > > [mbozzore@tyan04 ~]$ mpicc --version > > gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) > > > > [mbozzore@tyan04 ~]$ mpicxx --version > > g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) > > > > [mbozzore@tyan04 ~]$ mpif77 --version > > GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) > > > > [mbozzore@tyan04 ~]$ mpif90 --version > > GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) > > > > Hardware: intel quads for the nodes, topspin switch and hcas for IB. > > > > > > Yes, I used OFED (1.3). > > > > I did not enable sharedlibs for that build. > > > > I will double check but if I remember well, everything was fine > (compilation) on the mvapich2 side. What version did you use ? > > > > Cheers, > > > > Mehdi > > > > Mehdi Bozzo-Rey > Open Source Solution Developer > Platform computing > Phone : +1 905 948 4649 > E-mail : mbozzore@platform.com > > > > > > > > *From:* Cojocaru,Vlad [mailto:vlad.cojocaru@eml-r.villa-bosch.de] > *Sent:* August-14-08 4:35 PM > *To:* Mehdi Bozzo-Rey; mvapich-discuss@cse.ohio-state.edu > *Subject:* RE: [mvapich-discuss] compile charm++ and namd with mvapich > 1.0.1and/or mvapich2 > > > > Hi Mehdi, > > Did you use intel 10.1 as well ? Did you build on openfabrics ? what > compiler flags did you pass to the mvapich build? Did you build with > --enable sharedlib or without? I would be grateful If you give me some > bits of the details how you built mvapich?. > Thanks for the reply. Yes, there is something about the compilation of > mvapich. As I said I successfully compiled NAMD on a cluster that had > already mvapich compiled with intel as the default mpi lib. However, > on the new cluster (quad cores AMD opterons with mellanox infiniband) > I got these problems. So, its definitely the mvapich build which > fails although I don't get any errors fro make. > > Any idea why the mpiname application fails to compile when compiling > mvapich2 ? > > Thanks again > > Best wishes > vlad > > > -----Original Message----- > From: Mehdi Bozzo-Rey [mailto:mbozzore@platform.com] > Sent: Thu 8/14/2008 7:20 PM > To: Cojocaru,Vlad; mvapich-discuss@cse.ohio-state.edu > Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich > 1.0.1and/or mvapich2 > > Hello Vlad, > > > I just recompiled NAMD and it looks ok for me (output of simple test > below). I guess the problem is on the compilation side. > > Best regards, > > Mehdi > > Mehdi Bozzo-Rey > Open Source Solution Developer > Platform computing > Phone : +1 905 948 4649 > E-mail : mbozzore@platform.com > > > [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile > ./hosts.8 ./namd2 src/alanin > Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 > Charm warning> Randomization of stack pointer is turned on in Kernel, > run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable > it. Thread migration may not work! > Info: NAMD 2.6 for Linux-amd64-MPI > Info: > Info: Please visit http://www.ks.uiuc.edu/Research/namd/ > Info: and send feedback or bug reports to namd@ks.uiuc.edu > Info: > Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) > Info: in all publications reporting results obtained with NAMD. > Info: > Info: Based on Charm++/Converse 50914 for > mpi-linux-x86_64-gfortran-smp-mpicxx > Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on > tyan04.lsf.platform.com > Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore > Info: Running on 8 processors. > Info: 8208 kB of memory in use. > Info: Memory usage based on mallinfo > Info: Changed directory to src > Info: Configuration file is alanin > TCL: Suspending until startup complete. > Info: SIMULATION PARAMETERS: > Info: TIMESTEP 0.5 > Info: NUMBER OF STEPS 9 > Info: STEPS PER CYCLE 3 > Info: LOAD BALANCE STRATEGY Other > Info: LDB PERIOD 600 steps > Info: FIRST LDB TIMESTEP 15 > Info: LDB BACKGROUND SCALING 1 > Info: HOM BACKGROUND SCALING 1 > Info: MAX SELF PARTITIONS 50 > Info: MAX PAIR PARTITIONS 20 > Info: SELF PARTITION ATOMS 125 > Info: PAIR PARTITION ATOMS 200 > Info: PAIR2 PARTITION ATOMS 400 > Info: MIN ATOMS PER PATCH 100 > Info: INITIAL TEMPERATURE 0 > Info: CENTER OF MASS MOVING INITIALLY? NO > Info: DIELECTRIC 1 > Info: EXCLUDE SCALED ONE-FOUR > Info: 1-4 SCALE FACTOR 0.4 > Info: NO DCD TRAJECTORY OUTPUT > Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT > Info: NO VELOCITY DCD OUTPUT > Info: OUTPUT FILENAME output > Info: BINARY OUTPUT FILES WILL BE USED > Info: NO RESTART FILE > Info: SWITCHING ACTIVE > Info: SWITCHING ON 7 > Info: SWITCHING OFF 8 > Info: PAIRLIST DISTANCE 9 > Info: PAIRLIST SHRINK RATE 0.01 > Info: PAIRLIST GROW RATE 0.01 > Info: PAIRLIST TRIGGER 0.3 > Info: PAIRLISTS PER CYCLE 2 > Info: PAIRLISTS ENABLED > Info: MARGIN 1 > Info: HYDROGEN GROUP CUTOFF 2.5 > Info: PATCH DIMENSION 12.5 > Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL > Info: TIMING OUTPUT STEPS 15 > Info: USING VERLET I (r-RESPA) MTS SCHEME. > Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS > Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS > Info: RANDOM NUMBER SEED 1218734148 > Info: USE HYDROGEN BONDS? NO > Info: COORDINATE PDB alanin.pdb > Info: STRUCTURE FILE alanin.psf > Info: PARAMETER file: XPLOR format! (default) > Info: PARAMETERS alanin.params > Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS > Info: SUMMARY OF PARAMETERS: > Info: 61 BONDS > Info: 179 ANGLES > Info: 38 DIHEDRAL > Info: 42 IMPROPER > Info: 0 CROSSTERM > Info: 21 VDW > Info: 0 VDW_PAIRS > Info: **************************** > Info: STRUCTURE SUMMARY: > Info: 66 ATOMS > Info: 65 BONDS > Info: 96 ANGLES > Info: 31 DIHEDRALS > Info: 32 IMPROPERS > Info: 0 CROSSTERMS > Info: 0 EXCLUSIONS > Info: 195 DEGREES OF FREEDOM > Info: 55 HYDROGEN GROUPS > Info: TOTAL MASS = 783.886 amu > Info: TOTAL CHARGE = 8.19564e-08 e > Info: ***************************** > Info: Entering startup phase 0 with 8208 kB of memory in use. > Info: Entering startup phase 1 with 8208 kB of memory in use. > Info: Entering startup phase 2 with 8208 kB of memory in use. > Info: Entering startup phase 3 with 8208 kB of memory in use. > Info: PATCH GRID IS 1 BY 1 BY 1 > Info: REMOVING COM VELOCITY 0 0 0 > Info: LARGEST PATCH (0) HAS 66 ATOMS > Info: CREATING 11 COMPUTE OBJECTS > Info: Entering startup phase 4 with 8208 kB of memory in use. > Info: Entering startup phase 5 with 8208 kB of memory in use. > Info: Entering startup phase 6 with 8208 kB of memory in use. > Measuring processor speeds... Done. > Info: Entering startup phase 7 with 8208 kB of memory in use. > Info: CREATING 11 COMPUTE OBJECTS > Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 > Info: NONBONDED TABLE SIZE: 705 POINTS > Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 > Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 > Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 > Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 > Info: Entering startup phase 8 with 8208 kB of memory in use. > Info: Finished startup with 8208 kB of memory in use. > ETITLE: TS BOND ANGLE DIHED > IMPRP ELECT VDW BOUNDARY > MISC KINETIC TOTAL TEMP > TOTAL2 TOTAL3 TEMPAVG > > ENERGY: 0 0.0050 0.4192 0.0368 > 0.4591 -210.1610 1.0506 0.0000 > 0.0000 0.0000 -208.1904 0.0000 > -208.1877 -208.1877 0.0000 > > ENERGY: 1 0.0051 0.4196 0.0367 > 0.4585 -210.1611 1.0184 0.0000 > 0.0000 0.0325 -208.1905 0.1675 > -208.1878 -208.1877 0.1675 > > ENERGY: 2 0.0058 0.4208 0.0365 > 0.4568 -210.1610 0.9219 0.0000 > 0.0000 0.1285 -208.1907 0.6632 > -208.1881 -208.1877 0.6632 > > ENERGY: 3 0.0092 0.4232 0.0361 > 0.4542 -210.1599 0.7617 0.0000 > 0.0000 0.2845 -208.1910 1.4683 > -208.1885 -208.1878 1.4683 > > ENERGY: 4 0.0176 0.4269 0.0356 > 0.4511 -210.1565 0.5386 0.0000 > 0.0000 0.4952 -208.1914 2.5561 > -208.1890 -208.1878 2.5561 > > ENERGY: 5 0.0327 0.4327 0.0350 > 0.4480 -210.1489 0.2537 0.0000 > 0.0000 0.7552 -208.1917 3.8977 > -208.1894 -208.1879 3.8977 > > ENERGY: 6 0.0552 0.4409 0.0343 > 0.4454 -210.1354 -0.0915 0.0000 > 0.0000 1.0592 -208.1920 5.4666 > -208.1898 -208.1880 5.4666 > > ENERGY: 7 0.0839 0.4522 0.0334 > 0.4440 -210.1137 -0.4951 0.0000 > 0.0000 1.4031 -208.1922 7.2418 > -208.1900 -208.1882 7.2418 > > ENERGY: 8 0.1162 0.4674 0.0325 > 0.4448 -210.0822 -0.9550 0.0000 > 0.0000 1.7839 -208.1923 9.2074 > -208.1902 -208.1883 9.2074 > > ENERGY: 9 0.1492 0.4870 0.0315 > 0.4485 -210.0391 -1.4687 0.0000 > 0.0000 2.1990 -208.1925 11.3497 > -208.1905 -208.1884 11.3497 > > WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 > WRITING COORDINATES TO OUTPUT FILE AT STEP 9 > WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 > ========================================== > WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB > End of program > > > > > > -----Original Message----- > From: mvapich-discuss-bounces@cse.ohio-state.edu > [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad > Cojocaru > Sent: August-14-08 11:32 AM > To: mvapich-discuss@cse.ohio-state.edu > Subject: [mvapich-discuss] compile charm++ and namd with mvapich > 1.0.1and/or mvapich2 > > Dear mvapich users, > > I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 > infiniband cluster using the intel 10.1.015 compilers. With > mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in > the /examples directory. Then, I bult charm++ and tested it with > "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on > top of mvapich1.0.1 and charm, > > Everything seemed ok only that the namd executable hangs without error > messages. In fact it appears as if it still runs but it doesn't produce > any output. If I repeat exactly the same procedure but with openmpi > instead of mvapich, everything works fine ....(however I am not so happy > about the scaling of openmpi on infiniband) > > Does anyone have experience with installing namd using mvapich1.0.1 ? If > yes, any idea why this happens? I must say when I did the same on > another cluster which had mvapich1.0.1 already compiled with the intel > compilers, everything worked out correcltly. So, it must be something > with the compilation of mvapich1.0.1 on our new infiniband setup that > creates the problem. > > The german in the error simply says that executable "mpiname was not > found" > > Best wishes > vlad > > ----------------------------------error------------------------------------------------------------------------ > I also tried mvapich2 but the compilation fails when installing the > mpiname application (see error below) which apparently fails to compile > (no executable is found in /env/mpiname dir). However no error messages > are printed by make and the build completes correctly. So I am not sure > why mpiname does not compile and still make install tries to install > it ... > > /usr/bin/install -c mpiname/mpiname > /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname > /usr/bin/install: Aufruf von stat f??r ?mpiname/mpiname? nicht m??glich: > Datei oder Verzeichnis nicht gefunden > make[1]: *** [install] Fehler 1 > make[1]: Leaving directory > `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' > make: *** [install] Fehler 2 > > -- > ---------------------------------------------------------------------------- > Dr. Vlad Cojocaru > > EML Research gGmbH > Schloss-Wolfsbrunnenweg 33 > 69118 Heidelberg > > Tel: ++49-6221-533266 > Fax: ++49-6221-533298 > > e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de > > http://projects.villa-bosch.de/mcm/people/cojocaru/ > > ---------------------------------------------------------------------------- > EML Research gGmbH > Amtgericht Mannheim / HRB 337446 > Managing Partner: Dr. h.c. Klaus Tschira > Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter > http://www.eml-r.org > ---------------------------------------------------------------------------- > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080815/fa4822ca/attachment-0001.html From panda at cse.ohio-state.edu Fri Aug 15 08:32:59 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Aug 15 08:33:14 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 In-Reply-To: <48A570CA.1020607@eml-r.villa-bosch.de> Message-ID: Vald, Please take a look at the detailed user guides of MVAPICH and MVAPICH2 regarding how to build and install these packages. They are available from the following URL: http://mvapich.cse.ohio-state.edu/support/ MVAPICH2 1.2 series has full autoconf-based configuration framework. It should significantly help you. DK On Fri, 15 Aug 2008, Vlad Cojocaru wrote: > Thanks Mehdi for all details, > > I guess you mean gcc when you say gfortran ... namd is not written in > fortran but in charm++ which is an adaptation of c++... > > Well, we have debian here so we used Debian packages to install the > inifiniband libs and headers ...(our sys administrator did that). Then I > tried to compile mvapich 1.0.1 and I found that I need the drastically > change the make.mvapich.gen2 file in order to get it to build (since the > defaults for $IBHOME are very strange ... we have everything in > /usr/include/infiniband and /usr/lib/infiniband ). After all I managed > to get it built but the namd hangs .... > > So I decided to try mvapich2 (1.2rc1 version) and I found lots problems. > Some of them I could fix but some are very strange. For instance in the > entire source tree there are lots of references to strange directories > /home/daffy ... or /home/7 ... and so on .. Some of them I replaced with > ${master_top_srcdir} since I figured out that one should replace them > but others I don't know ... Also, when I tried to build with shared > libs, the make is not able to build the mpiname application ... I could > not figure out why ... > > So, lots of problems ....I'll try to figure them out ... However, the > problems with mvapich2 look more as bugs in the Makefiiles .. So, maybe > somebody would like to change those ... > > Cheers > vlad > > > Mehdi Bozzo-Rey wrote: > > > > Hi Vlad, > > > > > > > > No, I did not use the intel compilers (not yet). I used gfortran. More > > precisely: > > > > > > > > OS: > > > > > > > > RHEL 5.1 (Kernel 2.6.18-53.el5) > > > > > > > > [mbozzore@tyan04 ~]$ mpicc --version > > > > gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) > > > > > > > > [mbozzore@tyan04 ~]$ mpicxx --version > > > > g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) > > > > > > > > [mbozzore@tyan04 ~]$ mpif77 --version > > > > GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) > > > > > > > > [mbozzore@tyan04 ~]$ mpif90 --version > > > > GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) > > > > > > > > Hardware: intel quads for the nodes, topspin switch and hcas for IB. > > > > > > > > > > > > Yes, I used OFED (1.3). > > > > > > > > I did not enable sharedlibs for that build. > > > > > > > > I will double check but if I remember well, everything was fine > > (compilation) on the mvapich2 side. What version did you use ? > > > > > > > > Cheers, > > > > > > > > Mehdi > > > > > > > > Mehdi Bozzo-Rey > > Open Source Solution Developer > > Platform computing > > Phone : +1 905 948 4649 > > E-mail : mbozzore@platform.com > > > > > > > > > > > > > > > > *From:* Cojocaru,Vlad [mailto:vlad.cojocaru@eml-r.villa-bosch.de] > > *Sent:* August-14-08 4:35 PM > > *To:* Mehdi Bozzo-Rey; mvapich-discuss@cse.ohio-state.edu > > *Subject:* RE: [mvapich-discuss] compile charm++ and namd with mvapich > > 1.0.1and/or mvapich2 > > > > > > > > Hi Mehdi, > > > > Did you use intel 10.1 as well ? Did you build on openfabrics ? what > > compiler flags did you pass to the mvapich build? Did you build with > > --enable sharedlib or without? I would be grateful If you give me some > > bits of the details how you built mvapich?. > > Thanks for the reply. Yes, there is something about the compilation of > > mvapich. As I said I successfully compiled NAMD on a cluster that had > > already mvapich compiled with intel as the default mpi lib. However, > > on the new cluster (quad cores AMD opterons with mellanox infiniband) > > I got these problems. So, its definitely the mvapich build which > > fails although I don't get any errors fro make. > > > > Any idea why the mpiname application fails to compile when compiling > > mvapich2 ? > > > > Thanks again > > > > Best wishes > > vlad > > > > > > -----Original Message----- > > From: Mehdi Bozzo-Rey [mailto:mbozzore@platform.com] > > Sent: Thu 8/14/2008 7:20 PM > > To: Cojocaru,Vlad; mvapich-discuss@cse.ohio-state.edu > > Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich > > 1.0.1and/or mvapich2 > > > > Hello Vlad, > > > > > > I just recompiled NAMD and it looks ok for me (output of simple test > > below). I guess the problem is on the compilation side. > > > > Best regards, > > > > Mehdi > > > > Mehdi Bozzo-Rey > > Open Source Solution Developer > > Platform computing > > Phone : +1 905 948 4649 > > E-mail : mbozzore@platform.com > > > > > > [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile > > ./hosts.8 ./namd2 src/alanin > > Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 > > Charm warning> Randomization of stack pointer is turned on in Kernel, > > run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable > > it. Thread migration may not work! > > Info: NAMD 2.6 for Linux-amd64-MPI > > Info: > > Info: Please visit http://www.ks.uiuc.edu/Research/namd/ > > Info: and send feedback or bug reports to namd@ks.uiuc.edu > > Info: > > Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) > > Info: in all publications reporting results obtained with NAMD. > > Info: > > Info: Based on Charm++/Converse 50914 for > > mpi-linux-x86_64-gfortran-smp-mpicxx > > Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on > > tyan04.lsf.platform.com > > Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore > > Info: Running on 8 processors. > > Info: 8208 kB of memory in use. > > Info: Memory usage based on mallinfo > > Info: Changed directory to src > > Info: Configuration file is alanin > > TCL: Suspending until startup complete. > > Info: SIMULATION PARAMETERS: > > Info: TIMESTEP 0.5 > > Info: NUMBER OF STEPS 9 > > Info: STEPS PER CYCLE 3 > > Info: LOAD BALANCE STRATEGY Other > > Info: LDB PERIOD 600 steps > > Info: FIRST LDB TIMESTEP 15 > > Info: LDB BACKGROUND SCALING 1 > > Info: HOM BACKGROUND SCALING 1 > > Info: MAX SELF PARTITIONS 50 > > Info: MAX PAIR PARTITIONS 20 > > Info: SELF PARTITION ATOMS 125 > > Info: PAIR PARTITION ATOMS 200 > > Info: PAIR2 PARTITION ATOMS 400 > > Info: MIN ATOMS PER PATCH 100 > > Info: INITIAL TEMPERATURE 0 > > Info: CENTER OF MASS MOVING INITIALLY? NO > > Info: DIELECTRIC 1 > > Info: EXCLUDE SCALED ONE-FOUR > > Info: 1-4 SCALE FACTOR 0.4 > > Info: NO DCD TRAJECTORY OUTPUT > > Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT > > Info: NO VELOCITY DCD OUTPUT > > Info: OUTPUT FILENAME output > > Info: BINARY OUTPUT FILES WILL BE USED > > Info: NO RESTART FILE > > Info: SWITCHING ACTIVE > > Info: SWITCHING ON 7 > > Info: SWITCHING OFF 8 > > Info: PAIRLIST DISTANCE 9 > > Info: PAIRLIST SHRINK RATE 0.01 > > Info: PAIRLIST GROW RATE 0.01 > > Info: PAIRLIST TRIGGER 0.3 > > Info: PAIRLISTS PER CYCLE 2 > > Info: PAIRLISTS ENABLED > > Info: MARGIN 1 > > Info: HYDROGEN GROUP CUTOFF 2.5 > > Info: PATCH DIMENSION 12.5 > > Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL > > Info: TIMING OUTPUT STEPS 15 > > Info: USING VERLET I (r-RESPA) MTS SCHEME. > > Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS > > Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS > > Info: RANDOM NUMBER SEED 1218734148 > > Info: USE HYDROGEN BONDS? NO > > Info: COORDINATE PDB alanin.pdb > > Info: STRUCTURE FILE alanin.psf > > Info: PARAMETER file: XPLOR format! (default) > > Info: PARAMETERS alanin.params > > Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS > > Info: SUMMARY OF PARAMETERS: > > Info: 61 BONDS > > Info: 179 ANGLES > > Info: 38 DIHEDRAL > > Info: 42 IMPROPER > > Info: 0 CROSSTERM > > Info: 21 VDW > > Info: 0 VDW_PAIRS > > Info: **************************** > > Info: STRUCTURE SUMMARY: > > Info: 66 ATOMS > > Info: 65 BONDS > > Info: 96 ANGLES > > Info: 31 DIHEDRALS > > Info: 32 IMPROPERS > > Info: 0 CROSSTERMS > > Info: 0 EXCLUSIONS > > Info: 195 DEGREES OF FREEDOM > > Info: 55 HYDROGEN GROUPS > > Info: TOTAL MASS = 783.886 amu > > Info: TOTAL CHARGE = 8.19564e-08 e > > Info: ***************************** > > Info: Entering startup phase 0 with 8208 kB of memory in use. > > Info: Entering startup phase 1 with 8208 kB of memory in use. > > Info: Entering startup phase 2 with 8208 kB of memory in use. > > Info: Entering startup phase 3 with 8208 kB of memory in use. > > Info: PATCH GRID IS 1 BY 1 BY 1 > > Info: REMOVING COM VELOCITY 0 0 0 > > Info: LARGEST PATCH (0) HAS 66 ATOMS > > Info: CREATING 11 COMPUTE OBJECTS > > Info: Entering startup phase 4 with 8208 kB of memory in use. > > Info: Entering startup phase 5 with 8208 kB of memory in use. > > Info: Entering startup phase 6 with 8208 kB of memory in use. > > Measuring processor speeds... Done. > > Info: Entering startup phase 7 with 8208 kB of memory in use. > > Info: CREATING 11 COMPUTE OBJECTS > > Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 > > Info: NONBONDED TABLE SIZE: 705 POINTS > > Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 > > Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 > > Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 > > Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 > > Info: Entering startup phase 8 with 8208 kB of memory in use. > > Info: Finished startup with 8208 kB of memory in use. > > ETITLE: TS BOND ANGLE DIHED > > IMPRP ELECT VDW BOUNDARY > > MISC KINETIC TOTAL TEMP > > TOTAL2 TOTAL3 TEMPAVG > > > > ENERGY: 0 0.0050 0.4192 0.0368 > > 0.4591 -210.1610 1.0506 0.0000 > > 0.0000 0.0000 -208.1904 0.0000 > > -208.1877 -208.1877 0.0000 > > > > ENERGY: 1 0.0051 0.4196 0.0367 > > 0.4585 -210.1611 1.0184 0.0000 > > 0.0000 0.0325 -208.1905 0.1675 > > -208.1878 -208.1877 0.1675 > > > > ENERGY: 2 0.0058 0.4208 0.0365 > > 0.4568 -210.1610 0.9219 0.0000 > > 0.0000 0.1285 -208.1907 0.6632 > > -208.1881 -208.1877 0.6632 > > > > ENERGY: 3 0.0092 0.4232 0.0361 > > 0.4542 -210.1599 0.7617 0.0000 > > 0.0000 0.2845 -208.1910 1.4683 > > -208.1885 -208.1878 1.4683 > > > > ENERGY: 4 0.0176 0.4269 0.0356 > > 0.4511 -210.1565 0.5386 0.0000 > > 0.0000 0.4952 -208.1914 2.5561 > > -208.1890 -208.1878 2.5561 > > > > ENERGY: 5 0.0327 0.4327 0.0350 > > 0.4480 -210.1489 0.2537 0.0000 > > 0.0000 0.7552 -208.1917 3.8977 > > -208.1894 -208.1879 3.8977 > > > > ENERGY: 6 0.0552 0.4409 0.0343 > > 0.4454 -210.1354 -0.0915 0.0000 > > 0.0000 1.0592 -208.1920 5.4666 > > -208.1898 -208.1880 5.4666 > > > > ENERGY: 7 0.0839 0.4522 0.0334 > > 0.4440 -210.1137 -0.4951 0.0000 > > 0.0000 1.4031 -208.1922 7.2418 > > -208.1900 -208.1882 7.2418 > > > > ENERGY: 8 0.1162 0.4674 0.0325 > > 0.4448 -210.0822 -0.9550 0.0000 > > 0.0000 1.7839 -208.1923 9.2074 > > -208.1902 -208.1883 9.2074 > > > > ENERGY: 9 0.1492 0.4870 0.0315 > > 0.4485 -210.0391 -1.4687 0.0000 > > 0.0000 2.1990 -208.1925 11.3497 > > -208.1905 -208.1884 11.3497 > > > > WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 > > WRITING COORDINATES TO OUTPUT FILE AT STEP 9 > > WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 > > ========================================== > > WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB > > End of program > > > > > > > > > > > > -----Original Message----- > > From: mvapich-discuss-bounces@cse.ohio-state.edu > > [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad > > Cojocaru > > Sent: August-14-08 11:32 AM > > To: mvapich-discuss@cse.ohio-state.edu > > Subject: [mvapich-discuss] compile charm++ and namd with mvapich > > 1.0.1and/or mvapich2 > > > > Dear mvapich users, > > > > I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 > > infiniband cluster using the intel 10.1.015 compilers. With > > mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in > > the /examples directory. Then, I bult charm++ and tested it with > > "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on > > top of mvapich1.0.1 and charm, > > > > Everything seemed ok only that the namd executable hangs without error > > messages. In fact it appears as if it still runs but it doesn't produce > > any output. If I repeat exactly the same procedure but with openmpi > > instead of mvapich, everything works fine ....(however I am not so happy > > about the scaling of openmpi on infiniband) > > > > Does anyone have experience with installing namd using mvapich1.0.1 ? If > > yes, any idea why this happens? I must say when I did the same on > > another cluster which had mvapich1.0.1 already compiled with the intel > > compilers, everything worked out correcltly. So, it must be something > > with the compilation of mvapich1.0.1 on our new infiniband setup that > > creates the problem. > > > > The german in the error simply says that executable "mpiname was not > > found" > > > > Best wishes > > vlad > > > > ----------------------------------error------------------------------------------------------------------------ > > I also tried mvapich2 but the compilation fails when installing the > > mpiname application (see error below) which apparently fails to compile > > (no executable is found in /env/mpiname dir). However no error messages > > are printed by make and the build completes correctly. So I am not sure > > why mpiname does not compile and still make install tries to install > > it ... > > > > /usr/bin/install -c mpiname/mpiname > > /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname > > /usr/bin/install: Aufruf von stat für âmpiname/mpinameâ nicht möglich: > > Datei oder Verzeichnis nicht gefunden > > make[1]: *** [install] Fehler 1 > > make[1]: Leaving directory > > `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' > > make: *** [install] Fehler 2 > > > > -- > > ---------------------------------------------------------------------------- > > Dr. Vlad Cojocaru > > > > EML Research gGmbH > > Schloss-Wolfsbrunnenweg 33 > > 69118 Heidelberg > > > > Tel: ++49-6221-533266 > > Fax: ++49-6221-533298 > > > > e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de > > > > http://projects.villa-bosch.de/mcm/people/cojocaru/ > > > > ---------------------------------------------------------------------------- > > EML Research gGmbH > > Amtgericht Mannheim / HRB 337446 > > Managing Partner: Dr. h.c. Klaus Tschira > > Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter > > http://www.eml-r.org > > ---------------------------------------------------------------------------- > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > -- > ---------------------------------------------------------------------------- > Dr. Vlad Cojocaru > > EML Research gGmbH > Schloss-Wolfsbrunnenweg 33 > 69118 Heidelberg > > Tel: ++49-6221-533266 > Fax: ++49-6221-533298 > > e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de > > http://projects.villa-bosch.de/mcm/people/cojocaru/ > > ---------------------------------------------------------------------------- > EML Research gGmbH > Amtgericht Mannheim / HRB 337446 > Managing Partner: Dr. h.c. Klaus Tschira > Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter > http://www.eml-r.org > ---------------------------------------------------------------------------- > > > From mbozzore at platform.com Fri Aug 15 08:34:35 2008 From: mbozzore at platform.com (Mehdi Bozzo-Rey) Date: Fri Aug 15 08:33:42 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 References: <48A45005.2020407@eml-r.villa-bosch.de> <531893A968B34D40B36C7A6445BC828A01CF64F6@catoexm06.noam.corp.platform.com> <3FA65DF819C05B40B9779B89D1FEB2950D0A09@vbemail20.villa-bosch.de> <531893A968B34D40B36C7A6445BC828A01CF6517@catoexm06.noam.corp.platform.com> <48A570CA.1020607@eml-r.villa-bosch.de> Message-ID: <531893A968B34D40B36C7A6445BC828A01CF651C@catoexm06.noam.corp.platform.com> Hello Vlad, I also have a lot of applications / libraries in fortran, this is why I used gfortran (which is part of the gcc suite anyway) as compiler for fortran77 and fortran90. Please note that in that case you need to export the following environment variable (compile time) : F77_GETARGDECL=" ". At run time, you will have to run the interactive fortran examples with the following variable as well: GFORTRAN_UNBUFFERED_ALL=y , as mentioned in the user guide (section 7.1.5 and 7.1.6: http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html ) The story is that by default, I/O is buffered (http://gcc.gnu.org/onlinedocs/gfortran/GFORTRAN_005fUNBUFFERED_005fALL.html) and with no option the example will appear to hang. I also plan to use the Intel compilers and the Portland Group compilers for some applications. Unfortunately, I don't have access to debian boxes ... our cluster stack is more Red Hat (or CentOS) oriented (for now) ... I tried out 1.2rc1 as well with (from config.log): ----------------------------- It was created by configure, which was generated by GNU Autoconf 2.59. Invocation command line was $ ./configure --prefix=/home/mbozzore/mvapich2 --enable-f77 --enable-f90 --ena ble-cxx --enable-sharedlibs=gcc --with-ib-libpath=/opt/ofed/lib64 --with-ib-incl ude=/opt/ofed/include/ --with-rdma=gen2 ----------------------------- Cheers, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform OCS5 Platform computing Phone: +1 905 948 4649 From: Vlad Cojocaru [mailto:Vlad.Cojocaru@eml-r.villa-bosch.de] Sent: August-15-08 8:04 AM To: Mehdi Bozzo-Rey Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Thanks Mehdi for all details, I guess you mean gcc when you say gfortran ... namd is not written in fortran but in charm++ which is an adaptation of c++... Well, we have debian here so we used Debian packages to install the inifiniband libs and headers ...(our sys administrator did that). Then I tried to compile mvapich 1.0.1 and I found that I need the drastically change the make.mvapich.gen2 file in order to get it to build (since the defaults for $IBHOME are very strange ... we have everything in /usr/include/infiniband and /usr/lib/infiniband ). After all I managed to get it built but the namd hangs .... So I decided to try mvapich2 (1.2rc1 version) and I found lots problems. Some of them I could fix but some are very strange. For instance in the entire source tree there are lots of references to strange directories /home/daffy ... or /home/7 ... and so on .. Some of them I replaced with ${master_top_srcdir} since I figured out that one should replace them but others I don't know ... Also, when I tried to build with shared libs, the make is not able to build the mpiname application ... I could not figure out why ... So, lots of problems ....I'll try to figure them out ... However, the problems with mvapich2 look more as bugs in the Makefiiles .. So, maybe somebody would like to change those ... Cheers vlad Mehdi Bozzo-Rey wrote: Hi Vlad, No, I did not use the intel compilers (not yet). I used gfortran. More precisely: OS: RHEL 5.1 (Kernel 2.6.18-53.el5) [mbozzore@tyan04 ~]$ mpicc --version gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) [mbozzore@tyan04 ~]$ mpicxx --version g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) [mbozzore@tyan04 ~]$ mpif77 --version GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) [mbozzore@tyan04 ~]$ mpif90 --version GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) Hardware: intel quads for the nodes, topspin switch and hcas for IB. Yes, I used OFED (1.3). I did not enable sharedlibs for that build. I will double check but if I remember well, everything was fine (compilation) on the mvapich2 side. What version did you use ? Cheers, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform computing Phone : +1 905 948 4649 E-mail : mbozzore@platform.com From: Cojocaru,Vlad [mailto:vlad.cojocaru@eml-r.villa-bosch.de] Sent: August-14-08 4:35 PM To: Mehdi Bozzo-Rey; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Hi Mehdi, Did you use intel 10.1 as well ? Did you build on openfabrics ? what compiler flags did you pass to the mvapich build? Did you build with --enable sharedlib or without? I would be grateful If you give me some bits of the details how you built mvapich?. Thanks for the reply. Yes, there is something about the compilation of mvapich. As I said I successfully compiled NAMD on a cluster that had already mvapich compiled with intel as the default mpi lib. However, on the new cluster (quad cores AMD opterons with mellanox infiniband) I got these problems. So, its definitely the mvapich build which fails although I don't get any errors fro make. Any idea why the mpiname application fails to compile when compiling mvapich2 ? Thanks again Best wishes vlad -----Original Message----- From: Mehdi Bozzo-Rey [mailto:mbozzore@platform.com] Sent: Thu 8/14/2008 7:20 PM To: Cojocaru,Vlad; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Hello Vlad, I just recompiled NAMD and it looks ok for me (output of simple test below). I guess the problem is on the compilation side. Best regards, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform computing Phone : +1 905 948 4649 E-mail : mbozzore@platform.com [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile ./hosts.8 ./namd2 src/alanin Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Info: NAMD 2.6 for Linux-amd64-MPI Info: Info: Please visit http://www.ks.uiuc.edu/Research/namd/ Info: and send feedback or bug reports to namd@ks.uiuc.edu Info: Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) Info: in all publications reporting results obtained with NAMD. Info: Info: Based on Charm++/Converse 50914 for mpi-linux-x86_64-gfortran-smp-mpicxx Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on tyan04.lsf.platform.com Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore Info: Running on 8 processors. Info: 8208 kB of memory in use. Info: Memory usage based on mallinfo Info: Changed directory to src Info: Configuration file is alanin TCL: Suspending until startup complete. Info: SIMULATION PARAMETERS: Info: TIMESTEP 0.5 Info: NUMBER OF STEPS 9 Info: STEPS PER CYCLE 3 Info: LOAD BALANCE STRATEGY Other Info: LDB PERIOD 600 steps Info: FIRST LDB TIMESTEP 15 Info: LDB BACKGROUND SCALING 1 Info: HOM BACKGROUND SCALING 1 Info: MAX SELF PARTITIONS 50 Info: MAX PAIR PARTITIONS 20 Info: SELF PARTITION ATOMS 125 Info: PAIR PARTITION ATOMS 200 Info: PAIR2 PARTITION ATOMS 400 Info: MIN ATOMS PER PATCH 100 Info: INITIAL TEMPERATURE 0 Info: CENTER OF MASS MOVING INITIALLY? NO Info: DIELECTRIC 1 Info: EXCLUDE SCALED ONE-FOUR Info: 1-4 SCALE FACTOR 0.4 Info: NO DCD TRAJECTORY OUTPUT Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT Info: NO VELOCITY DCD OUTPUT Info: OUTPUT FILENAME output Info: BINARY OUTPUT FILES WILL BE USED Info: NO RESTART FILE Info: SWITCHING ACTIVE Info: SWITCHING ON 7 Info: SWITCHING OFF 8 Info: PAIRLIST DISTANCE 9 Info: PAIRLIST SHRINK RATE 0.01 Info: PAIRLIST GROW RATE 0.01 Info: PAIRLIST TRIGGER 0.3 Info: PAIRLISTS PER CYCLE 2 Info: PAIRLISTS ENABLED Info: MARGIN 1 Info: HYDROGEN GROUP CUTOFF 2.5 Info: PATCH DIMENSION 12.5 Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL Info: TIMING OUTPUT STEPS 15 Info: USING VERLET I (r-RESPA) MTS SCHEME. Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS Info: RANDOM NUMBER SEED 1218734148 Info: USE HYDROGEN BONDS? NO Info: COORDINATE PDB alanin.pdb Info: STRUCTURE FILE alanin.psf Info: PARAMETER file: XPLOR format! (default) Info: PARAMETERS alanin.params Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS Info: SUMMARY OF PARAMETERS: Info: 61 BONDS Info: 179 ANGLES Info: 38 DIHEDRAL Info: 42 IMPROPER Info: 0 CROSSTERM Info: 21 VDW Info: 0 VDW_PAIRS Info: **************************** Info: STRUCTURE SUMMARY: Info: 66 ATOMS Info: 65 BONDS Info: 96 ANGLES Info: 31 DIHEDRALS Info: 32 IMPROPERS Info: 0 CROSSTERMS Info: 0 EXCLUSIONS Info: 195 DEGREES OF FREEDOM Info: 55 HYDROGEN GROUPS Info: TOTAL MASS = 783.886 amu Info: TOTAL CHARGE = 8.19564e-08 e Info: ***************************** Info: Entering startup phase 0 with 8208 kB of memory in use. Info: Entering startup phase 1 with 8208 kB of memory in use. Info: Entering startup phase 2 with 8208 kB of memory in use. Info: Entering startup phase 3 with 8208 kB of memory in use. Info: PATCH GRID IS 1 BY 1 BY 1 Info: REMOVING COM VELOCITY 0 0 0 Info: LARGEST PATCH (0) HAS 66 ATOMS Info: CREATING 11 COMPUTE OBJECTS Info: Entering startup phase 4 with 8208 kB of memory in use. Info: Entering startup phase 5 with 8208 kB of memory in use. Info: Entering startup phase 6 with 8208 kB of memory in use. Measuring processor speeds... Done. Info: Entering startup phase 7 with 8208 kB of memory in use. Info: CREATING 11 COMPUTE OBJECTS Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 Info: NONBONDED TABLE SIZE: 705 POINTS Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 Info: Entering startup phase 8 with 8208 kB of memory in use. Info: Finished startup with 8208 kB of memory in use. ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP TOTAL2 TOTAL3 TEMPAVG ENERGY: 0 0.0050 0.4192 0.0368 0.4591 -210.1610 1.0506 0.0000 0.0000 0.0000 -208.1904 0.0000 -208.1877 -208.1877 0.0000 ENERGY: 1 0.0051 0.4196 0.0367 0.4585 -210.1611 1.0184 0.0000 0.0000 0.0325 -208.1905 0.1675 -208.1878 -208.1877 0.1675 ENERGY: 2 0.0058 0.4208 0.0365 0.4568 -210.1610 0.9219 0.0000 0.0000 0.1285 -208.1907 0.6632 -208.1881 -208.1877 0.6632 ENERGY: 3 0.0092 0.4232 0.0361 0.4542 -210.1599 0.7617 0.0000 0.0000 0.2845 -208.1910 1.4683 -208.1885 -208.1878 1.4683 ENERGY: 4 0.0176 0.4269 0.0356 0.4511 -210.1565 0.5386 0.0000 0.0000 0.4952 -208.1914 2.5561 -208.1890 -208.1878 2.5561 ENERGY: 5 0.0327 0.4327 0.0350 0.4480 -210.1489 0.2537 0.0000 0.0000 0.7552 -208.1917 3.8977 -208.1894 -208.1879 3.8977 ENERGY: 6 0.0552 0.4409 0.0343 0.4454 -210.1354 -0.0915 0.0000 0.0000 1.0592 -208.1920 5.4666 -208.1898 -208.1880 5.4666 ENERGY: 7 0.0839 0.4522 0.0334 0.4440 -210.1137 -0.4951 0.0000 0.0000 1.4031 -208.1922 7.2418 -208.1900 -208.1882 7.2418 ENERGY: 8 0.1162 0.4674 0.0325 0.4448 -210.0822 -0.9550 0.0000 0.0000 1.7839 -208.1923 9.2074 -208.1902 -208.1883 9.2074 ENERGY: 9 0.1492 0.4870 0.0315 0.4485 -210.0391 -1.4687 0.0000 0.0000 2.1990 -208.1925 11.3497 -208.1905 -208.1884 11.3497 WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 WRITING COORDINATES TO OUTPUT FILE AT STEP 9 WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 ========================================== WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB End of program -----Original Message----- From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad Cojocaru Sent: August-14-08 11:32 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Dear mvapich users, I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 infiniband cluster using the intel 10.1.015 compilers. With mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in the /examples directory. Then, I bult charm++ and tested it with "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on top of mvapich1.0.1 and charm, Everything seemed ok only that the namd executable hangs without error messages. In fact it appears as if it still runs but it doesn't produce any output. If I repeat exactly the same procedure but with openmpi instead of mvapich, everything works fine ....(however I am not so happy about the scaling of openmpi on infiniband) Does anyone have experience with installing namd using mvapich1.0.1 ? If yes, any idea why this happens? I must say when I did the same on another cluster which had mvapich1.0.1 already compiled with the intel compilers, everything worked out correcltly. So, it must be something with the compilation of mvapich1.0.1 on our new infiniband setup that creates the problem. The german in the error simply says that executable "mpiname was not found" Best wishes vlad ----------------------------------error------------------------------------------------------------------------ I also tried mvapich2 but the compilation fails when installing the mpiname application (see error below) which apparently fails to compile (no executable is found in /env/mpiname dir). However no error messages are printed by make and the build completes correctly. So I am not sure why mpiname does not compile and still make install tries to install it ... /usr/bin/install -c mpiname/mpiname /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname /usr/bin/install: Aufruf von stat f??r ?mpiname/mpiname? nicht m??glich: Datei oder Verzeichnis nicht gefunden make[1]: *** [install] Fehler 1 make[1]: Leaving directory `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' make: *** [install] Fehler 2 -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080815/8cf643b6/attachment-0001.html From Vlad.Cojocaru at eml-r.villa-bosch.de Fri Aug 15 08:42:18 2008 From: Vlad.Cojocaru at eml-r.villa-bosch.de (Vlad Cojocaru) Date: Fri Aug 15 08:42:40 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 In-Reply-To: References: Message-ID: <48A579AA.9090304@eml-r.villa-bosch.de> Dear Dhabaleswar, Dear Mehdi, Thanks a lot for your answers. I do have the user guide in front of my eyes but I still cannot explain all the references to /home/7/curtisbr/svn/mvapich2/mvapich2-1.2rc1 in the Makefile.in files in the source tree .... It looks as if after replacing those with ${master_top_srcdir} mvapich2 builds correctly .. I still have to test it though ... Is the gfortran stuff relevant since I do not use gfortran ? Cheers vlad Dhabaleswar Panda wrote: > Vald, > > Please take a look at the detailed user guides of MVAPICH and MVAPICH2 > regarding how to build and install these packages. They are available from > the following URL: > > http://mvapich.cse.ohio-state.edu/support/ > > MVAPICH2 1.2 series has full autoconf-based configuration framework. It > should significantly help you. > > DK > > > On Fri, 15 Aug 2008, Vlad Cojocaru wrote: > > >> Thanks Mehdi for all details, >> >> I guess you mean gcc when you say gfortran ... namd is not written in >> fortran but in charm++ which is an adaptation of c++... >> >> Well, we have debian here so we used Debian packages to install the >> inifiniband libs and headers ...(our sys administrator did that). Then I >> tried to compile mvapich 1.0.1 and I found that I need the drastically >> change the make.mvapich.gen2 file in order to get it to build (since the >> defaults for $IBHOME are very strange ... we have everything in >> /usr/include/infiniband and /usr/lib/infiniband ). After all I managed >> to get it built but the namd hangs .... >> >> So I decided to try mvapich2 (1.2rc1 version) and I found lots problems. >> Some of them I could fix but some are very strange. For instance in the >> entire source tree there are lots of references to strange directories >> /home/daffy ... or /home/7 ... and so on .. Some of them I replaced with >> ${master_top_srcdir} since I figured out that one should replace them >> but others I don't know ... Also, when I tried to build with shared >> libs, the make is not able to build the mpiname application ... I could >> not figure out why ... >> >> So, lots of problems ....I'll try to figure them out ... However, the >> problems with mvapich2 look more as bugs in the Makefiiles .. So, maybe >> somebody would like to change those ... >> >> Cheers >> vlad >> >> >> Mehdi Bozzo-Rey wrote: >> >>> Hi Vlad, >>> >>> >>> >>> No, I did not use the intel compilers (not yet). I used gfortran. More >>> precisely: >>> >>> >>> >>> OS: >>> >>> >>> >>> RHEL 5.1 (Kernel 2.6.18-53.el5) >>> >>> >>> >>> [mbozzore@tyan04 ~]$ mpicc --version >>> >>> gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>> >>> >>> >>> [mbozzore@tyan04 ~]$ mpicxx --version >>> >>> g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>> >>> >>> >>> [mbozzore@tyan04 ~]$ mpif77 --version >>> >>> GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>> >>> >>> >>> [mbozzore@tyan04 ~]$ mpif90 --version >>> >>> GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>> >>> >>> >>> Hardware: intel quads for the nodes, topspin switch and hcas for IB. >>> >>> >>> >>> >>> >>> Yes, I used OFED (1.3). >>> >>> >>> >>> I did not enable sharedlibs for that build. >>> >>> >>> >>> I will double check but if I remember well, everything was fine >>> (compilation) on the mvapich2 side. What version did you use ? >>> >>> >>> >>> Cheers, >>> >>> >>> >>> Mehdi >>> >>> >>> >>> Mehdi Bozzo-Rey >>> Open Source Solution Developer >>> Platform computing >>> Phone : +1 905 948 4649 >>> E-mail : mbozzore@platform.com >>> >>> >>> >>> >>> >>> >>> >>> *From:* Cojocaru,Vlad [mailto:vlad.cojocaru@eml-r.villa-bosch.de] >>> *Sent:* August-14-08 4:35 PM >>> *To:* Mehdi Bozzo-Rey; mvapich-discuss@cse.ohio-state.edu >>> *Subject:* RE: [mvapich-discuss] compile charm++ and namd with mvapich >>> 1.0.1and/or mvapich2 >>> >>> >>> >>> Hi Mehdi, >>> >>> Did you use intel 10.1 as well ? Did you build on openfabrics ? what >>> compiler flags did you pass to the mvapich build? Did you build with >>> --enable sharedlib or without? I would be grateful If you give me some >>> bits of the details how you built mvapich?. >>> Thanks for the reply. Yes, there is something about the compilation of >>> mvapich. As I said I successfully compiled NAMD on a cluster that had >>> already mvapich compiled with intel as the default mpi lib. However, >>> on the new cluster (quad cores AMD opterons with mellanox infiniband) >>> I got these problems. So, its definitely the mvapich build which >>> fails although I don't get any errors fro make. >>> >>> Any idea why the mpiname application fails to compile when compiling >>> mvapich2 ? >>> >>> Thanks again >>> >>> Best wishes >>> vlad >>> >>> >>> -----Original Message----- >>> From: Mehdi Bozzo-Rey [mailto:mbozzore@platform.com] >>> Sent: Thu 8/14/2008 7:20 PM >>> To: Cojocaru,Vlad; mvapich-discuss@cse.ohio-state.edu >>> Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich >>> 1.0.1and/or mvapich2 >>> >>> Hello Vlad, >>> >>> >>> I just recompiled NAMD and it looks ok for me (output of simple test >>> below). I guess the problem is on the compilation side. >>> >>> Best regards, >>> >>> Mehdi >>> >>> Mehdi Bozzo-Rey >>> Open Source Solution Developer >>> Platform computing >>> Phone : +1 905 948 4649 >>> E-mail : mbozzore@platform.com >>> >>> >>> [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile >>> ./hosts.8 ./namd2 src/alanin >>> Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 >>> Charm warning> Randomization of stack pointer is turned on in Kernel, >>> run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable >>> it. Thread migration may not work! >>> Info: NAMD 2.6 for Linux-amd64-MPI >>> Info: >>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/ >>> Info: and send feedback or bug reports to namd@ks.uiuc.edu >>> Info: >>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) >>> Info: in all publications reporting results obtained with NAMD. >>> Info: >>> Info: Based on Charm++/Converse 50914 for >>> mpi-linux-x86_64-gfortran-smp-mpicxx >>> Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on >>> tyan04.lsf.platform.com >>> Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore >>> Info: Running on 8 processors. >>> Info: 8208 kB of memory in use. >>> Info: Memory usage based on mallinfo >>> Info: Changed directory to src >>> Info: Configuration file is alanin >>> TCL: Suspending until startup complete. >>> Info: SIMULATION PARAMETERS: >>> Info: TIMESTEP 0.5 >>> Info: NUMBER OF STEPS 9 >>> Info: STEPS PER CYCLE 3 >>> Info: LOAD BALANCE STRATEGY Other >>> Info: LDB PERIOD 600 steps >>> Info: FIRST LDB TIMESTEP 15 >>> Info: LDB BACKGROUND SCALING 1 >>> Info: HOM BACKGROUND SCALING 1 >>> Info: MAX SELF PARTITIONS 50 >>> Info: MAX PAIR PARTITIONS 20 >>> Info: SELF PARTITION ATOMS 125 >>> Info: PAIR PARTITION ATOMS 200 >>> Info: PAIR2 PARTITION ATOMS 400 >>> Info: MIN ATOMS PER PATCH 100 >>> Info: INITIAL TEMPERATURE 0 >>> Info: CENTER OF MASS MOVING INITIALLY? NO >>> Info: DIELECTRIC 1 >>> Info: EXCLUDE SCALED ONE-FOUR >>> Info: 1-4 SCALE FACTOR 0.4 >>> Info: NO DCD TRAJECTORY OUTPUT >>> Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT >>> Info: NO VELOCITY DCD OUTPUT >>> Info: OUTPUT FILENAME output >>> Info: BINARY OUTPUT FILES WILL BE USED >>> Info: NO RESTART FILE >>> Info: SWITCHING ACTIVE >>> Info: SWITCHING ON 7 >>> Info: SWITCHING OFF 8 >>> Info: PAIRLIST DISTANCE 9 >>> Info: PAIRLIST SHRINK RATE 0.01 >>> Info: PAIRLIST GROW RATE 0.01 >>> Info: PAIRLIST TRIGGER 0.3 >>> Info: PAIRLISTS PER CYCLE 2 >>> Info: PAIRLISTS ENABLED >>> Info: MARGIN 1 >>> Info: HYDROGEN GROUP CUTOFF 2.5 >>> Info: PATCH DIMENSION 12.5 >>> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL >>> Info: TIMING OUTPUT STEPS 15 >>> Info: USING VERLET I (r-RESPA) MTS SCHEME. >>> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS >>> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS >>> Info: RANDOM NUMBER SEED 1218734148 >>> Info: USE HYDROGEN BONDS? NO >>> Info: COORDINATE PDB alanin.pdb >>> Info: STRUCTURE FILE alanin.psf >>> Info: PARAMETER file: XPLOR format! (default) >>> Info: PARAMETERS alanin.params >>> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS >>> Info: SUMMARY OF PARAMETERS: >>> Info: 61 BONDS >>> Info: 179 ANGLES >>> Info: 38 DIHEDRAL >>> Info: 42 IMPROPER >>> Info: 0 CROSSTERM >>> Info: 21 VDW >>> Info: 0 VDW_PAIRS >>> Info: **************************** >>> Info: STRUCTURE SUMMARY: >>> Info: 66 ATOMS >>> Info: 65 BONDS >>> Info: 96 ANGLES >>> Info: 31 DIHEDRALS >>> Info: 32 IMPROPERS >>> Info: 0 CROSSTERMS >>> Info: 0 EXCLUSIONS >>> Info: 195 DEGREES OF FREEDOM >>> Info: 55 HYDROGEN GROUPS >>> Info: TOTAL MASS = 783.886 amu >>> Info: TOTAL CHARGE = 8.19564e-08 e >>> Info: ***************************** >>> Info: Entering startup phase 0 with 8208 kB of memory in use. >>> Info: Entering startup phase 1 with 8208 kB of memory in use. >>> Info: Entering startup phase 2 with 8208 kB of memory in use. >>> Info: Entering startup phase 3 with 8208 kB of memory in use. >>> Info: PATCH GRID IS 1 BY 1 BY 1 >>> Info: REMOVING COM VELOCITY 0 0 0 >>> Info: LARGEST PATCH (0) HAS 66 ATOMS >>> Info: CREATING 11 COMPUTE OBJECTS >>> Info: Entering startup phase 4 with 8208 kB of memory in use. >>> Info: Entering startup phase 5 with 8208 kB of memory in use. >>> Info: Entering startup phase 6 with 8208 kB of memory in use. >>> Measuring processor speeds... Done. >>> Info: Entering startup phase 7 with 8208 kB of memory in use. >>> Info: CREATING 11 COMPUTE OBJECTS >>> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 >>> Info: NONBONDED TABLE SIZE: 705 POINTS >>> Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 >>> Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 >>> Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 >>> Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 >>> Info: Entering startup phase 8 with 8208 kB of memory in use. >>> Info: Finished startup with 8208 kB of memory in use. >>> ETITLE: TS BOND ANGLE DIHED >>> IMPRP ELECT VDW BOUNDARY >>> MISC KINETIC TOTAL TEMP >>> TOTAL2 TOTAL3 TEMPAVG >>> >>> ENERGY: 0 0.0050 0.4192 0.0368 >>> 0.4591 -210.1610 1.0506 0.0000 >>> 0.0000 0.0000 -208.1904 0.0000 >>> -208.1877 -208.1877 0.0000 >>> >>> ENERGY: 1 0.0051 0.4196 0.0367 >>> 0.4585 -210.1611 1.0184 0.0000 >>> 0.0000 0.0325 -208.1905 0.1675 >>> -208.1878 -208.1877 0.1675 >>> >>> ENERGY: 2 0.0058 0.4208 0.0365 >>> 0.4568 -210.1610 0.9219 0.0000 >>> 0.0000 0.1285 -208.1907 0.6632 >>> -208.1881 -208.1877 0.6632 >>> >>> ENERGY: 3 0.0092 0.4232 0.0361 >>> 0.4542 -210.1599 0.7617 0.0000 >>> 0.0000 0.2845 -208.1910 1.4683 >>> -208.1885 -208.1878 1.4683 >>> >>> ENERGY: 4 0.0176 0.4269 0.0356 >>> 0.4511 -210.1565 0.5386 0.0000 >>> 0.0000 0.4952 -208.1914 2.5561 >>> -208.1890 -208.1878 2.5561 >>> >>> ENERGY: 5 0.0327 0.4327 0.0350 >>> 0.4480 -210.1489 0.2537 0.0000 >>> 0.0000 0.7552 -208.1917 3.8977 >>> -208.1894 -208.1879 3.8977 >>> >>> ENERGY: 6 0.0552 0.4409 0.0343 >>> 0.4454 -210.1354 -0.0915 0.0000 >>> 0.0000 1.0592 -208.1920 5.4666 >>> -208.1898 -208.1880 5.4666 >>> >>> ENERGY: 7 0.0839 0.4522 0.0334 >>> 0.4440 -210.1137 -0.4951 0.0000 >>> 0.0000 1.4031 -208.1922 7.2418 >>> -208.1900 -208.1882 7.2418 >>> >>> ENERGY: 8 0.1162 0.4674 0.0325 >>> 0.4448 -210.0822 -0.9550 0.0000 >>> 0.0000 1.7839 -208.1923 9.2074 >>> -208.1902 -208.1883 9.2074 >>> >>> ENERGY: 9 0.1492 0.4870 0.0315 >>> 0.4485 -210.0391 -1.4687 0.0000 >>> 0.0000 2.1990 -208.1925 11.3497 >>> -208.1905 -208.1884 11.3497 >>> >>> WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 >>> WRITING COORDINATES TO OUTPUT FILE AT STEP 9 >>> WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 >>> ========================================== >>> WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB >>> End of program >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: mvapich-discuss-bounces@cse.ohio-state.edu >>> [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad >>> Cojocaru >>> Sent: August-14-08 11:32 AM >>> To: mvapich-discuss@cse.ohio-state.edu >>> Subject: [mvapich-discuss] compile charm++ and namd with mvapich >>> 1.0.1and/or mvapich2 >>> >>> Dear mvapich users, >>> >>> I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 >>> infiniband cluster using the intel 10.1.015 compilers. With >>> mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in >>> the /examples directory. Then, I bult charm++ and tested it with >>> "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on >>> top of mvapich1.0.1 and charm, >>> >>> Everything seemed ok only that the namd executable hangs without error >>> messages. In fact it appears as if it still runs but it doesn't produce >>> any output. If I repeat exactly the same procedure but with openmpi >>> instead of mvapich, everything works fine ....(however I am not so happy >>> about the scaling of openmpi on infiniband) >>> >>> Does anyone have experience with installing namd using mvapich1.0.1 ? If >>> yes, any idea why this happens? I must say when I did the same on >>> another cluster which had mvapich1.0.1 already compiled with the intel >>> compilers, everything worked out correcltly. So, it must be something >>> with the compilation of mvapich1.0.1 on our new infiniband setup that >>> creates the problem. >>> >>> The german in the error simply says that executable "mpiname was not >>> found" >>> >>> Best wishes >>> vlad >>> >>> ----------------------------------error------------------------------------------------------------------------ >>> I also tried mvapich2 but the compilation fails when installing the >>> mpiname application (see error below) which apparently fails to compile >>> (no executable is found in /env/mpiname dir). However no error messages >>> are printed by make and the build completes correctly. So I am not sure >>> why mpiname does not compile and still make install tries to install >>> it ... >>> >>> /usr/bin/install -c mpiname/mpiname >>> /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname >>> /usr/bin/install: Aufruf von stat f?r ?mpiname/mpiname? nicht m?glich: >>> Datei oder Verzeichnis nicht gefunden >>> make[1]: *** [install] Fehler 1 >>> make[1]: Leaving directory >>> `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' >>> make: *** [install] Fehler 2 >>> >>> -- >>> ---------------------------------------------------------------------------- >>> Dr. Vlad Cojocaru >>> >>> EML Research gGmbH >>> Schloss-Wolfsbrunnenweg 33 >>> 69118 Heidelberg >>> >>> Tel: ++49-6221-533266 >>> Fax: ++49-6221-533298 >>> >>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de >>> >>> http://projects.villa-bosch.de/mcm/people/cojocaru/ >>> >>> ---------------------------------------------------------------------------- >>> EML Research gGmbH >>> Amtgericht Mannheim / HRB 337446 >>> Managing Partner: Dr. h.c. Klaus Tschira >>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter >>> http://www.eml-r.org >>> ---------------------------------------------------------------------------- >>> >>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >>> >> -- >> ---------------------------------------------------------------------------- >> Dr. Vlad Cojocaru >> >> EML Research gGmbH >> Schloss-Wolfsbrunnenweg 33 >> 69118 Heidelberg >> >> Tel: ++49-6221-533266 >> Fax: ++49-6221-533298 >> >> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de >> >> http://projects.villa-bosch.de/mcm/people/cojocaru/ >> >> ---------------------------------------------------------------------------- >> EML Research gGmbH >> Amtgericht Mannheim / HRB 337446 >> Managing Partner: Dr. h.c. Klaus Tschira >> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter >> http://www.eml-r.org >> ---------------------------------------------------------------------------- >> >> >> >> > > -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080815/70d66186/attachment-0001.html From mbozzore at platform.com Fri Aug 15 08:48:12 2008 From: mbozzore at platform.com (Mehdi Bozzo-Rey) Date: Fri Aug 15 08:47:21 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 References: <48A579AA.9090304@eml-r.villa-bosch.de> Message-ID: <531893A968B34D40B36C7A6445BC828A01CF651D@catoexm06.noam.corp.platform.com> Hello Vlad, If you don?t plan to use gfortran, then the gfortran things are not relevant. Cheers, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform OCS5 Platform computing Phone: +1 905 948 4649 From: Vlad Cojocaru [mailto:Vlad.Cojocaru@eml-r.villa-bosch.de] Sent: August-15-08 8:42 AM To: Dhabaleswar Panda Cc: Mehdi Bozzo-Rey; mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Dear Dhabaleswar, Dear Mehdi, Thanks a lot for your answers. I do have the user guide in front of my eyes but I still cannot explain all the references to /home/7/curtisbr/svn/mvapich2/mvapich2-1.2rc1 in the Makefile.in files in the source tree .... It looks as if after replacing those with ${master_top_srcdir} mvapich2 builds correctly .. I still have to test it though ... Is the gfortran stuff relevant since I do not use gfortran ? Cheers vlad Dhabaleswar Panda wrote: Vald, Please take a look at the detailed user guides of MVAPICH and MVAPICH2 regarding how to build and install these packages. They are available from the following URL: http://mvapich.cse.ohio-state.edu/support/ MVAPICH2 1.2 series has full autoconf-based configuration framework. It should significantly help you. DK On Fri, 15 Aug 2008, Vlad Cojocaru wrote: Thanks Mehdi for all details, I guess you mean gcc when you say gfortran ... namd is not written in fortran but in charm++ which is an adaptation of c++... Well, we have debian here so we used Debian packages to install the inifiniband libs and headers ...(our sys administrator did that). Then I tried to compile mvapich 1.0.1 and I found that I need the drastically change the make.mvapich.gen2 file in order to get it to build (since the defaults for $IBHOME are very strange ... we have everything in /usr/include/infiniband and /usr/lib/infiniband ). After all I managed to get it built but the namd hangs .... So I decided to try mvapich2 (1.2rc1 version) and I found lots problems. Some of them I could fix but some are very strange. For instance in the entire source tree there are lots of references to strange directories /home/daffy ... or /home/7 ... and so on .. Some of them I replaced with ${master_top_srcdir} since I figured out that one should replace them but others I don't know ... Also, when I tried to build with shared libs, the make is not able to build the mpiname application ... I could not figure out why ... So, lots of problems ....I'll try to figure them out ... However, the problems with mvapich2 look more as bugs in the Makefiiles .. So, maybe somebody would like to change those ... Cheers vlad Mehdi Bozzo-Rey wrote: Hi Vlad, No, I did not use the intel compilers (not yet). I used gfortran. More precisely: OS: RHEL 5.1 (Kernel 2.6.18-53.el5) [mbozzore@tyan04 ~]$ mpicc --version gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) [mbozzore@tyan04 ~]$ mpicxx --version g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) [mbozzore@tyan04 ~]$ mpif77 --version GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) [mbozzore@tyan04 ~]$ mpif90 --version GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) Hardware: intel quads for the nodes, topspin switch and hcas for IB. Yes, I used OFED (1.3). I did not enable sharedlibs for that build. I will double check but if I remember well, everything was fine (compilation) on the mvapich2 side. What version did you use ? Cheers, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform computing Phone : +1 905 948 4649 E-mail : mbozzore@platform.com *From:* Cojocaru,Vlad [mailto:vlad.cojocaru@eml-r.villa-bosch.de] *Sent:* August-14-08 4:35 PM *To:* Mehdi Bozzo-Rey; mvapich-discuss@cse.ohio-state.edu *Subject:* RE: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Hi Mehdi, Did you use intel 10.1 as well ? Did you build on openfabrics ? what compiler flags did you pass to the mvapich build? Did you build with --enable sharedlib or without? I would be grateful If you give me some bits of the details how you built mvapich?. Thanks for the reply. Yes, there is something about the compilation of mvapich. As I said I successfully compiled NAMD on a cluster that had already mvapich compiled with intel as the default mpi lib. However, on the new cluster (quad cores AMD opterons with mellanox infiniband) I got these problems. So, its definitely the mvapich build which fails although I don't get any errors fro make. Any idea why the mpiname application fails to compile when compiling mvapich2 ? Thanks again Best wishes vlad -----Original Message----- From: Mehdi Bozzo-Rey [mailto:mbozzore@platform.com] Sent: Thu 8/14/2008 7:20 PM To: Cojocaru,Vlad; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Hello Vlad, I just recompiled NAMD and it looks ok for me (output of simple test below). I guess the problem is on the compilation side. Best regards, Mehdi Mehdi Bozzo-Rey Open Source Solution Developer Platform computing Phone : +1 905 948 4649 E-mail : mbozzore@platform.com [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile ./hosts.8 ./namd2 src/alanin Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Info: NAMD 2.6 for Linux-amd64-MPI Info: Info: Please visit http://www.ks.uiuc.edu/Research/namd/ Info: and send feedback or bug reports to namd@ks.uiuc.edu Info: Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) Info: in all publications reporting results obtained with NAMD. Info: Info: Based on Charm++/Converse 50914 for mpi-linux-x86_64-gfortran-smp-mpicxx Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on tyan04.lsf.platform.com Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore Info: Running on 8 processors. Info: 8208 kB of memory in use. Info: Memory usage based on mallinfo Info: Changed directory to src Info: Configuration file is alanin TCL: Suspending until startup complete. Info: SIMULATION PARAMETERS: Info: TIMESTEP 0.5 Info: NUMBER OF STEPS 9 Info: STEPS PER CYCLE 3 Info: LOAD BALANCE STRATEGY Other Info: LDB PERIOD 600 steps Info: FIRST LDB TIMESTEP 15 Info: LDB BACKGROUND SCALING 1 Info: HOM BACKGROUND SCALING 1 Info: MAX SELF PARTITIONS 50 Info: MAX PAIR PARTITIONS 20 Info: SELF PARTITION ATOMS 125 Info: PAIR PARTITION ATOMS 200 Info: PAIR2 PARTITION ATOMS 400 Info: MIN ATOMS PER PATCH 100 Info: INITIAL TEMPERATURE 0 Info: CENTER OF MASS MOVING INITIALLY? NO Info: DIELECTRIC 1 Info: EXCLUDE SCALED ONE-FOUR Info: 1-4 SCALE FACTOR 0.4 Info: NO DCD TRAJECTORY OUTPUT Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT Info: NO VELOCITY DCD OUTPUT Info: OUTPUT FILENAME output Info: BINARY OUTPUT FILES WILL BE USED Info: NO RESTART FILE Info: SWITCHING ACTIVE Info: SWITCHING ON 7 Info: SWITCHING OFF 8 Info: PAIRLIST DISTANCE 9 Info: PAIRLIST SHRINK RATE 0.01 Info: PAIRLIST GROW RATE 0.01 Info: PAIRLIST TRIGGER 0.3 Info: PAIRLISTS PER CYCLE 2 Info: PAIRLISTS ENABLED Info: MARGIN 1 Info: HYDROGEN GROUP CUTOFF 2.5 Info: PATCH DIMENSION 12.5 Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL Info: TIMING OUTPUT STEPS 15 Info: USING VERLET I (r-RESPA) MTS SCHEME. Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS Info: RANDOM NUMBER SEED 1218734148 Info: USE HYDROGEN BONDS? NO Info: COORDINATE PDB alanin.pdb Info: STRUCTURE FILE alanin.psf Info: PARAMETER file: XPLOR format! (default) Info: PARAMETERS alanin.params Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS Info: SUMMARY OF PARAMETERS: Info: 61 BONDS Info: 179 ANGLES Info: 38 DIHEDRAL Info: 42 IMPROPER Info: 0 CROSSTERM Info: 21 VDW Info: 0 VDW_PAIRS Info: **************************** Info: STRUCTURE SUMMARY: Info: 66 ATOMS Info: 65 BONDS Info: 96 ANGLES Info: 31 DIHEDRALS Info: 32 IMPROPERS Info: 0 CROSSTERMS Info: 0 EXCLUSIONS Info: 195 DEGREES OF FREEDOM Info: 55 HYDROGEN GROUPS Info: TOTAL MASS = 783.886 amu Info: TOTAL CHARGE = 8.19564e-08 e Info: ***************************** Info: Entering startup phase 0 with 8208 kB of memory in use. Info: Entering startup phase 1 with 8208 kB of memory in use. Info: Entering startup phase 2 with 8208 kB of memory in use. Info: Entering startup phase 3 with 8208 kB of memory in use. Info: PATCH GRID IS 1 BY 1 BY 1 Info: REMOVING COM VELOCITY 0 0 0 Info: LARGEST PATCH (0) HAS 66 ATOMS Info: CREATING 11 COMPUTE OBJECTS Info: Entering startup phase 4 with 8208 kB of memory in use. Info: Entering startup phase 5 with 8208 kB of memory in use. Info: Entering startup phase 6 with 8208 kB of memory in use. Measuring processor speeds... Done. Info: Entering startup phase 7 with 8208 kB of memory in use. Info: CREATING 11 COMPUTE OBJECTS Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 Info: NONBONDED TABLE SIZE: 705 POINTS Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 Info: Entering startup phase 8 with 8208 kB of memory in use. Info: Finished startup with 8208 kB of memory in use. ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP TOTAL2 TOTAL3 TEMPAVG ENERGY: 0 0.0050 0.4192 0.0368 0.4591 -210.1610 1.0506 0.0000 0.0000 0.0000 -208.1904 0.0000 -208.1877 -208.1877 0.0000 ENERGY: 1 0.0051 0.4196 0.0367 0.4585 -210.1611 1.0184 0.0000 0.0000 0.0325 -208.1905 0.1675 -208.1878 -208.1877 0.1675 ENERGY: 2 0.0058 0.4208 0.0365 0.4568 -210.1610 0.9219 0.0000 0.0000 0.1285 -208.1907 0.6632 -208.1881 -208.1877 0.6632 ENERGY: 3 0.0092 0.4232 0.0361 0.4542 -210.1599 0.7617 0.0000 0.0000 0.2845 -208.1910 1.4683 -208.1885 -208.1878 1.4683 ENERGY: 4 0.0176 0.4269 0.0356 0.4511 -210.1565 0.5386 0.0000 0.0000 0.4952 -208.1914 2.5561 -208.1890 -208.1878 2.5561 ENERGY: 5 0.0327 0.4327 0.0350 0.4480 -210.1489 0.2537 0.0000 0.0000 0.7552 -208.1917 3.8977 -208.1894 -208.1879 3.8977 ENERGY: 6 0.0552 0.4409 0.0343 0.4454 -210.1354 -0.0915 0.0000 0.0000 1.0592 -208.1920 5.4666 -208.1898 -208.1880 5.4666 ENERGY: 7 0.0839 0.4522 0.0334 0.4440 -210.1137 -0.4951 0.0000 0.0000 1.4031 -208.1922 7.2418 -208.1900 -208.1882 7.2418 ENERGY: 8 0.1162 0.4674 0.0325 0.4448 -210.0822 -0.9550 0.0000 0.0000 1.7839 -208.1923 9.2074 -208.1902 -208.1883 9.2074 ENERGY: 9 0.1492 0.4870 0.0315 0.4485 -210.0391 -1.4687 0.0000 0.0000 2.1990 -208.1925 11.3497 -208.1905 -208.1884 11.3497 WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 WRITING COORDINATES TO OUTPUT FILE AT STEP 9 WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 ========================================== WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB End of program -----Original Message----- From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad Cojocaru Sent: August-14-08 11:32 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 Dear mvapich users, I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 infiniband cluster using the intel 10.1.015 compilers. With mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in the /examples directory. Then, I bult charm++ and tested it with "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on top of mvapich1.0.1 and charm, Everything seemed ok only that the namd executable hangs without error messages. In fact it appears as if it still runs but it doesn't produce any output. If I repeat exactly the same procedure but with openmpi instead of mvapich, everything works fine ....(however I am not so happy about the scaling of openmpi on infiniband) Does anyone have experience with installing namd using mvapich1.0.1 ? If yes, any idea why this happens? I must say when I did the same on another cluster which had mvapich1.0.1 already compiled with the intel compilers, everything worked out correcltly. So, it must be something with the compilation of mvapich1.0.1 on our new infiniband setup that creates the problem. The german in the error simply says that executable "mpiname was not found" Best wishes vlad ----------------------------------error------------------------------------------------------------------------ I also tried mvapich2 but the compilation fails when installing the mpiname application (see error below) which apparently fails to compile (no executable is found in /env/mpiname dir). However no error messages are printed by make and the build completes correctly. So I am not sure why mpiname does not compile and still make install tries to install it ... /usr/bin/install -c mpiname/mpiname /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname /usr/bin/install: Aufruf von stat f?r ?mpiname/mpiname? nicht m?glich: Datei oder Verzeichnis nicht gefunden make[1]: *** [install] Fehler 1 make[1]: Leaving directory `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' make: *** [install] Fehler 2 -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080815/98967681/attachment-0001.html From perkinjo at cse.ohio-state.edu Fri Aug 15 09:18:14 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Fri Aug 15 09:19:46 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 In-Reply-To: <48A579AA.9090304@eml-r.villa-bosch.de> References: <48A579AA.9090304@eml-r.villa-bosch.de> Message-ID: <20080815131812.GG24018@cse.ohio-state.edu> Vlad: Can you please send us your config.log and make.log files. Those references to /home/7/curtisbr should only be present to create the Makefile.in files that are already present in the tarball. Also, I do not know why mpiname is failing to build for you but it should not affect the functionality of any mpi programs. On Fri, Aug 15, 2008 at 02:42:18PM +0200, Vlad Cojocaru wrote: > Dear Dhabaleswar, Dear Mehdi, > > Thanks a lot for your answers. I do have the user guide in front of my > eyes but I still cannot explain all the references to > /home/7/curtisbr/svn/mvapich2/mvapich2-1.2rc1 in the Makefile.in files > in the source tree .... > > It looks as if after replacing those with ${master_top_srcdir} mvapich2 > builds correctly .. I still have to test it though ... > > Is the gfortran stuff relevant since I do not use gfortran ? > > Cheers > vlad > > Dhabaleswar Panda wrote: >> Vald, >> >> Please take a look at the detailed user guides of MVAPICH and MVAPICH2 >> regarding how to build and install these packages. They are available from >> the following URL: >> >> http://mvapich.cse.ohio-state.edu/support/ >> >> MVAPICH2 1.2 series has full autoconf-based configuration framework. It >> should significantly help you. >> >> DK >> >> >> On Fri, 15 Aug 2008, Vlad Cojocaru wrote: >> >> >>> Thanks Mehdi for all details, >>> >>> I guess you mean gcc when you say gfortran ... namd is not written in >>> fortran but in charm++ which is an adaptation of c++... >>> >>> Well, we have debian here so we used Debian packages to install the >>> inifiniband libs and headers ...(our sys administrator did that). Then I >>> tried to compile mvapich 1.0.1 and I found that I need the drastically >>> change the make.mvapich.gen2 file in order to get it to build (since the >>> defaults for $IBHOME are very strange ... we have everything in >>> /usr/include/infiniband and /usr/lib/infiniband ). After all I managed >>> to get it built but the namd hangs .... >>> >>> So I decided to try mvapich2 (1.2rc1 version) and I found lots problems. >>> Some of them I could fix but some are very strange. For instance in the >>> entire source tree there are lots of references to strange directories >>> /home/daffy ... or /home/7 ... and so on .. Some of them I replaced with >>> ${master_top_srcdir} since I figured out that one should replace them >>> but others I don't know ... Also, when I tried to build with shared >>> libs, the make is not able to build the mpiname application ... I could >>> not figure out why ... >>> >>> So, lots of problems ....I'll try to figure them out ... However, the >>> problems with mvapich2 look more as bugs in the Makefiiles .. So, maybe >>> somebody would like to change those ... >>> >>> Cheers >>> vlad >>> >>> >>> Mehdi Bozzo-Rey wrote: >>> >>>> Hi Vlad, >>>> >>>> >>>> >>>> No, I did not use the intel compilers (not yet). I used gfortran. More >>>> precisely: >>>> >>>> >>>> >>>> OS: >>>> >>>> >>>> >>>> RHEL 5.1 (Kernel 2.6.18-53.el5) >>>> >>>> >>>> >>>> [mbozzore@tyan04 ~]$ mpicc --version >>>> >>>> gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>>> >>>> >>>> >>>> [mbozzore@tyan04 ~]$ mpicxx --version >>>> >>>> g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>>> >>>> >>>> >>>> [mbozzore@tyan04 ~]$ mpif77 --version >>>> >>>> GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>>> >>>> >>>> >>>> [mbozzore@tyan04 ~]$ mpif90 --version >>>> >>>> GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>>> >>>> >>>> >>>> Hardware: intel quads for the nodes, topspin switch and hcas for IB. >>>> >>>> >>>> >>>> >>>> >>>> Yes, I used OFED (1.3). >>>> >>>> >>>> >>>> I did not enable sharedlibs for that build. >>>> >>>> >>>> >>>> I will double check but if I remember well, everything was fine >>>> (compilation) on the mvapich2 side. What version did you use ? >>>> >>>> >>>> >>>> Cheers, >>>> >>>> >>>> >>>> Mehdi >>>> >>>> >>>> >>>> Mehdi Bozzo-Rey >>>> Open Source Solution Developer >>>> Platform computing >>>> Phone : +1 905 948 4649 >>>> E-mail : mbozzore@platform.com >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *From:* Cojocaru,Vlad [mailto:vlad.cojocaru@eml-r.villa-bosch.de] >>>> *Sent:* August-14-08 4:35 PM >>>> *To:* Mehdi Bozzo-Rey; mvapich-discuss@cse.ohio-state.edu >>>> *Subject:* RE: [mvapich-discuss] compile charm++ and namd with mvapich >>>> 1.0.1and/or mvapich2 >>>> >>>> >>>> >>>> Hi Mehdi, >>>> >>>> Did you use intel 10.1 as well ? Did you build on openfabrics ? what >>>> compiler flags did you pass to the mvapich build? Did you build with >>>> --enable sharedlib or without? I would be grateful If you give me some >>>> bits of the details how you built mvapich?. >>>> Thanks for the reply. Yes, there is something about the compilation of >>>> mvapich. As I said I successfully compiled NAMD on a cluster that had >>>> already mvapich compiled with intel as the default mpi lib. However, >>>> on the new cluster (quad cores AMD opterons with mellanox infiniband) >>>> I got these problems. So, its definitely the mvapich build which >>>> fails although I don't get any errors fro make. >>>> >>>> Any idea why the mpiname application fails to compile when compiling >>>> mvapich2 ? >>>> >>>> Thanks again >>>> >>>> Best wishes >>>> vlad >>>> >>>> >>>> -----Original Message----- >>>> From: Mehdi Bozzo-Rey [mailto:mbozzore@platform.com] >>>> Sent: Thu 8/14/2008 7:20 PM >>>> To: Cojocaru,Vlad; mvapich-discuss@cse.ohio-state.edu >>>> Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich >>>> 1.0.1and/or mvapich2 >>>> >>>> Hello Vlad, >>>> >>>> >>>> I just recompiled NAMD and it looks ok for me (output of simple test >>>> below). I guess the problem is on the compilation side. >>>> >>>> Best regards, >>>> >>>> Mehdi >>>> >>>> Mehdi Bozzo-Rey >>>> Open Source Solution Developer >>>> Platform computing >>>> Phone : +1 905 948 4649 >>>> E-mail : mbozzore@platform.com >>>> >>>> >>>> [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile >>>> ./hosts.8 ./namd2 src/alanin >>>> Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 >>>> Charm warning> Randomization of stack pointer is turned on in Kernel, >>>> run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable >>>> it. Thread migration may not work! >>>> Info: NAMD 2.6 for Linux-amd64-MPI >>>> Info: >>>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/ >>>> Info: and send feedback or bug reports to namd@ks.uiuc.edu >>>> Info: >>>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) >>>> Info: in all publications reporting results obtained with NAMD. >>>> Info: >>>> Info: Based on Charm++/Converse 50914 for >>>> mpi-linux-x86_64-gfortran-smp-mpicxx >>>> Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on >>>> tyan04.lsf.platform.com >>>> Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore >>>> Info: Running on 8 processors. >>>> Info: 8208 kB of memory in use. >>>> Info: Memory usage based on mallinfo >>>> Info: Changed directory to src >>>> Info: Configuration file is alanin >>>> TCL: Suspending until startup complete. >>>> Info: SIMULATION PARAMETERS: >>>> Info: TIMESTEP 0.5 >>>> Info: NUMBER OF STEPS 9 >>>> Info: STEPS PER CYCLE 3 >>>> Info: LOAD BALANCE STRATEGY Other >>>> Info: LDB PERIOD 600 steps >>>> Info: FIRST LDB TIMESTEP 15 >>>> Info: LDB BACKGROUND SCALING 1 >>>> Info: HOM BACKGROUND SCALING 1 >>>> Info: MAX SELF PARTITIONS 50 >>>> Info: MAX PAIR PARTITIONS 20 >>>> Info: SELF PARTITION ATOMS 125 >>>> Info: PAIR PARTITION ATOMS 200 >>>> Info: PAIR2 PARTITION ATOMS 400 >>>> Info: MIN ATOMS PER PATCH 100 >>>> Info: INITIAL TEMPERATURE 0 >>>> Info: CENTER OF MASS MOVING INITIALLY? NO >>>> Info: DIELECTRIC 1 >>>> Info: EXCLUDE SCALED ONE-FOUR >>>> Info: 1-4 SCALE FACTOR 0.4 >>>> Info: NO DCD TRAJECTORY OUTPUT >>>> Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT >>>> Info: NO VELOCITY DCD OUTPUT >>>> Info: OUTPUT FILENAME output >>>> Info: BINARY OUTPUT FILES WILL BE USED >>>> Info: NO RESTART FILE >>>> Info: SWITCHING ACTIVE >>>> Info: SWITCHING ON 7 >>>> Info: SWITCHING OFF 8 >>>> Info: PAIRLIST DISTANCE 9 >>>> Info: PAIRLIST SHRINK RATE 0.01 >>>> Info: PAIRLIST GROW RATE 0.01 >>>> Info: PAIRLIST TRIGGER 0.3 >>>> Info: PAIRLISTS PER CYCLE 2 >>>> Info: PAIRLISTS ENABLED >>>> Info: MARGIN 1 >>>> Info: HYDROGEN GROUP CUTOFF 2.5 >>>> Info: PATCH DIMENSION 12.5 >>>> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL >>>> Info: TIMING OUTPUT STEPS 15 >>>> Info: USING VERLET I (r-RESPA) MTS SCHEME. >>>> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS >>>> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS >>>> Info: RANDOM NUMBER SEED 1218734148 >>>> Info: USE HYDROGEN BONDS? NO >>>> Info: COORDINATE PDB alanin.pdb >>>> Info: STRUCTURE FILE alanin.psf >>>> Info: PARAMETER file: XPLOR format! (default) >>>> Info: PARAMETERS alanin.params >>>> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS >>>> Info: SUMMARY OF PARAMETERS: >>>> Info: 61 BONDS >>>> Info: 179 ANGLES >>>> Info: 38 DIHEDRAL >>>> Info: 42 IMPROPER >>>> Info: 0 CROSSTERM >>>> Info: 21 VDW >>>> Info: 0 VDW_PAIRS >>>> Info: **************************** >>>> Info: STRUCTURE SUMMARY: >>>> Info: 66 ATOMS >>>> Info: 65 BONDS >>>> Info: 96 ANGLES >>>> Info: 31 DIHEDRALS >>>> Info: 32 IMPROPERS >>>> Info: 0 CROSSTERMS >>>> Info: 0 EXCLUSIONS >>>> Info: 195 DEGREES OF FREEDOM >>>> Info: 55 HYDROGEN GROUPS >>>> Info: TOTAL MASS = 783.886 amu >>>> Info: TOTAL CHARGE = 8.19564e-08 e >>>> Info: ***************************** >>>> Info: Entering startup phase 0 with 8208 kB of memory in use. >>>> Info: Entering startup phase 1 with 8208 kB of memory in use. >>>> Info: Entering startup phase 2 with 8208 kB of memory in use. >>>> Info: Entering startup phase 3 with 8208 kB of memory in use. >>>> Info: PATCH GRID IS 1 BY 1 BY 1 >>>> Info: REMOVING COM VELOCITY 0 0 0 >>>> Info: LARGEST PATCH (0) HAS 66 ATOMS >>>> Info: CREATING 11 COMPUTE OBJECTS >>>> Info: Entering startup phase 4 with 8208 kB of memory in use. >>>> Info: Entering startup phase 5 with 8208 kB of memory in use. >>>> Info: Entering startup phase 6 with 8208 kB of memory in use. >>>> Measuring processor speeds... Done. >>>> Info: Entering startup phase 7 with 8208 kB of memory in use. >>>> Info: CREATING 11 COMPUTE OBJECTS >>>> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 >>>> Info: NONBONDED TABLE SIZE: 705 POINTS >>>> Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 >>>> Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 >>>> Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 >>>> Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 >>>> Info: Entering startup phase 8 with 8208 kB of memory in use. >>>> Info: Finished startup with 8208 kB of memory in use. >>>> ETITLE: TS BOND ANGLE DIHED >>>> IMPRP ELECT VDW BOUNDARY >>>> MISC KINETIC TOTAL TEMP >>>> TOTAL2 TOTAL3 TEMPAVG >>>> >>>> ENERGY: 0 0.0050 0.4192 0.0368 >>>> 0.4591 -210.1610 1.0506 0.0000 >>>> 0.0000 0.0000 -208.1904 0.0000 >>>> -208.1877 -208.1877 0.0000 >>>> >>>> ENERGY: 1 0.0051 0.4196 0.0367 >>>> 0.4585 -210.1611 1.0184 0.0000 >>>> 0.0000 0.0325 -208.1905 0.1675 >>>> -208.1878 -208.1877 0.1675 >>>> >>>> ENERGY: 2 0.0058 0.4208 0.0365 >>>> 0.4568 -210.1610 0.9219 0.0000 >>>> 0.0000 0.1285 -208.1907 0.6632 >>>> -208.1881 -208.1877 0.6632 >>>> >>>> ENERGY: 3 0.0092 0.4232 0.0361 >>>> 0.4542 -210.1599 0.7617 0.0000 >>>> 0.0000 0.2845 -208.1910 1.4683 >>>> -208.1885 -208.1878 1.4683 >>>> >>>> ENERGY: 4 0.0176 0.4269 0.0356 >>>> 0.4511 -210.1565 0.5386 0.0000 >>>> 0.0000 0.4952 -208.1914 2.5561 >>>> -208.1890 -208.1878 2.5561 >>>> >>>> ENERGY: 5 0.0327 0.4327 0.0350 >>>> 0.4480 -210.1489 0.2537 0.0000 >>>> 0.0000 0.7552 -208.1917 3.8977 >>>> -208.1894 -208.1879 3.8977 >>>> >>>> ENERGY: 6 0.0552 0.4409 0.0343 >>>> 0.4454 -210.1354 -0.0915 0.0000 >>>> 0.0000 1.0592 -208.1920 5.4666 >>>> -208.1898 -208.1880 5.4666 >>>> >>>> ENERGY: 7 0.0839 0.4522 0.0334 >>>> 0.4440 -210.1137 -0.4951 0.0000 >>>> 0.0000 1.4031 -208.1922 7.2418 >>>> -208.1900 -208.1882 7.2418 >>>> >>>> ENERGY: 8 0.1162 0.4674 0.0325 >>>> 0.4448 -210.0822 -0.9550 0.0000 >>>> 0.0000 1.7839 -208.1923 9.2074 >>>> -208.1902 -208.1883 9.2074 >>>> >>>> ENERGY: 9 0.1492 0.4870 0.0315 >>>> 0.4485 -210.0391 -1.4687 0.0000 >>>> 0.0000 2.1990 -208.1925 11.3497 >>>> -208.1905 -208.1884 11.3497 >>>> >>>> WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 >>>> WRITING COORDINATES TO OUTPUT FILE AT STEP 9 >>>> WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 >>>> ========================================== >>>> WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB >>>> End of program >>>> >>>> >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: mvapich-discuss-bounces@cse.ohio-state.edu >>>> [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad >>>> Cojocaru >>>> Sent: August-14-08 11:32 AM >>>> To: mvapich-discuss@cse.ohio-state.edu >>>> Subject: [mvapich-discuss] compile charm++ and namd with mvapich >>>> 1.0.1and/or mvapich2 >>>> >>>> Dear mvapich users, >>>> >>>> I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 >>>> infiniband cluster using the intel 10.1.015 compilers. With >>>> mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in >>>> the /examples directory. Then, I bult charm++ and tested it with >>>> "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on >>>> top of mvapich1.0.1 and charm, >>>> >>>> Everything seemed ok only that the namd executable hangs without error >>>> messages. In fact it appears as if it still runs but it doesn't produce >>>> any output. If I repeat exactly the same procedure but with openmpi >>>> instead of mvapich, everything works fine ....(however I am not so happy >>>> about the scaling of openmpi on infiniband) >>>> >>>> Does anyone have experience with installing namd using mvapich1.0.1 ? If >>>> yes, any idea why this happens? I must say when I did the same on >>>> another cluster which had mvapich1.0.1 already compiled with the intel >>>> compilers, everything worked out correcltly. So, it must be something >>>> with the compilation of mvapich1.0.1 on our new infiniband setup that >>>> creates the problem. >>>> >>>> The german in the error simply says that executable "mpiname was not >>>> found" >>>> >>>> Best wishes >>>> vlad >>>> >>>> ----------------------------------error------------------------------------------------------------------------ >>>> I also tried mvapich2 but the compilation fails when installing the >>>> mpiname application (see error below) which apparently fails to compile >>>> (no executable is found in /env/mpiname dir). However no error messages >>>> are printed by make and the build completes correctly. So I am not sure >>>> why mpiname does not compile and still make install tries to install >>>> it ... >>>> >>>> /usr/bin/install -c mpiname/mpiname >>>> /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname >>>> /usr/bin/install: Aufruf von stat f?r ?mpiname/mpiname? nicht m?glich: >>>> Datei oder Verzeichnis nicht gefunden >>>> make[1]: *** [install] Fehler 1 >>>> make[1]: Leaving directory >>>> `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' >>>> make: *** [install] Fehler 2 >>>> >>>> -- >>>> ---------------------------------------------------------------------------- >>>> Dr. Vlad Cojocaru >>>> >>>> EML Research gGmbH >>>> Schloss-Wolfsbrunnenweg 33 >>>> 69118 Heidelberg >>>> >>>> Tel: ++49-6221-533266 >>>> Fax: ++49-6221-533298 >>>> >>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de >>>> >>>> http://projects.villa-bosch.de/mcm/people/cojocaru/ >>>> >>>> ---------------------------------------------------------------------------- >>>> EML Research gGmbH >>>> Amtgericht Mannheim / HRB 337446 >>>> Managing Partner: Dr. h.c. Klaus Tschira >>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter >>>> http://www.eml-r.org >>>> ---------------------------------------------------------------------------- >>>> >>>> >>>> _______________________________________________ >>>> mvapich-discuss mailing list >>>> mvapich-discuss@cse.ohio-state.edu >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>> >>>> >>> -- >>> ---------------------------------------------------------------------------- >>> Dr. Vlad Cojocaru >>> >>> EML Research gGmbH >>> Schloss-Wolfsbrunnenweg 33 >>> 69118 Heidelberg >>> >>> Tel: ++49-6221-533266 >>> Fax: ++49-6221-533298 >>> >>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de >>> >>> http://projects.villa-bosch.de/mcm/people/cojocaru/ >>> >>> ---------------------------------------------------------------------------- >>> EML Research gGmbH >>> Amtgericht Mannheim / HRB 337446 >>> Managing Partner: Dr. h.c. Klaus Tschira >>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter >>> http://www.eml-r.org >>> ---------------------------------------------------------------------------- >>> >>> >>> >>> >> >> > > -- > ---------------------------------------------------------------------------- > Dr. Vlad Cojocaru > > EML Research gGmbH > Schloss-Wolfsbrunnenweg 33 > 69118 Heidelberg > > Tel: ++49-6221-533266 > Fax: ++49-6221-533298 > > e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de > > http://projects.villa-bosch.de/mcm/people/cojocaru/ > > ---------------------------------------------------------------------------- > EML Research gGmbH > Amtgericht Mannheim / HRB 337446 > Managing Partner: Dr. h.c. Klaus Tschira > Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter > http://www.eml-r.org > ---------------------------------------------------------------------------- > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From Vlad.Cojocaru at eml-r.villa-bosch.de Fri Aug 15 10:24:41 2008 From: Vlad.Cojocaru at eml-r.villa-bosch.de (Vlad Cojocaru) Date: Fri Aug 15 10:25:22 2008 Subject: [mvapich-discuss] compile charm++ and namd with mvapich 1.0.1and/or mvapich2 In-Reply-To: <20080815131812.GG24018@cse.ohio-state.edu> References: <48A579AA.9090304@eml-r.villa-bosch.de> <20080815131812.GG24018@cse.ohio-state.edu> Message-ID: <48A591A9.4070201@eml-r.villa-bosch.de> Dear Jonathan, I did manage to build mvapich2 at the end but only if I substituted those lines with ${master_top_srcdir}. Unfortunately I do not have anymore the logfiles of the failed builds but they were just saying that "/home/7/curtisbr directory was not found" and then it exited immediately with an Error! If you want I could repeat the build without replacing those lines and send you the logfiles ... Best vlad Jonathan Perkins wrote: > Vlad: > Can you please send us your config.log and make.log files. Those > references to /home/7/curtisbr should only be present to create the > Makefile.in files that are already present in the tarball. Also, I do > not know why mpiname is failing to build for you but it should not > affect the functionality of any mpi programs. > > On Fri, Aug 15, 2008 at 02:42:18PM +0200, Vlad Cojocaru wrote: > >> Dear Dhabaleswar, Dear Mehdi, >> >> Thanks a lot for your answers. I do have the user guide in front of my >> eyes but I still cannot explain all the references to >> /home/7/curtisbr/svn/mvapich2/mvapich2-1.2rc1 in the Makefile.in files >> in the source tree .... >> >> It looks as if after replacing those with ${master_top_srcdir} mvapich2 >> builds correctly .. I still have to test it though ... >> >> Is the gfortran stuff relevant since I do not use gfortran ? >> >> Cheers >> vlad >> >> Dhabaleswar Panda wrote: >> >>> Vald, >>> >>> Please take a look at the detailed user guides of MVAPICH and MVAPICH2 >>> regarding how to build and install these packages. They are available from >>> the following URL: >>> >>> http://mvapich.cse.ohio-state.edu/support/ >>> >>> MVAPICH2 1.2 series has full autoconf-based configuration framework. It >>> should significantly help you. >>> >>> DK >>> >>> >>> On Fri, 15 Aug 2008, Vlad Cojocaru wrote: >>> >>> >>> >>>> Thanks Mehdi for all details, >>>> >>>> I guess you mean gcc when you say gfortran ... namd is not written in >>>> fortran but in charm++ which is an adaptation of c++... >>>> >>>> Well, we have debian here so we used Debian packages to install the >>>> inifiniband libs and headers ...(our sys administrator did that). Then I >>>> tried to compile mvapich 1.0.1 and I found that I need the drastically >>>> change the make.mvapich.gen2 file in order to get it to build (since the >>>> defaults for $IBHOME are very strange ... we have everything in >>>> /usr/include/infiniband and /usr/lib/infiniband ). After all I managed >>>> to get it built but the namd hangs .... >>>> >>>> So I decided to try mvapich2 (1.2rc1 version) and I found lots problems. >>>> Some of them I could fix but some are very strange. For instance in the >>>> entire source tree there are lots of references to strange directories >>>> /home/daffy ... or /home/7 ... and so on .. Some of them I replaced with >>>> ${master_top_srcdir} since I figured out that one should replace them >>>> but others I don't know ... Also, when I tried to build with shared >>>> libs, the make is not able to build the mpiname application ... I could >>>> not figure out why ... >>>> >>>> So, lots of problems ....I'll try to figure them out ... However, the >>>> problems with mvapich2 look more as bugs in the Makefiiles .. So, maybe >>>> somebody would like to change those ... >>>> >>>> Cheers >>>> vlad >>>> >>>> >>>> Mehdi Bozzo-Rey wrote: >>>> >>>> >>>>> Hi Vlad, >>>>> >>>>> >>>>> >>>>> No, I did not use the intel compilers (not yet). I used gfortran. More >>>>> precisely: >>>>> >>>>> >>>>> >>>>> OS: >>>>> >>>>> >>>>> >>>>> RHEL 5.1 (Kernel 2.6.18-53.el5) >>>>> >>>>> >>>>> >>>>> [mbozzore@tyan04 ~]$ mpicc --version >>>>> >>>>> gcc (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>>>> >>>>> >>>>> >>>>> [mbozzore@tyan04 ~]$ mpicxx --version >>>>> >>>>> g++ (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>>>> >>>>> >>>>> >>>>> [mbozzore@tyan04 ~]$ mpif77 --version >>>>> >>>>> GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>>>> >>>>> >>>>> >>>>> [mbozzore@tyan04 ~]$ mpif90 --version >>>>> >>>>> GNU Fortran (GCC) 4.1.2 20071124 (Red Hat 4.1.2-42) >>>>> >>>>> >>>>> >>>>> Hardware: intel quads for the nodes, topspin switch and hcas for IB. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Yes, I used OFED (1.3). >>>>> >>>>> >>>>> >>>>> I did not enable sharedlibs for that build. >>>>> >>>>> >>>>> >>>>> I will double check but if I remember well, everything was fine >>>>> (compilation) on the mvapich2 side. What version did you use ? >>>>> >>>>> >>>>> >>>>> Cheers, >>>>> >>>>> >>>>> >>>>> Mehdi >>>>> >>>>> >>>>> >>>>> Mehdi Bozzo-Rey >>>>> Open Source Solution Developer >>>>> Platform computing >>>>> Phone : +1 905 948 4649 >>>>> E-mail : mbozzore@platform.com >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *From:* Cojocaru,Vlad [mailto:vlad.cojocaru@eml-r.villa-bosch.de] >>>>> *Sent:* August-14-08 4:35 PM >>>>> *To:* Mehdi Bozzo-Rey; mvapich-discuss@cse.ohio-state.edu >>>>> *Subject:* RE: [mvapich-discuss] compile charm++ and namd with mvapich >>>>> 1.0.1and/or mvapich2 >>>>> >>>>> >>>>> >>>>> Hi Mehdi, >>>>> >>>>> Did you use intel 10.1 as well ? Did you build on openfabrics ? what >>>>> compiler flags did you pass to the mvapich build? Did you build with >>>>> --enable sharedlib or without? I would be grateful If you give me some >>>>> bits of the details how you built mvapich?. >>>>> Thanks for the reply. Yes, there is something about the compilation of >>>>> mvapich. As I said I successfully compiled NAMD on a cluster that had >>>>> already mvapich compiled with intel as the default mpi lib. However, >>>>> on the new cluster (quad cores AMD opterons with mellanox infiniband) >>>>> I got these problems. So, its definitely the mvapich build which >>>>> fails although I don't get any errors fro make. >>>>> >>>>> Any idea why the mpiname application fails to compile when compiling >>>>> mvapich2 ? >>>>> >>>>> Thanks again >>>>> >>>>> Best wishes >>>>> vlad >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Mehdi Bozzo-Rey [mailto:mbozzore@platform.com] >>>>> Sent: Thu 8/14/2008 7:20 PM >>>>> To: Cojocaru,Vlad; mvapich-discuss@cse.ohio-state.edu >>>>> Subject: RE: [mvapich-discuss] compile charm++ and namd with mvapich >>>>> 1.0.1and/or mvapich2 >>>>> >>>>> Hello Vlad, >>>>> >>>>> >>>>> I just recompiled NAMD and it looks ok for me (output of simple test >>>>> below). I guess the problem is on the compilation side. >>>>> >>>>> Best regards, >>>>> >>>>> Mehdi >>>>> >>>>> Mehdi Bozzo-Rey >>>>> Open Source Solution Developer >>>>> Platform computing >>>>> Phone : +1 905 948 4649 >>>>> E-mail : mbozzore@platform.com >>>>> >>>>> >>>>> [mbozzore@tyan04 Linux-amd64-MPI]$ mpirun_rsh -np 8 -hostfile >>>>> ./hosts.8 ./namd2 src/alanin >>>>> Charm++> Running on MPI version: 1.2 multi-thread support: 1/1 >>>>> Charm warning> Randomization of stack pointer is turned on in Kernel, >>>>> run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable >>>>> it. Thread migration may not work! >>>>> Info: NAMD 2.6 for Linux-amd64-MPI >>>>> Info: >>>>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/ >>>>> Info: and send feedback or bug reports to namd@ks.uiuc.edu >>>>> Info: >>>>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005) >>>>> Info: in all publications reporting results obtained with NAMD. >>>>> Info: >>>>> Info: Based on Charm++/Converse 50914 for >>>>> mpi-linux-x86_64-gfortran-smp-mpicxx >>>>> Info: Built Thu Aug 14 13:12:02 EDT 2008 by mbozzore on >>>>> tyan04.lsf.platform.com >>>>> Info: 1 NAMD 2.6 Linux-amd64-MPI 8 compute-00-00.ocs5.org mbozzore >>>>> Info: Running on 8 processors. >>>>> Info: 8208 kB of memory in use. >>>>> Info: Memory usage based on mallinfo >>>>> Info: Changed directory to src >>>>> Info: Configuration file is alanin >>>>> TCL: Suspending until startup complete. >>>>> Info: SIMULATION PARAMETERS: >>>>> Info: TIMESTEP 0.5 >>>>> Info: NUMBER OF STEPS 9 >>>>> Info: STEPS PER CYCLE 3 >>>>> Info: LOAD BALANCE STRATEGY Other >>>>> Info: LDB PERIOD 600 steps >>>>> Info: FIRST LDB TIMESTEP 15 >>>>> Info: LDB BACKGROUND SCALING 1 >>>>> Info: HOM BACKGROUND SCALING 1 >>>>> Info: MAX SELF PARTITIONS 50 >>>>> Info: MAX PAIR PARTITIONS 20 >>>>> Info: SELF PARTITION ATOMS 125 >>>>> Info: PAIR PARTITION ATOMS 200 >>>>> Info: PAIR2 PARTITION ATOMS 400 >>>>> Info: MIN ATOMS PER PATCH 100 >>>>> Info: INITIAL TEMPERATURE 0 >>>>> Info: CENTER OF MASS MOVING INITIALLY? NO >>>>> Info: DIELECTRIC 1 >>>>> Info: EXCLUDE SCALED ONE-FOUR >>>>> Info: 1-4 SCALE FACTOR 0.4 >>>>> Info: NO DCD TRAJECTORY OUTPUT >>>>> Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT >>>>> Info: NO VELOCITY DCD OUTPUT >>>>> Info: OUTPUT FILENAME output >>>>> Info: BINARY OUTPUT FILES WILL BE USED >>>>> Info: NO RESTART FILE >>>>> Info: SWITCHING ACTIVE >>>>> Info: SWITCHING ON 7 >>>>> Info: SWITCHING OFF 8 >>>>> Info: PAIRLIST DISTANCE 9 >>>>> Info: PAIRLIST SHRINK RATE 0.01 >>>>> Info: PAIRLIST GROW RATE 0.01 >>>>> Info: PAIRLIST TRIGGER 0.3 >>>>> Info: PAIRLISTS PER CYCLE 2 >>>>> Info: PAIRLISTS ENABLED >>>>> Info: MARGIN 1 >>>>> Info: HYDROGEN GROUP CUTOFF 2.5 >>>>> Info: PATCH DIMENSION 12.5 >>>>> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL >>>>> Info: TIMING OUTPUT STEPS 15 >>>>> Info: USING VERLET I (r-RESPA) MTS SCHEME. >>>>> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS >>>>> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS >>>>> Info: RANDOM NUMBER SEED 1218734148 >>>>> Info: USE HYDROGEN BONDS? NO >>>>> Info: COORDINATE PDB alanin.pdb >>>>> Info: STRUCTURE FILE alanin.psf >>>>> Info: PARAMETER file: XPLOR format! (default) >>>>> Info: PARAMETERS alanin.params >>>>> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS >>>>> Info: SUMMARY OF PARAMETERS: >>>>> Info: 61 BONDS >>>>> Info: 179 ANGLES >>>>> Info: 38 DIHEDRAL >>>>> Info: 42 IMPROPER >>>>> Info: 0 CROSSTERM >>>>> Info: 21 VDW >>>>> Info: 0 VDW_PAIRS >>>>> Info: **************************** >>>>> Info: STRUCTURE SUMMARY: >>>>> Info: 66 ATOMS >>>>> Info: 65 BONDS >>>>> Info: 96 ANGLES >>>>> Info: 31 DIHEDRALS >>>>> Info: 32 IMPROPERS >>>>> Info: 0 CROSSTERMS >>>>> Info: 0 EXCLUSIONS >>>>> Info: 195 DEGREES OF FREEDOM >>>>> Info: 55 HYDROGEN GROUPS >>>>> Info: TOTAL MASS = 783.886 amu >>>>> Info: TOTAL CHARGE = 8.19564e-08 e >>>>> Info: ***************************** >>>>> Info: Entering startup phase 0 with 8208 kB of memory in use. >>>>> Info: Entering startup phase 1 with 8208 kB of memory in use. >>>>> Info: Entering startup phase 2 with 8208 kB of memory in use. >>>>> Info: Entering startup phase 3 with 8208 kB of memory in use. >>>>> Info: PATCH GRID IS 1 BY 1 BY 1 >>>>> Info: REMOVING COM VELOCITY 0 0 0 >>>>> Info: LARGEST PATCH (0) HAS 66 ATOMS >>>>> Info: CREATING 11 COMPUTE OBJECTS >>>>> Info: Entering startup phase 4 with 8208 kB of memory in use. >>>>> Info: Entering startup phase 5 with 8208 kB of memory in use. >>>>> Info: Entering startup phase 6 with 8208 kB of memory in use. >>>>> Measuring processor speeds... Done. >>>>> Info: Entering startup phase 7 with 8208 kB of memory in use. >>>>> Info: CREATING 11 COMPUTE OBJECTS >>>>> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625 >>>>> Info: NONBONDED TABLE SIZE: 705 POINTS >>>>> Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 3.38813e-21 AT 7.99609 >>>>> Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.27241e-16 AT 7.99609 >>>>> Info: ABSOLUTE IMPRECISION IN FAST TABLE FORCE: 6.77626e-21 AT 7.99609 >>>>> Info: RELATIVE IMPRECISION IN FAST TABLE FORCE: 1.1972e-16 AT 7.99609 >>>>> Info: Entering startup phase 8 with 8208 kB of memory in use. >>>>> Info: Finished startup with 8208 kB of memory in use. >>>>> ETITLE: TS BOND ANGLE DIHED >>>>> IMPRP ELECT VDW BOUNDARY >>>>> MISC KINETIC TOTAL TEMP >>>>> TOTAL2 TOTAL3 TEMPAVG >>>>> >>>>> ENERGY: 0 0.0050 0.4192 0.0368 >>>>> 0.4591 -210.1610 1.0506 0.0000 >>>>> 0.0000 0.0000 -208.1904 0.0000 >>>>> -208.1877 -208.1877 0.0000 >>>>> >>>>> ENERGY: 1 0.0051 0.4196 0.0367 >>>>> 0.4585 -210.1611 1.0184 0.0000 >>>>> 0.0000 0.0325 -208.1905 0.1675 >>>>> -208.1878 -208.1877 0.1675 >>>>> >>>>> ENERGY: 2 0.0058 0.4208 0.0365 >>>>> 0.4568 -210.1610 0.9219 0.0000 >>>>> 0.0000 0.1285 -208.1907 0.6632 >>>>> -208.1881 -208.1877 0.6632 >>>>> >>>>> ENERGY: 3 0.0092 0.4232 0.0361 >>>>> 0.4542 -210.1599 0.7617 0.0000 >>>>> 0.0000 0.2845 -208.1910 1.4683 >>>>> -208.1885 -208.1878 1.4683 >>>>> >>>>> ENERGY: 4 0.0176 0.4269 0.0356 >>>>> 0.4511 -210.1565 0.5386 0.0000 >>>>> 0.0000 0.4952 -208.1914 2.5561 >>>>> -208.1890 -208.1878 2.5561 >>>>> >>>>> ENERGY: 5 0.0327 0.4327 0.0350 >>>>> 0.4480 -210.1489 0.2537 0.0000 >>>>> 0.0000 0.7552 -208.1917 3.8977 >>>>> -208.1894 -208.1879 3.8977 >>>>> >>>>> ENERGY: 6 0.0552 0.4409 0.0343 >>>>> 0.4454 -210.1354 -0.0915 0.0000 >>>>> 0.0000 1.0592 -208.1920 5.4666 >>>>> -208.1898 -208.1880 5.4666 >>>>> >>>>> ENERGY: 7 0.0839 0.4522 0.0334 >>>>> 0.4440 -210.1137 -0.4951 0.0000 >>>>> 0.0000 1.4031 -208.1922 7.2418 >>>>> -208.1900 -208.1882 7.2418 >>>>> >>>>> ENERGY: 8 0.1162 0.4674 0.0325 >>>>> 0.4448 -210.0822 -0.9550 0.0000 >>>>> 0.0000 1.7839 -208.1923 9.2074 >>>>> -208.1902 -208.1883 9.2074 >>>>> >>>>> ENERGY: 9 0.1492 0.4870 0.0315 >>>>> 0.4485 -210.0391 -1.4687 0.0000 >>>>> 0.0000 2.1990 -208.1925 11.3497 >>>>> -208.1905 -208.1884 11.3497 >>>>> >>>>> WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 9 >>>>> WRITING COORDINATES TO OUTPUT FILE AT STEP 9 >>>>> WRITING VELOCITIES TO OUTPUT FILE AT STEP 9 >>>>> ========================================== >>>>> WallClock: 4.172574 CPUTime: 4.167367 Memory: 8208 kB >>>>> End of program >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: mvapich-discuss-bounces@cse.ohio-state.edu >>>>> [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Vlad >>>>> Cojocaru >>>>> Sent: August-14-08 11:32 AM >>>>> To: mvapich-discuss@cse.ohio-state.edu >>>>> Subject: [mvapich-discuss] compile charm++ and namd with mvapich >>>>> 1.0.1and/or mvapich2 >>>>> >>>>> Dear mvapich users, >>>>> >>>>> I tried to compile mvapich1.0.1, charm++ and namd on our new Linux-amd64 >>>>> infiniband cluster using the intel 10.1.015 compilers. With >>>>> mvapich1.0.1, I managed to build mvapich1.0.1, tested the programs in >>>>> the /examples directory. Then, I bult charm++ and tested it with >>>>> "mpirun_rsh -n 2" .. All tests passed correctly. Then I built namd on >>>>> top of mvapich1.0.1 and charm, >>>>> >>>>> Everything seemed ok only that the namd executable hangs without error >>>>> messages. In fact it appears as if it still runs but it doesn't produce >>>>> any output. If I repeat exactly the same procedure but with openmpi >>>>> instead of mvapich, everything works fine ....(however I am not so happy >>>>> about the scaling of openmpi on infiniband) >>>>> >>>>> Does anyone have experience with installing namd using mvapich1.0.1 ? If >>>>> yes, any idea why this happens? I must say when I did the same on >>>>> another cluster which had mvapich1.0.1 already compiled with the intel >>>>> compilers, everything worked out correcltly. So, it must be something >>>>> with the compilation of mvapich1.0.1 on our new infiniband setup that >>>>> creates the problem. >>>>> >>>>> The german in the error simply says that executable "mpiname was not >>>>> found" >>>>> >>>>> Best wishes >>>>> vlad >>>>> >>>>> ----------------------------------error------------------------------------------------------------------------ >>>>> I also tried mvapich2 but the compilation fails when installing the >>>>> mpiname application (see error below) which apparently fails to compile >>>>> (no executable is found in /env/mpiname dir). However no error messages >>>>> are printed by make and the build completes correctly. So I am not sure >>>>> why mpiname does not compile and still make install tries to install >>>>> it ... >>>>> >>>>> /usr/bin/install -c mpiname/mpiname >>>>> /sw/mcm/app/vlad/mpi/C07/mvapich2/1.2/bin/mpiname >>>>> /usr/bin/install: Aufruf von stat f?r ?mpiname/mpiname? nicht m?glich: >>>>> Datei oder Verzeichnis nicht gefunden >>>>> make[1]: *** [install] Fehler 1 >>>>> make[1]: Leaving directory >>>>> `/sw/mcm/app/vlad/mpi/C07/mvapich2/1.2-src/src/env' >>>>> make: *** [install] Fehler 2 >>>>> >>>>> -- >>>>> ---------------------------------------------------------------------------- >>>>> Dr. Vlad Cojocaru >>>>> >>>>> EML Research gGmbH >>>>> Schloss-Wolfsbrunnenweg 33 >>>>> 69118 Heidelberg >>>>> >>>>> Tel: ++49-6221-533266 >>>>> Fax: ++49-6221-533298 >>>>> >>>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de >>>>> >>>>> http://projects.villa-bosch.de/mcm/people/cojocaru/ >>>>> >>>>> ---------------------------------------------------------------------------- >>>>> EML Research gGmbH >>>>> Amtgericht Mannheim / HRB 337446 >>>>> Managing Partner: Dr. h.c. Klaus Tschira >>>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter >>>>> http://www.eml-r.org >>>>> ---------------------------------------------------------------------------- >>>>> >>>>> >>>>> _______________________________________________ >>>>> mvapich-discuss mailing list >>>>> mvapich-discuss@cse.ohio-state.edu >>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>>> >>>>> >>>>> >>>> -- >>>> ---------------------------------------------------------------------------- >>>> Dr. Vlad Cojocaru >>>> >>>> EML Research gGmbH >>>> Schloss-Wolfsbrunnenweg 33 >>>> 69118 Heidelberg >>>> >>>> Tel: ++49-6221-533266 >>>> Fax: ++49-6221-533298 >>>> >>>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de >>>> >>>> http://projects.villa-bosch.de/mcm/people/cojocaru/ >>>> >>>> ---------------------------------------------------------------------------- >>>> EML Research gGmbH >>>> Amtgericht Mannheim / HRB 337446 >>>> Managing Partner: Dr. h.c. Klaus Tschira >>>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter >>>> http://www.eml-r.org >>>> ---------------------------------------------------------------------------- >>>> >>>> >>>> >>>> >>>> >>> >>> >> -- >> ---------------------------------------------------------------------------- >> Dr. Vlad Cojocaru >> >> EML Research gGmbH >> Schloss-Wolfsbrunnenweg 33 >> 69118 Heidelberg >> >> Tel: ++49-6221-533266 >> Fax: ++49-6221-533298 >> >> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de >> >> http://projects.villa-bosch.de/mcm/people/cojocaru/ >> >> ---------------------------------------------------------------------------- >> EML Research gGmbH >> Amtgericht Mannheim / HRB 337446 >> Managing Partner: Dr. h.c. Klaus Tschira >> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter >> http://www.eml-r.org >> ---------------------------------------------------------------------------- >> >> >> > > >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > > -- ---------------------------------------------------------------------------- Dr. Vlad Cojocaru EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg Tel: ++49-6221-533266 Fax: ++49-6221-533298 e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de http://projects.villa-bosch.de/mcm/people/cojocaru/ ---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ---------------------------------------------------------------------------- From forum.san at gmail.com Mon Aug 18 02:36:24 2008 From: forum.san at gmail.com (Sangamesh B) Date: Mon Aug 18 02:36:36 2008 Subject: [mvapich-discuss] problems in executing higher number process job Message-ID: Dear all, Problem No 1: Application: GROMACS 3.3.3 Parallel Library: MVAPICH2-1.0.3 Compilers: Intel C++ and Fortran 10 A parallel Gromacs-3.3.3(C application) 32 core job runs successfully on a Rocks 4.3, 33 node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores ). But if I submit same job for 64 or higher no of processes, it comes without doing anything. This is my command line: grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run Problem No 2: Application: NAMD 2.6 Parallel Library: MVAPICH2-1.0.3 Compilers: Intel C++ and Fortran 10 I built successfully charm++ with mvapich2 and intel compilers, and then compiled NAMD2. The test examples given in the NAMD distribution works fine. With the following input file( This input file is the one which is used in the NAMD website, for benchmarking. It runs/scales upto 252 processes as mentioned in NAMD website). But in my case it runs only for 8 process, 16 process, 32 process, 64 processes. But when a 128 core job submitted, it doesn't run at all. The following is the command and error. #mpirun -machinefile ./machfile -np 128 /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee namd_128cores Charm++> Running on MPI version: 2.0 multi-thread support: 0/0 rank 65 in job 4 master_host_name_50238 caused collective abort of all ranks exit status of rank 65: killed by signal 9 So, in further, I built charmc with network version of charm++ library without using mvapich2. Now it works for any number process job. So, for the above two problems, I guess there is some thing problem with mvapich2 itself. Is there a solution for it? Regards, Sangamesh -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080818/a3157798/attachment.html From panda at cse.ohio-state.edu Mon Aug 18 09:06:00 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Aug 18 09:06:12 2008 Subject: [mvapich-discuss] problems in executing higher number process job In-Reply-To: Message-ID: Sangamesh, Some of your earlier queries were for the uDAPL interface of MVAPICH2 running on your customized adapter. Do these problems occur on the same environment/interface? Since MVAPICH2 supports multiple interfaces, it will be good if you can indicate which interface of MVAPICH2 you are using here. DK On Mon, 18 Aug 2008, Sangamesh B wrote: > Dear all, > > Problem No 1: > > Application: GROMACS 3.3.3 > > Parallel Library: MVAPICH2-1.0.3 > > Compilers: Intel C++ and Fortran 10 > > A parallel Gromacs-3.3.3(C application) 32 core job runs successfully on a > Rocks 4.3, 33 > node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores ). > > But if I submit same job for 64 or higher no of processes, it comes without > doing > anything. > > This is my command line: > > grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr > mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run > > > > Problem No 2: > > Application: NAMD 2.6 > > Parallel Library: MVAPICH2-1.0.3 > > Compilers: Intel C++ and Fortran 10 > > I built successfully charm++ with mvapich2 and intel compilers, and then > compiled NAMD2. > > The test examples given in the NAMD distribution works fine. > > With the following input file( This input file is the one which is used in > the NAMD website, for benchmarking. It runs/scales upto 252 processes as > mentioned in NAMD website). But in my case it runs only for 8 process, 16 > process, 32 process, 64 processes. > > But when a 128 core job submitted, it doesn't run at all. The following is > the command and error. > > #mpirun -machinefile ./machfile -np 128 > /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee > namd_128cores > Charm++> Running on MPI version: 2.0 multi-thread support: 0/0 > rank 65 in job 4 master_host_name_50238 caused collective abort of all > ranks > exit status of rank 65: killed by signal 9 > > > So, in further, I built charmc with network version of charm++ library > without using mvapich2. Now it works for any number process job. > > So, for the above two problems, I guess there is some thing problem with > mvapich2 itself. Is there a solution for it? > > > Regards, > Sangamesh > From jbernstein at penguincomputing.com Mon Aug 18 20:21:44 2008 From: jbernstein at penguincomputing.com (Joshua Bernstein) Date: Mon Aug 18 20:21:59 2008 Subject: [mvapich-discuss] problems in executing higher number process job In-Reply-To: References: Message-ID: <48AA1218.202@penguincomputing.com> Agreed, Generally the "OpenIB" transport provides for greater startup and reliability over large number of cores, so if you are using uDAPL, I would suggest giving openib a shot. -Joshua Bernstein Software Engineer Penguin Computing Dhabaleswar Panda wrote: > Sangamesh, > > Some of your earlier queries were for the uDAPL interface of MVAPICH2 > running on your customized adapter. Do these problems occur on the same > environment/interface? Since MVAPICH2 supports multiple interfaces, it > will be good if you can indicate which interface of MVAPICH2 you are using > here. > > DK > > On Mon, 18 Aug 2008, Sangamesh B wrote: > >> Dear all, >> >> Problem No 1: >> >> Application: GROMACS 3.3.3 >> >> Parallel Library: MVAPICH2-1.0.3 >> >> Compilers: Intel C++ and Fortran 10 >> >> A parallel Gromacs-3.3.3(C application) 32 core job runs successfully on a >> Rocks 4.3, 33 >> node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores ). >> >> But if I submit same job for 64 or higher no of processes, it comes without >> doing >> anything. >> >> This is my command line: >> >> grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr >> mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run >> >> >> >> Problem No 2: >> >> Application: NAMD 2.6 >> >> Parallel Library: MVAPICH2-1.0.3 >> >> Compilers: Intel C++ and Fortran 10 >> >> I built successfully charm++ with mvapich2 and intel compilers, and then >> compiled NAMD2. >> >> The test examples given in the NAMD distribution works fine. >> >> With the following input file( This input file is the one which is used in >> the NAMD website, for benchmarking. It runs/scales upto 252 processes as >> mentioned in NAMD website). But in my case it runs only for 8 process, 16 >> process, 32 process, 64 processes. >> >> But when a 128 core job submitted, it doesn't run at all. The following is >> the command and error. >> >> #mpirun -machinefile ./machfile -np 128 >> /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee >> namd_128cores >> Charm++> Running on MPI version: 2.0 multi-thread support: 0/0 >> rank 65 in job 4 master_host_name_50238 caused collective abort of all >> ranks >> exit status of rank 65: killed by signal 9 >> >> >> So, in further, I built charmc with network version of charm++ library >> without using mvapich2. Now it works for any number process job. >> >> So, for the above two problems, I guess there is some thing problem with >> mvapich2 itself. Is there a solution for it? >> >> >> Regards, >> Sangamesh >> > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From forum.san at gmail.com Tue Aug 19 00:14:53 2008 From: forum.san at gmail.com (Sangamesh B) Date: Tue Aug 19 00:15:07 2008 Subject: [mvapich-discuss] problems in executing higher number process job In-Reply-To: <48AA1218.202@penguincomputing.com> References: <48AA1218.202@penguincomputing.com> Message-ID: Hi DK Sir, I'm using OpenIB. MVAPICH2 is built with OFED-1.3 and Intel compilers. This is the new cluster we built recently. The environment is different from the earlier. But earlier also we built mvapich2 for OFA interface only. We've used make.mvapich2.ofa for installation. This will not install uDAPL stack right? Thank you, Sangamesh On Tue, Aug 19, 2008 at 5:51 AM, Joshua Bernstein < jbernstein@penguincomputing.com> wrote: > Agreed, > > Generally the "OpenIB" transport provides for greater startup and > reliability over large number of cores, so if you are using uDAPL, I would > suggest giving openib a shot. > > -Joshua Bernstein > Software Engineer > Penguin Computing > > Dhabaleswar Panda wrote: > >> Sangamesh, >> >> Some of your earlier queries were for the uDAPL interface of MVAPICH2 >> running on your customized adapter. Do these problems occur on the same >> environment/interface? Since MVAPICH2 supports multiple interfaces, it >> will be good if you can indicate which interface of MVAPICH2 you are using >> here. >> >> DK >> >> On Mon, 18 Aug 2008, Sangamesh B wrote: >> >> Dear all, >>> >>> Problem No 1: >>> >>> Application: GROMACS 3.3.3 >>> >>> Parallel Library: MVAPICH2-1.0.3 >>> >>> Compilers: Intel C++ and Fortran 10 >>> >>> A parallel Gromacs-3.3.3(C application) 32 core job runs successfully on >>> a >>> Rocks 4.3, 33 >>> node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores ). >>> >>> But if I submit same job for 64 or higher no of processes, it comes >>> without >>> doing >>> anything. >>> >>> This is my command line: >>> >>> grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr >>> mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run >>> >>> >>> >>> Problem No 2: >>> >>> Application: NAMD 2.6 >>> >>> Parallel Library: MVAPICH2-1.0.3 >>> >>> Compilers: Intel C++ and Fortran 10 >>> >>> I built successfully charm++ with mvapich2 and intel compilers, and then >>> compiled NAMD2. >>> >>> The test examples given in the NAMD distribution works fine. >>> >>> With the following input file( This input file is the one which is used >>> in >>> the NAMD website, for benchmarking. It runs/scales upto 252 processes as >>> mentioned in NAMD website). But in my case it runs only for 8 process, 16 >>> process, 32 process, 64 processes. >>> >>> But when a 128 core job submitted, it doesn't run at all. The following >>> is >>> the command and error. >>> >>> #mpirun -machinefile ./machfile -np 128 >>> /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee >>> namd_128cores >>> Charm++> Running on MPI version: 2.0 multi-thread support: 0/0 >>> rank 65 in job 4 master_host_name_50238 caused collective abort of all >>> ranks >>> exit status of rank 65: killed by signal 9 >>> >>> >>> So, in further, I built charmc with network version of charm++ library >>> without using mvapich2. Now it works for any number process job. >>> >>> So, for the above two problems, I guess there is some thing problem with >>> mvapich2 itself. Is there a solution for it? >>> >>> >>> Regards, >>> Sangamesh >>> >>> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080819/7f117cc4/attachment-0001.html From koop at cse.ohio-state.edu Tue Aug 19 14:56:47 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Aug 19 14:56:59 2008 Subject: [mvapich-discuss] problems in executing higher number process job In-Reply-To: Message-ID: Sangamesh, I'm not sure what your issue here is, however, we have run each of these sets of software in the past without any problem. I just re-verified again that NAMD works fine with that version of MVAPICH2 and compilers at 128 processes and above. Can you give your parameters that you used for building Charm++? (conv_mach.sh) I've posted this in the past as a guide for MVAPICH: cd charm-5.9 cd ./src/arch cp -r mpi-linux-amd64 mpi-linux-amd64-mvapich cd mpi-linux-amd64-mvapich * edit conv-mach.h and change: #define CMK_MALLOC_USE_GNU_MALLOC 1 #define CMK_MALLOC_USE_OS_BUILTIN 0 to #define CMK_MALLOC_USE_GNU_MALLOC 0 #define CMK_MALLOC_USE_OS_BUILTIN 1 * make sure the MVAPICH mpicc and mpiCC are first in your path. Otherwise, add the full path to the mpicc and mpiCC commands in conv_mach.sh cd ../../.. ./build charm++ mpi-linux-amd64-mvapich --no-build-shared You may need to change mpiCC to mpicxx in the conv_mach.sh in charm-5.9/src/arch/mpi-linux-amd64-mvapich Matt On Tue, 19 Aug 2008, Sangamesh B wrote: > Hi DK Sir, > > I'm using OpenIB. MVAPICH2 is built with OFED-1.3 and Intel compilers. > > This is the new cluster we built recently. The environment is different from > the earlier. But earlier also we built mvapich2 for OFA interface only. > > We've used make.mvapich2.ofa for installation. This will not install uDAPL > stack right? > > Thank you, > Sangamesh > > On Tue, Aug 19, 2008 at 5:51 AM, Joshua Bernstein < > jbernstein@penguincomputing.com> wrote: > > > Agreed, > > > > Generally the "OpenIB" transport provides for greater startup and > > reliability over large number of cores, so if you are using uDAPL, I would > > suggest giving openib a shot. > > > > -Joshua Bernstein > > Software Engineer > > Penguin Computing > > > > Dhabaleswar Panda wrote: > > > >> Sangamesh, > >> > >> Some of your earlier queries were for the uDAPL interface of MVAPICH2 > >> running on your customized adapter. Do these problems occur on the same > >> environment/interface? Since MVAPICH2 supports multiple interfaces, it > >> will be good if you can indicate which interface of MVAPICH2 you are using > >> here. > >> > >> DK > >> > >> On Mon, 18 Aug 2008, Sangamesh B wrote: > >> > >> Dear all, > >>> > >>> Problem No 1: > >>> > >>> Application: GROMACS 3.3.3 > >>> > >>> Parallel Library: MVAPICH2-1.0.3 > >>> > >>> Compilers: Intel C++ and Fortran 10 > >>> > >>> A parallel Gromacs-3.3.3(C application) 32 core job runs successfully on > >>> a > >>> Rocks 4.3, 33 > >>> node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores ). > >>> > >>> But if I submit same job for 64 or higher no of processes, it comes > >>> without > >>> doing > >>> anything. > >>> > >>> This is my command line: > >>> > >>> grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr > >>> mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run > >>> > >>> > >>> > >>> Problem No 2: > >>> > >>> Application: NAMD 2.6 > >>> > >>> Parallel Library: MVAPICH2-1.0.3 > >>> > >>> Compilers: Intel C++ and Fortran 10 > >>> > >>> I built successfully charm++ with mvapich2 and intel compilers, and then > >>> compiled NAMD2. > >>> > >>> The test examples given in the NAMD distribution works fine. > >>> > >>> With the following input file( This input file is the one which is used > >>> in > >>> the NAMD website, for benchmarking. It runs/scales upto 252 processes as > >>> mentioned in NAMD website). But in my case it runs only for 8 process, 16 > >>> process, 32 process, 64 processes. > >>> > >>> But when a 128 core job submitted, it doesn't run at all. The following > >>> is > >>> the command and error. > >>> > >>> #mpirun -machinefile ./machfile -np 128 > >>> /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee > >>> namd_128cores > >>> Charm++> Running on MPI version: 2.0 multi-thread support: 0/0 > >>> rank 65 in job 4 master_host_name_50238 caused collective abort of all > >>> ranks > >>> exit status of rank 65: killed by signal 9 > >>> > >>> > >>> So, in further, I built charmc with network version of charm++ library > >>> without using mvapich2. Now it works for any number process job. > >>> > >>> So, for the above two problems, I guess there is some thing problem with > >>> mvapich2 itself. Is there a solution for it? > >>> > >>> > >>> Regards, > >>> Sangamesh > >>> > >>> > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > From twcroc at wm.edu Tue Aug 19 19:19:31 2008 From: twcroc at wm.edu (Tom Crockett) Date: Tue Aug 19 22:34:20 2008 Subject: [mvapich-discuss] Process Termination Detection with mpirun_rsh Message-ID: <48AB5503.5030503@wm.edu> Hi, I've recently installed MVAPICH2 1.2rc1 on my cluster, and have been experimenting with the new mpirun_rsh job launcher. In general, I much prefer this simpler approach, and have found it to be faster and more reliable than MPD. However, I'm having one fairly serious problem relating to termination detection when processes abort. Here's the scenario: 1. Launch an MPI job on multiple nodes via "mpirun_rsh -rsh", typically with multiple processes per node (multi-process, multi-core). 2. One process dies, e.g., with a segmentation violation, on some random node. 3. The node with the offending process seems to notice this locally; all the sibling processes and the local mpispawn process terminate. However, the remaining nodes (including the master) don't seem to notice; their processes continue to run (or more likely stall, waiting on communication which will never arrive). If I run this experiment on two nodes (for example) and look at the process state on the master node before the process dies on the remote node, I see two sets of "rsh" processes, with one active process and one defunct process in each set. "ps" shows that each defunct "rsh" is a child of an active process. Following abnormal process termination on the remote node, there will be only one active rsh process and one defunct rsh process, confirming that the remote processes have cleaned up and exited. So it seems that mpirun_rsh is not responding properly to the death of a child process. Here's a concrete example showing the process state on the master node following termination of the processes on the remote node: 11 [ty10] /bin/ps -utom -o 'user pid ppid s nice vsz rss pmem time fname' USER PID PPID S NI VSZ RSS %MEM TIME COMMAND tom 6218 14345 S 0 9000 1984 0.0 00:00:00 tcsh tom 6219 6218 S 0 1772 428 0.0 00:00:00 pbs_demu tom 6251 6218 S 0 9368 1588 0.0 00:00:00 28027.ty tom 6252 6251 S 0 12216 3072 0.0 00:00:00 pbsmvp2 tom 6257 6252 S 0 5288 676 0.0 00:00:00 mpirun_r tom 6258 6257 S 0 6396 692 0.0 00:00:00 rsh tom 6261 6260 S 0 9784 2096 0.0 00:00:00 tcsh tom 6262 6258 Z 0 0 0 0.0 00:00:00 rsh tom 6307 6261 S 0 5492 712 0.0 00:00:00 mpispawn tom 6308 6307 R 0 8038032 19576 0.2 00:04:48 rand4 tom 6309 6307 R 0 8038036 14416 0.1 00:05:06 rand4 tom 6310 6307 R 0 8037904 14264 0.1 00:05:07 rand4 tom 6311 6307 R 0 8038032 14328 0.1 00:05:06 rand4 Interestingly, whether the master node detects the remote process termination seems to depend on how the remote process dies. If I hit the remote process with a SIGTERM, mpirun_rsh seems to notice and things get cleaned up after a minute or two. If it terminates with something else (e.g., a SIGSEGV), the job will sit there forever. Finally, it's not just remote nodes that suffer from this problem. The behavior is the same if it's a local process on the master node that aborts -- the local rsh and its descendants disappear, but mpirun_rsh and processes on remote nodes persist. Now for a few more specifics about our environment: OS: SuSE Linux Enterprise Server 10 SP1 Compiler: PGI 7.1-4 InfiniBand: OFED 1.3 Scheduler: TORQUE 2.2.1 Hardware Platform: Dell SC1435 (Opteron 2218) Eventually, of course, the job scheduler will timeout the job and kill the master mpirun_rsh process, which seems to clean everything up OK. (In general, top-down kills by the scheduler seem to work fine. It's bottom-up termination that's problematic.) But much of our workload has very long runtimes (on the order of days to weeks), and my users don't want to wait that long only to find out that their job actually bombed with a segfault several days earlier. Any thoughts on what might be causing this and how to fix it? -Tom -- Tom Crockett College of William and Mary email: twcroc@wm.edu IT/High Performance Computing Group phone: (757) 221-2762 Savage House fax: (757) 221-2023 P.O. Box 8795 Williamsburg, VA 23187-8795 From forum.san at gmail.com Wed Aug 20 03:24:18 2008 From: forum.san at gmail.com (Sangamesh B) Date: Wed Aug 20 03:24:36 2008 Subject: [mvapich-discuss] problems in executing higher number process job In-Reply-To: References: Message-ID: Dear Matthew, While installing charm++, I had referred your mvapich-charm-namd post. The charm++ configuration files look as follows: *****************************************************************************/ #ifndef _CONV_MACH_H #define _CONV_MACH_H #define CMK_AMD64 1 #define CMK_CONVERSE_MPI 1 #define CMK_DEFAULT_MAIN_USES_COMMON_CODE 1 #define CMK_GETPAGESIZE_AVAILABLE 1 #define CMK_IS_HETERO 0 #define CMK_MALLOC_USE_GNU_MALLOC 0 #define CMK_MALLOC_USE_OS_BUILTIN 1 #define CMK_MEMORY_PAGESIZE 8192 #define CMK_MEMORY_PROTECTABLE 1 #define CMK_NODE_QUEUE_AVAILABLE 0 #define CMK_SHARED_VARS_EXEMPLAR 0 #define CMK_SHARED_VARS_UNAVAILABLE 1 #define CMK_SHARED_VARS_UNIPROCESSOR 0 #define CMK_SIGNAL_NOT_NEEDED 0 #define CMK_SIGNAL_USE_SIGACTION 0 #define CMK_SIGNAL_USE_SIGACTION_WITH_RESTART 1 #define CMK_THREADS_REQUIRE_NO_CPV 0 #define CMK_TIMER_USE_GETRUSAGE 0 #define CMK_TIMER_USE_SPECIAL 1 #define CMK_TIMER_USE_TIMES 0 #define CMK_TIMER_USE_RDTSC 0 #define CMK_THREADS_USE_CONTEXT 1 #define CMK_THREADS_USE_PTHREADS 0 #define CMK_TYPEDEF_INT2 short #define CMK_TYPEDEF_INT4 int #define CMK_TYPEDEF_INT8 long long #define CMK_TYPEDEF_UINT2 unsigned short #define CMK_TYPEDEF_UINT4 unsigned int #define CMK_TYPEDEF_UINT8 unsigned long long #define CMK_TYPEDEF_FLOAT4 float #define CMK_TYPEDEF_FLOAT8 double #define CMK_64BIT 1 #define CMK_WHEN_PROCESSOR_IDLE_BUSYWAIT 1 #define CMK_WHEN_PROCESSOR_IDLE_USLEEP 0 #define CMK_WEB_MODE 1 #define CMK_DEBUG_MODE 0 #define CMK_LBDB_ON 1 #endif # cat src/arch/mpi-linux-x86_64/conv-mach.sh # user enviorn var: MPICXX and MPICC # or, use the definition in file $CHARMINC/MPIOPTS if test -x "$CHARMINC/MPIOPTS" then . $CHARMINC/MPIOPTS else MPICXX_DEF=/data/mvapich2_intel/bin/mpicxx MPICC_DEF=/data/mvapich2_intel/bin/mpicc fi test -z "$MPICXX" && MPICXX=$MPICXX_DEF test -z "$MPICC" && MPICC=$MPICC_DEF test "$MPICXX" != "$MPICXX_DEF" && /bin/rm -f $CHARMINC/MPIOPTS if test ! -f "$CHARMINC/MPIOPTS" then echo MPICXX_DEF=$MPICXX > $CHARMINC/MPIOPTS echo MPICC_DEF=$MPICC >> $CHARMINC/MPIOPTS chmod +x $CHARMINC/MPIOPTS fi CMK_REAL_COMPILER=`$MPICXX -show 2>/dev/null | cut -d' ' -f1 ` case "$CMK_REAL_COMPILER" in g++) CMK_AMD64="-m64 -fPIC" ;; esac CMK_CPP_CHARM="/lib/cpp -P" CMK_CPP_C="$MPICC -E" CMK_CC="$MPICC $CMK_AMD64 " CMK_CXX="$MPICXX $CMK_AMD64 " CMK_CXXPP="$MPICXX -E $CMK_AMD64 " #CMK_SYSLIBS="-lmpich" CMK_LIBS="-lckqt $CMK_SYSLIBS " CMK_LD_LIBRARY_PATH="-Wl,-rpath,$CHARMLIBSO/" CMK_NATIVE_CC="icc $CMK_AMD64 " CMK_NATIVE_LD="icc $CMK_AMD64 " CMK_NATIVE_CXX="icpc $CMK_AMD64 " CMK_NATIVE_LDXX="icpc $CMK_AMD64 " CMK_NATIVE_LIBS="" # fortran compiler CMK_CF77="f77" CMK_CF90="f90" CMK_F90LIBS="-L/usr/absoft/lib -L/opt/absoft/lib -lf90math -lfio -lU77 -lf77math " CMK_F77LIBS="-lg2c " CMK_F90_USE_MODDIR=1 CMK_F90_MODINC="-p" CMK_QT='generic64' CMK_RANLIB="ranlib" All configurations are fine. I think there is no problem with installation. Any similar hints for Gromacs? Thank you, Sangamesh Consultant, HPC On Wed, Aug 20, 2008 at 12:26 AM, Matthew Koop wrote: > Sangamesh, > > I'm not sure what your issue here is, however, we have run each of these > sets of software in the past without any problem. I just re-verified again > that NAMD works fine with that version of MVAPICH2 and compilers at 128 > processes and above. > > Can you give your parameters that you used for building Charm++? > (conv_mach.sh) > > I've posted this in the past as a guide for MVAPICH: > cd charm-5.9 > cd ./src/arch > > cp -r mpi-linux-amd64 mpi-linux-amd64-mvapich > cd mpi-linux-amd64-mvapich > > * edit conv-mach.h and change: > > #define CMK_MALLOC_USE_GNU_MALLOC 1 > #define CMK_MALLOC_USE_OS_BUILTIN 0 > > to > > #define CMK_MALLOC_USE_GNU_MALLOC 0 > #define CMK_MALLOC_USE_OS_BUILTIN 1 > > * make sure the MVAPICH mpicc and mpiCC are first in your path. Otherwise, > add the full path to the mpicc and mpiCC commands in conv_mach.sh > > cd ../../.. > > ./build charm++ mpi-linux-amd64-mvapich --no-build-shared > > You may need to change mpiCC to mpicxx in the conv_mach.sh in > charm-5.9/src/arch/mpi-linux-amd64-mvapich > > Matt > > On Tue, 19 Aug 2008, Sangamesh B wrote: > > > Hi DK Sir, > > > > I'm using OpenIB. MVAPICH2 is built with OFED-1.3 and Intel > compilers. > > > > This is the new cluster we built recently. The environment is different > from > > the earlier. But earlier also we built mvapich2 for OFA interface only. > > > > We've used make.mvapich2.ofa for installation. This will not install > uDAPL > > stack right? > > > > Thank you, > > Sangamesh > > > > On Tue, Aug 19, 2008 at 5:51 AM, Joshua Bernstein < > > jbernstein@penguincomputing.com> wrote: > > > > > Agreed, > > > > > > Generally the "OpenIB" transport provides for greater startup > and > > > reliability over large number of cores, so if you are using uDAPL, I > would > > > suggest giving openib a shot. > > > > > > -Joshua Bernstein > > > Software Engineer > > > Penguin Computing > > > > > > Dhabaleswar Panda wrote: > > > > > >> Sangamesh, > > >> > > >> Some of your earlier queries were for the uDAPL interface of MVAPICH2 > > >> running on your customized adapter. Do these problems occur on the > same > > >> environment/interface? Since MVAPICH2 supports multiple interfaces, it > > >> will be good if you can indicate which interface of MVAPICH2 you are > using > > >> here. > > >> > > >> DK > > >> > > >> On Mon, 18 Aug 2008, Sangamesh B wrote: > > >> > > >> Dear all, > > >>> > > >>> Problem No 1: > > >>> > > >>> Application: GROMACS 3.3.3 > > >>> > > >>> Parallel Library: MVAPICH2-1.0.3 > > >>> > > >>> Compilers: Intel C++ and Fortran 10 > > >>> > > >>> A parallel Gromacs-3.3.3(C application) 32 core job runs > successfully on > > >>> a > > >>> Rocks 4.3, 33 > > >>> node cluster ( Dual processor, Quad core Intel Xeon: Total 264 cores > ). > > >>> > > >>> But if I submit same job for 64 or higher no of processes, it comes > > >>> without > > >>> doing > > >>> anything. > > >>> > > >>> This is my command line: > > >>> > > >>> grompp_mpi -np 64 -f run.mdp -p topol.top -c pr.gro -o run.tpr > > >>> mpirun -machinefile ./machfile1 -np 64 mdrun_mpi -v -deffnm run > > >>> > > >>> > > >>> > > >>> Problem No 2: > > >>> > > >>> Application: NAMD 2.6 > > >>> > > >>> Parallel Library: MVAPICH2-1.0.3 > > >>> > > >>> Compilers: Intel C++ and Fortran 10 > > >>> > > >>> I built successfully charm++ with mvapich2 and intel compilers, and > then > > >>> compiled NAMD2. > > >>> > > >>> The test examples given in the NAMD distribution works fine. > > >>> > > >>> With the following input file( This input file is the one which is > used > > >>> in > > >>> the NAMD website, for benchmarking. It runs/scales upto 252 processes > as > > >>> mentioned in NAMD website). But in my case it runs only for 8 > process, 16 > > >>> process, 32 process, 64 processes. > > >>> > > >>> But when a 128 core job submitted, it doesn't run at all. The > following > > >>> is > > >>> the command and error. > > >>> > > >>> #mpirun -machinefile ./machfile -np 128 > > >>> /data/apps/namd26_mvapich2/Linux-mvapich2/namd2 ./apoa1.namd | tee > > >>> namd_128cores > > >>> Charm++> Running on MPI version: 2.0 multi-thread support: 0/0 > > >>> rank 65 in job 4 master_host_name_50238 caused collective abort of > all > > >>> ranks > > >>> exit status of rank 65: killed by signal 9 > > >>> > > >>> > > >>> So, in further, I built charmc with network version of charm++ > library > > >>> without using mvapich2. Now it works for any number process job. > > >>> > > >>> So, for the above two problems, I guess there is some thing problem > with > > >>> mvapich2 itself. Is there a solution for it? > > >>> > > >>> > > >>> Regards, > > >>> Sangamesh > > >>> > > >>> > > >> _______________________________________________ > > >> mvapich-discuss mailing list > > >> mvapich-discuss@cse.ohio-state.edu > > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > >> > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080820/35ed7632/attachment-0001.html From Fred.Stecher at atk.com Wed Aug 20 11:26:57 2008 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Aug 20 11:27:44 2008 Subject: [mvapich-discuss] Process Termination Detection with mpirun_rsh In-Reply-To: <48AB5503.5030503@wm.edu> References: <48AB5503.5030503@wm.edu> Message-ID: Tom, We use MVAPICH-1.0 which comes with mpirun_rsh. It has the same problem and we do not use a scheduler. We have to check the nodes when a run is aborted by the application. For a node that still has processes running even though they should have been aborted, we have to kill one process at a time to clear the node. I would think that this is a known problem and should be corrected soon. Fred -----Original Message----- From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Tom Crockett Sent: Tuesday, August 19, 2008 6:20 PM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] Process Termination Detection with mpirun_rsh Hi, I've recently installed MVAPICH2 1.2rc1 on my cluster, and have been experimenting with the new mpirun_rsh job launcher. In general, I much prefer this simpler approach, and have found it to be faster and more reliable than MPD. However, I'm having one fairly serious problem relating to termination detection when processes abort. Here's the scenario: 1. Launch an MPI job on multiple nodes via "mpirun_rsh -rsh", typically with multiple processes per node (multi-process, multi-core). 2. One process dies, e.g., with a segmentation violation, on some random node. 3. The node with the offending process seems to notice this locally; all the sibling processes and the local mpispawn process terminate. However, the remaining nodes (including the master) don't seem to notice; their processes continue to run (or more likely stall, waiting on communication which will never arrive). If I run this experiment on two nodes (for example) and look at the process state on the master node before the process dies on the remote node, I see two sets of "rsh" processes, with one active process and one defunct process in each set. "ps" shows that each defunct "rsh" is a child of an active process. Following abnormal process termination on the remote node, there will be only one active rsh process and one defunct rsh process, confirming that the remote processes have cleaned up and exited. So it seems that mpirun_rsh is not responding properly to the death of a child process. Here's a concrete example showing the process state on the master node following termination of the processes on the remote node: 11 [ty10] /bin/ps -utom -o 'user pid ppid s nice vsz rss pmem time fname' USER PID PPID S NI VSZ RSS %MEM TIME COMMAND tom 6218 14345 S 0 9000 1984 0.0 00:00:00 tcsh tom 6219 6218 S 0 1772 428 0.0 00:00:00 pbs_demu tom 6251 6218 S 0 9368 1588 0.0 00:00:00 28027.ty tom 6252 6251 S 0 12216 3072 0.0 00:00:00 pbsmvp2 tom 6257 6252 S 0 5288 676 0.0 00:00:00 mpirun_r tom 6258 6257 S 0 6396 692 0.0 00:00:00 rsh tom 6261 6260 S 0 9784 2096 0.0 00:00:00 tcsh tom 6262 6258 Z 0 0 0 0.0 00:00:00 rsh tom 6307 6261 S 0 5492 712 0.0 00:00:00 mpispawn tom 6308 6307 R 0 8038032 19576 0.2 00:04:48 rand4 tom 6309 6307 R 0 8038036 14416 0.1 00:05:06 rand4 tom 6310 6307 R 0 8037904 14264 0.1 00:05:07 rand4 tom 6311 6307 R 0 8038032 14328 0.1 00:05:06 rand4 Interestingly, whether the master node detects the remote process termination seems to depend on how the remote process dies. If I hit the remote process with a SIGTERM, mpirun_rsh seems to notice and things get cleaned up after a minute or two. If it terminates with something else (e.g., a SIGSEGV), the job will sit there forever. Finally, it's not just remote nodes that suffer from this problem. The behavior is the same if it's a local process on the master node that aborts -- the local rsh and its descendants disappear, but mpirun_rsh and processes on remote nodes persist. Now for a few more specifics about our environment: OS: SuSE Linux Enterprise Server 10 SP1 Compiler: PGI 7.1-4 InfiniBand: OFED 1.3 Scheduler: TORQUE 2.2.1 Hardware Platform: Dell SC1435 (Opteron 2218) Eventually, of course, the job scheduler will timeout the job and kill the master mpirun_rsh process, which seems to clean everything up OK. (In general, top-down kills by the scheduler seem to work fine. It's bottom-up termination that's problematic.) But much of our workload has very long runtimes (on the order of days to weeks), and my users don't want to wait that long only to find out that their job actually bombed with a segfault several days earlier. Any thoughts on what might be causing this and how to fix it? -Tom -- Tom Crockett College of William and Mary email: twcroc@wm.edu IT/High Performance Computing Group phone: (757) 221-2762 Savage House fax: (757) 221-2023 P.O. Box 8795 Williamsburg, VA 23187-8795 _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From twcroc at wm.edu Wed Aug 20 15:42:49 2008 From: twcroc at wm.edu (Tom Crockett) Date: Wed Aug 20 15:43:04 2008 Subject: [mvapich-discuss] Process Termination Detection with mpirun_rsh In-Reply-To: <48AB5503.5030503@wm.edu> References: <48AB5503.5030503@wm.edu> Message-ID: <48AC73B9.4040705@wm.edu> Tom Crockett wrote: > Following abnormal process termination on the remote node, there will be > only one active rsh process and one defunct rsh process, confirming that > the remote processes have cleaned up and exited. So it seems that > mpirun_rsh is not responding properly to the death of a child process. I've been poking around in the source code for mpirun_rsh and mpispawn, and I think I've figured out what the trouble is -- I'm just not sure what to do about it. mpispawn is correctly noticing that one of its children has terminated abnormally, kills off all of its other children, and exits with a non-zero return code. So far, so good. Unfortunately, rsh (unlike ssh) does not propagate this return code back as its own exit status, and instead exits with a return code of 0. mpirun_rsh incorrectly interprets this to mean that the remote processes have terminated normally. Instead of jumping into its cleanup procedure to kill off the remaining processes in the job, it just sits around waiting for its other children to exit, which they will never do without outside intervention. One potential workaround would be to use ssh instead of rsh, but we much prefer to use rsh for spawning remote processes in our clusters. There are two main reasons for this: (1) rsh is simpler, faster, easier to configure, and less susceptible to breaking when users customize their personal settings, and (2) as a rule we disallow ssh and rlogin access to our compute nodes so that users will have fewer pathways to circumvent the job scheduler. So what is really needed is either a custom version of rsh which mirrors the return status of its remote command, or else some other mechanism by which mpispawn can notify mpirun_rsh when something bad happens to one of its children. I'm curious if the former already exists somewhere? > Interestingly, whether the master node detects the remote process > termination seems to depend on how the remote process dies. If I hit > the remote process with a SIGTERM, mpirun_rsh seems to notice and things > get cleaned up after a minute or two. If it terminates with something > else (e.g., a SIGSEGV), the job will sit there forever. I haven't dug deeply into this behavior yet, but I conjecture that the SIGTERM is being caught by the MPI processes and is being handled in MPI-land, whereas most other signals (such as SIGSEGV) are not being trapped at the application level. -Tom -- Tom Crockett College of William and Mary email: twcroc@wm.edu IT/High Performance Computing Group phone: (757) 221-2762 Savage House fax: (757) 221-2023 P.O. Box 8795 Williamsburg, VA 23187-8795 From panda at cse.ohio-state.edu Wed Aug 20 15:56:52 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Aug 20 15:57:05 2008 Subject: [mvapich-discuss] Process Termination Detection with mpirun_rsh In-Reply-To: <48AC73B9.4040705@wm.edu> Message-ID: Hi Tom and Fred, Thanks for reporting this issue and also sending us the follow-up comments. We had started some internal discussions today morning here and were suspecting the role of rsh vs. ssh. This seems to be the case here. We will look for a solution which can solve this rsh-related problem. We will get back to you in a few days. Thanks, DK On Wed, 20 Aug 2008, Tom Crockett wrote: > Tom Crockett wrote: > > Following abnormal process termination on the remote node, there will be > > only one active rsh process and one defunct rsh process, confirming that > > the remote processes have cleaned up and exited. So it seems that > > mpirun_rsh is not responding properly to the death of a child process. > > I've been poking around in the source code for mpirun_rsh and mpispawn, > and I think I've figured out what the trouble is -- I'm just not sure > what to do about it. mpispawn is correctly noticing that one of its > children has terminated abnormally, kills off all of its other children, > and exits with a non-zero return code. So far, so good. > > Unfortunately, rsh (unlike ssh) does not propagate this return code back > as its own exit status, and instead exits with a return code of 0. > mpirun_rsh incorrectly interprets this to mean that the remote processes > have terminated normally. Instead of jumping into its cleanup procedure > to kill off the remaining processes in the job, it just sits around > waiting for its other children to exit, which they will never do without > outside intervention. > > One potential workaround would be to use ssh instead of rsh, but we much > prefer to use rsh for spawning remote processes in our clusters. There > are two main reasons for this: (1) rsh is simpler, faster, easier to > configure, and less susceptible to breaking when users customize their > personal settings, and (2) as a rule we disallow ssh and rlogin access > to our compute nodes so that users will have fewer pathways to > circumvent the job scheduler. > > So what is really needed is either a custom version of rsh which mirrors > the return status of its remote command, or else some other mechanism by > which mpispawn can notify mpirun_rsh when something bad happens to one > of its children. I'm curious if the former already exists somewhere? > > > > Interestingly, whether the master node detects the remote process > > termination seems to depend on how the remote process dies. If I hit > > the remote process with a SIGTERM, mpirun_rsh seems to notice and things > > get cleaned up after a minute or two. If it terminates with something > > else (e.g., a SIGSEGV), the job will sit there forever. > > I haven't dug deeply into this behavior yet, but I conjecture that the > SIGTERM is being caught by the MPI processes and is being handled in > MPI-land, whereas most other signals (such as SIGSEGV) are not being > trapped at the application level. > > -Tom > > -- > Tom Crockett > > College of William and Mary email: twcroc@wm.edu > IT/High Performance Computing Group phone: (757) 221-2762 > Savage House fax: (757) 221-2023 > P.O. Box 8795 > Williamsburg, VA 23187-8795 > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From panda at cse.ohio-state.edu Wed Aug 20 23:59:31 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Aug 20 23:59:43 2008 Subject: [mvapich-discuss] MVAPICH2 1.2RC2 is available Message-ID: The MVAPICH team is pleased to announce the availability of MVAPICH2 1.2RC2. The following bugs are fixed in this release. - Properly handle the scenario in shared memory broadcast code when the datatypes of different processes taking part in broadcast are different. - Fix a bug in Checkpoint-Restart code to determine whether a connection is a shared memory connection or a network connection. - Support non-standard path for BLCR header files. - Increase the maximum heap size to avoid race condition in realloc(). - Use int32_t for rank for larger jobs with 32k processes or more. - Improve mvapich2-1.2 bandwidth to the same level of mvapich2-1.0.3. - An error handling patch for uDAPL interface. Thanks for Nilesh Awate for the patch. - Explicitly set some of the EP attributes when on demand connection is used in uDAPL interface. MVAPICH2 users are requested to update their installations with this latest release. Thanks, The MVAPICH Team From Terrence.LIAO at total.com Thu Aug 21 17:55:56 2008 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Thu Aug 21 17:56:22 2008 Subject: [mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer Message-ID: Dear mvapich, I got a core dump when MPI_Bcast(buffer, n, MPI_DOUBLE,...) when n is 1024*1024, i,e 8MB buffer on np=8 on 8 compute nodes. I have NO problem when using np = 7. I am using mvapich-1.0 Feb 28 2008 download on AMD cluster - quad-core dual sockets 16GB mem, with 4xDDR IB. mvapich is built on pgi 7.1 compiler. Below is the gdb output. Any suggestion I should do to fix this problem? Thank you very much. -- Terrence Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 182894245856 (LWP 18383)] 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6 (gdb) where #0 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6 #1 0x0000000000449c09 in MPID_VIA_self_start (buf=0x2a96546010, len=8388608, src_lrank=0, tag=2, context_id=0, shandle=0x57a1e8) at viasend.c:276 #2 0x000000000044c205 in MPID_IsendContig (comm_ptr=0x5a2060, buf=0x2a96546010, len=8388608, src_lrank=0, tag=2, context_id=0, dest_grank=0, msgrep=MPID_MSGREP_RECEIVER, request=0x57a1e8, error_code=0x7fbfffe66c) at mpid_send.c:84 #3 0x0000000000435cfd in MPID_IsendDatatype (comm_ptr=0x5a2060, buf=0x2a96546010, count=1048576, dtype_ptr=0x56ac60, src_lrank=0, tag=2, context_id=0, dest_grank=0, request=0x57a1e8, error_code=0x7fbfffe66c) at mpid_hsend.c:129 #4 0x0000000000443215 in PMPI_Isend (buf=0x2a96546010, count=1048576, datatype=11, dest=0, tag=2, comm=91, request=0x7fbfffe710) at isend.c:97 #5 0x0000000000444710 in PMPI_Sendrecv (sendbuf=0x2a96546010, sendcount=1048576, sendtype=11, dest=0, sendtag=2, recvbuf=0x2a96d4bc00, recvcount=1048576, recvtype=11, source=0, recvtag=2, comm=91, status=0x7fbfffe820) at sendrecv.c:95 #6 0x000000000041c355 in intra_shmem_Bcast_Large (buffer=0x2a96546010, count=1048576, datatype=0x56ac60, nbytes=8388608, root=0, comm=0x5a2060) at intra_fns_new.c:1704 #7 0x000000000041b6b4 in intra_Bcast_Large (buffer=0x2a96546010, count=1048576, datatype=0x56ac60, nbytes=8388608, root=0, comm=0x5a2060) at intra_fns_new.c:1309 #8 0x000000000041b157 in intra_newBcast (buffer=0x2a96546010, count=1048576, datatype=0x56ac60, root=0, comm=0x5a2060) at intra_fns_new.c:1117 #9 0x0000000000412008 in PMPI_Bcast (buffer=0x2a96546010, count=1048576, datatype=11, root=0, comm=91) at bcast.c:122 #10 0x00000000004042de in main (argc=2, argv=0x7fbfffee98) at large-mpi_bcast_test.c:159 (gdb) -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080821/706f5f42/attachment-0001.html From panda at cse.ohio-state.edu Thu Aug 21 22:01:10 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Aug 21 22:01:22 2008 Subject: [mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer In-Reply-To: Message-ID: Hi Terrence, Thanks for reporting this problem. After MVAPICH 1.0 release, we had a bug-fix release of 1.0.1 on 05/30/08. After that some more fixes also have gone into the 1.0 branch based on the feedbacks we have received from the users. Here are some check-ins which we believe might be related to the failure symptom you have described. ---------------------------------------------- r2179 | mamidala | 2008-03-04 18:40:24 -0500 (Tue, 04 Mar 2008) | 3 lines checking in a fix for BLACS seg. fault problem. Problem occurs when application holds onto MPI communicators not freeing immediately ---------------------------------------------- r2783 | kumarra | 2008-06-24 23:11:04 -0400 (Tue, 24 Jun 2008) | 1 line shared memory bcast buffer overflow. Reported by David Kewley@Dell. --------------------------------------------- r2805 | kumarra | 2008-06-30 13:28:54 -0400 (Mon, 30 Jun 2008) | 1 line Do not try to use shmem broadcast if shmem_bcast shared memory initialization fails --------------------------------------------- Can you try MVAPICH 1.0.1 release, the bugfix 1.0 branch or the trunk and let us know whether the problem persists. If the problem persists, we will take a look at this issue further. You can get these latest versions through svn checkout or through tarballs. FYI, daily tarballs of the 1.0 bugfix branch are available here: http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.0/ Similarly, daily tarballs of the trunk are available here: http://mvapich.cse.ohio-state.edu/nightly/mvapich/trunk/ Thanks, DK On Thu, 21 Aug 2008 Terrence.LIAO@total.com wrote: > Dear mvapich, > > I got a core dump when MPI_Bcast(buffer, n, MPI_DOUBLE,...) when n is > 1024*1024, i,e 8MB buffer on np=8 on 8 compute nodes. I have NO > problem when using np = 7. I am using mvapich-1.0 Feb 28 2008 download on > AMD cluster - quad-core dual sockets 16GB mem, with 4xDDR IB. mvapich is > built on pgi 7.1 compiler. Below is the gdb output. Any suggestion I > should do to fix this problem? Thank you very much. -- Terrence > > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 182894245856 (LWP 18383)] > 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6 > (gdb) where > #0 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6 > #1 0x0000000000449c09 in MPID_VIA_self_start (buf=0x2a96546010, > len=8388608, src_lrank=0, tag=2, > context_id=0, shandle=0x57a1e8) at viasend.c:276 > #2 0x000000000044c205 in MPID_IsendContig (comm_ptr=0x5a2060, > buf=0x2a96546010, len=8388608, > src_lrank=0, tag=2, context_id=0, dest_grank=0, > msgrep=MPID_MSGREP_RECEIVER, request=0x57a1e8, > error_code=0x7fbfffe66c) at mpid_send.c:84 > #3 0x0000000000435cfd in MPID_IsendDatatype (comm_ptr=0x5a2060, > buf=0x2a96546010, count=1048576, > dtype_ptr=0x56ac60, src_lrank=0, tag=2, context_id=0, dest_grank=0, > request=0x57a1e8, > error_code=0x7fbfffe66c) at mpid_hsend.c:129 > #4 0x0000000000443215 in PMPI_Isend (buf=0x2a96546010, count=1048576, > datatype=11, dest=0, tag=2, > comm=91, request=0x7fbfffe710) at isend.c:97 > #5 0x0000000000444710 in PMPI_Sendrecv (sendbuf=0x2a96546010, > sendcount=1048576, sendtype=11, > dest=0, sendtag=2, recvbuf=0x2a96d4bc00, recvcount=1048576, > recvtype=11, source=0, recvtag=2, > comm=91, status=0x7fbfffe820) at sendrecv.c:95 > #6 0x000000000041c355 in intra_shmem_Bcast_Large (buffer=0x2a96546010, > count=1048576, > datatype=0x56ac60, nbytes=8388608, root=0, comm=0x5a2060) at > intra_fns_new.c:1704 > #7 0x000000000041b6b4 in intra_Bcast_Large (buffer=0x2a96546010, > count=1048576, datatype=0x56ac60, > nbytes=8388608, root=0, comm=0x5a2060) at intra_fns_new.c:1309 > #8 0x000000000041b157 in intra_newBcast (buffer=0x2a96546010, > count=1048576, datatype=0x56ac60, > root=0, comm=0x5a2060) at intra_fns_new.c:1117 > #9 0x0000000000412008 in PMPI_Bcast (buffer=0x2a96546010, count=1048576, > datatype=11, root=0, > comm=91) at bcast.c:122 > #10 0x00000000004042de in main (argc=2, argv=0x7fbfffee98) at > large-mpi_bcast_test.c:159 > (gdb) > > > > > From Terrence.LIAO at total.com Fri Aug 22 11:03:29 2008 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Fri Aug 22 11:03:52 2008 Subject: [mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer In-Reply-To: Message-ID: Hi, DK, Yes, you are right. Using the new version Aug 21. The MPI_Bcast no longer core dump and can Bcast to the 2GB buffer limit. I do have another question, How can I extend MPI buffer beyond the 2GB limit? Thank you very much. -- Terrence -------------------------------------------------------- Terrence Liao, Ph.D. Research Computer Scientist TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC 1201 Louisiana, Suite 1800, Houston, TX 77002 Tel: 713.647.3498 Fax: 713.647.3638 Email: terrence.liao@total.com Dhabaleswar Panda 08/21/2008 09:01 PM To Terrence.LIAO@total.com cc mvapich-discuss@cse.ohio-state.edu Subject Re: [mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer Hi Terrence, Thanks for reporting this problem. After MVAPICH 1.0 release, we had a bug-fix release of 1.0.1 on 05/30/08. After that some more fixes also have gone into the 1.0 branch based on the feedbacks we have received from the users. Here are some check-ins which we believe might be related to the failure symptom you have described. ---------------------------------------------- r2179 | mamidala | 2008-03-04 18:40:24 -0500 (Tue, 04 Mar 2008) | 3 lines checking in a fix for BLACS seg. fault problem. Problem occurs when application holds onto MPI communicators not freeing immediately ---------------------------------------------- r2783 | kumarra | 2008-06-24 23:11:04 -0400 (Tue, 24 Jun 2008) | 1 line shared memory bcast buffer overflow. Reported by David Kewley@Dell. --------------------------------------------- r2805 | kumarra | 2008-06-30 13:28:54 -0400 (Mon, 30 Jun 2008) | 1 line Do not try to use shmem broadcast if shmem_bcast shared memory initialization fails --------------------------------------------- Can you try MVAPICH 1.0.1 release, the bugfix 1.0 branch or the trunk and let us know whether the problem persists. If the problem persists, we will take a look at this issue further. You can get these latest versions through svn checkout or through tarballs. FYI, daily tarballs of the 1.0 bugfix branch are available here: http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.0/ Similarly, daily tarballs of the trunk are available here: http://mvapich.cse.ohio-state.edu/nightly/mvapich/trunk/ Thanks, DK On Thu, 21 Aug 2008 Terrence.LIAO@total.com wrote: > Dear mvapich, > > I got a core dump when MPI_Bcast(buffer, n, MPI_DOUBLE,...) when n is > 1024*1024, i,e 8MB buffer on np=8 on 8 compute nodes. I have NO > problem when using np = 7. I am using mvapich-1.0 Feb 28 2008 download on > AMD cluster - quad-core dual sockets 16GB mem, with 4xDDR IB. mvapich is > built on pgi 7.1 compiler. Below is the gdb output. Any suggestion I > should do to fix this problem? Thank you very much. -- Terrence > > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 182894245856 (LWP 18383)] > 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6 > (gdb) where > #0 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6 > #1 0x0000000000449c09 in MPID_VIA_self_start (buf=0x2a96546010, > len=8388608, src_lrank=0, tag=2, > context_id=0, shandle=0x57a1e8) at viasend.c:276 > #2 0x000000000044c205 in MPID_IsendContig (comm_ptr=0x5a2060, > buf=0x2a96546010, len=8388608, > src_lrank=0, tag=2, context_id=0, dest_grank=0, > msgrep=MPID_MSGREP_RECEIVER, request=0x57a1e8, > error_code=0x7fbfffe66c) at mpid_send.c:84 > #3 0x0000000000435cfd in MPID_IsendDatatype (comm_ptr=0x5a2060, > buf=0x2a96546010, count=1048576, > dtype_ptr=0x56ac60, src_lrank=0, tag=2, context_id=0, dest_grank=0, > request=0x57a1e8, > error_code=0x7fbfffe66c) at mpid_hsend.c:129 > #4 0x0000000000443215 in PMPI_Isend (buf=0x2a96546010, count=1048576, > datatype=11, dest=0, tag=2, > comm=91, request=0x7fbfffe710) at isend.c:97 > #5 0x0000000000444710 in PMPI_Sendrecv (sendbuf=0x2a96546010, > sendcount=1048576, sendtype=11, > dest=0, sendtag=2, recvbuf=0x2a96d4bc00, recvcount=1048576, > recvtype=11, source=0, recvtag=2, > comm=91, status=0x7fbfffe820) at sendrecv.c:95 > #6 0x000000000041c355 in intra_shmem_Bcast_Large (buffer=0x2a96546010, > count=1048576, > datatype=0x56ac60, nbytes=8388608, root=0, comm=0x5a2060) at > intra_fns_new.c:1704 > #7 0x000000000041b6b4 in intra_Bcast_Large (buffer=0x2a96546010, > count=1048576, datatype=0x56ac60, > nbytes=8388608, root=0, comm=0x5a2060) at intra_fns_new.c:1309 > #8 0x000000000041b157 in intra_newBcast (buffer=0x2a96546010, > count=1048576, datatype=0x56ac60, > root=0, comm=0x5a2060) at intra_fns_new.c:1117 > #9 0x0000000000412008 in PMPI_Bcast (buffer=0x2a96546010, count=1048576, > datatype=11, root=0, > comm=91) at bcast.c:122 > #10 0x00000000004042de in main (argc=2, argv=0x7fbfffee98) at > large-mpi_bcast_test.c:159 > (gdb) > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080822/a86907b3/attachment.html From koop at cse.ohio-state.edu Sun Aug 24 17:20:32 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Sun Aug 24 17:20:46 2008 Subject: [mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer In-Reply-To: Message-ID: I'm glad the latest version is working for you now. The MPI buffer limit is a well-known issue with MPI. Since the datatype is an 'int' you cannot increase the number of elements. You should be able to use other datatypes to allow a larger buffer though. Matt On Fri, 22 Aug 2008 Terrence.LIAO@total.com wrote: > Hi, DK, > > Yes, you are right. Using the new version Aug 21. The MPI_Bcast no > longer core dump and can Bcast to the 2GB buffer limit. > I do have another question, How can I extend MPI buffer beyond the 2GB > limit? > > Thank you very much. > > -- Terrence > -------------------------------------------------------- > Terrence Liao, Ph.D. > Research Computer Scientist > TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC > 1201 Louisiana, Suite 1800, Houston, TX 77002 > Tel: 713.647.3498 Fax: 713.647.3638 > Email: terrence.liao@total.com > > > > > > > Dhabaleswar Panda > 08/21/2008 09:01 PM > > To > Terrence.LIAO@total.com > cc > mvapich-discuss@cse.ohio-state.edu > Subject > Re: [mvapich-discuss] Help problem MPI_Bcast fails on np=8 with 8MB buffer > > > > > > > Hi Terrence, > > Thanks for reporting this problem. After MVAPICH 1.0 release, we had a > bug-fix release of 1.0.1 on 05/30/08. After that some more fixes also > have gone into the 1.0 branch based on the feedbacks we have received from > the users. > > Here are some check-ins which we believe might be related to the failure > symptom you have described. > > ---------------------------------------------- > r2179 | mamidala | 2008-03-04 18:40:24 -0500 (Tue, 04 Mar 2008) | 3 lines > checking in a fix for BLACS seg. fault problem. Problem occurs when > application holds onto MPI communicators not freeing immediately > ---------------------------------------------- > r2783 | kumarra | 2008-06-24 23:11:04 -0400 (Tue, 24 Jun 2008) | 1 line > shared memory bcast buffer overflow. Reported by David Kewley@Dell. > --------------------------------------------- > r2805 | kumarra | 2008-06-30 13:28:54 -0400 (Mon, 30 Jun 2008) | 1 line > Do not try to use shmem broadcast if shmem_bcast shared memory > initialization fails > --------------------------------------------- > > Can you try MVAPICH 1.0.1 release, the bugfix 1.0 branch or the trunk and > let us know whether the problem persists. If the problem persists, we will > take a look at this issue further. > > You can get these latest versions through svn checkout or through > tarballs. > > FYI, daily tarballs of the 1.0 bugfix branch are available here: > http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.0/ > > Similarly, daily tarballs of the trunk are available here: > http://mvapich.cse.ohio-state.edu/nightly/mvapich/trunk/ > > Thanks, > > DK > > On Thu, 21 Aug 2008 Terrence.LIAO@total.com wrote: > > > Dear mvapich, > > > > I got a core dump when MPI_Bcast(buffer, n, MPI_DOUBLE,...) when n is > > 1024*1024, i,e 8MB buffer on np=8 on 8 compute nodes. I have NO > > problem when using np = 7. I am using mvapich-1.0 Feb 28 2008 download > on > > AMD cluster - quad-core dual sockets 16GB mem, with 4xDDR IB. mvapich > is > > built on pgi 7.1 compiler. Below is the gdb output. Any suggestion > I > > should do to fix this problem? Thank you very much. -- Terrence > > > > > > Program received signal SIGSEGV, Segmentation fault. > > [Switching to Thread 182894245856 (LWP 18383)] > > 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6 > > (gdb) where > > #0 0x00000036d80723e3 in memcpy () from /lib64/tls/libc.so.6 > > #1 0x0000000000449c09 in MPID_VIA_self_start (buf=0x2a96546010, > > len=8388608, src_lrank=0, tag=2, > > context_id=0, shandle=0x57a1e8) at viasend.c:276 > > #2 0x000000000044c205 in MPID_IsendContig (comm_ptr=0x5a2060, > > buf=0x2a96546010, len=8388608, > > src_lrank=0, tag=2, context_id=0, dest_grank=0, > > msgrep=MPID_MSGREP_RECEIVER, request=0x57a1e8, > > error_code=0x7fbfffe66c) at mpid_send.c:84 > > #3 0x0000000000435cfd in MPID_IsendDatatype (comm_ptr=0x5a2060, > > buf=0x2a96546010, count=1048576, > > dtype_ptr=0x56ac60, src_lrank=0, tag=2, context_id=0, dest_grank=0, > > request=0x57a1e8, > > error_code=0x7fbfffe66c) at mpid_hsend.c:129 > > #4 0x0000000000443215 in PMPI_Isend (buf=0x2a96546010, count=1048576, > > datatype=11, dest=0, tag=2, > > comm=91, request=0x7fbfffe710) at isend.c:97 > > #5 0x0000000000444710 in PMPI_Sendrecv (sendbuf=0x2a96546010, > > sendcount=1048576, sendtype=11, > > dest=0, sendtag=2, recvbuf=0x2a96d4bc00, recvcount=1048576, > > recvtype=11, source=0, recvtag=2, > > comm=91, status=0x7fbfffe820) at sendrecv.c:95 > > #6 0x000000000041c355 in intra_shmem_Bcast_Large (buffer=0x2a96546010, > > count=1048576, > > datatype=0x56ac60, nbytes=8388608, root=0, comm=0x5a2060) at > > intra_fns_new.c:1704 > > #7 0x000000000041b6b4 in intra_Bcast_Large (buffer=0x2a96546010, > > count=1048576, datatype=0x56ac60, > > nbytes=8388608, root=0, comm=0x5a2060) at intra_fns_new.c:1309 > > #8 0x000000000041b157 in intra_newBcast (buffer=0x2a96546010, > > count=1048576, datatype=0x56ac60, > > root=0, comm=0x5a2060) at intra_fns_new.c:1117 > > #9 0x0000000000412008 in PMPI_Bcast (buffer=0x2a96546010, > count=1048576, > > datatype=11, root=0, > > comm=91) at bcast.c:122 > > #10 0x00000000004042de in main (argc=2, argv=0x7fbfffee98) at > > large-mpi_bcast_test.c:159 > > (gdb) > > > > > > > > > > > > From twcroc at wm.edu Mon Aug 25 16:07:55 2008 From: twcroc at wm.edu (Tom Crockett) Date: Mon Aug 25 16:08:10 2008 Subject: [mvapich-discuss] MVAPICH 1.2rc2: Wrong Format in maint/Version Message-ID: <48B3111B.8070704@wm.edu> Hi, The contents of the file "maint/Version" in MVAPICH 1.2rc2 do not conform to the format expected by the configure script. This results in the following error messages in the configuration step: ./configure: line 1648: test: : integer expression expected ./configure: line 1649: test: : integer expression expected Further investigation reveals that the configuration variables V1-V5 are being set to null strings, resulting in an incorrect value for NUMVERSION. One suggested fix is to change the contents of "maint/Version" to read: 1.2.0rc2 -Tom -- Tom Crockett College of William and Mary email: twcroc@wm.edu IT/High Performance Computing Group phone: (757) 221-2762 Savage House fax: (757) 221-2023 P.O. Box 8795 Williamsburg, VA 23187-8795 From twcroc at wm.edu Mon Aug 25 16:35:20 2008 From: twcroc at wm.edu (Tom Crockett) Date: Mon Aug 25 16:35:34 2008 Subject: [mvapich-discuss] MVAPICH 1.2rc2: Wrong Format in maint/Version In-Reply-To: <48B3111B.8070704@wm.edu> References: <48B3111B.8070704@wm.edu> Message-ID: <48B31788.5070808@wm.edu> Tom Crockett wrote: > The contents of the file "maint/Version" in MVAPICH 1.2rc2 do not > conform to the format expected by the configure script. Just to clarify, that should have been "MVAPICH2 1.2rc2". Too many two's for me to keep track of... -Tom -- Tom Crockett College of William and Mary email: twcroc@wm.edu IT/High Performance Computing Group phone: (757) 221-2762 Savage House fax: (757) 221-2023 P.O. Box 8795 Williamsburg, VA 23187-8795 From perkinjo at cse.ohio-state.edu Mon Aug 25 17:09:04 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon Aug 25 17:10:32 2008 Subject: [mvapich-discuss] MVAPICH 1.2rc2: Wrong Format in maint/Version In-Reply-To: <48B3111B.8070704@wm.edu> References: <48B3111B.8070704@wm.edu> Message-ID: <20080825210903.GM22736@cse.ohio-state.edu> On Mon, Aug 25, 2008 at 04:07:55PM -0400, Tom Crockett wrote: > Hi, > > The contents of the file "maint/Version" in MVAPICH 1.2rc2 do not > conform to the format expected by the configure script. This results in > the following error messages in the configuration step: > > ./configure: line 1648: test: : integer expression expected > ./configure: line 1649: test: : integer expression expected > > Further investigation reveals that the configuration variables V1-V5 are > being set to null strings, resulting in an incorrect value for > NUMVERSION. > > One suggested fix is to change the contents of "maint/Version" to read: > > 1.2.0rc2 Tom: Thanks for reporting this issue. The suggested fix is the correct fix and is now reflected in trunk of our mvapich2 repository. We'll update our release tarball shortly. > > -Tom > > -- > Tom Crockett > > College of William and Mary email: twcroc@wm.edu > IT/High Performance Computing Group phone: (757) 221-2762 > Savage House fax: (757) 221-2023 > P.O. Box 8795 > Williamsburg, VA 23187-8795 > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From perkinjo at cse.ohio-state.edu Tue Aug 26 11:00:14 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Tue Aug 26 11:02:06 2008 Subject: [mvapich-discuss] MVAPICH 1.2rc2: Wrong Format in maint/Version In-Reply-To: <20080825210903.GM22736@cse.ohio-state.edu> References: <48B3111B.8070704@wm.edu> <20080825210903.GM22736@cse.ohio-state.edu> Message-ID: <20080826150013.GA2885@cse.ohio-state.edu> On Mon, Aug 25, 2008 at 05:09:04PM -0400, Jonathan Perkins wrote: > On Mon, Aug 25, 2008 at 04:07:55PM -0400, Tom Crockett wrote: > > Hi, > > > > The contents of the file "maint/Version" in MVAPICH 1.2rc2 do not > > conform to the format expected by the configure script. This results in > > the following error messages in the configuration step: > > > > ./configure: line 1648: test: : integer expression expected > > ./configure: line 1649: test: : integer expression expected > > > > Further investigation reveals that the configuration variables V1-V5 are > > being set to null strings, resulting in an incorrect value for > > NUMVERSION. > > > > One suggested fix is to change the contents of "maint/Version" to read: > > > > 1.2.0rc2 > > Tom: > Thanks for reporting this issue. The suggested fix is the correct fix > and is now reflected in trunk of our mvapich2 repository. We'll update > our release tarball shortly. > Our release tarball reflects this change as of yesterday evening. It can be found on our mvapich2 download page at http://mvapich.cse.ohio-state.edu/download/mvapich2/download.shtml Thanks again for pointing this out. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From kernel at tekno-soft.it Fri Aug 29 07:26:54 2008 From: kernel at tekno-soft.it (Roberto Fichera) Date: Fri Aug 29 07:29:10 2008 Subject: [mvapich-discuss] Races with MPI_THREAD_MULTI In-Reply-To: <488A1024.60902@tekno-soft.it> References: <488A1024.60902@tekno-soft.it> Message-ID: <48B7DCFE.4000400@tekno-soft.it> Roberto Fichera ha scritto: Just an update about this issue, I managed to run this test application using the HP-MPI (http://www.hp.com/go/mpi) implementation and it seems working as expected, so after ~24h of execution it completes its jobs without any error. > Dhabaleswar Panda ha scritto: >> Hi Roberto, >> >> We have done several rounds of checks and do not see any difference >> between MPICH2 1.0.7 and the TCP/IP interface of MVAPICH2 1.2. Both these >> should perform exactly the same. We are continuing our investigation. >> >> We are wondering whether you can send us a sample code piece to reproduce >> the problem you are indicating across these two interfaces. This will >> help us to debug this problem faster and help you to solve your problem. >> > I've added other CCs in this email, maybe other people are interested > to have a look in. > > Attached you find the test program, which I'm working on, to turn up > the problem. I'm not completely sure if it works perfectly since I wasn't > able to complete its execution, but please let me know if I made > something wrong inside the code. The testmaster is quite easy, you > must provide the number > of jobs to simulate (say 50000) and the node file that the resource > manager provide for its schedule. Actually the node that matches the > master will > be excluded by the slave nodes. > > The testmain creates a ring of threads from the assigned nodes. So > walking in the ring, for each free node it find, a thread is started > so you should have as > many threads as the number of assigned nodes working in > multithreading. For simulating something to do each thread internally > generate a random integer, > sets some MPI_Info (host and pwd), spawn the testslave job, send it > the generated random number, wait that the testslave receive and send > back that > number, sent and received numbers are comparated in order to verify > their coherency, the slave send an empty MPI_Send() for signaling its > termination, > the thread now calls MPI_Comm_disconnect() for closing the slave > connection, and finally all the MPI_Info are cleared. At this time the > thread terminate. > When the number of requested jobs are correctly "worked out" the > application should terminate ... but without cleaning up (too tired > sorry ;-), so it just wait a > bit and finalize the MPI. > > At this time, I wasn't able to complete any execution. Currently the > application still crashing with the backtrace you find below. Only one > time > I was able to reach 3500 jobs but one thread was stuck in a mutex. > Looking in the backtrace you can find the same race I'm getting in my > applications. > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 1087666512 (LWP 18231)] > 0x00000000006a3902 in MPIDI_PG_Dup_vcr () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > Missing separate debuginfos, use: debuginfo-install glibc.x86_64 > (gdb) info threads > 29 Thread 1121462608 (LWP 18232) 0x0000003465a0a8f9 in > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > * 28 Thread 1087666512 (LWP 18231) 0x00000000006a3902 in > MPIDI_PG_Dup_vcr () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > 27 Thread 1142442320 (LWP 18230) 0x0000003464ecbd66 in poll () from > /lib64/libc.so.6 > 26 Thread 1098156368 (LWP 18229) 0x0000003464e9ac61 in nanosleep () > from /lib64/libc.so.6 > 1 Thread 140135980537584 (LWP 18029) main (argc=3, > argv=0x7ffffb5992d8) at testmaster.c:437 > > (gdb) bt > #0 0x00000000006a3902 in MPIDI_PG_Dup_vcr () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #1 0x0000000000668012 in SetupNewIntercomm () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #2 0x00000000006682c8 in MPIDI_Comm_accept () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #3 0x00000000006a6617 in MPID_Comm_accept () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #4 0x000000000065ec5f in MPIDI_Comm_spawn_multiple () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #5 0x00000000006a17e6 in MPID_Comm_spawn_multiple () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #6 0x00000000006783fd in PMPI_Comm_spawn () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #7 0x00000000004017de in NodeThread_threadMain (arg=0x120a790) at > testmaster.c:314 > #8 0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0 > #9 0x0000003464ed4b0d in clone () from /lib64/libc.so.6 > (gdb) thread 29 > > [Switching to thread 29 (Thread 1121462608 (LWP 18232))]#0 > 0x0000003465a0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > (gdb) bt > #0 0x0000003465a0a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from > /lib64/libpthread.so.0 > #1 0x000000000065e2e7 in MPIDI_CH3I_Progress () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #2 0x00000000006675ca in FreeNewVC () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #3 0x0000000000668302 in MPIDI_Comm_accept () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #4 0x00000000006a6617 in MPID_Comm_accept () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #5 0x000000000065ec5f in MPIDI_Comm_spawn_multiple () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #6 0x00000000006a17e6 in MPID_Comm_spawn_multiple () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #7 0x00000000006783fd in PMPI_Comm_spawn () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #8 0x00000000004017de in NodeThread_threadMain (arg=0x120d590) at > testmaster.c:314 > #9 0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0 > #10 0x0000003464ed4b0d in clone () from /lib64/libc.so.6 > (gdb) thread 27 > > [Switching to thread 27 (Thread 1142442320 (LWP 18230))]#0 > 0x0000003464ecbd66 in poll () from /lib64/libc.so.6 > (gdb) bt > #0 0x0000003464ecbd66 in poll () from /lib64/libc.so.6 > #1 0x00000000006d63bf in MPIDU_Sock_wait () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #2 0x000000000065e1e7 in MPIDI_CH3I_Progress () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #3 0x00000000006cf87c in PMPI_Send () from > /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 > #4 0x0000000000401831 in NodeThread_threadMain (arg=0x120a6f0) at > testmaster.c:480 > #5 0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0 > #6 0x0000003464ed4b0d in clone () from /lib64/libc.so.6 > > (gdb) thread 26 > [Switching to thread 26 (Thread 1098156368 (LWP 18229))]#0 > 0x0000003464e9ac61 in nanosleep () from /lib64/libc.so.6 > (gdb) bt > #0 0x0000003464e9ac61 in nanosleep () from /lib64/libc.so.6 > #1 0x0000003464e9aa84 in sleep () from /lib64/libc.so.6 > #2 0x000000000040197c in NodeThread_threadMain (arg=0x120d630) at > testmaster.c:505 > #3 0x0000003465a06407 in start_thread () from /lib64/libpthread.so.0 > #4 0x0000003464ed4b0d in clone () from /lib64/libc.so.6 > (gdb) > >> Thanks, >> >> DK >> >> On Tue, 22 Jul 2008, Roberto Fichera wrote: >> >> >>> Roberto Fichera ha scritto: >>> >>>> Dhabaleswar Panda ha scritto: >>>> >>>>> Hi Roberto, >>>>> >>>>> Thanks for your note. You are using the ch3:sock device in MVAPICH2 which >>>>> is the same as MPICH2. You are also seeing similar failure scenarios (but >>>>> in different forms) with MPICH2 1.0.7. I am cc'ing this message to mpich2 >>>>> mailing list. One of the MPICH2 developers will be able to extend help on >>>>> this issue faster. >>>>> >>>>> >>>> Thanks for that. About the mpich2 problem, I already sent an email >>>> regarding its related issue. >>>> But the strange thing is that when linking against mpich2 I don't see >>>> a so fast race as I see in the >>>> mvapich2. In the mpich2 case I had to wait 1 or 2 hours before the lock. >>>> >>> Just an update about the problem I got. After replacing all the >>> MPI_Send() to MPI_Ssend() everything >>> seems working well with mpich2 v1.0.7. My application doesn't race >>> anymore at least after dispatching >>> 50.000 jobs across 4 nodes, but trying to execute the same application >>> against the last mvapich2 1.2rc1 >>> I'm still getting the same problem as shown below. >>> >>> I've another question, since this multithreaded application has to run >>> into a cluster with 1024 nodes equiped >>> with Mellanox IB card, I really like to know if the OpenFabrics-IB >>> interface does support the MPI_THREAD_MULTIPLE >>> initialization and also the MPI_Comm_spawn() implementation. >>> >>> Thanks a lot for the feedback. >>> >>>>> Thanks, >>>>> >>>>> DK >>>>> >>>>> >>>>> On Fri, 18 Jul 2008, Roberto Fichera wrote: >>>>> >>>>> >>>>> >>>>>> Hi All on the list, >>>>>> >>>>>> I'm trying to use mvapich2 v1.2rc1 in a multithreaded application, >>>>>> initialize using MPI_THREAD_MULTI. >>>>>> I've the master application doing the following thing, start several >>>>>> thread depending by the assigned nodes, >>>>>> on each node a slave application is spawned using the MPI_Comm_spawn(). >>>>>> Before to call the >>>>>> MPI_Comm_spawn() I prepare the given MPI_Info struct, one for each >>>>>> thread, in order to set the all keys >>>>>> (host and wdir) for addressing the wanted behaviour. So, as sooner as >>>>>> the master application starts, it races >>>>>> immediately with 4 nodes, 1 master and 3 slaves. Below you can see the >>>>>> status of the master application at race >>>>>> time. It seems stuck on the PMIU_readline() which never returns so the >>>>>> global lock is never relesead. MVAPICH2 >>>>>> is compiled with: >>>>>> >>>>>> PKG_PATH=/HRI/External/mvapich2/1.2rc1 >>>>>> >>>>>> ./configure --prefix=$PKG_PATH \ >>>>>> --bindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \ >>>>>> --sbindir=$PKG_PATH/bin/linux-x86_64-gcc-glibc2.3.4 \ >>>>>> --libdir=$PKG_PATH/lib/linux-x86_64-gcc-glibc2.3.4 \ >>>>>> --enable-sharedlibs=gcc \ >>>>>> --enable-f90 \ >>>>>> --enable-threads=multiple \ >>>>>> --enable-g=-ggdb \ >>>>>> --enable-debuginfo \ >>>>>> --with-device=ch3:sock \ >>>>>> --datadir=$PKG_PATH/data \ >>>>>> --with-htmldir=$PKG_PATH/doc/html \ >>>>>> --with-docdir=$PKG_PATH/doc \ >>>>>> LDFLAGS='-Wl,-z,noexecstack' >>>>>> >>>>>> so I'm using the ch3:sock device. >>>>>> >>>>>> -----Thread 2 >>>>>> [Switching to thread 2 (Thread 1115699536 (LWP 29479))]#0 >>>>>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0 >>>>>> (gdb) bt >>>>>> #0 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0 >>>>>> #1 0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0 >>>>>> --->>#2 0x00000033ca408390 in pthread_mutex_lock () from >>>>>> /lib64/libpthread.so.0 >>>>>> --->>#3 0x00002aaaab382654 in PMPI_Info_set () from >>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #4 0x0000000000417627 in ParallelWorker_setSlaveInfo (self=>>>>> optimized out>, key=0x0, value=0x33ca40ff58 >>>>>> "!\204??\r\206??\030\204??3\206??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\177\205??\177\205??\177\205??\177\205??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\033\205??\033\205??\033\205??\033\205??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\033\205??\033\205??"...) >>>>>> at ParallelWorker.c:664 >>>>>> #5 0x0000000000418905 in ParallelWorker_handleParallel (self=0x62ff50) >>>>>> at ParallelWorker.c:719 >>>>>> #6 0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ff50) at >>>>>> ParallelWorker.c:504 >>>>>> #7 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0 >>>>>> #8 0x00000033c94d4b0d in clone () from /lib64/libc.so.6 >>>>>> >>>>>> -----Thread 3 >>>>>> [Switching to thread 3 (Thread 1105209680 (LWP 29478))]#0 >>>>>> 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0 >>>>>> (gdb) bt >>>>>> #0 0x00000033ca40cef4 in __lll_lock_wait () from /lib64/libpthread.so.0 >>>>>> #1 0x00000033ca408915 in _L_lock_102 () from /lib64/libpthread.so.0 >>>>>> --->>#2 0x00000033ca408390 in pthread_mutex_lock () from >>>>>> /lib64/libpthread.so.0 >>>>>> --->>#3 0x00002aaaab382654 in PMPI_Info_set () from >>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #4 0x0000000000417627 in ParallelWorker_setSlaveInfo (self=>>>>> optimized out>, key=0x0, value=0x33ca40ff58 >>>>>> "!\204??\r\206??\030\204??3\206??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\177\205??\177\205??\177\205??\177\205??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\033\205??\033\205??\033\205??\033\205??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\n\204??\033\205??\033\205??"...) >>>>>> at ParallelWorker.c:664 >>>>>> #5 0x0000000000418905 in ParallelWorker_handleParallel (self=0x62f270) >>>>>> at ParallelWorker.c:719 >>>>>> #6 0x000000000041b39e in ParallelWorker_threadMain (arg=0x62f270) at >>>>>> ParallelWorker.c:504 >>>>>> #7 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0 >>>>>> #8 0x00000033c94d4b0d in clone () from /lib64/libc.so.6 >>>>>> >>>>>> -----Thread 4 >>>>>> [Switching to thread 4 (Thread 1094719824 (LWP 29477))]#0 >>>>>> 0x00000033ca40d34b in read () from /lib64/libpthread.so.0 >>>>>> (gdb) bt >>>>>> #0 0x00000033ca40d34b in read () from /lib64/libpthread.so.0 >>>>>> --->>#1 0x00002aaaab3db84a in PMIU_readline () from >>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> --->>#2 0x00002aaaab3d9d37 in PMI_Spawn_multiple () from >>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #3 0x00002aaaab333893 in MPIDI_Comm_spawn_multiple () from >>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #4 0x00002aaaab38bcf6 in MPID_Comm_spawn_multiple () from >>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #5 0x00002aaaab355a10 in PMPI_Comm_spawn () from >>>>>> /home/roberto/.HRI/Proxy/HRI/External/mvapich2/1.2/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #6 0x00000000004189d8 in ParallelWorker_handleParallel (self=0x62ad40) >>>>>> at ParallelWorker.c:754 >>>>>> #7 0x000000000041b39e in ParallelWorker_threadMain (arg=0x62ad40) at >>>>>> ParallelWorker.c:504 >>>>>> #8 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0 >>>>>> #9 0x00000033c94d4b0d in clone () from /lib64/libc.so.6 >>>>>> >>>>>> I also tried to run against MPICH2 v1.0.7, but here I got a similar >>>>>> scenery which show up after between 1 - 2 hours of execution, >>>>>> see below: >>>>>> >>>>>> ----- thread 2 >>>>>> [Switching to thread 2 (Thread 1094719824 (LWP 1279))]#0 0x00000033c94cbd66 in poll () from /lib64/libc.so.6 >>>>>> (gdb) bt >>>>>> #0 0x00000033c94cbd66 in poll () from /lib64/libc.so.6 >>>>>> #1 0x00002aaaab5a3d2f in MPIDU_Sock_wait () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #2 0x00002aaaab52bdc7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #3 0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #4 0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #5 0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #6 0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6358e0) at ParallelWorker.c:819 >>>>>> #7 0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6358e0) at ParallelWorker.c:515 >>>>>> #8 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0 >>>>>> #9 0x00000033c94d4b0d in clone () from /lib64/libc.so.6 >>>>>> >>>>>> ----- thread 3 >>>>>> [Switching to thread 3 (Thread 1084229968 (LWP 1278))]#0 0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 >>>>>> (gdb) bt >>>>>> #0 0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 >>>>>> #1 0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #2 0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #3 0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #4 0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #5 0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x634d20) at ParallelWorker.c:819 >>>>>> #6 0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x634d20) at ParallelWorker.c:515 >>>>>> #7 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0 >>>>>> #8 0x00000033c94d4b0d in clone () from /lib64/libc.so.6 >>>>>> >>>>>> >>>>>> ----- thread 4 >>>>>> [Switching to thread 4 (Thread 1115699536 (LWP 1277))]#0 0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 >>>>>> (gdb) bt >>>>>> #0 0x00000033ca40a8f9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 >>>>>> #1 0x00002aaaab52bec7 in MPIDI_CH3I_Progress () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #2 0x00002aaaab5301a7 in MPIDI_CH3U_VC_WaitForClose () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #3 0x00002aaaab56f162 in MPID_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #4 0x00002aaaab5417ec in PMPI_Comm_disconnect () from /home/roberto/.HRI/Proxy/HRI/External/mpich2/1.0.7/lib/linux-x86_64-gcc-glibc2.3.4/libmpich.so.1.1 >>>>>> #5 0x00002aaaabda5a99 in ParallelWorker_destroySlave (self=0x6341a0) at ParallelWorker.c:819 >>>>>> #6 0x00002aaaabda6223 in ParallelWorker_threadMain (arg=0x6341a0) at ParallelWorker.c:515 >>>>>> #7 0x00000033ca406407 in start_thread () from /lib64/libpthread.so.0 >>>>>> #8 0x00000033c94d4b0d in clone () from /lib64/libc.so.6 >>>>>> >>>>>> where the thread 2 is poll()ing never never returns, so never signals >>>>>> the poll() completion and than all the others >>>>>> waiters in the MPIDI_CH3I_Progress() condition will never wake up. >>>>>> >>>>>> Does anyone is having the same problem? >>>>>> >>>>>> Thanks in advance, >>>>>> Roberto Fichera. >>>>>> >>>>>> >>>>>> >>>>> >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> mvapich-discuss mailing list >>>> mvapich-discuss@cse.ohio-state.edu >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>> >>>> >>> >> >> >> > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080829/d869032d/attachment-0001.html From mark.debbage at qlogic.com Fri Aug 29 13:25:10 2008 From: mark.debbage at qlogic.com (Mark Debbage) Date: Fri Aug 29 13:32:18 2008 Subject: [mvapich-discuss] MVAPICH 1.0.0 and stdin Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A01EA86AB@AVEXCH1.qlogic.org> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: mpicat.c Type: text/x-csrc Size: 376 bytes Desc: mpicat.c Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080829/e4d65762/mpicat.bin From mark.debbage at qlogic.com Fri Aug 29 13:45:30 2008 From: mark.debbage at qlogic.com (Mark Debbage) Date: Fri Aug 29 14:06:47 2008 Subject: [mvapich-discuss] RE: MVAPICH 1.0.0 and stdin References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A01EA86AB@AVEXCH1.qlogic.org> Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A01EA86AC@AVEXCH1.qlogic.org> This is a resend with in-line attachment. Also note that the problem does not occur with MVAPICH 0.9.9. If I use MVAPICH 1.0.0 and arrange to use the "legacy" start-up mechanism then it also works reliably. For example: /usr/mpi/gcc/mvapich-1.0.0/bin/mpirun_rsh -legacy -np 2 -hostfile hosts /home/markdebbage/support/OU/./mpicat < input This makes me think that the new source code allowing multiple MPI processes per ssh is the problem, though in this case there is just one MPI process per node. Mark. -----Original Message----- From: Mark Debbage Sent: Fri 8/29/2008 10:25 AM To: mvapich-discuss@cse.ohio-state.edu Subject: MVAPICH 1.0.0 and stdin We are having problems with stdin and MVAPICH 1.0.0 (from OFED 1.3). I am running with the mpirun process and rank 0 on the same host and expecting the stdin of the mpirun process to be available to rank 0. This works reliably if there is just one process in the job, or if all MPI processes are mapped to that same host. However, if there are MPI processes on other hosts, then stdin becomes intermittent - about 4 in 5 times it works fine, but 1 in 5 times all reads on stdin return EOF. I've attached the example source code. It is a simple MPI version of cat. I am building and running like this: markdebbage@perf-15:~/support/OU> /usr/mpi/gcc/mvapich-1.0.0/bin/mpicc mpicat.c -o mpicat markdebbage@perf-15:~/support/OU> cat hosts perf-15 perf-16 Here's a working run: markdebbage@perf-15:~/support/OU> /usr/mpi/gcc/mvapich-1.0.0/bin/mpirun -machinefile hosts -np 2 ./mpicat < input This is rank 0 - start loop 1 2 3 4 5 6 999 This is rank 0 - end loop Here's a non-working run: markdebbage@perf-15:~/support/OU> /usr/mpi/gcc/mvapich-1.0.0/bin/mpirun -machinefile hosts -np 2 ./mpicat < input This is rank 0 - start loop This is rank 0 - end loop markdebbage@perf-15:~/support/OU> I've tried this with OFED 1.3 running on Mellanox and QLogic adapters, and also with the PSM version of MVAPICH running on QLogic adapters. It appears that this is independent of transport. I also tried the -stdin option that appears on the mpirun help page. However, that seems to be silently ignored. I can see the code in mpirun.args that processes that option but it doesn't appear to be connected up to anything. Cheers, Mark. #include #include #include int main (int argc, char **argv) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { printf("This is rank 0 - start loop\n"); int c; while ((c = getchar()) != EOF) { putchar(c); } printf("This is rank 0 - end loop\n"); } MPI_Finalize(); return EXIT_SUCCESS; } -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080829/e7e78a4e/attachment-0001.html From mark.debbage at qlogic.com Fri Aug 29 13:52:46 2008 From: mark.debbage at qlogic.com (Mark Debbage) Date: Fri Aug 29 14:06:48 2008 Subject: [mvapich-discuss] RE: MVAPICH 1.0.0 and stdin References: <6DB5B58A8E5AB846A7B3B3BFF1B4315A01EA86AB@AVEXCH1.qlogic.org> <6DB5B58A8E5AB846A7B3B3BFF1B4315A01EA86AC@AVEXCH1.qlogic.org> Message-ID: <6DB5B58A8E5AB846A7B3B3BFF1B4315A01EA86AD@AVEXCH1.qlogic.org> OK, this turns out to be pretty straightforward. spawn_linear (the legacy spawner) arranges for stdin to be propagated to just rank 0 and uses /dev/null for all other ranks: if (i != 0) { int fd = open("/dev/null", O_RDWR, 0); (void) dup2(fd, STDIN_FILENO); } spawn_fast (the new spawner) doesn't have any code to do this. My guess is that the local ssh processes for the other ranks are looking at stdin (maybe just polling it) and stealing the stdin from rank 0. Can you include a fix for this in your next release? Thanks, Mark. -----Original Message----- From: Mark Debbage Sent: Fri 8/29/2008 10:45 AM To: Mark Debbage; mvapich-discuss@cse.ohio-state.edu Subject: RE: MVAPICH 1.0.0 and stdin This is a resend with in-line attachment. Also note that the problem does not occur with MVAPICH 0.9.9. If I use MVAPICH 1.0.0 and arrange to use the "legacy" start-up mechanism then it also works reliably. For example: /usr/mpi/gcc/mvapich-1.0.0/bin/mpirun_rsh -legacy -np 2 -hostfile hosts /home/markdebbage/support/OU/./mpicat < input This makes me think that the new source code allowing multiple MPI processes per ssh is the problem, though in this case there is just one MPI process per node. Mark. -----Original Message----- From: Mark Debbage Sent: Fri 8/29/2008 10:25 AM To: mvapich-discuss@cse.ohio-state.edu Subject: MVAPICH 1.0.0 and stdin We are having problems with stdin and MVAPICH 1.0.0 (from OFED 1.3). I am running with the mpirun process and rank 0 on the same host and expecting the stdin of the mpirun process to be available to rank 0. This works reliably if there is just one process in the job, or if all MPI processes are mapped to that same host. However, if there are MPI processes on other hosts, then stdin becomes intermittent - about 4 in 5 times it works fine, but 1 in 5 times all reads on stdin return EOF. I've attached the example source code. It is a simple MPI version of cat. I am building and running like this: markdebbage@perf-15:~/support/OU> /usr/mpi/gcc/mvapich-1.0.0/bin/mpicc mpicat.c -o mpicat markdebbage@perf-15:~/support/OU> cat hosts perf-15 perf-16 Here's a working run: markdebbage@perf-15:~/support/OU> /usr/mpi/gcc/mvapich-1.0.0/bin/mpirun -machinefile hosts -np 2 ./mpicat < input This is rank 0 - start loop 1 2 3 4 5 6 999 This is rank 0 - end loop Here's a non-working run: markdebbage@perf-15:~/support/OU> /usr/mpi/gcc/mvapich-1.0.0/bin/mpirun -machinefile hosts -np 2 ./mpicat < input This is rank 0 - start loop This is rank 0 - end loop markdebbage@perf-15:~/support/OU> I've tried this with OFED 1.3 running on Mellanox and QLogic adapters, and also with the PSM version of MVAPICH running on QLogic adapters. It appears that this is independent of transport. I also tried the -stdin option that appears on the mpirun help page. However, that seems to be silently ignored. I can see the code in mpirun.args that processes that option but it doesn't appear to be connected up to anything. Cheers, Mark. #include #include #include int main (int argc, char **argv) { int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { printf("This is rank 0 - start loop\n"); int c; while ((c = getchar()) != EOF) { putchar(c); } printf("This is rank 0 - end loop\n"); } MPI_Finalize(); return EXIT_SUCCESS; } -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080829/ecd7b5bf/attachment-0001.html From panda at cse.ohio-state.edu Fri Aug 29 14:56:51 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Aug 29 14:57:04 2008 Subject: [mvapich-discuss] RE: MVAPICH 1.0.0 and stdin In-Reply-To: <6DB5B58A8E5AB846A7B3B3BFF1B4315A01EA86AD@AVEXCH1.qlogic.org> Message-ID: Hi Mark, Thanks for reporting this problem and the associated details regarding where things are failing. We will work on a fix for this for the upcoming 1.1 release. Thanks, DK On Fri, 29 Aug 2008, Mark Debbage wrote: > OK, this turns out to be pretty straightforward. > spawn_linear (the legacy spawner) arranges for stdin to > be propagated to just rank 0 and uses /dev/null for all > other ranks: > > if (i != 0) { > int fd = open("/dev/null", O_RDWR, 0); > (void) dup2(fd, STDIN_FILENO); > } > > spawn_fast (the new spawner) doesn't have any code to do > this. My guess is that the local ssh processes for the other > ranks are looking at stdin (maybe just polling it) and stealing > the stdin from rank 0. > > Can you include a fix for this in your next release? Thanks, > > Mark. > > -----Original Message----- > From: Mark Debbage > Sent: Fri 8/29/2008 10:45 AM > To: Mark Debbage; mvapich-discuss@cse.ohio-state.edu > Subject: RE: MVAPICH 1.0.0 and stdin > > This is a resend with in-line attachment. Also note that the > problem does not occur with MVAPICH 0.9.9. If I use MVAPICH 1.0.0 > and arrange to use the "legacy" start-up mechanism then it also > works reliably. For example: > > /usr/mpi/gcc/mvapich-1.0.0/bin/mpirun_rsh -legacy -np 2 -hostfile hosts /home/markdebbage/support/OU/./mpicat < input > > This makes me think that the new source code allowing multiple > MPI processes per ssh is the problem, though in this case there > is just one MPI process per node. > > Mark. > > > -----Original Message----- > From: Mark Debbage > Sent: Fri 8/29/2008 10:25 AM > To: mvapich-discuss@cse.ohio-state.edu > Subject: MVAPICH 1.0.0 and stdin > > We are having problems with stdin and MVAPICH 1.0.0 (from OFED 1.3). > I am running with the mpirun process and rank 0 on the same host > and expecting the stdin of the mpirun process to be available to > rank 0. This works reliably if there is just one process in the job, > or if all MPI processes are mapped to that same host. However, if > there are MPI processes on other hosts, then stdin becomes > intermittent - about 4 in 5 times it works fine, but 1 in 5 times > all reads on stdin return EOF. > > I've attached the example source code. It is a simple MPI version > of cat. I am building and running like this: > > markdebbage@perf-15:~/support/OU> /usr/mpi/gcc/mvapich-1.0.0/bin/mpicc mpicat.c -o mpicat > > markdebbage@perf-15:~/support/OU> cat hosts > perf-15 > perf-16 > > Here's a working run: > > markdebbage@perf-15:~/support/OU> /usr/mpi/gcc/mvapich-1.0.0/bin/mpirun -machinefile hosts -np 2 ./mpicat < input > This is rank 0 - start loop > 1 > 2 > 3 > 4 > 5 > 6 > 999 > This is rank 0 - end loop > > Here's a non-working run: > > markdebbage@perf-15:~/support/OU> /usr/mpi/gcc/mvapich-1.0.0/bin/mpirun -machinefile hosts -np 2 ./mpicat < input > This is rank 0 - start loop > This is rank 0 - end loop > markdebbage@perf-15:~/support/OU> > > I've tried this with OFED 1.3 running on Mellanox and QLogic adapters, > and also with the PSM version of MVAPICH running on QLogic adapters. > It appears that this is independent of transport. I also tried the > -stdin option that appears on the mpirun help page. However, that > seems to be silently ignored. I can see the code in mpirun.args that > processes that option but it doesn't appear to be connected up to > anything. > > Cheers, > > Mark. > > #include > #include > #include > > int main (int argc, char **argv) > { > int rank; > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > if (rank == 0) { > printf("This is rank 0 - start loop\n"); > int c; > while ((c = getchar()) != EOF) { > putchar(c); > } > printf("This is rank 0 - end loop\n"); > } > MPI_Finalize(); > return EXIT_SUCCESS; > } > >