From Fred.Stecher at atk.com Wed Jan 7 18:25:31 2009 From: Fred.Stecher at atk.com (Stecher, Fred) Date: Wed Jan 7 18:35:39 2009 Subject: [mvapich-discuss] compile errors with mvapich-1.1rc1 Message-ID: Hi, I am trying to build a parallel version of a visualization program. The build stops when linking libmpich.a with undefined references. Below is the output from the build. Thanks, Fred /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic h/lib/libmpich.a(mperror.o): In function ` MPIR_Errors_are_fatal': mperror.c:(.text+0xb4): undefined reference to `__builtin_va_gparg1' mperror.c:(.text+0xc0): undefined reference to `__builtin_va_gparg1' mperror.c:(.text+0xcc): undefined reference to `__builtin_va_gparg1' /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic h/lib/libmpich.a(mperror.o): In function ` MPIR_Errors_warn': mperror.c:(.text+0x241): undefined reference to `__builtin_va_gparg1' mperror.c:(.text+0x24d): undefined reference to `__builtin_va_gparg1' /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic h/lib/libmpich.a(mperror.o):mperror.c:(.te xt+0x259): more undefined references to `__builtin_va_gparg1' follow /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic h/lib/libmpich.a(ptrcvt.o): In function `M PIR_InitPointer': ptrcvt.c:(.text+0x95e): undefined reference to `__c_mzero8' /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic h/lib/libmpich.a(viainit.o): In function ` viainit_on_demand_exchange': viainit.c:(.text+0x8f5): undefined reference to `_mp_malloc' viainit.c:(.text+0x9c4): undefined reference to `_mp_free' /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic h/lib/libmpich.a(viainit.o): In function ` viainit_exchange': viainit.c:(.text+0x9eb): undefined reference to `_mp_malloc' viainit.c:(.text+0xa81): undefined reference to `_mp_free' /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic h/lib/libmpich.a(viainit.o): In function ` MPID_VIA_Init': viainit.c:(.text+0xd2e): undefined reference to `__c_mcopy4' /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic h/lib/libmpich.a(malloc.o): In function `p tmalloc_init': malloc.c:(.text+0x37e0): undefined reference to `__c_mzero8' /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic h/lib/libmpich.a(debugutil.o):(.data+0x58) : undefined reference to `__pgdbg_stub' collect2: ld returned 1 exit status make[2]: *** [/home/apps/ale3d/4.8/visit_1.11/visit1.11.0/src/exe/engine_par] Error 1 make[2]: Leaving directory `/home/apps/ale3d/4.8/visit_1.11/visit1.11.0/src/engine/main' make[1]: *** [all] Error 1 make[1]: Leaving directory `/home/apps/ale3d/4.8/visit_1.11/visit1.11.0/src/engine' make: *** [all] Error 1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090107/ea1576e3/attachment.html From panda at cse.ohio-state.edu Wed Jan 7 19:23:06 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Jan 7 19:23:13 2009 Subject: [mvapich-discuss] compile errors with mvapich-1.1rc1 In-Reply-To: Message-ID: You seem to be using the rc1 version. Do you see the error in the final 1.1 released version (the final release was made on Nov 14th). Thanks, DK On Wed, 7 Jan 2009, Stecher, Fred wrote: > Hi, > > I am trying to build a parallel version of a visualization program. The > build stops when linking libmpich.a with undefined references. Below is > the output from the build. > > > > Thanks, > > > > Fred > > > > > > /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic > h/lib/libmpich.a(mperror.o): In function ` > > MPIR_Errors_are_fatal': > > mperror.c:(.text+0xb4): undefined reference to `__builtin_va_gparg1' > > mperror.c:(.text+0xc0): undefined reference to `__builtin_va_gparg1' > > mperror.c:(.text+0xcc): undefined reference to `__builtin_va_gparg1' > > /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic > h/lib/libmpich.a(mperror.o): In function ` > > MPIR_Errors_warn': > > mperror.c:(.text+0x241): undefined reference to `__builtin_va_gparg1' > > mperror.c:(.text+0x24d): undefined reference to `__builtin_va_gparg1' > > /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic > h/lib/libmpich.a(mperror.o):mperror.c:(.te > > xt+0x259): more undefined references to `__builtin_va_gparg1' follow > > /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic > h/lib/libmpich.a(ptrcvt.o): In function `M > > PIR_InitPointer': > > ptrcvt.c:(.text+0x95e): undefined reference to `__c_mzero8' > > /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic > h/lib/libmpich.a(viainit.o): In function ` > > viainit_on_demand_exchange': > > viainit.c:(.text+0x8f5): undefined reference to `_mp_malloc' > > viainit.c:(.text+0x9c4): undefined reference to `_mp_free' > > /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic > h/lib/libmpich.a(viainit.o): In function ` > > viainit_exchange': > > viainit.c:(.text+0x9eb): undefined reference to `_mp_malloc' > > viainit.c:(.text+0xa81): undefined reference to `_mp_free' > > /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic > h/lib/libmpich.a(viainit.o): In function ` > > MPID_VIA_Init': > > viainit.c:(.text+0xd2e): undefined reference to `__c_mcopy4' > > /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic > h/lib/libmpich.a(malloc.o): In function `p > > tmalloc_init': > > malloc.c:(.text+0x37e0): undefined reference to `__c_mzero8' > > /data1/home/fstecher/mvapich/mvapich-1.1/mvapich-1.1rc1/usr/local/mvapic > h/lib/libmpich.a(debugutil.o):(.data+0x58) > > : undefined reference to `__pgdbg_stub' > > collect2: ld returned 1 exit status > > make[2]: *** > [/home/apps/ale3d/4.8/visit_1.11/visit1.11.0/src/exe/engine_par] Error 1 > > make[2]: Leaving directory > `/home/apps/ale3d/4.8/visit_1.11/visit1.11.0/src/engine/main' > > make[1]: *** [all] Error 1 > > make[1]: Leaving directory > `/home/apps/ale3d/4.8/visit_1.11/visit1.11.0/src/engine' > > make: *** [all] Error 1 > > From sriram at pnl.gov Mon Jan 12 12:45:57 2009 From: sriram at pnl.gov (Krishnamoorthy, Sriram) Date: Mon Jan 12 12:59:50 2009 Subject: [mvapich-discuss] Queue pair usage in mvapich versions Message-ID: Is there a way to query or control the number of queue pairs used by mvapich/mvapich2 (in versions 1.1, 1.0.1, and 1.2) per SMP node? If it is a fixed expression (either all intialized on start-up or later on-demand), could you please provide the same or point me to a reference? I am trying to create 2*p queue pairs per SMP node, after initializing MPI. Beyond 8192 processes, this is failing in the queue pair create call. Please cc my email id, as I am not subscribed to the mailing list. Thanks in advance, Sriram.K -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090112/4b8657dd/attachment.html From koop at cse.ohio-state.edu Mon Jan 12 15:21:30 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Jan 12 15:21:38 2009 Subject: [mvapich-discuss] Queue pair usage in mvapich versions In-Reply-To: Message-ID: Sriram, By default MVAPICH will setup connections "on-demand" when there are more than 64 processes in a job, so at your scale QPs should only be created when they are needed. So however processes directly communicate with each other will need to create QPs. If a process communicates directly with 'n' peer ranks then it will create 'n' QPs. How many processes do you have per node? In the past (I'm not sure if it is changed by default now), there was a 64K limit of QPs per HCA in the driver. You may want to update your OFED installation if it is an older version. Thus, an AlltoAll on 8K processes with 8 processes per node would hit the limit (16 per node would hit even sooner). If you need extreme scalability you can try using the ch_hybrid device of MVAPICH, which uses the UD transport of InfiniBand and needs very few QPs since it is connection-less. Additionally, what is the maximum lockable memory on the node? (ulimit -l) QPs must be in pinned memory, so if that limit is not high enough QP creation could also fail. Matt On Mon, 12 Jan 2009, Krishnamoorthy, Sriram wrote: > Is there a way to query or control the number of queue pairs used by > mvapich/mvapich2 (in versions 1.1, 1.0.1, and 1.2) per SMP node? > > If it is a fixed expression (either all intialized on start-up or later > on-demand), could you please provide the same or point me to a > reference? > > I am trying to create 2*p queue pairs per SMP node, after initializing > MPI. Beyond 8192 processes, this is failing in the queue pair create > call. > > Please cc my email id, as I am not subscribed to the mailing list. > > Thanks in advance, > Sriram.K > > > > From michael.heinz at qlogic.com Tue Jan 13 10:29:33 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Tue Jan 13 10:30:39 2009 Subject: [mvapich-discuss] Problems with hostname resolution and MPI_INIT() Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624A3F@MNEXMB1.qlogic.org> I keep running into this problem at random, and each time it brings someone down for a couple of hours before we figure it out... again. Basically, some distros add lines like this to their host file: # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 node01 localhost.localdomain localhost The problem is that this causes "node01" to tell the other MPI ranks that it's IP address is 127.0.0.1, which causes MPI jobs to hang in MPI_INIT(). I've seen a similar issue with distros that define 127.0.0.2. So, I don't mind digging into the code and changing how mvapich does IP address resolution, but I can't help but think that this problem must happen all the time - before I start patching the code, is there a bug in my network configs that I should be fixing? -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090113/d4336ed0/attachment-0001.html From perkinjo at cse.ohio-state.edu Tue Jan 13 11:08:13 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Tue Jan 13 11:08:27 2009 Subject: [mvapich-discuss] Problems with hostname resolution and MPI_INIT() In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624A3F@MNEXMB1.qlogic.org> References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624A3F@MNEXMB1.qlogic.org> Message-ID: <20090113160812.GA5610@cse.ohio-state.edu> Michael: Hi, my comments are inline. On Tue, Jan 13, 2009 at 09:29:33AM -0600, Mike Heinz wrote: > I keep running into this problem at random, and each time it brings someone down for a couple of hours before we figure it out... again. > > Basically, some distros add lines like this to their host file: > > # Do not remove the following line, or various programs > # that require network functionality will fail. > 127.0.0.1 node01 localhost.localdomain localhost My impression is that node01 should not be listed as an entry for the localhost ip address. I believe node01 should be listed by its the unique ip address on its subnet. > > The problem is that this causes "node01" to tell the other MPI ranks that it's IP address is 127.0.0.1, which causes MPI jobs to hang in MPI_INIT(). > > I've seen a similar issue with distros that define 127.0.0.2. > > So, I don't mind digging into the code and changing how mvapich does IP address resolution, but I can't help but think that this problem must happen all the time - before I start patching the code, is there a bug in my network configs that I should be fixing? I think it is an issue with your network configs. Which distro(s) do you see this problem on? > > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From michael.heinz at qlogic.com Tue Jan 13 11:17:30 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Tue Jan 13 11:18:36 2009 Subject: [mvapich-discuss] Problems with hostname resolution and MPI_INIT() In-Reply-To: <20090113160812.GA5610@cse.ohio-state.edu> References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624A3F@MNEXMB1.qlogic.org> <20090113160812.GA5610@cse.ohio-state.edu> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624A45@MNEXMB1.qlogic.org> Jonathan, Records like these appear to be common on many distros. The example before was from a stock RHEL4 installation. Here's another example, from another machine: # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 homer.dev.silverstorm.com homer localhost.localdomain localhost Homer is the name of the box - note that, as per the comments, these lines were generated by the distro, not by a user. For comparison, I found these lines in a completely fresh RHEL5 install: # Do not remove the following line, or various programs # that require network functionality will fail. ::1 localhost.localdomain localhost mheinz-linux "mheinz-linux" is the name of the box. Meanwhile, SLES10 does something similar: 127.0.0.2 moe.dev.silverstorm.com moe Moe is the name of the box. Again, this appears to be done by the distro, not by any user. We can get around the problem by manually editing all the host files, but I'm concerned because I don't understand why the distros seem to feel this is necessary. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] Sent: Tuesday, January 13, 2009 11:08 AM To: Mike Heinz Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] Problems with hostname resolution and MPI_INIT() Michael: Hi, my comments are inline. On Tue, Jan 13, 2009 at 09:29:33AM -0600, Mike Heinz wrote: > I keep running into this problem at random, and each time it brings someone down for a couple of hours before we figure it out... again. > > Basically, some distros add lines like this to their host file: > > # Do not remove the following line, or various programs > # that require network functionality will fail. > 127.0.0.1 node01 localhost.localdomain localhost My impression is that node01 should not be listed as an entry for the localhost ip address. I believe node01 should be listed by its the unique ip address on its subnet. > > The problem is that this causes "node01" to tell the other MPI ranks that it's IP address is 127.0.0.1, which causes MPI jobs to hang in MPI_INIT(). > > I've seen a similar issue with distros that define 127.0.0.2. > > So, I don't mind digging into the code and changing how mvapich does IP address resolution, but I can't help but think that this problem must happen all the time - before I start patching the code, is there a bug in my network configs that I should be fixing? I think it is an issue with your network configs. Which distro(s) do you see this problem on? > > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From michael.heinz at qlogic.com Tue Jan 13 14:40:38 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Tue Jan 13 14:41:44 2009 Subject: [mvapich-discuss] Problems with hostname resolution and MPI_INIT() In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624A45@MNEXMB1.qlogic.org> References: <4C2744E8AD2982428C5BFE523DF8CDCB3E74624A3F@MNEXMB1.qlogic.org> <20090113160812.GA5610@cse.ohio-state.edu> <4C2744E8AD2982428C5BFE523DF8CDCB3E74624A45@MNEXMB1.qlogic.org> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E7468401D@MNEXMB1.qlogic.org> I want to post this to the group in case someone else runs into this problem. Jaidev suggests that the odd /etc/host entries aren't required by some mysterious network service but are caused by installing Linux without specifying an IP address. This makes sense because we use DHCP across all our networks, and implies that my initial instincts (to just delete the offending records) is acceptable. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Heinz Sent: Tuesday, January 13, 2009 11:18 AM To: Jonathan Perkins Cc: mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] Problems with hostname resolution and MPI_INIT() Jonathan, Records like these appear to be common on many distros. The example before was from a stock RHEL4 installation. Here's another example, from another machine: # Do not remove the following line, or various programs # that require network functionality will fail. 127.0.0.1 homer.dev.silverstorm.com homer localhost.localdomain localhost Homer is the name of the box - note that, as per the comments, these lines were generated by the distro, not by a user. For comparison, I found these lines in a completely fresh RHEL5 install: # Do not remove the following line, or various programs # that require network functionality will fail. ::1 localhost.localdomain localhost mheinz-linux "mheinz-linux" is the name of the box. Meanwhile, SLES10 does something similar: 127.0.0.2 moe.dev.silverstorm.com moe Moe is the name of the box. Again, this appears to be done by the distro, not by any user. We can get around the problem by manually editing all the host files, but I'm concerned because I don't understand why the distros seem to feel this is necessary. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] Sent: Tuesday, January 13, 2009 11:08 AM To: Mike Heinz Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] Problems with hostname resolution and MPI_INIT() Michael: Hi, my comments are inline. On Tue, Jan 13, 2009 at 09:29:33AM -0600, Mike Heinz wrote: > I keep running into this problem at random, and each time it brings someone down for a couple of hours before we figure it out... again. > > Basically, some distros add lines like this to their host file: > > # Do not remove the following line, or various programs > # that require network functionality will fail. > 127.0.0.1 node01 localhost.localdomain localhost My impression is that node01 should not be listed as an entry for the localhost ip address. I believe node01 should be listed by its the unique ip address on its subnet. > > The problem is that this causes "node01" to tell the other MPI ranks that it's IP address is 127.0.0.1, which causes MPI jobs to hang in MPI_INIT(). > > I've seen a similar issue with distros that define 127.0.0.2. > > So, I don't mind digging into the code and changing how mvapich does IP address resolution, but I can't help but think that this problem must happen all the time - before I start patching the code, is there a bug in my network configs that I should be fixing? I think it is an issue with your network configs. Which distro(s) do you see this problem on? > > > -- > Michael Heinz > Principal Engineer, Qlogic Corporation > King of Prussia, Pennsylvania > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From Terrence.LIAO at total.com Wed Jan 14 08:26:47 2009 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Wed Jan 14 08:27:02 2009 Subject: [mvapich-discuss] Re: Problems with hostname resolution and MPI_INIT() (Mike Heinz) In-Reply-To: <200901131602.n0DG1eBc010597@cse.ohio-state.edu> Message-ID: Mike, We have the same problem before (posted in mvapich-discuss Digest, Vol 37, Issue 1) and just like you, I dug into the mvapich code and modified the get_host_id(). System admin told me, it is a normal practice to put hostname to 127.0.0.1 entry just like yours. Of course this the culprit. Later on, to resolve a sendmail conflict, system admin removed the hostname from the 127.0.0.1 entry and created the real IP to hostname entry on /etc/hosts. Of course, with this, no more mvapich code change for me. May be the solution to this is let mvapich install guide to advise or recommend to create entry for hostname in /etc/hosts, and to warn the potential problem if added it into 127.0.0.1. (May be this is already been done and, well, I have to admit that I did not read the entire install guide. ) Thank you very much. -- Terrence -------------------------------------------------------- Terrence Liao, Ph.D. Research Computer Scientist TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC 1201 Louisiana, Suite 1800, Houston, TX 77002 Tel: 713.647.3498 Fax: 713.647.3638 Email: terrence.liao@total.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090114/73ed859b/attachment.html From christian.guggenberger at rzg.mpg.de Wed Jan 14 08:49:55 2009 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Wed Jan 14 08:50:27 2009 Subject: [mvapich-discuss] Re: Problems with hostname resolution and MPI_INIT() (Mike Heinz) In-Reply-To: References: <200901131602.n0DG1eBc010597@cse.ohio-state.edu> Message-ID: <20090114134955.GA7765@bonnie.rzg.mpg.de> > entry on /etc/hosts. Of course, with this, no more mvapich code change for > me. May be the solution to this is let mvapich install guide to advise or > recommend to create entry for hostname in /etc/hosts, and to warn the > potential problem if added it into 127.0.0.1. (May be this is already been > done and, well, I have to admit that I did not read the entire install > guide. ) > I agree with Terrence to have such an advice in the FAQ/install guid. One could then also point out that hostnames should be RFC compliant. (people every now and then stumble across underscores in hostnames ...) cheers. - Christian From panda at cse.ohio-state.edu Wed Jan 14 09:40:36 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Jan 14 09:40:50 2009 Subject: [mvapich-discuss] Re: Problems with hostname resolution and MPI_INIT() (Mike Heinz) In-Reply-To: <20090114134955.GA7765@bonnie.rzg.mpg.de> Message-ID: Thanks for these suggestions. We will add this information to the user guide. Thanks, DK On Wed, 14 Jan 2009, Christian Guggenberger wrote: > > entry on /etc/hosts. Of course, with this, no more mvapich code change for > > me. May be the solution to this is let mvapich install guide to advise or > > recommend to create entry for hostname in /etc/hosts, and to warn the > > potential problem if added it into 127.0.0.1. (May be this is already been > > done and, well, I have to admit that I did not read the entire install > > guide. ) > > > > I agree with Terrence to have such an advice in the FAQ/install guid. > One could then also point out that hostnames should be RFC compliant. > (people every now and then stumble across underscores in hostnames ...) > > cheers. > - Christian > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From michael.heinz at qlogic.com Wed Jan 14 10:04:36 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Wed Jan 14 10:05:40 2009 Subject: [mvapich-discuss] RE: Problems with hostname resolution and MPI_INIT() (Mike Heinz) In-Reply-To: References: <200901131602.n0DG1eBc010597@cse.ohio-state.edu> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74684064@MNEXMB1.qlogic.org> Heh. I'm glad to know we aren't the only ones with this issue - but it gets even weirder. I've conclusively shown that removing the localhost record solves the problem but adding a faulty localhost record doesn't cause the problem in a previously working machine! So, there must be an additional circumstance that has to occur to trigger the issue. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania From: Terrence.LIAO@total.com [mailto:Terrence.LIAO@total.com] Sent: Wednesday, January 14, 2009 8:27 AM To: Mike Heinz; mvapich-discuss@cse.ohio-state.edu Cc: Terrence.LIAO@total.com Subject: Re: Problems with hostname resolution and MPI_INIT() (Mike Heinz) Mike, We have the same problem before (posted in mvapich-discuss Digest, Vol 37, Issue 1) and just like you, I dug into the mvapich code and modified the get_host_id(). System admin told me, it is a normal practice to put hostname to 127.0.0.1 entry just like yours. Of course this the culprit. Later on, to resolve a sendmail conflict, system admin removed the hostname from the 127.0.0.1 entry and created the real IP to hostname entry on /etc/hosts. Of course, with this, no more mvapich code change for me. May be the solution to this is let mvapich install guide to advise or recommend to create entry for hostname in /etc/hosts, and to warn the potential problem if added it into 127.0.0.1. (May be this is already been done and, well, I have to admit that I did not read the entire install guide. ) Thank you very much. -- Terrence -------------------------------------------------------- Terrence Liao, Ph.D. Research Computer Scientist TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC 1201 Louisiana, Suite 1800, Houston, TX 77002 Tel: 713.647.3498 Fax: 713.647.3638 Email: terrence.liao@total.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090114/869fa209/attachment.html From michael.heinz at qlogic.com Wed Jan 14 10:06:37 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Wed Jan 14 10:07:40 2009 Subject: [mvapich-discuss] Re: Problems with hostname resolution and MPI_INIT() (Mike Heinz) In-Reply-To: <20090114134955.GA7765@bonnie.rzg.mpg.de> References: <200901131602.n0DG1eBc010597@cse.ohio-state.edu> <20090114134955.GA7765@bonnie.rzg.mpg.de> Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3E74684066@MNEXMB1.qlogic.org> Ugh. Yeah, we used to have a habit of naming our IB addresses things like "hostname_IB" or "hostname_INIC" which worked great until it completely scrambled the name resolution on some new machines we were using. That's when we found out that dashes are legal but underscores aren't. That was a fun week, getting all the names straightened out. -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -----Original Message----- From: Christian Guggenberger [mailto:christian.guggenberger@rzg.mpg.de] Sent: Wednesday, January 14, 2009 8:50 AM To: Terrence.LIAO@total.com Cc: Mike Heinz; mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] Re: Problems with hostname resolution and MPI_INIT() (Mike Heinz) > entry on /etc/hosts. Of course, with this, no more mvapich code change for > me. May be the solution to this is let mvapich install guide to advise or > recommend to create entry for hostname in /etc/hosts, and to warn the > potential problem if added it into 127.0.0.1. (May be this is already been > done and, well, I have to admit that I did not read the entire install > guide. ) > I agree with Terrence to have such an advice in the FAQ/install guid. One could then also point out that hostnames should be RFC compliant. (people every now and then stumble across underscores in hostnames ...) cheers. - Christian From gabra at us.ibm.com Wed Jan 14 10:19:06 2009 From: gabra at us.ibm.com (Gregory D Abram) Date: Wed Jan 14 10:19:19 2009 Subject: [mvapich-discuss] MPI flavor-agnostic libraries In-Reply-To: <200901141515.n0EF78f9018070@cse.ohio-state.edu> References: <200901141515.n0EF78f9018070@cse.ohio-state.edu> Message-ID: I'd like to release binary libraries that use MPI but are agnostic as to which flavor or MPI (eg. OpenMPI, MVAPICH, LAM...) that is used by the application linking the libraries. I've seen that there are some significant differences that stand in the way, for example that MPI_Comm is a pointer on OpenMPI and an integer on MVAPICH. I can see some ways that might work, but they are pretty complex - for example, I could create an intercept library that loads a real MPI library explicitly and do whatever needs be done (for example, translating MPI_Comm parameters). Does anyone know of anything that might help? Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090114/981ace81/attachment.html From daniel.s.kokron at nasa.gov Fri Jan 23 13:24:25 2009 From: daniel.s.kokron at nasa.gov (Dan Kokron) Date: Fri Jan 23 13:44:11 2009 Subject: [mvapich-discuss] -with-coll configure option not working Message-ID: <1232735065.12534.48.camel@outfield.gsfc.nasa.gov> I attempted to compile mvapich-1.1 on an Intel cluster using the older collective routines (--with-coll=intra_fns). The build (./make.mvapich.gen2) failed with the following ... make overtake /u/dkokron/play/mvapich-1.1/bin/mpicc -D_EM64T_ -DEARLY_SEND_COMPLETION -DMEMORY_SCALE -DVIADEV_RPUT_SUPPORT -D_SMP_ -D_SMP_RNDV_ -DXRC -DCH_GEN2 -D_GNU_SOURCE -D__INTEL_COMPILER -I/usr/include -O3 -DHAVE_MPICHCONF_H -DHAVE_STDLIB_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STRING_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1 -DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1 -DHAVE_SIGACTION=1 -DHAVE_SLEEP=1 -DHAVE_SYSCONF=1 -c overtake.c overtake.c(189): (col. 2) remark: LOOP WAS VECTORIZED. overtake.c(223): (col. 2) remark: LOOP WAS VECTORIZED. overtake.c(234): (col. 2) remark: LOOP WAS VECTORIZED. overtake.c(243): (col. 2) remark: LOOP WAS VECTORIZED. overtake.c(255): (col. 2) remark: LOOP WAS VECTORIZED. overtake.c(263): (col. 2) remark: LOOP WAS VECTORIZED. overtake.c(173): (col. 5) remark: LOOP WAS VECTORIZED. overtake.c(47): (col. 5) remark: LOOP WAS VECTORIZED. /u/dkokron/play/mvapich-1.1/bin/mpicc -D_EM64T_ -DEARLY_SEND_COMPLETION -DMEMORY_SCALE -DVIADEV_RPUT_SUPPORT -D_SMP_ -D_SMP_RNDV_ -DXRC -DCH_GEN2 -D_GNU_SOURCE -D__INTEL_COMPILER -I/usr/include -O3 -DHAVE_MPICHCONF_H -DHAVE_STDLIB_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STRING_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1 -DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1 -DHAVE_SIGACTION=1 -DHAVE_SLEEP=1 -DHAVE_SYSCONF=1 -c test.c /u/dkokron/play/mvapich-1.1/bin/mpicc -o overtake overtake.o test.o /u/dkokron/play/mvapich-1.1/lib/libmpich.a(initutil.o): In function `MPIR_Init': initutil.c:(.text+0xf2): undefined reference to `disable_shmem_barrier' initutil.c:(.text+0x11c): undefined reference to `disable_shmem_reduce' initutil.c:(.text+0x146): undefined reference to `disable_shmem_allreduce' initutil.c:(.text+0x170): undefined reference to `disable_shmem_bcast' initutil.c:(.text+0x290): undefined reference to `shmem_coll_reduce_threshold' initutil.c:(.text+0x2ad): undefined reference to `shmem_coll_allreduce_threshold' initutil.c:(.text+0x2ca): undefined reference to `shmem_coll_bcast_threshold' initutil.c:(.text+0x2d6): undefined reference to `shmem_coll_reduce_threshold' initutil.c:(.text+0x2e2): undefined reference to `shmem_coll_allreduce_threshold' initutil.c:(.text+0x316): undefined reference to `allgather_large_msg_threshold' initutil.c:(.text+0x344): undefined reference to `allgather_small_msg_threshold' initutil.c:(.text+0x36c): undefined reference to `bcast_knomial_degree' /u/dkokron/play/mvapich-1.1/lib/libmpich.a(context_util.o): In function `MPIR_Context_alloc': context_util.c:(.text+0x2f): undefined reference to `disable_shmem_allreduce' context_util.c:(.text+0x3f): undefined reference to `disable_shmem_allreduce' context_util.c:(.text+0x8b): undefined reference to `disable_shmem_allreduce' context_util.c:(.text+0xaf): undefined reference to `disable_shmem_allreduce' context_util.c:(.text+0xbf): undefined reference to `disable_shmem_allreduce' /u/dkokron/play/mvapich-1.1/lib/libmpich.a(context_util.o):context_util.c:(.text+0x102): more undefined references to `disable_shmem_allreduce' follow /u/dkokron/play/mvapich-1.1/lib/libmpich.a(context_util.o): In function `MPIR_Context_alloc': context_util.c:(.text+0x171): undefined reference to `disable_shmem_bcast' context_util.c:(.text+0x181): undefined reference to `disable_shmem_bcast' context_util.c:(.text+0x1bb): undefined reference to `disable_shmem_bcast' /u/dkokron/play/mvapich-1.1/lib/libmpich.a(comm_rdma_init.o): In function `comm_rdma_init': comm_rdma_init.c:(.text+0x563): undefined reference to `disable_shmem_barrier' The config-mine.log file is attached. -- Dan Kokron Global Modeling and Assimilation Office NASA Goddard Space Flight Center Greenbelt, MD 20771 Daniel.S.Kokron@nasa.gov Phone: (301) 614-5192 Fax: (301) 614-5304 -------------- next part -------------- A non-text attachment was scrubbed... Name: config-mine.log Type: text/x-log Size: 22319 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090123/c759f84c/config-mine-0001.bin From subramon at cse.ohio-state.edu Fri Jan 23 15:06:46 2009 From: subramon at cse.ohio-state.edu (Hari Subramoni) Date: Fri Jan 23 15:06:52 2009 Subject: [mvapich-discuss] -with-coll configure option not working In-Reply-To: <1232735065.12534.48.camel@outfield.gsfc.nasa.gov> Message-ID: Hi Dan, We would recommend executing an un-modified version of make.mvapich.gen2 and setting the environment variable VIADEV_USE_SHMEM_COLL to 0 at runtime. We also have many options which allows you to selectively disable the newer collectives. Section 9.7.1 thru 9.7.4 of the MVAPICH 1.1 user guide describes the necessary settings you need to make to achieve this. I've given the link below for your reference. http://mvapich/support/mvapich_user_guide-1.1.html#x1-1350009.7.1 Thx, Hari. On Fri, 23 Jan 2009, Dan Kokron wrote: > I attempted to compile mvapich-1.1 on an Intel cluster using the older > collective routines (--with-coll=intra_fns). The build > (./make.mvapich.gen2) failed with the following ... > > make overtake > /u/dkokron/play/mvapich-1.1/bin/mpicc -D_EM64T_ -DEARLY_SEND_COMPLETION -DMEMORY_SCALE -DVIADEV_RPUT_SUPPORT -D_SMP_ -D_SMP_RNDV_ -DXRC -DCH_GEN2 -D_GNU_SOURCE -D__INTEL_COMPILER -I/usr/include -O3 -DHAVE_MPICHCONF_H -DHAVE_STDLIB_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STRING_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1 -DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1 -DHAVE_SIGACTION=1 -DHAVE_SLEEP=1 -DHAVE_SYSCONF=1 -c overtake.c > overtake.c(189): (col. 2) remark: LOOP WAS VECTORIZED. > overtake.c(223): (col. 2) remark: LOOP WAS VECTORIZED. > overtake.c(234): (col. 2) remark: LOOP WAS VECTORIZED. > overtake.c(243): (col. 2) remark: LOOP WAS VECTORIZED. > overtake.c(255): (col. 2) remark: LOOP WAS VECTORIZED. > overtake.c(263): (col. 2) remark: LOOP WAS VECTORIZED. > overtake.c(173): (col. 5) remark: LOOP WAS VECTORIZED. > overtake.c(47): (col. 5) remark: LOOP WAS VECTORIZED. > /u/dkokron/play/mvapich-1.1/bin/mpicc -D_EM64T_ -DEARLY_SEND_COMPLETION -DMEMORY_SCALE -DVIADEV_RPUT_SUPPORT -D_SMP_ -D_SMP_RNDV_ -DXRC -DCH_GEN2 -D_GNU_SOURCE -D__INTEL_COMPILER -I/usr/include -O3 -DHAVE_MPICHCONF_H -DHAVE_STDLIB_H=1 -DHAVE_UNISTD_H=1 -DHAVE_STRING_H=1 -DUSE_STDARG=1 -DHAVE_LONG_DOUBLE=1 -DHAVE_LONG_LONG_INT=1 -DHAVE_PROTOTYPES=1 -DHAVE_SIGNAL_H=1 -DHAVE_SIGACTION=1 -DHAVE_SLEEP=1 -DHAVE_SYSCONF=1 -c test.c > /u/dkokron/play/mvapich-1.1/bin/mpicc -o overtake overtake.o test.o > /u/dkokron/play/mvapich-1.1/lib/libmpich.a(initutil.o): In function `MPIR_Init': > initutil.c:(.text+0xf2): undefined reference to `disable_shmem_barrier' > initutil.c:(.text+0x11c): undefined reference to `disable_shmem_reduce' > initutil.c:(.text+0x146): undefined reference to `disable_shmem_allreduce' > initutil.c:(.text+0x170): undefined reference to `disable_shmem_bcast' > initutil.c:(.text+0x290): undefined reference to `shmem_coll_reduce_threshold' > initutil.c:(.text+0x2ad): undefined reference to `shmem_coll_allreduce_threshold' > initutil.c:(.text+0x2ca): undefined reference to `shmem_coll_bcast_threshold' > initutil.c:(.text+0x2d6): undefined reference to `shmem_coll_reduce_threshold' > initutil.c:(.text+0x2e2): undefined reference to `shmem_coll_allreduce_threshold' > initutil.c:(.text+0x316): undefined reference to `allgather_large_msg_threshold' > initutil.c:(.text+0x344): undefined reference to `allgather_small_msg_threshold' > initutil.c:(.text+0x36c): undefined reference to `bcast_knomial_degree' > /u/dkokron/play/mvapich-1.1/lib/libmpich.a(context_util.o): In function `MPIR_Context_alloc': > context_util.c:(.text+0x2f): undefined reference to `disable_shmem_allreduce' > context_util.c:(.text+0x3f): undefined reference to `disable_shmem_allreduce' > context_util.c:(.text+0x8b): undefined reference to `disable_shmem_allreduce' > context_util.c:(.text+0xaf): undefined reference to `disable_shmem_allreduce' > context_util.c:(.text+0xbf): undefined reference to `disable_shmem_allreduce' > /u/dkokron/play/mvapich-1.1/lib/libmpich.a(context_util.o):context_util.c:(.text+0x102): more undefined references to `disable_shmem_allreduce' follow > /u/dkokron/play/mvapich-1.1/lib/libmpich.a(context_util.o): In function `MPIR_Context_alloc': > context_util.c:(.text+0x171): undefined reference to `disable_shmem_bcast' > context_util.c:(.text+0x181): undefined reference to `disable_shmem_bcast' > context_util.c:(.text+0x1bb): undefined reference to `disable_shmem_bcast' > /u/dkokron/play/mvapich-1.1/lib/libmpich.a(comm_rdma_init.o): In function `comm_rdma_init': > comm_rdma_init.c:(.text+0x563): undefined reference to `disable_shmem_barrier' > > > The config-mine.log file is attached. > > -- > Dan Kokron > Global Modeling and Assimilation Office > NASA Goddard Space Flight Center > Greenbelt, MD 20771 > Daniel.S.Kokron@nasa.gov > Phone: (301) 614-5192 > Fax: (301) 614-5304 > From forum.san at gmail.com Sat Jan 31 03:03:06 2009 From: forum.san at gmail.com (Sangamesh B) Date: Sat Jan 31 03:03:15 2009 Subject: [mvapich-discuss] cpmd job failure Message-ID: Hello mvapich2 team, The CPMD (www.cpmd.org) application is installed with intel compilers on a Rocks4.3 Linux based infiniband supported cluster, mvapich2 version 1.2p1. The 40 process job runs for some time and then fails with following output: LINE SEARCH : LAMBDA=.164E-01 PREDICTED ENERGY = -1890.824133217 57 9.731E-05 7.571E-06 -1890.824133 -8.483E-07 47.38 LINE SEARCH : LAMBDA=.166E-01 PREDICTED ENERGY = -1890.824133946 58 9.831E-05 7.265E-06 -1890.824134 -7.234E-07 47.41 LINE SEARCH : LAMBDA=.178E-01 PREDICTED ENERGY = -1890.824134657 59 9.529E-05 6.389E-06 -1890.824135 -6.945E-07 47.36 rank 17 in job 1 node-0-5.local_32810 caused collective abort of all ranks exit status of rank 17: killed by signal 9 rank 1 in job 1 node-0-5.local_32810 caused collective abort of all ranks exit status of rank 1: killed by signal 9 For several same jobs, it fails around same point (but not exactly at same step). What could be the solution for this? Thanks, Sangamesh From panda at cse.ohio-state.edu Sat Jan 31 09:38:08 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sat Jan 31 09:38:13 2009 Subject: [mvapich-discuss] cpmd job failure In-Reply-To: Message-ID: Thanks for reporting this. Are you running MVAPICH2 1.2p1 with the `default' mode or with any environment variables? Can you also indicate the details on your platform (processor, number of cores/node, amount of memory per core, IB HCA speed, etc.). Thanks, DK On Sat, 31 Jan 2009, Sangamesh B wrote: > Hello mvapich2 team, > > The CPMD (www.cpmd.org) application is installed with intel > compilers on a Rocks4.3 Linux based infiniband supported cluster, > mvapich2 version 1.2p1. > > The 40 process job runs for some time and then fails with following output: > > LINE SEARCH : LAMBDA=.164E-01 PREDICTED ENERGY = -1890.824133217 > 57 9.731E-05 7.571E-06 -1890.824133 -8.483E-07 47.38 > LINE SEARCH : LAMBDA=.166E-01 PREDICTED ENERGY = -1890.824133946 > 58 9.831E-05 7.265E-06 -1890.824134 -7.234E-07 47.41 > LINE SEARCH : LAMBDA=.178E-01 PREDICTED ENERGY = -1890.824134657 > 59 9.529E-05 6.389E-06 -1890.824135 -6.945E-07 47.36 > rank 17 in job 1 node-0-5.local_32810 caused collective abort of all ranks > exit status of rank 17: killed by signal 9 > rank 1 in job 1 node-0-5.local_32810 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > > For several same jobs, it fails around same point (but not exactly at > same step). > > What could be the solution for this? > > Thanks, > Sangamesh > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >