From ipl at dhigroup.com Wed Oct 7 07:47:23 2009 From: ipl at dhigroup.com (Iris Pernille Lohmann) Date: Wed Oct 7 09:37:08 2009 Subject: [mvapich-discuss] problem when configuring in MVAPICH2 1.4 Message-ID: <66D0CDDB47B56E49985BE88D9E9DD450752185749A@mx7serv> Skipped content of type multipart/related-------------- next part -------------- A non-text attachment was scrubbed... Name: configure1.log Type: application/octet-stream Size: 16366 bytes Desc: configure1.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091007/a4513842/configure1-0001.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: configure2.log Type: application/octet-stream Size: 16431 bytes Desc: configure2.log Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091007/a4513842/configure2-0001.obj From perkinjo at cse.ohio-state.edu Wed Oct 7 09:54:40 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Wed Oct 7 09:55:18 2009 Subject: [mvapich-discuss] problem when configuring in MVAPICH2 1.4 In-Reply-To: <66D0CDDB47B56E49985BE88D9E9DD450752185749A@mx7serv> References: <66D0CDDB47B56E49985BE88D9E9DD450752185749A@mx7serv> Message-ID: <20091007135440.GA2333@cse.ohio-state.edu> On Wed, Oct 07, 2009 at 01:47:23PM +0200, Iris Pernille Lohmann wrote: > Dear list members, > > I would like to try using MPICH2 with InfiniBand on the high performance computing cluster I am using, so I have downloaded the mvapich2-1.4rc2-3476.gz and unzipped it. > > I have problems when running the ./configure. Hello Iris. My response is inline. > > I am using Linux with OpenFabricsIB, so following the instructions of the installation guide, I configure with > ./configure --with-rdma=gen2 >& configure1.log > > The configuration fails because the libibumad cannot be found; I attached the configure1.log file, and here is the error message at the end of it: > ./configure: line 3635: enable_ftb-cr=no: command not found > checking for the InfiniBand includes path... default > checking for the InfiniBand library path... default > checking for library containing umad_init... no > configure: error: 'libibumad not found. Did you specify --with-ib-libpath=?' > configure: error: ./configure failed for channels/mrail > configure: error: Configure of src/mpid/ch3 failed! > > > Then I configure with > ./configure --with-rdma=gen2 -with-ib-libpath=/usr/lib64 >& configure2.log > > Where /usr/lib64 is the dir containing libibumad.so.1 and libibumad.so.1.0.3 Does libibumad.so exist? If the link does not exist please create it by something like: ln -s /usr/lib64/libibumad.so.1.0.3 /usr/lib64/libibumad.so I've seen this behavior before where if the ofed installation is incomplete the required libibumad.so (and others) link was not created. If this is the case then you should be able to configure by simply using ./configure without specifying any of the extra options. > This configuration fails, still because libibumad cannot be found, this time with this message (and configure2.log is attached), > ./configure: line 3635: enable_ftb-cr=no: command not found > checking for the InfiniBand includes path... default > checking for the InfiniBand library path... /usr/lib64 > checking for library containing umad_init... no > configure: error: 'libibumad not found. Did you specify --with-ib-libpath=?' > configure: error: ./configure failed for channels/mrail > configure: error: Configure of src/mpid/ch3 failed! > > > I have a feeling this is peace-of-cake for many of you, and I would be really grateful for a reply! > > Thanks in advance, > Iris -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091007/1e24e0a3/attachment.bin From perkinjo at cse.ohio-state.edu Thu Oct 8 09:25:08 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu Oct 8 09:25:44 2009 Subject: [mvapich-discuss] problem when configuring in MVAPICH2 1.4 In-Reply-To: <66D0CDDB47B56E49985BE88D9E9DD4507521889F69@mx7serv> References: <66D0CDDB47B56E49985BE88D9E9DD450752185749A@mx7serv> <20091007135440.GA2333@cse.ohio-state.edu> <66D0CDDB47B56E49985BE88D9E9DD4507521889F69@mx7serv> Message-ID: <20091008132508.GF2462@cse.ohio-state.edu> On Thu, Oct 08, 2009 at 03:01:27PM +0200, Iris Pernille Lohmann wrote: > Thank you very much! It works now just with ./configure - except that > now the include-files cannot be found, so the ofed installation may in > fact be incomplete. I am contacting the ones responsible. I'm assuming that there are now /usr/include/infiniband/*.h files. Please let us know whether a re-installation solves your problem or not. > > Thank you again! No problem. > > Cheers, > Iris > > > > -----Original Message----- > From: Jonathan Perkins [mailto:perkinjo@cse.ohio-state.edu] > Sent: 07 October 2009 15:55 > To: Iris Pernille Lohmann > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] problem when configuring in MVAPICH2 1.4 > > On Wed, Oct 07, 2009 at 01:47:23PM +0200, Iris Pernille Lohmann wrote: > > Dear list members, > > > > I would like to try using MPICH2 with InfiniBand on the high performance computing cluster I am using, so I have downloaded the mvapich2-1.4rc2-3476.gz and unzipped it. > > > > I have problems when running the ./configure. > > Hello Iris. My response is inline. > > > > > I am using Linux with OpenFabricsIB, so following the instructions of > > the installation guide, I configure with ./configure --with-rdma=gen2 > > >& configure1.log > > > > The configuration fails because the libibumad cannot be found; I attached the configure1.log file, and here is the error message at the end of it: > > ./configure: line 3635: enable_ftb-cr=no: command not found checking > > for the InfiniBand includes path... default checking for the > > InfiniBand library path... default checking for library containing > > umad_init... no > > configure: error: 'libibumad not found. Did you specify --with-ib-libpath=?' > > configure: error: ./configure failed for channels/mrail > > configure: error: Configure of src/mpid/ch3 failed! > > > > > > Then I configure with > > ./configure --with-rdma=gen2 -with-ib-libpath=/usr/lib64 >& > > configure2.log > > > > Where /usr/lib64 is the dir containing libibumad.so.1 and > > libibumad.so.1.0.3 > > Does libibumad.so exist? If the link does not exist please create it by something like: > ln -s /usr/lib64/libibumad.so.1.0.3 /usr/lib64/libibumad.so > > I've seen this behavior before where if the ofed installation is incomplete the required libibumad.so (and others) link was not created. > > If this is the case then you should be able to configure by simply using ./configure without specifying any of the extra options. > > > This configuration fails, still because libibumad cannot be found, > > this time with this message (and configure2.log is attached), > > ./configure: line 3635: enable_ftb-cr=no: command not found checking > > for the InfiniBand includes path... default checking for the > > InfiniBand library path... /usr/lib64 checking for library containing > > umad_init... no > > configure: error: 'libibumad not found. Did you specify --with-ib-libpath=?' > > configure: error: ./configure failed for channels/mrail > > configure: error: Configure of src/mpid/ch3 failed! > > > > > > I have a feeling this is peace-of-cake for many of you, and I would be really grateful for a reply! > > > > Thanks in advance, > > Iris > > -- > Jonathan Perkins > http://www.cse.ohio-state.edu/~perkinjo > -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091008/6c52425f/attachment.bin From vera_wx_cn at yahoo.com.cn Thu Oct 8 11:44:39 2009 From: vera_wx_cn at yahoo.com.cn (=?utf-8?B?5by6IOmprA==?=) Date: Thu Oct 8 11:45:19 2009 Subject: [mvapich-discuss] (no subject) Message-ID: <856618.82071.qm@web15303.mail.cnb.yahoo.com> Hi ? I try to test mvapich2 on the following platforms as: Intel Xeon Processor 5500 nodes(nehalem) QDR IB linux 2.6.18-128 icc 11.1 ? May ?I? use checkpoint-restart of mvapich2 on this clusters? ___________________________________________________________ ????????????????? http://card.mail.cn.yahoo.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091008/b60ce8bd/attachment.html From perkinjo at cse.ohio-state.edu Thu Oct 8 11:56:52 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu Oct 8 11:57:29 2009 Subject: [mvapich-discuss] (no subject) In-Reply-To: <856618.82071.qm@web15303.mail.cnb.yahoo.com> References: <856618.82071.qm@web15303.mail.cnb.yahoo.com> Message-ID: <20091008155652.GM2462@cse.ohio-state.edu> On Thu, Oct 08, 2009 at 11:44:39PM +0800, ? ? wrote: > Hi > ? > I try to test mvapich2 on the following platforms as: > Intel Xeon Processor 5500 nodes(nehalem) > QDR IB > linux 2.6.18-128 > icc 11.1 > ? > May ?I? use checkpoint-restart of mvapich2 on this clusters? Assuming that BLCR functions appropriately, this setup should be fine. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091008/baaf986e/attachment.bin From perkinjo at cse.ohio-state.edu Thu Oct 8 12:39:07 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu Oct 8 12:39:50 2009 Subject: [mvapich-discuss] fail to link with mvapich2-1.2p1/mvapich2-1.4rc2 In-Reply-To: <4AC20EAF.9060100@cardiff.ac.uk> References: <3B7D8CBBF8049C4C9746728929189A920F241A48@CFEVS1-IP.americas.cray.com> <4ABB8CC0.3010301@cardiff.ac.uk> <20090924165611.GH2346@cse.ohio-state.edu> <4AC1DFDE.7080106@cardiff.ac.uk> <4AC20EAF.9060100@cardiff.ac.uk> Message-ID: <20091008163907.GP2462@cse.ohio-state.edu> Hi Manhui. I just wanted to let you know that we recently applied the patch that you've sent. Please let us know if the issue is resolved in our trunk. On Tue, Sep 29, 2009 at 02:42:07PM +0100, Manhui Wang wrote: > I made a mistake in producing the previous uploaded patches, here > attached is the updated one, which seems to work fine with Molpro. > > Manhui Wang wrote: > > Dear MVAPICH2 developers, > > Following last mail, I tried to see whether it is possible to > > make some change in Molpro. But it seems to be impossible since Molpro > > directly calls these functions, which are from yacc or bison libraries. > > This means these name conflicts with those in yacc or bison libraries. I > > have made two small patches(see files attached) for mvapich2, with these > > patches Molpro works fine with Mvapich2. I would be very appreciated if > > you can include these changes (probably you will consider more changes > > to avoid conflicting with other programs, but this works for molpro at > > least) in the mvapich2 development version at your earliest convenience. > > So I can test the updated mvapich2 before the final release. > > > > Thank you very much. > > Manhui > > > > Jonathan Perkins wrote: > >> On Thu, Sep 24, 2009 at 04:14:08PM +0100, Manhui Wang wrote: > >>> Dear MVAPICH2 developers, > >>> I tried to build Molpro (see http://www.molpro.net/) with MVAPICH2 > >>> 1.2p1 (as well as mvapich2-1.4rc2) library, but failed. The reason is > >>> that both Molpro and Mvapich use some general function names. The > >>> source code of mvapich library (tokens.c parser.c) contains > >>> some yacc-like parsing code. These three > >>> functions(yylex yyparse yyerror) happen to have the same names as those > >>> in parse files of Molpro. I renamed these three function to those with a > >>> prefix mvapich_* in MVAPICH source code, and recompiled the mvapich2 > >>> library. Now it works fine. Could you please slightly change these > >>> names to avoid potential conflict with other application program in next > >>> release version? Surely, Molpro could also choose other specific > >>> function names to avoid such problems. > >> Thank you for the suggestion. We'll take a look at this and have it > >> resolved before our final release. > >> > >>> The following is the error message: > >>> > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o): In > >>> function `yylex': > >>> parser.c:(.text+0xbfc): multiple definition of `yylex' > >>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x9ccf): first defined here > >>> ld: Warning: size of symbol `yylex' changed from 3091 in > >>> ../lib/libmolpro.a(licence.o) to 3336 in > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o) > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In > >>> function `yyparse': > >>> tokens.c:(.text+0x0): multiple definition of `yyparse' > >>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0xdb): first defined here > >>> ld: Warning: size of symbol `yyparse' changed from 22803 in > >>> ../lib/libmolpro.a(licence.o) to 3110 in > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o) > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In > >>> function `yyerror': > >>> tokens.c:(.text+0x1e5c): multiple definition of `yyerror' > >>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x5edd): first defined here > >>> ld: Warning: size of symbol `yyerror' changed from 934 in > >>> ../lib/libmolpro.a(licence.o) to 26 in > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o) > >>> failure > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o): In > >>> function `yylex': > >>> parser.c:(.text+0xbfc): multiple definition of `yylex' > >>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x9ccf): first defined here > >>> ld: Warning: size of symbol `yylex' changed from 3091 in > >>> ../lib/libmolpro.a(licence.o) to 3336 in > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o) > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In > >>> function `yyparse': > >>> tokens.c:(.text+0x0): multiple definition of `yyparse' > >>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0xdb): first defined here > >>> ld: Warning: size of symbol `yyparse' changed from 22803 in > >>> ../lib/libmolpro.a(licence.o) to 3110 in > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o) > >>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In > >>> function `yyerror': > >>> tokens.c:(.text+0x1e5c): multiple definition of `yyerror' > >>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x5edd): first defined here > >>> > >>> Thanks > >>> Manhui > >>> > >>> Yutaka Kubota wrote: > >>>> Dear MVAPICH2 discussion Mailing list, > >>>> > >>>> This is Yutaka Kubota from Cray Japan. > >>>> > >>>> I would like to know when will release MVAPICH2 1.4 official version. If > >>>> this plan was not decided, we will try to use RC2 or RC3 version. We > >>>> just would like to know this plan is exist or not. > >>>> > >>>> Best regards > >>>> > >>>> Yutaka Kubota > >>>> > >>>> > >>>> _______________________________________________ > >>>> mvapich-discuss mailing list > >>>> mvapich-discuss@cse.ohio-state.edu > >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>> -- > >>> ----------- > >>> Manhui Wang > >>> School of Chemistry, Cardiff University, > >>> Main Building, Park Place, > >>> Cardiff CF10 3AT, UK > >>> Telephone: +44 (0)29208 76637 > >>> _______________________________________________ > >>> mvapich-discuss mailing list > >>> mvapich-discuss@cse.ohio-state.edu > >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -- > ----------- > Manhui Wang > School of Chemistry, Cardiff University, > Main Building, Park Place, > Cardiff CF10 3AT, UK > Telephone: +44 (0)29208 76637 > diff -crB mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/parser.c mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/parser.c > *** mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/parser.c 2009-06-01 22:56:18.000000000 +0100 > --- mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/parser.c 2009-09-29 13:20:10.000000000 +0100 > *************** > *** 551,557 **** > * easily add parameters. > */ > #ifndef YY_DECL > ! #define YY_DECL int yylex YY_PROTO(( void )) > #endif > > /* Code executed at the beginning of each rule, after yytext and yyleng > --- 551,557 ---- > * easily add parameters. > */ > #ifndef YY_DECL > ! #define YY_DECL int mvapich_yylex YY_PROTO(( void )) > #endif > > /* Code executed at the beginning of each rule, after yytext and yyleng > *************** > *** 714,720 **** > /* We're scanning a new file or input source. It's > * possible that this happened because the user > * just pointed yyin at a new source and called > ! * yylex(). If so, then we have to assure > * consistency between yy_current_buffer and our > * globals. Here is the right place to do so, because > * this is the first action (other than possibly a > --- 714,720 ---- > /* We're scanning a new file or input source. It's > * possible that this happened because the user > * just pointed yyin at a new source and called > ! * mvapich_yylex(). If so, then we have to assure > * consistency between yy_current_buffer and our > * globals. Here is the right place to do so, because > * this is the first action (other than possibly a > *************** > *** 827,833 **** > "fatal flex scanner internal error--no action found" ); > } /* end of action switch */ > } /* end of scanning one token */ > ! } /* end of yylex */ > > > /* yy_get_next_buffer - try to read in a new buffer > --- 827,833 ---- > "fatal flex scanner internal error--no action found" ); > } /* end of action switch */ > } /* end of scanning one token */ > ! } /* end of mvapich_yylex */ > > > /* yy_get_next_buffer - try to read in a new buffer > *************** > *** 1574,1580 **** > #if YY_MAIN > int main() > { > ! yylex(); > return 0; > } > #endif > --- 1574,1580 ---- > #if YY_MAIN > int main() > { > ! mvapich_yylex(); > return 0; > } > #endif > diff -crB mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/tokens.c mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/tokens.c > *** mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/tokens.c 2009-06-01 22:56:18.000000000 +0100 > --- mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/tokens.c 2009-09-29 13:20:05.000000000 +0100 > *************** > *** 113,119 **** > * Global functions > */ > int token_parse(PLPA_NAME(cpu_set_t) *cpu_set); > ! void yyerror(char const *s); > > /* > * Local functions > --- 113,119 ---- > * Global functions > */ > int token_parse(PLPA_NAME(cpu_set_t) *cpu_set); > ! void mvapich_yyerror(char const *s); > > /* > * Local functions > *************** > *** 495,501 **** > #define YYERROR goto yyerrorlab > > > ! /* Like YYERROR except do call yyerror. This remains here temporarily > to ease the transition to the new meaning of YYERROR, for GCC. > Once GCC version 2 has supplanted version 1, this can go. */ > > --- 495,501 ---- > #define YYERROR goto yyerrorlab > > > ! /* Like YYERROR except do call mvapich_yyerror. This remains here temporarily > to ease the transition to the new meaning of YYERROR, for GCC. > Once GCC version 2 has supplanted version 1, this can go. */ > > *************** > *** 515,521 **** > } \ > else \ > { \ > ! yyerror ("syntax error: cannot back up");\ > YYERROR; \ > } \ > while (0) > --- 515,521 ---- > } \ > else \ > { \ > ! mvapich_yyerror ("syntax error: cannot back up");\ > YYERROR; \ > } \ > while (0) > *************** > *** 534,545 **** > (Current).last_column = (Rhs)[N].last_column) > #endif > > ! /* YYLEX -- calling `yylex' with the right arguments. */ > > #ifdef YYLEX_PARAM > ! # define YYLEX yylex (YYLEX_PARAM) > #else > ! # define YYLEX yylex () > #endif > > /* Enable debugging if requested. */ > --- 534,545 ---- > (Current).last_column = (Rhs)[N].last_column) > #endif > > ! /* YYLEX -- calling `mvapich_yylex' with the right arguments. */ > > #ifdef YYLEX_PARAM > ! # define YYLEX mvapich_yylex (YYLEX_PARAM) > #else > ! # define YYLEX mvapich_yylex () > #endif > > /* Enable debugging if requested. */ > *************** > *** 787,801 **** > > #ifdef YYPARSE_PARAM > # if defined (__STDC__) || defined (__cplusplus) > ! int yyparse (void *YYPARSE_PARAM); > # else > ! int yyparse (); > # endif > #else /* ! YYPARSE_PARAM */ > #if defined (__STDC__) || defined (__cplusplus) > ! int yyparse (void); > #else > ! int yyparse (); > #endif > #endif /* ! YYPARSE_PARAM */ > > --- 787,801 ---- > > #ifdef YYPARSE_PARAM > # if defined (__STDC__) || defined (__cplusplus) > ! int mvapich_yyparse (void *YYPARSE_PARAM); > # else > ! int mvapich_yyparse (); > # endif > #else /* ! YYPARSE_PARAM */ > #if defined (__STDC__) || defined (__cplusplus) > ! int mvapich_yyparse (void); > #else > ! int mvapich_yyparse (); > #endif > #endif /* ! YYPARSE_PARAM */ > > *************** > *** 813,835 **** > > > /*----------. > ! | yyparse. | > `----------*/ > > #ifdef YYPARSE_PARAM > # if defined (__STDC__) || defined (__cplusplus) > ! int yyparse (void *YYPARSE_PARAM) > # else > ! int yyparse (YYPARSE_PARAM) > void *YYPARSE_PARAM; > # endif > #else /* ! YYPARSE_PARAM */ > #if defined (__STDC__) || defined (__cplusplus) > int > ! yyparse (void) > #else > int > ! yyparse () > > #endif > #endif > --- 813,835 ---- > > > /*----------. > ! | mvapich_yyparse. | > `----------*/ > > #ifdef YYPARSE_PARAM > # if defined (__STDC__) || defined (__cplusplus) > ! int mvapich_yyparse (void *YYPARSE_PARAM) > # else > ! int mvapich_yyparse (YYPARSE_PARAM) > void *YYPARSE_PARAM; > # endif > #else /* ! YYPARSE_PARAM */ > #if defined (__STDC__) || defined (__cplusplus) > int > ! mvapich_yyparse (void) > #else > int > ! mvapich_yyparse () > > #endif > #endif > *************** > *** 1305,1319 **** > yyprefix = " or "; > } > } > ! yyerror (yymsg); > YYSTACK_FREE (yymsg); > } > else > ! yyerror ("syntax error; also virtual memory exhausted"); > } > else > #endif /* YYERROR_VERBOSE */ > ! yyerror ("syntax error"); > } > > > --- 1305,1319 ---- > yyprefix = " or "; > } > } > ! mvapich_yyerror (yymsg); > YYSTACK_FREE (yymsg); > } > else > ! mvapich_yyerror ("syntax error; also virtual memory exhausted"); > } > else > #endif /* YYERROR_VERBOSE */ > ! mvapich_yyerror ("syntax error"); > } > > > *************** > *** 1431,1437 **** > | yyoverflowlab -- parser overflow comes here. | > `----------------------------------------------*/ > yyoverflowlab: > ! yyerror ("parser stack overflow"); > yyresult = 2; > /* Fall through. */ > #endif > --- 1431,1437 ---- > | yyoverflowlab -- parser overflow comes here. | > `----------------------------------------------*/ > yyoverflowlab: > ! mvapich_yyerror ("parser stack overflow"); > yyresult = 2; > /* Fall through. */ > #endif > *************** > *** 1454,1467 **** > > PLPA_CPU_ZERO(cpu_set); > return_value = cpu_set; > ! ret = yyparse(); > if (0 != ret) { > return ret; > } > return 0; > } > > ! void yyerror (char const *s) > { > fprintf(stderr, "ERROR: %s\n", s); > } > --- 1454,1467 ---- > > PLPA_CPU_ZERO(cpu_set); > return_value = cpu_set; > ! ret = mvapich_yyparse(); > if (0 != ret) { > return ret; > } > return 0; > } > > ! void mvapich_yyerror (char const *s) > { > fprintf(stderr, "ERROR: %s\n", s); > } -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091008/d5c16d89/attachment-0001.bin From wangm9 at cardiff.ac.uk Fri Oct 9 10:43:23 2009 From: wangm9 at cardiff.ac.uk (Manhui Wang) Date: Fri Oct 9 10:44:02 2009 Subject: [mvapich-discuss] fail to link with mvapich2-1.2p1/mvapich2-1.4rc2 In-Reply-To: <20091008163907.GP2462@cse.ohio-state.edu> References: <3B7D8CBBF8049C4C9746728929189A920F241A48@CFEVS1-IP.americas.cray.com> <4ABB8CC0.3010301@cardiff.ac.uk> <20090924165611.GH2346@cse.ohio-state.edu> <4AC1DFDE.7080106@cardiff.ac.uk> <4AC20EAF.9060100@cardiff.ac.uk> <20091008163907.GP2462@cse.ohio-state.edu> Message-ID: <4ACF4C0B.10001@cardiff.ac.uk> Hi Jonathan, Thanks for telling me about this. I will try it. Just a quick questions, do the tar balls ( eg. mvapich2-trunk-2009-10-08.tar.gz) at http://mvapich.cse.ohio-state.edu/nightly/mvapich2/trunk/ contain all the source code? Or do they just contain bugfixes? It seems no configure file exists after untarring the mvapich2-trunk-2009-10-08.tar.gz. Or am I missing something? Thank you. Manhui Jonathan Perkins wrote: > Hi Manhui. I just wanted to let you know that we recently applied the > patch that you've sent. Please let us know if the issue is resolved in > our trunk. > > On Tue, Sep 29, 2009 at 02:42:07PM +0100, Manhui Wang wrote: >> I made a mistake in producing the previous uploaded patches, here >> attached is the updated one, which seems to work fine with Molpro. >> >> Manhui Wang wrote: >>> Dear MVAPICH2 developers, >>> Following last mail, I tried to see whether it is possible to >>> make some change in Molpro. But it seems to be impossible since Molpro >>> directly calls these functions, which are from yacc or bison libraries. >>> This means these name conflicts with those in yacc or bison libraries. I >>> have made two small patches(see files attached) for mvapich2, with these >>> patches Molpro works fine with Mvapich2. I would be very appreciated if >>> you can include these changes (probably you will consider more changes >>> to avoid conflicting with other programs, but this works for molpro at >>> least) in the mvapich2 development version at your earliest convenience. >>> So I can test the updated mvapich2 before the final release. >>> >>> Thank you very much. >>> Manhui >>> >>> Jonathan Perkins wrote: >>>> On Thu, Sep 24, 2009 at 04:14:08PM +0100, Manhui Wang wrote: >>>>> Dear MVAPICH2 developers, >>>>> I tried to build Molpro (see http://www.molpro.net/) with MVAPICH2 >>>>> 1.2p1 (as well as mvapich2-1.4rc2) library, but failed. The reason is >>>>> that both Molpro and Mvapich use some general function names. The >>>>> source code of mvapich library (tokens.c parser.c) contains >>>>> some yacc-like parsing code. These three >>>>> functions(yylex yyparse yyerror) happen to have the same names as those >>>>> in parse files of Molpro. I renamed these three function to those with a >>>>> prefix mvapich_* in MVAPICH source code, and recompiled the mvapich2 >>>>> library. Now it works fine. Could you please slightly change these >>>>> names to avoid potential conflict with other application program in next >>>>> release version? Surely, Molpro could also choose other specific >>>>> function names to avoid such problems. >>>> Thank you for the suggestion. We'll take a look at this and have it >>>> resolved before our final release. >>>> >>>>> The following is the error message: >>>>> >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o): In >>>>> function `yylex': >>>>> parser.c:(.text+0xbfc): multiple definition of `yylex' >>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x9ccf): first defined here >>>>> ld: Warning: size of symbol `yylex' changed from 3091 in >>>>> ../lib/libmolpro.a(licence.o) to 3336 in >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o) >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In >>>>> function `yyparse': >>>>> tokens.c:(.text+0x0): multiple definition of `yyparse' >>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0xdb): first defined here >>>>> ld: Warning: size of symbol `yyparse' changed from 22803 in >>>>> ../lib/libmolpro.a(licence.o) to 3110 in >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o) >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In >>>>> function `yyerror': >>>>> tokens.c:(.text+0x1e5c): multiple definition of `yyerror' >>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x5edd): first defined here >>>>> ld: Warning: size of symbol `yyerror' changed from 934 in >>>>> ../lib/libmolpro.a(licence.o) to 26 in >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o) >>>>> failure >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o): In >>>>> function `yylex': >>>>> parser.c:(.text+0xbfc): multiple definition of `yylex' >>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x9ccf): first defined here >>>>> ld: Warning: size of symbol `yylex' changed from 3091 in >>>>> ../lib/libmolpro.a(licence.o) to 3336 in >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o) >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In >>>>> function `yyparse': >>>>> tokens.c:(.text+0x0): multiple definition of `yyparse' >>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0xdb): first defined here >>>>> ld: Warning: size of symbol `yyparse' changed from 22803 in >>>>> ../lib/libmolpro.a(licence.o) to 3110 in >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o) >>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In >>>>> function `yyerror': >>>>> tokens.c:(.text+0x1e5c): multiple definition of `yyerror' >>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x5edd): first defined here >>>>> >>>>> Thanks >>>>> Manhui >>>>> >>>>> Yutaka Kubota wrote: >>>>>> Dear MVAPICH2 discussion Mailing list, >>>>>> >>>>>> This is Yutaka Kubota from Cray Japan. >>>>>> >>>>>> I would like to know when will release MVAPICH2 1.4 official version. If >>>>>> this plan was not decided, we will try to use RC2 or RC3 version. We >>>>>> just would like to know this plan is exist or not. >>>>>> >>>>>> Best regards >>>>>> >>>>>> Yutaka Kubota >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> mvapich-discuss mailing list >>>>>> mvapich-discuss@cse.ohio-state.edu >>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>>> -- >>>>> ----------- >>>>> Manhui Wang >>>>> School of Chemistry, Cardiff University, >>>>> Main Building, Park Place, >>>>> Cardiff CF10 3AT, UK >>>>> Telephone: +44 (0)29208 76637 >>>>> _______________________________________________ >>>>> mvapich-discuss mailing list >>>>> mvapich-discuss@cse.ohio-state.edu >>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> -- >> ----------- >> Manhui Wang >> School of Chemistry, Cardiff University, >> Main Building, Park Place, >> Cardiff CF10 3AT, UK >> Telephone: +44 (0)29208 76637 > >> diff -crB mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/parser.c mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/parser.c >> *** mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/parser.c 2009-06-01 22:56:18.000000000 +0100 >> --- mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/parser.c 2009-09-29 13:20:10.000000000 +0100 >> *************** >> *** 551,557 **** >> * easily add parameters. >> */ >> #ifndef YY_DECL >> ! #define YY_DECL int yylex YY_PROTO(( void )) >> #endif >> >> /* Code executed at the beginning of each rule, after yytext and yyleng >> --- 551,557 ---- >> * easily add parameters. >> */ >> #ifndef YY_DECL >> ! #define YY_DECL int mvapich_yylex YY_PROTO(( void )) >> #endif >> >> /* Code executed at the beginning of each rule, after yytext and yyleng >> *************** >> *** 714,720 **** >> /* We're scanning a new file or input source. It's >> * possible that this happened because the user >> * just pointed yyin at a new source and called >> ! * yylex(). If so, then we have to assure >> * consistency between yy_current_buffer and our >> * globals. Here is the right place to do so, because >> * this is the first action (other than possibly a >> --- 714,720 ---- >> /* We're scanning a new file or input source. It's >> * possible that this happened because the user >> * just pointed yyin at a new source and called >> ! * mvapich_yylex(). If so, then we have to assure >> * consistency between yy_current_buffer and our >> * globals. Here is the right place to do so, because >> * this is the first action (other than possibly a >> *************** >> *** 827,833 **** >> "fatal flex scanner internal error--no action found" ); >> } /* end of action switch */ >> } /* end of scanning one token */ >> ! } /* end of yylex */ >> >> >> /* yy_get_next_buffer - try to read in a new buffer >> --- 827,833 ---- >> "fatal flex scanner internal error--no action found" ); >> } /* end of action switch */ >> } /* end of scanning one token */ >> ! } /* end of mvapich_yylex */ >> >> >> /* yy_get_next_buffer - try to read in a new buffer >> *************** >> *** 1574,1580 **** >> #if YY_MAIN >> int main() >> { >> ! yylex(); >> return 0; >> } >> #endif >> --- 1574,1580 ---- >> #if YY_MAIN >> int main() >> { >> ! mvapich_yylex(); >> return 0; >> } >> #endif >> diff -crB mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/tokens.c mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/tokens.c >> *** mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/tokens.c 2009-06-01 22:56:18.000000000 +0100 >> --- mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/tokens.c 2009-09-29 13:20:05.000000000 +0100 >> *************** >> *** 113,119 **** >> * Global functions >> */ >> int token_parse(PLPA_NAME(cpu_set_t) *cpu_set); >> ! void yyerror(char const *s); >> >> /* >> * Local functions >> --- 113,119 ---- >> * Global functions >> */ >> int token_parse(PLPA_NAME(cpu_set_t) *cpu_set); >> ! void mvapich_yyerror(char const *s); >> >> /* >> * Local functions >> *************** >> *** 495,501 **** >> #define YYERROR goto yyerrorlab >> >> >> ! /* Like YYERROR except do call yyerror. This remains here temporarily >> to ease the transition to the new meaning of YYERROR, for GCC. >> Once GCC version 2 has supplanted version 1, this can go. */ >> >> --- 495,501 ---- >> #define YYERROR goto yyerrorlab >> >> >> ! /* Like YYERROR except do call mvapich_yyerror. This remains here temporarily >> to ease the transition to the new meaning of YYERROR, for GCC. >> Once GCC version 2 has supplanted version 1, this can go. */ >> >> *************** >> *** 515,521 **** >> } \ >> else \ >> { \ >> ! yyerror ("syntax error: cannot back up");\ >> YYERROR; \ >> } \ >> while (0) >> --- 515,521 ---- >> } \ >> else \ >> { \ >> ! mvapich_yyerror ("syntax error: cannot back up");\ >> YYERROR; \ >> } \ >> while (0) >> *************** >> *** 534,545 **** >> (Current).last_column = (Rhs)[N].last_column) >> #endif >> >> ! /* YYLEX -- calling `yylex' with the right arguments. */ >> >> #ifdef YYLEX_PARAM >> ! # define YYLEX yylex (YYLEX_PARAM) >> #else >> ! # define YYLEX yylex () >> #endif >> >> /* Enable debugging if requested. */ >> --- 534,545 ---- >> (Current).last_column = (Rhs)[N].last_column) >> #endif >> >> ! /* YYLEX -- calling `mvapich_yylex' with the right arguments. */ >> >> #ifdef YYLEX_PARAM >> ! # define YYLEX mvapich_yylex (YYLEX_PARAM) >> #else >> ! # define YYLEX mvapich_yylex () >> #endif >> >> /* Enable debugging if requested. */ >> *************** >> *** 787,801 **** >> >> #ifdef YYPARSE_PARAM >> # if defined (__STDC__) || defined (__cplusplus) >> ! int yyparse (void *YYPARSE_PARAM); >> # else >> ! int yyparse (); >> # endif >> #else /* ! YYPARSE_PARAM */ >> #if defined (__STDC__) || defined (__cplusplus) >> ! int yyparse (void); >> #else >> ! int yyparse (); >> #endif >> #endif /* ! YYPARSE_PARAM */ >> >> --- 787,801 ---- >> >> #ifdef YYPARSE_PARAM >> # if defined (__STDC__) || defined (__cplusplus) >> ! int mvapich_yyparse (void *YYPARSE_PARAM); >> # else >> ! int mvapich_yyparse (); >> # endif >> #else /* ! YYPARSE_PARAM */ >> #if defined (__STDC__) || defined (__cplusplus) >> ! int mvapich_yyparse (void); >> #else >> ! int mvapich_yyparse (); >> #endif >> #endif /* ! YYPARSE_PARAM */ >> >> *************** >> *** 813,835 **** >> >> >> /*----------. >> ! | yyparse. | >> `----------*/ >> >> #ifdef YYPARSE_PARAM >> # if defined (__STDC__) || defined (__cplusplus) >> ! int yyparse (void *YYPARSE_PARAM) >> # else >> ! int yyparse (YYPARSE_PARAM) >> void *YYPARSE_PARAM; >> # endif >> #else /* ! YYPARSE_PARAM */ >> #if defined (__STDC__) || defined (__cplusplus) >> int >> ! yyparse (void) >> #else >> int >> ! yyparse () >> >> #endif >> #endif >> --- 813,835 ---- >> >> >> /*----------. >> ! | mvapich_yyparse. | >> `----------*/ >> >> #ifdef YYPARSE_PARAM >> # if defined (__STDC__) || defined (__cplusplus) >> ! int mvapich_yyparse (void *YYPARSE_PARAM) >> # else >> ! int mvapich_yyparse (YYPARSE_PARAM) >> void *YYPARSE_PARAM; >> # endif >> #else /* ! YYPARSE_PARAM */ >> #if defined (__STDC__) || defined (__cplusplus) >> int >> ! mvapich_yyparse (void) >> #else >> int >> ! mvapich_yyparse () >> >> #endif >> #endif >> *************** >> *** 1305,1319 **** >> yyprefix = " or "; >> } >> } >> ! yyerror (yymsg); >> YYSTACK_FREE (yymsg); >> } >> else >> ! yyerror ("syntax error; also virtual memory exhausted"); >> } >> else >> #endif /* YYERROR_VERBOSE */ >> ! yyerror ("syntax error"); >> } >> >> >> --- 1305,1319 ---- >> yyprefix = " or "; >> } >> } >> ! mvapich_yyerror (yymsg); >> YYSTACK_FREE (yymsg); >> } >> else >> ! mvapich_yyerror ("syntax error; also virtual memory exhausted"); >> } >> else >> #endif /* YYERROR_VERBOSE */ >> ! mvapich_yyerror ("syntax error"); >> } >> >> >> *************** >> *** 1431,1437 **** >> | yyoverflowlab -- parser overflow comes here. | >> `----------------------------------------------*/ >> yyoverflowlab: >> ! yyerror ("parser stack overflow"); >> yyresult = 2; >> /* Fall through. */ >> #endif >> --- 1431,1437 ---- >> | yyoverflowlab -- parser overflow comes here. | >> `----------------------------------------------*/ >> yyoverflowlab: >> ! mvapich_yyerror ("parser stack overflow"); >> yyresult = 2; >> /* Fall through. */ >> #endif >> *************** >> *** 1454,1467 **** >> >> PLPA_CPU_ZERO(cpu_set); >> return_value = cpu_set; >> ! ret = yyparse(); >> if (0 != ret) { >> return ret; >> } >> return 0; >> } >> >> ! void yyerror (char const *s) >> { >> fprintf(stderr, "ERROR: %s\n", s); >> } >> --- 1454,1467 ---- >> >> PLPA_CPU_ZERO(cpu_set); >> return_value = cpu_set; >> ! ret = mvapich_yyparse(); >> if (0 != ret) { >> return ret; >> } >> return 0; >> } >> >> ! void mvapich_yyerror (char const *s) >> { >> fprintf(stderr, "ERROR: %s\n", s); >> } > > -- ----------- Manhui Wang School of Chemistry, Cardiff University, Main Building, Park Place, Cardiff CF10 3AT, UK Telephone: +44 (0)29208 76637 From wangm9 at cardiff.ac.uk Fri Oct 9 11:45:44 2009 From: wangm9 at cardiff.ac.uk (Manhui Wang) Date: Fri Oct 9 11:46:23 2009 Subject: [mvapich-discuss] fail to link with mvapich2-1.2p1/mvapich2-1.4rc2 In-Reply-To: <4ACF4C0B.10001@cardiff.ac.uk> References: <3B7D8CBBF8049C4C9746728929189A920F241A48@CFEVS1-IP.americas.cray.com> <4ABB8CC0.3010301@cardiff.ac.uk> <20090924165611.GH2346@cse.ohio-state.edu> <4AC1DFDE.7080106@cardiff.ac.uk> <4AC20EAF.9060100@cardiff.ac.uk> <20091008163907.GP2462@cse.ohio-state.edu> <4ACF4C0B.10001@cardiff.ac.uk> Message-ID: <4ACF5AA8.1060906@cardiff.ac.uk> Hi Jonathan, I forgot to update the files in the source tree with: $ cd mvapich2 $ maint/updatefiles Thus configure is produced. Now I have built the newest Mvapich2, and it has resolved the previous linking problem I mentioned. Thanks, Manhui Manhui Wang wrote: > Hi Jonathan, > Thanks for telling me about this. I will try it. Just a quick > questions, do the tar balls ( eg. mvapich2-trunk-2009-10-08.tar.gz) at > http://mvapich.cse.ohio-state.edu/nightly/mvapich2/trunk/ > contain all the source code? Or do they just contain bugfixes? > It seems no configure file exists after untarring the > mvapich2-trunk-2009-10-08.tar.gz. Or am I missing something? > > Thank you. > Manhui > > Jonathan Perkins wrote: >> Hi Manhui. I just wanted to let you know that we recently applied the >> patch that you've sent. Please let us know if the issue is resolved in >> our trunk. >> >> On Tue, Sep 29, 2009 at 02:42:07PM +0100, Manhui Wang wrote: >>> I made a mistake in producing the previous uploaded patches, here >>> attached is the updated one, which seems to work fine with Molpro. >>> >>> Manhui Wang wrote: >>>> Dear MVAPICH2 developers, >>>> Following last mail, I tried to see whether it is possible to >>>> make some change in Molpro. But it seems to be impossible since Molpro >>>> directly calls these functions, which are from yacc or bison libraries. >>>> This means these name conflicts with those in yacc or bison libraries. I >>>> have made two small patches(see files attached) for mvapich2, with these >>>> patches Molpro works fine with Mvapich2. I would be very appreciated if >>>> you can include these changes (probably you will consider more changes >>>> to avoid conflicting with other programs, but this works for molpro at >>>> least) in the mvapich2 development version at your earliest convenience. >>>> So I can test the updated mvapich2 before the final release. >>>> >>>> Thank you very much. >>>> Manhui >>>> >>>> Jonathan Perkins wrote: >>>>> On Thu, Sep 24, 2009 at 04:14:08PM +0100, Manhui Wang wrote: >>>>>> Dear MVAPICH2 developers, >>>>>> I tried to build Molpro (see http://www.molpro.net/) with MVAPICH2 >>>>>> 1.2p1 (as well as mvapich2-1.4rc2) library, but failed. The reason is >>>>>> that both Molpro and Mvapich use some general function names. The >>>>>> source code of mvapich library (tokens.c parser.c) contains >>>>>> some yacc-like parsing code. These three >>>>>> functions(yylex yyparse yyerror) happen to have the same names as those >>>>>> in parse files of Molpro. I renamed these three function to those with a >>>>>> prefix mvapich_* in MVAPICH source code, and recompiled the mvapich2 >>>>>> library. Now it works fine. Could you please slightly change these >>>>>> names to avoid potential conflict with other application program in next >>>>>> release version? Surely, Molpro could also choose other specific >>>>>> function names to avoid such problems. >>>>> Thank you for the suggestion. We'll take a look at this and have it >>>>> resolved before our final release. >>>>> >>>>>> The following is the error message: >>>>>> >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o): In >>>>>> function `yylex': >>>>>> parser.c:(.text+0xbfc): multiple definition of `yylex' >>>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x9ccf): first defined here >>>>>> ld: Warning: size of symbol `yylex' changed from 3091 in >>>>>> ../lib/libmolpro.a(licence.o) to 3336 in >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o) >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In >>>>>> function `yyparse': >>>>>> tokens.c:(.text+0x0): multiple definition of `yyparse' >>>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0xdb): first defined here >>>>>> ld: Warning: size of symbol `yyparse' changed from 22803 in >>>>>> ../lib/libmolpro.a(licence.o) to 3110 in >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o) >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In >>>>>> function `yyerror': >>>>>> tokens.c:(.text+0x1e5c): multiple definition of `yyerror' >>>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x5edd): first defined here >>>>>> ld: Warning: size of symbol `yyerror' changed from 934 in >>>>>> ../lib/libmolpro.a(licence.o) to 26 in >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o) >>>>>> failure >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o): In >>>>>> function `yylex': >>>>>> parser.c:(.text+0xbfc): multiple definition of `yylex' >>>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x9ccf): first defined here >>>>>> ld: Warning: size of symbol `yylex' changed from 3091 in >>>>>> ../lib/libmolpro.a(licence.o) to 3336 in >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(parser.o) >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In >>>>>> function `yyparse': >>>>>> tokens.c:(.text+0x0): multiple definition of `yyparse' >>>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0xdb): first defined here >>>>>> ld: Warning: size of symbol `yyparse' changed from 22803 in >>>>>> ../lib/libmolpro.a(licence.o) to 3110 in >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o) >>>>>> /home/sacmw4/soft/mvapich2-1.2p1-install/lib/libmpich.a(tokens.o): In >>>>>> function `yyerror': >>>>>> tokens.c:(.text+0x1e5c): multiple definition of `yyerror' >>>>>> ../lib/libmolpro.a(licence.o):parse.c:(.text+0x5edd): first defined here >>>>>> >>>>>> Thanks >>>>>> Manhui >>>>>> >>>>>> Yutaka Kubota wrote: >>>>>>> Dear MVAPICH2 discussion Mailing list, >>>>>>> >>>>>>> This is Yutaka Kubota from Cray Japan. >>>>>>> >>>>>>> I would like to know when will release MVAPICH2 1.4 official version. If >>>>>>> this plan was not decided, we will try to use RC2 or RC3 version. We >>>>>>> just would like to know this plan is exist or not. >>>>>>> >>>>>>> Best regards >>>>>>> >>>>>>> Yutaka Kubota >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> mvapich-discuss mailing list >>>>>>> mvapich-discuss@cse.ohio-state.edu >>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>>>> -- >>>>>> ----------- >>>>>> Manhui Wang >>>>>> School of Chemistry, Cardiff University, >>>>>> Main Building, Park Place, >>>>>> Cardiff CF10 3AT, UK >>>>>> Telephone: +44 (0)29208 76637 >>>>>> _______________________________________________ >>>>>> mvapich-discuss mailing list >>>>>> mvapich-discuss@cse.ohio-state.edu >>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>> ------------------------------------------------------------------------ >>>> >>>> _______________________________________________ >>>> mvapich-discuss mailing list >>>> mvapich-discuss@cse.ohio-state.edu >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> -- >>> ----------- >>> Manhui Wang >>> School of Chemistry, Cardiff University, >>> Main Building, Park Place, >>> Cardiff CF10 3AT, UK >>> Telephone: +44 (0)29208 76637 >>> diff -crB mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/parser.c mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/parser.c >>> *** mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/parser.c 2009-06-01 22:56:18.000000000 +0100 >>> --- mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/parser.c 2009-09-29 13:20:10.000000000 +0100 >>> *************** >>> *** 551,557 **** >>> * easily add parameters. >>> */ >>> #ifndef YY_DECL >>> ! #define YY_DECL int yylex YY_PROTO(( void )) >>> #endif >>> >>> /* Code executed at the beginning of each rule, after yytext and yyleng >>> --- 551,557 ---- >>> * easily add parameters. >>> */ >>> #ifndef YY_DECL >>> ! #define YY_DECL int mvapich_yylex YY_PROTO(( void )) >>> #endif >>> >>> /* Code executed at the beginning of each rule, after yytext and yyleng >>> *************** >>> *** 714,720 **** >>> /* We're scanning a new file or input source. It's >>> * possible that this happened because the user >>> * just pointed yyin at a new source and called >>> ! * yylex(). If so, then we have to assure >>> * consistency between yy_current_buffer and our >>> * globals. Here is the right place to do so, because >>> * this is the first action (other than possibly a >>> --- 714,720 ---- >>> /* We're scanning a new file or input source. It's >>> * possible that this happened because the user >>> * just pointed yyin at a new source and called >>> ! * mvapich_yylex(). If so, then we have to assure >>> * consistency between yy_current_buffer and our >>> * globals. Here is the right place to do so, because >>> * this is the first action (other than possibly a >>> *************** >>> *** 827,833 **** >>> "fatal flex scanner internal error--no action found" ); >>> } /* end of action switch */ >>> } /* end of scanning one token */ >>> ! } /* end of yylex */ >>> >>> >>> /* yy_get_next_buffer - try to read in a new buffer >>> --- 827,833 ---- >>> "fatal flex scanner internal error--no action found" ); >>> } /* end of action switch */ >>> } /* end of scanning one token */ >>> ! } /* end of mvapich_yylex */ >>> >>> >>> /* yy_get_next_buffer - try to read in a new buffer >>> *************** >>> *** 1574,1580 **** >>> #if YY_MAIN >>> int main() >>> { >>> ! yylex(); >>> return 0; >>> } >>> #endif >>> --- 1574,1580 ---- >>> #if YY_MAIN >>> int main() >>> { >>> ! mvapich_yylex(); >>> return 0; >>> } >>> #endif >>> diff -crB mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/tokens.c mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/tokens.c >>> *** mvapich2-1.4rc2-orig/src/mpid/ch3/channels/mrail/src/plpa/tokens.c 2009-06-01 22:56:18.000000000 +0100 >>> --- mvapich2-1.4rc2/src/mpid/ch3/channels/mrail/src/plpa/tokens.c 2009-09-29 13:20:05.000000000 +0100 >>> *************** >>> *** 113,119 **** >>> * Global functions >>> */ >>> int token_parse(PLPA_NAME(cpu_set_t) *cpu_set); >>> ! void yyerror(char const *s); >>> >>> /* >>> * Local functions >>> --- 113,119 ---- >>> * Global functions >>> */ >>> int token_parse(PLPA_NAME(cpu_set_t) *cpu_set); >>> ! void mvapich_yyerror(char const *s); >>> >>> /* >>> * Local functions >>> *************** >>> *** 495,501 **** >>> #define YYERROR goto yyerrorlab >>> >>> >>> ! /* Like YYERROR except do call yyerror. This remains here temporarily >>> to ease the transition to the new meaning of YYERROR, for GCC. >>> Once GCC version 2 has supplanted version 1, this can go. */ >>> >>> --- 495,501 ---- >>> #define YYERROR goto yyerrorlab >>> >>> >>> ! /* Like YYERROR except do call mvapich_yyerror. This remains here temporarily >>> to ease the transition to the new meaning of YYERROR, for GCC. >>> Once GCC version 2 has supplanted version 1, this can go. */ >>> >>> *************** >>> *** 515,521 **** >>> } \ >>> else \ >>> { \ >>> ! yyerror ("syntax error: cannot back up");\ >>> YYERROR; \ >>> } \ >>> while (0) >>> --- 515,521 ---- >>> } \ >>> else \ >>> { \ >>> ! mvapich_yyerror ("syntax error: cannot back up");\ >>> YYERROR; \ >>> } \ >>> while (0) >>> *************** >>> *** 534,545 **** >>> (Current).last_column = (Rhs)[N].last_column) >>> #endif >>> >>> ! /* YYLEX -- calling `yylex' with the right arguments. */ >>> >>> #ifdef YYLEX_PARAM >>> ! # define YYLEX yylex (YYLEX_PARAM) >>> #else >>> ! # define YYLEX yylex () >>> #endif >>> >>> /* Enable debugging if requested. */ >>> --- 534,545 ---- >>> (Current).last_column = (Rhs)[N].last_column) >>> #endif >>> >>> ! /* YYLEX -- calling `mvapich_yylex' with the right arguments. */ >>> >>> #ifdef YYLEX_PARAM >>> ! # define YYLEX mvapich_yylex (YYLEX_PARAM) >>> #else >>> ! # define YYLEX mvapich_yylex () >>> #endif >>> >>> /* Enable debugging if requested. */ >>> *************** >>> *** 787,801 **** >>> >>> #ifdef YYPARSE_PARAM >>> # if defined (__STDC__) || defined (__cplusplus) >>> ! int yyparse (void *YYPARSE_PARAM); >>> # else >>> ! int yyparse (); >>> # endif >>> #else /* ! YYPARSE_PARAM */ >>> #if defined (__STDC__) || defined (__cplusplus) >>> ! int yyparse (void); >>> #else >>> ! int yyparse (); >>> #endif >>> #endif /* ! YYPARSE_PARAM */ >>> >>> --- 787,801 ---- >>> >>> #ifdef YYPARSE_PARAM >>> # if defined (__STDC__) || defined (__cplusplus) >>> ! int mvapich_yyparse (void *YYPARSE_PARAM); >>> # else >>> ! int mvapich_yyparse (); >>> # endif >>> #else /* ! YYPARSE_PARAM */ >>> #if defined (__STDC__) || defined (__cplusplus) >>> ! int mvapich_yyparse (void); >>> #else >>> ! int mvapich_yyparse (); >>> #endif >>> #endif /* ! YYPARSE_PARAM */ >>> >>> *************** >>> *** 813,835 **** >>> >>> >>> /*----------. >>> ! | yyparse. | >>> `----------*/ >>> >>> #ifdef YYPARSE_PARAM >>> # if defined (__STDC__) || defined (__cplusplus) >>> ! int yyparse (void *YYPARSE_PARAM) >>> # else >>> ! int yyparse (YYPARSE_PARAM) >>> void *YYPARSE_PARAM; >>> # endif >>> #else /* ! YYPARSE_PARAM */ >>> #if defined (__STDC__) || defined (__cplusplus) >>> int >>> ! yyparse (void) >>> #else >>> int >>> ! yyparse () >>> >>> #endif >>> #endif >>> --- 813,835 ---- >>> >>> >>> /*----------. >>> ! | mvapich_yyparse. | >>> `----------*/ >>> >>> #ifdef YYPARSE_PARAM >>> # if defined (__STDC__) || defined (__cplusplus) >>> ! int mvapich_yyparse (void *YYPARSE_PARAM) >>> # else >>> ! int mvapich_yyparse (YYPARSE_PARAM) >>> void *YYPARSE_PARAM; >>> # endif >>> #else /* ! YYPARSE_PARAM */ >>> #if defined (__STDC__) || defined (__cplusplus) >>> int >>> ! mvapich_yyparse (void) >>> #else >>> int >>> ! mvapich_yyparse () >>> >>> #endif >>> #endif >>> *************** >>> *** 1305,1319 **** >>> yyprefix = " or "; >>> } >>> } >>> ! yyerror (yymsg); >>> YYSTACK_FREE (yymsg); >>> } >>> else >>> ! yyerror ("syntax error; also virtual memory exhausted"); >>> } >>> else >>> #endif /* YYERROR_VERBOSE */ >>> ! yyerror ("syntax error"); >>> } >>> >>> >>> --- 1305,1319 ---- >>> yyprefix = " or "; >>> } >>> } >>> ! mvapich_yyerror (yymsg); >>> YYSTACK_FREE (yymsg); >>> } >>> else >>> ! mvapich_yyerror ("syntax error; also virtual memory exhausted"); >>> } >>> else >>> #endif /* YYERROR_VERBOSE */ >>> ! mvapich_yyerror ("syntax error"); >>> } >>> >>> >>> *************** >>> *** 1431,1437 **** >>> | yyoverflowlab -- parser overflow comes here. | >>> `----------------------------------------------*/ >>> yyoverflowlab: >>> ! yyerror ("parser stack overflow"); >>> yyresult = 2; >>> /* Fall through. */ >>> #endif >>> --- 1431,1437 ---- >>> | yyoverflowlab -- parser overflow comes here. | >>> `----------------------------------------------*/ >>> yyoverflowlab: >>> ! mvapich_yyerror ("parser stack overflow"); >>> yyresult = 2; >>> /* Fall through. */ >>> #endif >>> *************** >>> *** 1454,1467 **** >>> >>> PLPA_CPU_ZERO(cpu_set); >>> return_value = cpu_set; >>> ! ret = yyparse(); >>> if (0 != ret) { >>> return ret; >>> } >>> return 0; >>> } >>> >>> ! void yyerror (char const *s) >>> { >>> fprintf(stderr, "ERROR: %s\n", s); >>> } >>> --- 1454,1467 ---- >>> >>> PLPA_CPU_ZERO(cpu_set); >>> return_value = cpu_set; >>> ! ret = mvapich_yyparse(); >>> if (0 != ret) { >>> return ret; >>> } >>> return 0; >>> } >>> >>> ! void mvapich_yyerror (char const *s) >>> { >>> fprintf(stderr, "ERROR: %s\n", s); >>> } >> > -- ----------- Manhui Wang School of Chemistry, Cardiff University, Main Building, Park Place, Cardiff CF10 3AT, UK Telephone: +44 (0)29208 76637 From perkinjo at cse.ohio-state.edu Fri Oct 9 12:20:28 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Fri Oct 9 12:21:07 2009 Subject: [mvapich-discuss] fail to link with mvapich2-1.2p1/mvapich2-1.4rc2 In-Reply-To: <4ACF5AA8.1060906@cardiff.ac.uk> References: <3B7D8CBBF8049C4C9746728929189A920F241A48@CFEVS1-IP.americas.cray.com> <4ABB8CC0.3010301@cardiff.ac.uk> <20090924165611.GH2346@cse.ohio-state.edu> <4AC1DFDE.7080106@cardiff.ac.uk> <4AC20EAF.9060100@cardiff.ac.uk> <20091008163907.GP2462@cse.ohio-state.edu> <4ACF4C0B.10001@cardiff.ac.uk> <4ACF5AA8.1060906@cardiff.ac.uk> Message-ID: <20091009162028.GC2385@cse.ohio-state.edu> On Fri, Oct 09, 2009 at 04:45:44PM +0100, Manhui Wang wrote: > Hi Jonathan, > I forgot to update the files in the source tree with: > $ cd mvapich2 > $ maint/updatefiles > > Thus configure is produced. Now I have built the newest Mvapich2, and it > has resolved the previous linking problem I mentioned. I'm glad to hear that this is working now. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091009/abb29b2a/attachment.bin From amolins at MIT.EDU Fri Oct 9 19:56:20 2009 From: amolins at MIT.EDU (Antonio Molins) Date: Fri Oct 9 21:04:26 2009 Subject: [mvapich-discuss] MVAPICH + MKL + ~100 lines of code = crash Message-ID: <716732C1-054B-4DBE-9527-22698089D053@exchange.mit.edu> Hi everybody, I am new to the list, and have been programming in MPI for a year or so. I am running into what I think is a bug of either MVAPICH or MKL that is totally driving me nuts :( The following code has proved good to generate a crash using MKL 10.2 update 2 (sequential version and threaded), last revision of MVAPICH, in two different clusters. Can anybody tell me what the problem is here? It does not crash always, but it does crash when the right number of MPI processes and matrix sizes are selected. Sometimes it wouldn't crash but sloooooow down several orders of magnitude, and note that the actual data being processed is exactly the same. A /* * crash.cpp - crashes with ICC 11.1, MKL 10.2, MVAPICH 1.0 on linux 64-bit * both linked with the serial or threaded libraries * doing mpirun -np 36 crash 5000 10 */ #include #include #include #include #include "mpi.h" #include "mkl_scalapack.h" extern "C" { /* BLACS C interface */ void Cblacs_get( int context, int request, int* value); int Cblacs_gridinit( int* context, char * order, int np_row, int np_col); void Cblacs_gridinfo( int context, int* np_row, int* np_col, int* my_row, int* my_col); int numroc_( int *n, int *nb, int *iproc, int *isrcproc, int *nprocs); /* PBLAS */ void pdgemm_( char *TRANSA, char *TRANSB, int * M, int * N, int * K, double * ALPHA, double * A, int * IA, int * JA, int * DESCA, double * B, int * IB, int * JB, int * DESCB, double * BETA, double * C, int * IC, int * JC, int * DESCC ); } #define BLOCK_SIZE 65 int main( int argc, char* argv[] ) { int iam, nprocs; MPI_Init(&argc,&argv); /* starts MPI */ MPI_Comm_rank(MPI_COMM_WORLD, &iam); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); // get done with the ones that are not part of the grid int blacs_pgrid_size = floor(sqrt(nprocs)); if (iam>=blacs_pgrid_size*blacs_pgrid_size) { printf("Bye bye world from process %d of %d. BLACS had no place for me...\n",iam,nprocs); MPI_Finalize(); } // start BLACS with square processor grid if(iam==0) printf("starting BLACS..."); int ictxt,nprow,npcol,myrow,mycol; Cblacs_get( -1, 0, &ictxt ); Cblacs_gridinit( &ictxt, "C", blacs_pgrid_size, blacs_pgrid_size ); Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol ); if(iam==0) printf("done.\n"); double timing; int m,n,k,lm,ln,nbm,nbn,rounds; int myzero=0,myone=1; sscanf(argv[1],"%d",&m); n=m; k=m; sscanf(argv[2],"%d",&rounds); nbm = BLOCK_SIZE; nbn = BLOCK_SIZE; lm = numroc_(&m, &nbm, &myrow, &myzero, &nprow); ln = numroc_(&n, &nbn, &mycol, &myzero, &npcol); int info; int *ipiv = new int[lm+nbm+10000000]; //adding a "little" bit of extra space just in case char ta = 'N',tb = 'T'; double alpha = 1.0, beta = 0.0; double* test1data = new double[lm*ln]; double* test2data = new double[lm*ln]; double* test3data = new double[lm*ln]; for(int i=0;i Message-ID: Thanks for the report. Your note indicates that you are using MVAPICH 1.0 version. This is a very old version (released during February '08). Can you check whether this happens with the latest MVAPICH 1.1 version. Please use the branch version obtained from the following URL. This has all the bug-fixes since 1.1 release. http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/ In the Changelog (http://mvapich.cse.ohio-state.edu/download/mvapich/changes.shtml), I see that a fix related to segmentation fault while using BLACS was done for 1.0.1 version. Let us know whether the problem persists with this latest 1.1 branch version and we will take a look at this in-depth. Thanks, DK On Fri, 9 Oct 2009, Antonio Molins wrote: > Hi everybody, > > I am new to the list, and have been programming in MPI for a year or so. I am running into what I think is a bug of either MVAPICH or MKL that is totally driving me nuts :( > > The following code has proved good to generate a crash using MKL 10.2 update 2 (sequential version and threaded), last revision of MVAPICH, in two different clusters. Can anybody tell me what the problem is here? It does not crash always, but it does crash when the right number of MPI processes and matrix sizes are selected. Sometimes it wouldn't crash but sloooooow down several orders of magnitude, and note that the actual data being processed is exactly the same. > > A > > /* > * crash.cpp - crashes with ICC 11.1, MKL 10.2, MVAPICH 1.0 on linux 64-bit > * both linked with the serial or threaded libraries > * doing mpirun -np 36 crash 5000 10 > */ > > > > #include > #include > #include > #include > #include "mpi.h" > #include "mkl_scalapack.h" > > > > extern "C" { > /* BLACS C interface */ > void Cblacs_get( int context, int request, int* value); > int Cblacs_gridinit( int* context, char * order, int np_row, int np_col); > void Cblacs_gridinfo( int context, int* np_row, int* np_col, int* my_row, int* my_col); > int numroc_( int *n, int *nb, int *iproc, int *isrcproc, int *nprocs); > /* PBLAS */ > void pdgemm_( char *TRANSA, char *TRANSB, int * M, int * N, int * K, double * ALPHA, > double * A, int * IA, int * JA, int * DESCA, double * B, int * IB, int * JB, int * DESCB, > double * BETA, double * C, int * IC, int * JC, int * DESCC ); > } > > > > #define BLOCK_SIZE 65 > > > > int main( int argc, char* argv[] ) > { > int iam, nprocs; > MPI_Init(&argc,&argv); /* starts MPI */ > MPI_Comm_rank(MPI_COMM_WORLD, &iam); > MPI_Comm_size(MPI_COMM_WORLD, &nprocs); > > > // get done with the ones that are not part of the grid > int blacs_pgrid_size = floor(sqrt(nprocs)); > if (iam>=blacs_pgrid_size*blacs_pgrid_size) { > printf("Bye bye world from process %d of %d. BLACS had no place for me...\n",iam,nprocs); > MPI_Finalize(); > } > > > // start BLACS with square processor grid > if(iam==0) > printf("starting BLACS..."); > int ictxt,nprow,npcol,myrow,mycol; > Cblacs_get( -1, 0, &ictxt ); > Cblacs_gridinit( &ictxt, "C", blacs_pgrid_size, blacs_pgrid_size ); > Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol ); > if(iam==0) > printf("done.\n"); > > > double timing; > int m,n,k,lm,ln,nbm,nbn,rounds; > int myzero=0,myone=1; > sscanf(argv[1],"%d",&m); > n=m; > k=m; > sscanf(argv[2],"%d",&rounds); > > > > nbm = BLOCK_SIZE; > nbn = BLOCK_SIZE; > lm = numroc_(&m, &nbm, &myrow, &myzero, &nprow); > ln = numroc_(&n, &nbn, &mycol, &myzero, &npcol); > > > int info; > int *ipiv = new int[lm+nbm+10000000]; //adding a "little" bit of extra space just in case > char ta = 'N',tb = 'T'; > double alpha = 1.0, beta = 0.0; > > > > double* test1data = new double[lm*ln]; > double* test2data = new double[lm*ln]; > double* test3data = new double[lm*ln]; > > > for(int i=0;i test1data[i]=(double)(rand()%100)/10000.0; > > > int *test1desc = new int[9]; > int *test2desc = new int[9]; > int *test3desc = new int[9]; > > > test1desc[0] = 1; // descriptor type > test1desc[1] = ictxt; // blacs context > test1desc[2] = m; // global number of rows > test1desc[3] = n; // global number of columns > test1desc[4] = nbm; // row block size > test1desc[5] = nbn; // column block size (DEFINED EQUAL THAN ROW BLOCK SIZE) > test1desc[6] = 0; // initial process row(DEFINED 0) > test1desc[7] = 0; // initial process column (DEFINED 0) > test1desc[8] = lm; // leading dimension of local array > > > memcpy(test2desc,test1desc,9*sizeof(int)); > memcpy(test3desc,test1desc,9*sizeof(int)); > > > > for(int iter=0;iter { > if(iam==0) > printf("iter %i - ",iter); > //test2 = test1 > memcpy(test2data,test1data,lm*ln*sizeof(double)); > //test3 = test1*test2 > timing=MPI_Wtime(); > pdgemm_(&ta,&tb,&m,&n,&k, > &alpha, > test1data,&myone,&myone,test1desc, > test2data,&myone,&myone, test2desc, > &beta, > test3data,&myone,&myone, test3desc); > if(iam==0) > printf(" PDGEMM = %f |",MPI_Wtime()-timing); > //test3 = LU(test3) > timing=MPI_Wtime(); > pdgetrf_(&m, &n, test3data, &myone, &myone, test3desc, ipiv, &info); > if(iam==0) > printf(" PDGETRF = %f.\n",MPI_Wtime()-timing); > } > delete[] ipiv; > delete[] test1data, test2data, test3data; > delete[] test1desc, test2desc, test3desc; > > > MPI_Finalize(); > return 0; > } > > -------------------------------------------------------------------------------- > Antonio Molins, PhD Candidate > Medical Engineering and Medical Physics > Harvard - MIT Division of Health Sciences and Technology > -- > "Y así del poco dormir y del mucho leer, > se le secó el cerebro de manera que vino > a perder el juicio". > Miguel de Cervantes > -------------------------------------------------------------------------------- > > > > From amolins at MIT.EDU Sat Oct 10 18:54:58 2009 From: amolins at MIT.EDU (Antonio Molins) Date: Sat Oct 10 18:56:03 2009 Subject: [mvapich-discuss] MVAPICH + MKL + ~100 lines of code = crash In-Reply-To: References: Message-ID: Thanks so much, DK! I was actually using MPICH 1.1, the last version available in the webpage, not the nightly build. I have been trying the one you sent me, and seems to fix the problem. Will tell you later once I try it more thoroughly. Best, A On Oct 9, 2009, at 9:16 PM, Dhabaleswar Panda wrote: > Thanks for the report. Your note indicates that you are using > MVAPICH 1.0 > version. This is a very old version (released during February '08). > Can > you check whether this happens with the latest MVAPICH 1.1 version. > Please > use the branch version obtained from the following URL. This has all > the > bug-fixes since 1.1 release. > > http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/ > > In the Changelog > (http://mvapich.cse.ohio-state.edu/download/mvapich/changes.shtml), I > see that a fix related to segmentation fault while using BLACS was > done > for 1.0.1 version. > > Let us know whether the problem persists with this latest 1.1 branch > version and we will take a look at this in-depth. > > Thanks, > > DK > > On Fri, 9 Oct 2009, Antonio Molins wrote: > >> Hi everybody, >> >> I am new to the list, and have been programming in MPI for a year >> or so. I am running into what I think is a bug of either MVAPICH or >> MKL that is totally driving me nuts :( >> >> The following code has proved good to generate a crash using MKL >> 10.2 update 2 (sequential version and threaded), last revision of >> MVAPICH, in two different clusters. Can anybody tell me what the >> problem is here? It does not crash always, but it does crash when >> the right number of MPI processes and matrix sizes are selected. >> Sometimes it wouldn't crash but sloooooow down several orders of >> magnitude, and note that the actual data being processed is exactly >> the same. >> >> A >> >> /* >> * crash.cpp - crashes with ICC 11.1, MKL 10.2, MVAPICH 1.0 on >> linux 64-bit >> * both linked with the serial or threaded libraries >> * doing mpirun -np 36 crash 5000 10 >> */ >> >> >> >> #include >> #include >> #include >> #include >> #include "mpi.h" >> #include "mkl_scalapack.h" >> >> >> >> extern "C" { >> /* BLACS C interface */ >> void Cblacs_get( int context, int request, int* value); >> int Cblacs_gridinit( int* context, char * order, int np_row, int >> np_col); >> void Cblacs_gridinfo( int context, int* np_row, int* np_col, int* >> my_row, int* my_col); >> int numroc_( int *n, int *nb, int *iproc, int *isrcproc, int >> *nprocs); >> /* PBLAS */ >> void pdgemm_( char *TRANSA, char *TRANSB, int * M, int * N, int * >> K, double * ALPHA, >> double * A, int * IA, int * JA, int * DESCA, double * B, int * IB, >> int * JB, int * DESCB, >> double * BETA, double * C, int * IC, int * JC, int * DESCC ); >> } >> >> >> >> #define BLOCK_SIZE 65 >> >> >> >> int main( int argc, char* argv[] ) >> { >> int iam, nprocs; >> MPI_Init(&argc,&argv); /* starts MPI */ >> MPI_Comm_rank(MPI_COMM_WORLD, &iam); >> MPI_Comm_size(MPI_COMM_WORLD, &nprocs); >> >> >> // get done with the ones that are not part of the grid >> int blacs_pgrid_size = floor(sqrt(nprocs)); >> if (iam>=blacs_pgrid_size*blacs_pgrid_size) { >> printf("Bye bye world from process %d of %d. BLACS had no place for >> me...\n",iam,nprocs); >> MPI_Finalize(); >> } >> >> >> // start BLACS with square processor grid >> if(iam==0) >> printf("starting BLACS..."); >> int ictxt,nprow,npcol,myrow,mycol; >> Cblacs_get( -1, 0, &ictxt ); >> Cblacs_gridinit( &ictxt, "C", blacs_pgrid_size, blacs_pgrid_size ); >> Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol ); >> if(iam==0) >> printf("done.\n"); >> >> >> double timing; >> int m,n,k,lm,ln,nbm,nbn,rounds; >> int myzero=0,myone=1; >> sscanf(argv[1],"%d",&m); >> n=m; >> k=m; >> sscanf(argv[2],"%d",&rounds); >> >> >> >> nbm = BLOCK_SIZE; >> nbn = BLOCK_SIZE; >> lm = numroc_(&m, &nbm, &myrow, &myzero, &nprow); >> ln = numroc_(&n, &nbn, &mycol, &myzero, &npcol); >> >> >> int info; >> int *ipiv = new int[lm+nbm+10000000]; //adding a "little" bit of >> extra space just in case >> char ta = 'N',tb = 'T'; >> double alpha = 1.0, beta = 0.0; >> >> >> >> double* test1data = new double[lm*ln]; >> double* test2data = new double[lm*ln]; >> double* test3data = new double[lm*ln]; >> >> >> for(int i=0;i> test1data[i]=(double)(rand()%100)/10000.0; >> >> >> int *test1desc = new int[9]; >> int *test2desc = new int[9]; >> int *test3desc = new int[9]; >> >> >> test1desc[0] = 1; // descriptor type >> test1desc[1] = ictxt; // blacs context >> test1desc[2] = m; // global number of rows >> test1desc[3] = n; // global number of columns >> test1desc[4] = nbm; // row block size >> test1desc[5] = nbn; // column block size (DEFINED EQUAL THAN ROW >> BLOCK SIZE) >> test1desc[6] = 0; // initial process row(DEFINED 0) >> test1desc[7] = 0; // initial process column (DEFINED 0) >> test1desc[8] = lm; // leading dimension of local array >> >> >> memcpy(test2desc,test1desc,9*sizeof(int)); >> memcpy(test3desc,test1desc,9*sizeof(int)); >> >> >> >> for(int iter=0;iter> { >> if(iam==0) >> printf("iter %i - ",iter); >> //test2 = test1 >> memcpy(test2data,test1data,lm*ln*sizeof(double)); >> //test3 = test1*test2 >> timing=MPI_Wtime(); >> pdgemm_(&ta,&tb,&m,&n,&k, >> &alpha, >> test1data,&myone,&myone,test1desc, >> test2data,&myone,&myone, test2desc, >> &beta, >> test3data,&myone,&myone, test3desc); >> if(iam==0) >> printf(" PDGEMM = %f |",MPI_Wtime()-timing); >> //test3 = LU(test3) >> timing=MPI_Wtime(); >> pdgetrf_(&m, &n, test3data, &myone, &myone, test3desc, ipiv, &info); >> if(iam==0) >> printf(" PDGETRF = %f.\n",MPI_Wtime()-timing); >> } >> delete[] ipiv; >> delete[] test1data, test2data, test3data; >> delete[] test1desc, test2desc, test3desc; >> >> >> MPI_Finalize(); >> return 0; >> } >> >> -------------------------------------------------------------------------------- >> Antonio Molins, PhD Candidate >> Medical Engineering and Medical Physics >> Harvard - MIT Division of Health Sciences and Technology >> -- >> "Y as? del poco dormir y del mucho leer, >> se le sec? el cerebro de manera que vino >> a perder el juicio". >> Miguel de Cervantes >> -------------------------------------------------------------------------------- >> >> >> >> > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -------------------------------------------------------------------------------- Antonio Molins, PhD Candidate Medical Engineering and Medical Physics Harvard - MIT Division of Health Sciences and Technology -- "Y as? del poco dormir y del mucho leer, se le sec? el cerebro de manera que vino a perder el juicio". Miguel de Cervantes -------------------------------------------------------------------------------- From panda at cse.ohio-state.edu Sat Oct 10 20:32:49 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sat Oct 10 20:33:24 2009 Subject: [mvapich-discuss] MVAPICH + MKL + ~100 lines of code = crash In-Reply-To: Message-ID: Good to know that the 1.1 branch version fixes the problem. Let us know the outcome once you have done more extensive experiments. Thanks, DK On Sat, 10 Oct 2009, Antonio Molins wrote: > Thanks so much, DK! > > I was actually using MPICH 1.1, the last version available in the > webpage, not the nightly build. I have been trying the one you sent > me, and seems to fix the problem. Will tell you later once I try it > more thoroughly. > > Best, > A > > On Oct 9, 2009, at 9:16 PM, Dhabaleswar Panda wrote: > > > Thanks for the report. Your note indicates that you are using > > MVAPICH 1.0 > > version. This is a very old version (released during February '08). > > Can > > you check whether this happens with the latest MVAPICH 1.1 version. > > Please > > use the branch version obtained from the following URL. This has all > > the > > bug-fixes since 1.1 release. > > > > http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/ > > > > In the Changelog > > (http://mvapich.cse.ohio-state.edu/download/mvapich/changes.shtml), I > > see that a fix related to segmentation fault while using BLACS was > > done > > for 1.0.1 version. > > > > Let us know whether the problem persists with this latest 1.1 branch > > version and we will take a look at this in-depth. > > > > Thanks, > > > > DK > > > > On Fri, 9 Oct 2009, Antonio Molins wrote: > > > >> Hi everybody, > >> > >> I am new to the list, and have been programming in MPI for a year > >> or so. I am running into what I think is a bug of either MVAPICH or > >> MKL that is totally driving me nuts :( > >> > >> The following code has proved good to generate a crash using MKL > >> 10.2 update 2 (sequential version and threaded), last revision of > >> MVAPICH, in two different clusters. Can anybody tell me what the > >> problem is here? It does not crash always, but it does crash when > >> the right number of MPI processes and matrix sizes are selected. > >> Sometimes it wouldn't crash but sloooooow down several orders of > >> magnitude, and note that the actual data being processed is exactly > >> the same. > >> > >> A > >> > >> /* > >> * crash.cpp - crashes with ICC 11.1, MKL 10.2, MVAPICH 1.0 on > >> linux 64-bit > >> * both linked with the serial or threaded libraries > >> * doing mpirun -np 36 crash 5000 10 > >> */ > >> > >> > >> > >> #include > >> #include > >> #include > >> #include > >> #include "mpi.h" > >> #include "mkl_scalapack.h" > >> > >> > >> > >> extern "C" { > >> /* BLACS C interface */ > >> void Cblacs_get( int context, int request, int* value); > >> int Cblacs_gridinit( int* context, char * order, int np_row, int > >> np_col); > >> void Cblacs_gridinfo( int context, int* np_row, int* np_col, int* > >> my_row, int* my_col); > >> int numroc_( int *n, int *nb, int *iproc, int *isrcproc, int > >> *nprocs); > >> /* PBLAS */ > >> void pdgemm_( char *TRANSA, char *TRANSB, int * M, int * N, int * > >> K, double * ALPHA, > >> double * A, int * IA, int * JA, int * DESCA, double * B, int * IB, > >> int * JB, int * DESCB, > >> double * BETA, double * C, int * IC, int * JC, int * DESCC ); > >> } > >> > >> > >> > >> #define BLOCK_SIZE 65 > >> > >> > >> > >> int main( int argc, char* argv[] ) > >> { > >> int iam, nprocs; > >> MPI_Init(&argc,&argv); /* starts MPI */ > >> MPI_Comm_rank(MPI_COMM_WORLD, &iam); > >> MPI_Comm_size(MPI_COMM_WORLD, &nprocs); > >> > >> > >> // get done with the ones that are not part of the grid > >> int blacs_pgrid_size = floor(sqrt(nprocs)); > >> if (iam>=blacs_pgrid_size*blacs_pgrid_size) { > >> printf("Bye bye world from process %d of %d. BLACS had no place for > >> me...\n",iam,nprocs); > >> MPI_Finalize(); > >> } > >> > >> > >> // start BLACS with square processor grid > >> if(iam==0) > >> printf("starting BLACS..."); > >> int ictxt,nprow,npcol,myrow,mycol; > >> Cblacs_get( -1, 0, &ictxt ); > >> Cblacs_gridinit( &ictxt, "C", blacs_pgrid_size, blacs_pgrid_size ); > >> Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol ); > >> if(iam==0) > >> printf("done.\n"); > >> > >> > >> double timing; > >> int m,n,k,lm,ln,nbm,nbn,rounds; > >> int myzero=0,myone=1; > >> sscanf(argv[1],"%d",&m); > >> n=m; > >> k=m; > >> sscanf(argv[2],"%d",&rounds); > >> > >> > >> > >> nbm = BLOCK_SIZE; > >> nbn = BLOCK_SIZE; > >> lm = numroc_(&m, &nbm, &myrow, &myzero, &nprow); > >> ln = numroc_(&n, &nbn, &mycol, &myzero, &npcol); > >> > >> > >> int info; > >> int *ipiv = new int[lm+nbm+10000000]; //adding a "little" bit of > >> extra space just in case > >> char ta = 'N',tb = 'T'; > >> double alpha = 1.0, beta = 0.0; > >> > >> > >> > >> double* test1data = new double[lm*ln]; > >> double* test2data = new double[lm*ln]; > >> double* test3data = new double[lm*ln]; > >> > >> > >> for(int i=0;i >> test1data[i]=(double)(rand()%100)/10000.0; > >> > >> > >> int *test1desc = new int[9]; > >> int *test2desc = new int[9]; > >> int *test3desc = new int[9]; > >> > >> > >> test1desc[0] = 1; // descriptor type > >> test1desc[1] = ictxt; // blacs context > >> test1desc[2] = m; // global number of rows > >> test1desc[3] = n; // global number of columns > >> test1desc[4] = nbm; // row block size > >> test1desc[5] = nbn; // column block size (DEFINED EQUAL THAN ROW > >> BLOCK SIZE) > >> test1desc[6] = 0; // initial process row(DEFINED 0) > >> test1desc[7] = 0; // initial process column (DEFINED 0) > >> test1desc[8] = lm; // leading dimension of local array > >> > >> > >> memcpy(test2desc,test1desc,9*sizeof(int)); > >> memcpy(test3desc,test1desc,9*sizeof(int)); > >> > >> > >> > >> for(int iter=0;iter >> { > >> if(iam==0) > >> printf("iter %i - ",iter); > >> //test2 = test1 > >> memcpy(test2data,test1data,lm*ln*sizeof(double)); > >> //test3 = test1*test2 > >> timing=MPI_Wtime(); > >> pdgemm_(&ta,&tb,&m,&n,&k, > >> &alpha, > >> test1data,&myone,&myone,test1desc, > >> test2data,&myone,&myone, test2desc, > >> &beta, > >> test3data,&myone,&myone, test3desc); > >> if(iam==0) > >> printf(" PDGEMM = %f |",MPI_Wtime()-timing); > >> //test3 = LU(test3) > >> timing=MPI_Wtime(); > >> pdgetrf_(&m, &n, test3data, &myone, &myone, test3desc, ipiv, &info); > >> if(iam==0) > >> printf(" PDGETRF = %f.\n",MPI_Wtime()-timing); > >> } > >> delete[] ipiv; > >> delete[] test1data, test2data, test3data; > >> delete[] test1desc, test2desc, test3desc; > >> > >> > >> MPI_Finalize(); > >> return 0; > >> } > >> > >> -------------------------------------------------------------------------------- > >> Antonio Molins, PhD Candidate > >> Medical Engineering and Medical Physics > >> Harvard - MIT Division of Health Sciences and Technology > >> -- > >> "Y así del poco dormir y del mucho leer, > >> se le secó el cerebro de manera que vino > >> a perder el juicio". > >> Miguel de Cervantes > >> -------------------------------------------------------------------------------- > >> > >> > >> > >> > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -------------------------------------------------------------------------------- > Antonio Molins, PhD Candidate > Medical Engineering and Medical Physics > Harvard - MIT Division of Health Sciences and Technology > -- > "Y así del poco dormir y del mucho leer, > se le secó el cerebro de manera que vino > a perder el juicio". > Miguel de Cervantes > -------------------------------------------------------------------------------- > > > > From forum.san at gmail.com Mon Oct 12 10:29:01 2009 From: forum.san at gmail.com (Sangamesh B) Date: Mon Oct 12 10:29:42 2009 Subject: [mvapich-discuss] mvapich2 runtime failure Message-ID: Hi, The mvapich2(1.2p1 and 1.4rc1) is installed with Intel 10.1 compilers on a Rocks5.1 HPC Linux cluster. The siesta-2.0.2 (Fortran) application is compiled with MKL library support. The job fails after running 20-30 minutes. $ cat err.362.mvapi2_24h_12 Warning! Rndv Receiver is receiving (36864 < 46080) less than as expected Fatal error in MPI_Bcast: Message truncated, error stack: MPI_Bcast(1145)...................: MPI_Bcast(buf=0x3c14e90, count=1, dtype=USER, root=0, comm=0xc4000005) failed MPIR_Bcast(229)...................: MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 truncated; 46080 bytes received but buffer size is 36864 Fatal error in MPI_Bcast: Message truncated, error stack: MPI_Bcast(1145)........................: MPI_Bcast(buf=0x866c130, count=1, dtype=USER, root=0, comm=0xc4000006) failed MPIR_Bcast(229)........................: MPIDI_CH3U_Post_data_receive_found(439): Message from rank 0 and tag 2 truncated; 46080 bytes received but buffer size is 36864 rm: cannot remove `/tmp/362.1.all.q/rsh': No such file or directory The siesta output file end with following error: siesta: 27 -8036.3459 -8035.3935 -8035.4038 0.0751 -3.9174 siesta: 28 -8036.3396 -8035.4433 -8035.4554 0.0707 -3.9601 siesta: 29 -8036.3531 -8035.5953 -8035.6096 0.0709 -3.9417 rank 9 in job 1 compute-0-12.local_50891 caused collective abort of all ranks exit status of rank 9: killed by signal 9 The HCA card is Mellanox: # ibstat CA 'mthca0' CA type: MT25204 Number of ports: 1 Firmware version: 1.2.0 Hardware version: a0 Node GUID: 0x0002c9020028de58 System image GUID: 0x0002c9020028de5b Port 1: State: Active Physical state: LinkUp Rate: 20 Base lid: 1 LMC: 0 SM lid: 1 Capability mask: 0x02510a6a Port GUID: 0x0002c9020028de59 We've used OFED-1.4. The same job fails even with mvapich2-1.4rc1, at same point. Why this error? How to resolve it? Is there any problem IB setup? The ib pingpong tests work fine for all the nodes. So there could not be a problem with ofed drivers. Please help us to resolve the error. Thanks in advance -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091012/8eed9534/attachment-0001.html From panda at cse.ohio-state.edu Mon Oct 12 11:19:25 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Oct 12 11:20:05 2009 Subject: [mvapich-discuss] mvapich2 runtime failure In-Reply-To: Message-ID: Can you try your siesta application with the latest version from the trunk available from the following URL: http://mvapich.cse.ohio-state.edu/nightly/mvapich2/trunk/ Several fixes have gone into this version after the RC2 release. If the problem persists with the latest trunk version, we will take a look at it in detail. DK On Mon, 12 Oct 2009, Sangamesh B wrote: > Hi, > > The mvapich2(1.2p1 and 1.4rc1) is installed with Intel 10.1 compilers on a > Rocks5.1 HPC Linux cluster. > > The siesta-2.0.2 (Fortran) application is compiled with MKL library support. > > The job fails after running 20-30 minutes. > > $ cat err.362.mvapi2_24h_12 > Warning! Rndv Receiver is receiving (36864 < 46080) less than as expected > Fatal error in MPI_Bcast: > Message truncated, error stack: > MPI_Bcast(1145)...................: MPI_Bcast(buf=0x3c14e90, count=1, > dtype=USER, root=0, comm=0xc4000005) failed > MPIR_Bcast(229)...................: > MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 truncated; > 46080 bytes received but buffer size is 36864 > Fatal error in MPI_Bcast: > Message truncated, error stack: > MPI_Bcast(1145)........................: MPI_Bcast(buf=0x866c130, count=1, > dtype=USER, root=0, comm=0xc4000006) failed > MPIR_Bcast(229)........................: > MPIDI_CH3U_Post_data_receive_found(439): Message from rank 0 and tag 2 > truncated; 46080 bytes received but buffer size is 36864 > rm: cannot remove `/tmp/362.1.all.q/rsh': No such file or directory > > > The siesta output file end with following error: > > siesta: 27 -8036.3459 -8035.3935 -8035.4038 0.0751 -3.9174 > siesta: 28 -8036.3396 -8035.4433 -8035.4554 0.0707 -3.9601 > siesta: 29 -8036.3531 -8035.5953 -8035.6096 0.0709 -3.9417 > rank 9 in job 1 compute-0-12.local_50891 caused collective abort of all > ranks > exit status of rank 9: killed by signal 9 > > > The HCA card is Mellanox: > > # ibstat > CA 'mthca0' > CA type: MT25204 > Number of ports: 1 > Firmware version: 1.2.0 > Hardware version: a0 > Node GUID: 0x0002c9020028de58 > System image GUID: 0x0002c9020028de5b > Port 1: > State: Active > Physical state: LinkUp > Rate: 20 > Base lid: 1 > LMC: 0 > SM lid: 1 > Capability mask: 0x02510a6a > Port GUID: 0x0002c9020028de59 > > We've used OFED-1.4. > > The same job fails even with mvapich2-1.4rc1, at same point. > > Why this error? How to resolve it? Is there any problem IB setup? > > The ib pingpong tests work fine for all the nodes. So there could not be a > problem with ofed drivers. > > Please help us to resolve the error. > > Thanks in advance > From forum.san at gmail.com Tue Oct 13 02:14:18 2009 From: forum.san at gmail.com (Sangamesh B) Date: Tue Oct 13 02:14:57 2009 Subject: [mvapich-discuss] mvapich2 runtime failure In-Reply-To: References: Message-ID: Hi, The latest links 10/10 and 09/10 did not work. The 08/10 trunk got downloaded. The trunk was not having configure script, also autoconf didn't work. I copied configure script and other required header files from mvapich2-1.2p1. But that failed with following error: /opt/intel/cce/10.1.018/bin/icc -DHAVE_CONFIG_H -I. -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks -I../../../include -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/include -O3 -xT -DNDEBUG -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/include -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/include -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/datatype -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/datatype -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/channels/mrail/include -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/channels/mrail/include -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/channels/mrail/src/gen2 -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/channels/mrail/src/gen2 -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks -I/root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/common/locks -c mpidu_process_locks.c /root/Packages/softwares_for_newcluster/mvapich2-trunk-2009-10-08/src/mpid/ch3/include/mpidpre.h(34): error: identifier "MPIR_Pint" is undefined typedef MPIR_Pint MPIDI_msg_sz_t; ^ compilation aborted for mpidu_process_locks.c (code 2) make[4]: *** [mpidu_process_locks.o] Error 2 Which version shall I use? Thanks On Mon, Oct 12, 2009 at 8:49 PM, Dhabaleswar Panda wrote: > Can you try your siesta application with the latest version from the trunk > available from the following URL: > > http://mvapich.cse.ohio-state.edu/nightly/mvapich2/trunk/ > > Several fixes have gone into this version after the RC2 release. If the > problem persists with the latest trunk version, we will take a look at it > in detail. > > DK > > On Mon, 12 Oct 2009, Sangamesh B wrote: > > > Hi, > > > > The mvapich2(1.2p1 and 1.4rc1) is installed with Intel 10.1 compilers > on a > > Rocks5.1 HPC Linux cluster. > > > > The siesta-2.0.2 (Fortran) application is compiled with MKL library > support. > > > > The job fails after running 20-30 minutes. > > > > $ cat err.362.mvapi2_24h_12 > > Warning! Rndv Receiver is receiving (36864 < 46080) less than as expected > > Fatal error in MPI_Bcast: > > Message truncated, error stack: > > MPI_Bcast(1145)...................: MPI_Bcast(buf=0x3c14e90, count=1, > > dtype=USER, root=0, comm=0xc4000005) failed > > MPIR_Bcast(229)...................: > > MPIDI_CH3U_Receive_data_found(254): Message from rank 0 and tag 2 > truncated; > > 46080 bytes received but buffer size is 36864 > > Fatal error in MPI_Bcast: > > Message truncated, error stack: > > MPI_Bcast(1145)........................: MPI_Bcast(buf=0x866c130, > count=1, > > dtype=USER, root=0, comm=0xc4000006) failed > > MPIR_Bcast(229)........................: > > MPIDI_CH3U_Post_data_receive_found(439): Message from rank 0 and tag 2 > > truncated; 46080 bytes received but buffer size is 36864 > > rm: cannot remove `/tmp/362.1.all.q/rsh': No such file or directory > > > > > > The siesta output file end with following error: > > > > siesta: 27 -8036.3459 -8035.3935 -8035.4038 0.0751 -3.9174 > > siesta: 28 -8036.3396 -8035.4433 -8035.4554 0.0707 -3.9601 > > siesta: 29 -8036.3531 -8035.5953 -8035.6096 0.0709 -3.9417 > > rank 9 in job 1 compute-0-12.local_50891 caused collective abort of > all > > ranks > > exit status of rank 9: killed by signal 9 > > > > > > The HCA card is Mellanox: > > > > # ibstat > > CA 'mthca0' > > CA type: MT25204 > > Number of ports: 1 > > Firmware version: 1.2.0 > > Hardware version: a0 > > Node GUID: 0x0002c9020028de58 > > System image GUID: 0x0002c9020028de5b > > Port 1: > > State: Active > > Physical state: LinkUp > > Rate: 20 > > Base lid: 1 > > LMC: 0 > > SM lid: 1 > > Capability mask: 0x02510a6a > > Port GUID: 0x0002c9020028de59 > > > > We've used OFED-1.4. > > > > The same job fails even with mvapich2-1.4rc1, at same point. > > > > Why this error? How to resolve it? Is there any problem IB setup? > > > > The ib pingpong tests work fine for all the nodes. So there could not be > a > > problem with ofed drivers. > > > > Please help us to resolve the error. > > > > Thanks in advance > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091013/5ec99e4a/attachment.html From perkinjo at cse.ohio-state.edu Tue Oct 13 08:53:54 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Tue Oct 13 08:54:32 2009 Subject: [mvapich-discuss] mvapich2 runtime failure In-Reply-To: References: Message-ID: <20091013125354.GC2644@cse.ohio-state.edu> On Tue, Oct 13, 2009 at 11:44:18AM +0530, Sangamesh B wrote: > Hi, > > The latest links 10/10 and 09/10 did not work. The 08/10 trunk got > downloaded. > The trunk was not having configure script, also autoconf didn't work. These links are working for me. Maybe you tried to access the urls directly while our website experienced some downtime yesterday morning. In order to generate configure and the other required files from our svn or nightly tarballs you need to use the ./maint/updatefiles command from the top level of your download. Let me know if this helps. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091013/246879fb/attachment.bin From amolins at MIT.EDU Wed Oct 14 17:54:59 2009 From: amolins at MIT.EDU (Antonio Molins) Date: Wed Oct 14 17:55:45 2009 Subject: [mvapich-discuss] MVAPICH + MKL + ~100 lines of code = crash In-Reply-To: References: Message-ID: <8937BA1A-96C0-4447-BA75-21D72EE73C5F@exchange.mit.edu> After some more testing, found out same problem again. Now using MKL 10.2.2.025, and the version of MVAPICH 1.1 that you sent me (that is mvapich-1.1-2009-10-09 nightly build). If you use the same code I provided, % mpirun -np 36 ~/shadowfax/EMsmooth_work/bin/crash 10000 1 Exit code -5 signaled from compute-0-71 Killing remote processes...MPI process terminated unexpectedly PMGR_COLLECTIVE ERROR: unexpected value: received 1, expecting 7 @ file pmgr_collective_mpispawn.c:144 MPI process terminated unexpectedly DONE Any idea why this is happening? Best, A On Oct 10, 2009, at 8:32 PM, Dhabaleswar Panda panda@cse.ohio- state.edu wrote: > Good to know that the 1.1 branch version fixes the problem. Let us > know > the outcome once you have done more extensive experiments. > > Thanks, > > DK > > On Sat, 10 Oct 2009, Antonio Molins wrote: > >> Thanks so much, DK! >> >> I was actually using MPICH 1.1, the last version available in the >> webpage, not the nightly build. I have been trying the one you sent >> me, and seems to fix the problem. Will tell you later once I try it >> more thoroughly. >> >> Best, >> A >> >> On Oct 9, 2009, at 9:16 PM, Dhabaleswar Panda wrote: >> >>> Thanks for the report. Your note indicates that you are using >>> MVAPICH 1.0 >>> version. This is a very old version (released during February '08). >>> Can >>> you check whether this happens with the latest MVAPICH 1.1 version. >>> Please >>> use the branch version obtained from the following URL. This has all >>> the >>> bug-fixes since 1.1 release. >>> >>> http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/ >>> >>> In the Changelog >>> (http://mvapich.cse.ohio-state.edu/download/mvapich/ >>> changes.shtml), I >>> see that a fix related to segmentation fault while using BLACS was >>> done >>> for 1.0.1 version. >>> >>> Let us know whether the problem persists with this latest 1.1 branch >>> version and we will take a look at this in-depth. >>> >>> Thanks, >>> >>> DK >>> >>> On Fri, 9 Oct 2009, Antonio Molins wrote: >>> >>>> Hi everybody, >>>> >>>> I am new to the list, and have been programming in MPI for a year >>>> or so. I am running into what I think is a bug of either MVAPICH or >>>> MKL that is totally driving me nuts :( >>>> >>>> The following code has proved good to generate a crash using MKL >>>> 10.2 update 2 (sequential version and threaded), last revision of >>>> MVAPICH, in two different clusters. Can anybody tell me what the >>>> problem is here? It does not crash always, but it does crash when >>>> the right number of MPI processes and matrix sizes are selected. >>>> Sometimes it wouldn't crash but sloooooow down several orders of >>>> magnitude, and note that the actual data being processed is exactly >>>> the same. >>>> >>>> A >>>> >>>> /* >>>> * crash.cpp - crashes with ICC 11.1, MKL 10.2, MVAPICH 1.0 on >>>> linux 64-bit >>>> * both linked with the serial or threaded libraries >>>> * doing mpirun -np 36 crash 5000 10 >>>> */ >>>> >>>> >>>> >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include "mpi.h" >>>> #include "mkl_scalapack.h" >>>> >>>> >>>> >>>> extern "C" { >>>> /* BLACS C interface */ >>>> void Cblacs_get( int context, int request, int* value); >>>> int Cblacs_gridinit( int* context, char * order, int np_row, int >>>> np_col); >>>> void Cblacs_gridinfo( int context, int* np_row, int* np_col, int* >>>> my_row, int* my_col); >>>> int numroc_( int *n, int *nb, int *iproc, int *isrcproc, int >>>> *nprocs); >>>> /* PBLAS */ >>>> void pdgemm_( char *TRANSA, char *TRANSB, int * M, int * N, int * >>>> K, double * ALPHA, >>>> double * A, int * IA, int * JA, int * DESCA, double * B, int * IB, >>>> int * JB, int * DESCB, >>>> double * BETA, double * C, int * IC, int * JC, int * DESCC ); >>>> } >>>> >>>> >>>> >>>> #define BLOCK_SIZE 65 >>>> >>>> >>>> >>>> int main( int argc, char* argv[] ) >>>> { >>>> int iam, nprocs; >>>> MPI_Init(&argc,&argv); /* starts MPI */ >>>> MPI_Comm_rank(MPI_COMM_WORLD, &iam); >>>> MPI_Comm_size(MPI_COMM_WORLD, &nprocs); >>>> >>>> >>>> // get done with the ones that are not part of the grid >>>> int blacs_pgrid_size = floor(sqrt(nprocs)); >>>> if (iam>=blacs_pgrid_size*blacs_pgrid_size) { >>>> printf("Bye bye world from process %d of %d. BLACS had no place for >>>> me...\n",iam,nprocs); >>>> MPI_Finalize(); >>>> } >>>> >>>> >>>> // start BLACS with square processor grid >>>> if(iam==0) >>>> printf("starting BLACS..."); >>>> int ictxt,nprow,npcol,myrow,mycol; >>>> Cblacs_get( -1, 0, &ictxt ); >>>> Cblacs_gridinit( &ictxt, "C", blacs_pgrid_size, blacs_pgrid_size ); >>>> Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol ); >>>> if(iam==0) >>>> printf("done.\n"); >>>> >>>> >>>> double timing; >>>> int m,n,k,lm,ln,nbm,nbn,rounds; >>>> int myzero=0,myone=1; >>>> sscanf(argv[1],"%d",&m); >>>> n=m; >>>> k=m; >>>> sscanf(argv[2],"%d",&rounds); >>>> >>>> >>>> >>>> nbm = BLOCK_SIZE; >>>> nbn = BLOCK_SIZE; >>>> lm = numroc_(&m, &nbm, &myrow, &myzero, &nprow); >>>> ln = numroc_(&n, &nbn, &mycol, &myzero, &npcol); >>>> >>>> >>>> int info; >>>> int *ipiv = new int[lm+nbm+10000000]; //adding a "little" bit of >>>> extra space just in case >>>> char ta = 'N',tb = 'T'; >>>> double alpha = 1.0, beta = 0.0; >>>> >>>> >>>> >>>> double* test1data = new double[lm*ln]; >>>> double* test2data = new double[lm*ln]; >>>> double* test3data = new double[lm*ln]; >>>> >>>> >>>> for(int i=0;i>>> test1data[i]=(double)(rand()%100)/10000.0; >>>> >>>> >>>> int *test1desc = new int[9]; >>>> int *test2desc = new int[9]; >>>> int *test3desc = new int[9]; >>>> >>>> >>>> test1desc[0] = 1; // descriptor type >>>> test1desc[1] = ictxt; // blacs context >>>> test1desc[2] = m; // global number of rows >>>> test1desc[3] = n; // global number of columns >>>> test1desc[4] = nbm; // row block size >>>> test1desc[5] = nbn; // column block size (DEFINED EQUAL THAN ROW >>>> BLOCK SIZE) >>>> test1desc[6] = 0; // initial process row(DEFINED 0) >>>> test1desc[7] = 0; // initial process column (DEFINED 0) >>>> test1desc[8] = lm; // leading dimension of local array >>>> >>>> >>>> memcpy(test2desc,test1desc,9*sizeof(int)); >>>> memcpy(test3desc,test1desc,9*sizeof(int)); >>>> >>>> >>>> >>>> for(int iter=0;iter>>> { >>>> if(iam==0) >>>> printf("iter %i - ",iter); >>>> //test2 = test1 >>>> memcpy(test2data,test1data,lm*ln*sizeof(double)); >>>> //test3 = test1*test2 >>>> timing=MPI_Wtime(); >>>> pdgemm_(&ta,&tb,&m,&n,&k, >>>> &alpha, >>>> test1data,&myone,&myone,test1desc, >>>> test2data,&myone,&myone, test2desc, >>>> &beta, >>>> test3data,&myone,&myone, test3desc); >>>> if(iam==0) >>>> printf(" PDGEMM = %f |",MPI_Wtime()-timing); >>>> //test3 = LU(test3) >>>> timing=MPI_Wtime(); >>>> pdgetrf_(&m, &n, test3data, &myone, &myone, test3desc, ipiv, >>>> &info); >>>> if(iam==0) >>>> printf(" PDGETRF = %f.\n",MPI_Wtime()-timing); >>>> } >>>> delete[] ipiv; >>>> delete[] test1data, test2data, test3data; >>>> delete[] test1desc, test2desc, test3desc; >>>> >>>> >>>> MPI_Finalize(); >>>> return 0; >>>> } >>>> >>>> -------------------------------------------------------------------------------- >>>> Antonio Molins, PhD Candidate >>>> Medical Engineering and Medical Physics >>>> Harvard - MIT Division of Health Sciences and Technology >>>> -- >>>> "Y as? del poco dormir y del mucho leer, >>>> se le sec? el cerebro de manera que vino >>>> a perder el juicio". >>>> Miguel de Cervantes >>>> -------------------------------------------------------------------------------- >>>> >>>> >>>> >>>> >>> >>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> -------------------------------------------------------------------------------- >> Antonio Molins, PhD Candidate >> Medical Engineering and Medical Physics >> Harvard - MIT Division of Health Sciences and Technology >> -- >> "Y as? del poco dormir y del mucho leer, >> se le sec? el cerebro de manera que vino >> a perder el juicio". >> Miguel de Cervantes >> -------------------------------------------------------------------------------- >> >> >> >> > -------------------------------------------------------------------------------- Antonio Molins, PhD Candidate Medical Engineering and Medical Physics Harvard - MIT Division of Health Sciences and Technology -- "Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise. " John W. Tukey, 1962 -------------------------------------------------------------------------------- From sangamesh.banappa at Locuz.com Thu Oct 15 06:15:01 2009 From: sangamesh.banappa at Locuz.com (sangamesh banappa) Date: Thu Oct 15 06:15:47 2009 Subject: [mvapich-discuss] mvapich2 runtime failure Message-ID: <9C1B1AB7D13E5A49B7AB2F1E7F6BBDDE28BC46@lesl-mail.locuzhyd.com> Hi, As the mails from my gmail id are getting bounced by mvapich firewall settings, I'm continuing with this mail id. Wrt the message: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2009-October/002535.html Now I can able to download it. But due to autoconf lower version it couldn't create configure: Our cluster is installed with CentOS5.2 and autoconf-2.59. [root@master mvapich2-trunk-2009-10-10]# ./maint/updatefiles You have autoconf version 2.59. Building Fortran 77 interface Building Fortran 90 interface Building C++ interface Extracting the error messages... There are unused error message texts in src/mpi/errhan/errnames.txt See the file unusederr.txt for the complete list checking for perl... /usr/bin/perl configure: creating ./config.status config.status: creating simplemake config.status: creating checkbuilds config.status: creating getcoverage config.status: creating genstates config.status: creating clmake config.status: creating f77tof90 config.status: creating extractstrings config.status: creating extractstates config.status: creating extractfixme config.status: creating createcoverage config.status: executing default-1 commands Creating the enumeration of logging states into src/include/mpiallstates.h Inconsistent value for state of MPID_STATE_MPIDI_CH3_ISENDV: Old : MPIDI_CH3_PACKETIZED_SEND in src/mpid/ch3/channels/mrail/src/rdma/.state-cache New : MPIDI_CH3_iSendv in src/mpid/ch3/channels/mrail/src/rdma/.state-cache Inconsistent value for state of MPID_STATE_MPIDI_CH3_ISTARTRNDVMSG: Old : MPIDI_CH3_iSendv in src/mpid/ch3/channels/mrail/src/rdma/.state-cache New : MPIDI_CH3_iStartRndvMsg in src/mpid/ch3/channels/shm/src/.state-cache Inconsistent value for state of MPID_STATE_MPIDI_CH3_PORTFNSINIT: Old : MPIDI_CH3_Init in src/mpid/ch3/channels/nemesis/src/.state-cache New : MPIDI_CH3_PortFnsInit in src/mpid/ch3/channels/mrail/src/rdma/.state-cache Inconsistent value for state of MPID_STATE_MPIDI_CH3_PROGRESS_TEST: Old : MPIDI_CH3I_Progress_test in src/mpid/ch3/channels/mrail/src/rdma/.state-cache New : MPIDI_CH3_Progress_test in src/mpid/ch3/channels/ssm/src/.state-cache Inconsistent value for state of MPID_STATE_MPIDI_CH3_RMAFNSINIT: Old : MPIDI_CH3_Init in src/mpid/ch3/channels/nemesis/src/.state-cache New : MPIDI_CH3_RMAFnsInit in src/mpid/ch3/channels/mrail/src/rdma/.state-cache Inconsistent values for keys: In src/mpid/ch3/channels/nemesis/src/.state-cache, have MPID_STATE_MPIDI_CH3_PORTFNSINIT -> MPIDI_CH3_Init also to MPID_STATE_MPIDI_CH3_PORTFNSINIT -> MPIDI_CH3_PortFnsInit seen in src/mpid/ch3/channels/mrail/src/rdma/.state-cache . Using MPID_STATE_MPIDI_CH3_PORTFNSINIT -> MPIDI_CH3_Init Inconsistent values for keys: In src/mpid/ch3/channels/mrail/src/rdma/.state-cache, have MPID_STATE_MPIDI_CH3_PROGRESS_TEST -> MPIDI_CH3I_Progress_test also to MPID_STATE_MPIDI_CH3_PROGRESS_TEST -> MPIDI_CH3_Progress_test seen in src/mpid/ch3/channels/ssm/src/.state-cache . Using MPID_STATE_MPIDI_CH3_PROGRESS_TEST -> MPIDI_CH3I_Progress_test Inconsistent values for keys: In src/mpid/ch3/channels/mrail/src/rdma/.state-cache, have MPID_STATE_MPIDI_CH3_ISENDV -> MPIDI_CH3_iSendv also to MPID_STATE_MPIDI_CH3_ISENDV -> MPIDI_CH3_PACKETIZED_SEND seen in src/mpid/ch3/channels/mrail/src/rdma/.state-cache . Using MPID_STATE_MPIDI_CH3_ISENDV -> MPIDI_CH3_iSendv Inconsistent values for keys: In src/mpid/ch3/channels/shm/src/.state-cache, have MPID_STATE_MPIDI_CH3_ISTARTRNDVMSG -> MPIDI_CH3_iStartRndvMsg also to MPID_STATE_MPIDI_CH3_ISTARTRNDVMSG -> MPIDI_CH3_iSendv seen in src/mpid/ch3/channels/mrail/src/rdma/.state-cache . Using MPID_STATE_MPIDI_CH3_ISTARTRNDVMSG -> MPIDI_CH3_iStartRndvMsg Inconsistent values for keys: In src/mpid/ch3/channels/mrail/src/rdma/.state-cache, have MPID_STATE_MPIDI_CH3_RMAFNSINIT -> MPIDI_CH3_RMAFnsInit also to MPID_STATE_MPIDI_CH3_RMAFNSINIT -> MPIDI_CH3_Init seen in src/mpid/ch3/channels/nemesis/src/.state-cache . Using MPID_STATE_MPIDI_CH3_RMAFNSINIT -> MPIDI_CH3_RMAFnsInit Create or update the Fortran 90 tests derived from the Fortran 77 tests Subroutine LibraryAddObjects redefined at maint/smlib/libadd.smlib line 81. libdir{${MPILIBNAME}} = ROOTDIR/lib Shell variable MPID_THREAD_OUTPUT_FILES will not be added to the list of known autoconf files for src/mpid/ch3. Warning: header file mpidconf.h.in or mpidconf.h.in.in not found in src/mpid/globus/ Shell variable MPID_THREAD_OUTPUT_FILES will not be added to the list of known autoconf files for src/mpid/dcmfd. Shell variable FILE will not be added to the list of known autoconf files for src/mpid/dcmfd. Skipping generation of rule for qdemo because Makefile.sm already contains one Sourcefile src/util/thread/mpe_thread.c does not exist. simplemake is assuming that this file will be created by the configure step in the build directory Warning: source file mpe_thread.c or mpe_thread.c.in not found in src/util/thread/ Replacing last config dir with ../../.. Replacing src/pm/mpirun/Makefile.in Replacing src/pm/mpirun/Makefile.in Replacing src/pm/mpirun/Makefile.in Processing ./limic2-0.5.2 with 'autoreconf --install' configure.ac:4: error: Autoconf version 2.63 or higher is required configure.ac:4: the top level autom4te: /usr/bin/m4 failed with exit status: 63 aclocal: autom4te failed with exit status: 63 autoreconf: aclocal failed with exit status: 63 Creating configure in ./src/mpi/romio Found ./configure.in; executing make configure target (cd . && autoheader -I ./confdb && \ autoconf -I ./confdb ) configure.in:1: error: Autoconf version 2.62 or higher is required configure.in:1: the top level autom4te: /usr/bin/m4 failed with exit status: 63 autoheader: /usr/bin/autom4te failed with exit status: 63 make: *** [configure] Error 63 find: ./configure: No such file or directory Error! Could not build configure in . with make configure PANIC: Could not make configure from configure.in In directory . [root@master mvapich2-trunk-2009-10-10]# ls autom4te.cache CHANGELOG_MPICH2 configure.in doc INSTALL limic2-0.5.2 Makefile.in mvapich2-doxygen.in README test CHANGELOG confdb COPYRIGHT examples LICENSE.TXT maint Makefile.sm osu_benchmarks src unusederr.txt You have mail in /var/spool/mail/root [root@master mvapich2-trunk-2009-10-10]# Is it possible to build it with current autoconf version? Thanks Sangamesh -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091015/9ede15fd/attachment.html From perkinjo at cse.ohio-state.edu Thu Oct 15 15:09:06 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu Oct 15 15:09:54 2009 Subject: [mvapich-discuss] mvapich2 runtime failure In-Reply-To: <9C1B1AB7D13E5A49B7AB2F1E7F6BBDDE28BC46@lesl-mail.locuzhyd.com> References: <9C1B1AB7D13E5A49B7AB2F1E7F6BBDDE28BC46@lesl-mail.locuzhyd.com> Message-ID: <20091015190906.GA7934@cse.ohio-state.edu> On Thu, Oct 15, 2009 at 03:45:01PM +0530, sangamesh banappa wrote: > Hi, > > As the mails from my gmail id are getting bounced by mvapich firewall settings, I'm continuing with this mail id. > > Wrt the message: > http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2009-October/002535.html > > Now I can able to download it. But due to autoconf lower version it couldn't create configure: > > Our cluster is installed with CentOS5.2 and autoconf-2.59. Try again after installing autoconf-2.63. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091015/bd0d770d/attachment.bin From perkinjo at cse.ohio-state.edu Thu Oct 15 18:00:21 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu Oct 15 18:00:59 2009 Subject: [mvapich-discuss] MVAPICH + MKL + ~100 lines of code = crash In-Reply-To: <8937BA1A-96C0-4447-BA75-21D72EE73C5F@exchange.mit.edu> References: <8937BA1A-96C0-4447-BA75-21D72EE73C5F@exchange.mit.edu> Message-ID: <20091015220021.GC2395@cse.ohio-state.edu> On Wed, Oct 14, 2009 at 05:54:59PM -0400, Antonio Molins wrote: > After some more testing, found out same problem again. Now using MKL > 10.2.2.025, and the version of MVAPICH 1.1 that you sent me (that is > mvapich-1.1-2009-10-09 nightly build). If you use the same code I > provided, > > % mpirun -np 36 ~/shadowfax/EMsmooth_work/bin/crash 10000 1 > Exit code -5 signaled from compute-0-71 > Killing remote processes...MPI process terminated unexpectedly > PMGR_COLLECTIVE ERROR: unexpected value: received 1, expecting 7 @ > file pmgr_collective_mpispawn.c:144 > MPI process terminated unexpectedly > DONE > > Any idea why this is happening? Not right now. Is this segfaulting? See if you can gather a backtrace (you can send it to me offline). It may also be worthwile to recompile with the option --enable-debug added to the configure line of the make.mvapich.gen2 script as well as making sure that debug symbols are available in your application and other libraries as well. Hopefully this will give us a good starting point. > > Best, > A > > On Oct 10, 2009, at 8:32 PM, Dhabaleswar Panda panda@cse.ohio- > state.edu wrote: > > > Good to know that the 1.1 branch version fixes the problem. Let us > > know > > the outcome once you have done more extensive experiments. > > > > Thanks, > > > > DK > > > > On Sat, 10 Oct 2009, Antonio Molins wrote: > > > >> Thanks so much, DK! > >> > >> I was actually using MPICH 1.1, the last version available in the > >> webpage, not the nightly build. I have been trying the one you sent > >> me, and seems to fix the problem. Will tell you later once I try it > >> more thoroughly. > >> > >> Best, > >> A > >> > >> On Oct 9, 2009, at 9:16 PM, Dhabaleswar Panda wrote: > >> > >>> Thanks for the report. Your note indicates that you are using > >>> MVAPICH 1.0 > >>> version. This is a very old version (released during February '08). > >>> Can > >>> you check whether this happens with the latest MVAPICH 1.1 version. > >>> Please > >>> use the branch version obtained from the following URL. This has all > >>> the > >>> bug-fixes since 1.1 release. > >>> > >>> http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/1.1/ > >>> > >>> In the Changelog > >>> (http://mvapich.cse.ohio-state.edu/download/mvapich/ > >>> changes.shtml), I > >>> see that a fix related to segmentation fault while using BLACS was > >>> done > >>> for 1.0.1 version. > >>> > >>> Let us know whether the problem persists with this latest 1.1 branch > >>> version and we will take a look at this in-depth. > >>> > >>> Thanks, > >>> > >>> DK > >>> > >>> On Fri, 9 Oct 2009, Antonio Molins wrote: > >>> > >>>> Hi everybody, > >>>> > >>>> I am new to the list, and have been programming in MPI for a year > >>>> or so. I am running into what I think is a bug of either MVAPICH or > >>>> MKL that is totally driving me nuts :( > >>>> > >>>> The following code has proved good to generate a crash using MKL > >>>> 10.2 update 2 (sequential version and threaded), last revision of > >>>> MVAPICH, in two different clusters. Can anybody tell me what the > >>>> problem is here? It does not crash always, but it does crash when > >>>> the right number of MPI processes and matrix sizes are selected. > >>>> Sometimes it wouldn't crash but sloooooow down several orders of > >>>> magnitude, and note that the actual data being processed is exactly > >>>> the same. > >>>> > >>>> A > >>>> > >>>> /* > >>>> * crash.cpp - crashes with ICC 11.1, MKL 10.2, MVAPICH 1.0 on > >>>> linux 64-bit > >>>> * both linked with the serial or threaded libraries > >>>> * doing mpirun -np 36 crash 5000 10 > >>>> */ > >>>> > >>>> > >>>> > >>>> #include > >>>> #include > >>>> #include > >>>> #include > >>>> #include "mpi.h" > >>>> #include "mkl_scalapack.h" > >>>> > >>>> > >>>> > >>>> extern "C" { > >>>> /* BLACS C interface */ > >>>> void Cblacs_get( int context, int request, int* value); > >>>> int Cblacs_gridinit( int* context, char * order, int np_row, int > >>>> np_col); > >>>> void Cblacs_gridinfo( int context, int* np_row, int* np_col, int* > >>>> my_row, int* my_col); > >>>> int numroc_( int *n, int *nb, int *iproc, int *isrcproc, int > >>>> *nprocs); > >>>> /* PBLAS */ > >>>> void pdgemm_( char *TRANSA, char *TRANSB, int * M, int * N, int * > >>>> K, double * ALPHA, > >>>> double * A, int * IA, int * JA, int * DESCA, double * B, int * IB, > >>>> int * JB, int * DESCB, > >>>> double * BETA, double * C, int * IC, int * JC, int * DESCC ); > >>>> } > >>>> > >>>> > >>>> > >>>> #define BLOCK_SIZE 65 > >>>> > >>>> > >>>> > >>>> int main( int argc, char* argv[] ) > >>>> { > >>>> int iam, nprocs; > >>>> MPI_Init(&argc,&argv); /* starts MPI */ > >>>> MPI_Comm_rank(MPI_COMM_WORLD, &iam); > >>>> MPI_Comm_size(MPI_COMM_WORLD, &nprocs); > >>>> > >>>> > >>>> // get done with the ones that are not part of the grid > >>>> int blacs_pgrid_size = floor(sqrt(nprocs)); > >>>> if (iam>=blacs_pgrid_size*blacs_pgrid_size) { > >>>> printf("Bye bye world from process %d of %d. BLACS had no place for > >>>> me...\n",iam,nprocs); > >>>> MPI_Finalize(); > >>>> } > >>>> > >>>> > >>>> // start BLACS with square processor grid > >>>> if(iam==0) > >>>> printf("starting BLACS..."); > >>>> int ictxt,nprow,npcol,myrow,mycol; > >>>> Cblacs_get( -1, 0, &ictxt ); > >>>> Cblacs_gridinit( &ictxt, "C", blacs_pgrid_size, blacs_pgrid_size ); > >>>> Cblacs_gridinfo( ictxt, &nprow, &npcol, &myrow, &mycol ); > >>>> if(iam==0) > >>>> printf("done.\n"); > >>>> > >>>> > >>>> double timing; > >>>> int m,n,k,lm,ln,nbm,nbn,rounds; > >>>> int myzero=0,myone=1; > >>>> sscanf(argv[1],"%d",&m); > >>>> n=m; > >>>> k=m; > >>>> sscanf(argv[2],"%d",&rounds); > >>>> > >>>> > >>>> > >>>> nbm = BLOCK_SIZE; > >>>> nbn = BLOCK_SIZE; > >>>> lm = numroc_(&m, &nbm, &myrow, &myzero, &nprow); > >>>> ln = numroc_(&n, &nbn, &mycol, &myzero, &npcol); > >>>> > >>>> > >>>> int info; > >>>> int *ipiv = new int[lm+nbm+10000000]; //adding a "little" bit of > >>>> extra space just in case > >>>> char ta = 'N',tb = 'T'; > >>>> double alpha = 1.0, beta = 0.0; > >>>> > >>>> > >>>> > >>>> double* test1data = new double[lm*ln]; > >>>> double* test2data = new double[lm*ln]; > >>>> double* test3data = new double[lm*ln]; > >>>> > >>>> > >>>> for(int i=0;i >>>> test1data[i]=(double)(rand()%100)/10000.0; > >>>> > >>>> > >>>> int *test1desc = new int[9]; > >>>> int *test2desc = new int[9]; > >>>> int *test3desc = new int[9]; > >>>> > >>>> > >>>> test1desc[0] = 1; // descriptor type > >>>> test1desc[1] = ictxt; // blacs context > >>>> test1desc[2] = m; // global number of rows > >>>> test1desc[3] = n; // global number of columns > >>>> test1desc[4] = nbm; // row block size > >>>> test1desc[5] = nbn; // column block size (DEFINED EQUAL THAN ROW > >>>> BLOCK SIZE) > >>>> test1desc[6] = 0; // initial process row(DEFINED 0) > >>>> test1desc[7] = 0; // initial process column (DEFINED 0) > >>>> test1desc[8] = lm; // leading dimension of local array > >>>> > >>>> > >>>> memcpy(test2desc,test1desc,9*sizeof(int)); > >>>> memcpy(test3desc,test1desc,9*sizeof(int)); > >>>> > >>>> > >>>> > >>>> for(int iter=0;iter >>>> { > >>>> if(iam==0) > >>>> printf("iter %i - ",iter); > >>>> //test2 = test1 > >>>> memcpy(test2data,test1data,lm*ln*sizeof(double)); > >>>> //test3 = test1*test2 > >>>> timing=MPI_Wtime(); > >>>> pdgemm_(&ta,&tb,&m,&n,&k, > >>>> &alpha, > >>>> test1data,&myone,&myone,test1desc, > >>>> test2data,&myone,&myone, test2desc, > >>>> &beta, > >>>> test3data,&myone,&myone, test3desc); > >>>> if(iam==0) > >>>> printf(" PDGEMM = %f |",MPI_Wtime()-timing); > >>>> //test3 = LU(test3) > >>>> timing=MPI_Wtime(); > >>>> pdgetrf_(&m, &n, test3data, &myone, &myone, test3desc, ipiv, > >>>> &info); > >>>> if(iam==0) > >>>> printf(" PDGETRF = %f.\n",MPI_Wtime()-timing); > >>>> } > >>>> delete[] ipiv; > >>>> delete[] test1data, test2data, test3data; > >>>> delete[] test1desc, test2desc, test3desc; > >>>> > >>>> > >>>> MPI_Finalize(); > >>>> return 0; > >>>> } > >>>> > >>>> -------------------------------------------------------------------------------- > >>>> Antonio Molins, PhD Candidate > >>>> Medical Engineering and Medical Physics > >>>> Harvard - MIT Division of Health Sciences and Technology > >>>> -- > >>>> "Y as? del poco dormir y del mucho leer, > >>>> se le sec? el cerebro de manera que vino > >>>> a perder el juicio". > >>>> Miguel de Cervantes > >>>> -------------------------------------------------------------------------------- > >>>> > >>>> > >>>> > >>>> > >>> > >>> > >>> _______________________________________________ > >>> mvapich-discuss mailing list > >>> mvapich-discuss@cse.ohio-state.edu > >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > >> -------------------------------------------------------------------------------- > >> Antonio Molins, PhD Candidate > >> Medical Engineering and Medical Physics > >> Harvard - MIT Division of Health Sciences and Technology > >> -- > >> "Y as? del poco dormir y del mucho leer, > >> se le sec? el cerebro de manera que vino > >> a perder el juicio". > >> Miguel de Cervantes > >> -------------------------------------------------------------------------------- > >> > >> > >> > >> > > > > -------------------------------------------------------------------------------- > Antonio Molins, PhD Candidate > Medical Engineering and Medical Physics > Harvard - MIT Division of Health Sciences and Technology > -- > "Far better an approximate answer to the right question, > which is often vague, > than an exact answer to the wrong question, > which can always be made precise. " > > John W. Tukey, 1962 > -------------------------------------------------------------------------------- > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091015/84b0c7c1/attachment-0001.bin From ipl at dhigroup.com Mon Oct 19 04:36:34 2009 From: ipl at dhigroup.com (Iris Pernille Lohmann) Date: Mon Oct 19 04:39:00 2009 Subject: [mvapich-discuss] crash of runs over InfiniBand Message-ID: <66D0CDDB47B56E49985BE88D9E9DD45075219AEFBD@mx7serv> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 2148 bytes Desc: image001.gif Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091019/6468f029/image001.gif From panda at cse.ohio-state.edu Mon Oct 19 09:48:48 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Oct 19 09:49:24 2009 Subject: [mvapich-discuss] crash of runs over InfiniBand In-Reply-To: <66D0CDDB47B56E49985BE88D9E9DD45075219AEFBD@mx7serv> Message-ID: Iris - InfiniBand communication relies on pinning and registering communication buffers (the associated memory) before communication can take place. It appears that you are running out of memory that can be pinned when running applications for a longer period of time. You can carry out the second step and let us know whether the problem goes away or not. Thanks, DK On Mon, 19 Oct 2009, Iris Pernille Lohmann wrote: > Dear list members, > > I am using MVAPICH 1.4 on a linux cluster. I have made some computations on 1 and 2 nodes using mpirun_rsh. When I run a relatively small computation, the run on 2 nodes works fine, whereas with a relatively large computation, the run on 2 nodes crashes (I get no error messages). Running on 1 node works fine. > > I am thinking that it may have something to do with memory, and in the User Guide section 9.3.4 there is a description on setting the soft memlock. > > In my limits.conf the soft memlock and hard memlock are already set to 6000000. > > Could the problem be that the second step mentioned in section 9.3.4, namely to add the following to /etc/init.d/sshd: > ulimit -l > > has not been done? What does it actually mean? > > Or can it be something completely different? > > > Best regards, > > Iris Lohmann > > > > > > Iris Pernille Lohmann > > MSc, PhD > > Ports & Offshore Technology (POT) > > > > [cid:image001.gif@01CA50A7.0EF6B450] > > > > DHI > > Agern Allé 5 > > DK-2970 Hørsholm > > Denmark > > > > Tel: > > > > +45 4516 9200 > > Direct: > > > > 45169427 > > > > ipl@dhigroup.com > > www.dhigroup.com > > > > WATER * ENVIRONMENT * HEALTH > > > > From kubota at cray.com Tue Oct 20 23:24:23 2009 From: kubota at cray.com (Yutaka Kubota) Date: Tue Oct 20 23:25:00 2009 Subject: [mvapich-discuss] Could you upgrade "mpirun_rsh" command Message-ID: Dear MVAPICH2 discussion Mailing list, This is Yutaka Kubota from Cray Japan. I always appreciate your support. Our Satoshi Isono had asked that we had confirmed that all putting files owner were changed before group when change the Liux group using newgrp command and submit job using "mpirun_rsh" command. Bill in university of Texas had answered that insert "/usr/bin/sg `id -gn`" command line in between MPI option and execution binary name. We had confirmed that most patter submission was resolved on this way. However some submission patter using "sg" command affected to user program. Could you fix "mpirun_rsh" command for newgrp user put files before owner changed symptom? Best regards Yutaka Kubota, Cray Japan. From ipl at dhigroup.com Wed Oct 21 02:32:49 2009 From: ipl at dhigroup.com (Iris Pernille Lohmann) Date: Wed Oct 21 02:33:34 2009 Subject: [mvapich-discuss] crash of runs over InfiniBand In-Reply-To: References: <66D0CDDB47B56E49985BE88D9E9DD45075219AEFBD@mx7serv> Message-ID: <66D0CDDB47B56E49985BE88D9E9DD45075219F9D00@mx7serv> Hi again, No, the socond step didn't help. I also tried inserting in limits.conf: * soft memlock unlimited * hard memlock unlimited On the two nodes that I am using, but with the same result - a crash. This is my findings with the test case that I am running: Mpirun_rsh/mpispawn: 1 node, 4 processors: OK 2 nodes, 4 processors (2 on each): crash 1 node, 8 processors: crash 2 nodes, 8 processors (4 on each): crash Launched on one of the computational nodes, example for the 2 nodes 8 processors, by: nohup mpirun_rsh -np 8 -hostfile hosts ./testprogram inputfile & Mpiexec/mpd: 1 node, 4 processors: OK 2 nodes. 4 processors (2 on each): OK 1 node, 8 processors: crash 2 nodes, 8 processors (4 on each): crash Launched on one of the computational nodes by, example for the 2 nodes 8 processors: mpdboot -n 2 -h hosts nohup mpiexec -n 8 ./testprogram inputfile > /dev/null 2>/dev/null & Both with mpirun_rsh and mpiexec, the crashes happens after a while (during the initialization of the program: the cores have been distributed and the crash happens during reading the mesh) I should perhaps also mention, that I tried the same test-program with MPICH2 on 1 node only and 8 cores, with success. Perhaps this gives a clue? The cluster is 12 compute nodes each with 2x quad core (Intel 5550 Nehalem 2.7 GHz) and 12GB RAM. When I use 8 cores I use all the cores of a node, and I assume that the Infiniband is being used in this case, even in the case where I run on 1 node only... I really hope you have some ideas of what may be the problem, and please let me know if you need more information. Best regards and thanks, Iris -----Original Message----- From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] Sent: 19 October 2009 15:49 To: Iris Pernille Lohmann Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] crash of runs over InfiniBand Iris - InfiniBand communication relies on pinning and registering communication buffers (the associated memory) before communication can take place. It appears that you are running out of memory that can be pinned when running applications for a longer period of time. You can carry out the second step and let us know whether the problem goes away or not. Thanks, DK On Mon, 19 Oct 2009, Iris Pernille Lohmann wrote: > Dear list members, > > I am using MVAPICH 1.4 on a linux cluster. I have made some computations on 1 and 2 nodes using mpirun_rsh. When I run a relatively small computation, the run on 2 nodes works fine, whereas with a relatively large computation, the run on 2 nodes crashes (I get no error messages). Running on 1 node works fine. > > I am thinking that it may have something to do with memory, and in the User Guide section 9.3.4 there is a description on setting the soft memlock. > > In my limits.conf the soft memlock and hard memlock are already set to 6000000. > > Could the problem be that the second step mentioned in section 9.3.4, namely to add the following to /etc/init.d/sshd: > ulimit -l > > has not been done? What does it actually mean? > > Or can it be something completely different? > > > Best regards, > > Iris Lohmann > > > > > > Iris Pernille Lohmann > > MSc, PhD > > Ports & Offshore Technology (POT) > > > > [cid:image001.gif@01CA50A7.0EF6B450] > > > > DHI > > Agern All? 5 > > DK-2970 H?rsholm > > Denmark > > > > Tel: > > > > +45 4516 9200 > > Direct: > > > > 45169427 > > > > ipl@dhigroup.com > > www.dhigroup.com > > > > WATER * ENVIRONMENT * HEALTH > > > > From panda at cse.ohio-state.edu Wed Oct 21 07:50:58 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Oct 21 07:51:34 2009 Subject: [mvapich-discuss] crash of runs over InfiniBand In-Reply-To: <66D0CDDB47B56E49985BE88D9E9DD45075219F9D00@mx7serv> Message-ID: Iris, Looks like there could be multiple issues here. When you run your application using 4/8 cores on a single node, InfiniBand is not used at all. It uses intra-node shared memory communication. When you are running MVAPICH2 1.4, which version are you running and how are you configuring it? Also, what is the memory requirement of your application per core? Looks like when you are running your application using all 8 cores, your overall memory requirement might be exceeding the available physical memory. Try to run your application with a smaller data size to see if it passes with all 8 cores. Also note that in addition to the data size, MPI library also requires some amount of memory. When you are running your application on two different nodes, can you make sure that your InfiniBand fabric is working correctly. I am assuming that you are using OpenFabrics stack. There are several IB-level tests (not MPI-level) which come with OpenFabrics. Can you run these tests to make sure that your IB setup is correct. Thanks, DK On Wed, 21 Oct 2009, Iris Pernille Lohmann wrote: > Hi again, > > No, the socond step didn't help. I also tried inserting in limits.conf: > * soft memlock unlimited > * hard memlock unlimited > On the two nodes that I am using, but with the same result - a crash. > > This is my findings with the test case that I am running: > Mpirun_rsh/mpispawn: > 1 node, 4 processors: OK > 2 nodes, 4 processors (2 on each): crash > 1 node, 8 processors: crash > 2 nodes, 8 processors (4 on each): crash > Launched on one of the computational nodes, example for the 2 nodes 8 processors, by: > nohup mpirun_rsh -np 8 -hostfile hosts ./testprogram inputfile & > > Mpiexec/mpd: > 1 node, 4 processors: OK > 2 nodes. 4 processors (2 on each): OK > 1 node, 8 processors: crash > 2 nodes, 8 processors (4 on each): crash > Launched on one of the computational nodes by, example for the 2 nodes 8 processors: > mpdboot -n 2 -h hosts > nohup mpiexec -n 8 ./testprogram inputfile > /dev/null 2>/dev/null & > > Both with mpirun_rsh and mpiexec, the crashes happens after a while (during the initialization of the program: the cores have been distributed and the crash happens during reading the mesh) > > I should perhaps also mention, that I tried the same test-program with MPICH2 on 1 node only and 8 cores, with success. Perhaps this gives a clue? > > The cluster is 12 compute nodes each with 2x quad core (Intel 5550 Nehalem 2.7 GHz) and 12GB RAM. When I use 8 cores I use all the cores of a node, and I assume that the Infiniband is being used in this case, even in the case where I run on 1 node only... > > I really hope you have some ideas of what may be the problem, and please let me know if you need more information. > > Best regards and thanks, > Iris > > > > -----Original Message----- > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > Sent: 19 October 2009 15:49 > To: Iris Pernille Lohmann > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] crash of runs over InfiniBand > > Iris - InfiniBand communication relies on pinning and registering > communication buffers (the associated memory) before communication can > take place. It appears that you are running out of memory that can be > pinned when running applications for a longer period of time. You can > carry out the second step and let us know whether the problem goes away or > not. > > Thanks, > > DK > > On Mon, 19 Oct 2009, Iris Pernille Lohmann wrote: > > > Dear list members, > > > > I am using MVAPICH 1.4 on a linux cluster. I have made some computations on 1 and 2 nodes using mpirun_rsh. When I run a relatively small computation, the run on 2 nodes works fine, whereas with a relatively large computation, the run on 2 nodes crashes (I get no error messages). Running on 1 node works fine. > > > > I am thinking that it may have something to do with memory, and in the User Guide section 9.3.4 there is a description on setting the soft memlock. > > > > In my limits.conf the soft memlock and hard memlock are already set to 6000000. > > > > Could the problem be that the second step mentioned in section 9.3.4, namely to add the following to /etc/init.d/sshd: > > ulimit -l > > > > has not been done? What does it actually mean? > > > > Or can it be something completely different? > > > > > > Best regards, > > > > Iris Lohmann > > > > > > > > > > > > Iris Pernille Lohmann > > > > MSc, PhD > > > > Ports & Offshore Technology (POT) > > > > > > > > [cid:image001.gif@01CA50A7.0EF6B450] > > > > > > > > DHI > > > > Agern Allé 5 > > > > DK-2970 Hørsholm > > > > Denmark > > > > > > > > Tel: > > > > > > > > +45 4516 9200 > > > > Direct: > > > > > > > > 45169427 > > > > > > > > ipl@dhigroup.com > > > > www.dhigroup.com > > > > > > > > WATER * ENVIRONMENT * HEALTH > > > > > > > > > > > > > > From yjlim at samboo.co.kr Fri Oct 23 04:53:41 2009 From: yjlim at samboo.co.kr (=?ks_c_5601-1987?B?wNO/68HY?=) Date: Fri Oct 23 04:54:20 2009 Subject: [mvapich-discuss]ibv_recv.c error Message-ID: Hello, All Our environment is below O/S : Cent O/S HCA : QLE7240 Driver : OFED-1.4 Application : mvapich2 (1.2 version) We use Qlogic QLE7240 HCA, Qlogic9040 Infiniband switch and MVAPICH2 Server happen error message below when he run MPI process(MVAPICH2) and then MPI process stop And, This error happen random servers So, I don't know why happen this error Please, I need help ---------------------------error message------------------------------ MPI process terminated unexpectedly Exit code -5 signaled from "node name" Killing remote processes...DONE ----------------------------------------------------------------------- Thank you Jun. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091023/007e7276/attachment.html From panda at cse.ohio-state.edu Fri Oct 23 08:14:42 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Oct 23 08:15:18 2009 Subject: [mvapich-discuss]ibv_recv.c error In-Reply-To: Message-ID: Looks like you are using the OFED stack (including the ofa-gen2 interface of MVAPICH2) with QLogic adapter. This may not be stable or efficient. You may check with your vendor regarding this. For QLogic adapters, you need to use the PSM interface to get the best performance. This support is available starting from MVAPICH2 1.4 version. You can try the latest MVAPICH2 1.4 version (RC2 or the trunk version). Please refer to the MVAPICH2 user guide to determine how you can configure it for the PSM interface. Let us know if you encounter any issue. DK On Fri, 23 Oct 2009, [ks_c_5601-1987] ÀÓ¿ëÁØ wrote: > Hello, All > > Our environment is below > > O/S : Cent O/S > HCA : QLE7240 > Driver : OFED-1.4 > Application : mvapich2 (1.2 version) > > We use Qlogic QLE7240 HCA, Qlogic9040 Infiniband switch and MVAPICH2 > > Server happen error message below when he run MPI process(MVAPICH2) and > then MPI process stop > > And, This error happen random servers > > So, I don't know why happen this error > > Please, I need help > > ---------------------------error message------------------------------ > > MPI process terminated unexpectedly > Exit code -5 signaled from "node name" > Killing remote processes...DONE > > ----------------------------------------------------------------------- > > Thank you > > Jun. > > From kubota at cray.com Sun Oct 25 20:53:33 2009 From: kubota at cray.com (Yutaka Kubota) Date: Sun Oct 25 20:54:16 2009 Subject: [mvapich-discuss] RE: Could you upgrade "mpirun_rsh" command In-Reply-To: References: Message-ID: Dear MVAPICH2 discussion Mailing list, This is Yutaka Kubota from Cray Japan Inc. We don't receive your reply about it yet. We guess that this problem is not MPI, so you might decide low priority about this symptom. However this problem is high priority for us. So if you don't have much time for fix the symptom. Please let me know who should I ask. If we have to fix this symptom for own coding. We are afraid that we have to fix mpirun_rsh command every MVAPICH2 version up. Yutaka Kubota, Cray Japan inc. -----Original Message----- From: owner-tokyocsd@cray.com [mailto:owner-tokyocsd@cray.com] On Behalf Of Yutaka Kubota Sent: Wednesday, October 21, 2009 12:24 PM To: mvapich-discuss@cse.ohio-state.edu Subject: Could you upgrade "mpirun_rsh" command Dear MVAPICH2 discussion Mailing list, This is Yutaka Kubota from Cray Japan. I always appreciate your support. Our Satoshi Isono had asked that we had confirmed that all putting files owner were changed before group when change the Liux group using newgrp command and submit job using "mpirun_rsh" command. Bill in university of Texas had answered that insert "/usr/bin/sg `id -gn`" command line in between MPI option and execution binary name. We had confirmed that most patter submission was resolved on this way. However some submission patter using "sg" command affected to user program. Could you fix "mpirun_rsh" command for newgrp user put files before owner changed symptom? Best regards Yutaka Kubota, Cray Japan. From perkinjo at cse.ohio-state.edu Mon Oct 26 11:41:30 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon Oct 26 11:42:07 2009 Subject: [mvapich-discuss] RE: Could you upgrade "mpirun_rsh" command In-Reply-To: References: Message-ID: <20091026154130.GJ3972@cse.ohio-state.edu> Sorry for the delay in replying. We spent some time reviewing the options for satisfying this request. At first glance this issue seems like it can be better suited to be handled by other mechanisms but I haven't found any that handles the dynamic setting of a user's effective group id on a remote machine. It also doesn't seem trivial to setup the requested behavior using wrapper script(s) while being friendly enough for the general user. Because of this, I think that it may be warranted to provide a command line option to mpirun_rsh that tells mpispawn to change its gid to that provided. I'm not sure if this will make it into our upcoming release. If not, we can release this as a patch to you in the meantime and make it generally available in the next minor or patch release for those who are interested. On Sun, Oct 25, 2009 at 07:53:33PM -0500, Yutaka Kubota wrote: > Dear MVAPICH2 discussion Mailing list, > > This is Yutaka Kubota from Cray Japan Inc. > > We don't receive your reply about it yet. We guess that this problem is not MPI, so you might decide low priority about this symptom. However this problem is high priority for us. So if you don't have much time for fix the symptom. Please let me know who should I ask. If we have to fix this symptom for own coding. We are afraid that we have to fix mpirun_rsh command every MVAPICH2 version up. > > Yutaka Kubota, Cray Japan inc. > > > -----Original Message----- > From: owner-tokyocsd@cray.com [mailto:owner-tokyocsd@cray.com] On Behalf Of Yutaka Kubota > Sent: Wednesday, October 21, 2009 12:24 PM > To: mvapich-discuss@cse.ohio-state.edu > Subject: Could you upgrade "mpirun_rsh" command > > Dear MVAPICH2 discussion Mailing list, > > This is Yutaka Kubota from Cray Japan. > > I always appreciate your support. Our Satoshi Isono had asked that we had confirmed that all putting files owner were changed before group when change the Liux group using newgrp command and submit job using "mpirun_rsh" command. Bill in university of Texas had answered that insert "/usr/bin/sg `id -gn`" command line in between MPI option and execution binary name. We had confirmed that most patter submission was resolved on this way. However some submission patter using "sg" command affected to user program. > > Could you fix "mpirun_rsh" command for newgrp user put files before owner changed symptom? > > Best regards > > Yutaka Kubota, Cray Japan. > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091026/0b7aeb9b/attachment.bin From anthony.j.mayhall at nasa.gov Wed Oct 28 11:23:12 2009 From: anthony.j.mayhall at nasa.gov (Mayhall, Anthony J. (MSFC-ES53)[TBE-1]) Date: Wed Oct 28 11:30:39 2009 Subject: [mvapich-discuss] Timeouts for MPI_Bcast, MPI_Barrier and MPI_Win_fence Message-ID: <76E49B2FF96CBE4CB1DB0C8D3E7D90173DAFA58421@NDMSSCC06.ndc.nasa.gov> Is it possible to specify a timeout value for MPI_Bcast, MPI_Barrier and MPI_Win_fence so that those calls would fail once the timeout value is reached? We are trying to use MPI in a real-time hardware-in-the-loop application and we need to be able to have those calls timeout so the processes will not hang. Thanks, Anthony Mayhall Davidson Technologies, Inc. (256)544-7620 From balaji at mcs.anl.gov Wed Oct 28 12:02:41 2009 From: balaji at mcs.anl.gov (Pavan Balaji) Date: Wed Oct 28 12:03:23 2009 Subject: [mvapich-discuss] Timeouts for MPI_Bcast, MPI_Barrier and MPI_Win_fence In-Reply-To: <76E49B2FF96CBE4CB1DB0C8D3E7D90173DAFA58421@NDMSSCC06.ndc.nasa.gov> References: <76E49B2FF96CBE4CB1DB0C8D3E7D90173DAFA58421@NDMSSCC06.ndc.nasa.gov> Message-ID: <4AE86B21.1080407@mcs.anl.gov> On 10/28/2009 10:23 AM, Mayhall, Anthony J. (MSFC-ES53)[TBE-1] wrote: > Is it possible to specify a timeout value for MPI_Bcast, MPI_Barrier > and MPI_Win_fence so that those calls would fail once the timeout > value is reached? We are trying to use MPI in a real-time > hardware-in-the-loop application and we need to be able to have those > calls timeout so the processes will not hang. The current MPI-2.2 standard doesn't allow this. However, a similar issue was raised by another application scientist here at Argonne. Maybe it is something to be considered for MPI-3. -- Pavan -- Pavan Balaji http://www.mcs.anl.gov/~balaji From anthony.j.mayhall at nasa.gov Wed Oct 28 12:14:53 2009 From: anthony.j.mayhall at nasa.gov (Mayhall, Anthony J. (MSFC-ES53)[TBE-1]) Date: Wed Oct 28 12:15:33 2009 Subject: [mvapich-discuss] Timeouts for MPI_Bcast, MPI_Barrier and MPI_Win_fence In-Reply-To: <4AE86B21.1080407@mcs.anl.gov> References: <76E49B2FF96CBE4CB1DB0C8D3E7D90173DAFA58421@NDMSSCC06.ndc.nasa.gov> <4AE86B21.1080407@mcs.anl.gov> Message-ID: <76E49B2FF96CBE4CB1DB0C8D3E7D90173DAFA584CA@NDMSSCC06.ndc.nasa.gov> Thanks for the quick reply. It is something that we will definitely need to have. We would need to be able to specify the timeout in at least 1ms increments. Thanks, Anthony Mayhall Davidson Technologies, Inc. (256)544-7620 -----Original Message----- From: Pavan Balaji [mailto:balaji@mcs.anl.gov] Sent: Wednesday, October 28, 2009 11:03 AM To: Mayhall, Anthony J. (MSFC-ES53)[TBE-1] Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] Timeouts for MPI_Bcast, MPI_Barrier and MPI_Win_fence On 10/28/2009 10:23 AM, Mayhall, Anthony J. (MSFC-ES53)[TBE-1] wrote: > Is it possible to specify a timeout value for MPI_Bcast, MPI_Barrier > and MPI_Win_fence so that those calls would fail once the timeout > value is reached? We are trying to use MPI in a real-time > hardware-in-the-loop application and we need to be able to have those > calls timeout so the processes will not hang. The current MPI-2.2 standard doesn't allow this. However, a similar issue was raised by another application scientist here at Argonne. Maybe it is something to be considered for MPI-3. -- Pavan -- Pavan Balaji http://www.mcs.anl.gov/~balaji From anthony.j.mayhall at nasa.gov Thu Oct 29 08:21:33 2009 From: anthony.j.mayhall at nasa.gov (Mayhall, Anthony J. (MSFC-ES53)[TBE-1]) Date: Thu Oct 29 08:22:13 2009 Subject: [mvapich-discuss] Mpirun_rsh Message-ID: <76E49B2FF96CBE4CB1DB0C8D3E7D90173DAFA58863@NDMSSCC06.ndc.nasa.gov> Is it possible to run MPI jobs without having to use mpirun_rsh? If I use the -show option and try to use those commands to run jobs, they fail with connection refused errors. I need to be able to run apps with different names. How can I do that? Mpiexec? Will mpiexec work with MVAPICH properly? What is mpirun_rsh doing to setup ports, etc., before running those commands? Is it something that can be done from the command line? Thanks, Anthony Mayhall Davidson Technologies, Inc. (256)544-7620 From perkinjo at cse.ohio-state.edu Thu Oct 29 11:09:47 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu Oct 29 11:10:29 2009 Subject: [mvapich-discuss] Mpirun_rsh In-Reply-To: <76E49B2FF96CBE4CB1DB0C8D3E7D90173DAFA58863@NDMSSCC06.ndc.nasa.gov> References: <76E49B2FF96CBE4CB1DB0C8D3E7D90173DAFA58863@NDMSSCC06.ndc.nasa.gov> Message-ID: <20091029150947.GB2369@cse.ohio-state.edu> On Thu, Oct 29, 2009 at 07:21:33AM -0500, Mayhall, Anthony J. (MSFC-ES53)[TBE-1] wrote: > Is it possible to run MPI jobs without having to use mpirun_rsh? If I > use the -show option and try to use those commands to run jobs, they > fail with connection refused errors. I need to be able to run apps > with different names. How can I do that? Mpiexec? Will mpiexec work > with MVAPICH properly? What is mpirun_rsh doing to setup ports, etc., > before running those commands? Is it something that can be done from > the command line? The use of mpirun_rsh is recommended for faster and scalable job start-up. One can use the MPD/mpiexec functionality. However, this will not be scalable. Most of MVAPICH/MVAPICH2 users currently use the mpirun_rsh framework. The MPD/mpiexec feature can be used with MVAPICH2. However, it may have performance bottlenecks. We plan to add additional features to mpirun_rsh in future releases. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20091029/5e9b0c1a/attachment.bin From panda at cse.ohio-state.edu Fri Oct 30 00:43:27 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Oct 30 00:44:03 2009 Subject: [mvapich-discuss] Announcing the release of MVAPICH2 1.4 Message-ID: The MVAPICH team is pleased to announce the release of MVAPICH2 1.4 with the following NEW features: - MPI 2.1 standard compliant - Based on MPICH2 1.0.8p1 - Dynamic Process Management (DPM) Support with mpirun_rsh and MPD - Available for OpenFabrics (IB) interface - Support for eXtended Reliable Connection (XRC) - Available for OpenFabrics (IB) interface - Native support for QLogic InfiniPath - Provides support over PSM interface - Support for RDMAoE (RDMA over Ethernet) with ConnectX-EN adapter - Kernel-level single-copy intra-node communication support based on LiMIC2 - Delivers superior intra-node performance for medium and large messages - Available for all interfaces (IB, iWARP, uDAPL and RDMAoE) - Enhancement to mpirun_rsh framework for faster job startup on large clusters - Hierarchical ssh to nodes to speedup job startup - Available for all interfaces (IB, iWARP, uDAPL and RDMAoE) - Scalable checkpoint-restart with mpirun_rsh framework - Checkpoint-restart with intra-node shared memory (kernel-level with LiMIC2) support - Check-point Restart with Fault-Tolerant Backplane (FTB-CR) Support - Available for OpenFabrics (IB) Interface - K-nomial tree-based solution together with shared memory-based broadcast for scalable MPI_Bcast operation - Available for all interfaces (IB, iWARP, uDAPL and RDMAoE) - Multiple CQ-based design for Chelsio 10GigE/iWARP This release also contains multiple bug fixes since MVAPICH2-1.2p1, MVAPICH2 1.4RC1 and MVAPICH2 1.4RC2. Summary of the major fixes are available from MVAPICH2 download page (under Changelog link). MVAPICH2 1.4 is being made available with OFED 1.5. It continues to deliver excellent performance. Sample performance numbers include: OpenFabrics/Gen2 on Nehalem quad-core (2.4 GHz) with PCIe-Gen2 and ConnectX-QDR (Two-sided Operations): - 1.61 microsec one-way latency (4 bytes) - 3026 MB/sec unidirectional bandwidth - 5854 MB/sec bidirectional bandwidth QLogic InfiniPath Support on Nehalem quad-core (2.4 GHz) with PCIe-Gen2 and QLogic-DDR (Two-sided Operations): - 2.15 microsec one-way latency (4 bytes) - 1603 MB/sec unidirectional bandwidth - 2108 MB/sec unidirectional bandwidth OpenFabrics/Gen2-RDMAoE (RDMA over Ethernet) Support on Clovertown quad-core (2.4 GHz) with ConnectX-EN (Two-sided operations): - 2.99 microsec one-way latency (4 bytes) - 1142 MB/sec unidirectional bandwidth - 2282 MB/sec bidirectional bandwidth Performance numbers for several other platforms, system configurations and operations (including intra-node shared memory and collectives) can be viewed by visiting `Performance' section of the project's web page. For downloading MVAPICH2 1.4, associated user guide and accessing the SVN, please visit the following URL: http://mvapich.cse.ohio-state.edu All feedbacks, including bug reports and hints for performance tuning, patches and enhancements are welcome. Please post it to the mvapich-discuss mailing list. We are also happy to inform that the number of organizations using MVAPICH/MVAPICH2 and registered at the MVAPICH site has crossed 1,000 world-wide (in 52 countries). The MVAPICH team extends thanks to all these organizations. Thanks, The MVAPICH Team From kallies at zib.de Fri Oct 30 13:02:42 2009 From: kallies at zib.de (Bernd Kallies) Date: Fri Oct 30 13:03:29 2009 Subject: [mvapich-discuss] mvapich2-1.4.0 question about CPU affinity Message-ID: <1256922162.3549.249.camel@kallies.zib.de> Dear list members, I ran mvapich2 v1.4.0 on clusters equipped with Intel Xeon E5472 (Harpertown/Penryn, 4 cores per socket, 2 sockets per node) and Intel Xeon X5570 (Gainestown/Nehalem, 4 cores per socket, 2 sockets per node). I analyzed the default CPU affinity maps applied by mvapich2 1.4.0 (MV2_CPU_MAPPING is unset, MV2_ENABLE_AFFINITY is 1). For code see https://www.hlrn.de/home/view/System/PlaceMe It seems to me that the following maps are applied: 1) Harpertown: 0:2:4:6:1:3:5:7 2) Gainestown: 0:1:2:3:4:5:6:7 The map found for Hapertown is different from previous mvapich2 releases, but is still far away from "Optimal runtime CPU binding" as written in the Changelog. The lstopo tool of the hwloc package http://www.open-mpi.org/projects/hwloc/ gives for a node with Harpertown CPUs: System(15GB) Socket#0 L2(6144KB) L1(32KB) + Core#0 + P#0 L1(32KB) + Core#1 + P#2 L2(6144KB) L1(32KB) + Core#2 + P#4 L1(32KB) + Core#3 + P#6 Socket#1 L2(6144KB) L1(32KB) + Core#0 + P#1 L1(32KB) + Core#1 + P#3 L2(6144KB) L1(32KB) + Core#2 + P#5 L1(32KB) + Core#3 + P#7 So, on this architecture the "optimal" affinity map for a pure MPI application is 0:1:4:5:2:3:6:7, because one has to try to minimize usage of shared L2 caches as much as possible (run 4 tasks on 0:1:4:5, not on 0:2:4:6 as mvapich2 does). On the other hand, the map applied on Gainestown is correct (minimizes usage of shared L3 cache and NUMA node memory). The topology map is here System(47GB) Node#0(23GB) + Socket#0 + L3(8192KB) L2(256KB) + L1(32KB) + Core#0 P#0 P#8 L2(256KB) + L1(32KB) + Core#1 P#2 P#10 L2(256KB) + L1(32KB) + Core#2 P#4 P#12 L2(256KB) + L1(32KB) + Core#3 P#6 P#14 Node#1(23GB) + Socket#1 + L3(8192KB) L2(256KB) + L1(32KB) + Core#0 P#1 P#9 L2(256KB) + L1(32KB) + Core#1 P#3 P#11 L2(256KB) + L1(32KB) + Core#2 P#5 P#13 L2(256KB) + L1(32KB) + Core#3 P#7 P#15 I'm wondering if I made a mistake or understood something wrong, or if there is some bug in the mvapich2 intelligence that seems to be used to analyze the cpu topology, or if this intelligence might become increased in future mvapich2 releases to reduce the need to know a value of MV2_CPU_MAPPING on a particular architecture, which is more suitable than the default. Sincerely, BK -- Dr. Bernd Kallies Konrad-Zuse-Zentrum f?r Informationstechnik Berlin Takustr. 7 14195 Berlin Tel: +49-30-84185-270 Fax: +49-30-84185-311 e-mail: kallies@zib.de From panda at cse.ohio-state.edu Fri Oct 30 14:49:13 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Oct 30 14:49:49 2009 Subject: [mvapich-discuss] mvapich2-1.4.0 question about CPU affinity In-Reply-To: <1256922162.3549.249.camel@kallies.zib.de> Message-ID: Dr. Kallies, Thanks for your note. We are analyzing the situations you have described and we will get back to you soon. Our long-term objective is to provide the best intelligence within the MVAPICH2 library to come up with the most efficient CPU binding for multi-core platforms. As you know, multi-core platforms come with all different configurations, cache sizes/speeds and memory sizes/speeds. Similarly, parallel applications have diverse computation and communication requirements. Thus, if some specific user-defined CPU mapping is good for a particular application and platform, it can always be used by the user-defined CPU mapping option of the MVAPICH2 library. Best Regards, DK On Fri, 30 Oct 2009, Bernd Kallies wrote: > Dear list members, > > I ran mvapich2 v1.4.0 on clusters equipped with Intel Xeon E5472 > (Harpertown/Penryn, 4 cores per socket, 2 sockets per node) and Intel > Xeon X5570 (Gainestown/Nehalem, 4 cores per socket, 2 sockets per node). > > I analyzed the default CPU affinity maps applied by mvapich2 1.4.0 > (MV2_CPU_MAPPING is unset, MV2_ENABLE_AFFINITY is 1). For code see > https://www.hlrn.de/home/view/System/PlaceMe > > It seems to me that the following maps are applied: > 1) Harpertown: 0:2:4:6:1:3:5:7 > 2) Gainestown: 0:1:2:3:4:5:6:7 > > The map found for Hapertown is different from previous mvapich2 > releases, but is still far away from "Optimal runtime CPU binding" as > written in the Changelog. > > The lstopo tool of the hwloc package > http://www.open-mpi.org/projects/hwloc/ > gives for a node with Harpertown CPUs: > > System(15GB) > Socket#0 > L2(6144KB) > L1(32KB) + Core#0 + P#0 > L1(32KB) + Core#1 + P#2 > L2(6144KB) > L1(32KB) + Core#2 + P#4 > L1(32KB) + Core#3 + P#6 > Socket#1 > L2(6144KB) > L1(32KB) + Core#0 + P#1 > L1(32KB) + Core#1 + P#3 > L2(6144KB) > L1(32KB) + Core#2 + P#5 > L1(32KB) + Core#3 + P#7 > > So, on this architecture the "optimal" affinity map for a pure MPI > application is 0:1:4:5:2:3:6:7, because one has to try to minimize usage > of shared L2 caches as much as possible (run 4 tasks on 0:1:4:5, not on > 0:2:4:6 as mvapich2 does). > > On the other hand, the map applied on Gainestown is correct (minimizes > usage of shared L3 cache and NUMA node memory). The topology map is here > > System(47GB) > Node#0(23GB) + Socket#0 + L3(8192KB) > L2(256KB) + L1(32KB) + Core#0 > P#0 > P#8 > L2(256KB) + L1(32KB) + Core#1 > P#2 > P#10 > L2(256KB) + L1(32KB) + Core#2 > P#4 > P#12 > L2(256KB) + L1(32KB) + Core#3 > P#6 > P#14 > Node#1(23GB) + Socket#1 + L3(8192KB) > L2(256KB) + L1(32KB) + Core#0 > P#1 > P#9 > L2(256KB) + L1(32KB) + Core#1 > P#3 > P#11 > L2(256KB) + L1(32KB) + Core#2 > P#5 > P#13 > L2(256KB) + L1(32KB) + Core#3 > P#7 > P#15 > > I'm wondering if I made a mistake or understood something wrong, or if > there is some bug in the mvapich2 intelligence that seems to be used to > analyze the cpu topology, or if this intelligence might become increased > in future mvapich2 releases to reduce the need to know a value of > MV2_CPU_MAPPING on a particular architecture, which is more suitable > than the default. > > Sincerely, BK > > -- > Dr. Bernd Kallies > Konrad-Zuse-Zentrum für Informationstechnik Berlin > Takustr. 7 > 14195 Berlin > Tel: +49-30-84185-270 > Fax: +49-30-84185-311 > e-mail: kallies@zib.de > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >