From benjamin.fersch at imk.fzk.de Mon Apr 6 08:31:37 2009 From: benjamin.fersch at imk.fzk.de (Benjamin Fersch) Date: Mon Apr 6 08:31:54 2009 Subject: [mvapich-discuss] Assertion problem Message-ID: <49D9F629.7090707@imk.fzk.de> Dear List Members, I'm running the WRF-ARW V3.0.1.1 weather model on our Inifiband HPC. The cluster is installed with OpenFabrics. I usually submit my jobs on 10 to 12 nodes and every node has 4 Opteron CPU's. The program was compiled with PORTLAND pgi64-7.2-5 mvapich2. My problem is that the model doesn't run through, properly. After about 6 hours computation time the following error shows up and the program stops. wrf.exe: ch3u_rndv.c:333: MPIDI_CH3_PktHandler_RndvClrToSend: Assertion `sreq->mrail.rndv_buf_off == 0' failed This error came up only once: wrf.exe: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion `rndv->protocol == VAPI_PROTOCOL_R3' failed It seems that increasing the number of nodes used, the crash occurs faster. So, would this be a memory problem, like an overflow? Running the program on a single cpu with mpirun doesn't result in a crash. I'm not very experienced in MPI details. Can anybody help with this problem? Thank you! Benjamin output of ofed_info: OFED-1.3.1 libibverbs: git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3 commit 40b771aa6a9c0ad092b2e20775b4723d3b173792 libmthca: git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3 commit 9501e698d257949acfab2edc90812602966dbcc9 libmlx4: git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3 commit 3869d6dab7e12fe452270ca641f7dd7082b42482 libehca: git://git.openfabrics.org/ofed_1_3/libehca.git ofed_1_3 commit fd898180cfa3b737f893f432a80b91bac3396325 libipathverbs: git://git.openfabrics.org/ofed_1_3/libipathverbs.git ofed_1_3 commit 82be4d81859d1fd2edf830220fe65a9923b80a46 libcxgb3: git://git.openfabrics.org/ofed_1_3/libcxgb3.git ofed_1_3 commit 6f7485feb244d8571fcab2292ef92c97bea48df0 libnes: git://git.openfabrics.org/ofed_1_3/libnes.git ofed_1_3 commit 471fa2e5a7bb2f8946119396358c31adcc6c2fb3 libibcm: git://git.openfabrics.org/ofed_1_3/libibcm.git ofed_1_3 commit 53ec35f544bbc1838bbadc2210909c25a954a5e2 librdmacm: git://git.openfabrics.org/ofed_1_3/librdmacm.git ofed_1_3 commit a0ef80a1e0d5debdae48a844fbc8d09aec5b24b1 dapl1: git://git.openfabrics.org/ofed_1_3/dapl1.git ofed_1_3 commit 7a9b58d6c50fc0a357de540ec3eb2ab2e07f8779 dapl2: git://git.openfabrics.org/ofed_1_3/dapl2.git ofed_1_3 commit 2583f07d9d0f55eee14e0b0e6074bc6fd0712177 libsdp: git://git.openfabrics.org/ofed_1_3/libsdp.git ofed_1_3 commit c8102dccc502930442b23de658674d386456b350 sdpnetstat: git://git.openfabrics.org/ofed_1_3/sdpnetstat.git ofed_1_3 commit 3341620a7259c4f7bdd4180864b98e260c3dc223 srptools: git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3 commit e0ce2d42eeb25f8e89b8f6daaa32a630c9b64f0d perftest: git://git.openfabrics.org/ofed_1_3/perftest.git ofed_1_3 commit 6321b5468f7293088cc003809049c02b176130d8 qlvnictools: git://git.openfabrics.org/ofed_1_3/qlvnictools.git ofed_1_3 commit 086f9cb80ee790d61bddaf201ecbae32a2ff21dd tvflash: git://git.openfabrics.org/ofed_1_3/tvflash.git ofed_1_3 commit f5e7407a7f2058448df5e5320d9843f944427429 mstflint: git://git.openfabrics.org/ofed_1_3/mstflint.git ofed_1_3 commit 78bbd3d521a9078553a991111ffb6f76665b9ee9 qperf: git://git.openfabrics.org/ofed_1_3/qperf.git ofed_1_3 commit 6221aabd038df0b7033e035378ca190641ed2295 management: git://git.openfabrics.org/ofed_1_3/management.git ofed_1_3 commit d9c852406dae14e8284f9cfb1c7f495bbb55fddf ibutils: git://git.openfabrics.org/ofed_1_3/ibutils.git ofed_1_3 commit 7daf94fab6eaf307316326f3f49704e6080a1508 ibsim: git://git.openfabrics.org/ofed_1_3/ibsim.git ofed_1_3 commit 55113d9f919709c7c97ea41d29991941b9c8be70 ofa_kernel-1.3.1: Git: git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel commit 39e1dc833f98e5134f91fcf7f33df402adf4bc0c # MPI mvapich-1.0.1-2533.src.rpm mvapich2-1.0.3-1.src.rpm openmpi-1.2.6-1.src.rpm mpitests-3.0-773.src.rpm -- Dipl. Hydr. Benjamin Fersch Institute for Meteorology and Climate Research (IMK-IFU) KIT Karlsruhe Institute of Technology (FZK) Kreuzeckbahnstra?e 19 82467 Garmisch-Partenkirchen (Germany) Phone: +49 8821 183-267 Fax: +49 8821 183-243 ________________________________________________________________________ Forschungszentrum Karlsruhe GmbH, Weberstra?e 5, 76133 Karlsruhe Amtsgericht Mannheim, HRB 100302 Vorsitzende des Aufsichtsrates: MinDir'in B?rbel Brumme-Bothe Vorstand (Gesch?ftsf?hrung): Prof. Dr. Eberhard Umbach (Vorsitzender); Dr. Alexander Kurz (stellv. Vorsitzender); Dr.-Ing. Peter Fritz; Prof. Dr.-Ing. Detlef L?he; Prof. Dr. Horst Hippler; Prof. Dr. Reinhard Maschuw; From perkinjo at cse.ohio-state.edu Mon Apr 6 09:18:43 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon Apr 6 09:19:09 2009 Subject: [mvapich-discuss] Assertion problem In-Reply-To: <49D9F629.7090707@imk.fzk.de> References: <49D9F629.7090707@imk.fzk.de> Message-ID: <20090406131843.GA3047@cse.ohio-state.edu> Benjamin: Thanks for using mvapich2. I'll discuss this issue with the other developers and get back to you in order to either resolve or to pinpoint the cause of this issue. On Mon, Apr 06, 2009 at 02:31:37PM +0200, Benjamin Fersch wrote: > Dear List Members, > > > I'm running the WRF-ARW V3.0.1.1 weather model on our Inifiband HPC. > The cluster is installed with OpenFabrics. > > I usually submit my jobs on 10 to 12 nodes and every node has 4 Opteron > CPU's. > > The program was compiled with PORTLAND pgi64-7.2-5 mvapich2. > > My problem is that the model doesn't run through, properly. After about > 6 hours computation time the following error shows up and the program stops. > > wrf.exe: ch3u_rndv.c:333: MPIDI_CH3_PktHandler_RndvClrToSend: Assertion > `sreq->mrail.rndv_buf_off == 0' failed > > This error came up only once: > wrf.exe: ch3_rndvtransfer.c:226: MPIDI_CH3_Rndv_transfer: Assertion > `rndv->protocol == VAPI_PROTOCOL_R3' failed > > It seems that increasing the number of nodes used, the crash occurs > faster. So, would this be a memory problem, like an overflow? > Running the program on a single cpu with mpirun doesn't result in a crash. > > I'm not very experienced in MPI details. > > Can anybody help with this problem? > > > Thank you! > > > Benjamin > > > output of ofed_info: > > OFED-1.3.1 > libibverbs: > git://git.openfabrics.org/ofed_1_3/libibverbs.git ofed_1_3 > commit 40b771aa6a9c0ad092b2e20775b4723d3b173792 > libmthca: > git://git.openfabrics.org/ofed_1_3/libmthca.git ofed_1_3 > commit 9501e698d257949acfab2edc90812602966dbcc9 > libmlx4: > git://git.openfabrics.org/ofed_1_3/libmlx4.git ofed_1_3 > commit 3869d6dab7e12fe452270ca641f7dd7082b42482 > libehca: > git://git.openfabrics.org/ofed_1_3/libehca.git ofed_1_3 > commit fd898180cfa3b737f893f432a80b91bac3396325 > libipathverbs: > git://git.openfabrics.org/ofed_1_3/libipathverbs.git ofed_1_3 > commit 82be4d81859d1fd2edf830220fe65a9923b80a46 > libcxgb3: > git://git.openfabrics.org/ofed_1_3/libcxgb3.git ofed_1_3 > commit 6f7485feb244d8571fcab2292ef92c97bea48df0 > libnes: > git://git.openfabrics.org/ofed_1_3/libnes.git ofed_1_3 > commit 471fa2e5a7bb2f8946119396358c31adcc6c2fb3 > libibcm: > git://git.openfabrics.org/ofed_1_3/libibcm.git ofed_1_3 > commit 53ec35f544bbc1838bbadc2210909c25a954a5e2 > librdmacm: > git://git.openfabrics.org/ofed_1_3/librdmacm.git ofed_1_3 > commit a0ef80a1e0d5debdae48a844fbc8d09aec5b24b1 > dapl1: > git://git.openfabrics.org/ofed_1_3/dapl1.git ofed_1_3 > commit 7a9b58d6c50fc0a357de540ec3eb2ab2e07f8779 > dapl2: > git://git.openfabrics.org/ofed_1_3/dapl2.git ofed_1_3 > commit 2583f07d9d0f55eee14e0b0e6074bc6fd0712177 > libsdp: > git://git.openfabrics.org/ofed_1_3/libsdp.git ofed_1_3 > commit c8102dccc502930442b23de658674d386456b350 > sdpnetstat: > git://git.openfabrics.org/ofed_1_3/sdpnetstat.git ofed_1_3 > commit 3341620a7259c4f7bdd4180864b98e260c3dc223 > srptools: > git://git.openfabrics.org/ofed_1_3/srptools.git ofed_1_3 > commit e0ce2d42eeb25f8e89b8f6daaa32a630c9b64f0d > perftest: > git://git.openfabrics.org/ofed_1_3/perftest.git ofed_1_3 > commit 6321b5468f7293088cc003809049c02b176130d8 > qlvnictools: > git://git.openfabrics.org/ofed_1_3/qlvnictools.git ofed_1_3 > commit 086f9cb80ee790d61bddaf201ecbae32a2ff21dd > tvflash: > git://git.openfabrics.org/ofed_1_3/tvflash.git ofed_1_3 > commit f5e7407a7f2058448df5e5320d9843f944427429 > mstflint: > git://git.openfabrics.org/ofed_1_3/mstflint.git ofed_1_3 > commit 78bbd3d521a9078553a991111ffb6f76665b9ee9 > qperf: > git://git.openfabrics.org/ofed_1_3/qperf.git ofed_1_3 > commit 6221aabd038df0b7033e035378ca190641ed2295 > management: > git://git.openfabrics.org/ofed_1_3/management.git ofed_1_3 > commit d9c852406dae14e8284f9cfb1c7f495bbb55fddf > ibutils: > git://git.openfabrics.org/ofed_1_3/ibutils.git ofed_1_3 > commit 7daf94fab6eaf307316326f3f49704e6080a1508 > ibsim: > git://git.openfabrics.org/ofed_1_3/ibsim.git ofed_1_3 > commit 55113d9f919709c7c97ea41d29991941b9c8be70 > > ofa_kernel-1.3.1: > Git: > git://git.openfabrics.org/ofed_1_3/linux-2.6.git ofed_kernel > commit 39e1dc833f98e5134f91fcf7f33df402adf4bc0c > > # MPI > mvapich-1.0.1-2533.src.rpm > mvapich2-1.0.3-1.src.rpm > openmpi-1.2.6-1.src.rpm > mpitests-3.0-773.src.rpm > > > > -- > Dipl. Hydr. Benjamin Fersch > > Institute for Meteorology and Climate Research (IMK-IFU) > KIT Karlsruhe Institute of Technology (FZK) > Kreuzeckbahnstra?e 19 > 82467 Garmisch-Partenkirchen (Germany) > > Phone: +49 8821 183-267 > Fax: +49 8821 183-243 > > ________________________________________________________________________ > > Forschungszentrum Karlsruhe GmbH, Weberstra?e 5, 76133 Karlsruhe > > Amtsgericht Mannheim, HRB 100302 > Vorsitzende des Aufsichtsrates: MinDir'in B?rbel Brumme-Bothe > Vorstand (Gesch?ftsf?hrung): Prof. Dr. Eberhard Umbach (Vorsitzender); > Dr. Alexander Kurz (stellv. Vorsitzender); Dr.-Ing. Peter Fritz; > Prof. Dr.-Ing. Detlef L?he; Prof. Dr. Horst Hippler; Prof. Dr. Reinhard > Maschuw; > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090406/eb1b9d23/attachment.bin From maya.usatu at gmail.com Wed Apr 15 15:46:31 2009 From: maya.usatu at gmail.com (Maya Khaliullina) Date: Wed Apr 15 15:46:45 2009 Subject: [mvapich-discuss] Profiling of osu_mbw_mr test Message-ID: Hello, We develop model of concurrent communications for Infiniband network of our HPC-cluster: Node: 2xQuad Core Intel Xeon 2.33 GHz O/S: RHEL4.5 File System: GPFS To investigate behaviour of infiniband interconnect we used profiling of osu_mbw_mr test from OMB with Allinea Optimization & Profiling Tool (OPT). Version of MVAPICH is 2-1.2. We reduced number of iterations in osu_mbw_mr test to 5 and used only 2 MB messages. We found that there are 2 main variants of communication behavior resulting in different summary bandwidth: 1) (pic.1) In the first case we see that one pair of communicating processes (2 and 6) works faster than others and finishes earlier. Corresponding bandwidth ~ 950 MB/sec. 2) (pic.2) In the second case all pairs work similarly with summary bandwidth ~ 960 MB/sec. Could you please explain the reason why we have such cases? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090416/e94d90ce/attachment.html From arthur at mail.rb.ru Thu Apr 16 05:09:36 2009 From: arthur at mail.rb.ru (Arthur Yuldashev) Date: Thu Apr 16 05:09:47 2009 Subject: [mvapich-discuss] service data size evaluation Message-ID: <49E6F5D0.4080200@mail.rb.ru> *This message was transferred with a trial version of CommuniGate(r) Pro* Dear developers! Could you please suggest how to evaluate size of service data (in bytes) which are sent for MPI point to point communications implemented in MVAPICH2 1.2? Does it differ when using build with BLCR support? Best regards, Arthur Yuldashev From maya.usatu at gmail.com Thu Apr 16 06:59:21 2009 From: maya.usatu at gmail.com (Maya Khaliullina) Date: Thu Apr 16 06:59:36 2009 Subject: [mvapich-discuss] Fwd: Profiling of osu_mbw_mr test In-Reply-To: References: Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: pic1.png Type: image/png Size: 4046 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090416/d9121509/pic1-0001.png -------------- next part -------------- A non-text attachment was scrubbed... Name: pic2.png Type: image/png Size: 4042 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090416/d9121509/pic2-0001.png From panda at cse.ohio-state.edu Thu Apr 16 09:52:30 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Apr 16 09:52:42 2009 Subject: [mvapich-discuss] Profiling of osu_mbw_mr test In-Reply-To: Message-ID: A couple of points here. When multiple processes are concurrently doing the data transfer, you need to make sure that they start around the same time and the communication steps overlap. This is achieved by running the test for multiple number of iterations. Typically we skip some of the initial iterations and take average of the remaining iterations. You have reduced the number of iterations to a smaller value. You can try to increase the and see the impact. Also, you may add double-barriers in the begining to make sure that processes are almost synchronized. Hope this helps. DK On Thu, 16 Apr 2009, Maya Khaliullina wrote: > Hello, > > We develop model of concurrent communications for Infiniband network of our > HPC-cluster: > Node: 2xQuad Core Intel Xeon 2.33 GHz > O/S: RHEL4.5 > File System: GPFS > > To investigate behaviour of infiniband interconnect we used profiling of > osu_mbw_mr test from OMB with Allinea Optimization & Profiling Tool (OPT). > > Version of MVAPICH is 2-1.2. > > We reduced number of iterations in osu_mbw_mr test to 5 and used only 2 MB > messages. > > We found that there are 2 main variants of communication behavior resulting > in different summary bandwidth: > > 1) (pic.1) In the first case we see that one pair of communicating processes > (2 and 6) works faster than others and finishes earlier. Corresponding > bandwidth ~ 950 MB/sec. > > 2) (pic.2) In the second case all pairs work similarly with summary > bandwidth ~ 960 MB/sec. > > Could you please explain the reason why we have such cases? > From nilesh_awate at yahoo.com Fri Apr 17 03:34:24 2009 From: nilesh_awate at yahoo.com (nilesh awate) Date: Fri Apr 17 03:34:38 2009 Subject: [mvapich-discuss] mvapich2-1.2 compilation error for debug options Message-ID: <398553.26679.qm@web94108.mail.in2.yahoo.com> Hi all, I am using mvapich2-q.2p1 over gen2 for debugging purpose i wanted to use --enable-g=mem option configurations options are ./configure --enable-g=mem --prefix=/home/anuj/mvapich2_1_2/ --with-rdma=gen2 --with-ib-include=/usr/local/ofed/include/ --with-ib-libpath=/usr/local/ofed/lib64/ --enable-sharedlibs=gcc --enable-debuginfo but it started giving me following error over both the adi ch3_smp_progress.c:815:58: warning: character constant too long for its type ch3_smp_progress.c: In function `MPIDI_CH3I_SMP_init': ch3_smp_progress.c:815: error: syntax error before ':' token ch3_smp_progress.c:828:59: warning: character constant too long for its type ch3_smp_progress.c:828: error: syntax error before ':' token make[7]: Leaving directory `/tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/rdma' I brows above error code which is under #if !defined(_X86_64_) so i just defined that flag to omit that part of the code but for next compilation it flush same error for other part of the code ch3u_rma_sync.c:205:74: warning: character constant too long for its type ch3u_rma_sync.c: In function `MPIDI_Win_fence': ch3u_rma_sync.c:205: error: syntax error before ':' token ch3u_rma_sync.c:233:27: warning: character constant too long for its type ch3u_rma_sync.c:233: error: syntax error before ':' token What should i do to have mvapich libraries with debugging options(spacifically memory tracing) waiting for reply Nilesh Awate Add more friends to your messenger and enjoy! Go to http://messenger.yahoo.com/invite/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090417/d49b2b25/attachment.html From maya.usatu at gmail.com Fri Apr 17 06:14:36 2009 From: maya.usatu at gmail.com (Maya Khaliullina) Date: Fri Apr 17 06:14:53 2009 Subject: [mvapich-discuss] Profiling of osu_mbw_mr test In-Reply-To: References: Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: pic1.png Type: image/png Size: 27579 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090417/0d4aea76/pic1-0001.png -------------- next part -------------- A non-text attachment was scrubbed... Name: pic2.png Type: image/png Size: 27456 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090417/0d4aea76/pic2-0001.png From polk678 at gmail.com Fri Apr 17 07:19:54 2009 From: polk678 at gmail.com (gossips J) Date: Fri Apr 17 07:20:09 2009 Subject: [mvapich-discuss] mvapich2-1.2 compilation error for debug options In-Reply-To: <398553.26679.qm@web94108.mail.in2.yahoo.com> References: <398553.26679.qm@web94108.mail.in2.yahoo.com> Message-ID: Yes, same observation for me too. It looks like there is a bug in mvapich2 compilation process with ??enable-g ?enable-debuginfo? option. It never compiles debug library for mavpich2. It contains primitive calls for memory allocation at some places which it doesn?t allow to compile during debug configured mvapich2. Let us hope that it will resolve in next release. Thanks, Polk. On Fri, Apr 17, 2009 at 1:04 PM, nilesh awate wrote: > > Hi all, > > I am using mvapich2-q.2p1 over gen2 > > for debugging purpose i wanted to use --enable-g=mem option > > configurations options are > > ./configure --enable-g=mem --prefix=/home/anuj/mvapich2_1_2/ > --with-rdma=gen2 --with-ib-include=/usr/local/ofed/include/ > --with-ib-libpath=/usr/local/ofed/lib64/ --enable-sharedlibs=gcc > --enable-debuginfo > > > but it started giving me following error over both the adi > > ch3_smp_progress.c:815:58: warning: character constant too long for its > type > ch3_smp_progress.c: In function `MPIDI_CH3I_SMP_init': > ch3_smp_progress.c:815: error: syntax error before ':' token > ch3_smp_progress.c:828:59: warning: character constant too long for its > type > ch3_smp_progress.c:828: error: syntax error before ':' token > make[7]: Leaving directory > `/tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/rdma' > > I brows above error code which is under #if !defined(_X86_64_) > > so i just defined that flag to omit that part of the code but for next > compilation it flush same error for > > other part of the code > > ch3u_rma_sync.c:205:74: warning: character constant too long for its type > ch3u_rma_sync.c: In function `MPIDI_Win_fence': > ch3u_rma_sync.c:205: error: syntax error before ':' token > ch3u_rma_sync.c:233:27: warning: character constant too long for its type > ch3u_rma_sync.c:233: error: syntax error before ':' token > > What should i do to have mvapich libraries with debugging > options(spacifically memory tracing) > > waiting for reply > Nilesh Awate > > > > > > ------------------------------ > Add more friends to your messenger and enjoy! Invite them now. > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090417/ffc494c7/attachment.html From panda at cse.ohio-state.edu Fri Apr 17 10:06:19 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Apr 17 10:06:33 2009 Subject: [mvapich-discuss] mvapich2-1.2 compilation error for debug options In-Reply-To: Message-ID: Thanks for your notes. We plan to resolve this issue in the upcoming release (in a few weeks). Thanks, DK On Fri, 17 Apr 2009, gossips J wrote: > Yes, same observation for me too. > > It looks like there is a bug in mvapich2 compilation process with “—enable-g > –enable-debuginfo” option. > > It never compiles debug library for mavpich2. > > It contains primitive calls for memory allocation at some places which it > doesn’t allow to compile during debug configured mvapich2. > > Let us hope that it will resolve in next release. > > Thanks, > > Polk. > > On Fri, Apr 17, 2009 at 1:04 PM, nilesh awate wrote: > > > > > Hi all, > > > > I am using mvapich2-q.2p1 over gen2 > > > > for debugging purpose i wanted to use --enable-g=mem option > > > > configurations options are > > > > ./configure --enable-g=mem --prefix=/home/anuj/mvapich2_1_2/ > > --with-rdma=gen2 --with-ib-include=/usr/local/ofed/include/ > > --with-ib-libpath=/usr/local/ofed/lib64/ --enable-sharedlibs=gcc > > --enable-debuginfo > > > > > > but it started giving me following error over both the adi > > > > ch3_smp_progress.c:815:58: warning: character constant too long for its > > type > > ch3_smp_progress.c: In function `MPIDI_CH3I_SMP_init': > > ch3_smp_progress.c:815: error: syntax error before ':' token > > ch3_smp_progress.c:828:59: warning: character constant too long for its > > type > > ch3_smp_progress.c:828: error: syntax error before ':' token > > make[7]: Leaving directory > > `/tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/rdma' > > > > I brows above error code which is under #if !defined(_X86_64_) > > > > so i just defined that flag to omit that part of the code but for next > > compilation it flush same error for > > > > other part of the code > > > > ch3u_rma_sync.c:205:74: warning: character constant too long for its type > > ch3u_rma_sync.c: In function `MPIDI_Win_fence': > > ch3u_rma_sync.c:205: error: syntax error before ':' token > > ch3u_rma_sync.c:233:27: warning: character constant too long for its type > > ch3u_rma_sync.c:233: error: syntax error before ':' token > > > > What should i do to have mvapich libraries with debugging > > options(spacifically memory tracing) > > > > waiting for reply > > Nilesh Awate > > > > > > > > > > > > ------------------------------ > > Add more friends to your messenger and enjoy! Invite them now. > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > From panda at cse.ohio-state.edu Fri Apr 17 12:07:13 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Apr 17 12:07:28 2009 Subject: [mvapich-discuss] service data size evaluation In-Reply-To: <49E6F5D0.4080200@mail.rb.ru> Message-ID: > *This message was transferred with a trial version of CommuniGate(r) Pro* > > Dear developers! > > Could you please suggest how to evaluate size of service data (in bytes) > which > are sent for MPI point to point communications implemented in MVAPICH2 1.2? I believe you are referring to the header size of MPI packets. They vary with different packet types. For example, for non-RDMA Fast path packets, the size is 56 bytes. > Does it differ when using build with BLCR support? No. Hope this helps. Thanks, DK > Best regards, > Arthur Yuldashev > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From polk678 at gmail.com Sat Apr 18 02:46:19 2009 From: polk678 at gmail.com (gossips J) Date: Sat Apr 18 02:46:33 2009 Subject: [mvapich-discuss] mvapich2-1.2p1 over ofa for longer duration above 64 processes/tasks Message-ID: I would like to know if there anybody has tried to run mvapich2-1.2p1 over OFA (-gen2 & rdma_cm) interface for more than 64 processes/tasks? In iterations for longer duration? How many? (with any provider drivers) If so what are the behavior, does it ran properly, deadlocked, or any other error message??? Thanks, Polk -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090418/5e4bc22d/attachment.html From sampat at cse.ohio-state.edu Sat Apr 25 20:59:09 2009 From: sampat at cse.ohio-state.edu (Ajay Sampat) Date: Sat Apr 25 23:31:18 2009 Subject: [mvapich-discuss] mvapich2-1.2 compilation error for debug options In-Reply-To: References: Message-ID: <890245760904251759j7f4fbe4dwd5a8318302287147@mail.gmail.com> Hello Nilesh Awate/ Gossips J, We now have a fix for the compilation problems you reported while using mem tracing debug options. You can find this fix on the latest nightly tarball from April 25th onwards here. http://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.2/ You can also signup to our mvapich-commit mailing list and keep up with all the commits to MVAPICH2 bug fix branches here. http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-commit/ This fix will also be available in the upcoming MVAPICH2 release. Thanks. On Fri, Apr 17, 2009 at 10:06 AM, Dhabaleswar Panda wrote: > Thanks for your notes. We plan to resolve this issue in the upcoming > release (in a few weeks). > > Thanks, > > DK > > On Fri, 17 Apr 2009, gossips J wrote: > >> Yes, same observation for me too. >> >> It looks like there is a bug in mvapich2 compilation process with ??enable-g >> ?enable-debuginfo? option. >> >> It never compiles debug library for mavpich2. >> >> It contains primitive calls for memory allocation at some places which it >> doesn?t allow to compile during debug configured mvapich2. >> >> Let us hope that it will resolve in next release. >> >> Thanks, >> >> Polk. >> >> On Fri, Apr 17, 2009 at 1:04 PM, nilesh awate wrote: >> >> > >> > Hi all, >> > >> > I am using mvapich2-q.2p1 over gen2 >> > >> > for debugging purpose i ?wanted to use --enable-g=mem option >> > >> > configurations options are >> > >> > ./configure ?--enable-g=mem --prefix=/home/anuj/mvapich2_1_2/ >> > --with-rdma=gen2 --with-ib-include=/usr/local/ofed/include/ >> > --with-ib-libpath=/usr/local/ofed/lib64/ --enable-sharedlibs=gcc >> > --enable-debuginfo >> > >> > >> > but it started giving me following error over both the adi >> > >> > ch3_smp_progress.c:815:58: warning: character constant too long for its >> > type >> > ch3_smp_progress.c: In function `MPIDI_CH3I_SMP_init': >> > ch3_smp_progress.c:815: error: syntax error before ':' token >> > ch3_smp_progress.c:828:59: warning: character constant too long for its >> > type >> > ch3_smp_progress.c:828: error: syntax error before ':' token >> > make[7]: Leaving directory >> > `/tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/rdma' >> > >> > I brows above error code which is under ?#if !defined(_X86_64_) >> > >> > so i just defined that flag to omit that part of the code but for next >> > compilation it flush same error for >> > >> > other part of the code >> > >> > ch3u_rma_sync.c:205:74: warning: character constant too long for its type >> > ch3u_rma_sync.c: In function `MPIDI_Win_fence': >> > ch3u_rma_sync.c:205: error: syntax error before ':' token >> > ch3u_rma_sync.c:233:27: warning: character constant too long for its type >> > ch3u_rma_sync.c:233: error: syntax error before ':' token >> > >> > What should i do to have mvapich libraries with debugging >> > options(spacifically memory tracing) >> > >> > waiting for reply >> > Nilesh Awate >> > >> > >> > >> > >> > >> > ------------------------------ >> > Add more friends to your messenger and enjoy! Invite them now. >> > >> > _______________________________________________ >> > mvapich-discuss mailing list >> > mvapich-discuss@cse.ohio-state.edu >> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > >> > >> > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- Ajay Sampat From polk678 at gmail.com Sun Apr 26 07:17:55 2009 From: polk678 at gmail.com (gossips J) Date: Sun Apr 26 07:18:16 2009 Subject: [mvapich-discuss] mvapich2-1.2 compilation error for debug options In-Reply-To: <890245760904251759j7f4fbe4dwd5a8318302287147@mail.gmail.com> References: <890245760904251759j7f4fbe4dwd5a8318302287147@mail.gmail.com> Message-ID: Do you mean this updated version of mvapich2 RPM will be part of OFED-1.4.1 RC4 releasing on Monday/Tuesday or in OFED-1.5? -polk. On 4/26/09, Ajay Sampat wrote: > > Hello Nilesh Awate/ Gossips J, > > We now have a fix for the compilation problems you reported while > using mem tracing debug options. > > You can find this fix on the latest nightly tarball from April 25th > onwards here. > http://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.2/ > > You can also signup to our mvapich-commit mailing list and keep > up with all the commits to MVAPICH2 bug fix branches here. > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-commit/ > > This fix will also be available in the upcoming MVAPICH2 release. > > Thanks. > > On Fri, Apr 17, 2009 at 10:06 AM, Dhabaleswar Panda > wrote: > > Thanks for your notes. We plan to resolve this issue in the upcoming > > release (in a few weeks). > > > > Thanks, > > > > DK > > > > On Fri, 17 Apr 2009, gossips J wrote: > > > >> Yes, same observation for me too. > >> > >> It looks like there is a bug in mvapich2 compilation process with > ??enable-g > >> ?enable-debuginfo? option. > >> > >> It never compiles debug library for mavpich2. > >> > >> It contains primitive calls for memory allocation at some places which > it > >> doesn?t allow to compile during debug configured mvapich2. > >> > >> Let us hope that it will resolve in next release. > >> > >> Thanks, > >> > >> Polk. > >> > >> On Fri, Apr 17, 2009 at 1:04 PM, nilesh awate >wrote: > >> > >> > > >> > Hi all, > >> > > >> > I am using mvapich2-q.2p1 over gen2 > >> > > >> > for debugging purpose i wanted to use --enable-g=mem option > >> > > >> > configurations options are > >> > > >> > ./configure --enable-g=mem --prefix=/home/anuj/mvapich2_1_2/ > >> > --with-rdma=gen2 --with-ib-include=/usr/local/ofed/include/ > >> > --with-ib-libpath=/usr/local/ofed/lib64/ --enable-sharedlibs=gcc > >> > --enable-debuginfo > >> > > >> > > >> > but it started giving me following error over both the adi > >> > > >> > ch3_smp_progress.c:815:58: warning: character constant too long for > its > >> > type > >> > ch3_smp_progress.c: In function `MPIDI_CH3I_SMP_init': > >> > ch3_smp_progress.c:815: error: syntax error before ':' token > >> > ch3_smp_progress.c:828:59: warning: character constant too long for > its > >> > type > >> > ch3_smp_progress.c:828: error: syntax error before ':' token > >> > make[7]: Leaving directory > >> > `/tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/rdma' > >> > > >> > I brows above error code which is under #if !defined(_X86_64_) > >> > > >> > so i just defined that flag to omit that part of the code but for next > >> > compilation it flush same error for > >> > > >> > other part of the code > >> > > >> > ch3u_rma_sync.c:205:74: warning: character constant too long for its > type > >> > ch3u_rma_sync.c: In function `MPIDI_Win_fence': > >> > ch3u_rma_sync.c:205: error: syntax error before ':' token > >> > ch3u_rma_sync.c:233:27: warning: character constant too long for its > type > >> > ch3u_rma_sync.c:233: error: syntax error before ':' token > >> > > >> > What should i do to have mvapich libraries with debugging > >> > options(spacifically memory tracing) > >> > > >> > waiting for reply > >> > Nilesh Awate > >> > > >> > > >> > > >> > > >> > > >> > ------------------------------ > >> > Add more friends to your messenger and enjoy! Invite them now.< > http://in.rd.yahoo.com/tagline_messenger_6/*http://messenger.yahoo.com/invite/ > > > >> > > >> > _______________________________________________ > >> > mvapich-discuss mailing list > >> > mvapich-discuss@cse.ohio-state.edu > >> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > >> > > >> > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > -- > Ajay Sampat > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090426/a1b71c42/attachment-0001.html From panda at cse.ohio-state.edu Sun Apr 26 20:17:57 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sun Apr 26 20:18:11 2009 Subject: [mvapich-discuss] mvapich2-1.2 compilation error for debug options In-Reply-To: Message-ID: > Do you mean this updated version of mvapich2 RPM will be part of OFED-1.4.1 > RC4 releasing on Monday/Tuesday or in OFED-1.5? As Ajay indicated, you can get an updated version of mvapich2 1.2p1+this patch from mvapich web page (nightly tarball). This patched version is not yet a part of OFED-1.4.1. You should be able to download, compile and use it with any of the recent OFED stacks including OFED 1.4.1. The subsequent mvapich2 version (with many new features and this patch) will be a part of OFED 1.5. Hope this helps. DK > -polk. > > > On 4/26/09, Ajay Sampat wrote: > > > > Hello Nilesh Awate/ Gossips J, > > > > We now have a fix for the compilation problems you reported while > > using mem tracing debug options. > > > > You can find this fix on the latest nightly tarball from April 25th > > onwards here. > > http://mvapich.cse.ohio-state.edu/nightly/mvapich2/branches/1.2/ > > > > You can also signup to our mvapich-commit mailing list and keep > > up with all the commits to MVAPICH2 bug fix branches here. > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-commit/ > > > > This fix will also be available in the upcoming MVAPICH2 release. > > > > Thanks. > > > > On Fri, Apr 17, 2009 at 10:06 AM, Dhabaleswar Panda > > wrote: > > > Thanks for your notes. We plan to resolve this issue in the upcoming > > > release (in a few weeks). > > > > > > Thanks, > > > > > > DK > > > > > > On Fri, 17 Apr 2009, gossips J wrote: > > > > > >> Yes, same observation for me too. > > >> > > >> It looks like there is a bug in mvapich2 compilation process with > > “—enable-g > > >> –enable-debuginfo” option. > > >> > > >> It never compiles debug library for mavpich2. > > >> > > >> It contains primitive calls for memory allocation at some places which > > it > > >> doesn’t allow to compile during debug configured mvapich2. > > >> > > >> Let us hope that it will resolve in next release. > > >> > > >> Thanks, > > >> > > >> Polk. > > >> > > >> On Fri, Apr 17, 2009 at 1:04 PM, nilesh awate > >wrote: > > >> > > >> > > > >> > Hi all, > > >> > > > >> > I am using mvapich2-q.2p1 over gen2 > > >> > > > >> > for debugging purpose i wanted to use --enable-g=mem option > > >> > > > >> > configurations options are > > >> > > > >> > ./configure --enable-g=mem --prefix=/home/anuj/mvapich2_1_2/ > > >> > --with-rdma=gen2 --with-ib-include=/usr/local/ofed/include/ > > >> > --with-ib-libpath=/usr/local/ofed/lib64/ --enable-sharedlibs=gcc > > >> > --enable-debuginfo > > >> > > > >> > > > >> > but it started giving me following error over both the adi > > >> > > > >> > ch3_smp_progress.c:815:58: warning: character constant too long for > > its > > >> > type > > >> > ch3_smp_progress.c: In function `MPIDI_CH3I_SMP_init': > > >> > ch3_smp_progress.c:815: error: syntax error before ':' token > > >> > ch3_smp_progress.c:828:59: warning: character constant too long for > > its > > >> > type > > >> > ch3_smp_progress.c:828: error: syntax error before ':' token > > >> > make[7]: Leaving directory > > >> > `/tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/rdma' > > >> > > > >> > I brows above error code which is under #if !defined(_X86_64_) > > >> > > > >> > so i just defined that flag to omit that part of the code but for next > > >> > compilation it flush same error for > > >> > > > >> > other part of the code > > >> > > > >> > ch3u_rma_sync.c:205:74: warning: character constant too long for its > > type > > >> > ch3u_rma_sync.c: In function `MPIDI_Win_fence': > > >> > ch3u_rma_sync.c:205: error: syntax error before ':' token > > >> > ch3u_rma_sync.c:233:27: warning: character constant too long for its > > type > > >> > ch3u_rma_sync.c:233: error: syntax error before ':' token > > >> > > > >> > What should i do to have mvapich libraries with debugging > > >> > options(spacifically memory tracing) > > >> > > > >> > waiting for reply > > >> > Nilesh Awate > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > ------------------------------ > > >> > Add more friends to your messenger and enjoy! Invite them now.< > > http://in.rd.yahoo.com/tagline_messenger_6/*http://messenger.yahoo.com/invite/ > > > > > >> > > > >> > _______________________________________________ > > >> > mvapich-discuss mailing list > > >> > mvapich-discuss@cse.ohio-state.edu > > >> > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > >> > > > >> > > > >> > > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > -- > > Ajay Sampat > > > From hoot at ptpnow.com Thu Apr 30 08:16:13 2009 From: hoot at ptpnow.com (Hoot Thompson) Date: Thu Apr 30 08:16:35 2009 Subject: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC Message-ID: <60C951AAF9F24FAEB7688A4CEB8EFA33@ptpdesk> Has there been any work done and/or experience with using mvapich2 and Intel NetEffects 10Gigbit NIC cards as the communication fabric? If so, any setup/configuration suggestions for a Linux environment would be appreciated. No matter what I try when I try to execute one of the OSU benchmarks I get the following error... client1nccs:~/mvapich2-1.2p1/osu_benchmarks # mpiexec -n 2 ./osu_latency 0: Starting MPI 0: [ring_startup.c:301] error(22): Could not modify boot qp to RTS 1: [ring_startup.c:301] error(22): Could not modify boot qp to RTS rank 1 in job 4 client1nccs_37672 caused collective abort of all ranks exit status of rank 1: killed by signal 9 rank 0 in job 4 client1nccs_37672 caused collective abort of all ranks exit status of rank 0: killed by signal 9 Thanks in advance..... From panda at cse.ohio-state.edu Thu Apr 30 09:15:55 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Apr 30 09:16:09 2009 Subject: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC In-Reply-To: <60C951AAF9F24FAEB7688A4CEB8EFA33@ptpdesk> Message-ID: We had tested MVAPICH2 and Neteffect (original cards) long back. They were working fine. As you know, Neteffect cards and drivers have gone through multiple changes in recent years, especially after Intel's aquisition of Neteffect. You may check with Intel about the latest status and drivers on these cards. Thanks, DK On Thu, 30 Apr 2009, Hoot Thompson wrote: > Has there been any work done and/or experience with using mvapich2 and Intel > NetEffects 10Gigbit NIC cards as the communication fabric? If so, any > setup/configuration suggestions for a Linux environment would be > appreciated. No matter what I try when I try to execute one of the OSU > benchmarks I get the following error... > > > client1nccs:~/mvapich2-1.2p1/osu_benchmarks # mpiexec -n 2 ./osu_latency > 0: Starting MPI > 0: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > 1: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > rank 1 in job 4 client1nccs_37672 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > rank 0 in job 4 client1nccs_37672 caused collective abort of all ranks > exit status of rank 0: killed by signal 9 > > > > Thanks in advance..... > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From arthur at mail.rb.ru Thu Apr 30 10:08:17 2009 From: arthur at mail.rb.ru (Arthur Yuldashev) Date: Thu Apr 30 10:08:31 2009 Subject: [mvapich-discuss] examples_collchk issue Message-ID: <49F9B0D1.8060902@mail.rb.ru> *This message was transferred with a trial version of CommuniGate(r) Pro* Hello! It seems that we've found a bug in some of examples provided with mvapich2-1.2. For instance in time_alltoallv.c there are following strings of code: if ( argv != NULL && argv[1] != NULL ) block_size = atoi( argv[1] ); else block_size = 1; if ( argv != NULL && argv[2] != NULL ) num_itr = atoi( argv[2] ); else num_itr = 1; We ran it without any command line arguments resulting in argv[1] equal to NULL, but argv[2] was equal to one of environment variables. So atoi(argv[2]) was equal to 0 resulting in 0 number of iterations, not 1 as supposed. And so actually no MPI_Alltoallv communications were done. Best regards, Arthur Yuldashev From nick.holway at gmail.com Thu Apr 30 10:08:27 2009 From: nick.holway at gmail.com (Nick Holway) Date: Thu Apr 30 10:08:41 2009 Subject: [mvapich-discuss] Jobs run slowly with >1 job on the same nodes Message-ID: <60fabd260904300708q7d8653asae0fadef49b94d13@mail.gmail.com> Dear all. I'm running a 64bit Rocks 5.1 cluster (ie Centos 5.2) with Voltaire OFED 1.4 and SGE 6.1u5. I compiled MVAPICH 1.2 with ifort 10 and I configured it with F77 & F90 bindings. The nodes all have 2 quad core Xeon CPUs. We've compiled PMEMD and sander.MPI and see the same problem with both. When one job is run at a time (32 CPUs on 8 nodes) the job runs well with good performance. If two jobs (eg 32 on the same 8 nodes) are launched at the same time then both jobs run an order of magnitude slower. A single 64 CPU run on the same nodes runs normally. We're also seeing problems with jobs disapearing from SGE and qdel not deleting the jobs properly. Does anyone know what might be causing the above issues? FWIW I've run the osu benchmarks and subounce on the cluster without issue. I originally raised this on the Amber mailing list who suggested that it's more likely to be a system problem rather than with their software (http://structbio.vanderbilt.edu/archives/amber-archive/2009/1410.php). Regards Nick From hoot at ptpnow.com Thu Apr 30 10:43:31 2009 From: hoot at ptpnow.com (Hoot Thompson) Date: Thu Apr 30 10:43:59 2009 Subject: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC In-Reply-To: References: <60C951AAF9F24FAEB7688A4CEB8EFA33@ptpdesk> Message-ID: <9D1D3508633A46709602B63471F50011@ptpdesk> Thanks for quick response. Can you tell me what the error message means? Hoot -----Original Message----- From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] Sent: Thursday, April 30, 2009 9:16 AM To: Hoot Thompson Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC We had tested MVAPICH2 and Neteffect (original cards) long back. They were working fine. As you know, Neteffect cards and drivers have gone through multiple changes in recent years, especially after Intel's aquisition of Neteffect. You may check with Intel about the latest status and drivers on these cards. Thanks, DK On Thu, 30 Apr 2009, Hoot Thompson wrote: > Has there been any work done and/or experience with using mvapich2 and > Intel NetEffects 10Gigbit NIC cards as the communication fabric? If > so, any setup/configuration suggestions for a Linux environment would > be appreciated. No matter what I try when I try to execute one of the > OSU benchmarks I get the following error... > > > client1nccs:~/mvapich2-1.2p1/osu_benchmarks # mpiexec -n 2 > ./osu_latency > 0: Starting MPI > 0: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > 1: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > rank 1 in job 4 client1nccs_37672 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > rank 0 in job 4 client1nccs_37672 caused collective abort of all ranks > exit status of rank 0: killed by signal 9 > > > > Thanks in advance..... > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From panda at cse.ohio-state.edu Thu Apr 30 11:25:47 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Apr 30 11:26:01 2009 Subject: [mvapich-discuss] Jobs run slowly with >1 job on the same nodes In-Reply-To: <60fabd260904300708q7d8653asae0fadef49b94d13@mail.gmail.com> Message-ID: It looks like CPU affinity is `on' here. Thus, when you are submitting two 32-process jobs, they are exactly getting mapped to the same set of cores. Thus, both jobs are running slower. Try running your applications by disabling affinity (MV2_ENABLE_AFFINITY =0). More details on this parameter are available from MVAPICH2 user guide at the following location: http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html#x1-10000011.16 Hope this helps. DK On Thu, 30 Apr 2009, Nick Holway wrote: > Dear all. > > I'm running a 64bit Rocks 5.1 cluster (ie Centos 5.2) with Voltaire > OFED 1.4 and SGE 6.1u5. I compiled MVAPICH 1.2 with ifort 10 and I > configured it with F77 & F90 bindings. The nodes all have 2 quad core > Xeon CPUs. > > We've compiled PMEMD and sander.MPI and see the same problem with > both. When one job is run at a time (32 CPUs on 8 nodes) the job runs > well with good performance. If two jobs (eg 32 on the same 8 nodes) > are launched at the same time then both jobs run an order of magnitude > slower. A single 64 CPU run on the same nodes runs normally. > > We're also seeing problems with jobs disapearing from SGE and qdel not > deleting the jobs properly. > > Does anyone know what might be causing the above issues? FWIW I've run > the osu benchmarks and subounce on the cluster without issue. > > I originally raised this on the Amber mailing list who suggested that > it's more likely to be a system problem rather than with their > software (http://structbio.vanderbilt.edu/archives/amber-archive/2009/1410.php). > > Regards > > Nick > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From polk678 at gmail.com Thu Apr 30 13:17:34 2009 From: polk678 at gmail.com (gossips J) Date: Thu Apr 30 13:18:04 2009 Subject: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC In-Reply-To: <9D1D3508633A46709602B63471F50011@ptpdesk> References: <60C951AAF9F24FAEB7688A4CEB8EFA33@ptpdesk> <9D1D3508633A46709602B63471F50011@ptpdesk> Message-ID: This error seems to be specific to verbs api implementation of the providers. It looks me to QP has got error while moving to RTS state, reason could be bad card type/FW combination or bad set of drivers. I would suggest to go through the providers code and see if there is any latest version out... As DK mentioned, intel-neteffect cards has gone through many changes so you should see the latest drivers release made by intel and check with it. hopefuly problem could be solved. -polk. On 4/30/09, Hoot Thompson wrote: > > Thanks for quick response. Can you tell me what the error message means? > > Hoot > > -----Original Message----- > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > Sent: Thursday, April 30, 2009 9:16 AM > To: Hoot Thompson > Cc: mvapich-discuss@cse.ohio-state.edu > Subject: Re: [mvapich-discuss] mvapich2 and osu benchmarks using Intel > NetEffects NIC > > We had tested MVAPICH2 and Neteffect (original cards) long back. They were > working fine. As you know, Neteffect cards and drivers have gone through > multiple changes in recent years, especially after Intel's aquisition of > Neteffect. You may check with Intel about the latest status and drivers on > these cards. > > Thanks, > > DK > > On Thu, 30 Apr 2009, Hoot Thompson wrote: > > > Has there been any work done and/or experience with using mvapich2 and > > Intel NetEffects 10Gigbit NIC cards as the communication fabric? If > > so, any setup/configuration suggestions for a Linux environment would > > be appreciated. No matter what I try when I try to execute one of the > > OSU benchmarks I get the following error... > > > > > > client1nccs:~/mvapich2-1.2p1/osu_benchmarks # mpiexec -n 2 > > ./osu_latency > > 0: Starting MPI > > 0: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > > 1: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > > rank 1 in job 4 client1nccs_37672 caused collective abort of all ranks > > exit status of rank 1: killed by signal 9 > > rank 0 in job 4 client1nccs_37672 caused collective abort of all ranks > > exit status of rank 0: killed by signal 9 > > > > > > > > Thanks in advance..... > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090430/2c8c3f56/attachment.html From koop at cse.ohio-state.edu Thu Apr 30 14:12:56 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Thu Apr 30 14:13:12 2009 Subject: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC In-Reply-To: Message-ID: Hi Hoot, I think the issue here is that the card is not being detected as an iWARP device, so it is taking the InfiniBand wireup protocol path. You'll want to consult section 5.2.5 of the user guide for more information on using an iWARP device in MVAPICH2: http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html That said, I've been told there are still other stability problems with some of the RDMA CM connection setup with the NetEffect driver with anything less than the latest release of OFED. At least for now, the best method may be to use the uDAPL driver for NetEffect. Matt On Thu, 30 Apr 2009, gossips J wrote: > This error seems to be specific to verbs api implementation of the > providers. It looks me to QP has got error while moving to RTS state, reason > could be bad card type/FW combination or bad set of drivers. > > I would suggest to go through the providers code and see if there is any > latest version out... > > As DK mentioned, intel-neteffect cards has gone through many changes so you > should see the latest drivers release made by intel and check with it. > > hopefuly problem could be solved. > > -polk. > > On 4/30/09, Hoot Thompson wrote: > > > > Thanks for quick response. Can you tell me what the error message means? > > > > Hoot > > > > -----Original Message----- > > From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] > > Sent: Thursday, April 30, 2009 9:16 AM > > To: Hoot Thompson > > Cc: mvapich-discuss@cse.ohio-state.edu > > Subject: Re: [mvapich-discuss] mvapich2 and osu benchmarks using Intel > > NetEffects NIC > > > > We had tested MVAPICH2 and Neteffect (original cards) long back. They were > > working fine. As you know, Neteffect cards and drivers have gone through > > multiple changes in recent years, especially after Intel's aquisition of > > Neteffect. You may check with Intel about the latest status and drivers on > > these cards. > > > > Thanks, > > > > DK > > > > On Thu, 30 Apr 2009, Hoot Thompson wrote: > > > > > Has there been any work done and/or experience with using mvapich2 and > > > Intel NetEffects 10Gigbit NIC cards as the communication fabric? If > > > so, any setup/configuration suggestions for a Linux environment would > > > be appreciated. No matter what I try when I try to execute one of the > > > OSU benchmarks I get the following error... > > > > > > > > > client1nccs:~/mvapich2-1.2p1/osu_benchmarks # mpiexec -n 2 > > > ./osu_latency > > > 0: Starting MPI > > > 0: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > > > 1: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > > > rank 1 in job 4 client1nccs_37672 caused collective abort of all ranks > > > exit status of rank 1: killed by signal 9 > > > rank 0 in job 4 client1nccs_37672 caused collective abort of all ranks > > > exit status of rank 0: killed by signal 9 > > > > > > > > > > > > Thanks in advance..... > > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > From hoot at ptpnow.com Thu Apr 30 14:17:19 2009 From: hoot at ptpnow.com (Hoot Thompson) Date: Thu Apr 30 14:17:40 2009 Subject: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC In-Reply-To: References: <60C951AAF9F24FAEB7688A4CEB8EFA33@ptpdesk> <9D1D3508633A46709602B63471F50011@ptpdesk> Message-ID: <5C59F75C88F44A37B42508B1CC0EC51A@ptpdesk> thanks! _____ From: gossips J [mailto:polk678@gmail.com] Sent: Thursday, April 30, 2009 1:18 PM To: Hoot Thompson Cc: Dhabaleswar Panda; mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC This error seems to be specific to verbs api implementation of the providers. It looks me to QP has got error while moving to RTS state, reason could be bad card type/FW combination or bad set of drivers. I would suggest to go through the providers code and see if there is any latest version out... As DK mentioned, intel-neteffect cards has gone through many changes so you should see the latest drivers release made by intel and check with it. hopefuly problem could be solved. -polk. On 4/30/09, Hoot Thompson wrote: Thanks for quick response. Can you tell me what the error message means? Hoot -----Original Message----- From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] Sent: Thursday, April 30, 2009 9:16 AM To: Hoot Thompson Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC We had tested MVAPICH2 and Neteffect (original cards) long back. They were working fine. As you know, Neteffect cards and drivers have gone through multiple changes in recent years, especially after Intel's aquisition of Neteffect. You may check with Intel about the latest status and drivers on these cards. Thanks, DK On Thu, 30 Apr 2009, Hoot Thompson wrote: > Has there been any work done and/or experience with using mvapich2 and > Intel NetEffects 10Gigbit NIC cards as the communication fabric? If > so, any setup/configuration suggestions for a Linux environment would > be appreciated. No matter what I try when I try to execute one of the > OSU benchmarks I get the following error... > > > client1nccs:~/mvapich2-1.2p1/osu_benchmarks # mpiexec -n 2 > ./osu_latency > 0: Starting MPI > 0: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > 1: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > rank 1 in job 4 client1nccs_37672 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > rank 0 in job 4 client1nccs_37672 caused collective abort of all ranks > exit status of rank 0: killed by signal 9 > > > > Thanks in advance..... > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090430/53d2b8f0/attachment.html From hoot at ptpnow.com Thu Apr 30 14:18:22 2009 From: hoot at ptpnow.com (Hoot Thompson) Date: Thu Apr 30 14:18:46 2009 Subject: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC In-Reply-To: References: <60C951AAF9F24FAEB7688A4CEB8EFA33@ptpdesk> <9D1D3508633A46709602B63471F50011@ptpdesk> Message-ID: Thanks _____ From: gossips J [mailto:polk678@gmail.com] Sent: Thursday, April 30, 2009 1:18 PM To: Hoot Thompson Cc: Dhabaleswar Panda; mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC This error seems to be specific to verbs api implementation of the providers. It looks me to QP has got error while moving to RTS state, reason could be bad card type/FW combination or bad set of drivers. I would suggest to go through the providers code and see if there is any latest version out... As DK mentioned, intel-neteffect cards has gone through many changes so you should see the latest drivers release made by intel and check with it. hopefuly problem could be solved. -polk. On 4/30/09, Hoot Thompson wrote: Thanks for quick response. Can you tell me what the error message means? Hoot -----Original Message----- From: Dhabaleswar Panda [mailto:panda@cse.ohio-state.edu] Sent: Thursday, April 30, 2009 9:16 AM To: Hoot Thompson Cc: mvapich-discuss@cse.ohio-state.edu Subject: Re: [mvapich-discuss] mvapich2 and osu benchmarks using Intel NetEffects NIC We had tested MVAPICH2 and Neteffect (original cards) long back. They were working fine. As you know, Neteffect cards and drivers have gone through multiple changes in recent years, especially after Intel's aquisition of Neteffect. You may check with Intel about the latest status and drivers on these cards. Thanks, DK On Thu, 30 Apr 2009, Hoot Thompson wrote: > Has there been any work done and/or experience with using mvapich2 and > Intel NetEffects 10Gigbit NIC cards as the communication fabric? If > so, any setup/configuration suggestions for a Linux environment would > be appreciated. No matter what I try when I try to execute one of the > OSU benchmarks I get the following error... > > > client1nccs:~/mvapich2-1.2p1/osu_benchmarks # mpiexec -n 2 > ./osu_latency > 0: Starting MPI > 0: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > 1: [ring_startup.c:301] error(22): Could not modify boot qp to RTS > rank 1 in job 4 client1nccs_37672 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > rank 0 in job 4 client1nccs_37672 caused collective abort of all ranks > exit status of rank 0: killed by signal 9 > > > > Thanks in advance..... > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090430/71ae6150/attachment-0001.html