From forum.san at gmail.com Sun Feb 1 10:22:04 2009 From: forum.san at gmail.com (Sangamesh B) Date: Sun Feb 1 10:22:14 2009 Subject: [mvapich-discuss] cpmd job failure In-Reply-To: References: Message-ID: Hello Sir, On Sat, Jan 31, 2009 at 8:08 PM, Dhabaleswar Panda wrote: > Thanks for reporting this. Are you running MVAPICH2 1.2p1 with the > `default' mode or with any environment variables? Can you also indicate > the details on your platform (processor, number of cores/node, amount of > memory per core, IB HCA speed, etc.). > I'm running it in 'default' mode. I've not used any additional variables. Intel Xeon Quad core Dual processor (8 cores/node). 4GB RAM/node (512 MB/core) Intel compilers 10 The same job runs fine with Open MPI. Thanks, Sangamesh > Thanks, > > DK > > On Sat, 31 Jan 2009, Sangamesh B wrote: > >> Hello mvapich2 team, >> >> The CPMD (www.cpmd.org) application is installed with intel >> compilers on a Rocks4.3 Linux based infiniband supported cluster, >> mvapich2 version 1.2p1. >> >> The 40 process job runs for some time and then fails with following output: >> >> LINE SEARCH : LAMBDA=.164E-01 PREDICTED ENERGY = -1890.824133217 >> 57 9.731E-05 7.571E-06 -1890.824133 -8.483E-07 47.38 >> LINE SEARCH : LAMBDA=.166E-01 PREDICTED ENERGY = -1890.824133946 >> 58 9.831E-05 7.265E-06 -1890.824134 -7.234E-07 47.41 >> LINE SEARCH : LAMBDA=.178E-01 PREDICTED ENERGY = -1890.824134657 >> 59 9.529E-05 6.389E-06 -1890.824135 -6.945E-07 47.36 >> rank 17 in job 1 node-0-5.local_32810 caused collective abort of all ranks >> exit status of rank 17: killed by signal 9 >> rank 1 in job 1 node-0-5.local_32810 caused collective abort of all ranks >> exit status of rank 1: killed by signal 9 >> >> For several same jobs, it fails around same point (but not exactly at >> same step). >> >> What could be the solution for this? >> >> Thanks, >> Sangamesh >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > From DANIEL.M.REDIG at saic.com Mon Feb 2 16:39:22 2009 From: DANIEL.M.REDIG at saic.com (Dan) Date: Mon Feb 2 17:42:08 2009 Subject: [mvapich-discuss] Control shouldn't reach here in prototype Message-ID: <1233610762.29586.45.camel@redig.dhcp.saic.com> I've looked at the source and attempted to understand this error but I need some help. The job runs when compiled with other MPIs but not MVAPICH2. In order to get it to run this far I needed to set the amount of memory available for semaphores a lot higher. [7] Abort: [15] Abort: Control shouldn't reach here in prototype, header 155 Control shouldn't reach here in prototype, header 190 at line 276 in file ibv_recv.c at line 276 in file ibv_recv.c [21] Abort: Control shouldn't reach here in prototype, header 84 at line 276 in file ibv_recv.c [25] Abort: Control shouldn't reach here in prototype, header 212 at line 276 in file ibv_recv.c rank 25 in job 3 cmfpos_36963 caused collective abort of all ranks exit status of rank 25: killed by signal 9 rank 15 in job 3 cmfpos_36963 caused collective abort of all ranks exit status of rank 15: killed by signal 9 rank 7 in job 3 cmfpos_36963 caused collective abort of all ranks exit status of rank 7: killed by signal 9 suggestions? Thank you! Dan From sridharj at cse.ohio-state.edu Mon Feb 2 18:40:23 2009 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Mon Feb 2 18:40:29 2009 Subject: [mvapich-discuss] Re: Problems with hostname resolution and MPI_INIT() (Mike Heinz) In-Reply-To: References: Message-ID: <1233618023.29260.0.camel@t13.nowlab.cis.ohio-state.edu> Hi, As suggested here, we've updated the userguide (http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html#x1-160005.2.1). -Jaidev On Wed, 2009-01-14 at 07:26 -0600, Terrence.LIAO@total.com wrote: > > Mike, > > We have the same problem before (posted in mvapich-discuss Digest, Vol > 37, Issue 1) and just like you, I dug into the mvapich code and > modified the get_host_id(). System admin told me, it is a normal > practice to put hostname to 127.0.0.1 entry just like yours. Of course > this the culprit. Later on, to resolve a sendmail conflict, system > admin removed the hostname from the 127.0.0.1 entry and created the > real IP to hostname entry on /etc/hosts. Of course, with this, no more > mvapich code change for me. May be the solution to this is let > mvapich install guide to advise or recommend to create entry for > hostname in /etc/hosts, and to warn the potential problem if added it > into 127.0.0.1. (May be this is already been done and, well, I have to > admit that I did not read the entire install guide. ) > > Thank you very much. > > -- Terrence > -------------------------------------------------------- > Terrence Liao, Ph.D. > Research Computer Scientist > TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC > 1201 Louisiana, Suite 1800, Houston, TX 77002 > Tel: 713.647.3498 Fax: 713.647.3638 > Email: terrence.liao@total.com > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From koop at cse.ohio-state.edu Mon Feb 2 18:51:57 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Feb 2 18:52:02 2009 Subject: [mvapich-discuss] Control shouldn't reach here in prototype In-Reply-To: <1233610762.29586.45.camel@redig.dhcp.saic.com> Message-ID: Dan, Can you give us some additional information about the system that this is occuring on, such as the OS, HCAs (multiple HCAs?), number of processes, etc? Also, would it be possible to send along the code (or a part) that exhibits this problem? Additionally, can you try with the following MV2_USE_COALESCE=0 environmental variable set? Thanks, Matt On Mon, 2 Feb 2009, Dan wrote: > I've looked at the source and attempted to understand this error but I > need some help. The job runs when compiled with other MPIs but not > MVAPICH2. In order to get it to run this far I needed to set the amount > of memory available for semaphores a lot higher. > > > > [7] Abort: [15] Abort: Control shouldn't reach here in prototype, header > 155 > Control shouldn't reach here in prototype, header 190 > at line 276 in file ibv_recv.c > at line 276 in file ibv_recv.c > [21] Abort: Control shouldn't reach here in prototype, header 84 > at line 276 in file ibv_recv.c > [25] Abort: Control shouldn't reach here in prototype, header 212 > at line 276 in file ibv_recv.c > rank 25 in job 3 cmfpos_36963 caused collective abort of all ranks > exit status of rank 25: killed by signal 9 > rank 15 in job 3 cmfpos_36963 caused collective abort of all ranks > exit status of rank 15: killed by signal 9 > rank 7 in job 3 cmfpos_36963 caused collective abort of all ranks > exit status of rank 7: killed by signal 9 > > > > suggestions? > > Thank you! > > Dan > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From DANIEL.M.REDIG at saic.com Mon Feb 2 19:42:19 2009 From: DANIEL.M.REDIG at saic.com (Dan) Date: Mon Feb 2 23:09:29 2009 Subject: [mvapich-discuss] Control shouldn't reach here in prototype In-Reply-To: References: Message-ID: <1233621739.29586.71.camel@redig.dhcp.saic.com> Hi Matt, We're running RHEL 4.6 64 bit, one Mellanox ConnectX DDR card per system, dual quad core opertons (8 core total) with 16 GB memory. We've been using Intel Cluster Toolkit including their MPI successfully with this code. I haven't tried OpenMPI lately. I'll look into getting a section of code to review. Thanks! Dan On Mon, 2009-02-02 at 18:51 -0500, Matthew Koop wrote: > Dan, > > Can you give us some additional information about the system that this > is occuring on, such as the OS, HCAs (multiple HCAs?), number of > processes, etc? > > Also, would it be possible to send along the code (or a part) that > exhibits this problem? > > Additionally, can you try with the following MV2_USE_COALESCE=0 > environmental variable set? > > Thanks, > > Matt > > On Mon, 2 Feb 2009, Dan wrote: > > > I've looked at the source and attempted to understand this error but I > > need some help. The job runs when compiled with other MPIs but not > > MVAPICH2. In order to get it to run this far I needed to set the amount > > of memory available for semaphores a lot higher. > > > > > > > > [7] Abort: [15] Abort: Control shouldn't reach here in prototype, header > > 155 > > Control shouldn't reach here in prototype, header 190 > > at line 276 in file ibv_recv.c > > at line 276 in file ibv_recv.c > > [21] Abort: Control shouldn't reach here in prototype, header 84 > > at line 276 in file ibv_recv.c > > [25] Abort: Control shouldn't reach here in prototype, header 212 > > at line 276 in file ibv_recv.c > > rank 25 in job 3 cmfpos_36963 caused collective abort of all ranks > > exit status of rank 25: killed by signal 9 > > rank 15 in job 3 cmfpos_36963 caused collective abort of all ranks > > exit status of rank 15: killed by signal 9 > > rank 7 in job 3 cmfpos_36963 caused collective abort of all ranks > > exit status of rank 7: killed by signal 9 > > > > > > > > suggestions? > > > > Thank you! > > > > Dan > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > From xmxmxie at gmail.com Wed Feb 4 07:43:40 2009 From: xmxmxie at gmail.com (Xie Min) Date: Wed Feb 4 07:43:46 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? Message-ID: <91bd441b0902040443n381db2c0h6c37e23886a0d711@mail.gmail.com> We are building a cluster which use Infiniband as the interconnection, each node has two Intel Xeon E5450 CPU, 4 cores/CPU, and 16GB memory. We installed mvapich2-1.2 on this cluster, and are using HPCC to do some tests, but when we enlarge the memory scale of HPCC to some extent, we often met "Out of memory" error. For example: In 8 nodes, we run 64 tasks HPCC program, so each node has 8 tasks running in it (one task for each CPU core). We use "top" to view the memory usage of HPCC tasks, if the memory scale of each tasks is set to 1.2/1.3GB (list in "RES" column of "top" output), the HPCC tasks will exit after running for a while (seems running Linpack). Using "dmesg", we found "Out of memory" error. We browsed the user guide of mvapich2, and found "MV2_USE_LAZY_MEM_UNREGISTER" parameter, this parameter controls if Pin-Down Cache is used. We set MV2_USE_LAZY_MEM_UNREGISTER to 0, and do HPCC tests again, now even we set the memory scale of each HPCC task to 1.6/1.7GB (list in RES column of "top" output), HPCC can run successfully without being killed by OS. Because each node in our cluster has 16GB physical memory (2GB for each CPU core on average), so we are wondering why each HPCC task can use only 1.2/1.3GB memory when Pin-Down Cache is enabled. Using OSU benchmarks, we found if Pin-Down Cache is disabled, osu_latency performance will decrease on long message, so we still want to use Pin-Down Cache when running HPCC on large memory scale. BTW, our cluster nodes have no harddisk, they boot using BOOTP and mount a common directory from a file server using Lustre. So there is no swap in each node too. How can we set the correct mvapich2 parameters to run HPCC in large memory scale (such as each HPCC task can be set to > 1.7GB )? Thanks! From stewart at cnf.cornell.edu Wed Feb 4 14:58:50 2009 From: stewart at cnf.cornell.edu (Derek Stewart) Date: Wed Feb 4 14:58:51 2009 Subject: [mvapich-discuss] program crashing running mvapich over infiniband Message-ID: <20090204195850.79195.qmail@mail.spidergraphics.com> Hi all, I was wondering if anyone would have a suggestion for this error. I am running abinit version 5.4.4p compiled with mvapich 2-1.2p1 and gcc (GCC) 3.4.6 and gfortran 4.1.2, Linux 2.6.9-78.0.13.ELsmp 64bit. Warning! Rndv Receiver is receiving (36680 < 1263624) less than as expected rank 1 in job 1 c32_32836 caused collective abort of all ranks exit status of rank 1: killed by signal 9 Thanks, Derek ################################ Derek Stewart, Ph. D. Scientific Computation Associate http://www.people.cornell.edu/pages/das248/ 250 Duffield Hall Cornell Nanoscale Facility (CNF) Ithaca, NY 14853 stewart (at) cnf.cornell.edu (607) 255-2856 From koop at cse.ohio-state.edu Wed Feb 4 18:25:23 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Feb 4 18:25:27 2009 Subject: [mvapich-discuss] program crashing running mvapich over infiniband In-Reply-To: <20090204195850.79195.qmail@mail.spidergraphics.com> Message-ID: Hi Derek, Thanks for reporting this problem. Can you give us some additional information about the run/system? How many processes are you running with and what HCAs are you using? We're also interested in trying to reproduce the problem here on our machines. Is there a dataset that you are using that you could send to us? Matt On Wed, 4 Feb 2009, Derek Stewart wrote: > Hi all, > > I was wondering if anyone would have a suggestion for this error. I am > running abinit version 5.4.4p compiled with mvapich 2-1.2p1 and gcc (GCC) > 3.4.6 and gfortran 4.1.2, Linux 2.6.9-78.0.13.ELsmp 64bit. > > Warning! Rndv Receiver is receiving (36680 < 1263624) less than as expected > rank 1 in job 1 > > c32_32836 caused collective abort of all ranks > exit status of rank 1: killed by signal 9 > > > Thanks, > > Derek > > ################################ > Derek Stewart, Ph. D. > Scientific Computation Associate > http://www.people.cornell.edu/pages/das248/ > 250 Duffield Hall > Cornell Nanoscale Facility (CNF) > Ithaca, NY 14853 > stewart (at) cnf.cornell.edu > (607) 255-2856 > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From koop at cse.ohio-state.edu Wed Feb 4 20:26:13 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Feb 4 20:26:17 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? In-Reply-To: <91bd441b0902040443n381db2c0h6c37e23886a0d711@mail.gmail.com> Message-ID: Hi, What OS/distro are you running? Are there any changes you made, such as page size, etc from the base? I'm taking a look at this issue on our machine as well, although I'm not seeing the memory change that you reported. Matt On Wed, 4 Feb 2009, Xie Min wrote: > We are building a cluster which use Infiniband as the interconnection, > each node has two Intel Xeon E5450 CPU, 4 cores/CPU, and 16GB memory. > > We installed mvapich2-1.2 on this cluster, and are using HPCC to do > some tests, but when we enlarge the memory scale of HPCC to some > extent, we often met "Out of memory" error. > > For example: > In 8 nodes, we run 64 tasks HPCC program, so each node has 8 tasks > running in it (one task for each CPU core). We use "top" to view the > memory usage of HPCC tasks, if the memory scale of each tasks is set > to 1.2/1.3GB (list in "RES" column of "top" output), the HPCC tasks > will exit after running for a while (seems running Linpack). Using > "dmesg", we found "Out of memory" error. > > We browsed the user guide of mvapich2, and found > "MV2_USE_LAZY_MEM_UNREGISTER" parameter, this parameter controls if > Pin-Down Cache is used. We set MV2_USE_LAZY_MEM_UNREGISTER to 0, and > do HPCC tests again, now even we set the memory scale of each HPCC > task to 1.6/1.7GB (list in RES column of "top" output), HPCC can run > successfully without being killed by OS. > > Because each node in our cluster has 16GB physical memory (2GB for > each CPU core on average), so we are wondering why each HPCC task can > use only 1.2/1.3GB memory when Pin-Down Cache is enabled. > > Using OSU benchmarks, we found if Pin-Down Cache is disabled, > osu_latency performance will decrease on long message, so we still > want to use Pin-Down Cache when running HPCC on large memory scale. > > BTW, our cluster nodes have no harddisk, they boot using BOOTP and > mount a common directory from a file server using Lustre. So there is > no swap in each node too. > > How can we set the correct mvapich2 parameters to run HPCC in large > memory scale (such as each HPCC task can be set to > 1.7GB )? > > Thanks! > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From stewart at cnf.cornell.edu Wed Feb 4 22:02:46 2009 From: stewart at cnf.cornell.edu (Derek Stewart) Date: Wed Feb 4 22:02:46 2009 Subject: [mvapich-discuss] program crashing running mvapich over infiniband In-Reply-To: References: Message-ID: <20090205030246.88523.qmail@mail.spidergraphics.com> Hi Matthew, Thank you for the quick reply. I am currently just testing with a small run including three nodes each with Dual Core Xeon 5140 2GB DDR2 RAM. The HCA cards are Mellanox PCI-E with 128 MB onboard (MHGA28-1TC). Mvapich came with the OFED-1.4 distribution. I will put together a test case for you to try out and send it to you tomorrow. Do you have abinit up and running there? Best regards, Derek Matthew Koop writes: > Hi Derek, > > Thanks for reporting this problem. Can you give us some additional > information about the run/system? How many processes are you running with > and what HCAs are you using? > > We're also interested in trying to reproduce the problem here on our > machines. Is there a dataset that you are using that you could send to us? > > Matt > > On Wed, 4 Feb 2009, Derek Stewart wrote: > >> Hi all, >> >> I was wondering if anyone would have a suggestion for this error. I am >> running abinit version 5.4.4p compiled with mvapich 2-1.2p1 and gcc (GCC) >> 3.4.6 and gfortran 4.1.2, Linux 2.6.9-78.0.13.ELsmp 64bit. >> >> Warning! Rndv Receiver is receiving (36680 < 1263624) less than as expected >> rank 1 in job 1 >> >> c32_32836 caused collective abort of all ranks >> exit status of rank 1: killed by signal 9 >> >> >> Thanks, >> >> Derek >> >> ################################ >> Derek Stewart, Ph. D. >> Scientific Computation Associate >> http://www.people.cornell.edu/pages/das248/ >> 250 Duffield Hall >> Cornell Nanoscale Facility (CNF) >> Ithaca, NY 14853 >> stewart (at) cnf.cornell.edu >> (607) 255-2856 >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > ################################ Derek Stewart, Ph. D. Scientific Computation Associate http://www.people.cornell.edu/pages/das248/ 250 Duffield Hall Cornell Nanoscale Facility (CNF) Ithaca, NY 14853 stewart (at) cnf.cornell.edu (607) 255-2856 From stewart at cnf.cornell.edu Thu Feb 5 00:12:23 2009 From: stewart at cnf.cornell.edu (Derek Stewart) Date: Thu Feb 5 00:12:23 2009 Subject: [mvapich-discuss] Problem solved: program crashing running mvapich over infiniband In-Reply-To: References: Message-ID: <20090205051223.90463.qmail@mail.spidergraphics.com> Hi Matthew, I just wanted to update you and let you know I solved the problem. It turns out the local versions of the input file on the different nodes were not identical. So the local program which was set for a larger memory calculation was receiving much less data from the small memory versions initiated on the other nodes. Once I fixed this problem, it ran smoothly. Thanks again, Derek Matthew Koop writes: > Hi Derek, > > Thanks for reporting this problem. Can you give us some additional > information about the run/system? How many processes are you running with > and what HCAs are you using? > > We're also interested in trying to reproduce the problem here on our > machines. Is there a dataset that you are using that you could send to us? > > Matt > > On Wed, 4 Feb 2009, Derek Stewart wrote: > >> Hi all, >> >> I was wondering if anyone would have a suggestion for this error. I am >> running abinit version 5.4.4p compiled with mvapich 2-1.2p1 and gcc (GCC) >> 3.4.6 and gfortran 4.1.2, Linux 2.6.9-78.0.13.ELsmp 64bit. >> >> Warning! Rndv Receiver is receiving (36680 < 1263624) less than as expected >> rank 1 in job 1 >> >> c32_32836 caused collective abort of all ranks >> exit status of rank 1: killed by signal 9 >> >> >> Thanks, >> >> Derek >> >> ################################ >> Derek Stewart, Ph. D. >> Scientific Computation Associate >> http://www.people.cornell.edu/pages/das248/ >> 250 Duffield Hall >> Cornell Nanoscale Facility (CNF) >> Ithaca, NY 14853 >> stewart (at) cnf.cornell.edu >> (607) 255-2856 >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > ################################ Derek Stewart, Ph. D. Scientific Computation Associate http://www.people.cornell.edu/pages/das248/ 250 Duffield Hall Cornell Nanoscale Facility (CNF) Ithaca, NY 14853 stewart (at) cnf.cornell.edu (607) 255-2856 From koop at cse.ohio-state.edu Thu Feb 5 00:31:36 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Thu Feb 5 00:31:41 2009 Subject: [mvapich-discuss] Problem solved: program crashing running mvapich over infiniband In-Reply-To: <20090205051223.90463.qmail@mail.spidergraphics.com> Message-ID: Derek, Glad to know that the problem is solved and it is running smoothly. Let us know if you encounter any other issues. Thanks, Matt On Thu, 5 Feb 2009, Derek Stewart wrote: > Hi Matthew, > > I just wanted to update you and let you know I solved the problem. It turns > out the local versions of the input file on the different nodes were not > identical. So the local program which was set for a larger memory > calculation was receiving much less data from the small memory versions > initiated on the other nodes. Once I fixed this problem, it ran smoothly. > > Thanks again, > > Derek > > > Matthew Koop writes: > > > Hi Derek, > > > > Thanks for reporting this problem. Can you give us some additional > > information about the run/system? How many processes are you running with > > and what HCAs are you using? > > > > We're also interested in trying to reproduce the problem here on our > > machines. Is there a dataset that you are using that you could send to us? > > > > Matt > > > > On Wed, 4 Feb 2009, Derek Stewart wrote: > > > >> Hi all, > >> > >> I was wondering if anyone would have a suggestion for this error. I am > >> running abinit version 5.4.4p compiled with mvapich 2-1.2p1 and gcc (GCC) > >> 3.4.6 and gfortran 4.1.2, Linux 2.6.9-78.0.13.ELsmp 64bit. > >> > >> Warning! Rndv Receiver is receiving (36680 < 1263624) less than as expected > >> rank 1 in job 1 > >> > >> c32_32836 caused collective abort of all ranks > >> exit status of rank 1: killed by signal 9 > >> > >> > >> Thanks, > >> > >> Derek > >> > >> ################################ > >> Derek Stewart, Ph. D. > >> Scientific Computation Associate > >> http://www.people.cornell.edu/pages/das248/ > >> 250 Duffield Hall > >> Cornell Nanoscale Facility (CNF) > >> Ithaca, NY 14853 > >> stewart (at) cnf.cornell.edu > >> (607) 255-2856 > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > > > ################################ > Derek Stewart, Ph. D. > Scientific Computation Associate > http://www.people.cornell.edu/pages/das248/ > 250 Duffield Hall > Cornell Nanoscale Facility (CNF) > Ithaca, NY 14853 > stewart (at) cnf.cornell.edu > (607) 255-2856 > From paolo.zini at ipcf.cnr.it Fri Feb 6 05:15:25 2009 From: paolo.zini at ipcf.cnr.it (paolo.zini@ipcf.cnr.it) Date: Fri Feb 6 05:10:41 2009 Subject: [mvapich-discuss] MVAPICH2 and Broadcom 5706 Message-ID: <000701c98843$d40fa030$57663092@zini> Hi all. My name is Paolo, I am in Pisa, Italy. I have a maybe stupid question, but I am new here, be patient. I have a small cluster, 16 nodes based on dual opteron machines with gigabit network, used for parallel computations. I use Suse 10 and mpich2 on it. The problem is that the delay introduced by the TCP stack affects the system efficiency. I would like to improve it, using MVAPICH2, but the budget available is small. According to tests found in literature, using MPI with Ammasso cards would reduce the delay to 20 - 30 microseconds. This would be interesting for me, because the communication requirements anren't so high... If I understand, now Ammasso cards are available only on Ebay. the idea would be to install low cost Broadcom gigabit cards, based on 5706 chip. Broadcom claims that 5706 and 5708 are iWARP compliant, MVAPICH2 supports iWARP hardware, MVAPICH2 have been used with gigabit nets using Ammasso cards, but I haven't found any experience based on Broadcom cards. Can I install Broadcom cards, MVAPICH2 and use RDMA? Can anyone enlighten me? Tank you for any help. Paolo Zini IPCF istituto del CNR, Area della ricerca di Pisa From panda at cse.ohio-state.edu Fri Feb 6 08:35:04 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Feb 6 08:35:11 2009 Subject: [mvapich-discuss] MVAPICH2 and Broadcom 5706 In-Reply-To: <000701c98843$d40fa030$57663092@zini> Message-ID: Thanks for your note and interest in running MVAPICH2 with Broadcom 5706's iWARP support. At this point of time, we have not tested this combination because we do not have access to these Broadcom adapters. You may check with the Broadcom folks to see whether they have done this testing internally or not. If Broadcom 5706's iWARP suppoort is strictly iWARP standard compliant, it should work. Thanks, DK On Fri, 6 Feb 2009 paolo.zini@ipcf.cnr.it wrote: > Hi all. > > My name is Paolo, I am in Pisa, Italy. > > I have a maybe stupid question, but I am new here, be patient. > > I have a small cluster, 16 nodes based on dual opteron machines with gigabit > network, used for parallel computations. > > I use Suse 10 and mpich2 on it. > > The problem is that the delay introduced by the TCP stack affects the system > efficiency. > > I would like to improve it, using MVAPICH2, but the budget available is > small. > > According to tests found in literature, using MPI with Ammasso cards would > reduce the delay to 20 - 30 microseconds. > > This would be interesting for me, because the communication requirements > anren't so high... > > If I understand, now Ammasso cards are available only on Ebay. the idea > would be to install low cost Broadcom gigabit cards, based on 5706 chip. > > Broadcom claims that 5706 and 5708 are iWARP compliant, MVAPICH2 supports > iWARP hardware, MVAPICH2 have been used with gigabit nets using Ammasso > cards, but I haven't found any experience based on Broadcom cards. > > Can I install Broadcom cards, MVAPICH2 and use RDMA? > > > > Can anyone enlighten me? > > > > Tank you for any help. > > > > Paolo Zini > IPCF istituto del CNR, > Area della ricerca di Pisa > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From xmxmxie at gmail.com Mon Feb 9 09:20:03 2009 From: xmxmxie at gmail.com (Xie Min) Date: Mon Feb 9 09:20:11 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? In-Reply-To: References: <91bd441b0902050519j475a8290i6115710a9d655e3e@mail.gmail.com> Message-ID: <91bd441b0902090620t756e1143qb1f1fd90e6bcd9de@mail.gmail.com> The hpcc we used is HPCC 1.0.0, but we just tried HPCC 1.3.1, seems has the same problem. In the attachment we attached two hpccinf.txt files for 64 HPCC tasks, the hpccinf.txt.13 is the "RES" of about 1.3GB, while hpccinf.txt.16 is the "RES" of about 1.6/1.7GB. Whould you please try them on your systems (with MV2_USE_LAZY_MEM_UNREGISTER=1), thanks. BTW, the OFED version we used is 1.3.1, physical memory on each node is 16GB, use 8 nodes for 64 tasks. 2009/2/7 Matthew Koop : > > Thanks for the additional information. I've tried here with HPCC 1.3.1 and > I haven't been able to see any difference in the 'RES' or 'VIRT' memory > while running. > > Would it be possible to send me your hpccinf.txt file so I can more > closely try to reproduce the problem? We also have AS5 with kernel 2.6.18 > as well. > > Thanks, > > Matt > > On Thu, 5 Feb 2009, Xie Min wrote: > >> We use Redhat AS5, kernel is 2.6.18 with lustre 1.6.6, and we don't >> modify kernel source. >> >> We test HPCC on two clusters: >> In one cluster, each node is booted using Boot over IB, it has no >> harddisk, so NO swap space. We run 64 HPCC tasks on 8 nodes (so each >> CPU core in the node will run one HPCC task), when each HPCC task use >> 1.2/1.3G memory, it will be killed by OS because of "Out of memory" >> error. But when MV2_USE_LAZY_MEM_UNREGISTER=0, task can use 1.7G >> memory and run successfully. >> >> In another cluster, each node has harddisk, it booted from local disk, >> and it HAS space space. We run 64 HPCC tasks on 8 nodes too. When each >> HPCC use 1.3G memory, we use "top" to show the memory usage >> information, we found swap will be used when HPCC is running for a >> while, and the node begin to run very slowly and cannot respond to >> keyboard input. But when MV2_USE_LAZY_MEM_UNREGISTER=0, each task can >> be set to 1.7G memory scale and run successfully. >> >> I tried another mvapich2 parameters: MV2_USE_LAZY_MEM_UNREGISTER=1, >> and MV2_NDREG_ENTRIES=8. In this configuration, HPCC is still be >> killed by OS with "Out of memory" error when the memory scale of each >> task is set to 1.3GB. >> >> 2009/2/5 Matthew Koop : >> > Hi, >> > >> > What OS/distro are you running? Are there any changes you made, such as >> > page size, etc from the base? >> > >> > I'm taking a look at this issue on our machine as well, although I'm not >> > seeing the memory change that you reported. >> > >> > Matt >> > >> > >> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: hpccinf.txt.13 Type: application/octet-stream Size: 1429 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090209/c3aaf3d2/hpccinf.txt.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: hpccinf.txt.16 Type: application/octet-stream Size: 1429 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090209/c3aaf3d2/hpccinf.txt-0001.obj From Huizhong.Lu at USherbrooke.ca Mon Feb 9 09:45:30 2009 From: Huizhong.Lu at USherbrooke.ca (Huizhong Lu) Date: Mon Feb 9 09:45:42 2009 Subject: [mvapich-discuss] mvapich2_intel64/0.9.8-15 ofed/1.2.5.5 Message-ID: <1234190730.4990418ac297a@www.usherbrooke.ca> Hello to all, We have installed mvapich2_intel64/0.9.8-15(ofed/1.2.5.5) on our cluster. When I run a simple MPI test progam (Fortran): call MPI_INIT (code) call MPI_Get_processor_name(arg1, 20, code) call mpi_barrier( mpi_comm_world, code ) call MPI_FINALIZE(code) but the code crash and gives the following message: rank 3 in job 1 cp564_58257 caused collective abort of all ranks exit status of rank 3: killed by signal 9 Does somebody have an idea about the error ? thanks, HuiZhong From panda at cse.ohio-state.edu Mon Feb 9 10:38:50 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Feb 9 10:38:56 2009 Subject: [mvapich-discuss] mvapich2_intel64/0.9.8-15 ofed/1.2.5.5 In-Reply-To: <1234190730.4990418ac297a@www.usherbrooke.ca> Message-ID: > We have installed mvapich2_intel64/0.9.8-15(ofed/1.2.5.5) > on our cluster. Looks like you are using a very old version of mvapich2 (0.9.8) - released during Nov '06. The OFED version is also very old (1.2.5.5). The latest released OFED version is 1.4 and it contains mvapich2 1.2p1. You can also download the latest mvapich2 version from mvapich web page. Please use the latest versions and let us know if the problem persists. Thanks, DK > When I run a simple MPI test progam (Fortran): > call MPI_INIT (code) > call MPI_Get_processor_name(arg1, 20, code) > call mpi_barrier( mpi_comm_world, code ) > call MPI_FINALIZE(code) > > but the code crash and gives the following message: > rank 3 in job 1 cp564_58257 caused collective abort of all ranks > exit status of rank 3: killed by signal 9 > > Does somebody have an idea about the error ? > > thanks, > > HuiZhong > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From sssmolniy at mail.ru Fri Feb 13 14:09:53 2009 From: sssmolniy at mail.ru (wind wind) Date: Fri Feb 13 14:10:02 2009 Subject: [mvapich-discuss] mvapich infiniband with LMC. How to set rank-lid? Message-ID: The documentations says that mvapich and IB support LMC. How to set a specific MPI process to a specific LID ? ?pensm can create new lid device (in LMC mode). In mvapich also have a flag for enable of LMC, but the MPI process (rank) does not take created miltiLID. How to set ratio rank-lid ? suggestions? Thank you! From koop at cse.ohio-state.edu Mon Feb 16 13:57:29 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Feb 16 13:57:36 2009 Subject: [mvapich-discuss] mvapich infiniband with LMC. How to set rank-lid? In-Reply-To: Message-ID: Hi, First, are you using MVAPICH or MVAPICH2 and what version? Right now when using LMC each port will get multiple LID values. MVAPICH will take advantage of these various LIDs (and new paths) to increase performance. As such, in MVAPICH2 a single process can be associated with multiple LIDs. Is there some reason you need to do specific process-LID mapping? Thanks, Matt On Fri, 13 Feb 2009, wind wind wrote: > The documentations says that mvapich and IB support LMC. How to set a > specific MPI process to a specific LID ? 飌ensm can create new lid > device (in LMC mode). In mvapich also have a flag for enable of LMC, > but the MPI process (rank) does not take created miltiLID. How to set > ratio rank-lid ? > > suggestions? > > Thank you! > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From sssmolniy at mail.ru Mon Feb 16 15:33:22 2009 From: sssmolniy at mail.ru (wind wind) Date: Mon Feb 16 15:33:33 2009 Subject: =?koi8-r?Q?Re[3]=3A_[mvapich-discuss]_mvapich_infiniband_with_LMC._How_to_setrank-lid=3F?= Message-ID: Hi Thanks for answer MVAPICH 0.9.9 For example I have host1 ib0 base lid 10, multi lid 11 host2 ib0 base lid 15, multi lid 16 Topology: host0 - switch0 - host1 \ / switch1 I can change route to host0 and host1 host0 -> switch0 -> host1 host1 <- switch1 <- host0 LMC = 1 And I want set this ratio: mpi proc 0 rank 0 lid 11 mpi proc 1 rank 1 lid 16 How i can do it ? From vivekg at cdac.in Tue Feb 17 07:36:08 2009 From: vivekg at cdac.in (Vivek Gavane) Date: Tue Feb 17 09:08:15 2009 Subject: [mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Job terminates with error Message-ID: Hello, I have mvapich2-1.2 compiled with the following options: /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg --enable-debuginfo --with-ib-include=/opt/OFED/include --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2 After I submit a job, the job completes but the following errors are reported on the console: ------------------------------------------------------------- send desc error Exit code -5 signaled from ibc0-16 Killing remote processes...[14] Abort: [] Got completion with error 12, vendor code=81, dest rank=0 at line 553 in file ibv_channel_manager.c MPI process terminated unexpectedly DONE ------------------------------------------------------------ And in the redirected output file, following errors are reported at the end: ----------------------------------------- cleanupSignal 15 received. Signal 15 received. Signal 15 received. Signal 15 received. ----------------------------------------- Do anyone know the reason for this? Thanks in advance. -- Regards, Vivek Gavane From panda at cse.ohio-state.edu Tue Feb 17 09:57:36 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue Feb 17 09:57:40 2009 Subject: [mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Job terminates with error In-Reply-To: Message-ID: Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the system is really large then it could be congestion. Thanks, DK On Tue, 17 Feb 2009, Vivek Gavane wrote: > Hello, > I have mvapich2-1.2 compiled with the following options: > > > /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg > --enable-debuginfo --with-ib-include=/opt/OFED/include > --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2 > > After I submit a job, the job completes but the following errors are > reported on the console: > > ------------------------------------------------------------- > send desc error > Exit code -5 signaled from ibc0-16 > Killing remote processes...[14] Abort: [] Got completion with error 12, > vendor code=81, dest rank=0 > at line 553 in file ibv_channel_manager.c > MPI process terminated unexpectedly > DONE > ------------------------------------------------------------ > > And in the redirected output file, following errors are reported at the > end: > ----------------------------------------- > cleanupSignal 15 received. > Signal 15 received. > Signal 15 received. > Signal 15 received. > ----------------------------------------- > > Do anyone know the reason for this? > > Thanks in advance. > -- > Regards, > Vivek Gavane > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From pourreza at cs.umanitoba.ca Tue Feb 17 11:56:27 2009 From: pourreza at cs.umanitoba.ca (Hossein Pourreza) Date: Tue Feb 17 12:04:06 2009 Subject: [mvapich-discuss] delay in spawning Message-ID: <20090217165627.GA23968@zinc.cs.umanitoba.ca> Hi, I used to experience with a very long delay before execution of my tasks with MVAPICH-0.9.8 and reported that problem. I was hopping to see a fix in the new version but I still have the same problem with MVAPICH-1.2. The configuration of my machine is: (I use SDR infiniband on PCI_X bus) uname -v = Generic_137112-07 /usr/bin/uname -p = i386 /bin/uname -X = System = SunOS Node = power Release = 5.10 KernelID = Generic_137112-07 Machine = i86pc BusType = Serial = Users = OEM# = 0 Origin# = 1 NumCPU = 4 /bin/arch = i86pc /usr/bin/arch -k = i86pc /usr/convex/getsysinfo = unknown hostinfo = unknown /bin/machine = unknown If I try to run my program with processes more than the number of cores of one node, I experience a very long delay before my task gets started. This delay increases by increasing number of processes. I run my tasks using: mpiexec -np xx ./myprog I am using mpd as processes placement daemon. Any help will be greatly appreciated. From panda at cse.ohio-state.edu Tue Feb 17 12:38:34 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue Feb 17 12:38:38 2009 Subject: [mvapich-discuss] delay in spawning In-Reply-To: <20090217165627.GA23968@zinc.cs.umanitoba.ca> Message-ID: You are using the old MPD-based start-up. Starting with MVAPICH2 1.2, a new scalable and robust mpirun_rsh framework (non-MPD-based) job launching mechanism has been added. This framework reduces job start-up time considerably and is also scalable to multi-thousand core clusters. Please use this and you should see considerable improvement in your job launching framework. Please refer to the MVAPICH2 user guide from the following URL for details on using the mpirun_rsh framework: http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html If the problem persists with the new mpirun_rsh framework, let us know. DK On Tue, 17 Feb 2009, Hossein Pourreza wrote: > Hi, > > I used to experience with a very long delay before execution of my tasks with > MVAPICH-0.9.8 and reported that problem. I was hopping to see a fix in the new > version but I still have the same problem with MVAPICH-1.2. The configuration of > my machine is: (I use SDR infiniband on PCI_X bus) > > > uname -v = Generic_137112-07 > > /usr/bin/uname -p = i386 > /bin/uname -X = System = SunOS > Node = power > Release = 5.10 > KernelID = Generic_137112-07 > Machine = i86pc > BusType = > Serial = > Users = > OEM# = 0 > Origin# = 1 > NumCPU = 4 > > /bin/arch = i86pc > /usr/bin/arch -k = i86pc > /usr/convex/getsysinfo = unknown > hostinfo = unknown > /bin/machine = unknown > > If I try to run my program with processes more than the number of cores of one > node, I experience a very long delay before my task gets started. This delay > increases by increasing number of processes. I run my tasks using: > > mpiexec -np xx ./myprog > > I am using mpd as processes placement daemon. > > Any help will be greatly appreciated. > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From sssmolniy at mail.ru Tue Feb 17 13:45:00 2009 From: sssmolniy at mail.ru (wind wind) Date: Tue Feb 17 14:20:24 2009 Subject: [mvapich-discuss] [mpich-discuss] mvapich and infiniband with LMC. How to set mpi rank - specific lid? In-Reply-To: References: Message-ID: Hi How to set rank to specific lid ? I use MVAPICH 0.9.9 For example I have host0 ib0 base lid 10, multi lid 11 multi lid 12 ... host1 ib0 base lid 15, multi lid 16 multi lid 17 ... host2 ib0 base lid 20, multi lid 21 multi lid 22 ... LMC = 2 Topology: switch0 / \ switch1 ------- switch2 / \ \ host0 host1 host2 I can to set this routs with using LMC host1 -> switch2 -> switch1 -> host0 host2 -> switch2 -> switch0 -> switch1 -> host0 How i can set this ratio: mpi proc 0 rank 0 lid 10 mpi proc 1 rank 1 lid 16 mpi proc 2 rank 2 lid 22 Do anyone know anything about this ? Thanks in advance. From pourreza at cs.umanitoba.ca Tue Feb 17 16:56:23 2009 From: pourreza at cs.umanitoba.ca (Hossein Pourreza) Date: Tue Feb 17 17:15:49 2009 Subject: [mvapich-discuss] delay in spawning In-Reply-To: References: <20090217182157.GB23968@zinc.cs.umanitoba.ca> Message-ID: <20090217215623.GC23968@zinc.cs.umanitoba.ca> I checked the FAQ and there is a *similar* problem. In my case the error message does not say anything about mpispawn it reads: /usr/bin/env: No such file or directory I, however, included the full path to mpirun_sh and ran my program but there was no output (not even error message). I ran top command on the other window and it was showing many (maybe 10) ssh commands running. I checked the debug output and it seems that execv is missing on my computer and it may cause the problem. I used your google-enabled search tool but there is nothing about "execv". Also, as I mentioned earlier, running mpirun_rsh with -v option gives Unknown option error. Again, I searched your web site and there is no posting regarding this issue. Thanks On Tue, Feb 17, 2009 at 02:21:45PM -0500, Dhabaleswar Panda wrote: > Hossein, > > This error and its solution are indicated in the FAQ page of the user > guide at the following URL: > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html#x1-520009.2.3 > > Let us know if this solves your problem. > > FYI, MVAPICH web site (including user guide and postings on > mvapich-discuss) are enabled by Google Search. If you encounter any > problem, please do a search first and you will get an answer quickly if > this has been reported/resolved earlier. > > DK > > On Tue, 17 Feb 2009, Hossein Pourreza wrote: > > > Thanks for the reply. > > > > Situation got worse :) > > > > I am getting the following error: > > > > /usr/bin/env: No such file or directory > > > > When I ran the mpirun_rsh with -v option it gave me: > > > > Unknown option > > > > and finally with -debug option I am getting following output: > > > > $mpirun_rsh -debug -np 32 -hostfile hosts ./bin/ep.B.32 > > > > execv: No such file or directory > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power02 cd > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 MPISPAWN_LOCAL_NPROCS=4 > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > MPISPAWN_MPIRUN_RANK_0=0 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > MPISPAWN_MPIRUN_RANK_1=1 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > MPISPAWN_MPIRUN_RANK_2=2 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > MPISPAWN_MPIRUN_RANK_3=3 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > such file or directory > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power03 cd > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1 MPISPAWN_LOCAL_NPROCS=4 > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > MPISPAWN_MPIRUN_RANK_0=4 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > MPISPAWN_MPIRUN_RANK_1=5 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > MPISPAWN_MPIRUN_RANK_2=6 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > MPISPAWN_MPIRUN_RANK_3=7 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > such file or directory > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power04 cd > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=2 MPISPAWN_LOCAL_NPROCS=4 > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > MPISPAWN_MPIRUN_RANK_0=8 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > MPISPAWN_MPIRUN_RANK_1=9 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > MPISPAWN_MPIRUN_RANK_2=10 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > MPISPAWN_MPIRUN_RANK_3=11 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > such file or directory > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power06 cd > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=4 MPISPAWN_LOCAL_NPROCS=4 > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > MPISPAWN_MPIRUN_RANK_0=16 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > MPISPAWN_MPIRUN_RANK_1=17 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > MPISPAWN_MPIRUN_RANK_2=18 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > MPISPAWN_MPIRUN_RANK_3=19 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > such file or directory > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power05 cd > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=3 MPISPAWN_LOCAL_NPROCS=4 > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > MPISPAWN_MPIRUN_RANK_0=12 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > MPISPAWN_MPIRUN_RANK_1=13 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > MPISPAWN_MPIRUN_RANK_2=14 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > MPISPAWN_MPIRUN_RANK_3=15 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > such file or directory > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power07 cd > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=5 MPISPAWN_LOCAL_NPROCS=4 > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > MPISPAWN_MPIRUN_RANK_0=20 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > MPISPAWN_MPIRUN_RANK_1=21 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > MPISPAWN_MPIRUN_RANK_2=22 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > MPISPAWN_MPIRUN_RANK_3=23 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > such file or directory > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power08 cd > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=6 MPISPAWN_LOCAL_NPROCS=4 > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > MPISPAWN_MPIRUN_RANK_0=24 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > MPISPAWN_MPIRUN_RANK_1=25 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > MPISPAWN_MPIRUN_RANK_2=26 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > MPISPAWN_MPIRUN_RANK_3=27 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > such file or directory > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power09 cd > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=7 MPISPAWN_LOCAL_NPROCS=4 > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > MPISPAWN_MPIRUN_RANK_0=28 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > MPISPAWN_MPIRUN_RANK_1=29 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > MPISPAWN_MPIRUN_RANK_2=30 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > MPISPAWN_MPIRUN_RANK_3=31 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn > > Child exited abnormally! > > cleanupKilling remote processes...DONE > > > > On Tue, Feb 17, 2009 at 12:38:34PM -0500, Dhabaleswar Panda wrote: > > > You are using the old MPD-based start-up. Starting with MVAPICH2 1.2, a > > > new scalable and robust mpirun_rsh framework (non-MPD-based) job launching > > > mechanism has been added. This framework reduces job start-up time > > > considerably and is also scalable to multi-thousand core clusters. Please > > > use this and you should see considerable improvement in your job launching > > > framework. Please refer to the MVAPICH2 user guide from the following URL > > > for details on using the mpirun_rsh framework: > > > > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html > > > > > > If the problem persists with the new mpirun_rsh framework, let us know. > > > > > > DK > > > > > > On Tue, 17 Feb 2009, Hossein Pourreza wrote: > > > > > > > Hi, > > > > > > > > I used to experience with a very long delay before execution of my tasks with > > > > MVAPICH-0.9.8 and reported that problem. I was hopping to see a fix in the new > > > > version but I still have the same problem with MVAPICH-1.2. The configuration of > > > > my machine is: (I use SDR infiniband on PCI_X bus) > > > > > > > > > > > > uname -v = Generic_137112-07 > > > > > > > > /usr/bin/uname -p = i386 > > > > /bin/uname -X = System = SunOS > > > > Node = power > > > > Release = 5.10 > > > > KernelID = Generic_137112-07 > > > > Machine = i86pc > > > > BusType = > > > > Serial = > > > > Users = > > > > OEM# = 0 > > > > Origin# = 1 > > > > NumCPU = 4 > > > > > > > > /bin/arch = i86pc > > > > /usr/bin/arch -k = i86pc > > > > /usr/convex/getsysinfo = unknown > > > > hostinfo = unknown > > > > /bin/machine = unknown > > > > > > > > If I try to run my program with processes more than the number of cores of one > > > > node, I experience a very long delay before my task gets started. This delay > > > > increases by increasing number of processes. I run my tasks using: > > > > > > > > mpiexec -np xx ./myprog > > > > > > > > I am using mpd as processes placement daemon. > > > > > > > > Any help will be greatly appreciated. > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > -- Hossein Pourreza e-mail: Department of Computer Science URL: http://www.cs.umanitoba.ca/~pourreza University of Manitoba Phone: 204-474-8391 Winnipeg, Manitoba, Canada R3T 2N2 From sridharj at cse.ohio-state.edu Tue Feb 17 17:39:00 2009 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Tue Feb 17 17:39:07 2009 Subject: [mvapich-discuss] delay in spawning In-Reply-To: <20090217215623.GC23968@zinc.cs.umanitoba.ca> References: <20090217182157.GB23968@zinc.cs.umanitoba.ca> <20090217215623.GC23968@zinc.cs.umanitoba.ca> Message-ID: <1234910340.18758.11.camel@t13.nowlab.cis.ohio-state.edu> Hossein, Can you forward the exact command line you used to invoke mpirun_rsh? If anything else is printed before or after the error message, please forward them too. Also, from the same environment and working directory can you send us the output of the following commands? $ which mpirun_rsh $ which mpispawn $ which env I suspect a mixup between different versions of mpirun_rsh since the -v version doesn't work for you. You can also try a running a simple MPI program such as osu_latency between two nodes to see if your environment is otherwise fine. Note, if you use the -debug option, xterm needs to be installed / linked at /usr/X11R6/bin/xterm. -Jaidev On Tue, 2009-02-17 at 15:56 -0600, Hossein Pourreza wrote: > I checked the FAQ and there is a *similar* problem. In my case the error message > does not say anything about mpispawn it reads: > /usr/bin/env: No such file or directory > > I, however, included the full path to mpirun_sh and ran my program but there was > no output (not even error message). I ran top command on the other window and > it was showing many (maybe 10) ssh commands running. > > I checked the debug output and it seems that execv is missing on my computer and > it may cause the problem. I used your google-enabled search tool but there is > nothing about "execv". > > Also, as I mentioned earlier, running mpirun_rsh with -v option gives Unknown > option error. Again, I searched your web site and there is no posting regarding > this issue. > > Thanks > > On Tue, Feb 17, 2009 at 02:21:45PM -0500, Dhabaleswar Panda wrote: > > Hossein, > > > > This error and its solution are indicated in the FAQ page of the user > > guide at the following URL: > > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html#x1-520009.2.3 > > > > Let us know if this solves your problem. > > > > FYI, MVAPICH web site (including user guide and postings on > > mvapich-discuss) are enabled by Google Search. If you encounter any > > problem, please do a search first and you will get an answer quickly if > > this has been reported/resolved earlier. > > > > DK > > > > On Tue, 17 Feb 2009, Hossein Pourreza wrote: > > > > > Thanks for the reply. > > > > > > Situation got worse :) > > > > > > I am getting the following error: > > > > > > /usr/bin/env: No such file or directory > > > > > > When I ran the mpirun_rsh with -v option it gave me: > > > > > > Unknown option > > > > > > and finally with -debug option I am getting following output: > > > > > > $mpirun_rsh -debug -np 32 -hostfile hosts ./bin/ep.B.32 > > > > > > execv: No such file or directory > > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power02 cd > > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 MPISPAWN_LOCAL_NPROCS=4 > > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > > MPISPAWN_MPIRUN_RANK_0=0 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > > MPISPAWN_MPIRUN_RANK_1=1 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > > MPISPAWN_MPIRUN_RANK_2=2 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > > MPISPAWN_MPIRUN_RANK_3=3 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > > such file or directory > > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power03 cd > > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1 MPISPAWN_LOCAL_NPROCS=4 > > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > > MPISPAWN_MPIRUN_RANK_0=4 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > > MPISPAWN_MPIRUN_RANK_1=5 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > > MPISPAWN_MPIRUN_RANK_2=6 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > > MPISPAWN_MPIRUN_RANK_3=7 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > > such file or directory > > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power04 cd > > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=2 MPISPAWN_LOCAL_NPROCS=4 > > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > > MPISPAWN_MPIRUN_RANK_0=8 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > > MPISPAWN_MPIRUN_RANK_1=9 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > > MPISPAWN_MPIRUN_RANK_2=10 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > > MPISPAWN_MPIRUN_RANK_3=11 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > > such file or directory > > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power06 cd > > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=4 MPISPAWN_LOCAL_NPROCS=4 > > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > > MPISPAWN_MPIRUN_RANK_0=16 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > > MPISPAWN_MPIRUN_RANK_1=17 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > > MPISPAWN_MPIRUN_RANK_2=18 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > > MPISPAWN_MPIRUN_RANK_3=19 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > > such file or directory > > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power05 cd > > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=3 MPISPAWN_LOCAL_NPROCS=4 > > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > > MPISPAWN_MPIRUN_RANK_0=12 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > > MPISPAWN_MPIRUN_RANK_1=13 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > > MPISPAWN_MPIRUN_RANK_2=14 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > > MPISPAWN_MPIRUN_RANK_3=15 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > > such file or directory > > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power07 cd > > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=5 MPISPAWN_LOCAL_NPROCS=4 > > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > > MPISPAWN_MPIRUN_RANK_0=20 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > > MPISPAWN_MPIRUN_RANK_1=21 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > > MPISPAWN_MPIRUN_RANK_2=22 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > > MPISPAWN_MPIRUN_RANK_3=23 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > > such file or directory > > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power08 cd > > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=6 MPISPAWN_LOCAL_NPROCS=4 > > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > > MPISPAWN_MPIRUN_RANK_0=24 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > > MPISPAWN_MPIRUN_RANK_1=25 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > > MPISPAWN_MPIRUN_RANK_2=26 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > > MPISPAWN_MPIRUN_RANK_3=27 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn execv: No > > > such file or directory > > > /usr/X11R6/bin/xterm -e /usr/bin/ssh -q power09 cd > > > /stage/Benchmarks/NPB2.4/NPB2.4-MPI; /usr/bin/env > > > LD_LIBRARY_PATH=/usr/mvapich/lib/shared:/stage/mvapich2.1/lib: > > > MPISPAWN_MPIRUN_MPD=0 MPISPAWN_MPIRUN_HOST=power01 MPISPAWN_CHECKIN_PORT=33209 > > > MPISPAWN_MPIRUN_PORT=33209 MPISPAWN_GLOBAL_NPROCS=32 MPISPAWN_MPIRUN_ID=11731 > > > MPISPAWN_ARGC=2 MPISPAWN_ARGV_0=/usr/bin/gdb MPISPAWN_ARGV_1=./bin/ep.B.32 > > > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=7 MPISPAWN_LOCAL_NPROCS=4 > > > MPISPAWN_WORKING_DIR=/stage/Benchmarks/NPB2.4/NPB2.4-MPI > > > MPISPAWN_MPIRUN_RANK_0=28 MPISPAWN_VIADEV_DEFAULT_PORT_0=-1 > > > MPISPAWN_MPIRUN_RANK_1=29 MPISPAWN_VIADEV_DEFAULT_PORT_1=-1 > > > MPISPAWN_MPIRUN_RANK_2=30 MPISPAWN_VIADEV_DEFAULT_PORT_2=-1 > > > MPISPAWN_MPIRUN_RANK_3=31 MPISPAWN_VIADEV_DEFAULT_PORT_3=-1 mpispawn > > > Child exited abnormally! > > > cleanupKilling remote processes...DONE > > > > > > On Tue, Feb 17, 2009 at 12:38:34PM -0500, Dhabaleswar Panda wrote: > > > > You are using the old MPD-based start-up. Starting with MVAPICH2 1.2, a > > > > new scalable and robust mpirun_rsh framework (non-MPD-based) job launching > > > > mechanism has been added. This framework reduces job start-up time > > > > considerably and is also scalable to multi-thousand core clusters. Please > > > > use this and you should see considerable improvement in your job launching > > > > framework. Please refer to the MVAPICH2 user guide from the following URL > > > > for details on using the mpirun_rsh framework: > > > > > > > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html > > > > > > > > If the problem persists with the new mpirun_rsh framework, let us know. > > > > > > > > DK > > > > > > > > On Tue, 17 Feb 2009, Hossein Pourreza wrote: > > > > > > > > > Hi, > > > > > > > > > > I used to experience with a very long delay before execution of my tasks with > > > > > MVAPICH-0.9.8 and reported that problem. I was hopping to see a fix in the new > > > > > version but I still have the same problem with MVAPICH-1.2. The configuration of > > > > > my machine is: (I use SDR infiniband on PCI_X bus) > > > > > > > > > > > > > > > uname -v = Generic_137112-07 > > > > > > > > > > /usr/bin/uname -p = i386 > > > > > /bin/uname -X = System = SunOS > > > > > Node = power > > > > > Release = 5.10 > > > > > KernelID = Generic_137112-07 > > > > > Machine = i86pc > > > > > BusType = > > > > > Serial = > > > > > Users = > > > > > OEM# = 0 > > > > > Origin# = 1 > > > > > NumCPU = 4 > > > > > > > > > > /bin/arch = i86pc > > > > > /usr/bin/arch -k = i86pc > > > > > /usr/convex/getsysinfo = unknown > > > > > hostinfo = unknown > > > > > /bin/machine = unknown > > > > > > > > > > If I try to run my program with processes more than the number of cores of one > > > > > node, I experience a very long delay before my task gets started. This delay > > > > > increases by increasing number of processes. I run my tasks using: > > > > > > > > > > mpiexec -np xx ./myprog > > > > > > > > > > I am using mpd as processes placement daemon. > > > > > > > > > > Any help will be greatly appreciated. > > > > > _______________________________________________ > > > > > mvapich-discuss mailing list > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > From vivekg at cdac.in Wed Feb 18 06:01:40 2009 From: vivekg at cdac.in (Vivek Gavane) Date: Wed Feb 18 06:06:41 2009 Subject: [mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error In-Reply-To: References: Message-ID: Sir, Thank you for the reply but the cable and switch seems to be fine. Is there any other reason/solution for the errors. And also the application program is giving complete and correct output except for the errors at the end. Thanks. -- Regards, Vivek Gavane Member Technical Staff Bioinformatics team, Scientific & Engineering Computing Group, National PARAM Supercomputing Facility, Centre for Development of Advanced Computing, Pune-411007. Phone: +91 20 25704100 ext. 195 Direct Line: +91 20 25704195 On Tue, Feb 17, 2009, Dhabaleswar Panda said: > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the > system is really large then it could be congestion. > > Thanks, > > DK > > On Tue, 17 Feb 2009, Vivek Gavane wrote: > >> Hello, >> I have mvapich2-1.2 compiled with the following options: >> >> >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg >> --enable-debuginfo --with-ib-include=/opt/OFED/include >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2 >> >> After I submit a job, the job completes but the following errors are >> reported on the console: >> >> ------------------------------------------------------------- >> send desc error >> Exit code -5 signaled from ibc0-16 >> Killing remote processes...[14] Abort: [] Got completion with error 12, >> vendor code=81, dest rank=0 >> at line 553 in file ibv_channel_manager.c >> MPI process terminated unexpectedly >> DONE >> ------------------------------------------------------------ >> >> And in the redirected output file, following errors are reported at the >> end: >> ----------------------------------------- >> cleanupSignal 15 received. >> Signal 15 received. >> Signal 15 received. >> Signal 15 received. >> ----------------------------------------- >> >> Do anyone know the reason for this? >> >> Thanks in advance. >> -- >> Regards, >> Vivek Gavane >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > From deamon.net at hotmail.com Wed Feb 18 10:02:34 2009 From: deamon.net at hotmail.com (=?gb2312?B?wfXWvse/?=) Date: Wed Feb 18 11:39:28 2009 Subject: [mvapich-discuss] During intra-node pingpong test, why bandwidth lowers when message size heightens? Message-ID: When I made a intra-node pingpong test in a SMP with MVAPICH2 1.0.3 and NetPipe, I found that the bandwidth lowers when messagesize heightens. Why? Hope to get an answer. _________________________________________________________________ 一点即聊,MSN推出新功能“点我!” http://im.live.cn/click/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090218/1d917393/attachment.html From chai.15 at osu.edu Wed Feb 18 13:18:20 2009 From: chai.15 at osu.edu (Lei Chai) Date: Wed Feb 18 13:18:13 2009 Subject: [mvapich-discuss] During intra-node pingpong test, why bandwidth lowers when message size heightens? In-Reply-To: References: Message-ID: <499C50EC.3020301@osu.edu> Hi, I believe it only gets lower when the message size is very large, like a few MB. That is because the buffers are out of cache for very large messages and the bandwidth becomes lower. Lei 刘志强 wrote: > When I made a intra-node pingpong test in a SMP with MVAPICH2 1.0.3 > and NetPipe, I found that the bandwidth lowers when messagesize heightens. > Why? Hope to get an answer. > > ------------------------------------------------------------------------ > MSN上小游戏,工作休闲两不误! 马上就开始玩! > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From panda at cse.ohio-state.edu Wed Feb 18 16:24:45 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Feb 18 16:24:49 2009 Subject: [mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error In-Reply-To: Message-ID: Vivek, Do you see this error always when you run this application? Do you see this error when you run your application on different set of nodes? If this happens always (irrespective of runs and nodes), will it be possible for you to send us a code snippet which reproduces this problem. This will help us to investigate this issue further. Thanks, DK > Sir, > Thank you for the reply but the cable and switch seems to be fine. Is > there any other reason/solution for the errors. And also the application > program is giving complete and correct output except for the errors at the > end. > > Thanks. > -- > Regards, > Vivek Gavane > > Member Technical Staff > Bioinformatics team, > Scientific & Engineering Computing Group, > National PARAM Supercomputing Facility, > Centre for Development of Advanced Computing, > Pune-411007. > > Phone: +91 20 25704100 ext. 195 > Direct Line: +91 20 25704195 > > On Tue, Feb 17, 2009, Dhabaleswar Panda said: > > > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the > > system is really large then it could be congestion. > > > > Thanks, > > > > DK > > > > On Tue, 17 Feb 2009, Vivek Gavane wrote: > > > >> Hello, > >> I have mvapich2-1.2 compiled with the following options: > >> > >> > >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg > >> --enable-debuginfo --with-ib-include=/opt/OFED/include > >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2 > >> > >> After I submit a job, the job completes but the following errors are > >> reported on the console: > >> > >> ------------------------------------------------------------- > >> send desc error > >> Exit code -5 signaled from ibc0-16 > >> Killing remote processes...[14] Abort: [] Got completion with error 12, > >> vendor code=81, dest rank=0 > >> at line 553 in file ibv_channel_manager.c > >> MPI process terminated unexpectedly > >> DONE > >> ------------------------------------------------------------ > >> > >> And in the redirected output file, following errors are reported at the > >> end: > >> ----------------------------------------- > >> cleanupSignal 15 received. > >> Signal 15 received. > >> Signal 15 received. > >> Signal 15 received. > >> ----------------------------------------- > >> > >> Do anyone know the reason for this? > >> > >> Thanks in advance. > >> -- > >> Regards, > >> Vivek Gavane > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > From subramon at cse.ohio-state.edu Wed Feb 18 21:37:38 2009 From: subramon at cse.ohio-state.edu (Hari Subramoni) Date: Wed Feb 18 21:37:45 2009 Subject: [mvapich-discuss] [mpich-discuss] mvapich and infiniband with LMC. How to set mpi rank - specific lid? In-Reply-To: Message-ID: Hi, You are using a very old version of MVAPICH. We strongly recommend that you upgrade to the latest version which gives you the best performance and more features. You can download the latest release from the following website http://mvapich.cse.ohio-state.edu/download/mvapich/ MVAPICH-1.1 allows you to choose multiple paths between end nodes by the use of the environment variable "VIADEV_USE_LMC". Please set this to '1' to enable this feature. Please refer to MVAPICH user guide available at the following link for more information. http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide-1.1.html#x1-790009.2.6 Thx, Hari. On Tue, 17 Feb 2009, wind wind wrote: > Hi > How to set rank to specific lid ? > I use MVAPICH 0.9.9 > > For example I have > host0 ib0 base lid 10, multi lid 11 multi lid 12 ... > host1 ib0 base lid 15, multi lid 16 multi lid 17 ... > host2 ib0 base lid 20, multi lid 21 multi lid 22 ... > LMC = 2 > > Topology: > switch0 > / \ > switch1 ------- switch2 > / \ \ > host0 host1 host2 > > I can to set this routs with using LMC > host1 -> switch2 -> switch1 -> host0 > host2 -> switch2 -> switch0 -> switch1 -> host0 > > How i can set this ratio: > mpi proc 0 rank 0 lid 10 > mpi proc 1 rank 1 lid 16 > mpi proc 2 rank 2 lid 22 > > Do anyone know anything about this ? > Thanks in advance. > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From vivekg at cdac.in Fri Feb 20 01:02:30 2009 From: vivekg at cdac.in (Vivek Gavane) Date: Fri Feb 20 01:08:20 2009 Subject: [mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error In-Reply-To: References: Message-ID: Sir, I have tried for different set of nodes for various runs, the same error is reported. But when I tried for small number of cores i.e 8 the job never came out even though it was complete and the output file was generated. Also the processes were showing 99.9% CPU usage even after complete output was generated. The application code I am using is MEME version meme3.0.3 http://meme.nbcr.net/downloads/old_versions/ Also I installed the newer version of MEME version meme_4.1.0 http://meme.nbcr.net/downloads/ It is also giving the following error everytime on different set of nodes: ----------------------------------- Exit code -5 signaled from ibc0-27 Killing remote processes...MPI process terminated unexpectedly DONE ----------------------------------- The redirected output file of the application contains: ----------------------------- cleanupSignal 15 received. ----------------------------- Thanks. -- Regards, Vivek Gavane Member Technical Staff Bioinformatics team, Scientific & Engineering Computing Group, National PARAM Supercomputing Facility, Centre for Development of Advanced Computing, Pune-411007. Phone: +91 20 25704100 ext. 195 Direct Line: +91 20 25704195 On Thu, Feb 19, 2009, Dhabaleswar Panda said: > Vivek, > > Do you see this error always when you run this application? Do you see > this error when you run your application on different set of nodes? If > this happens always (irrespective of runs and nodes), will it be possible > for you to send us a code snippet which reproduces this problem. This will > help us to investigate this issue further. > > Thanks, > > DK > >> Sir, >> Thank you for the reply but the cable and switch seems to be fine. Is >> there any other reason/solution for the errors. And also the application >> program is giving complete and correct output except for the errors at the >> end. >> >> Thanks. >> -- >> Regards, >> Vivek Gavane >> >> Member Technical Staff >> Bioinformatics team, >> Scientific & Engineering Computing Group, >> National PARAM Supercomputing Facility, >> Centre for Development of Advanced Computing, >> Pune-411007. >> >> Phone: +91 20 25704100 ext. 195 >> Direct Line: +91 20 25704195 >> >> On Tue, Feb 17, 2009, Dhabaleswar Panda said: >> >> > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the >> > system is really large then it could be congestion. >> > >> > Thanks, >> > >> > DK >> > >> > On Tue, 17 Feb 2009, Vivek Gavane wrote: >> > >> >> Hello, >> >> I have mvapich2-1.2 compiled with the following options: >> >> >> >> >> >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg >> >> --enable-debuginfo --with-ib-include=/opt/OFED/include >> >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2 >> >> >> >> After I submit a job, the job completes but the following errors are >> >> reported on the console: >> >> >> >> ------------------------------------------------------------- >> >> send desc error >> >> Exit code -5 signaled from ibc0-16 >> >> Killing remote processes...[14] Abort: [] Got completion with error 12, >> >> vendor code=81, dest rank=0 >> >> at line 553 in file ibv_channel_manager.c >> >> MPI process terminated unexpectedly >> >> DONE >> >> ------------------------------------------------------------ >> >> >> >> And in the redirected output file, following errors are reported at the >> >> end: >> >> ----------------------------------------- >> >> cleanupSignal 15 received. >> >> Signal 15 received. >> >> Signal 15 received. >> >> Signal 15 received. >> >> ----------------------------------------- >> >> >> >> Do anyone know the reason for this? >> >> >> >> Thanks in advance. >> >> -- >> >> Regards, >> >> Vivek Gavane >> >> _______________________________________________ >> >> mvapich-discuss mailing list >> >> mvapich-discuss@cse.ohio-state.edu >> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> >> > >> >> > From panda at cse.ohio-state.edu Fri Feb 20 08:17:16 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Feb 20 08:17:20 2009 Subject: [mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error In-Reply-To: Message-ID: Thanks for providing the details and pointer to the code. We will take a look at it. Can you also indicate which version of OFED you are using and the platform details (Intel or AMD and HCA type). DK On Fri, 20 Feb 2009, Vivek Gavane wrote: > Sir, > I have tried for different set of nodes for various runs, the same > error is reported. But when I tried for small number of cores i.e 8 the > job never came out even though it was complete and the output file was > generated. Also the processes were showing 99.9% CPU usage even after > complete output was generated. > > The application code I am using is MEME version meme3.0.3 > http://meme.nbcr.net/downloads/old_versions/ > > Also I installed the newer version of MEME version meme_4.1.0 > http://meme.nbcr.net/downloads/ > > It is also giving the following error everytime on different set of nodes: > ----------------------------------- > Exit code -5 signaled from ibc0-27 > Killing remote processes...MPI process terminated unexpectedly > DONE > ----------------------------------- > > The redirected output file of the application contains: > ----------------------------- > cleanupSignal 15 received. > ----------------------------- > > Thanks. > -- > Regards, > Vivek Gavane > > Member Technical Staff > Bioinformatics team, > Scientific & Engineering Computing Group, > National PARAM Supercomputing Facility, > Centre for Development of Advanced Computing, > Pune-411007. > > Phone: +91 20 25704100 ext. 195 > Direct Line: +91 20 25704195 > > On Thu, Feb 19, 2009, Dhabaleswar Panda said: > > > Vivek, > > > > Do you see this error always when you run this application? Do you see > > this error when you run your application on different set of nodes? If > > this happens always (irrespective of runs and nodes), will it be possible > > for you to send us a code snippet which reproduces this problem. This will > > help us to investigate this issue further. > > > > Thanks, > > > > DK > > > >> Sir, > >> Thank you for the reply but the cable and switch seems to be fine. Is > >> there any other reason/solution for the errors. And also the application > >> program is giving complete and correct output except for the errors at the > >> end. > >> > >> Thanks. > >> -- > >> Regards, > >> Vivek Gavane > >> > >> Member Technical Staff > >> Bioinformatics team, > >> Scientific & Engineering Computing Group, > >> National PARAM Supercomputing Facility, > >> Centre for Development of Advanced Computing, > >> Pune-411007. > >> > >> Phone: +91 20 25704100 ext. 195 > >> Direct Line: +91 20 25704195 > >> > >> On Tue, Feb 17, 2009, Dhabaleswar Panda said: > >> > >> > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the > >> > system is really large then it could be congestion. > >> > > >> > Thanks, > >> > > >> > DK > >> > > >> > On Tue, 17 Feb 2009, Vivek Gavane wrote: > >> > > >> >> Hello, > >> >> I have mvapich2-1.2 compiled with the following options: > >> >> > >> >> > >> >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg > >> >> --enable-debuginfo --with-ib-include=/opt/OFED/include > >> >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2 > >> >> > >> >> After I submit a job, the job completes but the following errors are > >> >> reported on the console: > >> >> > >> >> ------------------------------------------------------------- > >> >> send desc error > >> >> Exit code -5 signaled from ibc0-16 > >> >> Killing remote processes...[14] Abort: [] Got completion with error 12, > >> >> vendor code=81, dest rank=0 > >> >> at line 553 in file ibv_channel_manager.c > >> >> MPI process terminated unexpectedly > >> >> DONE > >> >> ------------------------------------------------------------ > >> >> > >> >> And in the redirected output file, following errors are reported at the > >> >> end: > >> >> ----------------------------------------- > >> >> cleanupSignal 15 received. > >> >> Signal 15 received. > >> >> Signal 15 received. > >> >> Signal 15 received. > >> >> ----------------------------------------- > >> >> > >> >> Do anyone know the reason for this? > >> >> > >> >> Thanks in advance. > >> >> -- > >> >> Regards, > >> >> Vivek Gavane > >> >> _______________________________________________ > >> >> mvapich-discuss mailing list > >> >> mvapich-discuss@cse.ohio-state.edu > >> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> >> > >> > > >> > >> > > > > > > From vivekg at cdac.in Fri Feb 20 08:53:03 2009 From: vivekg at cdac.in (Vivek Gavane) Date: Fri Feb 20 08:58:13 2009 Subject: [mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error In-Reply-To: References: Message-ID: Sir, I am using OFED 1.2.5 and the platform is AMD Opteron. We are using "MT47396 Infiniscale-III Mellanox Technologies" switch. The VERBS version is 1.1.0 Thanks. -- Regards, Vivek Gavane On Fri, Feb 20, 2009, Dhabaleswar Panda said: > Thanks for providing the details and pointer to the code. We will take a > look at it. > > Can you also indicate which version of OFED you are using and the platform > details (Intel or AMD and HCA type). > > DK > > On Fri, 20 Feb 2009, Vivek Gavane wrote: > >> Sir, >> I have tried for different set of nodes for various runs, the same >> error is reported. But when I tried for small number of cores i.e 8 the >> job never came out even though it was complete and the output file was >> generated. Also the processes were showing 99.9% CPU usage even after >> complete output was generated. >> >> The application code I am using is MEME version meme3.0.3 >> http://meme.nbcr.net/downloads/old_versions/ >> >> Also I installed the newer version of MEME version meme_4.1.0 >> http://meme.nbcr.net/downloads/ >> >> It is also giving the following error everytime on different set of nodes: >> ----------------------------------- >> Exit code -5 signaled from ibc0-27 >> Killing remote processes...MPI process terminated unexpectedly >> DONE >> ----------------------------------- >> >> The redirected output file of the application contains: >> ----------------------------- >> cleanupSignal 15 received. >> ----------------------------- >> >> Thanks. >> -- >> Regards, >> Vivek Gavane >> >> Member Technical Staff >> Bioinformatics team, >> Scientific & Engineering Computing Group, >> National PARAM Supercomputing Facility, >> Centre for Development of Advanced Computing, >> Pune-411007. >> >> Phone: +91 20 25704100 ext. 195 >> Direct Line: +91 20 25704195 >> >> On Thu, Feb 19, 2009, Dhabaleswar Panda said: >> >> > Vivek, >> > >> > Do you see this error always when you run this application? Do you see >> > this error when you run your application on different set of nodes? If >> > this happens always (irrespective of runs and nodes), will it be possible >> > for you to send us a code snippet which reproduces this problem. This will >> > help us to investigate this issue further. >> > >> > Thanks, >> > >> > DK >> > >> >> Sir, >> >> Thank you for the reply but the cable and switch seems to be fine. Is >> >> there any other reason/solution for the errors. And also the application >> >> program is giving complete and correct output except for the errors at the >> >> end. >> >> >> >> Thanks. >> >> -- >> >> Regards, >> >> Vivek Gavane >> >> >> >> Member Technical Staff >> >> Bioinformatics team, >> >> Scientific & Engineering Computing Group, >> >> National PARAM Supercomputing Facility, >> >> Centre for Development of Advanced Computing, >> >> Pune-411007. >> >> >> >> Phone: +91 20 25704100 ext. 195 >> >> Direct Line: +91 20 25704195 >> >> >> >> On Tue, Feb 17, 2009, Dhabaleswar Panda said: >> >> >> >> > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the >> >> > system is really large then it could be congestion. >> >> > >> >> > Thanks, >> >> > >> >> > DK >> >> > >> >> > On Tue, 17 Feb 2009, Vivek Gavane wrote: >> >> > >> >> >> Hello, >> >> >> I have mvapich2-1.2 compiled with the following options: >> >> >> >> >> >> >> >> >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg >> >> >> --enable-debuginfo --with-ib-include=/opt/OFED/include >> >> >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2 >> >> >> >> >> >> After I submit a job, the job completes but the following errors are >> >> >> reported on the console: >> >> >> >> >> >> ------------------------------------------------------------- >> >> >> send desc error >> >> >> Exit code -5 signaled from ibc0-16 >> >> >> Killing remote processes...[14] Abort: [] Got completion with error 12, >> >> >> vendor code=81, dest rank=0 >> >> >> at line 553 in file ibv_channel_manager.c >> >> >> MPI process terminated unexpectedly >> >> >> DONE >> >> >> ------------------------------------------------------------ >> >> >> >> >> >> And in the redirected output file, following errors are reported at the >> >> >> end: >> >> >> ----------------------------------------- >> >> >> cleanupSignal 15 received. >> >> >> Signal 15 received. >> >> >> Signal 15 received. >> >> >> Signal 15 received. >> >> >> ----------------------------------------- >> >> >> >> >> >> Do anyone know the reason for this? >> >> >> >> >> >> Thanks in advance. >> >> >> -- >> >> >> Regards, >> >> >> Vivek Gavane >> >> >> _______________________________________________ >> >> >> mvapich-discuss mailing list >> >> >> mvapich-discuss@cse.ohio-state.edu >> >> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> > From vitto.giova at yahoo.it Sun Feb 22 11:33:41 2009 From: vitto.giova at yahoo.it (Vittorio) Date: Sun Feb 22 11:33:49 2009 Subject: [mvapich-discuss] MPI_Send over 2 GB fails Message-ID: <4de51c660902220833g48129763yae88c873153251fe@mail.gmail.com> hello! i'm performing some performance test of mpvapich2 on infiniband: the test is very simple sending fixed quantities of data from one node to another. from 1 kB to 2 GB there are no problems but as soon as i try to transfer 4GB and above i get Fatal error in MPI_Send: Internal MPI error!, error stack: MPI_Send(192): MPI_Send(buf=0x6020a0, count=536870912, MPI_UNSIGNED_LONG, dest=1, tag=1, MPI_COMM_WORLD) failed (unknown)(): Internal MPI error![cli_0]: aborting job: Fatal error in MPI_Send: Internal MPI error!, error stack: MPI_Send(192): MPI_Send(buf=0x6020a0, count=536870912, MPI_UNSIGNED_LONG, dest=1, tag=1, MPI_COMM_WORLD) failed (unknown)(): Internal MPI error! rank 0 in job 11 randori_45329 caused collective abort of all ranks exit status of rank 0: return code 1 the two machines are equal with a 64bit OS and equipped with 32 GB of ram. i also tried the program on a single machine, but i receive the same error just after the 2 GB transfer. i'm pretty sure MPI can send more than 4 GB of data so i just can't figure out what the problem might be. any help is really appreciated thanks a lot Vittorio -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090222/1b023a07/attachment.html From koop at cse.ohio-state.edu Sun Feb 22 17:01:47 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Sun Feb 22 17:01:51 2009 Subject: [mvapich-discuss] MPI_Send over 2 GB fails In-Reply-To: <4de51c660902220833g48129763yae88c873153251fe@mail.gmail.com> Message-ID: Vittorio, This is a known issue we have with MVAPICH2. Currently some of the internal data structures within the library are not large enough to handle over 2GB of data in a single send operation. We are planning on fixing this in a future release. Matt On Sun, 22 Feb 2009, Vittorio wrote: > hello! > i'm performing some performance test of mpvapich2 on infiniband: > the test is very simple sending fixed quantities of data from one node to > another. > from 1 kB to 2 GB there are no problems but as soon as i try to transfer 4GB > and above i get > > Fatal error in MPI_Send: Internal MPI error!, error stack: > MPI_Send(192): MPI_Send(buf=0x6020a0, count=536870912, MPI_UNSIGNED_LONG, > dest=1, tag=1, MPI_COMM_WORLD) failed > (unknown)(): Internal MPI error![cli_0]: aborting job: > Fatal error in MPI_Send: Internal MPI error!, error stack: > MPI_Send(192): MPI_Send(buf=0x6020a0, count=536870912, MPI_UNSIGNED_LONG, > dest=1, tag=1, MPI_COMM_WORLD) failed > (unknown)(): Internal MPI error! > rank 0 in job 11 randori_45329 caused collective abort of all ranks > exit status of rank 0: return code 1 > > the two machines are equal with a 64bit OS and equipped with 32 GB of ram. > i also tried the program on a single machine, but i receive the same error > just after the 2 GB transfer. > > i'm pretty sure MPI can send more than 4 GB of data so i just can't figure > out what the problem might be. > any help is really appreciated > thanks a lot > Vittorio > From vitto.giova at yahoo.it Sun Feb 22 17:16:16 2009 From: vitto.giova at yahoo.it (Vittorio) Date: Sun Feb 22 17:16:23 2009 Subject: [mvapich-discuss] MPI_Send over 2 GB fails In-Reply-To: References: <4de51c660902220833g48129763yae88c873153251fe@mail.gmail.com> Message-ID: <4de51c660902221416v1aa83643o86094bb2721fd1df@mail.gmail.com> Thanks, i feared that i had something misconfigured, but i couldn't figure out what luckily apart from this test i don't need to send such large quantities of data thanks again Vittorio On Sun, Feb 22, 2009 at 11:01 PM, Matthew Koop wrote: > Vittorio, > > This is a known issue we have with MVAPICH2. Currently some of the > internal data structures within the library are not large enough to handle > over 2GB of data in a single send operation. > > We are planning on fixing this in a future release. > > Matt > > On Sun, 22 Feb 2009, Vittorio wrote: > > > hello! > > i'm performing some performance test of mpvapich2 on infiniband: > > the test is very simple sending fixed quantities of data from one node to > > another. > > from 1 kB to 2 GB there are no problems but as soon as i try to transfer > 4GB > > and above i get > > > > Fatal error in MPI_Send: Internal MPI error!, error stack: > > MPI_Send(192): MPI_Send(buf=0x6020a0, count=536870912, MPI_UNSIGNED_LONG, > > dest=1, tag=1, MPI_COMM_WORLD) failed > > (unknown)(): Internal MPI error![cli_0]: aborting job: > > Fatal error in MPI_Send: Internal MPI error!, error stack: > > MPI_Send(192): MPI_Send(buf=0x6020a0, count=536870912, MPI_UNSIGNED_LONG, > > dest=1, tag=1, MPI_COMM_WORLD) failed > > (unknown)(): Internal MPI error! > > rank 0 in job 11 randori_45329 caused collective abort of all ranks > > exit status of rank 0: return code 1 > > > > the two machines are equal with a 64bit OS and equipped with 32 GB of > ram. > > i also tried the program on a single machine, but i receive the same > error > > just after the 2 GB transfer. > > > > i'm pretty sure MPI can send more than 4 GB of data so i just can't > figure > > out what the problem might be. > > any help is really appreciated > > thanks a lot > > Vittorio > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090222/85b82b51/attachment.html From rafaarco at ugr.es Mon Feb 23 04:45:26 2009 From: rafaarco at ugr.es (Rafael Arco Arredondo) Date: Mon Feb 23 04:46:24 2009 Subject: [mvapich-discuss] Errors spawning processes with mpirun_rsh Message-ID: <1235382326.13614.24.camel@boabdilmec.ugr.es> Hello, I'm having some issues with mpirun_rsh within both MVAPICH 1.1 and MVAPICH2 1.2p1. As I commented in another email to the list some time ago, mpirun_rsh is the only mechanism we can use to create MPI processes in our configuration. The command issued is: mpirun_rsh -ssh -np 2 -hostfile ./machines ./mpihello And the error reported by mpirun_rsh is: Exit code -5 signaled from localhost MPI process terminated unexpectedly Killing remote processes...DONE We also got this on some of our machines: Child exited abnormally! Killing remote processes...DONE mpihello is a simple hello world and this happens even when the processes are launched on localhost only. OFED 1.2 is used as the underlying Infiniband libraries, and both MVAPICH and MVAPICH2 were compiled with the OpenFabrics/Gen2 single-rail option, without XRC as indicated in the user's guide for OFED libraries prior to version 1.3. Any help will be kindly appreciated. Thank you in advance, Rafa From sridharj at cse.ohio-state.edu Mon Feb 23 09:46:55 2009 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Mon Feb 23 09:47:06 2009 Subject: [mvapich-discuss] Errors spawning processes with mpirun_rsh In-Reply-To: <1235382326.13614.24.camel@boabdilmec.ugr.es> References: <1235382326.13614.24.camel@boabdilmec.ugr.es> Message-ID: <49A2B6DF.5050700@cse.ohio-state.edu> Hi Rafael, The message indicates that the application terminated with a non zero error code or crashed after launching. Can you check if it leaves any core files? You may need to set ulimit to unlimited. For example, add ulimit -c unlimited in your ~/.bashrc. Can you also give us details of the cluster and any options you've enabled with MVAPICH / MVAPICH2? -Jaidev On 02/23/2009 04:45 AM, Rafael Arco Arredondo wrote: > Hello, > > I'm having some issues with mpirun_rsh within both MVAPICH 1.1 and > MVAPICH2 1.2p1. As I commented in another email to the list some time > ago, mpirun_rsh is the only mechanism we can use to create MPI processes > in our configuration. > > The command issued is: > mpirun_rsh -ssh -np 2 -hostfile ./machines ./mpihello > > And the error reported by mpirun_rsh is: > > Exit code -5 signaled from localhost > MPI process terminated unexpectedly > Killing remote processes...DONE > > We also got this on some of our machines: > > Child exited abnormally! > Killing remote processes...DONE > > mpihello is a simple hello world and this happens even when the > processes are launched on localhost only. > > OFED 1.2 is used as the underlying Infiniband libraries, and both > MVAPICH and MVAPICH2 were compiled with the OpenFabrics/Gen2 single-rail > option, without XRC as indicated in the user's guide for OFED libraries > prior to version 1.3. > > Any help will be kindly appreciated. > > Thank you in advance, > > Rafa > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From rafaarco at ugr.es Mon Feb 23 12:08:35 2009 From: rafaarco at ugr.es (Rafael Arco Arredondo) Date: Mon Feb 23 12:09:53 2009 Subject: [mvapich-discuss] Errors spawning processes with mpirun_rsh In-Reply-To: <49A2B6DF.5050700@cse.ohio-state.edu> References: <1235382326.13614.24.camel@boabdilmec.ugr.es> <49A2B6DF.5050700@cse.ohio-state.edu> Message-ID: <1235408916.8012.39.camel@localhost> Hi Jaidev, Thank you for your prompt reply. > The message indicates that the application terminated with a non zero > error code or crashed after launching. Can you check if it leaves any > core files? You may need to set ulimit to unlimited. For example, add > ulimit -c unlimited in your ~/.bashrc. Yes, a core file is generated after adding 'ulimit -c unlimited' to $HOME/.bashrc. > Can you also give us details of the cluster and any options you've > enabled with MVAPICH / MVAPICH2? It is a cluster of servers with AMD64 Opteron processors, an Infiniband network and Sun Grid Engine 6.2 as batch scheduler (anyway this error is reported both when SGE controls the jobs and when it doesn't, when mpirun_rsh is directly executed from the command line). In order to compile MVAPICH, the PathScale compiler was used (for which the make.mvapich.gen2 script was accordingly edited), shared library support was enabled and the flag -DXRC was removed. The rest of the options, including the configuration files in $MVAPICH_HOME/etc, wasn't modified (i.e., default values are used). As for MVAPICH2, it was compiled by invoking the configure script this way: ./configure --enable-sharedlibs=gcc CC=pathcc F77=pathf90 F90=pathf90 CXX=pathCC And then plain 'make' and 'make install'. Again, the other options weren't changed. MVAPICH and MVAPICH2 compile with no problems, so do programs compiled with mpicc. However, programs crash on the initialization stage after launching as you said. Any ideas? Thanks again, Rafa > On 02/23/2009 04:45 AM, Rafael Arco Arredondo wrote: > > Hello, > > > > I'm having some issues with mpirun_rsh within both MVAPICH 1.1 and > > MVAPICH2 1.2p1. As I commented in another email to the list some time > > ago, mpirun_rsh is the only mechanism we can use to create MPI processes > > in our configuration. > > > > The command issued is: > > mpirun_rsh -ssh -np 2 -hostfile ./machines ./mpihello > > > > And the error reported by mpirun_rsh is: > > > > Exit code -5 signaled from localhost > > MPI process terminated unexpectedly > > Killing remote processes...DONE > > > > We also got this on some of our machines: > > > > Child exited abnormally! > > Killing remote processes...DONE > > > > mpihello is a simple hello world and this happens even when the > > processes are launched on localhost only. > > > > OFED 1.2 is used as the underlying Infiniband libraries, and both > > MVAPICH and MVAPICH2 were compiled with the OpenFabrics/Gen2 single-rail > > option, without XRC as indicated in the user's guide for OFED libraries > > prior to version 1.3. > > > > Any help will be kindly appreciated. > > > > Thank you in advance, > > > > Rafa > > > > From sridharj at cse.ohio-state.edu Mon Feb 23 12:25:12 2009 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Mon Feb 23 12:25:18 2009 Subject: [mvapich-discuss] Errors spawning processes with mpirun_rsh In-Reply-To: <1235408916.8012.39.camel@localhost> References: <1235382326.13614.24.camel@boabdilmec.ugr.es> <49A2B6DF.5050700@cse.ohio-state.edu> <1235408916.8012.39.camel@localhost> Message-ID: <1235409912.29473.5.camel@t13.nowlab.cis.ohio-state.edu> Hi Rafael, On Mon, 2009-02-23 at 18:08 +0100, Rafael Arco Arredondo wrote: > Hi Jaidev, > > Thank you for your prompt reply. > > > The message indicates that the application terminated with a non zero > > error code or crashed after launching. Can you check if it leaves any > > core files? You may need to set ulimit to unlimited. For example, add > > ulimit -c unlimited in your ~/.bashrc. > > Yes, a core file is generated after adding 'ulimit -c unlimited' to > $HOME/.bashrc. Can you send us the backtrace from this core file - $ gdb ./mpihello core.xyz (gdb) bt If you have core files from both mvapich and mvapich2 runs, we'd like to see them. This will provide more insights. It'll be more useful if you can compile the libraries and your application with debug symbols: * For mvapich2, configure the libraries with --enable-g=dbg and compile your application with mpicc -g * For mvapich, edit make.mvapich.gen2, add -g to CFLAGS and compile your application with mpicc -g -Jaidev > > > Can you also give us details of the cluster and any options you've > > enabled with MVAPICH / MVAPICH2? > > It is a cluster of servers with AMD64 Opteron processors, an Infiniband > network and Sun Grid Engine 6.2 as batch scheduler (anyway this error is > reported both when SGE controls the jobs and when it doesn't, when > mpirun_rsh is directly executed from the command line). > > In order to compile MVAPICH, the PathScale compiler was used (for which > the make.mvapich.gen2 script was accordingly edited), shared library > support was enabled and the flag -DXRC was removed. The rest of the > options, including the configuration files in $MVAPICH_HOME/etc, wasn't > modified (i.e., default values are used). > > As for MVAPICH2, it was compiled by invoking the configure script this > way: > > ./configure --enable-sharedlibs=gcc CC=pathcc F77=pathf90 F90=pathf90 > CXX=pathCC > > And then plain 'make' and 'make install'. Again, the other options > weren't changed. > > MVAPICH and MVAPICH2 compile with no problems, so do programs compiled > with mpicc. However, programs crash on the initialization stage after > launching as you said. > > Any ideas? > > Thanks again, > > Rafa > > > On 02/23/2009 04:45 AM, Rafael Arco Arredondo wrote: > > > Hello, > > > > > > I'm having some issues with mpirun_rsh within both MVAPICH 1.1 and > > > MVAPICH2 1.2p1. As I commented in another email to the list some time > > > ago, mpirun_rsh is the only mechanism we can use to create MPI processes > > > in our configuration. > > > > > > The command issued is: > > > mpirun_rsh -ssh -np 2 -hostfile ./machines ./mpihello > > > > > > And the error reported by mpirun_rsh is: > > > > > > Exit code -5 signaled from localhost > > > MPI process terminated unexpectedly > > > Killing remote processes...DONE > > > > > > We also got this on some of our machines: > > > > > > Child exited abnormally! > > > Killing remote processes...DONE > > > > > > mpihello is a simple hello world and this happens even when the > > > processes are launched on localhost only. > > > > > > OFED 1.2 is used as the underlying Infiniband libraries, and both > > > MVAPICH and MVAPICH2 were compiled with the OpenFabrics/Gen2 single-rail > > > option, without XRC as indicated in the user's guide for OFED libraries > > > prior to version 1.3. > > > > > > Any help will be kindly appreciated. > > > > > > Thank you in advance, > > > > > > Rafa > > > > > > > >