From Marcos.Verissimo at uclouvain.be Mon Dec 3 00:56:50 2007 From: Marcos.Verissimo at uclouvain.be (Marcos Verissimo Alves) Date: Mon Dec 3 00:57:34 2007 Subject: [mvapich-discuss] Questions on running mvapich on a cluster Message-ID: <4759.88.197.232.230.1196661410.squirrel@mmp-2-1.sipr-dc.ucl.ac.be> Hi all, I am new to the list and maybe the question I have is already answered somewhere else. Since, however, I could not find it in the archives, I'll make to the more knowledgeable people in the list, because I am not the system administrator. We have a cluster with infiniband interconnection, and I have managed to successfully compile mvapich2-0.9.8 (quite easy!). Also, I have compiled the ab initio calculation program SIESTA and have managed to make it run in parallel correctly. However, I am sure that there must be a more intelligent (and correct) way of executing a program in parallel. So here are my questions: 1) Supposing that mvapich were adequately installed by our sysadmin, what exactly is the command to start the mpi daemon on the machines? Currently my script uses the following command: /usr/bin/rsh -n $machine "/home/pcpm/mverissi/my_mvapich_0.9.8/bin/mpdboot -n 8 -f /home/pcpm/mverissi/mfile.$idm --ncpus=2 --mpd=/home/pcpm/mverissi/my_mvapich_0.9.8/bin/mpd --rsh=rsh" where mfile.$idm is a file containing the machines onto which the mpd will be started. Thinking about it now, I guess it would be enough to issue the mpdboot command without the /usr/bin/rsh -n $machine part. Is that correct? 2) If the mpi daemon is started by the sysadmin on the nodes, probably he'll start it as root. If he does so, will we "mortal" users be able to run their processes? In other words, can one run a calculation using mvapich even if the owner of the mpi process is root? The reason I ask is because of the following. As I guess is customary in many clusters for HPC, we have a home in which we keep our files, available through NFS, and the calculations are executed in such a way that the (huge) files that contain the data are written on local disks with a faster access, then copied to the user's home after the calculation ends. When I run the calculations, the queue system creates a directory, on each local disk on the slave nodes, named like /tmp/all.q.98764 . However, the mpd console files are created, generally, in /tmp/ . I see that my console files have the name mpd2.console_mverissi (my username in the cluster), but I do not know if this is because I started the mpi daemon myself. So, I am not sure if the mpd2.console_xxxxxx files will be created as mpd2.console_root (if root starts the mpd) or if the console files will exist as mpd2.console_xxxxxx if root starts the mpd but user xxxxxx starts the mpi calculation. Hope I am not being extremely confusing here... 3) The last question concerns the .mpd.conf and .mpdpasswd files. To make the calculation run, the only way I could think of was to copy those files to the temporary directory that I mentioned in question 2. However, it would be nice to have mvapich finding those files in the users' home dirs (even if the calculation is being run in a scratch disk) instead of the users having to copy them to temporary directories. Is there a way of doing this? Sorry if the questions are too basic and even worse, if they have been asked (and answered before). It's just that our sysadmin has been quite busy lately (as is usual with all sysadmins :D ) and I'd like to get these information to pass it to him. By the way, if those information could be included in the user's guide, it would be extremely useful. Thanks in advance, Marcos -- Dr. Marcos Verissimo Alves Post-Doctoral Fellow Unit? de Physico-Chimie et de Physique des Mat?riaux (PCPM) Universit? Catholique de Louvain 1 Place Croix du Sud, B-1348 Louvain-la-Neuve Belgique ------ Gort, Klaatu barada nikto. Klaatu barada nikto. Klaatu barada nikto. From huanwei at cse.ohio-state.edu Mon Dec 3 13:50:34 2007 From: huanwei at cse.ohio-state.edu (wei huang) Date: Mon Dec 3 13:51:15 2007 Subject: [mvapich-discuss] Questions on running mvapich on a cluster In-Reply-To: <4759.88.197.232.230.1196661410.squirrel@mmp-2-1.sipr-dc.ucl.ac.be> Message-ID: Hi Marcos, Thanks for using mvapich2. Please see inline for more detailed answers to your questions. > 1) Supposing that mvapich were adequately installed by our sysadmin, what > exactly is the command to start the mpi daemon on the machines? Currently > my script uses the following command: > > /usr/bin/rsh -n $machine "/home/pcpm/mverissi/my_mvapich_0.9.8/bin/mpdboot > -n 8 -f /home/pcpm/mverissi/mfile.$idm --ncpus=2 > --mpd=/home/pcpm/mverissi/my_mvapich_0.9.8/bin/mpd --rsh=rsh" > > where mfile.$idm is a file containing the machines onto which the mpd will > be started. Thinking about it now, I guess it would be enough to issue the > mpdboot command without the /usr/bin/rsh -n $machine part. Is that > correct? You should be able to simply run mpdboot on one of your computing nodes (any one). More details can be find through our online user guide: http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2.html Is there any specific reason that you are using your own scripts to start mpd ring? > 2) If the mpi daemon is started by the sysadmin on the nodes, probably > he'll start it as root. If he does so, will we "mortal" users be able to > run their processes? In other words, can one run a calculation using > mvapich even if the owner of the mpi process is root? That is feasible. Users can use root's mpd ring by settting environment variable MPD_USE_ROOT_MPD=1. For more details you can read the installation guide from the MPICH2 website: http://www.mcs.anl.gov/research/projects/mpich2/ > 3) The last question concerns the .mpd.conf and .mpdpasswd files. To make > the calculation run, the only way I could think of was to copy those files > to the temporary directory that I mentioned in question 2. However, it > would be nice to have mvapich finding those files in the users' home dirs > (even if the calculation is being run in a scratch disk) instead of the > users having to copy them to temporary directories. Is there a way of > doing this? Currently mpd daemon is looking into user's home directory for those files. Are you facing any problem by just have those files in your home directly? Of course, if you use root's mpd, those files are in /etc/mpd.conf. Thanks. -- Wei From panda at cse.ohio-state.edu Wed Dec 5 10:50:32 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Dec 5 10:51:12 2007 Subject: [mvapich-discuss] MVAPICH+IB] mvapich2-1.0.1 in IB network when openSM with LASH is running In-Reply-To: <829ded920711280303q6429d1fdxa09d807ff97c25a0@mail.gmail.com> Message-ID: On Wed, 28 Nov 2007, Keshetti Mahesh wrote: > Has anyone in the list ever tested MVAPICH in infiniband network > in which openSM is running with LASH routing algorithm enabled? We have not studied this combination completely. Thanks, DK > I also haven't tested the above case but i can foresee a problem > because LASH routing algorithm in openSM uses virtual > lanes (VL) which are directly mapped with service levels (SL) > of infiniband network.And LASH routing algorithm assigns different > VLs ( thus SLs) to different paths in the network. This SL <-> path association > information is available only through the subnet manager (openSM) during > connection establishment.But AFAIK, mvapich don't use the services of subnet > manager for connection establishment between nodes. So I want to know > whether anyone thought about it and working on it or not. > > regards, > Mahesh > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From ywan at ed.ac.uk Fri Dec 7 11:33:36 2007 From: ywan at ed.ac.uk (Yuan Wan) Date: Fri Dec 7 12:25:46 2007 Subject: [mvapich-discuss] build mvapich2 with smpd Message-ID: Hi all, I have built mvapich2-1.0 on my cluster following the user manual. It works well. Now I need to build another mvapich2 using 'smpd' as startup method instead of 'mpd' Anyone know how to do it? What I have tried is to replace '--with-pm=mpd' by '--with-pm=smpd --with-pmi=smpd' in make.mvapich2.ofa The build procedure completes but I cannot run the test code normally. --Yuan Yuan Wan -- Unix Section Information Services Infrastructure Division University of Edinburgh tel: 0131 650 4985 email: ywan@ed.ac.uk 2032 Computing Services, JCMB The King's Buildings, Edinburgh, EH9 3JZ From Christian_Boehme at freenet.de Fri Dec 7 12:49:08 2007 From: Christian_Boehme at freenet.de (Christian Boehme) Date: Fri Dec 7 12:49:55 2007 Subject: [mvapich-discuss] Strange error with MPI_REDUCE Message-ID: <47598794.3040205@freenet.de> Dear list, we recently encountered a strange problem with MPI_REDUCE in our mvapich-0.9.9 installation. Please consider the following F77 program: program reduce_err implicit none c FORTRAN MPI-INCLUDE-file include 'mpif.h' integer ierr, nproc, myid real*8 x , y call MPI_INIT( ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, nproc, ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) x = 0 y = 1 call MPI_REDUCE( y, x, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 1, : MPI_COMM_WORLD, ierr ) write(6,*) myid, ': Value for x after reduce:', x call MPI_FINALIZE( ierr ) stop end Obviously, the output should be the number of processes for myid=1, and zero for all other processes. This is also what we get when using either one process per node (only Infiniband communication) or put all processes on one node (only shared memory): > mpirun_rsh -np 4 gwdm001 gwdm004 gwdm002 gwdm003 reduce_err > 3 : Value for x after reduce: 0.00000000000000 > 2 : Value for x after reduce: 0.00000000000000 > 1 : Value for x after reduce: 4.00000000000000 > 0 : Value for x after reduce: 0.00000000000000 However, when mixing the two, i.e., utilizing several nodes and more than one process on those nodes, we also get the number of processes for myid=0: > mpirun_rsh -np 4 gwdm001 gwdm001 gwdm002 gwdm003 reduce_err > 1 : Value for x after reduce: 4.00000000000000 > 2 : Value for x after reduce: 0.00000000000000 > 3 : Value for x after reduce: 0.00000000000000 > 0 : Value for x after reduce: 4.00000000000000 This behavior is rather unexpected and can seriously break some programs. What could be the problem? Many thanks in advance Christian Boehme From panda at cse.ohio-state.edu Sat Dec 8 11:31:48 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Sat Dec 8 11:32:29 2007 Subject: [mvapich-discuss] Strange error with MPI_REDUCE In-Reply-To: <47598794.3040205@freenet.de> Message-ID: Thanks for reporting this issue. Can you tell us which version of 0.9.9 you are using (the one available with OFED 1.2 or from the OSU site). Which compiler are you using? Can you also check whether you see the same problem with the latest MVAPICH 1.0-beta (please use the latest version from the trunk). In the mean time, we will also investigate this issue further. Thanks, DK On Fri, 7 Dec 2007, Christian Boehme wrote: > Dear list, > > we recently encountered a strange problem with MPI_REDUCE in our > mvapich-0.9.9 installation. Please consider the following F77 program: > > program reduce_err > > implicit none > c FORTRAN MPI-INCLUDE-file > include 'mpif.h' > integer ierr, nproc, myid > real*8 x , y > > call MPI_INIT( ierr ) > call MPI_COMM_SIZE( MPI_COMM_WORLD, nproc, ierr ) > call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) > x = 0 > y = 1 > call MPI_REDUCE( y, x, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 1, > : MPI_COMM_WORLD, ierr ) > write(6,*) myid, ': Value for x after reduce:', x > call MPI_FINALIZE( ierr ) > > stop > end > > Obviously, the output should be the number of processes for myid=1, and > zero for all other processes. This is also what we get when using either > one process per node (only Infiniband communication) or put all > processes on one node (only shared memory): > > > mpirun_rsh -np 4 gwdm001 gwdm004 gwdm002 gwdm003 reduce_err > > 3 : Value for x after reduce: 0.00000000000000 > > 2 : Value for x after reduce: 0.00000000000000 > > 1 : Value for x after reduce: 4.00000000000000 > > 0 : Value for x after reduce: 0.00000000000000 > > However, when mixing the two, i.e., utilizing several nodes and more > than one process on those nodes, we also get the number of processes for > myid=0: > > > mpirun_rsh -np 4 gwdm001 gwdm001 gwdm002 gwdm003 reduce_err > > 1 : Value for x after reduce: 4.00000000000000 > > 2 : Value for x after reduce: 0.00000000000000 > > 3 : Value for x after reduce: 0.00000000000000 > > 0 : Value for x after reduce: 4.00000000000000 > > This behavior is rather unexpected and can seriously break some > programs. What could be the problem? Many thanks in advance > > Christian Boehme > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From mamidala at cse.ohio-state.edu Sun Dec 9 16:03:56 2007 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Sun Dec 9 16:04:37 2007 Subject: [mvapich-discuss] Strange error with MPI_REDUCE In-Reply-To: Message-ID: Hi Christian, Can you also try the patch I am attaching with this mail and let us know how it works? Thanks, Amith. On Sat, 8 Dec 2007, Dhabaleswar Panda wrote: > Thanks for reporting this issue. Can you tell us which version of 0.9.9 > you are using (the one available with OFED 1.2 or from the OSU site). > Which compiler are you using? Can you also check whether you see the same > problem with the latest MVAPICH 1.0-beta (please use the latest version > from the trunk). > > In the mean time, we will also investigate this issue further. > > Thanks, > > DK > > > On Fri, 7 Dec 2007, Christian Boehme wrote: > > > Dear list, > > > > we recently encountered a strange problem with MPI_REDUCE in our > > mvapich-0.9.9 installation. Please consider the following F77 program: > > > > program reduce_err > > > > implicit none > > c FORTRAN MPI-INCLUDE-file > > include 'mpif.h' > > integer ierr, nproc, myid > > real*8 x , y > > > > call MPI_INIT( ierr ) > > call MPI_COMM_SIZE( MPI_COMM_WORLD, nproc, ierr ) > > call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) > > x = 0 > > y = 1 > > call MPI_REDUCE( y, x, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 1, > > : MPI_COMM_WORLD, ierr ) > > write(6,*) myid, ': Value for x after reduce:', x > > call MPI_FINALIZE( ierr ) > > > > stop > > end > > > > Obviously, the output should be the number of processes for myid=1, and > > zero for all other processes. This is also what we get when using either > > one process per node (only Infiniband communication) or put all > > processes on one node (only shared memory): > > > > > mpirun_rsh -np 4 gwdm001 gwdm004 gwdm002 gwdm003 reduce_err > > > 3 : Value for x after reduce: 0.00000000000000 > > > 2 : Value for x after reduce: 0.00000000000000 > > > 1 : Value for x after reduce: 4.00000000000000 > > > 0 : Value for x after reduce: 0.00000000000000 > > > > However, when mixing the two, i.e., utilizing several nodes and more > > than one process on those nodes, we also get the number of processes for > > myid=0: > > > > > mpirun_rsh -np 4 gwdm001 gwdm001 gwdm002 gwdm003 reduce_err > > > 1 : Value for x after reduce: 4.00000000000000 > > > 2 : Value for x after reduce: 0.00000000000000 > > > 3 : Value for x after reduce: 0.00000000000000 > > > 0 : Value for x after reduce: 4.00000000000000 > > > > This behavior is rather unexpected and can seriously break some > > programs. What could be the problem? Many thanks in advance > > > > Christian Boehme > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -------------- next part -------------- Index: intra_fns_new.c =================================================================== --- intra_fns_new.c (revision 1650) +++ intra_fns_new.c (working copy) @@ -5074,7 +5074,7 @@ MPI_Comm shmem_comm, leader_comm; struct MPIR_COMMUNICATOR *comm_ptr = 0,*shmem_commptr = 0, *leader_commptr = 0; int local_rank = -1, global_rank = -1, local_size=0, my_rank; - void* local_buf=NULL, *tmpbuf=NULL; + void* local_buf=NULL, *tmpbuf=NULL, *tmpbuf1=NULL; int stride = 0, i, is_commutative; int leader_root, total_size=0, shmem_comm_rank; @@ -5156,6 +5156,11 @@ MPIR_REDUCE_TAG, comm_ptr->self, &status); } + if (local_rank == 0){ + MPIR_ALLOC(tmpbuf1, MALLOC(count*extent), comm_ptr, MPI_ERR_EXHAUSTED, myname); + tmpbuf1 = (void *)((char*)tmpbuf1 - lb); + } + if (local_size > 1){ MPID_SHMEM_COLL_GetShmemBuf(local_size, local_rank, shmem_comm_rank, &shmem_buf); } @@ -5176,11 +5181,11 @@ leader_root = comm_ptr->leader_rank[leader_of_root]; if (local_size != total_size){ if (local_size > 1){ - mpi_errno = intra_Reduce(tmpbuf, recvbuf, count, datatype, + mpi_errno = intra_Reduce(tmpbuf, tmpbuf1, count, datatype, op, leader_root, leader_commptr); } else{ - mpi_errno = intra_Reduce(sendbuf, recvbuf, count, datatype, + mpi_errno = intra_Reduce(sendbuf, tmpbuf1, count, datatype, op, leader_root, leader_commptr); } } @@ -5207,19 +5212,27 @@ MPID_SHMEM_COLL_SetGatherComplete(local_size, local_rank, shmem_comm_rank); } + if ((local_rank == 0) && (root == my_rank)){ + mpi_errno = MPI_Sendrecv(tmpbuf1, count, datatype->self, rank, + MPIR_REDUCE_TAG, recvbuf, count, datatype->self, rank, + MPIR_REDUCE_TAG, comm_ptr->self, &status); + return MPI_SUCCESS; + } + /* Copying data from leader to the root incase * leader is not the root */ if (local_size > 1){ /* Send the message to the root if the leader is not the * root of the reduce operation */ + if ((local_rank == 0) && (root != my_rank) && (leader_root == global_rank)){ if (local_size == total_size){ mpi_errno = MPI_Send( tmpbuf, count, datatype->self, root, MPIR_REDUCE_TAG, comm->self ); } else{ - mpi_errno = MPI_Send( recvbuf, count, datatype->self, root, + mpi_errno = MPI_Send( tmpbuf1, count, datatype->self, root, MPIR_REDUCE_TAG, comm->self ); } } From lexa at adam.botik.ru Tue Dec 11 09:57:42 2007 From: lexa at adam.botik.ru (Alexei I. Adamovich) Date: Tue Dec 11 09:58:26 2007 Subject: [mvapich-discuss] Perhaps small bug/fix found for mvapich-1.0-beta Message-ID: <20071211145742.GA1518@adam.botik.ru> Hi! Seems, I've found a small cut-an-paste-style inexactness in mpid/ch_gen2_us/mv_buf.c (in get_rbuf_size), see diff/patch in attachement. Is this one known already? Sincerely, Alexei I. Adamovich -------------- next part -------------- --- mpid/ch_gen2_ud/mv_buf.c 2007-10-26 23:50:29.000000000 +0400 +++ /tmp/mv_buf.c 2007-12-11 16:06:46.000000000 +0300 @@ -100,31 +100,31 @@ static inline mv_rbuf_size * get_rbuf_size(int rbuf_size) { int found = 0, i; mv_rbuf_size * b; for(i = 0; i < mv_rbuf_avail_num; i++) { if(mv_rbuf_avail[i].alloc_size == rbuf_size) { found = 1; b = &(mv_rbuf_avail[i]); break; } } if(!found) { /* TODO: re-order based on size!!! */ - b = &(mv_rbuf_avail[mv_sbuf_avail_num]); + b = &(mv_rbuf_avail[mv_rbuf_avail_num]); b->alloc_size = rbuf_size; /* TODO: change if using headers */ b->max_data_size = rbuf_size; b->head = NULL; mv_rbuf_avail_num++; D_PRINT("creating new rbuf size: %d\n", rbuf_size); } return b; } From koop at cse.ohio-state.edu Tue Dec 11 10:17:50 2007 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Dec 11 10:18:32 2007 Subject: [mvapich-discuss] Perhaps small bug/fix found for mvapich-1.0-beta In-Reply-To: <20071211145742.GA1518@adam.botik.ru> Message-ID: Alexei, Thanks for pointing this issue out. I'll apply this to the SVN trunk. Matt On Tue, 11 Dec 2007, Alexei I. Adamovich wrote: > Hi! > > Seems, I've found a small cut-an-paste-style inexactness in > mpid/ch_gen2_us/mv_buf.c (in get_rbuf_size), see diff/patch in attachement. > > Is this one known already? > > Sincerely, > Alexei I. Adamovich > From StephanGerber at gmx.net Tue Dec 11 10:37:04 2007 From: StephanGerber at gmx.net (Stephan Gerber) Date: Tue Dec 11 10:37:47 2007 Subject: [mvapich-discuss] port problem using MVAPICH2 Message-ID: <20071211153704.209990@gmx.net> Dear MVAPICH users and developers, i have some problems using MVAPICH2. to start with MVAPICH(1) and OPEMMPI are both running on our infinibandcluster but both do not scale as i wanted to do them. so i want to use MVAPICH2 to achieve better results (hopefully...). my system is a dual-Opteron cluster with 4 nodes each has 2 processors with each of them two cores. i tried booting with: mpdboot --totalnum=4 -1 --file=/Users/gerber/mac/mpd.hosts --rsh=ssh --verbose --ncpus=16 in this case i end up with the follwoing error message mpdboot_n01.local (handle_mpd_output 373): from mpd on n01, invalid port info: does anyone know which problem might that be and how to solve it? if i use the --chhup(only) option i see that only one node out of four is up!? if i dont use the option --totalnum=4 the mpdboot workes fine but still afetr using mpdtrace i see that there is only one host up... for the second boot-approach i tried starting mpiexec but i end up with the following error: rank 3 in job 1 n01.local_39320 caused collective abort of all ranks exit status of rank 3: return code 13 [rdma_iba_init.c:91] Error initializing MVAPICH2 malloc library Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(230): Initialization failed MPID_Init(81)........: channel initialization failed (unknown)(): Other MPI error[gerber@n01] any help would be appreciated! thanx in advance br stephan From Christian_Boehme at freenet.de Tue Dec 11 10:39:35 2007 From: Christian_Boehme at freenet.de (Christian Boehme) Date: Tue Dec 11 10:40:17 2007 Subject: [mvapich-discuss] Strange error with MPI_REDUCE In-Reply-To: References: Message-ID: <475EAF37.4090805@freenet.de> Hi Amith, > Can you also try the patch I am attaching with this mail and let us know > how it works? > Now that the patch seems to work, would it be possible to get a similar patch for MVAPICH2 (version 0.9.8)? Many thanks Christian Boehme From stoffel at sgi.com Wed Dec 12 10:45:57 2007 From: stoffel at sgi.com (Jim Stoffel) Date: Wed Dec 12 11:00:13 2007 Subject: [mvapich-discuss] Specify hca device per node in a given run Message-ID: <47600235.9000708@sgi.com> Is there a way to specify the HCA device on a per node basis for a single instance of a MPI application execution? DAPL_PROVIDER (mvapich2) or VIADEV_DEVICE(mvapich) can be used to specify the device which applies to all nodes in the run. Can the device be set for each individual node in a paramfile or a hostlist file? I would somehow like to have data transfer through hca0 of node0 and hca1 of node1. node0 DAPL_PROVIDER OpenIB-cma node1 DAPL_PROVIDER OpenIB-cma-1 Please advise. Thanks, Jim From stoffel at sgi.com Wed Dec 12 11:44:33 2007 From: stoffel at sgi.com (Jim Stoffel) Date: Wed Dec 12 11:45:15 2007 Subject: [mvapich-discuss] [Fwd: Specify hca device per node in a given run] Message-ID: <47600FF1.4040609@sgi.com> Sorry if you receive this twice. Sending again as a member this time. Is there a way to specify the HCA device on a per node basis for a single instance of a MPI application execution? DAPL_PROVIDER (mvapich2) or VIADEV_DEVICE(mvapich) can be used to specify the device which applies to all nodes in the run. Can the device be set for each individual node in a paramfile or a hostlist file? I would somehow like to have data transfer through hca0 of node0 and hca1 of node1. node0 DAPL_PROVIDER OpenIB-cma node1 DAPL_PROVIDER OpenIB-cma-1 Please advise. Thanks, Jim From chai.15 at osu.edu Wed Dec 12 13:39:27 2007 From: chai.15 at osu.edu (LEI CHAI) Date: Wed Dec 12 13:40:09 2007 Subject: [mvapich-discuss] Specify hca device per node in a given run Message-ID: <10b1b410e372.10e37210b1b4@osu.edu> Hi Jim, To use different dapl providers on different nodes for mvapich2, we have a solution posted earlier, please take a look: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/2007-July/000981.html Currently you cannot specify different device names (VIADEV_DEVICE) for mvapich though. It will automatically detect the first working device to use. Lei ----- Original Message ----- From: Jim Stoffel Date: Wednesday, December 12, 2007 10:45 am Subject: [mvapich-discuss] Specify hca device per node in a given run > Is there a way to specify the HCA device on a per node > basis for a single instance of a MPI application execution? > DAPL_PROVIDER (mvapich2) or VIADEV_DEVICE(mvapich) > can be used to specify the device which applies to all nodes in the > run. Can the device be set for each individual node in a paramfile > or a hostlist file? > > I would somehow like to have data transfer through hca0 of node0 > and hca1 of node1. > > node0 DAPL_PROVIDER OpenIB-cma > node1 DAPL_PROVIDER OpenIB-cma-1 > > Please advise. > Thanks, > Jim > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From mamidala at cse.ohio-state.edu Wed Dec 12 23:02:39 2007 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Wed Dec 12 23:03:21 2007 Subject: [mvapich-discuss] Strange error with MPI_REDUCE In-Reply-To: <475EAF37.4090805@freenet.de> Message-ID: Hi Christian, Thanks for trying out the patch. I will post a patch to MVAPICH2 in the next few days. Thanks, Amith. On Tue, 11 Dec 2007, Christian Boehme wrote: > Hi Amith, > > Can you also try the patch I am attaching with this mail and let us know > > how it works? > > > > Now that the patch seems to work, would it be possible to get a similar > patch for MVAPICH2 (version 0.9.8)? Many thanks > > Christian Boehme > From maillistbox at 126.com Thu Dec 13 09:44:33 2007 From: maillistbox at 126.com (Eric Zhang) Date: Thu Dec 13 09:44:09 2007 Subject: [mvapich-discuss] How to solve CQ creation problem while using mpirun_rsh -rsh? Message-ID: <47614551.7040400@126.com> Hi, mvapich-discuss: We submit the jobs using command like: mpirun_rsh -rsh -np 24 -hostfile myhostfile xhpl But this will fail because "Error in Creating CQ". I read the mvapich's user guide and found this problem can be fixed only in SSH environment -- add "ulimit -l xxx" in /etc/init.d/sshd. Anyone can tell me how to solve this "CQ creation" problem while I want to use rsh? Thanks in advance. Eric Zhang 2007-12-13 From gopalakk at cse.ohio-state.edu Thu Dec 13 11:31:38 2007 From: gopalakk at cse.ohio-state.edu (Karthik Gopalakrishnan) Date: Thu Dec 13 11:32:25 2007 Subject: [mvapich-discuss] How to solve CQ creation problem while using mpirun_rsh -rsh? In-Reply-To: <47614551.7040400@126.com> References: <47614551.7040400@126.com> Message-ID: <92eddfb50712130831x30bff35dwb1185c191ae10bca@mail.gmail.com> Hi. ulimit -l sets the "maximum size that may be locked into memory" for all the processes started by the shell where it is run. By adding that line in /etc/init.d/sshd, you allocate more resources to sshd. If you are using rsh, you should be adding a similar line to /etc/init.d/xinetd. Hope this helps. Regards, Karthik On 12/13/07, Eric Zhang wrote: > Hi, mvapich-discuss: > > We submit the jobs using command like: > > mpirun_rsh -rsh -np 24 -hostfile myhostfile xhpl > > But this will fail because "Error in Creating CQ". I read the > mvapich's user guide and found this problem can be fixed only in SSH > environment -- add "ulimit -l xxx" in /etc/init.d/sshd. > > Anyone can tell me how to solve this "CQ creation" problem while I > want to use rsh? > > Thanks in advance. > > Eric Zhang > 2007-12-13 > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > From ibatis2 at 163.com Fri Dec 14 07:26:25 2007 From: ibatis2 at 163.com (jetspeed) Date: Fri Dec 14 07:31:12 2007 Subject: [mvapich-discuss] Has anyone enabled InfiniBand in IBM BladeCenter JS21 by OFED 1.2.5 Message-ID: <20071214202625.9ab584ef.ibatis2@163.com> Hi: I want to install OFED 1.2.5 on the IBM BladeCenter to use the MvapiCH2 shipped with it, but unfortunately it seems that the MPI uses the Ethernet to communicate as before, I can't downlowd OFED tailored for RHEL4 by Cisco, so I wonder is there someone have the experience that successfully installed the OFED on InfiniBand. Could someone give me suggestions, Thanks in advance. Sincerely Yours. From maillistbox at 126.com Fri Dec 14 09:30:56 2007 From: maillistbox at 126.com (Eric Zhang) Date: Fri Dec 14 09:30:41 2007 Subject: [mvapich-discuss] Has anyone enabled InfiniBand in IBM BladeCenter JS21 by OFED 1.2.5 In-Reply-To: <20071214202625.9ab584ef.ibatis2@163.com> References: <20071214202625.9ab584ef.ibatis2@163.com> Message-ID: <476293A0.5090303@126.com> Hi, mvapich-discuss: Basically, it is very easy to install OFED on an Infiniband machine. Just uncompressed the OFED package, run the build/install.sh then follow the install guide -- that's all. I remember OFED ships with OpenMPI not mvapich2? Eric Zhang jetspeed wrote: > Hi: > I want to install OFED 1.2.5 on the IBM BladeCenter to use the MvapiCH2 shipped with it, > but unfortunately it seems that the MPI uses the Ethernet to communicate as before, > > I can't downlowd OFED tailored for RHEL4 by Cisco, so I wonder is there someone have the experience that successfully installed the OFED on InfiniBand. > > Could someone give me suggestions, Thanks in advance. > > Sincerely Yours. > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From maillistbox at 126.com Fri Dec 14 09:32:39 2007 From: maillistbox at 126.com (Eric Zhang) Date: Fri Dec 14 09:32:13 2007 Subject: [mvapich-discuss] How to solve CQ creation problem while using mpirun_rsh -rsh? In-Reply-To: <92eddfb50712130831x30bff35dwb1185c191ae10bca@mail.gmail.com> References: <47614551.7040400@126.com> <92eddfb50712130831x30bff35dwb1185c191ae10bca@mail.gmail.com> Message-ID: <47629407.3020406@126.com> Hi, Karthik: Yeah, it works. Thanks, Karthik. And I also add "ulimit -l " in my sgeexecd script -- We use sge as job schedule system and it also works fine. Eric Zhang Karthik Gopalakrishnan wrote: > Hi. > > ulimit -l sets the "maximum size that may be locked into memory" for > all the processes started by the shell where it is run. By adding that > line in /etc/init.d/sshd, you allocate more resources to sshd. If you > are using rsh, you should be adding a similar line to > /etc/init.d/xinetd. Hope this helps. > > Regards, > Karthik > > On 12/13/07, Eric Zhang wrote: >> Hi, mvapich-discuss: >> >> We submit the jobs using command like: >> >> mpirun_rsh -rsh -np 24 -hostfile myhostfile xhpl >> >> But this will fail because "Error in Creating CQ". I read the >> mvapich's user guide and found this problem can be fixed only in SSH >> environment -- add "ulimit -l xxx" in /etc/init.d/sshd. >> >> Anyone can tell me how to solve this "CQ creation" problem while I >> want to use rsh? >> >> Thanks in advance. >> >> Eric Zhang >> 2007-12-13 >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> > From panda at cse.ohio-state.edu Fri Dec 14 10:04:08 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Dec 14 10:04:50 2007 Subject: [mvapich-discuss] Has anyone enabled InfiniBand in IBM BladeCenter JS21 by OFED 1.2.5 In-Reply-To: <476293A0.5090303@126.com> Message-ID: > I remember OFED ships with OpenMPI not mvapich2? OFED includes MVAPICH, MVAPICH2 and OpenMPI. During the installation, one can select which MPI version to be installed through the `mpi selector' function. DK > Eric Zhang > > jetspeed wrote: > > Hi: > > I want to install OFED 1.2.5 on the IBM BladeCenter to use the MvapiCH2 shipped with it, > > but unfortunately it seems that the MPI uses the Ethernet to communicate as before, > > > > I can't downlowd OFED tailored for RHEL4 by Cisco, so I wonder is there someone have the experience that successfully installed the OFED on InfiniBand. > > > > Could someone give me suggestions, Thanks in advance. > > > > Sincerely Yours. > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From christian.guggenberger at rzg.mpg.de Fri Dec 14 10:04:12 2007 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Fri Dec 14 10:04:57 2007 Subject: [MPICH] Re: [mvapich-discuss] Fatal error in MPI_Allreduce In-Reply-To: References: <20071129153620.GG26051@daltons.rzg.mpg.de> <004c01c8337f$67f65840$70013b0a@thakurlaptop> Message-ID: <20071214150412.GK11593@daltons.rzg.mpg.de> On Fri, Nov 30, 2007 at 12:47:45PM -0600, Anthony Chan wrote: > > http://softwarecommunity.intel.com/isn/Community/en-US/forums/thread/30237719.aspx > > A.Chan > > On Fri, 30 Nov 2007, Rajeev Thakur wrote: > >> This is caused by a bug in the Intel 10.0 compiler. It cannot handle case >> statements of the form below. It needs a "return MPI_SUCCESS" after each >> case statement. You can fix it in your code by editing >> src/mpi/coll/opminloc.c and adding "return MPI_SUCCESS" after each of the >> case statements below. >> thanks a lot for your answers. It looks like the issue has been fixed in Version 10.1.011 of Intel compilers. cheers. - Christian From jsquyres at cisco.com Fri Dec 14 10:18:09 2007 From: jsquyres at cisco.com (Jeff Squyres) Date: Fri Dec 14 10:19:11 2007 Subject: [mvapich-discuss] Has anyone enabled InfiniBand in IBM BladeCenter JS21 by OFED 1.2.5 In-Reply-To: References: Message-ID: On Dec 14, 2007, at 10:04 AM, Dhabaleswar Panda wrote: >> I remember OFED ships with OpenMPI not mvapich2? > > OFED includes MVAPICH, MVAPICH2 and OpenMPI. During the > installation, one > can select which MPI version to be installed through the `mpi > selector' > function. FWIW, you can run the mpi-selector-menu program at any time -- not just during the OFED installation. mpi-selector-menu allows you to choose which MPI you want to use, either at a system-wide basis or on a per-user basis. See mpi-selector-menu(1) for more details, or mpi- selector(1) for even more details. (mpi-selector-menu is a simple menu-based wrapper around the back-end mpi-selector command) -- Jeff Squyres Cisco Systems From ibatis2 at 163.com Mon Dec 17 04:06:15 2007 From: ibatis2 at 163.com (jetspeed) Date: Mon Dec 17 04:11:10 2007 Subject: [mvapich-discuss] Has anyone enabled InfiniBand in IBM BladeCenter JS21 by OFED 1.2.5 In-Reply-To: References: Message-ID: <20071217170615.f28e7707.ibatis2@163.com> Thanks, maybe my OFED is installed successfully, I will check this later. the problem is , after I select the mpi version Current system default: mvapich2_gcc-0.9.8 Current user default: mvapich2_gcc-0.9.8 then , I use mpdboot, and got "mpdboot_node01 (handle_mpd_output 359): failed to ping mpd on node01; recvd output={}" maybe other settings needed for Mvapich? On Fri, 14 Dec 2007 10:18:09 -0500 Jeff Squyres wrote: > On Dec 14, 2007, at 10:04 AM, Dhabaleswar Panda wrote: > > >> I remember OFED ships with OpenMPI not mvapich2? > > > > OFED includes MVAPICH, MVAPICH2 and OpenMPI. During the > > installation, one > > can select which MPI version to be installed through the `mpi > > selector' > > function. > > FWIW, you can run the mpi-selector-menu program at any time -- not > just during the OFED installation. mpi-selector-menu allows you to > choose which MPI you want to use, either at a system-wide basis or on > a per-user basis. See mpi-selector-menu(1) for more details, or mpi- > selector(1) for even more details. > > (mpi-selector-menu is a simple menu-based wrapper around the back-end > mpi-selector command) > > -- > Jeff Squyres > Cisco Systems > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From StephanGerber at gmx.net Mon Dec 17 06:12:48 2007 From: StephanGerber at gmx.net (Stephan Gerber) Date: Mon Dec 17 06:13:31 2007 Subject: [mvapich-discuss] port problem using MVAPICH2 Message-ID: <20071217111248.128010@gmx.net> Dear MVAPICH users and developers, I am not sure whether my last mail did not arrive the forum or simply nobody could answer my question - so i try it again.. I am using OFED 1.2. which bring mvapich, openmpi and mvapich2 with it. the system was preconfigured so i did not have to install it. all test-cases and examples running fine but: when i try to use the three mpi versions in my own application i had success with mvapich and opemmpi but still have problems using mvapich2. the lib of my application which uses mpi-libs compiled without error but i struggle on the first steps with mvapich2 (my system is a dual-Opteron cluster with 4 nodes each has 2 processors with each of them two cores.) i tried booting with: mpdboot --totalnum=4 -1 --file=/Users/gerber/mac/mpd.hosts --rsh=ssh --verbose --ncpus=16 (of course i tried the simpler command mentioned in the userguide too) in this case i end up with the follwoing error message mpdboot_n01.local (handle_mpd_output 373): from mpd on n01, invalid port info: does anyone know which problem might that be and how to solve it? if i use the --chhup(only) option i see that only one node out of four is up!? but all other test with the ib-tools look ok. i am not sure if i am missing a lib or an include path or whast else?! if i dont use the option --totalnum=4 the mpdboot workes fine but still after using mpdtrace i see that there is only one host up... for the second boot-approach i tried starting mpiexec but i end up with the following error: rank 3 in job 1 n01.local_39320 caused collective abort of all ranks exit status of rank 3: return code 13 [rdma_iba_init.c:91] Error initializing MVAPICH2 malloc library Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(230): Initialization failed MPID_Init(81)........: channel initialization failed (unknown)(): Other MPI error[gerber@n01] so i guess there is something wrong with the malloc lib or am i missing some env-variables?! any help would be really appreciated! thanx in advance br stephan From ibatis2 at 163.com Mon Dec 17 07:07:56 2007 From: ibatis2 at 163.com (jetspeed) Date: Mon Dec 17 07:12:51 2007 Subject: [mvapich-discuss] Has anyone enabled InfiniBand in IBM BladeCenter JS21 by OFED 1.2.5 In-Reply-To: <20071217170615.f28e7707.ibatis2@163.com> References: <20071217170615.f28e7707.ibatis2@163.com> Message-ID: <20071217200756.f8c2e4b6.ibatis2@163.com> same as http://lists.freebsd.org/pipermail/freebsd-cluster/2007-June/000351.html On Mon, 17 Dec 2007 17:06:15 +0800 jetspeed wrote: > Thanks, maybe my OFED is installed successfully, I will check this later. > > the problem is , after I select the mpi version > > Current system default: mvapich2_gcc-0.9.8 > Current user default: mvapich2_gcc-0.9.8 > > then , I use mpdboot, and got > > "mpdboot_node01 (handle_mpd_output 359): failed to ping mpd on node01; recvd output={}" > > maybe other settings needed for Mvapich? > > On Fri, 14 Dec 2007 10:18:09 -0500 > Jeff Squyres wrote: > > > On Dec 14, 2007, at 10:04 AM, Dhabaleswar Panda wrote: > > > > >> I remember OFED ships with OpenMPI not mvapich2? > > > > > > OFED includes MVAPICH, MVAPICH2 and OpenMPI. During the > > > installation, one > > > can select which MPI version to be installed through the `mpi > > > selector' > > > function. > > > > FWIW, you can run the mpi-selector-menu program at any time -- not > > just during the OFED installation. mpi-selector-menu allows you to > > choose which MPI you want to use, either at a system-wide basis or on > > a per-user basis. See mpi-selector-menu(1) for more details, or mpi- > > selector(1) for even more details. > > > > (mpi-selector-menu is a simple menu-based wrapper around the back-end > > mpi-selector command) > > > > -- > > Jeff Squyres > > Cisco Systems > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From mamidala at cse.ohio-state.edu Mon Dec 17 11:02:23 2007 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Mon Dec 17 11:03:04 2007 Subject: [mvapich-discuss] Strange error with MPI_REDUCE In-Reply-To: Message-ID: Hi Christian, I am attaching the patch for MVAPICH2 (0.9.8) along with this mail. Can you please try this out? Thanks, Amith. On Wed, 12 Dec 2007, amith rajith mamidala wrote: > Hi Christian, > > Thanks for trying out the patch. > I will post a patch to MVAPICH2 in the next few days. > > Thanks, > Amith. > > On Tue, 11 Dec 2007, Christian Boehme wrote: > > > Hi Amith, > > > Can you also try the patch I am attaching with this mail and let us know > > > how it works? > > > > > > > Now that the patch seems to work, would it be possible to get a similar > > patch for MVAPICH2 (version 0.9.8)? Many thanks > > > > Christian Boehme > > > > -------------- next part -------------- Index: reduce.c =================================================================== --- reduce.c (revision 1676) +++ reduce.c (working copy) @@ -728,14 +728,14 @@ MPI_Comm shmem_comm, leader_comm; MPID_Comm *shmem_commptr = 0, *leader_commptr = 0; int local_rank = -1, global_rank = -1, local_size=0, my_rank; - void* local_buf, *tmpbuf; + void* local_buf, *tmpbuf, *tmpbuf1; MPI_Aint true_lb, true_extent, extent; MPI_User_function *uop; int stride = 0, i, is_commutative, size; MPID_Op *op_ptr; MPI_Status status; int leader_root, total_size, shmem_comm_rank; - MPIU_CHKLMEM_DECL(1); + MPIU_CHKLMEM_DECL(2); #ifdef HAVE_CXX_BINDING int is_cxx_uop = 0; #endif @@ -921,6 +921,8 @@ global_rank = leader_commptr->rank; MPIU_CHKLMEM_MALLOC(tmpbuf, void *, count*(MPIR_MAX(extent,true_extent)), mpi_errno, "receive buffer"); tmpbuf = (void *)((char*)tmpbuf - true_lb); + MPIU_CHKLMEM_MALLOC(tmpbuf1, void *, count*(MPIR_MAX(extent,true_extent)), mpi_errno, "receive buffer"); + tmpbuf1 = (void *)((char*)tmpbuf1 - true_lb); MPIR_Nest_incr(); mpi_errno = MPIR_Localcopy(sendbuf, count, datatype, tmpbuf, count, datatype); @@ -956,7 +958,7 @@ leader_root = comm_ptr->leader_rank[leader_of_root]; if (local_size != total_size){ MPIR_Nest_incr(); - mpi_errno = MPIR_Reduce(tmpbuf, recvbuf, count, datatype, + mpi_errno = MPIR_Reduce(tmpbuf, tmpbuf1, count, datatype, op, leader_root, leader_commptr); MPIR_Nest_decr(); } @@ -978,6 +980,13 @@ MPIDI_CH3I_SHMEM_COLL_SetGatherComplete(local_size, local_rank, shmem_comm_rank); } + if ((local_rank == 0) && (root == my_rank)){ + MPIR_Nest_incr(); + mpi_errno = MPIR_Localcopy(tmpbuf1, count, datatype, recvbuf, + count, datatype); + MPIR_Nest_decr(); + goto fn_exit; + } /* Copying data from leader to the root incase leader is * not the root */ @@ -991,7 +1000,7 @@ MPIR_REDUCE_TAG, comm ); } else{ - mpi_errno = MPIC_Send( recvbuf, count, datatype, root, + mpi_errno = MPIC_Send( tmpbuf1, count, datatype, root, MPIR_REDUCE_TAG, comm ); } } From Craig.Tierney at noaa.gov Mon Dec 17 13:08:46 2007 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Mon Dec 17 13:09:31 2007 Subject: [mvapich-discuss] Problem building mvapich 1.0 with Portland Group Message-ID: <4766BB2E.1020701@noaa.gov> I am having problems building Mvapich (both 1.0 and 1.0.1) for OFED using the Portland Group compilers. When do execute the following: # OPEN_IB_HOME=/usr # CC=pgcc # CXX=pgCC # FC=pgf77 # F90=pgf90 # ROMIO=yes # ./make.mvapich.ofa I get the following error when building the example programs. ../bin/mpicc -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM -I/usr/include -O2 -L../lib -o cpi cpi.o -lm -lmpich -L/usr/lib64 -lrdmacm -libverbs -libumad -lpthread -lrt -L/usr/lib64 -lrdmacm -libverbs -libumad -lpthread -lrt ../lib/libmpich.a(ibv_param.o)(.text+0xa7): In function `ntohll': : undefined reference to `bswap_64' ../lib/libmpich.a(ibv_param.o)(.text+0xb7): In function `htonll': : undefined reference to `bswap_64' I don't get this linking error when building with Intel 9.X. Has anyone else run into this problem? Thanks, Craig -- Craig Tierney (craig.tierney@noaa.gov) From nagaraj at cse.ohio-state.edu Mon Dec 17 13:42:28 2007 From: nagaraj at cse.ohio-state.edu (Deepak Nagaraj) Date: Mon Dec 17 13:43:11 2007 Subject: [mvapich-discuss] Problem building mvapich 1.0 with Portland Group In-Reply-To: <4766BB2E.1020701@noaa.gov> References: <4766BB2E.1020701@noaa.gov> Message-ID: Hi Craig, On Dec 17, 2007 1:08 PM, Craig Tierney wrote: > > I get the following error when building the example programs. > > ../bin/mpicc -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED > -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM > -I/usr/include -O2 -L../lib -o cpi cpi.o -lm -lmpich -L/usr/lib64 > -lrdmacm -libverbs -libumad -lpthread -lrt -L/usr/lib64 -lrdmacm > -libverbs -libumad -lpthread -lrt > ../lib/libmpich.a(ibv_param.o)(.text+0xa7): In function `ntohll': > : undefined reference to `bswap_64' > ../lib/libmpich.a(ibv_param.o)(.text+0xb7): In function `htonll': > : undefined reference to `bswap_64' > grep reveals that this is a macro sitting in byteswap.h. Do you have this file in your compiler's include path? Installing glibc-devel should get you this file if you are on Linux, as per Google. $ cd /usr/include $ grep -r "bswap_64" * bits/byteswap.h:# define __bswap_64(x) \ bits/byteswap.h:# define __bswap_64(x) \ byteswap.h:# define bswap_64(x) __bswap_64 (x) $ Thanks, -Deepak From Craig.Tierney at noaa.gov Mon Dec 17 15:28:54 2007 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Mon Dec 17 15:29:41 2007 Subject: [mvapich-discuss] Problem building mvapich 1.0 with Portland Group In-Reply-To: References: <4766BB2E.1020701@noaa.gov> Message-ID: <4766DC06.8050001@noaa.gov> Deepak Nagaraj wrote: > Hi Craig, > > On Dec 17, 2007 1:08 PM, Craig Tierney wrote: >> I get the following error when building the example programs. >> >> ../bin/mpicc -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED >> -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM >> -I/usr/include -O2 -L../lib -o cpi cpi.o -lm -lmpich -L/usr/lib64 >> -lrdmacm -libverbs -libumad -lpthread -lrt -L/usr/lib64 -lrdmacm >> -libverbs -libumad -lpthread -lrt >> ../lib/libmpich.a(ibv_param.o)(.text+0xa7): In function `ntohll': >> : undefined reference to `bswap_64' >> ../lib/libmpich.a(ibv_param.o)(.text+0xb7): In function `htonll': >> : undefined reference to `bswap_64' >> > grep reveals that this is a macro sitting in byteswap.h. Do you have > this file in your compiler's include path? Installing glibc-devel > should get you this file if you are on Linux, as per Google. > > $ cd /usr/include > $ grep -r "bswap_64" * > bits/byteswap.h:# define __bswap_64(x) \ > bits/byteswap.h:# define __bswap_64(x) \ > byteswap.h:# define bswap_64(x) __bswap_64 (x) > $ > > Thanks, > -Deepak I was thinking that the problem was related to this. This macro does exist, and I am pretty sure that the call to pgcc is picking it up. However, the problem is that no function in the mvapich distribution calls this function. It is coming form some of the OFED libraries. To me, it appears that I have to build the OFED libraries with pgcc (not gcc) so that it picks up the macro. I am highly reluctant to do this, because it means I need multiple versions of OFED, which makes no sense. Craig -- Craig Tierney (craig.tierney@noaa.gov) From koop at cse.ohio-state.edu Mon Dec 17 21:00:56 2007 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Dec 17 21:01:37 2007 Subject: [mvapich-discuss] Problem building mvapich 1.0 with Portland Group In-Reply-To: <4766DC06.8050001@noaa.gov> Message-ID: Craig, What version of PGI are you using? We've seen this issue before with versions of PGI previous to 6.2 (we've verified with 6.2-4). If you build MVAPICH with the latest version of PGI the error will go away as bswap_64 is implemented in this version. It is due to bswap_64 being used in an OFED include file -- so no use in trying to recompile OFED with PGI. If updating to the latest version is not an option, we can potential find a workaround, but updating to the latest version of PGI is the preferred solution. Let us know if you have any other problems. Thanks, Matt On Mon, 17 Dec 2007, Craig Tierney wrote: > Deepak Nagaraj wrote: > > Hi Craig, > > > > On Dec 17, 2007 1:08 PM, Craig Tierney wrote: > >> I get the following error when building the example programs. > >> > >> ../bin/mpicc -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED > >> -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM > >> -I/usr/include -O2 -L../lib -o cpi cpi.o -lm -lmpich -L/usr/lib64 > >> -lrdmacm -libverbs -libumad -lpthread -lrt -L/usr/lib64 -lrdmacm > >> -libverbs -libumad -lpthread -lrt > >> ../lib/libmpich.a(ibv_param.o)(.text+0xa7): In function `ntohll': > >> : undefined reference to `bswap_64' > >> ../lib/libmpich.a(ibv_param.o)(.text+0xb7): In function `htonll': > >> : undefined reference to `bswap_64' > >> > > grep reveals that this is a macro sitting in byteswap.h. Do you have > > this file in your compiler's include path? Installing glibc-devel > > should get you this file if you are on Linux, as per Google. > > > > $ cd /usr/include > > $ grep -r "bswap_64" * > > bits/byteswap.h:# define __bswap_64(x) \ > > bits/byteswap.h:# define __bswap_64(x) \ > > byteswap.h:# define bswap_64(x) __bswap_64 (x) > > $ > > > > Thanks, > > -Deepak > > I was thinking that the problem was related to this. This macro > does exist, and I am pretty sure that the call to pgcc is picking it > up. > > However, the problem is that no function in the mvapich distribution > calls this function. It is coming form some of the OFED libraries. > To me, it appears that I have to build the OFED libraries with > pgcc (not gcc) so that it picks up the macro. I am highly reluctant > to do this, because it means I need multiple versions of OFED, which > makes no sense. > > Craig > > > -- > Craig Tierney (craig.tierney@noaa.gov) > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From koop at cse.ohio-state.edu Mon Dec 17 21:12:12 2007 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Dec 17 21:12:53 2007 Subject: [mvapich-discuss] Has anyone enabled InfiniBand in IBM BladeCenter JS21 by OFED 1.2.5 In-Reply-To: <20071217200756.f8c2e4b6.ibatis2@163.com> Message-ID: Hi, In regards to your first comment: "I want to install OFED 1.2.5 on the IBM BladeCenter to use the MvapiCH2 shipped with it, but unfortunately it seems that the MPI uses the Ethernet to communicate as before, " If you are using the MVAPICH2 from OFED it should use InfiniBand (or iWARP). Only the initial control messages to launch processes use normal TCP. Why does it seem like it it using Ethernet? "mpdboot_node01 (handle_mpd_output 359): failed to ping mpd on node01; recvd output={}" This is likely a setup issue not related to InfiniBand. Can you try using the MPICH2 installers guide for using mpdboot (it has special troubleshooting instructions to guide you at the end of the document)? http://www.mcs.anl.gov/research/projects/mpich2/documentation/files/mpich2-doc-install.pdf You will likely want to try starting the mpd daemons by hand rather than through mpdboot to figure out setup issues since it will give you more output. Thanks and let us know if you continue to have problems. Matt On Mon, 17 Dec 2007, jetspeed wrote: > same as http://lists.freebsd.org/pipermail/freebsd-cluster/2007-June/000351.html > > > On Mon, 17 Dec 2007 17:06:15 +0800 > jetspeed wrote: > > > Thanks, maybe my OFED is installed successfully, I will check this later. > > > > the problem is , after I select the mpi version > > > > Current system default: mvapich2_gcc-0.9.8 > > Current user default: mvapich2_gcc-0.9.8 > > > > then , I use mpdboot, and got > > > > "mpdboot_node01 (handle_mpd_output 359): failed to ping mpd on node01; recvd output={}" > > > > maybe other settings needed for Mvapich? > > > > On Fri, 14 Dec 2007 10:18:09 -0500 > > Jeff Squyres wrote: > > > > > On Dec 14, 2007, at 10:04 AM, Dhabaleswar Panda wrote: > > > > > > >> I remember OFED ships with OpenMPI not mvapich2? > > > > > > > > OFED includes MVAPICH, MVAPICH2 and OpenMPI. During the > > > > installation, one > > > > can select which MPI version to be installed through the `mpi > > > > selector' > > > > function. > > > > > > FWIW, you can run the mpi-selector-menu program at any time -- not > > > just during the OFED installation. mpi-selector-menu allows you to > > > choose which MPI you want to use, either at a system-wide basis or on > > > a per-user basis. See mpi-selector-menu(1) for more details, or mpi- > > > selector(1) for even more details. > > > > > > (mpi-selector-menu is a simple menu-based wrapper around the back-end > > > mpi-selector command) > > > > > > -- > > > Jeff Squyres > > > Cisco Systems > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From koop at cse.ohio-state.edu Mon Dec 17 21:40:01 2007 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Mon Dec 17 21:40:42 2007 Subject: [mvapich-discuss] port problem using MVAPICH2 In-Reply-To: <20071211153704.209990@gmx.net> Message-ID: Stephan, I'm trying to understand the setup you have here -- let me know if this is correct: You have 4 nodes -- 2 dual-core processors (4 cores per node). Are you running mpdboot from a compute node or from somewhere else? It looks like you may be launching from a Mac. If you are launching from something not a compute node, can you try using a compute node to boot the ring instead? Try a simpler command from one of the compute nodes: mpdboot -n 4 -f The failure in initializing the malloc library is likely due to starting the MPD ring from a node that is not in the compute cluster (by default mpdboot uses the local node as well as those in the hostfile). Let us know if this helps. If not, can you send the contents of the hostfile you are using as well? Thanks, Matt On Tue, 11 Dec 2007, Stephan Gerber wrote: > Dear MVAPICH users and developers, > > i have some problems using MVAPICH2. > to start with MVAPICH(1) and OPEMMPI are both running on our infinibandcluster but both do not scale as i wanted to do them. > so i want to use MVAPICH2 to achieve better results (hopefully...). > my system is a dual-Opteron cluster with 4 nodes each has 2 processors with each of them two cores. > i tried booting with: > mpdboot --totalnum=4 -1 --file=/Users/gerber/mac/mpd.hosts --rsh=ssh --verbose --ncpus=16 > in this case i end up with the follwoing error message > mpdboot_n01.local (handle_mpd_output 373): from mpd on n01, invalid port info: > does anyone know which problem might that be and how to solve it? > if i use the --chhup(only) option i see that only one node out of four is up!? > > if i dont use the option --totalnum=4 the mpdboot workes fine but still afetr using mpdtrace i see that there is only one host up... > > for the second boot-approach i tried starting mpiexec but i end up with the following error: > rank 3 in job 1 n01.local_39320 caused collective abort of all ranks > exit status of rank 3: return code 13 > [rdma_iba_init.c:91] Error initializing MVAPICH2 malloc library > Fatal error in MPI_Init: Other MPI error, error stack: > MPIR_Init_thread(230): Initialization failed > MPID_Init(81)........: channel initialization failed > (unknown)(): Other MPI error[gerber@n01] > > any help would be appreciated! > thanx in advance > br > stephan > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From Craig.Tierney at noaa.gov Tue Dec 18 12:22:57 2007 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Tue Dec 18 21:38:02 2007 Subject: [mvapich-discuss] Problem building mvapich 1.0 with Portland Group In-Reply-To: References: Message-ID: <476801F1.8080001@noaa.gov> Matthew Koop wrote: > Craig, > > What version of PGI are you using? We've seen this issue before with > versions of PGI previous to 6.2 (we've verified with 6.2-4). If you build > MVAPICH with the latest version of PGI the error will go away as bswap_64 > is implemented in this version. > > It is due to bswap_64 being used in an OFED include file -- so no use in > trying to recompile OFED with PGI. > > If updating to the latest version is not an option, we can potential find > a workaround, but updating to the latest version of PGI is the preferred > solution. > > Let us know if you have any other problems. We updated to PG 7.1-3. This fixed the problem. Craig > > Thanks, > > Matt > > On Mon, 17 Dec 2007, Craig Tierney wrote: > >> Deepak Nagaraj wrote: >>> Hi Craig, >>> >>> On Dec 17, 2007 1:08 PM, Craig Tierney wrote: >>>> I get the following error when building the example programs. >>>> >>>> ../bin/mpicc -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED >>>> -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM >>>> -I/usr/include -O2 -L../lib -o cpi cpi.o -lm -lmpich -L/usr/lib64 >>>> -lrdmacm -libverbs -libumad -lpthread -lrt -L/usr/lib64 -lrdmacm >>>> -libverbs -libumad -lpthread -lrt >>>> ../lib/libmpich.a(ibv_param.o)(.text+0xa7): In function `ntohll': >>>> : undefined reference to `bswap_64' >>>> ../lib/libmpich.a(ibv_param.o)(.text+0xb7): In function `htonll': >>>> : undefined reference to `bswap_64' >>>> >>> grep reveals that this is a macro sitting in byteswap.h. Do you have >>> this file in your compiler's include path? Installing glibc-devel >>> should get you this file if you are on Linux, as per Google. >>> >>> $ cd /usr/include >>> $ grep -r "bswap_64" * >>> bits/byteswap.h:# define __bswap_64(x) \ >>> bits/byteswap.h:# define __bswap_64(x) \ >>> byteswap.h:# define bswap_64(x) __bswap_64 (x) >>> $ >>> >>> Thanks, >>> -Deepak >> I was thinking that the problem was related to this. This macro >> does exist, and I am pretty sure that the call to pgcc is picking it >> up. >> >> However, the problem is that no function in the mvapich distribution >> calls this function. It is coming form some of the OFED libraries. >> To me, it appears that I have to build the OFED libraries with >> pgcc (not gcc) so that it picks up the macro. I am highly reluctant >> to do this, because it means I need multiple versions of OFED, which >> makes no sense. >> >> Craig >> >> >> -- >> Craig Tierney (craig.tierney@noaa.gov) >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > -- Craig Tierney (craig.tierney@noaa.gov) From koop at cse.ohio-state.edu Wed Dec 19 11:31:50 2007 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Dec 19 11:31:55 2007 Subject: [mvapich-discuss] Problem building mvapich 1.0 with Portland Group In-Reply-To: <476801F1.8080001@noaa.gov> Message-ID: Good to hear. Let us know if you encounter any other issues, Matt On Tue, 18 Dec 2007, Craig Tierney wrote: > Matthew Koop wrote: > > Craig, > > > > What version of PGI are you using? We've seen this issue before with > > versions of PGI previous to 6.2 (we've verified with 6.2-4). If you build > > MVAPICH with the latest version of PGI the error will go away as bswap_64 > > is implemented in this version. > > > > It is due to bswap_64 being used in an OFED include file -- so no use in > > trying to recompile OFED with PGI. > > > > If updating to the latest version is not an option, we can potential find > > a workaround, but updating to the latest version of PGI is the preferred > > solution. > > > > Let us know if you have any other problems. > > We updated to PG 7.1-3. This fixed the problem. > > Craig > > > > > > Thanks, > > > > Matt > > > > On Mon, 17 Dec 2007, Craig Tierney wrote: > > > >> Deepak Nagaraj wrote: > >>> Hi Craig, > >>> > >>> On Dec 17, 2007 1:08 PM, Craig Tierney wrote: > >>>> I get the following error when building the example programs. > >>>> > >>>> ../bin/mpicc -D_EM64T_ -D_SMP_ -DUSE_HEADER_CACHING -DONE_SIDED > >>>> -DMPIDI_CH3_CHANNEL_RNDV -DMPID_USE_SEQUENCE_NUMBERS -DRDMA_CM > >>>> -I/usr/include -O2 -L../lib -o cpi cpi.o -lm -lmpich -L/usr/lib64 > >>>> -lrdmacm -libverbs -libumad -lpthread -lrt -L/usr/lib64 -lrdmacm > >>>> -libverbs -libumad -lpthread -lrt > >>>> ../lib/libmpich.a(ibv_param.o)(.text+0xa7): In function `ntohll': > >>>> : undefined reference to `bswap_64' > >>>> ../lib/libmpich.a(ibv_param.o)(.text+0xb7): In function `htonll': > >>>> : undefined reference to `bswap_64' > >>>> > >>> grep reveals that this is a macro sitting in byteswap.h. Do you have > >>> this file in your compiler's include path? Installing glibc-devel > >>> should get you this file if you are on Linux, as per Google. > >>> > >>> $ cd /usr/include > >>> $ grep -r "bswap_64" * > >>> bits/byteswap.h:# define __bswap_64(x) \ > >>> bits/byteswap.h:# define __bswap_64(x) \ > >>> byteswap.h:# define bswap_64(x) __bswap_64 (x) > >>> $ > >>> > >>> Thanks, > >>> -Deepak > >> I was thinking that the problem was related to this. This macro > >> does exist, and I am pretty sure that the call to pgcc is picking it > >> up. > >> > >> However, the problem is that no function in the mvapich distribution > >> calls this function. It is coming form some of the OFED libraries. > >> To me, it appears that I have to build the OFED libraries with > >> pgcc (not gcc) so that it picks up the macro. I am highly reluctant > >> to do this, because it means I need multiple versions of OFED, which > >> makes no sense. > >> > >> Craig > >> > >> > >> -- > >> Craig Tierney (craig.tierney@noaa.gov) > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > > > > -- > Craig Tierney (craig.tierney@noaa.gov) > From pasha at dev.mellanox.co.il Thu Dec 20 10:14:48 2007 From: pasha at dev.mellanox.co.il (Pavel Shamis (Pasha)) Date: Thu Dec 20 10:19:55 2007 Subject: [mvapich-discuss] Re: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process In-Reply-To: <200712201535.37527.jackm@dev.mellanox.co.il> References: <200712201535.37527.jackm@dev.mellanox.co.il> Message-ID: <476A86E8.8020308@dev.mellanox.co.il> Adding Open MPI and MVAPICH community to the thread. Pasha (Pavel Shamis) Jack Morgenstein wrote: > background: see "XRC Cleanup order issue thread" at > > http://lists.openfabrics.org/pipermail/general/2007-December/043935.html > > (userspace process which created the receiving XRC qp on a given host dies before > other processes which still need to receive XRC messages on their SRQs which are > "paired" with the now-destroyed receiving XRC QP.) > > Solution: Add a userspace verb (as part of the XRC suite) which enables the user process > to create an XRC QP owned by the kernel -- which belongs to the required XRC domain. > > This QP will be destroyed when the XRC domain is closed (i.e., as part of a ibv_close_xrc_domain > call, but only when the domain's reference count goes to zero). > > Below, I give the new userspace API for this function. Any feedback will be appreciated. > This API will be implemented in the upcoming OFED 1.3 release, so we need feedback ASAP. > > Notes: > 1. There is no query or destroy verb for this QP. There is also no userspace object for the > QP. Userspace has ONLY the raw qp number to use when creating the (X)RC connection. > > 2. Since the QP is "owned" by kernel space, async events for this QP are also handled in kernel > space (i.e., reported in /var/log/messages). There are no completion events for the QP, since > it does not send, and all receives completions are reported in the XRC SRQ's cq. > > If this QP enters the error state, the remote QP which sends will start receiving RETRY_EXCEEDED > errors, so the application will be aware of the failure. > > - Jack > ====================================================================================== > /** > * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as a receive-side only QP, > * and moves the created qp through the RESET->INIT and INIT->RTR transitions. > * (The RTR->RTS transition is not needed, since this QP does no sending). > * The sending XRC QP uses this QP as destination, while specifying an XRC SRQ > * for actually receiving the transmissions and generating all completions on the > * receiving side. > * > * This QP is created in kernel space, and persists until the XRC domain is closed. > * (i.e., its reference count goes to zero). > * > * @pd: protection domain to use. At lower layer, this provides access to userspace obj > * @xrc_domain: xrc domain to use for the QP. > * @attr: modify-qp attributes needed to bring the QP to RTR. > * @attr_mask: bitmap indicating which attributes are provided in the attr struct. > * used for validity checking. > * @xrc_rcv_qpn: qp_num of created QP (if success). To be passed to the remote node. The > * remote node will use xrc_rcv_qpn in ibv_post_send when sending to > * XRC SRQ's on this host in the same xrc domain. > * > * RETURNS: success (0), or a (negative) error value. > */ > > int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd, > struct ibv_xrc_domain *xrc_domain, > struct ibv_qp_attr *attr, > enum ibv_qp_attr_mask attr_mask, > uint32_t *xrc_rcv_qpn); > > Notes: > > 1. Although the kernel creates the qp in the kernel's own PD, we still need the PD > parameter to determine the device. > > 2. I chose to use struct ibv_qp_attr, which is used in modify QP, rather than create > a new structure for this purpose. This also guards against API changes in the event > that during development I notice that more modify-qp parameters must be specified > for this operation to work. > > 3. Table of the ibv_qp_attr parameters showing what values to set: > > struct ibv_qp_attr { > enum ibv_qp_state qp_state; Not needed > enum ibv_qp_state cur_qp_state; Not needed > -- Driver starts from RESET and takes qp to RTR. > enum ibv_mtu path_mtu; Yes > enum ibv_mig_state path_mig_state; Yes > uint32_t qkey; Yes > uint32_t rq_psn; Yes > uint32_t sq_psn; Not needed > uint32_t dest_qp_num; Yes -- this is the remote side QP for the RC conn. > int qp_access_flags; Yes > struct ibv_qp_cap cap; Need only XRC domain. > Other caps will use hard-coded values: > max_send_wr = 1; > max_recv_wr = 0; > max_send_sge = 1; > max_recv_sge = 0; > max_inline_data = 0; > struct ibv_ah_attr ah_attr; Yes > struct ibv_ah_attr alt_ah_attr; Optional > uint16_t pkey_index; Yes > uint16_t alt_pkey_index; Optional > uint8_t en_sqd_async_notify; Not needed (No sq) > uint8_t sq_draining; Not needed (No sq) > uint8_t max_rd_atomic; Not needed (No sq) > uint8_t max_dest_rd_atomic; Yes -- Total max outstanding RDMAs expected > for ALL srq destinations using this receive QP. > (if you are only using SENDs, this value can be 0). > uint8_t min_rnr_timer; default - 0 > uint8_t port_num; Yes > uint8_t timeout; Yes > uint8_t retry_cnt; Yes > uint8_t rnr_retry; Yes > uint8_t alt_port_num; Optional > uint8_t alt_timeout; Optional > }; > > 4. Attribute mask bits to set: > For RESET_to_INIT transition: > IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT > > For INIT_to_RTR transition: > IB_QP_AV | IB_QP_PATH_MTU | > IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER > If you are using RDMA or atomics, also set: > IB_QP_MAX_DEST_RD_ATOMIC > > > _______________________________________________ > general mailing list > general@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > > -- Pavel Shamis (Pasha) Mellanox Technologies From poobahtim at gmail.com Thu Dec 20 12:29:33 2007 From: poobahtim at gmail.com (Tim Hartley) Date: Thu Dec 20 12:29:38 2007 Subject: [mvapich-discuss] [SOLVED] pthreads and mvapich? In-Reply-To: <1E3DCD1C63492545881FACB6063A57C101636F32@mtiexch01.mti.com> References: <1E3DCD1C63492545881FACB6063A57C101636F32@mtiexch01.mti.com> Message-ID: > > I'm new to the list and am having a strange problem when running > > pthreads in an MPI process. The MPI calls are made safely, i.e. only > > from the initial process. The pthreads are just workers. However, on > > a multicore system, the pthreads always get scheduled on the same > > core. That is, multiple pthreads do not get scheduled on more than > > one core concurrently. I'm mainly curious if anyone has run into this > > before. > > It seems like the MVAPICH process affinity feature prevent your process > from using all cores in the node. Can you try using the following flag > to disable affinity? > VIADEV_USE_AFFINITY=0 This solution works, thanks! Tim From changquing.tang at hp.com Thu Dec 20 11:24:09 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Thu Dec 20 12:32:37 2007 Subject: [mvapich-discuss] RE: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process In-Reply-To: <476A86E8.8020308@dev.mellanox.co.il> References: <200712201535.37527.jackm@dev.mellanox.co.il> <476A86E8.8020308@dev.mellanox.co.il> Message-ID: Jack: Thanks for adding this new function, this is what we need. There is one issue I want to make clear, This new "kernel" owned QP "will be destroyed when the XRC domain is closed (i.e., as part of a ibv_close_xrc_domain call, but only when the domain's reference count goes to zero) " If I have a MPI server processes on a node, many other MPI client processes will dynamically connect/disconnect with the server. The server use same XRC domain. Will this cause accumulating the "kernel" QP for such application ? we want the server to run 365 days a year. Thanks. --CQ > -----Original Message----- > From: Pavel Shamis (Pasha) [mailto:pasha@dev.mellanox.co.il] > Sent: Thursday, December 20, 2007 9:15 AM > To: Jack Morgenstein > Cc: Tang, Changqing; Roland Dreier; > general@lists.openfabrics.org; Open MPI Developers; > mvapich-discuss@cse.ohio-state.edu > Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP > independent of any one user process > > Adding Open MPI and MVAPICH community to the thread. > > Pasha (Pavel Shamis) > > Jack Morgenstein wrote: > > background: see "XRC Cleanup order issue thread" at > > > > > > > http://lists.openfabrics.org/pipermail/general/2007-December/043935.ht > > ml > > > > (userspace process which created the receiving XRC qp on a > given host > > dies before other processes which still need to receive XRC > messages > > on their SRQs which are "paired" with the now-destroyed > receiving XRC > > QP.) > > > > Solution: Add a userspace verb (as part of the XRC suite) which > > enables the user process to create an XRC QP owned by the > kernel -- which belongs to the required XRC domain. > > > > This QP will be destroyed when the XRC domain is closed > (i.e., as part > > of a ibv_close_xrc_domain call, but only when the domain's > reference count goes to zero). > > > > Below, I give the new userspace API for this function. Any > feedback will be appreciated. > > This API will be implemented in the upcoming OFED 1.3 > release, so we need feedback ASAP. > > > > Notes: > > 1. There is no query or destroy verb for this QP. There is > also no userspace object for the > > QP. Userspace has ONLY the raw qp number to use when > creating the (X)RC connection. > > > > 2. Since the QP is "owned" by kernel space, async events > for this QP are also handled in kernel > > space (i.e., reported in /var/log/messages). There are > no completion events for the QP, since > > it does not send, and all receives completions are > reported in the XRC SRQ's cq. > > > > If this QP enters the error state, the remote QP which > sends will start receiving RETRY_EXCEEDED > > errors, so the application will be aware of the failure. > > > > - Jack > > > ====================================================================== > > ================ > > /** > > * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as > a receive-side only QP, > > * and moves the created qp through the RESET->INIT and > INIT->RTR transitions. > > * (The RTR->RTS transition is not needed, since this > QP does no sending). > > * The sending XRC QP uses this QP as destination, while > specifying an XRC SRQ > > * for actually receiving the transmissions and > generating all completions on the > > * receiving side. > > * > > * This QP is created in kernel space, and persists > until the XRC domain is closed. > > * (i.e., its reference count goes to zero). > > * > > * @pd: protection domain to use. At lower layer, this provides > > access to userspace obj > > * @xrc_domain: xrc domain to use for the QP. > > * @attr: modify-qp attributes needed to bring the QP to RTR. > > * @attr_mask: bitmap indicating which attributes are > provided in the attr struct. > > * used for validity checking. > > * @xrc_rcv_qpn: qp_num of created QP (if success). To be > passed to the remote node. The > > * remote node will use xrc_rcv_qpn in > ibv_post_send when sending to > > * XRC SRQ's on this host in the same xrc domain. > > * > > * RETURNS: success (0), or a (negative) error value. > > */ > > > > int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd, > > struct ibv_xrc_domain *xrc_domain, > > struct ibv_qp_attr *attr, > > enum ibv_qp_attr_mask attr_mask, > > uint32_t *xrc_rcv_qpn); > > > > Notes: > > > > 1. Although the kernel creates the qp in the kernel's own > PD, we still need the PD > > parameter to determine the device. > > > > 2. I chose to use struct ibv_qp_attr, which is used in > modify QP, rather than create > > a new structure for this purpose. This also guards > against API changes in the event > > that during development I notice that more modify-qp > parameters must be specified > > for this operation to work. > > > > 3. Table of the ibv_qp_attr parameters showing what values to set: > > > > struct ibv_qp_attr { > > enum ibv_qp_state qp_state; Not needed > > enum ibv_qp_state cur_qp_state; Not needed > > -- Driver starts from RESET and takes qp to RTR. > > enum ibv_mtu path_mtu; Yes > > enum ibv_mig_state path_mig_state; Yes > > uint32_t qkey; Yes > > uint32_t rq_psn; Yes > > uint32_t sq_psn; Not needed > > uint32_t dest_qp_num; Yes > -- this is the remote side QP for the RC conn. > > int qp_access_flags; Yes > > struct ibv_qp_cap cap; Need > only XRC domain. > > Other > caps will use hard-coded values: > > > max_send_wr = 1; > > > max_recv_wr = 0; > > > max_send_sge = 1; > > > max_recv_sge = 0; > > > max_inline_data = 0; > > struct ibv_ah_attr ah_attr; Yes > > struct ibv_ah_attr alt_ah_attr; Optional > > uint16_t pkey_index; Yes > > uint16_t alt_pkey_index; Optional > > uint8_t en_sqd_async_notify; Not > needed (No sq) > > uint8_t sq_draining; Not > needed (No sq) > > uint8_t max_rd_atomic; Not > needed (No sq) > > uint8_t max_dest_rd_atomic; Yes > -- Total max outstanding RDMAs expected > > for > ALL srq destinations using this receive QP. > > (if > you are only using SENDs, this value can be 0). > > uint8_t min_rnr_timer; default - 0 > > uint8_t port_num; Yes > > uint8_t timeout; Yes > > uint8_t retry_cnt; Yes > > uint8_t rnr_retry; Yes > > uint8_t alt_port_num; Optional > > uint8_t alt_timeout; Optional > > }; > > > > 4. Attribute mask bits to set: > > For RESET_to_INIT transition: > > IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT > > > > For INIT_to_RTR transition: > > IB_QP_AV | IB_QP_PATH_MTU | > > IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER > > If you are using RDMA or atomics, also set: > > IB_QP_MAX_DEST_RD_ATOMIC > > > > > > _______________________________________________ > > general mailing list > > general@lists.openfabrics.org > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > > > > -- > Pavel Shamis (Pasha) > Mellanox Technologies > > From mamidala at cse.ohio-state.edu Thu Dec 20 12:45:25 2007 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Thu Dec 20 12:45:32 2007 Subject: [mvapich-discuss] Strange error with MPI_REDUCE In-Reply-To: Message-ID: Hi Christian, If you are using MVAPICH1, I am attaching a minor patch related to freeing of memory with MPI_REDUCE. Can you apply this one too? Thanks, Amith. On Mon, 17 Dec 2007, amith rajith mamidala wrote: > Hi Christian, > > I am attaching the patch for MVAPICH2 (0.9.8) along with this mail. Can > you please try this out? > > Thanks, > Amith. > > On Wed, 12 Dec 2007, amith rajith mamidala wrote: > > > Hi Christian, > > > > Thanks for trying out the patch. > > I will post a patch to MVAPICH2 in the next few days. > > > > Thanks, > > Amith. > > > > On Tue, 11 Dec 2007, Christian Boehme wrote: > > > > > Hi Amith, > > > > Can you also try the patch I am attaching with this mail and let us know > > > > how it works? > > > > > > > > > > Now that the patch seems to work, would it be possible to get a similar > > > patch for MVAPICH2 (version 0.9.8)? Many thanks > > > > > > Christian Boehme > > > > > > > > -------------- next part -------------- Index: intra_fns_new.c =================================================================== --- intra_fns_new.c (revision 1714) +++ intra_fns_new.c (working copy) @@ -5194,12 +5194,14 @@ mpi_errno = MPI_Sendrecv(tmpbuf, count, datatype->self, rank, MPIR_REDUCE_TAG, recvbuf, count, datatype->self, rank, MPIR_REDUCE_TAG, comm_ptr->self, &status); + FREE((char *)tmpbuf+lb); } else{ mpi_errno = MPI_Sendrecv(sendbuf, count, datatype->self, rank, MPIR_REDUCE_TAG, recvbuf, count, datatype->self, rank, MPIR_REDUCE_TAG, comm_ptr->self, &status); } + FREE((char *)tmpbuf1+lb); return MPI_SUCCESS; } @@ -5216,6 +5218,8 @@ mpi_errno = MPI_Sendrecv(tmpbuf1, count, datatype->self, rank, MPIR_REDUCE_TAG, recvbuf, count, datatype->self, rank, MPIR_REDUCE_TAG, comm_ptr->self, &status); + FREE((char *)tmpbuf1+lb); + FREE((char *)tmpbuf+lb); return MPI_SUCCESS; } @@ -5246,6 +5250,9 @@ if ((local_rank == 0)&&(local_size > 1)){ FREE((char *)tmpbuf+lb); } + if (local_rank == 0){ + FREE((char *)tmpbuf1+lb); + } } else{ From jackm at dev.mellanox.co.il Fri Dec 21 03:31:59 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Fri Dec 21 03:43:22 2007 Subject: [mvapich-discuss] Re: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process In-Reply-To: References: <200712201535.37527.jackm@dev.mellanox.co.il> <476A86E8.8020308@dev.mellanox.co.il> Message-ID: <200712211031.59761.jackm@dev.mellanox.co.il> On Thursday 20 December 2007 18:24, Tang, Changqing wrote: > ? ? ? If I have a MPI server processes on a node, many other MPI client processes will dynamically > connect/disconnect with the server. The server use same XRC domain. > > ? ? ? ? Will this cause accumulating the "kernel" QP for such application ? we want the server to run 365 days > a year. Yes, it will. I have no way of knowing when a given receiving XRC QP is no longer needed -- except when the domain it belongs to is finally closed. I don't see that adding a userspace "destroy" verb for this QP will help: The only one who actually knows that the XRC QP is no longer required is the userspace process which created the QP at the remote end of the RC connection of the receiving XRC QP. This remote process can only send a request to destroy the QP to some local process (via its own private protocol). However, you pointed out that the process which originally created the QP may not be around any more (this was the source of the problem which led to the RFC in this thread) -- and sending the destroy request to all the remote processes on that node which it communicates with is REALLY ugly. I'm not familiar with MPI, so this may be a silly question: Can the MPI server process create a new domain for each client process, and destroy that domain when the client process is done (i.e., is this MPI server process a supervisor of resources for distributed computations (but is not a participant in these computations)?). (Actually, what I'm asking -- is it possible to allocate a new XRC domain for a distributed computation, and destroy that domain at the end of that computation?) -- Jack From Christian_Boehme at freenet.de Fri Dec 21 10:01:01 2007 From: Christian_Boehme at freenet.de (Christian Boehme) Date: Fri Dec 21 10:01:09 2007 Subject: [mvapich-discuss] Strange error with MPI_REDUCE In-Reply-To: References: Message-ID: <476BD52D.2070605@freenet.de> Hi Amith, amith rajith mamidala schrieb: > If you are using MVAPICH1, I am attaching a minor patch related to freeing > of memory with MPI_REDUCE. Can you apply this one too? > The error regarding the wrong behavior of MPI_REDUCE was fixed with your previous patch. How should I test this one? >> I am attaching the patch for MVAPICH2 (0.9.8) along with this mail. Can >> you please try this out? >> Thanks a lot. However, this patch did not seem to work, the behavior of MPI_REDUCE is still as described before. I'm gone for this year, so I won't get back to this problem before Jan, 7th. Have a good start into 2008! Christian From eborisch at ieee.org Fri Dec 21 10:23:22 2007 From: eborisch at ieee.org (Eric A. Borisch) Date: Fri Dec 21 10:28:27 2007 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 Message-ID: <392f95800712210723o684becffi3c93db60f4d7939b@mail.gmail.com> I seem to be running into a memory registration issue. Observations: 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall) into a local buffer on the root rank, I receive all of the data from any ranks that are running on the same machine, but only part (or none at all) of the data from ranks running on external machines. The transfer length is above the eager/rendezvous threshold. 2) Once the problem occurs, it is persistent. However, if I force MVAPICH to re-register by calling "while(dreg_evict())" at this point and then re-transfer, the correct data is received. (Same memory being transferred from / to.) 3) I've only witnessed problems occurring above the 4G (as returned by malloc()) memory range. 4) When I receive partial data from ranks, the data ends on a (4k) page bound. Data past this bound (which should have been updated) is unchanged during the transfer, yet both the sender and receiver report no errors. (This is very bad!) 5) Stepping through the code on both ends of the transfer shows the software agreeing on the (correct) length and location as far down as I can follow it. 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows no issues. (Other than the expected performance hit.) 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2-1.0 (vapi) 8) The user code is also sending data out (from a different buffer) over ethernet to a remote gui from the root node. I can't move to gen2 at this point -- we are using a vendor library for interfacing to another system, and this library uses VAPI. uname -a output: Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST 2006 x86_64 x86_64 x86_64 GNU/Linux Intel SE7520JR2 motherboards. 4G physical ram on each node. It appears (perhaps this is obvious) that the assumption that memory registered (by the dreg.c code) remains registered until explicitly unregistered (again, by the dreg.c code) is being violated in some way. This, however, is wading in to uncharted (for me, at least) linux memory management waters. The user code is doing nothing to fiddle with registration in any explicit way. (With the exception of as mentioned in (2)) Please let me know what other information I can provide to resolve this. I'm still trying to put together a small test program to cause the problem, but have been unsuccessful so far. Thanks, Eric -- Eric A. Borisch eborisch@ieee.org From mamidala at cse.ohio-state.edu Fri Dec 21 10:35:40 2007 From: mamidala at cse.ohio-state.edu (amith rajith mamidala) Date: Fri Dec 21 10:35:44 2007 Subject: [mvapich-discuss] Strange error with MPI_REDUCE In-Reply-To: <476BD52D.2070605@freenet.de> Message-ID: Hi Christian, > The error regarding the wrong behavior of MPI_REDUCE was fixed with your > previous patch. How should I test this one? Thanks for trying out the patches. This patch for MVAPICH1 is for the possible memory leaks. It doesn't affect the functionality of the code. > > >> I am attaching the patch for MVAPICH2 (0.9.8) along with this mail. Can > >> you please try this out? > >> > > Thanks a lot. However, this patch did not seem to work, the behavior of > MPI_REDUCE is still as described before. I will test this one once more and get back to you. > > I'm gone for this year, so I won't get back to this problem before Jan, > 7th. Have a good start into 2008! > Thanks. Amith From changquing.tang at hp.com Fri Dec 21 12:13:26 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Fri Dec 21 12:14:39 2007 Subject: [mvapich-discuss] RE: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process In-Reply-To: <200712211031.59761.jackm@dev.mellanox.co.il> References: <200712201535.37527.jackm@dev.mellanox.co.il> <476A86E8.8020308@dev.mellanox.co.il> <200712211031.59761.jackm@dev.mellanox.co.il> Message-ID: > -----Original Message----- > From: Jack Morgenstein [mailto:jackm@dev.mellanox.co.il] > Sent: Friday, December 21, 2007 2:32 AM > To: Tang, Changqing > Cc: pasha@dev.mellanox.co.il; > mvapich-discuss@cse.ohio-state.edu; > general@lists.openfabrics.org; Open MPI Developers > Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP > independent of any one user process > > On Thursday 20 December 2007 18:24, Tang, Changqing wrote: > > If I have a MPI server processes on a node, many other MPI > > client processes will dynamically connect/disconnect with > the server. The server use same XRC domain. > > > > Will this cause accumulating the "kernel" QP for such > > application ? we want the server to run 365 days a year. > > Yes, it will. I have no way of knowing when a given > receiving XRC QP is no longer needed -- except when the > domain it belongs to is finally closed. > > I don't see that adding a userspace "destroy" verb for this > QP will help: This kernel QP is for receiving only, so when there is no activity on this QP, can the kernel sends a heart-beat message to check if the remote sending QP is still there (still connected) ? if not, the kernel is safe to cleanup this qp. So whenever the RC connection is broken, kernel can destroy this QP. > > The only one who actually knows that the XRC QP is no longer > required is the userspace process which created the QP at the > remote end of the RC connection of the receiving XRC QP. > > This remote process can only send a request to destroy the QP > to some local process (via its own private protocol). > However, you pointed out that the process which originally > created the QP may not be around any more (this was the > source of the problem which led to the RFC in this thread) -- > and sending the destroy request to all the remote processes > on that node which it communicates with is REALLY ugly. > > I'm not familiar with MPI, so this may be a silly question: > Can the MPI server process create a new domain for each > client process, and destroy that domain when the client > process is done (i.e., is this MPI server process a > supervisor of resources for distributed computations (but is > not a participant in these computations)?). The server could be process group across multiple nodes, there are parallel database searching engine, for example. > > (Actually, what I'm asking -- is it possible to allocate a > new XRC domain for a distributed computation, and destroy > that domain at the end of that computation?) Yes, it could, but it makes MPI harder to manage the code. And also we have a connect/accept speed concern. We hope not to do it this way. --CQ > > > -- Jack > From jackm at dev.mellanox.co.il Fri Dec 21 13:09:26 2007 From: jackm at dev.mellanox.co.il (Jack Morgenstein) Date: Fri Dec 21 13:12:09 2007 Subject: [mvapich-discuss] Re: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process In-Reply-To: References: <200712201535.37527.jackm@dev.mellanox.co.il> <200712211031.59761.jackm@dev.mellanox.co.il> Message-ID: <200712212009.26816.jackm@dev.mellanox.co.il> On Friday 21 December 2007 19:13, Tang, Changqing wrote: > This kernel QP is for receiving only, so when there is no activity on this QP, > can the kernel sends a heart-beat message to check if the remote sending QP > is still there (still connected) ? if not, the kernel is safe to cleanup > this qp. > > So whenever the RC connection is broken, kernel can destroy this QP. > This increases the XRC complexity considerably: 1. Need to have a separate kernel thread which will scan ALL xrc domains on this host for XRC receive QPs. This thread will need to do some form of RDMA_READ/WRITE, because otherwise it will interfere with the remote (sending side) operation. Furthermore, the sending-side XRC QP may not have anyone listening on an associated XRC SRQ qp -- it is not meant to be set up to receive. We only need an operation that will yield a RETRY_EXCEEDED error completion if the connection has broken. 2. This opens the door for all sorts of nasty race conditions, since we will now have a bi-directional protocol. For example, what if this feature is being combined with APM (valid for RC QPs), and we are simply in the middle of a migration, and maybe communication is temporarily interrupted. We will be killing off the QP without allowing any error recovery mechanism to work. 3. The application complexity goes up -- we now need the sending-side QP to declare a memory region and send this region's address to the receiving side so that the receiving side (the kernel thread mentioned above) can periodically try to read from this region. Still, I'll give this some thought. For example, maybe we can rdma_read some random (illegal) address -- If the connection is alive, we'll get a "remote access error" completion, while if its dead, we'll get retry exceeded (need to check that the bad rdma read request does not cause the QPs to enter an error state). - Jack From changquing.tang at hp.com Fri Dec 21 13:22:29 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Fri Dec 21 13:23:49 2007 Subject: [mvapich-discuss] RE: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process In-Reply-To: <200712212009.26816.jackm@dev.mellanox.co.il> References: <200712201535.37527.jackm@dev.mellanox.co.il> <200712211031.59761.jackm@dev.mellanox.co.il> <200712212009.26816.jackm@dev.mellanox.co.il> Message-ID: What we do for heart-beat is using zero-byte rdma_write, the message goes to the peer QP only, there is no need to post anything on remote side, no need for pinned memory. --CQ > -----Original Message----- > From: Jack Morgenstein [mailto:jackm@dev.mellanox.co.il] > Sent: Friday, December 21, 2007 12:09 PM > To: Tang, Changqing > Cc: pasha@dev.mellanox.co.il; > mvapich-discuss@cse.ohio-state.edu; > general@lists.openfabrics.org; Open MPI Developers > Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP > independent of any one user process > > On Friday 21 December 2007 19:13, Tang, Changqing wrote: > > This kernel QP is for receiving only, so when there is no > activity on > > this QP, can the kernel sends a heart-beat message to check if the > > remote sending QP is still there (still connected) ? if not, the > > kernel is safe to cleanup this qp. > > > > So whenever the RC connection is broken, kernel can destroy this QP. > > > This increases the XRC complexity considerably: > > 1. Need to have a separate kernel thread which will scan ALL > xrc domains on this host for XRC receive QPs. > This thread will need to do some form of RDMA_READ/WRITE, > because otherwise it will interfere with > the remote (sending side) operation. Furthermore, the > sending-side XRC QP may not have anyone listening > on an associated XRC SRQ qp -- it is not meant to be set > up to receive. We only need an operation that > will yield a RETRY_EXCEEDED error completion if the > connection has broken. > > 2. This opens the door for all sorts of nasty race > conditions, since we will now have a bi-directional > protocol. For example, what if this feature is being > combined with APM (valid for RC QPs), and we > are simply in the middle of a migration, and maybe > communication is temporarily interrupted. > We will be killing off the QP without allowing any error > recovery mechanism to work. > > 3. The application complexity goes up -- we now need the > sending-side QP to declare a memory region and send > this region's address to the receiving side so that the > receiving side (the kernel thread mentioned above) > can periodically try to read from this region. > > Still, I'll give this some thought. For example, maybe we > can rdma_read some random (illegal) address -- If the > connection is alive, we'll get a "remote access error" > completion, while if its dead, we'll get retry exceeded (need > to check that the bad rdma read request does not cause the > QPs to enter an error state). > > - Jack > From chai.15 at osu.edu Fri Dec 21 18:29:37 2007 From: chai.15 at osu.edu (LEI CHAI) Date: Fri Dec 21 18:29:44 2007 Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 Message-ID: <1a7dd1d151.1d1511a7dd@osu.edu> Hi Eric, Thanks for using mvapich/mvapich2. The problem you reported can be solved by using the PTMALLOC feature which is supported by the gen2 device but not vapi/vapi_multirail. Not much features have been added to vapi/vapi_multirail devices for the last few releases because not many people use them. Since you cannot move to gen2, we would suggest you disable LAZY_MEM_UNREGISTER for your tests. Thanks, Lei ----- Original Message ----- From: "Eric A. Borisch" Date: Friday, December 21, 2007 10:23 am Subject: [mvapich-discuss] Odd behavior with memory registration / dreg / MVAPICH and MVAPICH2 > I seem to be running into a memory registration issue. > > Observations: > > 1) During some transfers (MPI_Isend / MPI_Irecv / MPI_Waitall) > into a > local buffer on the root rank, I receive all of the data from any > ranks that are running on the same machine, but only part (or none at > all) of the data from ranks running on external machines. The transfer > length is above the eager/rendezvous threshold. > 2) Once the problem occurs, it is persistent. However, if I force > MVAPICH to re-register by calling "while(dreg_evict())" at this point > and then re-transfer, the correct data is received. (Same memory being > transferred from / to.) > 3) I've only witnessed problems occurring above the 4G (as > returned by > malloc()) memory range. > 4) When I receive partial data from ranks, the data ends on a (4k) > page bound. Data past this bound (which should have been updated) is > unchanged during the transfer, yet both the sender and receiver report > no errors. (This is very bad!) > 5) Stepping through the code on both ends of the transfer shows the > software agreeing on the (correct) length and location as far down as > I can follow it. > 6) Running against a compilation with no -DLAZY_MEM_UNREGISTER shows > no issues. (Other than the expected performance hit.) > 7) Occurs on both MVAPICH-1.0-beta (vapi_multirail) and mvapich2- > 1.0 (vapi) > 8) The user code is also sending data out (from a different buffer) > over ethernet to a remote gui from the root node. > > I can't move to gen2 at this point -- we are using a vendor library > for interfacing to another system, and this library uses VAPI. > > uname -a output: > Linux rt2.mayo.edu 2.6.9-42.0.2.ELsmp #1 SMP Wed Aug 23 13:38:27 BST > 2006 x86_64 x86_64 x86_64 GNU/Linux > > Intel SE7520JR2 motherboards. 4G physical ram on each node. > > It appears (perhaps this is obvious) that the assumption that memory > registered (by the dreg.c code) remains registered until explicitly > unregistered (again, by the dreg.c code) is being violated in some > way. This, however, is wading in to uncharted (for me, at least) linux > memory management waters. The user code is doing nothing to fiddle > with registration in any explicit way. (With the exception of as > mentioned in (2)) > > Please let me know what other information I can provide to resolve > this. I'm still trying to put together a small test program to cause > the problem, but have been unsuccessful so far. > > Thanks, > Eric > -- > Eric A. Borisch > eborisch@ieee.org > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From ibatis2 at 163.com Mon Dec 24 04:05:44 2007 From: ibatis2 at 163.com (jetspeed) Date: Mon Dec 24 04:10:21 2007 Subject: [mvapich-discuss] can't set up mpd ring between two nodes Message-ID: <20071224170544.54fbe46e.ibatis2@163.com> Hi all: I installed mvapich2 , which is with the OFED 1.2.5. 1. when I use mpdboot on a machine, I got : mpdboot_inode02 (handle_mpd_output 359): failed to ping mpd on inode02; recvd output={} 2. when I try to use mpd to set up mpd ring, as the user guide of mpich2: mpd & on node02 mpd -h node02 -p port on node01 I got: on node01: (the latter mpd) inode01_33435 (connect_lhs 621): invalid challenge from inode02 32969: {} inode01_33435 (enter_ring 566): lhs connect failed inode01_33435 (run 233): failed to enter ring on node02: (the first mpd ) inode02_32969: mpd_uncaught_except_tb handling: exceptions.TypeError: sequence item 0: expected string, int found /usr/mpi/gcc/mvapich2-0.9.8-15/bin/mpdlib.py 733 handle_ring_listener_connection newsock.correctChallengeResponse = \ /usr/mpi/gcc/mvapich2-0.9.8-15/bin/mpdlib.py 488 handle_active_streams handler(stream,*args) /usr/mpi/gcc/mvapich2-0.9.8-15/bin/mpd 266 runmainloop rv = self.streamHandler.handle_active_streams(timeout=8.0) /usr/mpi/gcc/mvapich2-0.9.8-15/bin/mpd 240 run self.runmainloop() /usr/mpi/gcc/mvapich2-0.9.8-15/bin/mpd 1344 ? mpd.run() Has anyone encountered this problem? Thanks in advance. From pasha at dev.mellanox.co.il Mon Dec 24 09:03:09 2007 From: pasha at dev.mellanox.co.il (Pavel Shamis (Pasha)) Date: Mon Dec 24 09:03:20 2007 Subject: [mvapich-discuss] Re: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process In-Reply-To: References: <200712201535.37527.jackm@dev.mellanox.co.il> <476A86E8.8020308@dev.mellanox.co.il> Message-ID: <476FBC1D.6050900@dev.mellanox.co.il> Hi CQ, Tang, Changqing wrote: > If I have a MPI server processes on a node, many other MPI client processes will dynamically > connect/disconnect with the server. The server use same XRC domain. > > Will this cause accumulating the "kernel" QP for such application ? we want the server to run 365 days > a year. > I have some question about the scenario above. Did you call for the mpi disconnect on the both ends (server/client) before the client exit (did we must to do it?) Regards, Pasha. > > Thanks. > --CQ > > > > > >> -----Original Message----- >> From: Pavel Shamis (Pasha) [mailto:pasha@dev.mellanox.co.il] >> Sent: Thursday, December 20, 2007 9:15 AM >> To: Jack Morgenstein >> Cc: Tang, Changqing; Roland Dreier; >> general@lists.openfabrics.org; Open MPI Developers; >> mvapich-discuss@cse.ohio-state.edu >> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP >> independent of any one user process >> >> Adding Open MPI and MVAPICH community to the thread. >> >> Pasha (Pavel Shamis) >> >> Jack Morgenstein wrote: >> >>> background: see "XRC Cleanup order issue thread" at >>> >>> >>> >>> >> http://lists.openfabrics.org/pipermail/general/2007-December/043935.ht >> >>> ml >>> >>> (userspace process which created the receiving XRC qp on a >>> >> given host >> >>> dies before other processes which still need to receive XRC >>> >> messages >> >>> on their SRQs which are "paired" with the now-destroyed >>> >> receiving XRC >> >>> QP.) >>> >>> Solution: Add a userspace verb (as part of the XRC suite) which >>> enables the user process to create an XRC QP owned by the >>> >> kernel -- which belongs to the required XRC domain. >> >>> This QP will be destroyed when the XRC domain is closed >>> >> (i.e., as part >> >>> of a ibv_close_xrc_domain call, but only when the domain's >>> >> reference count goes to zero). >> >>> Below, I give the new userspace API for this function. Any >>> >> feedback will be appreciated. >> >>> This API will be implemented in the upcoming OFED 1.3 >>> >> release, so we need feedback ASAP. >> >>> Notes: >>> 1. There is no query or destroy verb for this QP. There is >>> >> also no userspace object for the >> >>> QP. Userspace has ONLY the raw qp number to use when >>> >> creating the (X)RC connection. >> >>> 2. Since the QP is "owned" by kernel space, async events >>> >> for this QP are also handled in kernel >> >>> space (i.e., reported in /var/log/messages). There are >>> >> no completion events for the QP, since >> >>> it does not send, and all receives completions are >>> >> reported in the XRC SRQ's cq. >> >>> If this QP enters the error state, the remote QP which >>> >> sends will start receiving RETRY_EXCEEDED >> >>> errors, so the application will be aware of the failure. >>> >>> - Jack >>> >>> >> ====================================================================== >> >>> ================ >>> /** >>> * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as >>> >> a receive-side only QP, >> >>> * and moves the created qp through the RESET->INIT and >>> >> INIT->RTR transitions. >> >>> * (The RTR->RTS transition is not needed, since this >>> >> QP does no sending). >> >>> * The sending XRC QP uses this QP as destination, while >>> >> specifying an XRC SRQ >> >>> * for actually receiving the transmissions and >>> >> generating all completions on the >> >>> * receiving side. >>> * >>> * This QP is created in kernel space, and persists >>> >> until the XRC domain is closed. >> >>> * (i.e., its reference count goes to zero). >>> * >>> * @pd: protection domain to use. At lower layer, this provides >>> access to userspace obj >>> * @xrc_domain: xrc domain to use for the QP. >>> * @attr: modify-qp attributes needed to bring the QP to RTR. >>> * @attr_mask: bitmap indicating which attributes are >>> >> provided in the attr struct. >> >>> * used for validity checking. >>> * @xrc_rcv_qpn: qp_num of created QP (if success). To be >>> >> passed to the remote node. The >> >>> * remote node will use xrc_rcv_qpn in >>> >> ibv_post_send when sending to >> >>> * XRC SRQ's on this host in the same xrc domain. >>> * >>> * RETURNS: success (0), or a (negative) error value. >>> */ >>> >>> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd, >>> struct ibv_xrc_domain *xrc_domain, >>> struct ibv_qp_attr *attr, >>> enum ibv_qp_attr_mask attr_mask, >>> uint32_t *xrc_rcv_qpn); >>> >>> Notes: >>> >>> 1. Although the kernel creates the qp in the kernel's own >>> >> PD, we still need the PD >> >>> parameter to determine the device. >>> >>> 2. I chose to use struct ibv_qp_attr, which is used in >>> >> modify QP, rather than create >> >>> a new structure for this purpose. This also guards >>> >> against API changes in the event >> >>> that during development I notice that more modify-qp >>> >> parameters must be specified >> >>> for this operation to work. >>> >>> 3. Table of the ibv_qp_attr parameters showing what values to set: >>> >>> struct ibv_qp_attr { >>> enum ibv_qp_state qp_state; Not needed >>> enum ibv_qp_state cur_qp_state; Not needed >>> -- Driver starts from RESET and takes qp to RTR. >>> enum ibv_mtu path_mtu; Yes >>> enum ibv_mig_state path_mig_state; Yes >>> uint32_t qkey; Yes >>> uint32_t rq_psn; Yes >>> uint32_t sq_psn; Not needed >>> uint32_t dest_qp_num; Yes >>> >> -- this is the remote side QP for the RC conn. >> >>> int qp_access_flags; Yes >>> struct ibv_qp_cap cap; Need >>> >> only XRC domain. >> >>> Other >>> >> caps will use hard-coded values: >> >> max_send_wr = 1; >> >> max_recv_wr = 0; >> >> max_send_sge = 1; >> >> max_recv_sge = 0; >> >> max_inline_data = 0; >> >>> struct ibv_ah_attr ah_attr; Yes >>> struct ibv_ah_attr alt_ah_attr; Optional >>> uint16_t pkey_index; Yes >>> uint16_t alt_pkey_index; Optional >>> uint8_t en_sqd_async_notify; Not >>> >> needed (No sq) >> >>> uint8_t sq_draining; Not >>> >> needed (No sq) >> >>> uint8_t max_rd_atomic; Not >>> >> needed (No sq) >> >>> uint8_t max_dest_rd_atomic; Yes >>> >> -- Total max outstanding RDMAs expected >> >>> for >>> >> ALL srq destinations using this receive QP. >> >>> (if >>> >> you are only using SENDs, this value can be 0). >> >>> uint8_t min_rnr_timer; default - 0 >>> uint8_t port_num; Yes >>> uint8_t timeout; Yes >>> uint8_t retry_cnt; Yes >>> uint8_t rnr_retry; Yes >>> uint8_t alt_port_num; Optional >>> uint8_t alt_timeout; Optional >>> }; >>> >>> 4. Attribute mask bits to set: >>> For RESET_to_INIT transition: >>> IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT >>> >>> For INIT_to_RTR transition: >>> IB_QP_AV | IB_QP_PATH_MTU | >>> IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER >>> If you are using RDMA or atomics, also set: >>> IB_QP_MAX_DEST_RD_ATOMIC >>> >>> >>> _______________________________________________ >>> general mailing list >>> general@lists.openfabrics.org >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general >>> >>> To unsubscribe, please visit >>> http://openib.org/mailman/listinfo/openib-general >>> >>> >>> >> -- >> Pavel Shamis (Pasha) >> Mellanox Technologies >> >> >> > > -- Pavel Shamis (Pasha) Mellanox Technologies From changquing.tang at hp.com Mon Dec 24 18:49:37 2007 From: changquing.tang at hp.com (Tang, Changqing) Date: Mon Dec 24 18:51:44 2007 Subject: [mvapich-discuss] RE: [ofa-general] [RFC] XRC -- make receiving XRC QP independent of any one user process In-Reply-To: <476FBC1D.6050900@dev.mellanox.co.il> References: <200712201535.37527.jackm@dev.mellanox.co.il> <476A86E8.8020308@dev.mellanox.co.il> <476FBC1D.6050900@dev.mellanox.co.il> Message-ID: > -----Original Message----- > From: Pavel Shamis (Pasha) [mailto:pasha@dev.mellanox.co.il] > Sent: Monday, December 24, 2007 8:03 AM > To: Tang, Changqing > Cc: Jack Morgenstein; Roland Dreier; > general@lists.openfabrics.org; Open MPI Developers; > mvapich-discuss@cse.ohio-state.edu > Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP > independent of any one user process > > Hi CQ, > Tang, Changqing wrote: > > If I have a MPI server processes on a node, many other MPI > > client processes will dynamically connect/disconnect with > the server. The server use same XRC domain. > > > > Will this cause accumulating the "kernel" QP for such > > application ? we want the server to run 365 days a year. > > > I have some question about the scenario above. Did you call > for the mpi disconnect on the both ends (server/client) > before the client exit (did we must to do it?) Yes, both ends will call disconnect. But for us, MPI_Comm_disconnect() call is not a collective call, it is just a local operation. --CQ > > Regards, > Pasha. > > > > Thanks. > > --CQ > > > > > > > > > > > >> -----Original Message----- > >> From: Pavel Shamis (Pasha) [mailto:pasha@dev.mellanox.co.il] > >> Sent: Thursday, December 20, 2007 9:15 AM > >> To: Jack Morgenstein > >> Cc: Tang, Changqing; Roland Dreier; > >> general@lists.openfabrics.org; Open MPI Developers; > >> mvapich-discuss@cse.ohio-state.edu > >> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP > >> independent of any one user process > >> > >> Adding Open MPI and MVAPICH community to the thread. > >> > >> Pasha (Pavel Shamis) > >> > >> Jack Morgenstein wrote: > >> > >>> background: see "XRC Cleanup order issue thread" at > >>> > >>> > >>> > >>> > >> > http://lists.openfabrics.org/pipermail/general/2007-December/043935.h > >> t > >> > >>> ml > >>> > >>> (userspace process which created the receiving XRC qp on a > >>> > >> given host > >> > >>> dies before other processes which still need to receive XRC > >>> > >> messages > >> > >>> on their SRQs which are "paired" with the now-destroyed > >>> > >> receiving XRC > >> > >>> QP.) > >>> > >>> Solution: Add a userspace verb (as part of the XRC suite) which > >>> enables the user process to create an XRC QP owned by the > >>> > >> kernel -- which belongs to the required XRC domain. > >> > >>> This QP will be destroyed when the XRC domain is closed > >>> > >> (i.e., as part > >> > >>> of a ibv_close_xrc_domain call, but only when the domain's > >>> > >> reference count goes to zero). > >> > >>> Below, I give the new userspace API for this function. Any > >>> > >> feedback will be appreciated. > >> > >>> This API will be implemented in the upcoming OFED 1.3 > >>> > >> release, so we need feedback ASAP. > >> > >>> Notes: > >>> 1. There is no query or destroy verb for this QP. There is > >>> > >> also no userspace object for the > >> > >>> QP. Userspace has ONLY the raw qp number to use when > >>> > >> creating the (X)RC connection. > >> > >>> 2. Since the QP is "owned" by kernel space, async events > >>> > >> for this QP are also handled in kernel > >> > >>> space (i.e., reported in /var/log/messages). There are > >>> > >> no completion events for the QP, since > >> > >>> it does not send, and all receives completions are > >>> > >> reported in the XRC SRQ's cq. > >> > >>> If this QP enters the error state, the remote QP which > >>> > >> sends will start receiving RETRY_EXCEEDED > >> > >>> errors, so the application will be aware of the failure. > >>> > >>> - Jack > >>> > >>> > >> > ===================================================================== > >> = > >> > >>> ================ > >>> /** > >>> * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as > >>> > >> a receive-side only QP, > >> > >>> * and moves the created qp through the RESET->INIT and > >>> > >> INIT->RTR transitions. > >> > >>> * (The RTR->RTS transition is not needed, since this > >>> > >> QP does no sending). > >> > >>> * The sending XRC QP uses this QP as destination, while > >>> > >> specifying an XRC SRQ > >> > >>> * for actually receiving the transmissions and > >>> > >> generating all completions on the > >> > >>> * receiving side. > >>> * > >>> * This QP is created in kernel space, and persists > >>> > >> until the XRC domain is closed. > >> > >>> * (i.e., its reference count goes to zero). > >>> * > >>> * @pd: protection domain to use. At lower layer, this provides > >>> access to userspace obj > >>> * @xrc_domain: xrc domain to use for the QP. > >>> * @attr: modify-qp attributes needed to bring the QP to RTR. > >>> * @attr_mask: bitmap indicating which attributes are > >>> > >> provided in the attr struct. > >> > >>> * used for validity checking. > >>> * @xrc_rcv_qpn: qp_num of created QP (if success). To be > >>> > >> passed to the remote node. The > >> > >>> * remote node will use xrc_rcv_qpn in > >>> > >> ibv_post_send when sending to > >> > >>> * XRC SRQ's on this host in the same xrc domain. > >>> * > >>> * RETURNS: success (0), or a (negative) error value. > >>> */ > >>> > >>> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd, > >>> struct ibv_xrc_domain *xrc_domain, > >>> struct ibv_qp_attr *attr, > >>> enum ibv_qp_attr_mask attr_mask, > >>> uint32_t *xrc_rcv_qpn); > >>> > >>> Notes: > >>> > >>> 1. Although the kernel creates the qp in the kernel's own > >>> > >> PD, we still need the PD > >> > >>> parameter to determine the device. > >>> > >>> 2. I chose to use struct ibv_qp_attr, which is used in > >>> > >> modify QP, rather than create > >> > >>> a new structure for this purpose. This also guards > >>> > >> against API changes in the event > >> > >>> that during development I notice that more modify-qp > >>> > >> parameters must be specified > >> > >>> for this operation to work. > >>> > >>> 3. Table of the ibv_qp_attr parameters showing what values to set: > >>> > >>> struct ibv_qp_attr { > >>> enum ibv_qp_state qp_state; Not needed > >>> enum ibv_qp_state cur_qp_state; Not needed > >>> -- Driver starts from RESET and takes qp to RTR. > >>> enum ibv_mtu path_mtu; Yes > >>> enum ibv_mig_state path_mig_state; Yes > >>> uint32_t qkey; Yes > >>> uint32_t rq_psn; Yes > >>> uint32_t sq_psn; Not needed > >>> uint32_t dest_qp_num; Yes > >>> > >> -- this is the remote side QP for the RC conn. > >> > >>> int qp_access_flags; Yes > >>> struct ibv_qp_cap cap; Need > >>> > >> only XRC domain. > >> > >>> Other > >>> > >> caps will use hard-coded values: > >> > >> max_send_wr = 1; > >> > >> max_recv_wr = 0; > >> > >> max_send_sge = 1; > >> > >> max_recv_sge = 0; > >> > >> max_inline_data = 0; > >> > >>> struct ibv_ah_attr ah_attr; Yes > >>> struct ibv_ah_attr alt_ah_attr; Optional > >>> uint16_t pkey_index; Yes > >>> uint16_t alt_pkey_index; Optional > >>> uint8_t en_sqd_async_notify; Not > >>> > >> needed (No sq) > >> > >>> uint8_t sq_draining; Not > >>> > >> needed (No sq) > >> > >>> uint8_t max_rd_atomic; Not > >>> > >> needed (No sq) > >> > >>> uint8_t max_dest_rd_atomic; Yes > >>> > >> -- Total max outstanding RDMAs expected > >> > >>> for > >>> > >> ALL srq destinations using this receive QP. > >> > >>> (if > >>> > >> you are only using SENDs, this value can be 0). > >> > >>> uint8_t min_rnr_timer; default - 0 > >>> uint8_t port_num; Yes > >>> uint8_t timeout; Yes > >>> uint8_t retry_cnt; Yes > >>> uint8_t rnr_retry; Yes > >>> uint8_t alt_port_num; Optional > >>> uint8_t alt_timeout; Optional > >>> }; > >>> > >>> 4. Attribute mask bits to set: > >>> For RESET_to_INIT transition: > >>> IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT > >>> > >>> For INIT_to_RTR transition: > >>> IB_QP_AV | IB_QP_PATH_MTU | > >>> IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER > >>> If you are using RDMA or atomics, also set: > >>> IB_QP_MAX_DEST_RD_ATOMIC > >>> > >>> > >>> _______________________________________________ > >>> general mailing list > >>> general@lists.openfabrics.org > >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > >>> > >>> To unsubscribe, please visit > >>> http://openib.org/mailman/listinfo/openib-general > >>> > >>> > >>> > >> -- > >> Pavel Shamis (Pasha) > >> Mellanox Technologies > >> > >> > >> > > > > > > > -- > Pavel Shamis (Pasha) > Mellanox Technologies > > From huanwei at cse.ohio-state.edu Wed Dec 26 11:22:19 2007 From: huanwei at cse.ohio-state.edu (wei huang) Date: Wed Dec 26 11:22:24 2007 Subject: [mvapich-discuss] can't set up mpd ring between two nodes (fwd) In-Reply-To: Message-ID: Hi, Thanks for using mvapich2. > I installed mvapich2 , which is with the OFED 1.2.5. So do you use the default installation coming with the OFED package? > 1. when I use mpdboot on a machine, I got : > mpdboot_inode02 (handle_mpd_output 359): failed to ping mpd on inode02; recvd output={} There are multiple reasons which can cause this failure. But there are few things to check first: 1) Do you have other mpd running on the same set of nodes? (under the same user name) 2) Do you have .mpd.conf in your home directory? I also want to mention that we have already released mvapich2-1.0.1. You can try that by downloading the software package from our website: http://mvapich.cse.ohio-state.edu/ There is a file called README_MPICH2 in the package. You can also read that for more details regarding set up mpd rings. Please let us know if this works. Thanks. -- Wei > 2. when I try to use mpd to set up mpd ring, as the user guide of mpich2: > mpd & on node02 > mpd -h node02 -p port on node01 > I got: > on node01: (the latter mpd) > inode01_33435 (connect_lhs 621): invalid challenge from inode02 32969: {} > inode01_33435 (enter_ring 566): lhs connect failed > inode01_33435 (run 233): failed to enter ring > > on node02: (the first mpd ) > > inode02_32969: mpd_uncaught_except_tb handling: > exceptions.TypeError: sequence item 0: expected string, int found > /usr/mpi/gcc/mvapich2-0.9.8-15/bin/mpdlib.py 733 handle_ring_listener_connection > newsock.correctChallengeResponse = \ > /usr/mpi/gcc/mvapich2-0.9.8-15/bin/mpdlib.py 488 handle_active_streams handler(stream,*args) > /usr/mpi/gcc/mvapich2-0.9.8-15/bin/mpd 266 runmainloop > rv = self.streamHandler.handle_active_streams(timeout=8.0) > /usr/mpi/gcc/mvapich2-0.9.8-15/bin/mpd 240 run > self.runmainloop() > /us