From art702 at jaguar1.usouthal.edu Mon Mar 2 10:37:24 2009 From: art702 at jaguar1.usouthal.edu (art702@jaguar1.usouthal.edu) Date: Mon Mar 2 10:37:31 2009 Subject: [mvapich-discuss] 'mpirun' source code Message-ID: Hi, I am trying to evaluate the mpi programs performance on Intel tool(VTune ), Is there any way to add debug flags to the 'mpirun' or get source code for mpirun so that i can re-compile it adding debug flags?Else is there any other tool to evaluate performance of mpi programs? regards, arvind. From sridharj at cse.ohio-state.edu Mon Mar 2 11:58:59 2009 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Mon Mar 2 11:59:05 2009 Subject: [mvapich-discuss] 'mpirun' source code In-Reply-To: References: Message-ID: <20090302165859.GA21551@mu.cse.ohio-state.edu> Hi, Which version of mvapich are you using? You can use -debug with mpirun_rsh to throw a gdb session. Source code is distributed along with the library and the location depends on the version (mvapich/mvapich2). -Jaidev On Mon, Mar 02, 2009 at 09:37:24AM -0600, art702@jaguar1.usouthal.edu wrote: > > Hi, > > I am trying to evaluate the mpi programs performance on Intel tool(VTune ), Is there any way to add debug flags to the 'mpirun' or get source code for mpirun so that i can re-compile it adding debug flags?Else is there any other tool to evaluate performance of mpi programs? > > regards, > arvind. > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- You can rent this space for only $5 a week. From perkinjo at cse.ohio-state.edu Mon Mar 2 13:17:41 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon Mar 2 13:17:51 2009 Subject: [mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error In-Reply-To: References: Message-ID: <20090302181740.GK2993@cse.ohio-state.edu> Vivek: We do not have an environment setup that can easily support the installation of this MEME Suite. Is there a simpler MPI program that this error can be reproduced with. This will greatly assist us in debugging this issue. On Fri, Feb 20, 2009 at 11:32:30AM +0530, Vivek Gavane wrote: > Sir, > I have tried for different set of nodes for various runs, the same > error is reported. But when I tried for small number of cores i.e 8 the > job never came out even though it was complete and the output file was > generated. Also the processes were showing 99.9% CPU usage even after > complete output was generated. > > The application code I am using is MEME version meme3.0.3 > http://meme.nbcr.net/downloads/old_versions/ > > Also I installed the newer version of MEME version meme_4.1.0 > http://meme.nbcr.net/downloads/ > > It is also giving the following error everytime on different set of nodes: > ----------------------------------- > Exit code -5 signaled from ibc0-27 > Killing remote processes...MPI process terminated unexpectedly > DONE > ----------------------------------- > > The redirected output file of the application contains: > ----------------------------- > cleanupSignal 15 received. > ----------------------------- > > Thanks. > -- > Regards, > Vivek Gavane > > Member Technical Staff > Bioinformatics team, > Scientific & Engineering Computing Group, > National PARAM Supercomputing Facility, > Centre for Development of Advanced Computing, > Pune-411007. > > Phone: +91 20 25704100 ext. 195 > Direct Line: +91 20 25704195 > > On Thu, Feb 19, 2009, Dhabaleswar Panda said: > > > Vivek, > > > > Do you see this error always when you run this application? Do you see > > this error when you run your application on different set of nodes? If > > this happens always (irrespective of runs and nodes), will it be possible > > for you to send us a code snippet which reproduces this problem. This will > > help us to investigate this issue further. > > > > Thanks, > > > > DK > > > >> Sir, > >> Thank you for the reply but the cable and switch seems to be fine. Is > >> there any other reason/solution for the errors. And also the application > >> program is giving complete and correct output except for the errors at the > >> end. > >> > >> Thanks. > >> -- > >> Regards, > >> Vivek Gavane > >> > >> Member Technical Staff > >> Bioinformatics team, > >> Scientific & Engineering Computing Group, > >> National PARAM Supercomputing Facility, > >> Centre for Development of Advanced Computing, > >> Pune-411007. > >> > >> Phone: +91 20 25704100 ext. 195 > >> Direct Line: +91 20 25704195 > >> > >> On Tue, Feb 17, 2009, Dhabaleswar Panda said: > >> > >> > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the > >> > system is really large then it could be congestion. > >> > > >> > Thanks, > >> > > >> > DK > >> > > >> > On Tue, 17 Feb 2009, Vivek Gavane wrote: > >> > > >> >> Hello, > >> >> I have mvapich2-1.2 compiled with the following options: > >> >> > >> >> > >> >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg > >> >> --enable-debuginfo --with-ib-include=/opt/OFED/include > >> >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2 > >> >> > >> >> After I submit a job, the job completes but the following errors are > >> >> reported on the console: > >> >> > >> >> ------------------------------------------------------------- > >> >> send desc error > >> >> Exit code -5 signaled from ibc0-16 > >> >> Killing remote processes...[14] Abort: [] Got completion with error 12, > >> >> vendor code=81, dest rank=0 > >> >> at line 553 in file ibv_channel_manager.c > >> >> MPI process terminated unexpectedly > >> >> DONE > >> >> ------------------------------------------------------------ > >> >> > >> >> And in the redirected output file, following errors are reported at the > >> >> end: > >> >> ----------------------------------------- > >> >> cleanupSignal 15 received. > >> >> Signal 15 received. > >> >> Signal 15 received. > >> >> Signal 15 received. > >> >> ----------------------------------------- > >> >> > >> >> Do anyone know the reason for this? > >> >> > >> >> Thanks in advance. > >> >> -- > >> >> Regards, > >> >> Vivek Gavane > >> >> _______________________________________________ > >> >> mvapich-discuss mailing list > >> >> mvapich-discuss@cse.ohio-state.edu > >> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> >> > >> > > >> > >> > > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From nathan.baca at gmail.com Tue Mar 3 22:45:16 2009 From: nathan.baca at gmail.com (Nathan Baca) Date: Tue Mar 3 22:45:24 2009 Subject: [mvapich-discuss] MPI-IO Inconsistency over Lustre using MVAPICH Message-ID: Hello, I am seeing inconsistent mpi-io behavior when writing to a Lustre file system using mvapich2 1.2p1 and mvapich 1.1 both with romio. What follows is a simple reproducer and output. Essentially one or more of the running processes does not read or write the correct amount of data to its part of a file residing on a Lustre (parallel) file system. I have tried both isolating the output to a single OST and striping across multiple OSTs. Both will reproduce the same result. I have tried compiling with multiple versions of both pathscale and intel compilers all with the same result. The odd thing is that this seems to work using hpmpi 2.03 with pathscale 3.2 and intel 10.1.018. The operating system is XC 3.2.1 which is essentially rhel4.5. The kernel is 2.6.9-67.9hp.7sp.XCsmp. Lustre version is lustre-1.4.11-2.3_0.6_xc3.2.1_k2.6.9_67.9hp.7sp.XCsmp. Any help figuring out what is happening is greatly appreciated. Thanks, Nate program gcrm_test_io implicit none include "mpif.h" integer X_SIZE integer w_me, w_nprocs integer my_info integer i integer (kind=4) :: ierr integer (kind=4) :: fileID integer (kind=MPI_OFFSET_KIND) :: mylen integer (kind=MPI_OFFSET_KIND) :: offset integer status(MPI_STATUS_SIZE) integer count integer ncells real (kind=4), allocatable, dimension (:) :: array2 logical sync call mpi_init(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr) call mpi_info_create(my_info, ierr) ! optional ways to set things in mpi-io ! call mpi_info_set (my_info, "romio_ds_read" , "enable" , ierr) ! call mpi_info_set (my_info, "romio_ds_write", "enable" , ierr) ! call mpi_info_set (my_info, "romio_cb_write", "enable" , ierr) x_size = 410011 ! A 'big' number, with bigger numbers it is more likely to fail sync = .true. ! Extra file synchronization ncells = (X_SIZE * w_nprocs) ! Use node zero to fill it with nines if (w_me .eq. 0) then call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr) allocate (array2(ncells)) array2(:) = 9.0 mylen = ncells offset = 0 * 4 call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, "native",MPI_INFO_NULL,ierr) call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr) call MPI_Get_count(status,MPI_INTEGER, count, ierr) if (count .ne. mylen) print*, "Wrong initial write count:", count,mylen deallocate(array2) if (sync) call MPI_FILE_SYNC (fileID,ierr) call MPI_FILE_CLOSE (fileID,ierr) endif ! All nodes now fill their area with ones call MPI_BARRIER(MPI_COMM_WORLD,ierr) allocate (array2( X_SIZE)) array2(:) = 1.0 offset = (w_me * X_SIZE) * 4 ! multiply by four, since it is real*4 mylen = X_SIZE call MPI_FILE_OPEN (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY, my_info, fileID, ierr) print*,"node",w_me,"starting",(offset/4) + 1,"ending",(offset/4)+mylen call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, "native",MPI_INFO_NULL,ierr) call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr) call MPI_Get_count(status,MPI_INTEGER, count, ierr) if (count .ne. mylen) print*, "Wrong write count:", count,mylen,w_me deallocate(array2) if (sync) call MPI_FILE_SYNC (fileID,ierr) call MPI_FILE_CLOSE (fileID,ierr) ! Read it back on node zero to see if it is ok data if (w_me .eq. 0) then call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", MPI_MODE_RDONLY, my_info, fileID, ierr) mylen = ncells allocate (array2(ncells)) call MPI_File_read(fileID, array2, mylen , MPI_REAL, status,ierr) call MPI_Get_count(status,MPI_INTEGER, count, ierr) if (count .ne. mylen) print*, "Wrong read count:", count,mylen do i=1,ncells if (array2(i) .ne. 1) then print*, "ERROR", i,array2(i), ((i-1)*4), ((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB goto 999 end if end do print*, "All done with nothing wrong" 999 deallocate(array2) call MPI_FILE_CLOSE (fileID,ierr) call MPI_file_delete ("output.dat",MPI_INFO_NULL,ierr) endif call mpi_finalize(ierr) end program gcrm_test_io 1.2p1 MVAPICH 2 node 1 starting 410012 ending 820022 node 2 starting 820023 ending 1230033 node 3 starting 1230034 ending 1640044 node 4 starting 1640045 ending 2050055 node 5 starting 2050056 ending 2460066 node 0 starting 1 ending 410011 All done with nothing wrong node 1 starting 410012 ending 820022 node 4 starting 1640045 ending 2050055 node 3 starting 1230034 ending 1640044 node 5 starting 2050056 ending 2460066 node 2 starting 820023 ending 1230033 Wrong write count: 228554 410011 2 node 0 starting 1 ending 410011 Wrong read count: 1048576 2460066 ERROR 1048577 0.E+0 4194304 4. node 1 starting 410012 ending 820022 node 3 starting 1230034 ending 1640044 node 4 starting 1640045 ending 2050055 node 2 starting 820023 ending 1230033 node 5 starting 2050056 ending 2460066 node 0 starting 1 ending 410011 Wrong read count: 1048576 2460066 ERROR 1048577 0.E+0 4194304 4. 1.1 MVAPICH node 0 starting 1 ending 410011 node 4 starting 1640045 ending 2050055 node 3 starting 1230034 ending 1640044 node 2 starting 820023 ending 1230033 node 1 starting 410012 ending 820022 node 5 starting 2050056 ending 2460066 All done with nothing wrong node 0 starting 1 ending 410011 node 5 starting 2050056 ending 2460066 node 2 starting 820023 ending 1230033 node 1 starting 410012 ending 820022 Wrong write count: 228554 410011 2 node 3 starting 1230034 ending 1640044 node 4 starting 1640045 ending 2050055 Wrong read count: 1048576 2460066 ERROR 1048577 0.0000000E+00 4194304 4.00000000000000 node 0 starting 1 ending 410011 node 3 starting 1230034 ending 1640044 node 4 starting 1640045 ending 2050055 node 1 starting 410012 ending 820022 node 5 starting 2050056 ending 2460066 node 2 starting 820023 ending 1230033 Wrong read count: 1229824 2460066 ERROR 1229825 0.0000000E+00 4919296 4.69140625000000 -- Nathan Baca nathan.baca@gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090303/ee3f6aa0/attachment.html From nilesh_awate at yahoo.com Wed Mar 4 08:52:17 2009 From: nilesh_awate at yahoo.com (nilesh awate) Date: Wed Mar 4 08:52:26 2009 Subject: [mvapich-discuss] (no subject) Message-ID: <22080.56777.qm@web94104.mail.in2.yahoo.com> Hi all, I am using mvapich2-1.2p1 with udapl adi over proprietary interconnect previously we were using mvapich2-1.0.3. I m using it over Intel Xeon X5472 @ 3.00GHz (8 nodes 8 cores each) cluster But I m not able to fire more than 6 processes over a cluster it just get stuck after connection establishment(used debug messeges in our library) i have observed the same thing over Mellanox card(OpenIB-cma). mpdboot way of firing process is working fine (but it is not recommended as you say) Foll. are environment variables that i set before run $ export PATH=/home/user/pn_mpi/mpi-bin1.2/bin/:$PATH $ which env /usr/bin/env $which mpispawn /home/user/pn_mpi/mpi-bin1.2/bin/mpispawn set in ~/.bashrc I have read FAQ for the same but didn't find much information waiting for reply Nilesh Connect with friends all over the world. Get Yahoo! India Messenger at http://in.messenger.yahoo.com/?wm=n/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090304/be0589d1/attachment-0001.html From sridharj at cse.ohio-state.edu Wed Mar 4 09:31:49 2009 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Wed Mar 4 09:31:55 2009 Subject: [mvapich-discuss] (no subject) In-Reply-To: <22080.56777.qm@web94104.mail.in2.yahoo.com> References: <22080.56777.qm@web94104.mail.in2.yahoo.com> Message-ID: <20090304143149.GA7285@omicron.cse.ohio-state.edu> Nilesh, On Wed, Mar 04, 2009 at 07:22:17PM +0530, nilesh awate wrote: > > But I m not able to fire more than 6 processes over a cluster > it just get stuck after connection establishment(used debug messeges > in our library) Can you send us a backtrace indicating where it is getting stuck? Is the application stuck or one of the startup components (mpirun_rsh / mpispawn)? -Jaidev -- You can rent this space for only $5 a week. From weikuan.yu at gmail.com Wed Mar 4 10:42:36 2009 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Wed Mar 4 10:42:53 2009 Subject: [mvapich-discuss] MPI-IO Inconsistency over Lustre using MVAPICH In-Reply-To: References: Message-ID: <49AEA16C.9050606@gmail.com> Hi, Nathan, Thanks for reporting the problem. I will be looking into it. I saw you had three outputs from the same program. Could you please clarify if it is the case that the program run correctly for the first time, but failed later on? --Weikuan Nathan Baca wrote: > Hello, > > I am seeing inconsistent mpi-io behavior when writing to a Lustre file > system using mvapich2 1.2p1 and mvapich 1.1 both with romio. What > follows is a simple reproducer and output. Essentially one or more of > the running processes does not read or write the correct amount of data > to its part of a file residing on a Lustre (parallel) file system. > > I have tried both isolating the output to a single OST and striping > across multiple OSTs. Both will reproduce the same result. I have tried > compiling with multiple versions of both pathscale and intel compilers > all with the same result. > > The odd thing is that this seems to work using hpmpi 2.03 with pathscale > 3.2 and intel 10.1.018. The operating system is XC 3.2.1 which is > essentially rhel4.5. The kernel is 2.6.9-67.9hp.7sp.XCsmp. Lustre > version is lustre-1.4.11-2.3_0.6_xc3.2.1_k2.6.9_67.9hp.7sp.XCsmp. > > Any help figuring out what is happening is greatly appreciated. Thanks, Nate > > program gcrm_test_io > implicit none > include "mpif.h" > > integer X_SIZE > > integer w_me, w_nprocs > integer my_info > > integer i > integer (kind=4) :: ierr > integer (kind=4) :: fileID > > integer (kind=MPI_OFFSET_KIND) :: mylen > integer (kind=MPI_OFFSET_KIND) :: offset > integer status(MPI_STATUS_SIZE) > integer count > integer ncells > real (kind=4), allocatable, dimension (:) :: array2 > logical sync > > call mpi_init(ierr) > call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr) > call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr) > > call mpi_info_create(my_info, ierr) > ! optional ways to set things in mpi-io > ! call mpi_info_set (my_info, "romio_ds_read" , "enable" , ierr) > ! call mpi_info_set (my_info, "romio_ds_write", "enable" , ierr) > ! call mpi_info_set (my_info, "romio_cb_write", "enable" , ierr) > > x_size = 410011 ! A 'big' number, with bigger numbers it is more > likely to fail > sync = .true. ! Extra file synchronization > > ncells = (X_SIZE * w_nprocs) > > ! Use node zero to fill it with nines > if (w_me .eq. 0) then > call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", > MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr) > allocate (array2(ncells)) > array2(:) = 9.0 > mylen = ncells > offset = 0 * 4 > call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, > "native",MPI_INFO_NULL,ierr) > call MPI_File_write(fileID, array2, mylen , MPI_REAL, > status,ierr) > call MPI_Get_count(status,MPI_INTEGER, count, ierr) > if (count .ne. mylen) print*, "Wrong initial write count:", > count,mylen > deallocate(array2) > if (sync) call MPI_FILE_SYNC (fileID,ierr) > call MPI_FILE_CLOSE (fileID,ierr) > endif > > ! All nodes now fill their area with ones > call MPI_BARRIER(MPI_COMM_WORLD,ierr) > allocate (array2( X_SIZE)) > array2(:) = 1.0 > offset = (w_me * X_SIZE) * 4 ! multiply by four, since it is real*4 > mylen = X_SIZE > call MPI_FILE_OPEN (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY, > my_info, fileID, ierr) > print*,"node",w_me,"starting",(offset/4) + > 1,"ending",(offset/4)+mylen > call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, > "native",MPI_INFO_NULL,ierr) > call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr) > call MPI_Get_count(status,MPI_INTEGER, count, ierr) > if (count .ne. mylen) print*, "Wrong write count:", count,mylen,w_me > deallocate(array2) > if (sync) call MPI_FILE_SYNC (fileID,ierr) > call MPI_FILE_CLOSE (fileID,ierr) > > ! Read it back on node zero to see if it is ok data > if (w_me .eq. 0) then > call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", > MPI_MODE_RDONLY, my_info, fileID, ierr) > mylen = ncells > allocate (array2(ncells)) > call MPI_File_read(fileID, array2, mylen , MPI_REAL, status,ierr) > call MPI_Get_count(status,MPI_INTEGER, count, ierr) > if (count .ne. mylen) print*, "Wrong read count:", count,mylen > do i=1,ncells > if (array2(i) .ne. 1) then > print*, "ERROR", i,array2(i), ((i-1)*4), > ((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB > goto 999 > end if > end do > print*, "All done with nothing wrong" > 999 deallocate(array2) > call MPI_FILE_CLOSE (fileID,ierr) > call MPI_file_delete ("output.dat",MPI_INFO_NULL,ierr) > endif > > call mpi_finalize(ierr) > > end program gcrm_test_io > > 1.2p1 MVAPICH 2 > node 1 starting 410012 ending 820022 > node 2 starting 820023 ending 1230033 > node 3 starting 1230034 ending 1640044 > node 4 starting 1640045 ending 2050055 > node 5 starting 2050056 ending 2460066 > node 0 starting 1 ending 410011 > All done with nothing wrong > > > node 1 starting 410012 ending 820022 > node 4 starting 1640045 ending 2050055 > node 3 starting 1230034 ending 1640044 > node 5 starting 2050056 ending 2460066 > node 2 starting 820023 ending 1230033 > Wrong write count: 228554 410011 2 > node 0 starting 1 ending 410011 > Wrong read count: 1048576 2460066 > ERROR 1048577 0.E+0 4194304 4. > > > node 1 starting 410012 ending 820022 > node 3 starting 1230034 ending 1640044 > node 4 starting 1640045 ending 2050055 > node 2 starting 820023 ending 1230033 > node 5 starting 2050056 ending 2460066 > node 0 starting 1 ending 410011 > Wrong read count: 1048576 2460066 > ERROR 1048577 0.E+0 4194304 4. > > > 1.1 MVAPICH > node 0 starting 1 ending > 410011 > node 4 starting 1640045 ending > 2050055 > node 3 starting 1230034 ending > 1640044 > node 2 starting 820023 ending > 1230033 > node 1 starting 410012 ending > 820022 > node 5 starting 2050056 ending > 2460066 > All done with nothing wrong > > > node 0 starting 1 ending > 410011 > node 5 starting 2050056 ending > 2460066 > node 2 starting 820023 ending > 1230033 > node 1 starting 410012 ending > 820022 > Wrong write count: 228554 410011 2 > node 3 starting 1230034 ending > 1640044 > node 4 starting 1640045 ending > 2050055 > Wrong read count: 1048576 2460066 > ERROR 1048577 0.0000000E+00 4194304 4.00000000000000 > > > node 0 starting 1 ending > 410011 > node 3 starting 1230034 ending > 1640044 > node 4 starting 1640045 ending > 2050055 > node 1 starting 410012 ending > 820022 > node 5 starting 2050056 ending > 2460066 > node 2 starting 820023 ending > 1230033 > Wrong read count: 1229824 2460066 > ERROR 1229825 0.0000000E+00 4919296 4.69140625000000 > > > -- > Nathan Baca > nathan.baca@gmail.com > > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From thakur at mcs.anl.gov Wed Mar 4 10:50:16 2009 From: thakur at mcs.anl.gov (Rajeev Thakur) Date: Wed Mar 4 10:50:18 2009 Subject: [mvapich-discuss] MPI-IO Inconsistency over Lustre using MVAPICH In-Reply-To: <200903041415.n24EEdFR019274@cse.ohio-state.edu> References: <200903041415.n24EEdFR019274@cse.ohio-state.edu> Message-ID: Nathan, Can you check if it works if you add the prefix "ufs:" to the file name in all opens? Rajeev > From: Nathan Baca > Subject: [mvapich-discuss] MPI-IO Inconsistency over Lustre using > MVAPICH > To: mvapich-discuss@cse.ohio-state.edu > Message-ID: > > Content-Type: text/plain; charset="utf-8" > > Hello, > > I am seeing inconsistent mpi-io behavior when writing to a Lustre file > system using mvapich2 1.2p1 and mvapich 1.1 both with romio. > What follows is > a simple reproducer and output. Essentially one or more of the running > processes does not read or write the correct amount of data > to its part of a > file residing on a Lustre (parallel) file system. > > I have tried both isolating the output to a single OST and > striping across > multiple OSTs. Both will reproduce the same result. I have > tried compiling > with multiple versions of both pathscale and intel compilers > all with the > same result. > > The odd thing is that this seems to work using hpmpi 2.03 > with pathscale 3.2 > and intel 10.1.018. The operating system is XC 3.2.1 which is > essentially > rhel4.5. The kernel is 2.6.9-67.9hp.7sp.XCsmp. Lustre version is > lustre-1.4.11-2.3_0.6_xc3.2.1_k2.6.9_67.9hp.7sp.XCsmp. > > Any help figuring out what is happening is greatly > appreciated. Thanks, Nate > > program gcrm_test_io > implicit none > include "mpif.h" > > integer X_SIZE > > integer w_me, w_nprocs > integer my_info > > integer i > integer (kind=4) :: ierr > integer (kind=4) :: fileID > > integer (kind=MPI_OFFSET_KIND) :: mylen > integer (kind=MPI_OFFSET_KIND) :: offset > integer status(MPI_STATUS_SIZE) > integer count > integer ncells > real (kind=4), allocatable, dimension (:) :: array2 > logical sync > > call mpi_init(ierr) > call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr) > call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr) > > call mpi_info_create(my_info, ierr) > ! optional ways to set things in mpi-io > ! call mpi_info_set (my_info, "romio_ds_read" , > "enable" , ierr) > ! call mpi_info_set (my_info, "romio_ds_write", > "enable" , ierr) > ! call mpi_info_set (my_info, "romio_cb_write", > "enable" , ierr) > > x_size = 410011 ! A 'big' number, with bigger numbers > it is more > likely to fail > sync = .true. ! Extra file synchronization > > ncells = (X_SIZE * w_nprocs) > > ! Use node zero to fill it with nines > if (w_me .eq. 0) then > call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", > MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr) > allocate (array2(ncells)) > array2(:) = 9.0 > mylen = ncells > offset = 0 * 4 > call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, > "native",MPI_INFO_NULL,ierr) > call MPI_File_write(fileID, array2, mylen , > MPI_REAL, status,ierr) > > call MPI_Get_count(status,MPI_INTEGER, count, ierr) > if (count .ne. mylen) print*, "Wrong initial write count:", > count,mylen > deallocate(array2) > if (sync) call MPI_FILE_SYNC (fileID,ierr) > call MPI_FILE_CLOSE (fileID,ierr) > endif > > ! All nodes now fill their area with ones > call MPI_BARRIER(MPI_COMM_WORLD,ierr) > allocate (array2( X_SIZE)) > array2(:) = 1.0 > offset = (w_me * X_SIZE) * 4 ! multiply by four, since > it is real*4 > mylen = X_SIZE > call MPI_FILE_OPEN > (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY, > my_info, fileID, ierr) > print*,"node",w_me,"starting",(offset/4) + > 1,"ending",(offset/4)+mylen > > call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, > "native",MPI_INFO_NULL,ierr) > call MPI_File_write(fileID, array2, mylen , MPI_REAL, > status,ierr) > call MPI_Get_count(status,MPI_INTEGER, count, ierr) > if (count .ne. mylen) print*, "Wrong write count:", > count,mylen,w_me > deallocate(array2) > if (sync) call MPI_FILE_SYNC (fileID,ierr) > call MPI_FILE_CLOSE (fileID,ierr) > > ! Read it back on node zero to see if it is ok data > if (w_me .eq. 0) then > call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", > MPI_MODE_RDONLY, > my_info, fileID, ierr) > mylen = ncells > allocate (array2(ncells)) > call MPI_File_read(fileID, array2, mylen , > MPI_REAL, status,ierr) > call MPI_Get_count(status,MPI_INTEGER, count, ierr) > if (count .ne. mylen) print*, "Wrong read count:", > count,mylen > do i=1,ncells > if (array2(i) .ne. 1) then > print*, "ERROR", i,array2(i), ((i-1)*4), > ((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB > goto 999 > end if > end do > print*, "All done with nothing wrong" > 999 deallocate(array2) > call MPI_FILE_CLOSE (fileID,ierr) > call MPI_file_delete ("output.dat",MPI_INFO_NULL,ierr) > endif > > call mpi_finalize(ierr) > > end program gcrm_test_io > > 1.2p1 MVAPICH 2 > node 1 starting 410012 ending 820022 > node 2 starting 820023 ending 1230033 > node 3 starting 1230034 ending 1640044 > node 4 starting 1640045 ending 2050055 > node 5 starting 2050056 ending 2460066 > node 0 starting 1 ending 410011 > All done with nothing wrong > > > node 1 starting 410012 ending 820022 > node 4 starting 1640045 ending 2050055 > node 3 starting 1230034 ending 1640044 > node 5 starting 2050056 ending 2460066 > node 2 starting 820023 ending 1230033 > Wrong write count: 228554 410011 2 > node 0 starting 1 ending 410011 > Wrong read count: 1048576 2460066 > ERROR 1048577 0.E+0 4194304 4. > > > node 1 starting 410012 ending 820022 > node 3 starting 1230034 ending 1640044 > node 4 starting 1640045 ending 2050055 > node 2 starting 820023 ending 1230033 > node 5 starting 2050056 ending 2460066 > node 0 starting 1 ending 410011 > Wrong read count: 1048576 2460066 > ERROR 1048577 0.E+0 4194304 4. > > > 1.1 MVAPICH > node 0 starting 1 ending > 410011 > node 4 starting 1640045 ending > 2050055 > node 3 starting 1230034 ending > 1640044 > node 2 starting 820023 ending > 1230033 > node 1 starting 410012 ending > 820022 > node 5 starting 2050056 ending > 2460066 > All done with nothing wrong > > > node 0 starting 1 ending > 410011 > node 5 starting 2050056 ending > 2460066 > node 2 starting 820023 ending > 1230033 > node 1 starting 410012 ending > 820022 > Wrong write count: 228554 410011 2 > node 3 starting 1230034 ending > 1640044 > node 4 starting 1640045 ending > 2050055 > Wrong read count: 1048576 2460066 > ERROR 1048577 0.0000000E+00 4194304 4.00000000000000 > > > node 0 starting 1 ending > 410011 > node 3 starting 1230034 ending > 1640044 > node 4 starting 1640045 ending > 2050055 > node 1 starting 410012 ending > 820022 > node 5 starting 2050056 ending > 2460066 > node 2 starting 820023 ending > 1230033 > Wrong read count: 1229824 2460066 > ERROR 1229825 0.0000000E+00 4919296 4.69140625000000 From Terrence.LIAO at total.com Wed Mar 4 10:54:20 2009 From: Terrence.LIAO at total.com (Terrence.LIAO@total.com) Date: Wed Mar 4 10:54:38 2009 Subject: [mvapich-discuss] Re: MPI-IO Inconsistency over Lustre using MVAPICH (Nathan Baca) In-Reply-To: <200903041414.n24EEdFQ019274@cse.ohio-state.edu> Message-ID: This is not a solution, but to report similar observation on "MPI-IO Inconsistency over Lustre using MVAPICH", that we have been seeing our MPI-IO code produces wrong results from time to time. This happens not only in MVAPICH, also in other MPI, such as SGI's MPT. We have rewritten our code not to use MPI-IO. Thank you very much. -- Terrence -------------------------------------------------------- Terrence Liao, Ph.D. Research Computer Scientist TOTAL E&P RESEARCH & TECHNOLOGY USA, LLC 1201 Louisiana, Suite 1800, Houston, TX 77002 Tel: 713.647.3498 Fax: 713.647.3638 Email: terrence.liao@total.com Houston HPC site: http://us-hou-spt01/sites/rt/hpc/default.aspx Pau HPC site: http://collaboratif.ep.corp.local/sites/hpc/hpc/RD.aspx mvapich-discuss-request@cse.ohio-state.edu Sent by: mvapich-discuss-bounces@cse.ohio-state.edu 03/04/2009 08:14 AM Please respond to mvapich-discuss@cse.ohio-state.edu To mvapich-discuss@cse.ohio-state.edu cc Subject mvapich-discuss Digest, Vol 39, Issue 2 Send mvapich-discuss mailing list submissions to mvapich-discuss@cse.ohio-state.edu To subscribe or unsubscribe via the World Wide Web, visit http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss or, via email, send a message with subject or body 'help' to mvapich-discuss-request@cse.ohio-state.edu You can reach the person managing the list at mvapich-discuss-owner@cse.ohio-state.edu When replying, please edit your Subject line so it is more specific than "Re: Contents of mvapich-discuss digest..." Today's Topics: 1. Re: Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error (Jonathan Perkins) 2. MPI-IO Inconsistency over Lustre using MVAPICH (Nathan Baca) 3. (no subject) (nilesh awate) ---------------------------------------------------------------------- Message: 1 Date: Mon, 2 Mar 2009 13:17:41 -0500 From: Jonathan Perkins Subject: Re: [mvapich-discuss] Mvapich2-1.2 for OpenFabrics IB/iWARP : Jobterminates with error To: Vivek Gavane Cc: mvapich-discuss@cse.ohio-state.edu Message-ID: <20090302181740.GK2993@cse.ohio-state.edu> Content-Type: text/plain; charset=us-ascii Vivek: We do not have an environment setup that can easily support the installation of this MEME Suite. Is there a simpler MPI program that this error can be reproduced with. This will greatly assist us in debugging this issue. On Fri, Feb 20, 2009 at 11:32:30AM +0530, Vivek Gavane wrote: > Sir, > I have tried for different set of nodes for various runs, the same > error is reported. But when I tried for small number of cores i.e 8 the > job never came out even though it was complete and the output file was > generated. Also the processes were showing 99.9% CPU usage even after > complete output was generated. > > The application code I am using is MEME version meme3.0.3 > http://meme.nbcr.net/downloads/old_versions/ > > Also I installed the newer version of MEME version meme_4.1.0 > http://meme.nbcr.net/downloads/ > > It is also giving the following error everytime on different set of nodes: > ----------------------------------- > Exit code -5 signaled from ibc0-27 > Killing remote processes...MPI process terminated unexpectedly > DONE > ----------------------------------- > > The redirected output file of the application contains: > ----------------------------- > cleanupSignal 15 received. > ----------------------------- > > Thanks. > -- > Regards, > Vivek Gavane > > Member Technical Staff > Bioinformatics team, > Scientific & Engineering Computing Group, > National PARAM Supercomputing Facility, > Centre for Development of Advanced Computing, > Pune-411007. > > Phone: +91 20 25704100 ext. 195 > Direct Line: +91 20 25704195 > > On Thu, Feb 19, 2009, Dhabaleswar Panda said: > > > Vivek, > > > > Do you see this error always when you run this application? Do you see > > this error when you run your application on different set of nodes? If > > this happens always (irrespective of runs and nodes), will it be possible > > for you to send us a code snippet which reproduces this problem. This will > > help us to investigate this issue further. > > > > Thanks, > > > > DK > > > >> Sir, > >> Thank you for the reply but the cable and switch seems to be fine. Is > >> there any other reason/solution for the errors. And also the application > >> program is giving complete and correct output except for the errors at the > >> end. > >> > >> Thanks. > >> -- > >> Regards, > >> Vivek Gavane > >> > >> Member Technical Staff > >> Bioinformatics team, > >> Scientific & Engineering Computing Group, > >> National PARAM Supercomputing Facility, > >> Centre for Development of Advanced Computing, > >> Pune-411007. > >> > >> Phone: +91 20 25704100 ext. 195 > >> Direct Line: +91 20 25704195 > >> > >> On Tue, Feb 17, 2009, Dhabaleswar Panda said: > >> > >> > Code 12 is a timeout -- could be a bad cable/HCA/switch leaf. If the > >> > system is really large then it could be congestion. > >> > > >> > Thanks, > >> > > >> > DK > >> > > >> > On Tue, 17 Feb 2009, Vivek Gavane wrote: > >> > > >> >> Hello, > >> >> I have mvapich2-1.2 compiled with the following options: > >> >> > >> >> > >> >> /configure --with-rdma=gen2 --enable-sharedlibs=gcc --enable-g=dbg > >> >> --enable-debuginfo --with-ib-include=/opt/OFED/include > >> >> --with-ib-libpath=/opt/OFED/lib64 --prefix=/home/apps/mvapich2-1.2 > >> >> > >> >> After I submit a job, the job completes but the following errors are > >> >> reported on the console: > >> >> > >> >> ------------------------------------------------------------- > >> >> send desc error > >> >> Exit code -5 signaled from ibc0-16 > >> >> Killing remote processes...[14] Abort: [] Got completion with error 12, > >> >> vendor code=81, dest rank=0 > >> >> at line 553 in file ibv_channel_manager.c > >> >> MPI process terminated unexpectedly > >> >> DONE > >> >> ------------------------------------------------------------ > >> >> > >> >> And in the redirected output file, following errors are reported at the > >> >> end: > >> >> ----------------------------------------- > >> >> cleanupSignal 15 received. > >> >> Signal 15 received. > >> >> Signal 15 received. > >> >> Signal 15 received. > >> >> ----------------------------------------- > >> >> > >> >> Do anyone know the reason for this? > >> >> > >> >> Thanks in advance. > >> >> -- > >> >> Regards, > >> >> Vivek Gavane > >> >> _______________________________________________ > >> >> mvapich-discuss mailing list > >> >> mvapich-discuss@cse.ohio-state.edu > >> >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> >> > >> > > >> > >> > > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo ------------------------------ Message: 2 Date: Tue, 3 Mar 2009 20:45:16 -0700 From: Nathan Baca Subject: [mvapich-discuss] MPI-IO Inconsistency over Lustre using MVAPICH To: mvapich-discuss@cse.ohio-state.edu Message-ID: Content-Type: text/plain; charset="utf-8" Hello, I am seeing inconsistent mpi-io behavior when writing to a Lustre file system using mvapich2 1.2p1 and mvapich 1.1 both with romio. What follows is a simple reproducer and output. Essentially one or more of the running processes does not read or write the correct amount of data to its part of a file residing on a Lustre (parallel) file system. I have tried both isolating the output to a single OST and striping across multiple OSTs. Both will reproduce the same result. I have tried compiling with multiple versions of both pathscale and intel compilers all with the same result. The odd thing is that this seems to work using hpmpi 2.03 with pathscale 3.2 and intel 10.1.018. The operating system is XC 3.2.1 which is essentially rhel4.5. The kernel is 2.6.9-67.9hp.7sp.XCsmp. Lustre version is lustre-1.4.11-2.3_0.6_xc3.2.1_k2.6.9_67.9hp.7sp.XCsmp. Any help figuring out what is happening is greatly appreciated. Thanks, Nate program gcrm_test_io implicit none include "mpif.h" integer X_SIZE integer w_me, w_nprocs integer my_info integer i integer (kind=4) :: ierr integer (kind=4) :: fileID integer (kind=MPI_OFFSET_KIND) :: mylen integer (kind=MPI_OFFSET_KIND) :: offset integer status(MPI_STATUS_SIZE) integer count integer ncells real (kind=4), allocatable, dimension (:) :: array2 logical sync call mpi_init(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD,w_nprocs,ierr) call MPI_COMM_RANK(MPI_COMM_WORLD,w_me,ierr) call mpi_info_create(my_info, ierr) ! optional ways to set things in mpi-io ! call mpi_info_set (my_info, "romio_ds_read" , "enable" , ierr) ! call mpi_info_set (my_info, "romio_ds_write", "enable" , ierr) ! call mpi_info_set (my_info, "romio_cb_write", "enable" , ierr) x_size = 410011 ! A 'big' number, with bigger numbers it is more likely to fail sync = .true. ! Extra file synchronization ncells = (X_SIZE * w_nprocs) ! Use node zero to fill it with nines if (w_me .eq. 0) then call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", MPI_MODE_CREATE+MPI_MODE_WRONLY, my_info, fileID, ierr) allocate (array2(ncells)) array2(:) = 9.0 mylen = ncells offset = 0 * 4 call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, "native",MPI_INFO_NULL,ierr) call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr) call MPI_Get_count(status,MPI_INTEGER, count, ierr) if (count .ne. mylen) print*, "Wrong initial write count:", count,mylen deallocate(array2) if (sync) call MPI_FILE_SYNC (fileID,ierr) call MPI_FILE_CLOSE (fileID,ierr) endif ! All nodes now fill their area with ones call MPI_BARRIER(MPI_COMM_WORLD,ierr) allocate (array2( X_SIZE)) array2(:) = 1.0 offset = (w_me * X_SIZE) * 4 ! multiply by four, since it is real*4 mylen = X_SIZE call MPI_FILE_OPEN (MPI_COMM_WORLD,"output.dat",MPI_MODE_WRONLY, my_info, fileID, ierr) print*,"node",w_me,"starting",(offset/4) + 1,"ending",(offset/4)+mylen call MPI_FILE_SET_VIEW(fileID,offset, MPI_REAL,MPI_REAL, "native",MPI_INFO_NULL,ierr) call MPI_File_write(fileID, array2, mylen , MPI_REAL, status,ierr) call MPI_Get_count(status,MPI_INTEGER, count, ierr) if (count .ne. mylen) print*, "Wrong write count:", count,mylen,w_me deallocate(array2) if (sync) call MPI_FILE_SYNC (fileID,ierr) call MPI_FILE_CLOSE (fileID,ierr) ! Read it back on node zero to see if it is ok data if (w_me .eq. 0) then call MPI_FILE_OPEN (MPI_COMM_SELF, "output.dat", MPI_MODE_RDONLY, my_info, fileID, ierr) mylen = ncells allocate (array2(ncells)) call MPI_File_read(fileID, array2, mylen , MPI_REAL, status,ierr) call MPI_Get_count(status,MPI_INTEGER, count, ierr) if (count .ne. mylen) print*, "Wrong read count:", count,mylen do i=1,ncells if (array2(i) .ne. 1) then print*, "ERROR", i,array2(i), ((i-1)*4), ((i-1)*4)/(1024d0*1024d0) ! Index, value, # of good bytes,MB goto 999 end if end do print*, "All done with nothing wrong" 999 deallocate(array2) call MPI_FILE_CLOSE (fileID,ierr) call MPI_file_delete ("output.dat",MPI_INFO_NULL,ierr) endif call mpi_finalize(ierr) end program gcrm_test_io 1.2p1 MVAPICH 2 node 1 starting 410012 ending 820022 node 2 starting 820023 ending 1230033 node 3 starting 1230034 ending 1640044 node 4 starting 1640045 ending 2050055 node 5 starting 2050056 ending 2460066 node 0 starting 1 ending 410011 All done with nothing wrong node 1 starting 410012 ending 820022 node 4 starting 1640045 ending 2050055 node 3 starting 1230034 ending 1640044 node 5 starting 2050056 ending 2460066 node 2 starting 820023 ending 1230033 Wrong write count: 228554 410011 2 node 0 starting 1 ending 410011 Wrong read count: 1048576 2460066 ERROR 1048577 0.E+0 4194304 4. node 1 starting 410012 ending 820022 node 3 starting 1230034 ending 1640044 node 4 starting 1640045 ending 2050055 node 2 starting 820023 ending 1230033 node 5 starting 2050056 ending 2460066 node 0 starting 1 ending 410011 Wrong read count: 1048576 2460066 ERROR 1048577 0.E+0 4194304 4. 1.1 MVAPICH node 0 starting 1 ending 410011 node 4 starting 1640045 ending 2050055 node 3 starting 1230034 ending 1640044 node 2 starting 820023 ending 1230033 node 1 starting 410012 ending 820022 node 5 starting 2050056 ending 2460066 All done with nothing wrong node 0 starting 1 ending 410011 node 5 starting 2050056 ending 2460066 node 2 starting 820023 ending 1230033 node 1 starting 410012 ending 820022 Wrong write count: 228554 410011 2 node 3 starting 1230034 ending 1640044 node 4 starting 1640045 ending 2050055 Wrong read count: 1048576 2460066 ERROR 1048577 0.0000000E+00 4194304 4.00000000000000 node 0 starting 1 ending 410011 node 3 starting 1230034 ending 1640044 node 4 starting 1640045 ending 2050055 node 1 starting 410012 ending 820022 node 5 starting 2050056 ending 2460066 node 2 starting 820023 ending 1230033 Wrong read count: 1229824 2460066 ERROR 1229825 0.0000000E+00 4919296 4.69140625000000 -- Nathan Baca nathan.baca@gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090303/ee3f6aa0/attachment-0001.html ------------------------------ Message: 3 Date: Wed, 4 Mar 2009 19:22:17 +0530 (IST) From: nilesh awate Subject: [mvapich-discuss] (no subject) To: MVAPICH2 Cc: Nilesh Awate Message-ID: <22080.56777.qm@web94104.mail.in2.yahoo.com> Content-Type: text/plain; charset="utf-8" Hi all, I am using mvapich2-1.2p1 with udapl adi over proprietary interconnect previously we were using mvapich2-1.0.3. I m using it over Intel Xeon X5472 @ 3.00GHz (8 nodes 8 cores each) cluster But I m not able to fire more than 6 processes over a cluster it just get stuck after connection establishment(used debug messeges in our library) i have observed the same thing over Mellanox card(OpenIB-cma). mpdboot way of firing process is working fine (but it is not recommended as you say) Foll. are environment variables that i set before run $ export PATH=/home/user/pn_mpi/mpi-bin1.2/bin/:$PATH $ which env /usr/bin/env $which mpispawn /home/user/pn_mpi/mpi-bin1.2/bin/mpispawn set in ~/.bashrc I have read FAQ for the same but didn't find much information waiting for reply Nilesh Connect with friends all over the world. Get Yahoo! India Messenger at http://in.messenger.yahoo.com/?wm=n/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090304/be0589d1/attachment.html ------------------------------ _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss End of mvapich-discuss Digest, Vol 39, Issue 2 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090304/b76115dc/attachment-0001.html From robl at mcs.anl.gov Wed Mar 4 12:32:09 2009 From: robl at mcs.anl.gov (Robert Latham) Date: Wed Mar 4 12:35:32 2009 Subject: [mvapich-discuss] MPI-IO Inconsistency over Lustre using MVAPICH In-Reply-To: References: <200903041415.n24EEdFR019274@cse.ohio-state.edu> Message-ID: <20090304173209.GC1408@mcs.anl.gov> On Wed, Mar 04, 2009 at 09:50:16AM -0600, Rajeev Thakur wrote: > Nathan, > Can you check if it works if you add the prefix "ufs:" to the file > name in all opens? Two things come to my eye: - any chance your site will upgrade to lustre 1.6? 1.4 should perform *correctly* but very very slowly for you - You have no error checking for any of your MPI_FILE routines. perhaps you omitted them for the sake of this example, but there's a chance that the MPI_FILE_* routine is trying to tell you what went wrong. Compare the 'ierr' result with MPI_SUCCESS. if not equal, convert the error code to something human readable with MPI_ERROR_STRING http://www.mpi-forum.org/docs/mpi21-report-bw/node186.htm#Node186 Thanks ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From weikuan.yu at gmail.com Wed Mar 4 14:21:40 2009 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Wed Mar 4 14:21:50 2009 Subject: [mvapich-discuss] MPI-IO Inconsistency over Lustre using MVAPICH In-Reply-To: <20090304173209.GC1408@mcs.anl.gov> References: <200903041415.n24EEdFR019274@cse.ohio-state.edu> <20090304173209.GC1408@mcs.anl.gov> Message-ID: <49AED4C4.9070100@gmail.com> I can't reproduce the error with 2-48 processes with mvapich2-1.2p1. Adding a prefix "lustre:" or "ufs:" produced the same correct results. The system I used has the following Lustre configuration. lustre: 1.6.4 kernel: 47 build: 1.6.4-19691231180000-PRISTINE-.usr.src.linux-2.6.9-55.0.9.EL_lustre.1.6.4-2.6.9-55.0.9.EL_lustre.1.6.4custom On the same system, I also tried mvapich-1.1. There was no error with the Lustre ADIO driver. Nathana, let me know where you need more help. You might want to check or update your Lustre configuration. Thanks, --Weikuan Robert Latham wrote: > On Wed, Mar 04, 2009 at 09:50:16AM -0600, Rajeev Thakur wrote: >> Nathan, >> Can you check if it works if you add the prefix "ufs:" to the file >> name in all opens? > > Two things come to my eye: > - any chance your site will upgrade to lustre 1.6? 1.4 should perform > *correctly* but very very slowly for you > - You have no error checking for any of your MPI_FILE routines. > perhaps you omitted them for the sake of this example, but there's a > chance that the MPI_FILE_* routine is trying to tell you what went > wrong. > > Compare the 'ierr' result with MPI_SUCCESS. if not equal, convert > the error code to something human readable with MPI_ERROR_STRING > http://www.mpi-forum.org/docs/mpi21-report-bw/node186.htm#Node186 > > Thanks > ==rob > From Jie.Cai at cs.anu.edu.au Wed Mar 4 18:08:00 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Wed Mar 4 17:55:31 2009 Subject: [mvapich-discuss] Bandwidth on single hca dual port multirail configuration Message-ID: <49AF09D0.8070304@cs.anu.edu.au> We have single ConnectX dual port HCA cluster installed, and try to build a dual port multirail IB cluster. I have tested to run OSU put bandwidth test on the cluster with MVAPICH2. mpirun_rsh -ssh -np 2 node02 node01 MV2_NUM_HCAS=1 MV2_NUM_PORTS=2 MV2_NUM_QP_PER_PORT=1 ./osu_bandwidth However, I didn't achieve bandwidth improvement. The peak bandwidth I got for the test is 1458.93 MB/s, which is far from the expectation (2.5GB/s). Does anyone knows what's going on? -- Jie Cai From koop at cse.ohio-state.edu Wed Mar 4 18:57:05 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Mar 4 18:57:11 2009 Subject: [mvapich-discuss] Bandwidth on single hca dual port multirail configuration In-Reply-To: <49AF09D0.8070304@cs.anu.edu.au> Message-ID: Hi Jie, You are running into the limitation of the PCIe 1.1 bus here. Even a single port with a higher bus speed (ConnectX on PCIe 2.0) can get higher bandwidth than a single port on PCIe 1.1. I hope this helps, Matt On Thu, 5 Mar 2009, Jie Cai wrote: > We have single ConnectX dual port HCA cluster installed, and try to > build a dual port multirail IB cluster. > > I have tested to run OSU put bandwidth test on the cluster with MVAPICH2. > > mpirun_rsh -ssh -np 2 node02 node01 MV2_NUM_HCAS=1 MV2_NUM_PORTS=2 > MV2_NUM_QP_PER_PORT=1 ./osu_bandwidth > > However, I didn't achieve bandwidth improvement. The peak bandwidth I > got for the test is 1458.93 MB/s, which is far from the expectation > (2.5GB/s). > > Does anyone knows what's going on? > > -- > Jie Cai > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From Jie.Cai at cs.anu.edu.au Wed Mar 4 19:13:42 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Wed Mar 4 19:01:17 2009 Subject: [mvapich-discuss] Bandwidth on single hca dual port multirail configuration In-Reply-To: References: Message-ID: <49AF1936.2070702@cs.anu.edu.au> Matthew Koop wrote: > Hi Jie, > > You are running into the limitation of the PCIe 1.1 bus here. Even a > single port with a higher bus speed (ConnectX on PCIe 2.0) can get higher > bandwidth than a single port on PCIe 1.1. > > I hope this helps, > > Matt > Thanks for the reply. The workstation I am using is Sun Ultra24, which has 2x 16 PCI-E 2.0 slots in it. I connect HCA in one of those slots. The theoretical system bus would be ~10GB/s (on the data sheet, didn't measure them myself yet). So, the system bus may not be the bottleneck. Is there some other factors would affect this? Jie > On Thu, 5 Mar 2009, Jie Cai wrote: > > >> We have single ConnectX dual port HCA cluster installed, and try to >> build a dual port multirail IB cluster. >> >> I have tested to run OSU put bandwidth test on the cluster with MVAPICH2. >> >> mpirun_rsh -ssh -np 2 node02 node01 MV2_NUM_HCAS=1 MV2_NUM_PORTS=2 >> MV2_NUM_QP_PER_PORT=1 ./osu_bandwidth >> >> However, I didn't achieve bandwidth improvement. The peak bandwidth I >> got for the test is 1458.93 MB/s, which is far from the expectation >> (2.5GB/s). >> >> Does anyone knows what's going on? >> >> -- >> Jie Cai >> >> >> >> >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> > > From panda at cse.ohio-state.edu Wed Mar 4 19:11:47 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Mar 4 19:11:54 2009 Subject: [mvapich-discuss] Bandwidth on single hca dual port multirail configuration In-Reply-To: <49AF09D0.8070304@cs.anu.edu.au> Message-ID: I think you posted similar questions on other mailing lists and you got some answers. You need to examine multiple things to see what is happening on your system. - What is the speed of your ConnectX card - SDR or DDR? - What is your platform (Intel or AMD)? What is the memory bandwidth available on this platform? Can it support two parallel streams of SDR or DDR IB communication? - Which version of MVAPICH2 you are using? Which interface of MVAPICH2 you are using - OpenFabrics-IB or uDAPL. OpenFabrics-IB interface supports multi-rail option and you should be able to use multiple ports or adapaters. The uDAPL interface only supports single port/adapter. - How much performance you get if you use one port? Do the numbers differ when you use one port vs. another port. - You seem to be using OSU Put bandwidth test. This reports bandwidth achieved through MPI one-sided Put operations. Did you try the regular OSU bandwidth test (which shows the performance of two-sided operations)? Do you see any performance difference? If you systematically analyze the problem, you should be able to find out what is going on. DK On Thu, 5 Mar 2009, Jie Cai wrote: > We have single ConnectX dual port HCA cluster installed, and try to > build a dual port multirail IB cluster. > > I have tested to run OSU put bandwidth test on the cluster with MVAPICH2. > > mpirun_rsh -ssh -np 2 node02 node01 MV2_NUM_HCAS=1 MV2_NUM_PORTS=2 > MV2_NUM_QP_PER_PORT=1 ./osu_bandwidth > > However, I didn't achieve bandwidth improvement. The peak bandwidth I > got for the test is 1458.93 MB/s, which is far from the expectation > (2.5GB/s). > > Does anyone knows what's going on? > > -- > Jie Cai > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From koop at cse.ohio-state.edu Wed Mar 4 20:42:33 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Mar 4 20:42:43 2009 Subject: [mvapich-discuss] Bandwidth on single hca dual port multirail configuration In-Reply-To: <49AF1936.2070702@cs.anu.edu.au> Message-ID: Jie, Just to followup on Dr. Panda's question -- what is the speed of card (SDR or DDR?). Also, are you using uDAPL or the Gen2 interface? If you are using the uDAPL interface there is no multirail support, so you would not see higher performance from two ports. The bandwidth you report is generally the bandwidth one sees from a Gen1 interface. As I mentioned earlier, if you have a Gen2 card and bus you should see even higher (1700+ MB/sec) from a single port. You may want to verify that your card has the Gen2 firmware. Matt On Thu, 5 Mar 2009, Jie Cai wrote: > > Matthew Koop wrote: > > Hi Jie, > > > > You are running into the limitation of the PCIe 1.1 bus here. Even a > > single port with a higher bus speed (ConnectX on PCIe 2.0) can get higher > > bandwidth than a single port on PCIe 1.1. > > > > I hope this helps, > > > > Matt > > > > Thanks for the reply. The workstation I am using is Sun Ultra24, > which has 2x 16 PCI-E 2.0 slots in it. I connect HCA in one of those slots. > > The theoretical system bus would be ~10GB/s (on the data sheet, didn't > measure them myself yet). > > So, the system bus may not be the bottleneck. > Is there some other factors would affect this? > > Jie > > On Thu, 5 Mar 2009, Jie Cai wrote: > > > > > >> We have single ConnectX dual port HCA cluster installed, and try to > >> build a dual port multirail IB cluster. > >> > >> I have tested to run OSU put bandwidth test on the cluster with MVAPICH2. > >> > >> mpirun_rsh -ssh -np 2 node02 node01 MV2_NUM_HCAS=1 MV2_NUM_PORTS=2 > >> MV2_NUM_QP_PER_PORT=1 ./osu_bandwidth > >> > >> However, I didn't achieve bandwidth improvement. The peak bandwidth I > >> got for the test is 1458.93 MB/s, which is far from the expectation > >> (2.5GB/s). > >> > >> Does anyone knows what's going on? > >> > >> -- > >> Jie Cai > >> > >> > >> > >> > >> > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > >> > > > > > From Jie.Cai at cs.anu.edu.au Wed Mar 4 21:16:57 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Wed Mar 4 21:04:35 2009 Subject: [mvapich-discuss] Bandwidth on single hca dual port multirail configuration In-Reply-To: References: Message-ID: <49AF3619.5030302@cs.anu.edu.au> Thanks for the email on the suggestions. Dhabaleswar Panda wrote: > I think you posted similar questions on other mailing lists and you got > some answers. You need to examine multiple things to see what is happening > on your system. > > - What is the speed of your ConnectX card - SDR or DDR? > The HCA I am using is 4x DDR (ConnectX MHGH28-XTC). > - What is your platform (Intel or AMD)? What is the memory bandwidth > available on this platform? Can it support two parallel streams of > SDR or DDR IB communication? > I am using Intel Core2 Quad Q6600 CPU and the Intel x38 express chipset. The memory bandwidth is ~ 5.3 GB/s per channel, and measured bandwidth using STREAM benchmark is: ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 5263.2339 0.0061 0.0061 0.0063 Scale: 5211.7318 0.0061 0.0061 0.0062 Add: 5710.2587 0.0084 0.0084 0.0084 Triad: 5734.0973 0.0084 0.0084 0.0084 ------------------------------------------------------------- Those suggest that the memory bandwidth would be sufficient to drive 2 DDR ports. > - Which version of MVAPICH2 you are using? Which interface of MVAPICH2 you > are using - OpenFabrics-IB or uDAPL. OpenFabrics-IB interface supports > multi-rail option and you should be able to use multiple ports or > adapaters. The uDAPL interface only supports single port/adapter. > I have installed MVAPICH2 1.2-p1 with default options on SUSE linux OS. So OpenFabrics-IB is using. > - How much performance you get if you use one port? The performance using 1 rail is ~1.45 GB/s, which is around the same as multi-ports. > Do the numbers differ > when you use one port vs. another port. > No difference between difference ports. > - You seem to be using OSU Put bandwidth test. This reports bandwidth > achieved through MPI one-sided Put operations. Did you try the > regular OSU bandwidth test (which shows the performance of > two-sided operations)? Do you see any performance difference? > Testing with osu_bandwidth on multirail, observed roughly same bandwidth (1459.15) > If you systematically analyze the problem, you should be able to find out > what is going on. > > DK > Matthew: I am not sure the firmware version on HCAs, but will definitely check it. In general, the hardware platform doesn't seem to be the limitation. I will get back to you once I have got more information. Thanks a lot. Best Regards, Jie > On Thu, 5 Mar 2009, Jie Cai wrote: > > >> We have single ConnectX dual port HCA cluster installed, and try to >> build a dual port multirail IB cluster. >> >> I have tested to run OSU put bandwidth test on the cluster with MVAPICH2. >> >> mpirun_rsh -ssh -np 2 node02 node01 MV2_NUM_HCAS=1 MV2_NUM_PORTS=2 >> MV2_NUM_QP_PER_PORT=1 ./osu_bandwidth >> >> However, I didn't achieve bandwidth improvement. The peak bandwidth I >> got for the test is 1458.93 MB/s, which is far from the expectation >> (2.5GB/s). >> >> Does anyone knows what's going on? >> >> -- >> Jie Cai >> >> >> >> >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> > > From ehsan.roohi at strath.ac.uk Thu Mar 5 09:09:30 2009 From: ehsan.roohi at strath.ac.uk (Ehsan Roohi) Date: Thu Mar 5 09:25:59 2009 Subject: [mvapich-discuss] Question on PBS In-Reply-To: <200903042134.n24LY15s010128@zanamavir.ncsa.uiuc.edu> References: <2C3CF455489D6B4F95AFAB84847FA4131CF4BFB403@E2K7-MS2.ds.strath.ac.uk> from Ehsan Roohi on Wed, 4 Mar 2009 20:06:29 +0000,<200903042134.n24LY15s010128@zanamavir.ncsa.uiuc.edu> Message-ID: <2C3CF455489D6B4F95AFAB84847FA4131CF4BFB408@E2K7-MS2.ds.strath.ac.uk> Dear All, I got the following error while trying to submit my job to HPC: The .o file contains: running mpdallexit on node17 LAUNCHED mpd on node17 via RUNNING: mpd on node17 LAUNCHED mpd on node16 via node17 LAUNCHED mpd on node15 via node17 mpdboot_node17 (handle_mpd_output 373): from mpd on node16, invalid port info: /bin/sh: line 1: ssh: command not found mpdtrace: cannot connect to local mpd (/tmp/mpd2.console_seb09103); possible ca uses: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpiexec_node17: cannot connect to local mpd (/tmp/mpd2.console_seb09103); possi ble causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_seb09103); possible causes: 1. no mpd is running on this host 2. an mpd is running but was started without a "console" (-n option) The PBS script that I use is: #!/bin/bash #PBS -l walltime=00:15:00 #PBS -l nodes=4:ppn=2 #PBS -V #PBS -N testjob # set echo # echo commands before execution; use for debugging cd $SCR # get executable and input files from mass storage #msscmd cd dir1, get a.out, mget *.input # mss doesn't keep executable bit set, so need to set it on program #chmod +x a.out #mvapich2-start-mpd export NP=`wc -l ${PBS_NODEFILE} | cut -d'/' -f1` export MPDSNP=`uniq ${PBS_NODEFILE} |wc -l| cut -d'/' -f1` cat ${PBS_NODEFILE} | uniq > /tmp/mpd_nodefile_${USER}_$$ export MPD_NODEFILE=/tmp/mpd_nodefile_${USER}_$$ mpdboot -v -n ${MPDSNP} -f ${MPD_NODEFILE} mpdtrace -l rm ${MPD_NODEFILE} rm -f /tmp/mypbsnodes${USER}_$$ export NP= `wc -l ${PBS_NODEFILE} | cut -d'/' -f1` export MV2_SRQ_SIZE=4000 mpirun -machinefile ${PBS_NODEFILE} a.out mpdallexit --------------------------------------------------------------------------------------------------------------------- Would you please help me in this problem? Thanks, Ehsan From perkinjo at cse.ohio-state.edu Thu Mar 5 09:58:08 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Thu Mar 5 09:58:18 2009 Subject: [mvapich-discuss] Question on PBS In-Reply-To: <2C3CF455489D6B4F95AFAB84847FA4131CF4BFB408@E2K7-MS2.ds.strath.ac.uk> References: <2C3CF455489D6B4F95AFAB84847FA4131CF4BFB403@E2K7-MS2.ds.strath.ac.uk> <2C3CF455489D6B4F95AFAB84847FA4131CF4BFB408@E2K7-MS2.ds.strath.ac.uk> Message-ID: <20090305145807.GA4323@cse.ohio-state.edu> Ehsan: Are you able to manually ssh into each node outside of PBS? It looks like you may have a problem with node17. See if you can ssh into and out of this node successfully. On Thu, Mar 05, 2009 at 02:09:30PM +0000, Ehsan Roohi wrote: > Dear All, > > I got the following error while trying to submit my job to HPC: > > The .o file contains: > > running mpdallexit on node17 > LAUNCHED mpd on node17 via > RUNNING: mpd on node17 > LAUNCHED mpd on node16 via node17 > LAUNCHED mpd on node15 via node17 > > mpdboot_node17 (handle_mpd_output 373): from mpd on node16, invalid port info: > /bin/sh: line 1: ssh: command not found > > mpdtrace: cannot connect to local mpd (/tmp/mpd2.console_seb09103); possible ca > uses: > 1. no mpd is running on this host > 2. an mpd is running but was started without a "console" (-n option) > mpiexec_node17: cannot connect to local mpd (/tmp/mpd2.console_seb09103); possi > ble causes: > 1. no mpd is running on this host > 2. an mpd is running but was started without a "console" (-n option) > mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_seb09103); possible > causes: > 1. no mpd is running on this host > 2. an mpd is running but was started without a "console" (-n option) > > The PBS script that I use is: > > #!/bin/bash > > #PBS -l walltime=00:15:00 > #PBS -l nodes=4:ppn=2 > #PBS -V > #PBS -N testjob > # set echo # echo commands before execution; use for debugging > cd $SCR > > # get executable and input files from mass storage > #msscmd cd dir1, get a.out, mget *.input > # mss doesn't keep executable bit set, so need to set it on program > #chmod +x a.out > > #mvapich2-start-mpd > export NP=`wc -l ${PBS_NODEFILE} | cut -d'/' -f1` > export MPDSNP=`uniq ${PBS_NODEFILE} |wc -l| cut -d'/' -f1` > cat ${PBS_NODEFILE} | uniq > /tmp/mpd_nodefile_${USER}_$$ > export MPD_NODEFILE=/tmp/mpd_nodefile_${USER}_$$ > mpdboot -v -n ${MPDSNP} -f ${MPD_NODEFILE} > mpdtrace -l > rm ${MPD_NODEFILE} > rm -f /tmp/mypbsnodes${USER}_$$ > export NP= `wc -l ${PBS_NODEFILE} | cut -d'/' -f1` > export MV2_SRQ_SIZE=4000 > mpirun -machinefile ${PBS_NODEFILE} a.out > mpdallexit > --------------------------------------------------------------------------------------------------------------------- > > > Would you please help me in this problem? > > Thanks, > Ehsan > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090305/1b383c3d/attachment.bin From mi.zhou at STJUDE.ORG Thu Mar 5 17:56:54 2009 From: mi.zhou at STJUDE.ORG (Mi Zhou) Date: Thu Mar 5 22:53:35 2009 Subject: [mvapich-discuss] mvapich build failure Message-ID: <1236293814.31767.39.camel@hc-mzhou-l> Hi, I am trying to build mvapich-1.1 using make.mvapich.gen2 with mandatory variables set. But it failed with some error: viainit.c: In function ?create_srq?: viainit.c:427: warning: assignment makes pointer from integer without a cast viainit.c:428: error: ?struct ibv_srq? has no member named ?xrc_srq_num? viainit.c:428: error: ?struct ibv_srq? has no member named ?xrc_srq_num? viainit.c: In function ?xrc_init?: viainit.c:1144: error: ?IBV_DEVICE_XRC? undeclared (first use in this function) viainit.c:1144: error: (Each undeclared identifier is reported only once viainit.c:1144: error: for each function it appears in.) viainit.c:1161: warning: assignment makes pointer from integer without a cast make[3]: *** [viainit.o] Error 1 Attached is the configuration log and make log. Any idea what could be missing. Thanks, Mi Zhou St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer -------------- next part -------------- A non-text attachment was scrubbed... Name: log.tar.gz Type: application/x-compressed-tar Size: 6341 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090305/112d9739/log.tar-0001.bin From sridharj at cse.ohio-state.edu Thu Mar 5 23:07:09 2009 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Thu Mar 5 23:07:16 2009 Subject: [mvapich-discuss] mvapich build failure In-Reply-To: <1236293814.31767.39.camel@hc-mzhou-l> References: <1236293814.31767.39.camel@hc-mzhou-l> Message-ID: <49B0A16D.6010501@cse.ohio-state.edu> You seem to have an older version of ofed without XRC support. You can either - * upgrade to ofed 1.3 or later or * remove -DXRC from CFLAGS in make.mvapich.gen2 -Jaidev On 03/05/2009 05:56 PM, Mi Zhou wrote: > Hi, > > I am trying to build mvapich-1.1 using make.mvapich.gen2 with mandatory > variables set. But it failed with some error: > > > viainit.c: In function ?create_srq?: > viainit.c:427: warning: assignment makes pointer from integer without a > cast > viainit.c:428: error: ?struct ibv_srq? has no member named ?xrc_srq_num? > viainit.c:428: error: ?struct ibv_srq? has no member named ?xrc_srq_num? > viainit.c: In function ?xrc_init?: > viainit.c:1144: error: ?IBV_DEVICE_XRC? undeclared (first use in this > function) > viainit.c:1144: error: (Each undeclared identifier is reported only once > viainit.c:1144: error: for each function it appears in.) > viainit.c:1161: warning: assignment makes pointer from integer without a > cast > make[3]: *** [viainit.o] Error 1 > > > Attached is the configuration log and make log. > > Any idea what could be missing. > > Thanks, > > Mi Zhou > St. Jude Children's Research Hospital > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From mi.zhou at STJUDE.ORG Fri Mar 6 12:41:25 2009 From: mi.zhou at STJUDE.ORG (Mi Zhou) Date: Fri Mar 6 13:06:14 2009 Subject: [mvapich-discuss] mvapich build failure. . In-Reply-To: <49B0A16D.6010501@cse.ohio-state.edu> References: <1236293814.31767.39.camel@hc-mzhou-l> <49B0A16D.6010501@cse.ohio-state.edu> Message-ID: <1236361285.31767.41.camel@hc-mzhou-l> I took out the "-DXRC" and it compiled. Thank you very much! Mi On Thu, 2009-03-05 at 22:07 -0600, Jaidev Sridhar wrote: > You seem to have an older version of ofed without XRC support. You can > either - > * upgrade to ofed 1.3 or later or > * remove -DXRC from CFLAGS in make.mvapich.gen2 > > -Jaidev > > On 03/05/2009 05:56 PM, Mi Zhou wrote: > > Hi, > > > > I am trying to build mvapich-1.1 using make.mvapich.gen2 with mandatory > > variables set. But it failed with some error: > > > > > > viainit.c: In function ?create_srq?: > > viainit.c:427: warning: assignment makes pointer from integer without a > > cast > > viainit.c:428: error: ?struct ibv_srq? has no member named ?xrc_srq_num? > > viainit.c:428: error: ?struct ibv_srq? has no member named ?xrc_srq_num? > > viainit.c: In function ?xrc_init?: > > viainit.c:1144: error: ?IBV_DEVICE_XRC? undeclared (first use in this > > function) > > viainit.c:1144: error: (Each undeclared identifier is reported only once > > viainit.c:1144: error: for each function it appears in.) > > viainit.c:1161: warning: assignment makes pointer from integer without a > > cast > > make[3]: *** [viainit.o] Error 1 > > > > > > Attached is the configuration log and make log. > > > > Any idea what could be missing. > > > > Thanks, > > > > Mi Zhou > > St. Jude Children's Research Hospital > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > From sssmolniy at mail.ru Fri Mar 6 14:14:32 2009 From: sssmolniy at mail.ru (Sergey) Date: Fri Mar 6 14:48:34 2009 Subject: [mvapich-discuss] [mpich-discuss] mvapich 1.1 and multipath Message-ID: Hi, may be anyone know this ? How mpi process addressing the message to destination (rank 0 send rank 1)? Through LID, GID, QP ID or anything else? Can I strictly define to mpi process outgoing LIDs ? For example, with LMC = 3, a sender of rank 0, lid 1 destination of rank 1 multilid 4 destination of rank 2 multilid 8 From kallies at zib.de Fri Mar 6 16:34:41 2009 From: kallies at zib.de (Bernd Kallies) Date: Fri Mar 6 16:34:59 2009 Subject: [mvapich-discuss] mvapich2 warning: Rndv Receiver is receiving less than as expected Message-ID: <1236375281.3547.224.camel@kallies.zib.de> I'd like to report a problem with mvapich2, which I did not manage to nail down to a particular issue so far. So I'd like to report it at this stage to the mvapich2 dev team. We use the program package cp2k (http://cp2k.berlios.de/). We compiled it with Intel compilers v10 or v11, and linked with mvapich2-1.2.0, which was generated with the same compiler. In addition, the package uses blacs, scalapack, and blas/lapack, fft. We have an SGI ICE cluster equipped with Intel Xeon Harpertown (E5472, 8 cores per node) and 2-port 4xDDR Mellanox ConnectX adapters (Mellanox MT26418, ConnectX IB DDR, PCIe 2.0 5GT/s). We use SLES10 SP2 for x86_64, and OFED 1.3 built by SGI. When calculating a particular time series with cp2k, the run aborts after about 30 timesteps (takes a couple of hours by using 128 MPI tasks) with the mvapich2 message Warning! Rndv Receiver is receiving (49152 < 110592) less than as expected with varying recv/expected message sizes. The simulation step where this message appears is not deterministic. The only thing I know is that it appears for sure between the 27th and the 32rd time step after a restart. Sometimes a run aborts with SIGSEGV. Sometimes the run only hangs until running into some time limit. The problem cannot be reproduced by starting from restart files written by cp2k inbetween the runs. One always has to wait about 30 time steps. So it does not seem to depend on the input data, but on wallclock time. The problem does not occur when using another MPI lib. I successfully tried SGI MPT v1.22, Intel MPI v3.1.026, MPICH2 v1.0.7 and v1.0.8 (using IPoIB). It only shows up with MVAPICH2. I tried v1.2.0, v1.2p1 from dev snapshot 2009-03-02, v1.2rc1, v1.2rc2. Failures occur independently from the compiler and optimization level used for the executable and self-compiled libs. I tried Intel and gcc, -O0 .. -O2. Failures are also independent from the blacs/scalapack/blas/lapack combination (I used self-compiled with Intel compilers or gcc, or what is shipped with MKL v10 or 11), and are not related to the use of shared vs. static libs. No special mvapich2 tuning environment variables are set. The only thing I figured out so far is that the warning messages and following aborts seem to originate from MPI communication initiated by blacs routines. A recent call stack trace produced by the Intel Fortran RTE while running an executable with blacs debugging on, and using static mpi/blacs/scalapack/mkl libs gives: Warning! Rndv Receiver is receiving (49152 < 110592) less than as expected BLACS ERROR 'MPI error 480874510 on call to MPI_Recv' from {-1,-1}, pnum=15, Contxt=-1, on line 8 of file 'BI_Srecv.c'. application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15 forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source cp2k.popt 0000000001A7BB4D Unknown Unknown Unknown cp2k.popt 0000000001A6B48A Unknown Unknown Unknown cp2k.popt 0000000001A6A634 Unknown Unknown Unknown cp2k.popt 0000000001A5B701 Unknown Unknown Unknown cp2k.popt 00000000011980E0 BI_Srecv 8 BI_Srecv.c cp2k.popt 0000000001197AE5 BI_IdringBR 12 BI_IdringBR.c cp2k.popt 0000000001193ACE Czgebr2d 192 zgebr2d_.c ... forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libmlx4-rdmav2.so 00002AF569D3414D Unknown Unknown Unknown ... Another run on the same input data by using the same exe and same task geom gave Warning! Rndv Receiver is receiving (512 < 221184) less than as expected forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source cp2k.popt 0000000001A709D2 Unknown Unknown Unknown cp2k.popt 0000000001A6D62D Unknown Unknown Unknown cp2k.popt 0000000001A6A5FC Unknown Unknown Unknown cp2k.popt 0000000001A5B701 Unknown Unknown Unknown cp2k.popt 00000000011980E0 BI_Srecv 8 BI_Srecv.c cp2k.popt 0000000001197AE5 BI_IdringBR 12 BI_IdringBR.c cp2k.popt 00000000011936E5 Cdgebr2d 192 dgebr2d_.c ... forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source libmlx4-rdmav2.so 00002AEF72F07A00 Unknown Unknown Unknown ... Currently I have no idea how to nail down the problem to a particular mvapich2 issue. The time needed to reach the point of failure, the large number of tasks, the complexity of the numerics, and the apparent dependency on some runtime component make it hard to debug this. Please let me know if you need further information. Sincerely, BK -- Dr. Bernd Kallies Konrad-Zuse-Zentrum f?r Informationstechnik Berlin Takustr. 7 14195 Berlin Tel: +49-30-84185-270 Fax: +49-30-84185-311 e-mail: kallies@zib.de From panda at cse.ohio-state.edu Fri Mar 6 20:25:18 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Mar 6 20:25:24 2009 Subject: [mvapich-discuss] mvapich2 warning: Rndv Receiver is receiving less than as expected In-Reply-To: <1236375281.3547.224.camel@kallies.zib.de> Message-ID: Thanks for reporting this issue in-depth. We will try to take a look at it and get back to you. In the mean time, could you try disabling the runtime environmental variable MV2_USE_SHMEM_COLL (MV2_USE_SHMEM_COLL=0) and let us know whether the problem still persists or not. Thanks, DK On Fri, 6 Mar 2009, Bernd Kallies wrote: > I'd like to report a problem with mvapich2, which I did not manage to > nail down to a particular issue so far. So I'd like to report it at this > stage to the mvapich2 dev team. > > We use the program package cp2k (http://cp2k.berlios.de/). We compiled > it with Intel compilers v10 or v11, and linked with mvapich2-1.2.0, > which was generated with the same compiler. In addition, the package > uses blacs, scalapack, and blas/lapack, fft. > > We have an SGI ICE cluster equipped with Intel Xeon Harpertown (E5472, 8 > cores per node) and 2-port 4xDDR Mellanox ConnectX adapters (Mellanox > MT26418, ConnectX IB DDR, PCIe 2.0 5GT/s). We use SLES10 SP2 for x86_64, > and OFED 1.3 built by SGI. > > When calculating a particular time series with cp2k, the run aborts > after about 30 timesteps (takes a couple of hours by using 128 MPI > tasks) with the mvapich2 message > > Warning! Rndv Receiver is receiving (49152 < 110592) less than as > expected > > with varying recv/expected message sizes. > > The simulation step where this message appears is not deterministic. The > only thing I know is that it appears for sure between the 27th and the > 32rd time step after a restart. Sometimes a run aborts with SIGSEGV. > Sometimes the run only hangs until running into some time limit. The > problem cannot be reproduced by starting from restart files written by > cp2k inbetween the runs. One always has to wait about 30 time steps. So > it does not seem to depend on the input data, but on wallclock time. > > The problem does not occur when using another MPI lib. I successfully > tried SGI MPT v1.22, Intel MPI v3.1.026, MPICH2 v1.0.7 and v1.0.8 (using > IPoIB). It only shows up with MVAPICH2. I tried v1.2.0, v1.2p1 from dev > snapshot 2009-03-02, v1.2rc1, v1.2rc2. Failures occur independently from > the compiler and optimization level used for the executable and > self-compiled libs. I tried Intel and gcc, -O0 .. -O2. Failures are also > independent from the blacs/scalapack/blas/lapack combination (I used > self-compiled with Intel compilers or gcc, or what is shipped with MKL > v10 or 11), and are not related to the use of shared vs. static libs. No > special mvapich2 tuning environment variables are set. > > The only thing I figured out so far is that the warning messages and > following aborts seem to originate from MPI communication initiated by > blacs routines. A recent call stack trace produced by the Intel Fortran > RTE while running an executable with blacs debugging on, and using > static mpi/blacs/scalapack/mkl libs gives: > > Warning! Rndv Receiver is receiving (49152 < 110592) less than as expected > BLACS ERROR 'MPI error 480874510 on call to MPI_Recv' > from {-1,-1}, pnum=15, Contxt=-1, on line 8 of file 'BI_Srecv.c'. > > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 15 > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line Source > cp2k.popt 0000000001A7BB4D Unknown Unknown Unknown > cp2k.popt 0000000001A6B48A Unknown Unknown Unknown > cp2k.popt 0000000001A6A634 Unknown Unknown Unknown > cp2k.popt 0000000001A5B701 Unknown Unknown Unknown > cp2k.popt 00000000011980E0 BI_Srecv 8 BI_Srecv.c > cp2k.popt 0000000001197AE5 BI_IdringBR 12 BI_IdringBR.c > cp2k.popt 0000000001193ACE Czgebr2d 192 zgebr2d_.c > ... > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line Source > libmlx4-rdmav2.so 00002AF569D3414D Unknown Unknown Unknown > ... > > > Another run on the same input data by using the same exe and same task > geom gave > > Warning! Rndv Receiver is receiving (512 < 221184) less than as expected > forrtl: severe (174): SIGSEGV, segmentation fault occurred > Image PC Routine Line Source > cp2k.popt 0000000001A709D2 Unknown Unknown Unknown > cp2k.popt 0000000001A6D62D Unknown Unknown Unknown > cp2k.popt 0000000001A6A5FC Unknown Unknown Unknown > cp2k.popt 0000000001A5B701 Unknown Unknown Unknown > cp2k.popt 00000000011980E0 BI_Srecv 8 BI_Srecv.c > cp2k.popt 0000000001197AE5 BI_IdringBR 12 BI_IdringBR.c > cp2k.popt 00000000011936E5 Cdgebr2d 192 dgebr2d_.c > ... > forrtl: error (78): process killed (SIGTERM) > Image PC Routine Line Source > libmlx4-rdmav2.so 00002AEF72F07A00 Unknown Unknown Unknown > ... > > > Currently I have no idea how to nail down the problem to a particular > mvapich2 issue. The time needed to reach the point of failure, the large > number of tasks, the complexity of the numerics, and the apparent > dependency on some runtime component make it hard to debug this. > > Please let me know if you need further information. > > Sincerely, BK > > -- > Dr. Bernd Kallies > Konrad-Zuse-Zentrum für Informationstechnik Berlin > Takustr. 7 > 14195 Berlin > Tel: +49-30-84185-270 > Fax: +49-30-84185-311 > e-mail: kallies@zib.de > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From Jie.Cai at cs.anu.edu.au Sat Mar 7 22:09:25 2009 From: Jie.Cai at cs.anu.edu.au (Jie Cai) Date: Sat Mar 7 21:56:49 2009 Subject: [mvapich-discuss] Bandwidth on single hca dual port multirail configuration In-Reply-To: References: Message-ID: <49B336E5.90200@cs.anu.edu.au> Hi Matt, I have reconfigured my cluster. Each node contains 2 DDR ConnectX HCAs. The OSU Put bandwidth with MVAPICH2 Openfabrics-IB interface has been improved to 2917MB/s. Then the problem for my old multi-rail configure (Single HCA, multi ports) probably come from I didn't configure the dual port properly. This issue still haven't been solved yet. Do you have some experience in configuring single HCA dual ports IB network? Regards, Jie Matthew Koop wrote: > Jie, > > Just to followup on Dr. Panda's question -- what is the speed of card (SDR > or DDR?). Also, are you using uDAPL or the Gen2 interface? If you are > using the uDAPL interface there is no multirail support, so you would not > see higher performance from two ports. > > The bandwidth you report is generally the bandwidth one sees from a Gen1 > interface. As I mentioned earlier, if you have a Gen2 card and bus you > should see even higher (1700+ MB/sec) from a single port. You may want to > verify that your card has the Gen2 firmware. > > Matt > > On Thu, 5 Mar 2009, Jie Cai wrote: > > >> Matthew Koop wrote: >> >>> Hi Jie, >>> >>> You are running into the limitation of the PCIe 1.1 bus here. Even a >>> single port with a higher bus speed (ConnectX on PCIe 2.0) can get higher >>> bandwidth than a single port on PCIe 1.1. >>> >>> I hope this helps, >>> >>> Matt >>> >>> >> Thanks for the reply. The workstation I am using is Sun Ultra24, >> which has 2x 16 PCI-E 2.0 slots in it. I connect HCA in one of those slots. >> >> The theoretical system bus would be ~10GB/s (on the data sheet, didn't >> measure them myself yet). >> >> So, the system bus may not be the bottleneck. >> Is there some other factors would affect this? >> >> Jie >> >>> On Thu, 5 Mar 2009, Jie Cai wrote: >>> >>> >>> >>>> We have single ConnectX dual port HCA cluster installed, and try to >>>> build a dual port multirail IB cluster. >>>> >>>> I have tested to run OSU put bandwidth test on the cluster with MVAPICH2. >>>> >>>> mpirun_rsh -ssh -np 2 node02 node01 MV2_NUM_HCAS=1 MV2_NUM_PORTS=2 >>>> MV2_NUM_QP_PER_PORT=1 ./osu_bandwidth >>>> >>>> However, I didn't achieve bandwidth improvement. The peak bandwidth I >>>> got for the test is 1458.93 MB/s, which is far from the expectation >>>> (2.5GB/s). >>>> >>>> Does anyone knows what's going on? >>>> >>>> -- >>>> Jie Cai >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> mvapich-discuss mailing list >>>> mvapich-discuss@cse.ohio-state.edu >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>> >>>> >>>> >>> > > From koop at cse.ohio-state.edu Sat Mar 7 23:34:47 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Sat Mar 7 23:34:54 2009 Subject: [mvapich-discuss] [mpich-discuss] mvapich 1.1 and multipath In-Reply-To: Message-ID: Sergey, Messages are sent based on LIDs and QPs. Right now there is no mechanism to set output LIDs directly. If you use the VIADEV_USE_LMC=1 (MVAPICH) or MV2_USE_HSAM=1 (MVAPICH2) it should try to spread traffic for you to maximize performance. What is the reason for needing this fine-grained control? Thanks, Matt On Fri, 6 Mar 2009, Sergey wrote: > Hi, may be anyone know this ? How mpi process addressing the message > to destination (rank 0 send rank 1)? Through LID, GID, QP ID or > anything else? Can I strictly define to mpi process outgoing LIDs ? > For example, with LMC = 3, a sender of rank 0, lid 1 > destination of rank 1 multilid 4 > destination of rank 2 multilid 8 > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From sssmolniy at mail.ru Sun Mar 8 17:11:43 2009 From: sssmolniy at mail.ru (Sergey) Date: Sun Mar 8 17:11:52 2009 Subject: =?koi8-r?Q?Re=3A_[mvapich-discuss]_[mvapich-discuss]_[mpich-discuss]_mvapich_1.1_and_multipath?= Message-ID: Hi Thanks for answer It need for set specific route for specific source-dest. I study in post graduate and it parts of my scientific work. Can I take destLIDs for mpi sender process and exchange it to my specific destLIDs ? I assume it placed in enviroments ? Can MPI process use more than one LID to transmit message to destination rank or sender may use only one LID to transmit ? Sorry for my english language > -----Original Message----- > From: Matthew Koop > To: Sergey > Date: Sat, 7 Mar 2009 23:34:47 -0500 (EST) > Subject: Re: [mvapich-discuss] [mpich-discuss] mvapich 1.1 and multipath > > > Sergey, > > > > Messages are sent based on LIDs and QPs. Right now there is no mechanism > > to set output LIDs directly. If you use the VIADEV_USE_LMC=1 (MVAPICH) or > > MV2_USE_HSAM=1 (MVAPICH2) it should try to spread traffic for you to > > maximize performance. > > > > What is the reason for needing this fine-grained control? > > > > Thanks, > > > > Matt > > > > > > On Fri, 6 Mar 2009, Sergey wrote: > > > > > Hi, may be anyone know this ? How mpi process addressing the message > > > to destination (rank 0 send rank 1)? Through LID, GID, QP ID or > > > anything else? Can I strictly define to mpi process outgoing LIDs ? > > > For example, with LMC = 3, a sender of rank 0, lid 1 > > > destination of rank 1 multilid 4 > > > destination of rank 2 multilid 8 > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > From dzieko at wcss.pl Mon Mar 9 11:02:13 2009 From: dzieko at wcss.pl (Pawel Dziekonski) Date: Mon Mar 9 11:02:25 2009 Subject: [mvapich-discuss] mvapich 1.1 and ofed 1.3.1 Message-ID: <20090309150213.GA4617@cefeid.wcss.wroc.pl> hello, I've just compiled mvapich 1.1 against ofed 1.3.1 and intel compilers on intel em64t. IB net is based on mthca0's and Voltaire. script and logs are here: https://cefeid.wcss.wroc.pl/d/tmp/mvapich-1.1/ my changes to make.mvapich.gen2 script: diff mvapich-1.1/make.mvapich.gen2 MVAPICH-1.1/make.mvapich.gen2 18,24c18,24 < IBHOME=${IBHOME:-/usr/local/ofed} < IBHOME_LIB=${IBHOME_LIB:-/usr/local/ofed/lib64} < PREFIX=${PREFIX:-/usr/local/mvapich} < export CC=${CC:-gcc} < export CXX=${CXX:-g++} < export F77=${F77:-g77} < export F90=${F90:-} --- > IBHOME=${IBHOME:-/usr} > IBHOME_LIB=${IBHOME_LIB:-/usr/lib64} > PREFIX=${PREFIX:-/usr/local/MVAPICH-1.1} > export CC=icc > export CXX=icc > export F77=ifort > export F90=ifort 59c59 < ROMIO="--with-romio" --- > ROMIO="--with-romio --with-file-system=lustre+nfs" 84c84 < export LIBS=${LIBS:--L${IBHOME_LIB} -Wl,-rpath=${IBHOME_LIB} -libverbs -libumad -lpthread} --- > export LIBS="${IBHOME_LIB}/libibverbs.a ${IBHOME_LIB}/libibumad.a ${IBHOME_LIB}/libibcommon.a -lpthread" as you see, I wanted to have IB and MPI libs statically linked in. however hello fails with a non-obvious error (for me ;) /usr/local/MVAPICH-1.1/bin/mpirun -np 2 -machinefile hello.mpi.hosts ./hello.mpi.x.mv11 libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 forrtl: severe (174): SIGSEGV, segmentation fault occurred MPI process terminated unexpectedly Exit code -5 signaled from wn001 Killing remote processes...forrtl: severe (174): SIGSEGV, segmentation fault occurred MPI process terminated unexpectedly DONE any hints? thanks in advance, Pawel -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From kallies at zib.de Mon Mar 9 11:11:17 2009 From: kallies at zib.de (Bernd Kallies) Date: Mon Mar 9 11:11:32 2009 Subject: [mvapich-discuss] mvapich2 warning: Rndv Receiver is receiving less than as expected In-Reply-To: References: Message-ID: <1236611477.3547.247.camel@kallies.zib.de> On Fri, 2009-03-06 at 20:25 -0500, Dhabaleswar Panda wrote: > Thanks for reporting this issue in-depth. We will try to take a look at it > and get back to you. In the mean time, could you try disabling the runtime > environmental variable MV2_USE_SHMEM_COLL (MV2_USE_SHMEM_COLL=0) and let > us know whether the problem still persists or not. I tried several setups, without success. The problem persists with - MV2_USE_SHMEM_COLL=0 - MV2_USE_SHARED_MEM=0 - MV2_USE_SHMEM_BCAST=0 - running on 128 nodes, 1 task per node - MV2_USE_LAZY_MEM_UNREGISTER=0 I used mvapich2-1.2.0 as of 06-Nov-2008, as well as the dev snapshot as of 02-March-2009. The latter is 1.2p1, I guess. My last trial (16 nodes, 8 tasks per node, mvapich2-1.2.0 static libs compiled with Intel compilers, -O1 -g -traceback, MV2_USE_LAZY_MEM_UNREGISTER=0) crashed with SIGSEGV after Warning! Rndv Receiver is receiving (110592 < 221184) less than as expected I used the Intel Fortran RTE to install a SIGSEGV handler, which generates a core dump. gdb where on this core dump gives: #0 0x00002b81a2ae6bb5 in raise () from /lib64/libc.so.6 #1 0x00002b81a2ae7fb0 in abort () from /lib64/libc.so.6 #2 0x0000000001ac4db5 in for__signal_handler () #3 #4 0x00002b81a2ae6bb5 in raise () from /lib64/libc.so.6 #5 0x00002b81a2ae7fb0 in abort () from /lib64/libc.so.6 #6 0x0000000001ac1740 in for__issue_diagnostic () #7 0x0000000001ac4b53 in for__signal_handler () #8 #9 MPIDI_CH3I_SMP_readv_rndv_cont (recv_vc_ptr=0x231d4d0, iov=0x459bd011b0, iovlen=-1404046976, index=37369648, num_bytes_ptr=0x22b3e12) at ch3_smp_progress.c:1761 #10 0x0000000001a6d62d in MPIDI_CH3I_SMP_read_progress (pg=0x231d4d0) at ch3_smp_progress.c:493 #11 0x0000000001a6a5fc in MPIDI_CH3I_Progress (is_blocking=36820176, state=0x459bd011b0) at ch3_progress.c:184 #12 0x0000000001a5b701 in PMPI_Recv (buf=0x231d4d0, count=-1680862800, datatype=-1404046976, source=37369648, tag=36388370, comm=0, status=0x36906c0) at recv.c:156 #13 0x00000000011980e0 in BI_Srecv (ctxt=0x231d4d0, src=-1680862800, msgid=-1404046976, bp=0x23a3730) at BI_Srecv.c:8 #14 0x0000000001197ae5 in BI_IdringBR (ctxt=0x231d4d0, bp=0x459bd011b0, send=0x2aaaac4ff180, src=37369648, step=36388370) at BI_IdringBR.c:12 #15 0x00000000011936e5 in Cdgebr2d (ConTxt=36820176, scope=0x459bd011b0
, top=0x2aaaac4ff180 "", m=37369648, n=36388370, A=0x0, lda=864, rsrc=0, csrc=5) at dgebr2d_.c:192 #16 0x00000000010e78ab in PB_CInV (TYPE=0x231d4d0, CONJUG=0x459bd011b0
, ROWCOL=0x2aaaac4ff180 "", M=37369648, N=36388370, DESCA=0x0, K=13, X=0x2aaad58ce900 "\204????\223\026>W???_\211\f>??)\\\216:\026??\204&-B\n\t?[b\222?\223??=??\213@C?\021?\030/\004'$/\b???S{??\022?n\232tW\224?\024>??8Y4\235\026????\032X\226 ??A??ad!>?\236?\\g=\023?/z\017&??\020?\235\225\t_\aT >L'N*?\222?=?}??\023$\020>??U|?\r(>?\025J\231?\f?=?\206\204?&r?=?\213hf\226d\b>r\025?.\031@!?tY???X?=\r?\205???\020>??\205N??\024>"..., IX=0, JX=17748848, DESCX=0x20, XROC=0x0, XAPTR=0x0, DXA=0x0, XAFREE=0x7fff08794ac8) at PB_CInV.c:490 #17 0x00000000010ed370 in PB_CpgemmAC (TYPE=0x231d4d0, DIRECA=0x459bd011b0
, DIRECC=0x2aaaac4ff180 "", TRANSA=0x23a3730 "", TRANSB=0x22b3e12 "", M=0, N=865, K=13553, ALPHA=0x1c3dd60 "", A=0x2aaad79ac3f0 "~\215?,\021A???;m)G???nCFW\211x\230?8?5\213`??????\215?l?>?_???0??R]?\006l\237??\\J~p?l\232?\0269??\212???h\230\032?~??>?a?????>\222L??p8?>\vYS.T\b??+d\216?\017V????w??~\233>?p1?P= >(\231?X\211g?>v?\216?\0222??\022u!V{e??/\023?\036?\031???U??6???\220??s\202*h>??9??\"?>I\017\f\035!3\231??_;\0272?q>"..., IA=0, JA=0, DESCA=0x7fff08794f38, B=0x2aaad7a4e400 "\220?\205\027?M\230?\214???r7?>\200r?kS??> ??N? ?>", IB=0, JB=0, DESCB=0x7fff08794f0c, BETA=0x1c3dd58 "", C=0x2aaad58ce800 "???\220\024??=\024\200\220\230\a|\003>x5?\204l\f\005?\bo???\210???\233\004\203??'?74????\036?r?d$?\216\037>?\202\0174\226???@ ??\201\224?=?{\233]n?\032??@h?/\" >??\214?????\222?\203?/???????\201X\031?\227??c\215????3?d?\026\030?2\2237\236f\a?? ?\232\207\b??=b?x] ?\020>?A[???\r?J\205\224?E-?=d?\002T?\036\t>?]??\206\027\v>\212j]?([#??\233?\221??\002?"..., IC=0, JC=0, DESCC=0x7fff08794ee0) at PB_CpgemmAC.c:505 #18 0x00000000010c48c6 in pdgemm_ (TRANSA=0x231d4d0 "\r", TRANSB=0x459bd011b0
, M=0x2aaaac4ff180, N=0x23a3730, K=0x22b3e12, ALPHA=0x0, A=0x2aaad79ac3f0, IA=0x7fff08795034, JA=0x1c070e8, DESCA=0x21eb220, B=0x2aaad7a4e400, IB=0x7fff08795038, JB=0x7fff0879503c, DESCB=0x21eb244, BETA=0x1c3dd58, C=0x2aaad58ce800, IC=0x7fff08795040, JC=0x7fff08795044, DESCC=0x21eb268) at pdgemm_.c:490 #19 0x0000000000b6e30e in cp_fm_basic_linalg_mp_cp_fm_gemm_ () #20 0x0000000000ec9f7b in qs_ot_mp_qs_ot_get_p_ () #21 0x000000000078e1a4 in qs_ot_scf_mp_ot_scf_mini_ () #22 0x00000000007d81ce in qs_scf_mp_qs_scf_loop_do_ot_ () #23 0x00000000007d6538 in qs_scf_mp_scf_env_do_scf_ () #24 0x00000000007d4fb0 in qs_scf_mp_scf_ () #25 0x00000000006b1e34 in qs_energy_mp_qs_energies_ () #26 0x00000000006bca89 in qs_force_mp_qs_forces_ () #27 0x0000000000468847 in force_env_methods_mp_force_env_calc_energy_force_ () #28 0x0000000000d28a2d in integrator_mp_nve_ () #29 0x00000000008c7db8 in velocity_verlet_control_mp_velocity_verlet_ () #30 0x00000000005d4701 in md_run_mp_qs_mol_dyn_low_ () #31 0x00000000005d344f in md_run_mp_qs_mol_dyn_ () #32 0x0000000000415583 in cp2k_runs_mp_cp2k_run_ () #33 0x000000000041a9aa in cp2k_runs_mp_run_input_ () #34 0x0000000000413d17 in cp2k () at /gfs1/work/bzfbbk/CP2K/cp2k/makefiles/../src/cp2k.F:272 #35 0x0000000000412ae2 in main () -- Dr. Bernd Kallies Konrad-Zuse-Zentrum f?r Informationstechnik Berlin Takustr. 7 14195 Berlin Tel: +49-30-84185-270 Fax: +49-30-84185-311 e-mail: kallies@zib.de From perkinjo at cse.ohio-state.edu Mon Mar 9 12:37:50 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon Mar 9 12:38:00 2009 Subject: [mvapich-discuss] mvapich 1.1 and ofed 1.3.1 In-Reply-To: <20090309150213.GA4617@cefeid.wcss.wroc.pl> References: <20090309150213.GA4617@cefeid.wcss.wroc.pl> Message-ID: <20090309163749.GC3062@cse.ohio-state.edu> Pawel: My reply is inline below. On Mon, Mar 09, 2009 at 04:02:13PM +0100, Pawel Dziekonski wrote: > > hello, > > I've just compiled mvapich 1.1 against ofed 1.3.1 and intel compilers on intel > em64t. IB net is based on mthca0's and Voltaire. script and logs are here: > https://cefeid.wcss.wroc.pl/d/tmp/mvapich-1.1/ > > my changes to make.mvapich.gen2 script: > > diff mvapich-1.1/make.mvapich.gen2 MVAPICH-1.1/make.mvapich.gen2 > 84c84 > < export LIBS=${LIBS:--L${IBHOME_LIB} -Wl,-rpath=${IBHOME_LIB} -libverbs -libumad -lpthread} > --- > > export LIBS="${IBHOME_LIB}/libibverbs.a ${IBHOME_LIB}/libibumad.a ${IBHOME_LIB}/libibcommon.a -lpthread" > > as you see, I wanted to have IB and MPI libs statically linked in. Instead of providing the paths to the static libraries in this manner can you try just replacing the -Wl,rpath=${IBHOME_LIB} portion with -static? Let us know if this solves your problem. > > however hello fails with a non-obvious error (for me ;) > > /usr/local/MVAPICH-1.1/bin/mpirun -np 2 -machinefile hello.mpi.hosts ./hello.mpi.x.mv11 > libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 > libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 > forrtl: severe (174): SIGSEGV, segmentation fault occurred > MPI process terminated unexpectedly > Exit code -5 signaled from wn001 > Killing remote processes...forrtl: severe (174): SIGSEGV, segmentation fault occurred > MPI process terminated unexpectedly > DONE > > > any hints? > thanks in advance, Pawel > -- > Pawel Dziekonski > Wroclaw Centre for Networking & Supercomputing, HPC Department > Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND > phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090309/51923965/attachment.bin From michael.heinz at qlogic.com Mon Mar 9 13:37:27 2009 From: michael.heinz at qlogic.com (Mike Heinz) Date: Mon Mar 9 13:39:03 2009 Subject: [mvapich-discuss] "Too many open files" error Message-ID: <4C2744E8AD2982428C5BFE523DF8CDCB3FFA55865E@MNEXMB1.qlogic.org> Hey, we're QA testing a release of OFED 1.4, including MVAPICH, and the testers just run into the following problem - they're running Pallas across 44 nodes when, part way through the run when machines start failing with a "too many open files" error (see below). At first blush, this sounds like a ulimit problem, and I'm trying to get access to the failing machines to test that theory - but is there some known condition where mvapich will leak file handles? [root@st28]# /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun -np 44 -machinefile (prior test cases trimmed) #---------------------------------------------------------------- # Benchmarking Bcast # ( #processes = 8 ) # ( 36 additional processes waiting in MPI_Barrier) #---------------------------------------------------------------- #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] 0 1000 0.05 0.07 0.05 1 1000 8.70 8.71 8.71 2 1000 8.16 8.18 8.17 4 1000 8.17 8.19 8.18 8 1000 7.83 7.84 7.83 16 1000 8.08 8.10 8.09 32 1000 8.36 8.38 8.37 64 1000 8.28 8.30 8.29 128 1000 9.02 9.03 9.03 256 1000 9.33 9.35 9.34 512 1000 10.13 10.14 10.13 1024 1000 12.33 12.35 12.33 2048 1000 14.86 14.89 14.87 4096 1000 20.21 20.23 20.22 8192 1000 33.47 33.51 33.49 16384 1000 126.25 126.32 126.27 open: Too many open files [5820] shmem_coll_init:error in opening shared memory file : 24 open: Too many open files [5820] shmem_coll_init:error in opening shared memory file : 24 open: Too many open files open: Too many open files open: Too many open files open: Too many open files [5820] shmem_coll_init:error in opening shared memory file : 24 open: Too many open files [5820] shmem_coll_init:error in opening shared memory file : 24 [0] shmem_coll_mmap:error in mmapping shared memory: 2 open: Too many open files [5820] shmem_coll_init:error in opening shared memory file : 24 -- Michael Heinz Principal Engineer, Qlogic Corporation King of Prussia, Pennsylvania -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090309/9a82772d/attachment-0001.html From chai.15 at osu.edu Tue Mar 10 01:50:16 2009 From: chai.15 at osu.edu (Lei Chai) Date: Tue Mar 10 01:50:22 2009 Subject: [mvapich-discuss] "Too many open files" error In-Reply-To: <4C2744E8AD2982428C5BFE523DF8CDCB3FFA55865E@MNEXMB1.qlogic.org> References: <4C2744E8AD2982428C5BFE523DF8CDCB3FFA55865E@MNEXMB1.qlogic.org> Message-ID: <49B5FF98.7070809@osu.edu> Hi Mike, Sorry to know you are having this problem. Could you try the following things: - Which pallas version are you running? The latest one is IMB-3.2, could you try it and see if the problem disappear?? - What are the "ulimic -n" and "cat /proc/sys/fs/file-max" outputs on your system? - Could you try the env variable VIADEV_USE_SHMEM_BCAST=0? - If the problem is still there could you try the patch below? Thanks, Lei =================================================================== --- mpid/ch_gen2/mpid_smpi.c 2009-03-10 03:16:35 UTC (rev 3233) +++ mpid/ch_gen2/mpid_smpi.c 2009-03-10 05:30:15 UTC (rev 3234) @@ -86,6 +86,8 @@ #define MPID_PROGRESSION_LOCK() #define MPID_PROGRESSION_UNLOCK() +unsigned int g_shmem_size = 0; +unsigned int g_shmem_size_pool = 0; int smp_eagersize = SMP_EAGERSIZE; int smpi_length_queue = SMPI_LENGTH_QUEUE; int expect_cancel_ack = 0; @@ -759,9 +761,8 @@ void smpi_init (void) { - unsigned int i, j, size, pool, pid, wait; + unsigned int i, j, pool, pid, wait; int local_num, sh_size, pid_len, rq_len, param_len, limit_len; - unsigned int size_pool; struct stat file_status, file_status_pool; char *shmem_file = NULL; char *pool_file = NULL; @@ -846,11 +847,11 @@ sh_size = sizeof(struct shared_mem) + pid_len + param_len + rq_len + limit_len + SMPI_CACHE_LINE_SIZE * 4; - size = (SMPI_CACHE_LINE_SIZE + sh_size + pagesize + + g_shmem_size = (SMPI_CACHE_LINE_SIZE + sh_size + pagesize + (smpi.num_local_nodes * (smpi.num_local_nodes - 1) * (SMPI_ALIGN (smpi_length_queue + pagesize)))); - size_pool = + g_shmem_size_pool = SMPI_ALIGN (sizeof (SEND_BUF_T) * smp_num_send_buffer + pagesize) * smpi.num_local_nodes + SMPI_CACHE_LINE_SIZE; @@ -867,7 +868,7 @@ } /* set file size, without touching pages */ - if (ftruncate (smpi.fd, size)) { + if (ftruncate (smpi.fd, g_shmem_size)) { /* to clean up tmp shared file */ unlink (shmem_file); error_abort_all (GEN_EXIT_ERR, @@ -886,7 +887,7 @@ } - if (ftruncate (smpi.fd_pool, size_pool)) { + if (ftruncate (smpi.fd_pool, g_shmem_size_pool)) { /* to clean up tmp shared file */ unlink (pool_file); error_abort_all (GEN_EXIT_ERR, @@ -898,8 +899,8 @@ #ifndef _X86_64_ { char *buf; - buf = (char *) calloc (size + 1, sizeof (char)); - if (write (smpi.fd, buf, size) != size) { + buf = (char *) calloc (g_shmem_size + 1, sizeof (char)); + if (write (smpi.fd, buf, g_shmem_size) != g_shmem_size) { error_abort_all (GEN_EXIT_ERR, "[%d] smpi_init:error in writing " "shared memory file: %d\n", @@ -910,8 +911,8 @@ { char *buf; - buf = (char *) calloc (size_pool + 1, sizeof (char)); - if (write (smpi.fd_pool, buf, size_pool) != size_pool) { + buf = (char *) calloc (g_shmem_size_pool + 1, sizeof (char)); + if (write (smpi.fd_pool, buf, g_shmem_size_pool) != g_shmem_size_pool) { error_abort_all (GEN_EXIT_ERR, "[%d] smpi_init:error in writing " "shared pool file: %d\n", @@ -959,14 +960,14 @@ } usleep (10); } - while (file_status.st_size != size || - file_status_pool.st_size != size_pool); + while (file_status.st_size != g_shmem_size || + file_status_pool.st_size != g_shmem_size_pool); smpi_shmem = (struct shared_mem *)malloc(sizeof(struct shared_mem)); smpi_malloc_assert(smpi_shmem, "smpi_init", "SMPI_SHMEM"); /* mmap of the shared memory file */ - smpi.mmap_ptr = mmap (0, size, + smpi.mmap_ptr = mmap (0, g_shmem_size, (PROT_READ | PROT_WRITE), (MAP_SHARED), smpi.fd, 0); if (smpi.mmap_ptr == (void *) -1) { /* to clean up tmp shared file */ @@ -976,7 +977,7 @@ "shared memory: %d\n", MPID_MyWorldRank, errno); } - smpi.send_buf_pool_ptr = mmap (0, size_pool, (PROT_READ | PROT_WRITE), + smpi.send_buf_pool_ptr = mmap (0, g_shmem_size_pool, (PROT_READ | PROT_WRITE), (MAP_SHARED), smpi.fd_pool, 0); if (smpi.send_buf_pool_ptr == (void *) -1) { @@ -1217,14 +1218,12 @@ MPID_SMP_Check_incoming (); } /* unmap the shared memory file */ - munmap (smpi.mmap_ptr, (SMPI_CACHE_LINE_SIZE + - sizeof (struct shared_mem) + - (smpi.num_local_nodes * - (smpi.num_local_nodes - - 1) * (smpi_length_queue + - SMPI_CACHE_LINE_SIZE)))); - + munmap (smpi.mmap_ptr, g_shmem_size); close (smpi.fd); + + munmap(smpi.send_buf_pool_ptr, g_shmem_size_pool); + close(smpi.fd_pool); + smpi_send_fifo_ptr = smpi.send_fifo_head; while (smpi_send_fifo_ptr) { free (smpi_send_fifo_ptr); =================================================== Mike Heinz wrote: > > Hey, we?re QA testing a release of OFED 1.4, including MVAPICH, and > the testers just run into the following problem ? they?re running > Pallas across 44 nodes when, part way through the run when machines > start failing with a ?too many open files? error (see below). > > At first blush, this sounds like a ulimit problem, and I?m trying to > get access to the failing machines to test that theory ? but is there > some known condition where mvapich will leak file handles? > > [root@st28]# /usr/mpi/gcc/mvapich-1.1.0/bin/mpirun -np 44 -machinefile > > (prior test cases trimmed) > > #---------------------------------------------------------------- > > # Benchmarking Bcast > > # ( #processes = 8 ) > > # ( 36 additional processes waiting in MPI_Barrier) > > #---------------------------------------------------------------- > > #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] > > 0 1000 0.05 0.07 0.05 > > 1 1000 8.70 8.71 8.71 > > 2 1000 8.16 8.18 8.17 > > 4 1000 8.17 8.19 8.18 > > 8 1000 7.83 7.84 7.83 > > 16 1000 8.08 8.10 8.09 > > 32 1000 8.36 8.38 8.37 > > 64 1000 8.28 8.30 8.29 > > 128 1000 9.02 9.03 9.03 > > 256 1000 9.33 9.35 9.34 > > 512 1000 10.13 10.14 10.13 > > 1024 1000 12.33 12.35 12.33 > > 2048 1000 14.86 14.89 14.87 > > 4096 1000 20.21 20.23 20.22 > > 8192 1000 33.47 33.51 33.49 > > 16384 1000 126.25 126.32 126.27 > > open: Too many open files > > [5820] shmem_coll_init:error in opening shared memory file > > : 24 > > open: Too many open files > > [5820] shmem_coll_init:error in opening shared memory file > > : 24 > > open: Too many open files > > open: Too many open files > > open: Too many open files > > open: Too many open files > > [5820] shmem_coll_init:error in opening shared memory file > > : 24 > > open: Too many open files > > [5820] shmem_coll_init:error in opening shared memory file > > : 24 > > [0] shmem_coll_mmap:error in mmapping shared memory: 2 > > open: Too many open files > > [5820] shmem_coll_init:error in opening shared memory file > > : 24 > > -- > > Michael Heinz > > Principal Engineer, Qlogic Corporation > > King of Prussia, Pennsylvania > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From Susan.A.Schwarz at dartmouth.edu Tue Mar 10 12:03:40 2009 From: Susan.A.Schwarz at dartmouth.edu (Susan A. Schwarz) Date: Tue Mar 10 12:46:13 2009 Subject: [mvapich-discuss] performance problems on mpi/openmp hybrid code Message-ID: <49B68F5C.5040405@dartmouth.edu> I am running an MPI/OpenMP code using mvapich on dual quad-core AMD nodes on a RHEL 5.3 cluster. Initially I found that the code took longer to run using the infiniband than when I ran it with just ethernet connections. I found the section in the MVAPICH User and Tuning Guide about setting VIADEV_USE_AFFINITY=0 to allow the openmp threads to run on other CPUs. Now when I set VIADEV_USE_AFFINITY=0, I find that now the openmp section is using other CPUs but because the load on the other CPUs is about 50%, my code is still not running as fast as the version that uses the ethernet. Here is the structure of the fortran code which I am compiling with Intel v11.0 compilers: do i= 1 to # of iterations [ perform mpi-based calculation] if master processor perform openmp-based calculation using 8 threads mpi_bcast(broadcast results to the other processes else if not the master mpi_bcast(obtain results from master) end if end do So the slave processors do an mpi_bcast and wait for the master process to complete the openmp-based calculation and broadcast the result. When I run 'top', I see that the slave processes are using 50% of each of the CPUs while waiting for the master process to complete the openmp section of the code. During the OpenMP section of the code, top shows the master processor running with a load of atmost 400%. During the ethernet-based run, the load on the slave processes is almost 0 and the master processor has a load of 800% during the openmp section of the code which is what I expected because I am using 8 threads. When I compare the elapsed times for the openmp section of the code, the infiniband version takes twice as long as the ethernet version. My question is why is the load on the slave processors 50% when I am using the infiniband when they are doing nothing except waiting for the results to be broadcast to them and why is my openmp running only at 400% and not 800% . Is there any way to either change my code or the configuration of mvapich so this doesn't happen. thank you, Susan Schwarz Research Computing Dartmouth College From koop at cse.ohio-state.edu Tue Mar 10 13:54:25 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Mar 10 13:54:31 2009 Subject: [mvapich-discuss] performance problems on mpi/openmp hybrid code In-Reply-To: <49B68F5C.5040405@dartmouth.edu> Message-ID: Hi Susan, Just to clarify -- this is my understanding: You are running 8 processes per node, and then there is one master process per node. At some point the master process will spawn 8 OpenMP threads. At this point we have 7 slave processes that are single-threaded and 1 master process that has 8 threads. If you are using such a situation, then you will need to turn on "blocking" mode. This will prevent the slave processes from using the CPU while the master threads are working. You will want to use both VIADEV_USE_AFFINITY=0 and VIADEV_USE_BLOCKING=1 This is an interesting hybrid mode we have not seen. I assume the slave processes are working during the other parts of the code? The alternative is to try use OpenMP fully on the node (and just have one master and no slaves). Or 2 MPI tasks, each with 4 threads, etc. In those cases you would not need to have VIADEV_USE_BLOCKING=1. Let us know if this helps, Matt On Tue, 10 Mar 2009, Susan A. Schwarz wrote: > I am running an MPI/OpenMP code using mvapich on dual quad-core AMD nodes on a > RHEL 5.3 cluster. Initially I found that the code took longer to run using the > infiniband than when I ran it with just ethernet connections. I found the > section in the MVAPICH User and Tuning Guide about setting VIADEV_USE_AFFINITY=0 > to allow the openmp threads to run on other CPUs. Now when I set > VIADEV_USE_AFFINITY=0, I find that now the openmp section is using other CPUs > but because the load on the other CPUs is about 50%, my code is still not > running as fast as the version that uses the ethernet. Here is the structure of > the fortran code which I am compiling with Intel v11.0 compilers: > > do i= 1 to # of iterations > > [ perform mpi-based calculation] > if master processor > perform openmp-based calculation using 8 threads > mpi_bcast(broadcast results to the other processes > else if not the master > mpi_bcast(obtain results from master) > end if > end do > > So the slave processors do an mpi_bcast and wait for the master process to > complete the openmp-based calculation and broadcast the result. When I run > 'top', I see that the slave processes are using 50% of each of the CPUs while > waiting for the master process to complete the openmp section of the code. > During the OpenMP section of the code, top shows the master processor running > with a load of atmost 400%. > > During the ethernet-based run, the load on the slave processes is almost 0 and > the master processor has a load of 800% during the openmp section of the code > which is what I expected because I am using 8 threads. When I compare the > elapsed times for the openmp section of the code, the infiniband version takes > twice as long as the ethernet version. > > My question is why is the load on the slave processors 50% when I am using the > infiniband when they are doing nothing except waiting for the results to be > broadcast to them and why is my openmp running only at 400% and not 800% . Is > there any way to either change my code or the configuration of mvapich so this > doesn't happen. > > thank you, > Susan Schwarz > Research Computing > Dartmouth College > > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From koop at cse.ohio-state.edu Tue Mar 10 14:22:43 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Mar 10 14:22:50 2009 Subject: =?koi8-r?Q?Re=3A_[mvapich-discuss]_[mvapich-discuss]_[mpich-discuss]_mvapich_1.1_and_multipath?= In-Reply-To: Message-ID: Hi Sergey, As I mentioned earlier, there currently is no method to manually set routes with MVAPICH or MVAPICH2. Using VIADEV_USE_LMC=1 (MVAPICH) or MV2_USE_HSAM=1 (MVAPICH2) will attempt to do the best it can. Thanks, Matt On Mon, 9 Mar 2009, Sergey wrote: > Hi > Thanks for answer > It need for set specific route for specific source-dest. I study in post graduate and it parts of my scientific work. > Can I take destLIDs for mpi sender process and exchange it to my specific destLIDs ? I assume it placed in enviroments ? > Can MPI process use more than one LID to transmit message to destination rank or sender may use only one LID to transmit ? > > Sorry for my english language > > > > -----Original Message----- > > From: Matthew Koop > > To: Sergey > > Date: Sat, 7 Mar 2009 23:34:47 -0500 (EST) > > Subject: Re: [mvapich-discuss] [mpich-discuss] mvapich 1.1 and multipath > > > > > Sergey, > > > > > > Messages are sent based on LIDs and QPs. Right now there is no mechanism > > > to set output LIDs directly. If you use the VIADEV_USE_LMC=1 (MVAPICH) or > > > MV2_USE_HSAM=1 (MVAPICH2) it should try to spread traffic for you to > > > maximize performance. > > > > > > What is the reason for needing this fine-grained control? > > > > > > Thanks, > > > > > > Matt > > > > > > > > > On Fri, 6 Mar 2009, Sergey wrote: > > > > > > > Hi, may be anyone know this ? How mpi process addressing the message > > > > to destination (rank 0 send rank 1)? Through LID, GID, QP ID or > > > > anything else? Can I strictly define to mpi process outgoing LIDs ? > > > > For example, with LMC = 3, a sender of rank 0, lid 1 > > > > destination of rank 1 multilid 4 > > > > destination of rank 2 multilid 8 > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From dzieko at wcss.pl Tue Mar 10 19:06:46 2009 From: dzieko at wcss.pl (Pawel Dziekonski) Date: Tue Mar 10 19:06:57 2009 Subject: [mvapich-discuss] mvapich 1.1 and ofed 1.3.1 In-Reply-To: <20090309163749.GC3062@cse.ohio-state.edu> References: <20090309150213.GA4617@cefeid.wcss.wroc.pl> <20090309163749.GC3062@cse.ohio-state.edu> Message-ID: <20090310230646.GB2622@cefeid.wcss.wroc.pl> On Mon, 09 Mar 2009 at 12:37:50PM -0400, Jonathan Perkins wrote: > Pawel: > My reply is inline below. > > On Mon, Mar 09, 2009 at 04:02:13PM +0100, Pawel Dziekonski wrote: > > > > hello, > > > > I've just compiled mvapich 1.1 against ofed 1.3.1 and intel compilers on intel > > em64t. IB net is based on mthca0's and Voltaire. script and logs are here: > > https://cefeid.wcss.wroc.pl/d/tmp/mvapich-1.1/ > > > > my changes to make.mvapich.gen2 script: > > > > diff mvapich-1.1/make.mvapich.gen2 MVAPICH-1.1/make.mvapich.gen2 > > 84c84 > > < export LIBS=${LIBS:--L${IBHOME_LIB} -Wl,-rpath=${IBHOME_LIB} -libverbs -libumad -lpthread} > > --- > > > export LIBS="${IBHOME_LIB}/libibverbs.a ${IBHOME_LIB}/libibumad.a ${IBHOME_LIB}/libibcommon.a -lpthread" > > > > as you see, I wanted to have IB and MPI libs statically linked in. > > Instead of providing the paths to the static libraries in this manner > can you try just replacing the -Wl,rpath=${IBHOME_LIB} portion with > -static? Let us know if this solves your problem. In this way it will link everything static, including libc, and fail. Instead I used: export LIBS="-L/usr/lib64 -Wl,-Bstatic -libverbs -libumad -libcommon -Wl,-Bdynamic -lpthread" and it works, however test job still fails with same error. Maybe this is an OFED design? regards, Pawel -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From perkinjo at cse.ohio-state.edu Tue Mar 10 20:46:11 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Tue Mar 10 20:46:26 2009 Subject: [mvapich-discuss] mvapich 1.1 and ofed 1.3.1 In-Reply-To: <20090310230646.GB2622@cefeid.wcss.wroc.pl> References: <20090309150213.GA4617@cefeid.wcss.wroc.pl> <20090309163749.GC3062@cse.ohio-state.edu> <20090310230646.GB2622@cefeid.wcss.wroc.pl> Message-ID: <20090311004610.GA3573@cse.ohio-state.edu> On Wed, Mar 11, 2009 at 12:06:46AM +0100, Pawel Dziekonski wrote: > On Mon, 09 Mar 2009 at 12:37:50PM -0400, Jonathan Perkins wrote: > > Pawel: > > My reply is inline below. > > > > On Mon, Mar 09, 2009 at 04:02:13PM +0100, Pawel Dziekonski wrote: > > > > > > hello, > > > > > > I've just compiled mvapich 1.1 against ofed 1.3.1 and intel compilers on intel > > > em64t. IB net is based on mthca0's and Voltaire. script and logs are here: > > > https://cefeid.wcss.wroc.pl/d/tmp/mvapich-1.1/ > > > > > > my changes to make.mvapich.gen2 script: > > > > > > diff mvapich-1.1/make.mvapich.gen2 MVAPICH-1.1/make.mvapich.gen2 > > > 84c84 > > > < export LIBS=${LIBS:--L${IBHOME_LIB} -Wl,-rpath=${IBHOME_LIB} -libverbs -libumad -lpthread} > > > --- > > > > export LIBS="${IBHOME_LIB}/libibverbs.a ${IBHOME_LIB}/libibumad.a ${IBHOME_LIB}/libibcommon.a -lpthread" > > > > > > as you see, I wanted to have IB and MPI libs statically linked in. > > > > Instead of providing the paths to the static libraries in this manner > > can you try just replacing the -Wl,rpath=${IBHOME_LIB} portion with > > -static? Let us know if this solves your problem. > > In this way it will link everything static, including libc, and fail. Instead > I used: > > export LIBS="-L/usr/lib64 -Wl,-Bstatic -libverbs -libumad -libcommon -Wl,-Bdynamic -lpthread" Thanks for the info. > > and it works, however test job still fails with same error. Maybe this is an OFED design? The output you provided does seem to point to some interaction with the way the verbs library works. The Infinihost cards require libmthca.a. I believe that you'll want to insert -lmthca before -Wl,-Bdynamic. > > regards, Pawel > > -- > Pawel Dziekonski > Wroclaw Centre for Networking & Supercomputing, HPC Department > Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND > phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090310/92ede9e4/attachment-0001.bin From dzieko at wcss.pl Wed Mar 11 03:26:16 2009 From: dzieko at wcss.pl (Pawel Dziekonski) Date: Wed Mar 11 03:26:25 2009 Subject: [mvapich-discuss] error IBV_WC_RETRY_EXC_ERR, code=12 Message-ID: <20090311072616.GA23401@cefeid.wcss.wroc.pl> Hello, I try to run Linpack on my whole cluster and it fails with: Column=091896 Fraction=0.135 Mflops=15485000.90 Column=095256 Fraction=0.140 Mflops=15490137.96 Abort signaled by rank 1036: [wn206:1036] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=91 Exit code -3 signaled from wn206 Killing remote processes...Abort signaled by rank 1946: [wn320:1946] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=56 Abort signaled by rank 911: [wn190:911] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1415 Abort signaled by rank 927: [wn192:927] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=297 Abort signaled by rank 660: [wn159:660] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1605 Abort signaled by rank 1188: [wn225:1188] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1503 Abort signaled by rank 665: [wn160:665] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1610 Abort signaled by rank 1046: [wn207:1046] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=101 MPI process terminated unexpectedly Signal 15 received. Signal 15 received. connect: Connection timed out Signal 15 received. connect: Connection timed out connect: Connection timed out connect: Connection timed out [...] wnXXX are worker nodes in the cluster. Which one from mentioned above could be a problem? All of then seem to work fine onthe 1st look. micro-benchmarks with Linpack on pairs of all nodes work fine too. I use MVAPICH 1.1 and HPL from Intel MKL on em64t with mellanox HCAs. thanks in advance, Pawel -- Pawel Dziekonski Wroclaw Centre for Networking & Supercomputing, HPC Department Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl From Susan.A.Schwarz at dartmouth.edu Wed Mar 11 08:26:42 2009 From: Susan.A.Schwarz at dartmouth.edu (Susan A. Schwarz) Date: Wed Mar 11 09:25:25 2009 Subject: [mvapich-discuss] performance problems on mpi/openmp hybrid code In-Reply-To: References: Message-ID: <49B7AE02.8080101@dartmouth.edu> Matt, Thank you for your response. The use of VIADEV_USE_AFFINITY=0 and VIADEV_USE_BLOCKING=1 have fixed my problem and my program is now running as I expected. Susan Matthew Koop wrote: > Hi Susan, > > Just to clarify -- this is my understanding: You are running 8 processes > per node, and then there is one master process per node. At some point the > master process will spawn 8 OpenMP threads. At this point we have 7 slave > processes that are single-threaded and 1 master process that has 8 > threads. > > If you are using such a situation, then you will need to turn on > "blocking" mode. This will prevent the slave processes from using the CPU > while the master threads are working. You will want to use both > VIADEV_USE_AFFINITY=0 and VIADEV_USE_BLOCKING=1 > > This is an interesting hybrid mode we have not seen. I assume the slave > processes are working during the other parts of the code? > > The alternative is to try use OpenMP fully on the node (and just have one > master and no slaves). Or 2 MPI tasks, each with 4 threads, etc. In those > cases you would not need to have VIADEV_USE_BLOCKING=1. > > Let us know if this helps, > > Matt > > On Tue, 10 Mar 2009, Susan A. Schwarz wrote: > >> I am running an MPI/OpenMP code using mvapich on dual quad-core AMD nodes on a >> RHEL 5.3 cluster. Initially I found that the code took longer to run using the >> infiniband than when I ran it with just ethernet connections. I found the >> section in the MVAPICH User and Tuning Guide about setting VIADEV_USE_AFFINITY=0 >> to allow the openmp threads to run on other CPUs. Now when I set >> VIADEV_USE_AFFINITY=0, I find that now the openmp section is using other CPUs >> but because the load on the other CPUs is about 50%, my code is still not >> running as fast as the version that uses the ethernet. Here is the structure of >> the fortran code which I am compiling with Intel v11.0 compilers: >> >> do i= 1 to # of iterations >> >> [ perform mpi-based calculation] >> if master processor >> perform openmp-based calculation using 8 threads >> mpi_bcast(broadcast results to the other processes >> else if not the master >> mpi_bcast(obtain results from master) >> end if >> end do >> >> So the slave processors do an mpi_bcast and wait for the master process to >> complete the openmp-based calculation and broadcast the result. When I run >> 'top', I see that the slave processes are using 50% of each of the CPUs while >> waiting for the master process to complete the openmp section of the code. >> During the OpenMP section of the code, top shows the master processor running >> with a load of atmost 400%. >> >> During the ethernet-based run, the load on the slave processes is almost 0 and >> the master processor has a load of 800% during the openmp section of the code >> which is what I expected because I am using 8 threads. When I compare the >> elapsed times for the openmp section of the code, the infiniband version takes >> twice as long as the ethernet version. >> >> My question is why is the load on the slave processors 50% when I am using the >> infiniband when they are doing nothing except waiting for the results to be >> broadcast to them and why is my openmp running only at 400% and not 800% . Is >> there any way to either change my code or the configuration of mvapich so this >> doesn't happen. >> >> thank you, >> Susan Schwarz >> Research Computing >> Dartmouth College >> >> >> >> >> >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > From koop at cse.ohio-state.edu Wed Mar 11 11:55:26 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Wed Mar 11 11:55:31 2009 Subject: [mvapich-discuss] error IBV_WC_RETRY_EXC_ERR, code=12 In-Reply-To: <20090311072616.GA23401@cefeid.wcss.wroc.pl> Message-ID: Hi Pawel, Is this a cluster that has been recently setup? If so, the IBV_WC_RETRY_EXC_ERR can come up during an application if there is a loose cable, bad HCA or a bad switch blade. Can you try running mpiGraph and see if it shows any problems? You can download it from: http://sourceforge.net/projects/mpigraph Just run one process per node and it will generate a picture of the "health" of the network. Dark lines will indicate a problem that is due to hardware. Matt On Wed, 11 Mar 2009, Pawel Dziekonski wrote: > Hello, > > I try to run Linpack on my whole cluster and it fails with: > > Column=091896 Fraction=0.135 Mflops=15485000.90 > Column=095256 Fraction=0.140 Mflops=15490137.96 > Abort signaled by rank 1036: [wn206:1036] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=91 > > Exit code -3 signaled from wn206 > Killing remote processes...Abort signaled by rank 1946: [wn320:1946] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=56 > > Abort signaled by rank 911: [wn190:911] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1415 > > Abort signaled by rank 927: [wn192:927] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=297 > > Abort signaled by rank 660: [wn159:660] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1605 > > Abort signaled by rank 1188: [wn225:1188] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1503 > > Abort signaled by rank 665: [wn160:665] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=1610 > > Abort signaled by rank 1046: [wn207:1046] Got completion with error IBV_WC_RETRY_EXC_ERR, code=12, dest rank=101 > > MPI process terminated unexpectedly > Signal 15 received. > Signal 15 received. > connect: Connection timed out > Signal 15 received. > connect: Connection timed out > connect: Connection timed out > connect: Connection timed out > [...] > > wnXXX are worker nodes in the cluster. Which one from mentioned above > could be a problem? All of then seem to work fine onthe 1st look. > > micro-benchmarks with Linpack on pairs of all nodes work fine too. > > I use MVAPICH 1.1 and HPL from Intel MKL on em64t with mellanox HCAs. > > thanks in advance, Pawel > > > > -- > Pawel Dziekonski > Wroclaw Centre for Networking & Supercomputing, HPC Department > Politechnika Wr., pl. Grunwaldzki 9, bud. D2/101, 50-377 Wroclaw, POLAND > phone: +48 71 3202043, fax: +48 71 3225797, http://www.wcss.wroc.pl > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From xmxmxie at gmail.com Sat Mar 14 11:06:31 2009 From: xmxmxie at gmail.com (Xie Min) Date: Sat Mar 14 11:06:41 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? In-Reply-To: References: <91bd441b0902090620t756e1143qb1f1fd90e6bcd9de@mail.gmail.com> Message-ID: <91bd441b0903140806t70b41bd5j2aad051a6bfffa5@mail.gmail.com> Today, we do some other tests. For 128 tasks HPCC (16 nodes * 8), we can run the whole test successfully and get final result. But for 512 tasks HPCC (64 nodes * 8), HPL is freezed too when MV2_USE_LAZY_MEM_UNREGISTER=1. I attach an input file for 512 tasks HPCC (about 1.6GB for each task), maybe you can try it on your systems to see if it will produce the same problem. Thanks. 2009/3/12 Matthew Koop : > Xie, > > Thanks for sending this information along. We've spent some time > investigating the issue and came up with a patch that will hopefully > resolve your issue. I've attached it to this email and it should be > applied at the base directory. > > Please let us know if this helps the problem, > > Matt > > On Mon, 9 Feb 2009, Xie Min wrote: > >> The hpcc we used is HPCC 1.0.0, but we just tried HPCC 1.3.1, seems >> has the same problem. >> >> In the attachment we attached two hpccinf.txt files for 64 HPCC tasks, >> the hpccinf.txt.13 is the "RES" of about 1.3GB, while hpccinf.txt.16 >> is the "RES" of about 1.6/1.7GB. Whould you please try them on your >> systems (with MV2_USE_LAZY_MEM_UNREGISTER=1), thanks. >> >> BTW, the OFED version we used is 1.3.1, physical memory on each node >> is 16GB, use 8 nodes for 64 tasks. >> >> >> >> 2009/2/7 Matthew Koop : >> > >> > Thanks for the additional information. I've tried here with HPCC 1.3.1 and >> > I haven't been able to see any difference in the 'RES' or 'VIRT' memory >> > while running. >> > >> > Would it be possible to send me your hpccinf.txt file so I can more >> > closely try to reproduce the problem? We also have AS5 with kernel 2.6.18 >> > as well. >> > >> > Thanks, >> > >> > Matt >> > >> > On Thu, 5 Feb 2009, Xie Min wrote: >> > >> >> We use Redhat AS5, kernel is 2.6.18 with lustre 1.6.6, and we don't >> >> modify kernel source. >> >> >> >> We test HPCC on two clusters: >> >> In one cluster, each node is booted using Boot over IB, it has no >> >> harddisk, so NO swap space. We run 64 HPCC tasks on 8 nodes (so each >> >> CPU core in the node will run one HPCC task), when each HPCC task use >> >> 1.2/1.3G memory, it will be killed by OS because of "Out of memory" >> >> error. But when MV2_USE_LAZY_MEM_UNREGISTER=0, task can use 1.7G >> >> memory and run successfully. >> >> >> >> In another cluster, each node has harddisk, it booted from local disk, >> >> and it HAS space space. We run 64 HPCC tasks on 8 nodes too. When each >> >> HPCC use 1.3G memory, we use "top" to show the memory usage >> >> information, we found swap will be used when HPCC is running for a >> >> while, and the node begin to run very slowly and cannot respond to >> >> keyboard input. But when MV2_USE_LAZY_MEM_UNREGISTER=0, each task can >> >> be set to 1.7G memory scale and run successfully. >> >> >> >> I tried another mvapich2 parameters: MV2_USE_LAZY_MEM_UNREGISTER=1, >> >> and MV2_NDREG_ENTRIES=8. In this configuration, HPCC is still be >> >> killed by OS with "Out of memory" error when the memory scale of each >> >> task is set to 1.3GB. >> >> >> >> 2009/2/5 Matthew Koop : >> >> > Hi, >> >> > >> >> > What OS/distro are you running? Are there any changes you made, such as >> >> > page size, etc from the base? >> >> > >> >> > I'm taking a look at this issue on our machine as well, although I'm not >> >> > seeing the memory change that you reported. >> >> > >> >> > Matt >> >> > >> >> > >> >> >> > >> > >> > -------------- next part -------------- HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 8 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 326000 Ns 1 # of NBs 80 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 16 Ps 32 Qs 16.0 threshold 1 # of panel fact 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 1 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0) ##### This line (no. 32) is ignored (it serves as a separator). ###### 0 Number of additional problem sizes for PTRANS 1200 10000 30000 values of N 0 number of additional blocking sizes for PTRANS 40 9 8 13 13 20 16 32 64 values of NB From rafaarco at ugr.es Mon Mar 16 10:13:16 2009 From: rafaarco at ugr.es (Rafael Arco Arredondo) Date: Mon Mar 16 10:13:29 2009 Subject: [mvapich-discuss] Errors spawning processes with mpirun_rsh In-Reply-To: <1235409912.29473.5.camel@t13.nowlab.cis.ohio-state.edu> References: <1235382326.13614.24.camel@boabdilmec.ugr.es> <49A2B6DF.5050700@cse.ohio-state.edu> <1235408916.8012.39.camel@localhost> <1235409912.29473.5.camel@t13.nowlab.cis.ohio-state.edu> Message-ID: <1237212796.20856.13.camel@boabdilmec.ugr.es> Hi Jaidev, Sorry for the delay. I had some other business to deal with :). I just found out the problem goes away when MVAPICH/MVAPICH2 are compiled with GCC instead of PathScale. I also tried compiling with PathScale and -O2 instead of the default -O3, but it also crashes. Here is the backtrace for MVAPICH (PathScale and -O3): #0 0x00002b6b1da52094 in _int_free (av=0x0, mem=0x2b6b1d8c8e40) at ptmalloc2/malloc.c:4346 #1 0x00002b6b1da509d7 in free (mem=0x50f950) at ptmalloc2/malloc.c:3473 #2 0x00002b6b1e05f942 in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6 #3 0x00002b6b1e0ced7e in __res_vinit () from /lib64/libc.so.6 #4 0x00002b6b1e0d0325 in __res_maybe_init () from /lib64/libc.so.6 #5 0x00002b6b1e0d1ace in __nss_hostname_digits_dots () from /lib64/libc.so.6 #6 0x00002b6b1e0d6530 in gethostbyname () from /lib64/libc.so.6 #7 0x00002b6b1da622a9 in pmgr_open () at /tmp/mvapich-1.1/mpid/ch_gen2/process/pmgr_collective_client.c:859 #8 0x0000000049be4cf0 in ?? () #9 0x00000000000a2f1d in ?? () #10 0x0000000000000002 in ?? () #11 0x00002b6b1da92bd0 in ?? () from /usr/local/apps/mpi/mvapich-1.1_psc_dbg/lib/shared/libmpich.so.1.0 #12 0x0000000000000002 in ?? () #13 0x0000000000000002 in ?? () #14 0x1999999999999999 in ?? () #15 0x00007fff8d1fac70 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?) And here the one for MVAPICH2: #0 0x00002ac12a18610e in _int_free (av=0x2ac12a3e6050, mem=0x501010) at /tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/memory/ptmalloc2/mvapich_malloc.c:4387 #1 0x00002ac12a1842de in free (mem=0x501010) at /tmp/mvapich2-1.2p1/src/mpid/ch3/channels/mrail/src/memory/ptmalloc2/mvapich_malloc.c:3476 #2 0x00002ac12a111814 in DLOOP_Dataloop_create_basic_all_bytes_struct (count=2, blklens=0x7fff80b90690, disps=0x7fff80b90680, oldtypes=0x7fff80b90670, dlp_p=0x2ac12a3d4ce0, dlsz_p=0x2ac12a3d4ce8, dldepth_p=0x2ac12a3d4cec, flag=0) at /tmp/mvapich2-1.2p1/src/mpid/common/datatype/dataloop/dataloop_create_struct.c:527 #3 0x00002ac12a11094e in MPID_Dataloop_create_struct (count=2, blklens=0x7fff80b90690, disps=0x7fff80b90680, oldtypes=0x7fff80b90670, dlp_p=0x2ac12a3d4ce0, dlsz_p=0x2ac12a3d4ce8, dldepth_p=0x2ac12a3d4cec, flag=0) at /tmp/mvapich2-1.2p1/src/mpid/common/datatype/dataloop/dataloop_create_struct.c:225 #4 0x00002ac12a11039e in MPID_Dataloop_create_pairtype (type=-1946157056, dlp_p=0x2ac12a3d4ce0, dlsz_p=0x2ac12a3d4ce8, dldepth_p=0x2ac12a3d4cec, flag=0) at /tmp/mvapich2-1.2p1/src/mpid/common/datatype/dataloop/dataloop_create_pairtype.c:74 #5 0x00002ac12a176b8f in MPID_Type_create_pairtype (type=-1946157056, new_dtp=0x2ac12a3d4c70) at /tmp/mvapich2-1.2p1/src/mpid/common/datatype/mpid_type_create_pairtype.c:177 #6 0x00002ac12a216c3a in MPIR_Datatype_init () at /tmp/mvapich2-1.2p1/src/mpi/datatype/typeutil.c:133 #7 0x00002ac12a156cdb in MPIR_Init_thread (argc=0x7fff80b908c0, argv=0x7fff80b908c8, required=0, provided=0x0) at /tmp/mvapich2-1.2p1/src/mpi/init/initthread.c:287 #8 0x00002ac12a156216 in PMPI_Init (argc=0x7fff80b908c0, argv=0x7fff80b908c8) at /tmp/mvapich2-1.2p1/src/mpi/init/init.c:135 #9 0x00000000004007be in main (argc=1, argv=0x7fff80b909b8) at /SCRATCH/rafaarco/mpi/mpihello.c:10 Anyway, the version compiled with GCC seems the work fine. Thanks again for your help and best regards, Rafa El lun, 23-02-2009 a las 12:25 -0500, Jaidev Sridhar escribi?: > Hi Rafael, > > On Mon, 2009-02-23 at 18:08 +0100, Rafael Arco Arredondo wrote: > > Hi Jaidev, > > > > Thank you for your prompt reply. > > > > > The message indicates that the application terminated with a non zero > > > error code or crashed after launching. Can you check if it leaves any > > > core files? You may need to set ulimit to unlimited. For example, add > > > ulimit -c unlimited in your ~/.bashrc. > > > > Yes, a core file is generated after adding 'ulimit -c unlimited' to > > $HOME/.bashrc. > > Can you send us the backtrace from this core file - > $ gdb ./mpihello core.xyz > (gdb) bt > > If you have core files from both mvapich and mvapich2 runs, we'd like to > see them. This will provide more insights. > > It'll be more useful if you can compile the libraries and your > application with debug symbols: > * For mvapich2, configure the libraries with --enable-g=dbg and > compile your application with mpicc -g > * For mvapich, edit make.mvapich.gen2, add -g to CFLAGS and compile > your application with mpicc -g > > -Jaidev > > > > > > Can you also give us details of the cluster and any options you've > > > enabled with MVAPICH / MVAPICH2? > > > > It is a cluster of servers with AMD64 Opteron processors, an Infiniband > > network and Sun Grid Engine 6.2 as batch scheduler (anyway this error is > > reported both when SGE controls the jobs and when it doesn't, when > > mpirun_rsh is directly executed from the command line). > > > > In order to compile MVAPICH, the PathScale compiler was used (for which > > the make.mvapich.gen2 script was accordingly edited), shared library > > support was enabled and the flag -DXRC was removed. The rest of the > > options, including the configuration files in $MVAPICH_HOME/etc, wasn't > > modified (i.e., default values are used). > > > > As for MVAPICH2, it was compiled by invoking the configure script this > > way: > > > > ./configure --enable-sharedlibs=gcc CC=pathcc F77=pathf90 F90=pathf90 > > CXX=pathCC > > > > And then plain 'make' and 'make install'. Again, the other options > > weren't changed. > > > > MVAPICH and MVAPICH2 compile with no problems, so do programs compiled > > with mpicc. However, programs crash on the initialization stage after > > launching as you said. > > > > Any ideas? > > > > Thanks again, > > > > Rafa > > > > > On 02/23/2009 04:45 AM, Rafael Arco Arredondo wrote: > > > > Hello, > > > > > > > > I'm having some issues with mpirun_rsh within both MVAPICH 1.1 and > > > > MVAPICH2 1.2p1. As I commented in another email to the list some time > > > > ago, mpirun_rsh is the only mechanism we can use to create MPI processes > > > > in our configuration. > > > > > > > > The command issued is: > > > > mpirun_rsh -ssh -np 2 -hostfile ./machines ./mpihello > > > > > > > > And the error reported by mpirun_rsh is: > > > > > > > > Exit code -5 signaled from localhost > > > > MPI process terminated unexpectedly > > > > Killing remote processes...DONE > > > > > > > > We also got this on some of our machines: > > > > > > > > Child exited abnormally! > > > > Killing remote processes...DONE > > > > > > > > mpihello is a simple hello world and this happens even when the > > > > processes are launched on localhost only. > > > > > > > > OFED 1.2 is used as the underlying Infiniband libraries, and both > > > > MVAPICH and MVAPICH2 were compiled with the OpenFabrics/Gen2 single-rail > > > > option, without XRC as indicated in the user's guide for OFED libraries > > > > prior to version 1.3. > > > > > > > > Any help will be kindly appreciated. > > > > > > > > Thank you in advance, > > > > > > > > Rafa > > > > > > > > > > > > From ajay35 at in.com Thu Mar 19 12:55:48 2009 From: ajay35 at in.com (Ajay) Date: Thu Mar 19 13:03:19 2009 Subject: [mvapich-discuss] mvapich2 over uDAPL gives QP_FATAL error Message-ID: <1237481748.c8cd63e1bf13c5016881652983fb615a@mail.in.com> Hello,I am using "mvapich21.2p1" of "OFED1.4.1RC1". I am trying to run mvapich2 over uDAPL ("compatdapl1.2.12" with "OpenIBCMA") and while running IMBEXT application (of Intel MPI Benchmarks 3.2), I am getting "IBVEVENTQPFATAL" event. I am getting error in "BidirGet" API of IMBEXT. If I try to run only BidirGet of IMBEXT then test works fine.Basically error is I have received some data but I don't have buffers posted for same.I found some workarounds and will like to understand more. Following are some queries:If value of "MV2PREPOSTDEPTH" variable (udaplprepostdepth) is -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090319/06bbea6c/attachment.html From yye00 at tacc.utexas.edu Thu Mar 19 13:21:17 2009 From: yye00 at tacc.utexas.edu (Yaakoub El Khamra) Date: Thu Mar 19 13:21:27 2009 Subject: [mvapich-discuss] mvapich2-1.2-2009-03-17 compilation issue Message-ID: <47a831090903191021x76cb983bie0c91b8eecd4c9c7@mail.gmail.com> I am trying to compile mvapich2-1.2-2009-03-17 (from tarball) with the following options: ./configure --prefix=/work/01125/yye00/DebugStack/latest_mvapich2_install/ --enable-error-checking=all --enable-error-messages=all --enable-timing-level=all --enable-g=all --enable-debuginfo --enable-threads=single CC=icc CXX=icpc F77=ifort F90=ifort --with-ib-libpath=/opt/ofed/lib64/ --with-ib-include=/opt/ofed/include/ This cause the following error: ch3u_rma_sync.c(205): warning #1899: multicharacter character literal (potential portability problem) int *ranks_in_win_grp = (int *) malloc((comm_size - 1) * sizeof(int)); ^ ch3u_rma_sync.c(205): error: expected a ";" int *ranks_in_win_grp = (int *) malloc((comm_size - 1) * sizeof(int)); ^ ch3u_rma_sync.c(233): warning #1899: multicharacter character literal (potential portability problem) free(ranks_in_win_grp); ^ ch3u_rma_sync.c(233): error: expected a ";" free(ranks_in_win_grp); On further investigation, it seems that the preprocessing step is where this breaks down. Malloc seems to be defined as: 'Error use MPIU_Malloc' :::; and similarly free is defined as: 'Error use MPIU_Free' :::; These #defines are in src/include/mpimem.h which manages to find its way into the ch3u_rma_sync.c . Is there a way to guard against this? Should I be using different configuration flags? Thanks in advance. Regards Yaakoub El Khamra From balaji at mcs.anl.gov Thu Mar 19 15:21:59 2009 From: balaji at mcs.anl.gov (Pavan Balaji) Date: Thu Mar 19 15:22:13 2009 Subject: [mvapich-discuss] mvapich2-1.2-2009-03-17 compilation issue In-Reply-To: <47a831090903191021x76cb983bie0c91b8eecd4c9c7@mail.gmail.com> References: <47a831090903191021x76cb983bie0c91b8eecd4c9c7@mail.gmail.com> Message-ID: <49C29B57.6050907@mcs.anl.gov> The correct fix to this is for the MVAPICH2 code to use MPIU_ routines rather than malloc/free directly. Basically, MPICH2 provides wrapper functions that allow it to do memory allocation and usage debugging, that are enabled when you do --enable-g=mem or meminit (when you do --enable-g=all, everything is added). Just replacing malloc and free with MPIU_Malloc/MPIU_Free should be OK, but there are better macros available too: http://wiki.mcs.anl.gov/mpich2/index.php/Coding_Standards#Memory_Allocation_in_MPICH2. As a short term fix, you can just disable memory debugging hooks. There are other useful ones too: dbg, log, handle, handlealloc, etc (you can just look in the top-level configure.in and search for --enable-g). -- Pavan Yaakoub El Khamra wrote: > I am trying to compile mvapich2-1.2-2009-03-17 (from tarball) with the > following options: > ./configure --prefix=/work/01125/yye00/DebugStack/latest_mvapich2_install/ > --enable-error-checking=all --enable-error-messages=all > --enable-timing-level=all --enable-g=all --enable-debuginfo > --enable-threads=single CC=icc CXX=icpc F77=ifort F90=ifort > --with-ib-libpath=/opt/ofed/lib64/ > --with-ib-include=/opt/ofed/include/ > > This cause the following error: > ch3u_rma_sync.c(205): warning #1899: multicharacter character literal > (potential portability problem) > int *ranks_in_win_grp = (int *) malloc((comm_size - 1) * sizeof(int)); > ^ > > ch3u_rma_sync.c(205): error: expected a ";" > int *ranks_in_win_grp = (int *) malloc((comm_size - 1) * sizeof(int)); > ^ > > ch3u_rma_sync.c(233): warning #1899: multicharacter character literal > (potential portability problem) > free(ranks_in_win_grp); > ^ > > ch3u_rma_sync.c(233): error: expected a ";" > free(ranks_in_win_grp); > > > On further investigation, it seems that the preprocessing step is > where this breaks down. Malloc seems to be defined as: 'Error use > MPIU_Malloc' :::; and similarly free is defined as: 'Error use > MPIU_Free' :::; > > These #defines are in src/include/mpimem.h which manages to find its > way into the ch3u_rma_sync.c . Is there a way to guard against this? > Should I be using different configuration flags? > Thanks in advance. > > > > > Regards > Yaakoub El Khamra > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Pavan Balaji http://www.mcs.anl.gov/~balaji From yye00 at tacc.utexas.edu Thu Mar 19 15:53:40 2009 From: yye00 at tacc.utexas.edu (Yaakoub El Khamra) Date: Thu Mar 19 15:54:28 2009 Subject: [mvapich-discuss] mvapich2-1.2-2009-03-17 compilation issue In-Reply-To: <49C29B57.6050907@mcs.anl.gov> References: <47a831090903191021x76cb983bie0c91b8eecd4c9c7@mail.gmail.com> <49C29B57.6050907@mcs.anl.gov> Message-ID: <47a831090903191253r6d8b58c3n9aec6a4b39297c23@mail.gmail.com> My fix for now is using --enable-g=handle,dbg,log but I full intend to get it to work with MPIU_Malloc/Free malloc and free. It would be easier to detect memory leaks that way. Thank you for pointing out the coding standards page, it will come in handy. Regards Yaakoub El Khamra On Thu, Mar 19, 2009 at 2:21 PM, Pavan Balaji wrote: > > The correct fix to this is for the MVAPICH2 code to use MPIU_ routines > rather than malloc/free directly. Basically, MPICH2 provides wrapper > functions that allow it to do memory allocation and usage debugging, > that are enabled when you do --enable-g=mem or meminit (when you do > --enable-g=all, everything is added). > > Just replacing malloc and free with MPIU_Malloc/MPIU_Free should be OK, > but there are better macros available too: > http://wiki.mcs.anl.gov/mpich2/index.php/Coding_Standards#Memory_Allocation_in_MPICH2. > > As a short term fix, you can just disable memory debugging hooks. There > are other useful ones too: dbg, log, handle, handlealloc, etc (you can > just look in the top-level configure.in and search for --enable-g). > > ?-- Pavan > > Yaakoub El Khamra wrote: >> I am trying to compile mvapich2-1.2-2009-03-17 (from tarball) with the >> following options: >> ?./configure --prefix=/work/01125/yye00/DebugStack/latest_mvapich2_install/ >> --enable-error-checking=all --enable-error-messages=all >> --enable-timing-level=all --enable-g=all --enable-debuginfo >> --enable-threads=single CC=icc CXX=icpc F77=ifort F90=ifort >> --with-ib-libpath=/opt/ofed/lib64/ >> --with-ib-include=/opt/ofed/include/ >> >> This cause the following error: >> ch3u_rma_sync.c(205): warning #1899: multicharacter character literal >> (potential portability problem) >> ? ? ? ? ? int *ranks_in_win_grp = (int *) malloc((comm_size - 1) * sizeof(int)); >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^ >> >> ch3u_rma_sync.c(205): error: expected a ";" >> ? ? ? ? ? int *ranks_in_win_grp = (int *) malloc((comm_size - 1) * sizeof(int)); >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ^ >> >> ch3u_rma_sync.c(233): warning #1899: multicharacter character literal >> (potential portability problem) >> ? ? ? ? ? free(ranks_in_win_grp); >> ? ? ? ? ? ^ >> >> ch3u_rma_sync.c(233): error: expected a ";" >> ? ? ? ? ? free(ranks_in_win_grp); >> >> >> On further investigation, it seems that the preprocessing step is >> where this breaks down. Malloc seems to be defined as: 'Error use >> MPIU_Malloc' :::; and similarly free is defined as: 'Error use >> MPIU_Free' :::; >> >> These #defines are in src/include/mpimem.h which manages to find its >> way into the ch3u_rma_sync.c . Is there a way to guard against this? >> Should I be using different configuration flags? >> Thanks in advance. >> >> >> >> >> Regards >> Yaakoub El Khamra >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > -- > Pavan Balaji > http://www.mcs.anl.gov/~balaji > From xingjinglu at gmail.com Thu Mar 19 21:55:38 2009 From: xingjinglu at gmail.com (xingjinglu) Date: Thu Mar 19 23:10:47 2009 Subject: [mvapich-discuss] hi, if mvapich support -mpitrace Message-ID: <000901c9a8fe$fd640580$f82c1080$@com> Now I compile NPB program with mvapich1.1 , the cmd is: mpicc -mpitrace -L/home/paraorc/mpe/lib -ltmpe and I got the exe file, like lu.A.4 But I can't run it Mpirun_rsh -rsh -np 4 -hostfile ./hosts ./bin/lu.A.4 Shows error like below: MPI process terminated unexpectedly Exit code -5 signaled from gnode21 Killing remote processes...DONE Eric.Lu -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090320/b7e58034/attachment-0001.html From panda at cse.ohio-state.edu Thu Mar 19 23:24:50 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Mar 19 23:24:59 2009 Subject: [mvapich-discuss] mvapich2 over uDAPL gives QP_FATAL error In-Reply-To: <1237481748.c8cd63e1bf13c5016881652983fb615a@mail.in.com> Message-ID: Thanks for your report. We are taking a look at it and will get back to you soon. DK On Thu, 19 Mar 2009, Ajay wrote: > Hello,I am using "mvapich21.2p1" of "OFED1.4.1RC1". I am trying to run mvapich2 over uDAPL ("compatdapl1.2.12" with "OpenIBCMA") and while running IMBEXT application (of Intel MPI Benchmarks 3.2), I am getting "IBVEVENTQPFATAL" event. I am getting error in "BidirGet" API of IMBEXT. If I try to run only BidirGet of IMBEXT then test works fine.Basically error is I have received some data but I don't have buffers posted for same.I found some workarounds and will like to understand more. Following are some queries:If value of "MV2PREPOSTDEPTH" variable (udaplprepostdepth) is > From rafaarco at ugr.es Fri Mar 20 07:00:13 2009 From: rafaarco at ugr.es (Rafael Arco Arredondo) Date: Fri Mar 20 07:00:24 2009 Subject: [mvapich-discuss] Process management and TCP connections Message-ID: <1237546813.9561.17.camel@boabdilmec.ugr.es> Hello, I compiled MVAPICH 1.1 and MVAPICH2 1.2 with OpenFabrics-Gen2 support, and it works pretty well. However, I realized several TCP connections are created for process creation and control by mpirun_rsh, mpispawn and rsh/ssh itself. These connections are established over the default network interface, Gigabit Ethernet in our case, and not Infiniband. I was wondering if it's possible to configure MVAPICH/MVAPICH2 so that connections are exclusively created over Infiniband. Thank you in advance and best regards, Rafa Arco From perkinjo at cse.ohio-state.edu Fri Mar 20 08:00:10 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Fri Mar 20 08:00:23 2009 Subject: [mvapich-discuss] Process management and TCP connections In-Reply-To: <1237546813.9561.17.camel@boabdilmec.ugr.es> References: <1237546813.9561.17.camel@boabdilmec.ugr.es> Message-ID: <20090320120010.GA3140@cse.ohio-state.edu> Rafa: Hi, this should be possible by configuring your network to use IPoIB. You'll need to specify hostnames that resolve to the IPoIB addresses instead of the ethernet addresses when you're launching your applications. Hope this helps. On Fri, Mar 20, 2009 at 12:00:13PM +0100, Rafael Arco Arredondo wrote: > Hello, > > I compiled MVAPICH 1.1 and MVAPICH2 1.2 with OpenFabrics-Gen2 support, > and it works pretty well. However, I realized several TCP connections > are created for process creation and control by mpirun_rsh, mpispawn and > rsh/ssh itself. These connections are established over the default > network interface, Gigabit Ethernet in our case, and not Infiniband. > > I was wondering if it's possible to configure MVAPICH/MVAPICH2 so that > connections are exclusively created over Infiniband. > > Thank you in advance and best regards, > > Rafa Arco > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090320/a6fd4c25/attachment.bin From xuji at home.ipe.ac.cn Fri Mar 20 08:51:53 2009 From: xuji at home.ipe.ac.cn (xuji) Date: Fri Mar 20 08:52:02 2009 Subject: [mvapich-discuss] How to use Infiniband in Fluent Message-ID: <200903202051472818192@home.ipe.ac.cn> Hi all: Do someone know how to use infiniband in Fluent ? Or how to use mvapich in Fluent? Appreciate in advance! Best wishes! 2009-03-20 xuji -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090320/a0571654/attachment.html From xmxmxie at gmail.com Fri Mar 20 09:08:40 2009 From: xmxmxie at gmail.com (Xie Min) Date: Fri Mar 20 09:08:51 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? In-Reply-To: References: <91bd441b0903140806t70b41bd5j2aad051a6bfffa5@mail.gmail.com> Message-ID: <91bd441b0903200608h2afcc0b5q172a974cfb8fc91e@mail.gmail.com> Thanks for the new patch. We did several tests today, but the testing system is busy today, so we don't have enough node resource and time to do test. We didn't run the HPCC to completion yet. For 512 tasks, seems this patch works, but we just run it for 30 minutes, in which HPL was running normally. For 1024 tasks, in one test, the HPL is deadlock when running for about 5 minute (while using the old patch, HPL will be deadlock at the start), but our testing system is not stable today, maybe the deadlock is caused by network failure. We need to find enough resource and time to do more tests. 2009/3/19 Matthew Koop : > Xie, > > Can you try the following attached patch instead of the other patch? We > found some places where a deadlock may have been allowed to occur. > > Thank you, > Matt > > On Sat, 14 Mar 2009, Xie Min wrote: > >> Today, we do some other tests. >> >> For 128 tasks HPCC (16 nodes * 8), we can run the whole test >> successfully and get final result. >> >> But for 512 tasks HPCC (64 nodes * 8), HPL is freezed too when >> MV2_USE_LAZY_MEM_UNREGISTER=1. >> >> I attach an input file for 512 tasks HPCC (about 1.6GB for each task), >> maybe you can try it on your systems to see if it will produce the >> same problem. >> >> Thanks. >> >> 2009/3/12 Matthew Koop : >> > Xie, >> > >> > Thanks for sending this information along. We've spent some time >> > investigating the issue and came up with a patch that will hopefully >> > resolve your issue. I've attached it to this email and it should be >> > applied at the base directory. >> > >> > Please let us know if this helps the problem, >> > >> > Matt >> > >> > On Mon, 9 Feb 2009, Xie Min wrote: >> > >> >> The hpcc we used is HPCC 1.0.0, but we just tried HPCC 1.3.1, seems >> >> has the same problem. >> >> >> >> In the attachment we attached two hpccinf.txt files for 64 HPCC tasks, >> >> the hpccinf.txt.13 is the "RES" of about 1.3GB, while hpccinf.txt.16 >> >> is the "RES" of about 1.6/1.7GB. Whould you please try them on your >> >> systems (with MV2_USE_LAZY_MEM_UNREGISTER=1), thanks. >> >> >> >> BTW, the OFED version we used is 1.3.1, physical memory on each node >> >> is 16GB, use 8 nodes for 64 tasks. >> >> >> >> >> >> >> >> 2009/2/7 Matthew Koop : >> >> > >> >> > Thanks for the additional information. I've tried here with HPCC 1.3.1 and >> >> > I haven't been able to see any difference in the 'RES' or 'VIRT' memory >> >> > while running. >> >> > >> >> > Would it be possible to send me your hpccinf.txt file so I can more >> >> > closely try to reproduce the problem? We also have AS5 with kernel 2.6.18 >> >> > as well. >> >> > >> >> > Thanks, >> >> > >> >> > Matt >> >> > >> >> > On Thu, 5 Feb 2009, Xie Min wrote: >> >> > >> >> >> We use Redhat AS5, kernel is 2.6.18 with lustre 1.6.6, and we don't >> >> >> modify kernel source. >> >> >> >> >> >> We test HPCC on two clusters: >> >> >> In one cluster, each node is booted using Boot over IB, it has no >> >> >> harddisk, so NO swap space. We run 64 HPCC tasks on 8 nodes (so each >> >> >> CPU core in the node will run one HPCC task), when each HPCC task use >> >> >> 1.2/1.3G memory, it will be killed by OS because of "Out of memory" >> >> >> error. But when MV2_USE_LAZY_MEM_UNREGISTER=0, task can use 1.7G >> >> >> memory and run successfully. >> >> >> >> >> >> In another cluster, each node has harddisk, it booted from local disk, >> >> >> and it HAS space space. We run 64 HPCC tasks on 8 nodes too. When each >> >> >> HPCC use 1.3G memory, we use "top" to show the memory usage >> >> >> information, we found swap will be used when HPCC is running for a >> >> >> while, and the node begin to run very slowly and cannot respond to >> >> >> keyboard input. But when MV2_USE_LAZY_MEM_UNREGISTER=0, each task can >> >> >> be set to 1.7G memory scale and run successfully. >> >> >> >> >> >> I tried another mvapich2 parameters: MV2_USE_LAZY_MEM_UNREGISTER=1, >> >> >> and MV2_NDREG_ENTRIES=8. In this configuration, HPCC is still be >> >> >> killed by OS with "Out of memory" error when the memory scale of each >> >> >> task is set to 1.3GB. >> >> >> >> >> >> 2009/2/5 Matthew Koop : >> >> >> > Hi, >> >> >> > >> >> >> > What OS/distro are you running? Are there any changes you made, such as >> >> >> > page size, etc from the base? >> >> >> > >> >> >> > I'm taking a look at this issue on our machine as well, although I'm not >> >> >> > seeing the memory change that you reported. >> >> >> > >> >> >> > Matt >> >> >> > >> >> >> > >> >> >> >> >> > >> >> > >> >> >> > >> > From chai.15 at osu.edu Mon Mar 23 17:30:49 2009 From: chai.15 at osu.edu (Lei Chai) Date: Mon Mar 23 17:30:55 2009 Subject: [mvapich-discuss] mvapich2 over uDAPL gives QP_FATAL error In-Reply-To: <1237481748.c8cd63e1bf13c5016881652983fb615a@mail.in.com> References: <1237481748.c8cd63e1bf13c5016881652983fb615a@mail.in.com> Message-ID: <49C7FF89.6040009@osu.edu> Hi Ajay, Thanks for your report and detailed information. We found a bug in this part. Basically the noop message is to send credit to the remote side. So after certain number of receives (udapl_dynamic_credit_threshold/udapl_credit_notify_threshold) the process sends a noop message. Since the total number of receive buffers are 80 by default (udapl_prepost_depth), there could be as many as 8 noop messages outstanding and the remote side has to have at least 8 receive buffers for receiving these noop messages. The current value (udapl_prepost_noop_extra) is 5 and that's why it failed. So your workarounds are actually fixes. I suggest you use udapl_prepost_noop_extra=8. We will fix it in our code base also. Thanks, Lei Ajay wrote: > Hello, > > I am using "mvapich2-1.2p1" of "OFED-1.4.1-RC1". I am trying to run > mvapich2 over uDAPL ("compat-dapl-1.2.12" with "OpenIB-CMA") and while > running IMB-EXT application (of Intel MPI Benchmarks 3.2), I am > getting "IBV_EVENT_QP_FATAL" event. I am getting error in "Bidir_Get" > API of IMB-EXT. If I try to run only Bidir_Get of IMB-EXT then test > works fine. > > Basically error is I have received some data but I don't have buffers > posted for same. > > I found some workarounds and will like to understand more. Following > are some queries: > > 1. If value of "MV2_PREPOST_DEPTH" variable (udapl_prepost_depth) > is <= 59; then this test works fine. I don't understand how > reducing number of prepost buffer resolved this issue. Any > suggestions? > 2. If value of variable "udapl_prepost_noop_extra" is changed from > 5 to 8 then this test works fine. Subsequently, if value of > "udapl_initial_credits" is changed from 5 to 2 then test works > fine (with no change in "udapl_prepost_noop_extra" variable). > Similarly, if value of "remote_credit" is changed from 5 to 2 > inside function MRAILI_Init_vc(), then this test works fine. I > will like to know what's use of "remote_credit" variable. And, > what's use of "remote_cc" and "rdma_credit" variables? > 3. If value of variables "udapl_dynamic_credit_threshold" and > "udapl_credit_notify_threshold" are changed from 10 to >=13 then > this test works fine (basically I was trying to send NOOP after > 15 post_recvs() instead of 10). I think NOOP sends are for > synchronization; so how does less no. of NOOPs are resolving > this issue? > > Thanks in Advance. > > Regards, > Ajay > > > > Dear *mvapich-discuss!* Get Yourself a cool, short *@in.com* Email ID > now! > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From panda at cse.ohio-state.edu Mon Mar 23 20:20:18 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Mar 23 20:20:27 2009 Subject: [mvapich-discuss] How to use Infiniband in Fluent In-Reply-To: <200903202051472818192@home.ipe.ac.cn> Message-ID: You can check with your Fluent representative about this. DK On Fri, 20 Mar 2009, xuji wrote: > Hi all: > > Do someone know how to use infiniband in Fluent ? > Or how to use mvapich in Fluent? > > Appreciate in advance! > > Best > wishes! > > 2009-03-20 > > > > xuji > From panda at cse.ohio-state.edu Mon Mar 23 20:23:22 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Mon Mar 23 20:23:31 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? In-Reply-To: <91bd441b0903200608h2afcc0b5q172a974cfb8fc91e@mail.gmail.com> Message-ID: > Thanks for the new patch. > > We did several tests today, but the testing system is busy today, so > we don't have enough node resource and time to do test. > We didn't run the HPCC to completion yet. > > For 512 tasks, seems this patch works, but we just run it for 30 > minutes, in which HPL was running normally. Good to know about this. > For 1024 tasks, in one test, the HPL is deadlock when running for > about 5 minute (while using the old patch, HPL will be deadlock at the > start), but our testing system is not stable today, maybe the deadlock > is caused by network failure. We need to find enough > resource and time to do more tests. Once you have a stable system, let us know how it works for 1024 tasks. Thanks, DK > 2009/3/19 Matthew Koop : > > Xie, > > Can you try the following attached patch instead of the other patch? We > > found some places where a deadlock may have been allowed to occur. > > > > Thank you, > > Matt > > > > On Sat, 14 Mar 2009, Xie Min wrote: > > > >> Today, we do some other tests. > >> > >> For 128 tasks HPCC (16 nodes * 8), we can run the whole test > >> successfully and get final result. > >> > >> But for 512 tasks HPCC (64 nodes * 8), HPL is freezed too when > >> MV2_USE_LAZY_MEM_UNREGISTER=1. > >> > >> I attach an input file for 512 tasks HPCC (about 1.6GB for each task), > >> maybe you can try it on your systems to see if it will produce the > >> same problem. > >> > >> Thanks. > >> > >> 2009/3/12 Matthew Koop : > >> > Xie, > >> > > >> > Thanks for sending this information along. We've spent some time > >> > investigating the issue and came up with a patch that will hopefully > >> > resolve your issue. I've attached it to this email and it should be > >> > applied at the base directory. > >> > > >> > Please let us know if this helps the problem, > >> > > >> > Matt > >> > > >> > On Mon, 9 Feb 2009, Xie Min wrote: > >> > > >> >> The hpcc we used is HPCC 1.0.0, but we just tried HPCC 1.3.1, seems > >> >> has the same problem. > >> >> > >> >> In the attachment we attached two hpccinf.txt files for 64 HPCC tasks, > >> >> the hpccinf.txt.13 is the "RES" of about 1.3GB, while hpccinf.txt.16 > >> >> is the "RES" of about 1.6/1.7GB. Whould you please try them on your > >> >> systems (with MV2_USE_LAZY_MEM_UNREGISTER=1), thanks. > >> >> > >> >> BTW, the OFED version we used is 1.3.1, physical memory on each node > >> >> is 16GB, use 8 nodes for 64 tasks. > >> >> > >> >> > >> >> > >> >> 2009/2/7 Matthew Koop : > >> >> > > >> >> > Thanks for the additional information. I've tried here with HPCC 1.3.1 and > >> >> > I haven't been able to see any difference in the 'RES' or 'VIRT' memory > >> >> > while running. > >> >> > > >> >> > Would it be possible to send me your hpccinf.txt file so I can more > >> >> > closely try to reproduce the problem? We also have AS5 with kernel 2.6.18 > >> >> > as well. > >> >> > > >> >> > Thanks, > >> >> > > >> >> > Matt > >> >> > > >> >> > On Thu, 5 Feb 2009, Xie Min wrote: > >> >> > > >> >> >> We use Redhat AS5, kernel is 2.6.18 with lustre 1.6.6, and we don't > >> >> >> modify kernel source. > >> >> >> > >> >> >> We test HPCC on two clusters: > >> >> >> In one cluster, each node is booted using Boot over IB, it has no > >> >> >> harddisk, so NO swap space. We run 64 HPCC tasks on 8 nodes (so each > >> >> >> CPU core in the node will run one HPCC task), when each HPCC task use > >> >> >> 1.2/1.3G memory, it will be killed by OS because of "Out of memory" > >> >> >> error. But when MV2_USE_LAZY_MEM_UNREGISTER=0, task can use 1.7G > >> >> >> memory and run successfully. > >> >> >> > >> >> >> In another cluster, each node has harddisk, it booted from local disk, > >> >> >> and it HAS space space. We run 64 HPCC tasks on 8 nodes too. When each > >> >> >> HPCC use 1.3G memory, we use "top" to show the memory usage > >> >> >> information, we found swap will be used when HPCC is running for a > >> >> >> while, and the node begin to run very slowly and cannot respond to > >> >> >> keyboard input. But when MV2_USE_LAZY_MEM_UNREGISTER=0, each task can > >> >> >> be set to 1.7G memory scale and run successfully. > >> >> >> > >> >> >> I tried another mvapich2 parameters: MV2_USE_LAZY_MEM_UNREGISTER=1, > >> >> >> and MV2_NDREG_ENTRIES=8. In this configuration, HPCC is still be > >> >> >> killed by OS with "Out of memory" error when the memory scale of each > >> >> >> task is set to 1.3GB. > >> >> >> > >> >> >> 2009/2/5 Matthew Koop : > >> >> >> > Hi, > >> >> >> > > >> >> >> > What OS/distro are you running? Are there any changes you made, such as > >> >> >> > page size, etc from the base? > >> >> >> > > >> >> >> > I'm taking a look at this issue on our machine as well, although I'm not > >> >> >> > seeing the memory change that you reported. > >> >> >> > > >> >> >> > Matt > >> >> >> > > >> >> >> > > >> >> >> > >> >> > > >> >> > > >> >> > >> > > >> > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From perkinjo at cse.ohio-state.edu Mon Mar 23 20:56:51 2009 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Mon Mar 23 20:57:05 2009 Subject: [mvapich-discuss] hi, if mvapich support -mpitrace In-Reply-To: <000901c9a8fe$fd640580$f82c1080$@com> References: <000901c9a8fe$fd640580$f82c1080$@com> Message-ID: <20090324005649.GG3033@cse.ohio-state.edu> Eric: How did you compile mvapich? I've simply removed the --without-mpe option from the make.mvapich.gen2 script before building to use the pre shipped version of mpe with mvapich. If you want to use an external version of mpe you should keep the --without-mpe option when building mvapich. Then when building the app you should be able to do something similar to what you posted except use -lmpe instead of -ltmpe. On Fri, Mar 20, 2009 at 09:55:38AM +0800, xingjinglu wrote: > Now I compile NPB program with mvapich1.1 , the cmd is: mpicc -mpitrace > -L/home/paraorc/mpe/lib -ltmpe and I got the exe file, like lu.A.4 > > But I can't run it > > Mpirun_rsh -rsh -np 4 -hostfile ./hosts ./bin/lu.A.4 > > > > Shows error like below: > > MPI process terminated unexpectedly > > Exit code -5 signaled from gnode21 > > Killing remote processes...DONE > > > > > > Eric.Lu > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090323/ca9bec5b/attachment.bin From xmxmxie at gmail.com Tue Mar 24 10:05:25 2009 From: xmxmxie at gmail.com (Xie Min) Date: Tue Mar 24 10:05:36 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? In-Reply-To: References: <91bd441b0903200608h2afcc0b5q172a974cfb8fc91e@mail.gmail.com> Message-ID: <91bd441b0903240705o5975594xeb1bd3f3218205c1@mail.gmail.com> We do some tests these days. Because we share our testing system with others, and a complete HPL will run more than two hours, we don't run the whole HPCC to completion yet. Without any environment variable setting, we run HPL about 10 times, each HPL test will run for about 12 minutes. Several tests still became deadlock, the "port_rcv_packets" counter of IB card stop to increase. One of our parallel applications met some problems, it often segment fault. But when we set the MV2_ON_DEMAND_THRESHOLD to the value equals to the tasks number, it can run successfully. So we set only one envrionment variable MV2_ON_DEMAND_THRESHOLD=1024, and run HPL tests again for about 10 times (128 * 8, 1024 tasks, "RES" of each task is about 1.4GB), each HPL test will run for about 12 minutes. All tests runs normally, the "port_rcv_packets" counter of IB card increase continually in the test running time. If on demand connection is used, seems there are some pthread lock operations in vbuf.c, I am not sure if this will have some relationship with the deadlock. 2009/3/24 Dhabaleswar Panda : >> Thanks for the new patch. >> >> We did several tests today, but the testing system is busy today, so >> we don't have enough node resource and time to do test. >> We didn't run the HPCC to completion yet. >> >> For 512 tasks, seems this patch works, but we just run it for 30 >> minutes, in which HPL was running normally. > > Good to know about this. > >> For 1024 tasks, in one test, the HPL is deadlock when running for >> about 5 minute (while using the old patch, HPL will be deadlock at the >> start), but our testing system is not stable today, maybe the deadlock >> is caused by network failure. We need to find enough >> resource and time to do more tests. > > Once you have a stable system, let us know how it works for 1024 tasks. > > Thanks, > > DK > >> 2009/3/19 Matthew Koop : >> > Xie, >> > Can you try the following attached patch instead of the other patch? We >> > found some places where a deadlock may have been allowed to occur. >> > >> > Thank you, >> > Matt >> > >> > On Sat, 14 Mar 2009, Xie Min wrote: >> > >> >> Today, we do some other tests. >> >> >> >> For 128 tasks HPCC (16 nodes * 8), we can run the whole test >> >> successfully and get final result. >> >> >> >> But for 512 tasks HPCC (64 nodes * 8), HPL is freezed too when >> >> MV2_USE_LAZY_MEM_UNREGISTER=1. >> >> >> >> I attach an input file for 512 tasks HPCC (about 1.6GB for each task), >> >> maybe you can try it on your systems to see if it will produce the >> >> same problem. >> >> >> >> Thanks. >> >> >> >> 2009/3/12 Matthew Koop : >> >> > Xie, >> >> > >> >> > Thanks for sending this information along. We've spent some time >> >> > investigating the issue and came up with a patch that will hopefully >> >> > resolve your issue. I've attached it to this email and it should be >> >> > applied at the base directory. >> >> > >> >> > Please let us know if this helps the problem, >> >> > >> >> > Matt >> >> > >> >> > On Mon, 9 Feb 2009, Xie Min wrote: >> >> > >> >> >> The hpcc we used is HPCC 1.0.0, but we just tried HPCC 1.3.1, seems >> >> >> has the same problem. >> >> >> >> >> >> In the attachment we attached two hpccinf.txt files for 64 HPCC tasks, >> >> >> the hpccinf.txt.13 is the "RES" of about 1.3GB, while hpccinf.txt.16 >> >> >> is the "RES" of about 1.6/1.7GB. Whould you please try them on your >> >> >> systems (with MV2_USE_LAZY_MEM_UNREGISTER=1), thanks. >> >> >> >> >> >> BTW, the OFED version we used is 1.3.1, physical memory on each node >> >> >> is 16GB, use 8 nodes for 64 tasks. >> >> >> >> >> >> >> >> >> >> >> >> 2009/2/7 Matthew Koop : >> >> >> > >> >> >> > Thanks for the additional information. I've tried here with HPCC 1.3.1 and >> >> >> > I haven't been able to see any difference in the 'RES' or 'VIRT' memory >> >> >> > while running. >> >> >> > >> >> >> > Would it be possible to send me your hpccinf.txt file so I can more >> >> >> > closely try to reproduce the problem? We also have AS5 with kernel 2.6.18 >> >> >> > as well. >> >> >> > >> >> >> > Thanks, >> >> >> > >> >> >> > Matt >> >> >> > >> >> >> > On Thu, 5 Feb 2009, Xie Min wrote: >> >> >> > >> >> >> >> We use Redhat AS5, kernel is 2.6.18 with lustre 1.6.6, and we don't >> >> >> >> modify kernel source. >> >> >> >> >> >> >> >> We test HPCC on two clusters: >> >> >> >> In one cluster, each node is booted using Boot over IB, it has no >> >> >> >> harddisk, so NO swap space. We run 64 HPCC tasks on 8 nodes (so each >> >> >> >> CPU core in the node will run one HPCC task), when each HPCC task use >> >> >> >> 1.2/1.3G memory, it will be killed by OS because of "Out of memory" >> >> >> >> error. But when MV2_USE_LAZY_MEM_UNREGISTER=0, task can use 1.7G >> >> >> >> memory and run successfully. >> >> >> >> >> >> >> >> In another cluster, each node has harddisk, it booted from local disk, >> >> >> >> and it HAS space space. We run 64 HPCC tasks on 8 nodes too. When each >> >> >> >> HPCC use 1.3G memory, we use "top" to show the memory usage >> >> >> >> information, we found swap will be used when HPCC is running for a >> >> >> >> while, and the node begin to run very slowly and cannot respond to >> >> >> >> keyboard input. But when MV2_USE_LAZY_MEM_UNREGISTER=0, each task can >> >> >> >> be set to 1.7G memory scale and run successfully. >> >> >> >> >> >> >> >> I tried another mvapich2 parameters: MV2_USE_LAZY_MEM_UNREGISTER=1, >> >> >> >> and MV2_NDREG_ENTRIES=8. In this configuration, HPCC is still be >> >> >> >> killed by OS with "Out of memory" error when the memory scale of each >> >> >> >> task is set to 1.3GB. >> >> >> >> >> >> >> >> 2009/2/5 Matthew Koop : >> >> >> >> > Hi, >> >> >> >> > >> >> >> >> > What OS/distro are you running? Are there any changes you made, such as >> >> >> >> > page size, etc from the base? >> >> >> >> > >> >> >> >> > I'm taking a look at this issue on our machine as well, although I'm not >> >> >> >> > seeing the memory change that you reported. >> >> >> >> > >> >> >> >> > Matt >> >> >> >> > >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> > >> >> >> >> >> > >> >> >> > >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > From polk678 at gmail.com Tue Mar 24 12:12:09 2009 From: polk678 at gmail.com (gossips J) Date: Tue Mar 24 12:12:21 2009 Subject: [mvapich-discuss] mvapich2-1.2p1 does not run successfully mpi1 application Message-ID: Hi Folks, I am a student and want to know about mvapich2-1.2p1. It does not run my MPI1 application successfully. Basically it stuck somewhere in middle of execution. I am running this for 80 processes. I figured out that if i do set "on demand threshold" environment settings to anything above 80 it works fine with out any issues. Basically what is causing this behavior? Why test gets stuck up at some point? How to debug this??? If anybody can provide some insight on how to handle this with mvapich2 than it would be great. Looking for help, Thanks in advance, Polk J. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090324/e4b849f6/attachment.html From panda at cse.ohio-state.edu Tue Mar 24 12:56:15 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Tue Mar 24 12:56:24 2009 Subject: [mvapich-discuss] mvapich2-1.2p1 does not run successfully mpi1 application In-Reply-To: Message-ID: Can you provide some details on your cluster (computing platform and network adapter), your MPI1 application, how you configured mvapich2-1.2p1 (default build or with any additional features) and whether you are running it with any additional runtime environmental variables. Without such information, it is very hard to investigate what is going on here. DK On Tue, 24 Mar 2009, gossips J wrote: > Hi Folks, > I am a student and want to know about mvapich2-1.2p1. > > It does not run my MPI1 application successfully. Basically it stuck > somewhere in middle of execution. > > I am running this for 80 processes. I figured out that if i do set "on > demand threshold" environment settings to anything above 80 it works fine > with out any issues. > > Basically what is causing this behavior? > Why test gets stuck up at some point? > How to debug this??? > > If anybody can provide some insight on how to handle this with mvapich2 than > it would be great. > > Looking for help, > > Thanks in advance, > Polk J. > From mrepper at sandia.gov Tue Mar 24 19:30:33 2009 From: mrepper at sandia.gov (Marcus R. Epperson) Date: Tue Mar 24 19:30:58 2009 Subject: [mvapich-discuss] [SPAM] "PMI Lookup name failed" when RDMA CM is used Message-ID: <49C96D19.6090607@sandia.gov> We have an Infiniband cluster which will require the use of RDMA CM, and which uses the Slurm resource manager for job launch. I'm trying to verify that mvapich2-1.2p1 will work with this combination but I'm not having much luck so far. I am able to run successfully when I don't enable mvapich2's RDMA CM option (this won't be possible long-term though): $ srun --mpi=none -w 'c1,c3' ./mpi_hello Hello, I am node c1 with rank 0 Hello, I am node c3 with rank 1 But when I enable it I get this: $ export MV2_USE_RDMA_CM=1 $ srun --mpi=none -w 'c1,c3' ./mpi_hello [1] Abort: PMI Lookup name failed at line 810 in file rdma_cm.c [0] Abort: PMI Lookup name failed at line 810 in file rdma_cm.c srun: error: c1: task 0: Exited with exit code 253 srun: error: c3: task 1: Exited with exit code 253 I believe these nodes are configured correctly according to #6.4 here: http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html#x1-300006.4 i.e. IPoIB is set up: # pdsh -w 'c[1,3]' "ifconfig ib0 | grep inet.addr" c1: inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0 c3: inet addr:192.168.2.3 Bcast:192.168.2.255 Mask:255.255.255.0 and mv2.conf is present on each node: # pdsh -w 'c[1,3]' "cat /etc/mv2.conf" c1: 192.168.2.1 c3: 192.168.2.3 Have I missed something, or is this a bug? If it's a bug, is it with mvapich2 or should I be looking elsewhere? Thanks for any help, -Marcus Epperson From xmxmxie at gmail.com Wed Mar 25 07:21:44 2009 From: xmxmxie at gmail.com (Xie Min) Date: Wed Mar 25 07:21:55 2009 Subject: [SPAM] Re: [mvapich-discuss] [SPAM] "PMI Lookup name failed" when RDMA CM is used In-Reply-To: <49C96D19.6090607@sandia.gov> References: <49C96D19.6090607@sandia.gov> Message-ID: <91bd441b0903250421m255c6f8ciecabe75ff2c171fe@mail.gmail.com> We met the problem too, so we use PMI_DEBUG=1 to see the output of Slurm PMI, mvapich2 use "ip " to call PMI_KVS_Get(), please notice there is a "blank" in the parameter. Delete the "blank" in rdma_cm.c, use "ip", seems Slurm PMI can work. 2009/3/25 Marcus R. Epperson : > We have an Infiniband cluster which will require the use of RDMA CM, and > which uses the Slurm resource manager for job launch. ?I'm trying to > verify that mvapich2-1.2p1 will work with this combination but I'm not > having much luck so far. > > I am able to run successfully when I don't enable mvapich2's RDMA CM > option (this won't be possible long-term though): > > $ srun --mpi=none -w 'c1,c3' ./mpi_hello > ? Hello, I am node c1 with rank 0 > ? Hello, I am node c3 with rank 1 > > But when I enable it I get this: > > $ export MV2_USE_RDMA_CM=1 > $ srun --mpi=none -w 'c1,c3' ./mpi_hello > ? [1] Abort: PMI Lookup name failed > ? ?at line 810 in file rdma_cm.c > ? [0] Abort: PMI Lookup name failed > ? ?at line 810 in file rdma_cm.c > ? srun: error: c1: task 0: Exited with exit code 253 > ? srun: error: c3: task 1: Exited with exit code 253 > > I believe these nodes are configured correctly according to #6.4 here: > > http://mvapich.cse.ohio-state.edu/support/user_guide_mvapich2-1.2.html#x1-300006.4 > > i.e. IPoIB is set up: > > # pdsh -w 'c[1,3]' "ifconfig ib0 | grep inet.addr" > c1: ? inet addr:192.168.2.1 ?Bcast:192.168.2.255 ?Mask:255.255.255.0 > c3: ? inet addr:192.168.2.3 ?Bcast:192.168.2.255 ?Mask:255.255.255.0 > > and mv2.conf is present on each node: > > # pdsh -w 'c[1,3]' "cat /etc/mv2.conf" > c1: 192.168.2.1 > c3: 192.168.2.3 > > Have I missed something, or is this a bug? ?If it's a bug, is it with > mvapich2 or should I be looking elsewhere? > > Thanks for any help, > -Marcus Epperson > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From mrepper at sandia.gov Wed Mar 25 12:11:55 2009 From: mrepper at sandia.gov (Marcus R. Epperson) Date: Wed Mar 25 12:12:28 2009 Subject: [SPAM] Re: [mvapich-discuss] "PMI Lookup name failed" when RDMA CM is used In-Reply-To: <91bd441b0903250421m255c6f8ciecabe75ff2c171fe@mail.gmail.com> References: <49C96D19.6090607@sandia.gov> <91bd441b0903250421m255c6f8ciecabe75ff2c171fe@mail.gmail.com> Message-ID: <49CA57CB.3010504@sandia.gov> On 03/25/2009 05:21 AM, Xie Min wrote: > We met the problem too, so we use PMI_DEBUG=1 to see the output of Slurm PMI, > mvapich2 use "ip " to call PMI_KVS_Get(), please notice there is > a "blank" in the parameter. > Delete the "blank" in rdma_cm.c, use "ip", seems Slurm PMI can work. That worked. Thanks for the help! -Marcus From ajay35 at in.com Wed Mar 25 15:31:12 2009 From: ajay35 at in.com (Ajay) Date: Wed Mar 25 15:34:10 2009 Subject: =?UTF-8?B?UmU6IFttdmFwaWNoLWRpc2N1c3NdIG12YXBpY2gyIG92ZXIgdURBUEwgZ2l2ZXMgUVBfRkFUQUwgZXJyb3I=?= In-Reply-To: <49C7FF89.6040009@osu.edu> Message-ID: <1238009472.9565f1cd832c9675c76672081c819342@mail.in.com> Hi Lei,Thanks for explanation and suggestion.Regards,Ajay Original message From:Lei Chai< chai.15@osu.edu >Date: 24 Mar 09 03:00:49Subject:Re: [mvapichdiscuss] mvapich2 over uDAPL gives QPFATAL errorTo: Ajay Hi Ajay,Thanks for your report and detailed information. We found a bug in thispart. Basically the noop message is to send credit to the remote side.So after certain number of receives(udapldynamiccreditthreshold/udaplcreditnotifythreshold) theprocess sends a noop message. Since the total number of receive buffersare 80 by default (udaplprepostdepth), there could be as many as 8noop messages outstanding and the remote side has to have at least 8receive buffers for receiving these noop messages. The current value(udaplprepostnoopextra) is 5 and that's why it failed. So yourworkarounds are actually fixes. I suggest you useudaplprepostnoopextra=8. We will fix it in our code base also.Thanks, Lei Ajay wrote: > Hello, > > I am using "mvapich21.2p1" of "OFED1.4.1RC1". I am try ing to run> mvapich2 over uDAPL ("compatdapl1.2.12" with "OpenIBCMA") and while> running IMBEXT application (of Intel MPI Benchmarks 3.2), I am> getting "IBVEVENTQPFATAL" event. I am getting error in "BidirGet"> API of IMBEXT. If I try to run only BidirGet of IMBEXT then test> works fine. > > Basically error is I have received some data but I don't have buffers> posted for same. > > I found some workarounds and will like to understand more. Following> are some queries: > >1. If value of "MV2PREPOSTDEPTH" variable (udaplprepostdepth) > iswhat's use of "remotecc" and "rdmacredit" variables? >3. If value of variables "udapldynamiccreditthreshold" and > "udaplcreditnotifythreshold" are changed from 10 to >=13 then > this test works fine (basically I was trying to send NOOP after > 15 postrecvs() instead of 10). I think NOOP sends are for > synchronization; so how does less no. of NOOPs are resolving > this issue? > > Thanks in Advance. > > Regards, > Ajay > > > > Dear *mvapichdi scuss!* Get Yourself a cool, short *@in.com* Email ID> now!>> > > > mvapichdiscuss mailing list > mvapichdiscuss@cse.ohiostate.edu > http://mail.cse.ohiostate.edu/mailman/listinfo/mvapichdiscuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20090326/71c593a7/attachment.html From panda at cse.ohio-state.edu Thu Mar 26 13:28:22 2009 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Mar 26 13:28:33 2009 Subject: [mvapich-discuss] Fwd: Announcing the release of BLCR 0.8.1 (fwd) Message-ID: Recently, an error was reported by an MVAPICH2 user while carrying out Checkpoint-Restart with BLCR 0.8.0. This problem has been solved in the new BLCR 0.8.1 release. If you are using MVAPICH2 with BLCR, please upgrade your BLCR installation to this latest 0.8.1 release. Thanks, DK ---------- Forwarded message ---------- From: Paul H. Hargrove Date: Thu, Mar 26, 2009 at 3:37 AM Subject: Announcing the release of BLCR 0.8.1 To: "checkpoint@lbl.gov" I am pleased to announce the release of Berkeley Lab Checkpoint/Restart (BLCR) version 0.8.1, which is now available from the BLCR Downloads page: http://ftg.lbl.gov/CheckpointRestart/CheckpointDownloads.shtml Relative to 0.8.0 this release extends support to 2.6.29 kernels and fixes several bugs.  A summary of the user-visible changes in BLCR, relative to 0.8.0, appears below in the form of an excerpt from the NEWS file. -Paul NEWS: 0.8.1 ----------- March 25, 2009 Bug fix and expanded-support release.  - This release adds support for 2.6.29 kernels.  - The final known xen-specific bug (2457) has been found to be a Xen bug.   Xen 3.1.2 or newer is recomended when using BLCR.  - This release makes the following libcr API additions/changes:   + Increase CR_MAX_CALLBACK from 32 to 32 million.  - This release adds additional error checking and improved documentation   for error cases for the following libcr functions:   + cr_register_callback()   + cr_replace_callback()   + cr_register_hook()  - This release fixes the following user-visible bugs and "issues"   + 2508 - inconsistent metadata from a networked file system   + 2520 - possible deadlock when using some functions with critical sections   + 2524 - restart-time SEGV on powerpc with 2.6.15 or older kernel   + 2525 - process deadlock on checkpoint after restart 2525 - This was the cause of the issue.   + 2526 - process deadlock between omit and forward operations   + C++ scoping problem for "struct cr_rstrt_relocate_pair"   + Correct a bug that would reset the persist count across checkpoints   + Prevent "stuck" tests when TOSTOP bit is set in termios.c_lflag.   + Correct flaws in tests "prctl" and "stage0002" that could yield false     failures under certain conditions -- Paul H. Hargrove                          PHHargrove@lbl.gov Future Technologies Group                 Tel: +1-510-495-2352 HPC Research Department                   Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory From koop at cse.ohio-state.edu Fri Mar 27 14:29:36 2009 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Fri Mar 27 14:29:45 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? In-Reply-To: <91bd441b0903240705o5975594xeb1bd3f3218205c1@mail.gmail.com> Message-ID: > One of our parallel applications met some problems, it often segment > fault. But when we set the > MV2_ON_DEMAND_THRESHOLD to the value equals to the tasks number, it > can run successfully. > > So we set only one envrionment variable MV2_ON_DEMAND_THRESHOLD=1024, > and run HPL tests again > for about 10 times (128 * 8, 1024 tasks, "RES" of each task is about > 1.4GB), each HPL test will run for about 12 minutes. > All tests runs normally, the "port_rcv_packets" counter of IB card > increase continually in the test running time. > > If on demand connection is used, seems there are some pthread lock > operations in vbuf.c, I am not sure if this will have > some relationship with the deadlock. I'm glad that things seem to be working now in terms of the memory usage. How large are these application runs that normally segfault (numbers of processes)? Can you try the following ENVs? MV2_CM_RECV_BUFFERS=8192 MV2_CM_TIMEOUT=250 We're also taking a further look within the code at possible issues. Thanks, Matt > 2009/3/24 Dhabaleswar Panda : > >> Thanks for the new patch. > >> > >> We did several tests today, but the testing system is busy today, so > >> we don't have enough node resource and time to do test. > >> We didn't run the HPCC to completion yet. > >> > >> For 512 tasks, seems this patch works, but we just run it for 30 > >> minutes, in which HPL was running normally. > > > > Good to know about this. > > > >> For 1024 tasks, in one test, the HPL is deadlock when running for > >> about 5 minute (while using the old patch, HPL will be deadlock at the > >> start), but our testing system is not stable today, maybe the deadlock > >> is caused by network failure. We need to find enough > >> resource and time to do more tests. > > > > Once you have a stable system, let us know how it works for 1024 tasks. > > > > Thanks, > > > > DK > > > >> 2009/3/19 Matthew Koop : > >> > Xie, > >> > Can you try the following attached patch instead of the other patch? We > >> > found some places where a deadlock may have been allowed to occur. > >> > > >> > Thank you, > >> > Matt > >> > > >> > On Sat, 14 Mar 2009, Xie Min wrote: > >> > > >> >> Today, we do some other tests. > >> >> > >> >> For 128 tasks HPCC (16 nodes * 8), we can run the whole test > >> >> successfully and get final result. > >> >> > >> >> But for 512 tasks HPCC (64 nodes * 8), HPL is freezed too when > >> >> MV2_USE_LAZY_MEM_UNREGISTER=1. > >> >> > >> >> I attach an input file for 512 tasks HPCC (about 1.6GB for each task), > >> >> maybe you can try it on your systems to see if it will produce the > >> >> same problem. > >> >> > >> >> Thanks. > >> >> > >> >> 2009/3/12 Matthew Koop : > >> >> > Xie, > >> >> > > >> >> > Thanks for sending this information along. We've spent some time > >> >> > investigating the issue and came up with a patch that will hopefully > >> >> > resolve your issue. I've attached it to this email and it should be > >> >> > applied at the base directory. > >> >> > > >> >> > Please let us know if this helps the problem, > >> >> > > >> >> > Matt > >> >> > > >> >> > On Mon, 9 Feb 2009, Xie Min wrote: > >> >> > > >> >> >> The hpcc we used is HPCC 1.0.0, but we just tried HPCC 1.3.1, seems > >> >> >> has the same problem. > >> >> >> > >> >> >> In the attachment we attached two hpccinf.txt files for 64 HPCC tasks, > >> >> >> the hpccinf.txt.13 is the "RES" of about 1.3GB, while hpccinf.txt.16 > >> >> >> is the "RES" of about 1.6/1.7GB. Whould you please try them on your > >> >> >> systems (with MV2_USE_LAZY_MEM_UNREGISTER=1), thanks. > >> >> >> > >> >> >> BTW, the OFED version we used is 1.3.1, physical memory on each node > >> >> >> is 16GB, use 8 nodes for 64 tasks. > >> >> >> > >> >> >> > >> >> >> > >> >> >> 2009/2/7 Matthew Koop : > >> >> >> > > >> >> >> > Thanks for the additional information. I've tried here with HPCC 1.3.1 and > >> >> >> > I haven't been able to see any difference in the 'RES' or 'VIRT' memory > >> >> >> > while running. > >> >> >> > > >> >> >> > Would it be possible to send me your hpccinf.txt file so I can more > >> >> >> > closely try to reproduce the problem? We also have AS5 with kernel 2.6.18 > >> >> >> > as well. > >> >> >> > > >> >> >> > Thanks, > >> >> >> > > >> >> >> > Matt > >> >> >> > > >> >> >> > On Thu, 5 Feb 2009, Xie Min wrote: > >> >> >> > > >> >> >> >> We use Redhat AS5, kernel is 2.6.18 with lustre 1.6.6, and we don't > >> >> >> >> modify kernel source. > >> >> >> >> > >> >> >> >> We test HPCC on two clusters: > >> >> >> >> In one cluster, each node is booted using Boot over IB, it has no > >> >> >> >> harddisk, so NO swap space. We run 64 HPCC tasks on 8 nodes (so each > >> >> >> >> CPU core in the node will run one HPCC task), when each HPCC task use > >> >> >> >> 1.2/1.3G memory, it will be killed by OS because of "Out of memory" > >> >> >> >> error. But when MV2_USE_LAZY_MEM_UNREGISTER=0, task can use 1.7G > >> >> >> >> memory and run successfully. > >> >> >> >> > >> >> >> >> In another cluster, each node has harddisk, it booted from local disk, > >> >> >> >> and it HAS space space. We run 64 HPCC tasks on 8 nodes too. When each > >> >> >> >> HPCC use 1.3G memory, we use "top" to show the memory usage > >> >> >> >> information, we found swap will be used when HPCC is running for a > >> >> >> >> while, and the node begin to run very slowly and cannot respond to > >> >> >> >> keyboard input. But when MV2_USE_LAZY_MEM_UNREGISTER=0, each task can > >> >> >> >> be set to 1.7G memory scale and run successfully. > >> >> >> >> > >> >> >> >> I tried another mvapich2 parameters: MV2_USE_LAZY_MEM_UNREGISTER=1, > >> >> >> >> and MV2_NDREG_ENTRIES=8. In this configuration, HPCC is still be > >> >> >> >> killed by OS with "Out of memory" error when the memory scale of each > >> >> >> >> task is set to 1.3GB. > >> >> >> >> > >> >> >> >> 2009/2/5 Matthew Koop : > >> >> >> >> > Hi, > >> >> >> >> > > >> >> >> >> > What OS/distro are you running? Are there any changes you made, such as > >> >> >> >> > page size, etc from the base? > >> >> >> >> > > >> >> >> >> > I'm taking a look at this issue on our machine as well, although I'm not > >> >> >> >> > seeing the memory change that you reported. > >> >> >> >> > > >> >> >> >> > Matt > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > >> >> >> > > >> >> >> > > >> >> >> > >> >> > > >> >> > >> > > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From xmxmxie at gmail.com Sat Mar 28 11:25:17 2009 From: xmxmxie at gmail.com (Xie Min) Date: Sat Mar 28 11:25:28 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? In-Reply-To: References: <91bd441b0903240705o5975594xeb1bd3f3218205c1@mail.gmail.com> Message-ID: <91bd441b0903280825n468e23c8x69eccc60070e89bf@mail.gmail.com> 2009/3/28 Matthew Koop : > >> One of our parallel applications met some problems, it often segment >> fault. But when we set the >> MV2_ON_DEMAND_THRESHOLD to the value equals to the tasks number, it >> can run successfully. >> >> So we set only one envrionment variable MV2_ON_DEMAND_THRESHOLD=1024, >> and run HPL tests again >> for about 10 times (128 * 8, 1024 tasks, "RES" of each task is about >> 1.4GB), each HPL test will run for about 12 minutes. >> All tests runs normally, the "port_rcv_packets" counter of IB card >> increase continually in the test running time. >> >> If on demand connection is used, seems there are some pthread lock >> operations in vbuf.c, I am not sure if this will have >> some relationship with the deadlock. > > I'm glad that things seem to be working now in terms of the memory usage. > How large are these application runs that normally segfault (numbers of > processes)? > One of our applications will segment fault in the scale of 128 tasks, after setting MV2_ON_DEMAND_THRESHOLD=128, it runs ok. > Can you try the following ENVs? > > MV2_CM_RECV_BUFFERS=8192 > MV2_CM_TIMEOUT=250 > We will try to find some time and node resources to test these ENVS. Thanks. > We're also taking a further look within the code at possible issues. > > Thanks, > > Matt > From xmxmxie at gmail.com Mon Mar 30 23:20:40 2009 From: xmxmxie at gmail.com (Xie Min) Date: Mon Mar 30 23:20:51 2009 Subject: [mvapich-discuss] MV2_USE_LAZY_MEM_UNREGISTER and memory usage? In-Reply-To: References: <91bd441b0903240705o5975594xeb1bd3f3218205c1@mail.gmail.com> Message-ID: <91bd441b0903302020n169bcfc7o10a585b5861d8e7d@mail.gmail.com> 2009/3/28 Matthew Koop : > >> One of our parallel applications met some problems, it often segment >> fault. But when we set the >> MV2_ON_DEMAND_THRESHOLD to the value equals to the tasks number, it >> can run successfully. >> >> So we set only one envrionment variable MV2_ON_DEMAND_THRESHOLD=1024, >> and run HPL tests again >> for about 10 times (128 * 8, 1024 tasks, "RES" of each task is about >> 1.4GB), each HPL test will run for about 12 minutes. >> All tests runs normally, the "port_rcv_packets" counter of IB card >> increase continually in the test running time. >> >> If on demand connection is used, seems there are some pthread lock >> operations in vbuf.c, I am not sure if this will have >> some relationship with the deadlock. > > I'm glad that things seem to be working now in terms of the memory usage. > How large are these application runs that normally segfault (numbers of > processes)? > > Can you try the following ENVs? > > MV2_CM_RECV_BUFFERS=8192 > MV2_CM_TIMEOUT=250 We did several tests today, there are still some deadlocks when running HPL with only these two ENVs set. > > We're also taking a further look within the code at possible issues. > > Thanks, > > Matt > >