From mhanby at uab.edu Wed Feb 7 11:36:10 2007 From: mhanby at uab.edu (Mike Hanby) Date: Wed Feb 7 11:36:45 2007 Subject: [mvapich-discuss] SDR or DDR Message-ID: <42D8C30759A99B4F926167BFEC117E7310C78E@UABEXMB5.ad.uab.edu> I need to fill in the value for LINKS=_DDR_ or _SDR_ Does anyone know how I tell whether my Infiniband cards have DDR or SDR ram? Also, my cards are PCI Express. For IO_BUS, I would choose _PCI_EX_, correct? tvflash -i command reports the following: HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0 Primary image is v4.7.600 build 3.2.0.110, with label 'HCA.LionCub.A0' Secondary image is v4.6.000 build 3.1.0.113, with label 'HCA.LionCub.A0' Vital Product Data Product Name: Lion cub P/N: 99-00026-01 E/C: Rev: B04 S/N: TS0548X03797 Freq/Power: PW=10W;PCIe 8X Date Code: 0548 Checksum: Ok Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070207/99345b8d/attachment.html From sweitzen at cisco.com Wed Feb 7 11:56:38 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed Feb 7 11:57:09 2007 Subject: [mvapich-discuss] SDR or DDR In-Reply-To: <42D8C30759A99B4F926167BFEC117E7310C78E@UABEXMB5.ad.uab.edu> References: <42D8C30759A99B4F926167BFEC117E7310C78E@UABEXMB5.ad.uab.edu> Message-ID: Your HCA is SDR, with DDR you will see DDR in the tvflash -i output. Scott ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Hanby Sent: Wednesday, February 07, 2007 8:36 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] SDR or DDR I need to fill in the value for LINKS=_DDR_ or _SDR_ Does anyone know how I tell whether my Infiniband cards have DDR or SDR ram? Also, my cards are PCI Express. For IO_BUS, I would choose _PCI_EX_, correct? tvflash -i command reports the following: HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0 Primary image is v4.7.600 build 3.2.0.110, with label 'HCA.LionCub.A0' Secondary image is v4.6.000 build 3.1.0.113, with label 'HCA.LionCub.A0' Vital Product Data Product Name: Lion cub P/N: 99-00026-01 E/C: Rev: B04 S/N: TS0548X03797 Freq/Power: PW=10W;PCIe 8X Date Code: 0548 Checksum: Ok Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070207/911974e3/attachment.html From Shainer at mellanox.com Wed Feb 7 12:35:44 2007 From: Shainer at mellanox.com (Gilad Shainer) Date: Wed Feb 7 12:33:23 2007 Subject: [mvapich-discuss] SDR or DDR Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F617FB5@mtiexch01.mti.com> The DDR or SDR is the link speed, 10Gb/s or 20Gb/s, and not the RAM. Gilad. ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Wednesday, February 07, 2007 8:57 AM To: Mike Hanby; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR Your HCA is SDR, with DDR you will see DDR in the tvflash -i output. Scott ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Hanby Sent: Wednesday, February 07, 2007 8:36 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] SDR or DDR I need to fill in the value for LINKS=_DDR_ or _SDR_ Does anyone know how I tell whether my Infiniband cards have DDR or SDR ram? Also, my cards are PCI Express. For IO_BUS, I would choose _PCI_EX_, correct? tvflash -i command reports the following: HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0 Primary image is v4.7.600 build 3.2.0.110, with label 'HCA.LionCub.A0' Secondary image is v4.6.000 build 3.1.0.113, with label 'HCA.LionCub.A0' Vital Product Data Product Name: Lion cub P/N: 99-00026-01 E/C: Rev: B04 S/N: TS0548X03797 Freq/Power: PW=10W;PCIe 8X Date Code: 0548 Checksum: Ok Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070207/f303100c/attachment-0001.html From mhanby at uab.edu Wed Feb 7 13:29:40 2007 From: mhanby at uab.edu (Mike Hanby) Date: Wed Feb 7 13:30:08 2007 Subject: [mvapich-discuss] SDR or DDR Message-ID: <42D8C30759A99B4F926167BFEC117E7310C7C1@UABEXMB5.ad.uab.edu> Thanks, I feel like a buffoon :-) I compiled MVAPICH 0.9.8 on a x86_64 Rocks 4.2.1 Cluster system using Intel 9.1 compilers. I have the Topspin roll installed on the cluster, where /usr/local/topspin contains the libraries and binaries for Infiniband. I could use the mvapich included with the Topspin roll, however my users want their applications compiled using the Intel compilers, and the mvapich on the roll is compiled with GNU. If I compile a simple helloworld mpi c program using mpicc and then run it using the command I get a Segmentation Fault: $ mpirun_rsh -np 1 node1 ~/mpi_hello bash: line 1: 12801 Segmentation fault /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=headnode MPIRUN_PORT=41013 MPIRUN_PROCESSES=node1:' MPIRUN_RANK=0 MPIRUN_NPROCS=1 MPIRUN_ID=12669 /home/makeuser/mpi_hello I looked through the make log and don't see any errors, just a bunch of warnings like: graph_nbr.c(83): warning #187: use of "=" where "==" may have been intended ( (topo->type != MPI_GRAPH) && (mpi_errno = MPI_ERR_TOPOLOGY)) I've also compiled Amber9 using mpicc, mpixx and mpif77 and also get a segmentation fault when I attempt to run sander.MPI (an Amber9 binary). Something tells me I'm doing something wrong. Here are the steps I followed to compile: I edit the file make.mvapich.vapi as follows: MTHOME=/usr/local/topspin PREFIX=/share/apps/mvapich/intel/mvapich-0.9.8-64 export CC =icc export CXX=icpc export F77=ifort export F90=ifort IO_BUS=_PCI_EX_ # For PCI Express LINKS=_SDR_ export CFLAGS="-D${ARCH} -DUSE_INLINE -DEARLY_SEND_COMPLETION -DRDMA_FAST_PATH \ -DVIADEV_RPUT_SUPPORT -DLAZY_MEM_UNREGISTER -D_SMP_ -D_SMP_RNDV_ \ $SUPPRESS -D${IO_BUS} -D${LINKS} \ ${HAVE_MPD_RING} -I${MTHOME}/include -I${MTHOME}/include/vapi $OPT_FLAG" I also have to edit mpid/vapi/viainit.c based on an error I received: case VAPI_PORT_ACTIVE: #ifdef VAPI_VERSION_CODE #if 0 #if VAPI_VERSION_CODE >= VAPI_VERSION(4,1,0) case VAPI_CLIENT_REREGISTER: case VAPI_RECEIVE_QUEUE_DRAINED: case VAPI_ECC_DETECT: case VAPI_PATH_MIG_ARMED: #endif #endif #endif ...and... case VAPI_PORT_ERROR: #ifdef VAPI_VERSION_CODE #if 0 #if VAPI_VERSION_CODE >= VAPI_VERSION(4,1,0) case VAPI_SRQ_CATASTROPHIC_ERROR: #endif #endif #endif I then just run: ./make.mvapich.vapi It appears to succeed and the directories and files get created in the --prefix location. Does anyone see anything glaringly wrong here? Thanks, Mike ________________________________ From: Gilad Shainer [mailto:Shainer@mellanox.com] Sent: Wednesday, February 07, 2007 11:36 To: Scott Weitzenkamp (sweitzen); Mike Hanby; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR The DDR or SDR is the link speed, 10Gb/s or 20Gb/s, and not the RAM. Gilad. ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Wednesday, February 07, 2007 8:57 AM To: Mike Hanby; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR Your HCA is SDR, with DDR you will see DDR in the tvflash -i output. Scott ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Hanby Sent: Wednesday, February 07, 2007 8:36 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] SDR or DDR I need to fill in the value for LINKS=_DDR_ or _SDR_ Does anyone know how I tell whether my Infiniband cards have DDR or SDR ram? Also, my cards are PCI Express. For IO_BUS, I would choose _PCI_EX_, correct? tvflash -i command reports the following: HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0 Primary image is v4.7.600 build 3.2.0.110, with label 'HCA.LionCub.A0' Secondary image is v4.6.000 build 3.1.0.113, with label 'HCA.LionCub.A0' Vital Product Data Product Name: Lion cub P/N: 99-00026-01 E/C: Rev: B04 S/N: TS0548X03797 Freq/Power: PW=10W;PCIe 8X Date Code: 0548 Checksum: Ok Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070207/cc024f7f/attachment.html From mhanby at uab.edu Wed Feb 7 17:01:56 2007 From: mhanby at uab.edu (Mike Hanby) Date: Wed Feb 7 17:02:25 2007 Subject: [mvapich-discuss] SDR or DDR Message-ID: <42D8C30759A99B4F926167BFEC117E7310C82C@UABEXMB5.ad.uab.edu> Thanks Scott, I wasn't aware of that. It looks like there's also mpicc.p (pgi?). I recompiled Amber9 using /usr/local/topspin/mpi/mpich/mpicc.i (and CC.i and f77.i), and now it appears to running using mpirun_ssh (at least I'm not getting a bunch of segfaults). Thanks, again Mike ________________________________ From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen@cisco.com] Sent: Wednesday, February 07, 2007 12:48 To: Mike Hanby Subject: RE: [mvapich-discuss] SDR or DDR The Topspin roll should include Intel compiler support, I will admit this is not well documented, we are working to correct this. Look for mpicc.i, mpiCC.i, mpif77.i, and mpif90.i in /usr/local/topspin/mpi/mpich/bin/. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Hanby Sent: Wednesday, February 07, 2007 10:30 AM To: mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR Thanks, I feel like a buffoon :-) I compiled MVAPICH 0.9.8 on a x86_64 Rocks 4.2.1 Cluster system using Intel 9.1 compilers. I have the Topspin roll installed on the cluster, where /usr/local/topspin contains the libraries and binaries for Infiniband. I could use the mvapich included with the Topspin roll, however my users want their applications compiled using the Intel compilers, and the mvapich on the roll is compiled with GNU. If I compile a simple helloworld mpi c program using mpicc and then run it using the command I get a Segmentation Fault: $ mpirun_rsh -np 1 node1 ~/mpi_hello bash: line 1: 12801 Segmentation fault /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=headnode MPIRUN_PORT=41013 MPIRUN_PROCESSES=node1:' MPIRUN_RANK=0 MPIRUN_NPROCS=1 MPIRUN_ID=12669 /home/makeuser/mpi_hello I looked through the make log and don't see any errors, just a bunch of warnings like: graph_nbr.c(83): warning #187: use of "=" where "==" may have been intended ( (topo->type != MPI_GRAPH) && (mpi_errno = MPI_ERR_TOPOLOGY)) I've also compiled Amber9 using mpicc, mpixx and mpif77 and also get a segmentation fault when I attempt to run sander.MPI (an Amber9 binary). Something tells me I'm doing something wrong. Here are the steps I followed to compile: I edit the file make.mvapich.vapi as follows: MTHOME=/usr/local/topspin PREFIX=/share/apps/mvapich/intel/mvapich-0.9.8-64 export CC =icc export CXX=icpc export F77=ifort export F90=ifort IO_BUS=_PCI_EX_ # For PCI Express LINKS=_SDR_ export CFLAGS="-D${ARCH} -DUSE_INLINE -DEARLY_SEND_COMPLETION -DRDMA_FAST_PATH \ -DVIADEV_RPUT_SUPPORT -DLAZY_MEM_UNREGISTER -D_SMP_ -D_SMP_RNDV_ \ $SUPPRESS -D${IO_BUS} -D${LINKS} \ ${HAVE_MPD_RING} -I${MTHOME}/include -I${MTHOME}/include/vapi $OPT_FLAG" I also have to edit mpid/vapi/viainit.c based on an error I received: case VAPI_PORT_ACTIVE: #ifdef VAPI_VERSION_CODE #if 0 #if VAPI_VERSION_CODE >= VAPI_VERSION(4,1,0) case VAPI_CLIENT_REREGISTER: case VAPI_RECEIVE_QUEUE_DRAINED: case VAPI_ECC_DETECT: case VAPI_PATH_MIG_ARMED: #endif #endif #endif ...and... case VAPI_PORT_ERROR: #ifdef VAPI_VERSION_CODE #if 0 #if VAPI_VERSION_CODE >= VAPI_VERSION(4,1,0) case VAPI_SRQ_CATASTROPHIC_ERROR: #endif #endif #endif I then just run: ./make.mvapich.vapi It appears to succeed and the directories and files get created in the --prefix location. Does anyone see anything glaringly wrong here? Thanks, Mike ________________________________ From: Gilad Shainer [mailto:Shainer@mellanox.com] Sent: Wednesday, February 07, 2007 11:36 To: Scott Weitzenkamp (sweitzen); Mike Hanby; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR The DDR or SDR is the link speed, 10Gb/s or 20Gb/s, and not the RAM. Gilad. ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Wednesday, February 07, 2007 8:57 AM To: Mike Hanby; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR Your HCA is SDR, with DDR you will see DDR in the tvflash -i output. Scott ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Hanby Sent: Wednesday, February 07, 2007 8:36 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] SDR or DDR I need to fill in the value for LINKS=_DDR_ or _SDR_ Does anyone know how I tell whether my Infiniband cards have DDR or SDR ram? Also, my cards are PCI Express. For IO_BUS, I would choose _PCI_EX_, correct? tvflash -i command reports the following: HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0 Primary image is v4.7.600 build 3.2.0.110, with label 'HCA.LionCub.A0' Secondary image is v4.6.000 build 3.1.0.113, with label 'HCA.LionCub.A0' Vital Product Data Product Name: Lion cub P/N: 99-00026-01 E/C: Rev: B04 S/N: TS0548X03797 Freq/Power: PW=10W;PCIe 8X Date Code: 0548 Checksum: Ok Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070207/c5579168/attachment-0001.html From sweitzen at cisco.com Wed Feb 7 17:15:18 2007 From: sweitzen at cisco.com (Scott Weitzenkamp (sweitzen)) Date: Wed Feb 7 17:19:29 2007 Subject: [mvapich-discuss] SDR or DDR In-Reply-To: <42D8C30759A99B4F926167BFEC117E7310C82C@UABEXMB5.ad.uab.edu> References: <42D8C30759A99B4F926167BFEC117E7310C82C@UABEXMB5.ad.uab.edu> Message-ID: Yes, .p scripts are for PGI compiler. We support Intel C/C++/F77/F90, and PGI F77/F90 (no PGI C/C++ support) with the binary RPMs. Scott ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Hanby Sent: Wednesday, February 07, 2007 2:02 PM To: mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR Thanks Scott, I wasn't aware of that. It looks like there's also mpicc.p (pgi?). I recompiled Amber9 using /usr/local/topspin/mpi/mpich/mpicc.i (and CC.i and f77.i), and now it appears to running using mpirun_ssh (at least I'm not getting a bunch of segfaults). Thanks, again Mike ________________________________ From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen@cisco.com] Sent: Wednesday, February 07, 2007 12:48 To: Mike Hanby Subject: RE: [mvapich-discuss] SDR or DDR The Topspin roll should include Intel compiler support, I will admit this is not well documented, we are working to correct this. Look for mpicc.i, mpiCC.i, mpif77.i, and mpif90.i in /usr/local/topspin/mpi/mpich/bin/. Scott Weitzenkamp SQA and Release Manager Server Virtualization Business Unit Cisco Systems ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Hanby Sent: Wednesday, February 07, 2007 10:30 AM To: mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR Thanks, I feel like a buffoon :-) I compiled MVAPICH 0.9.8 on a x86_64 Rocks 4.2.1 Cluster system using Intel 9.1 compilers. I have the Topspin roll installed on the cluster, where /usr/local/topspin contains the libraries and binaries for Infiniband. I could use the mvapich included with the Topspin roll, however my users want their applications compiled using the Intel compilers, and the mvapich on the roll is compiled with GNU. If I compile a simple helloworld mpi c program using mpicc and then run it using the command I get a Segmentation Fault: $ mpirun_rsh -np 1 node1 ~/mpi_hello bash: line 1: 12801 Segmentation fault /usr/bin/env MPIRUN_MPD=0 MPIRUN_HOST=headnode MPIRUN_PORT=41013 MPIRUN_PROCESSES=node1:' MPIRUN_RANK=0 MPIRUN_NPROCS=1 MPIRUN_ID=12669 /home/makeuser/mpi_hello I looked through the make log and don't see any errors, just a bunch of warnings like: graph_nbr.c(83): warning #187: use of "=" where "==" may have been intended ( (topo->type != MPI_GRAPH) && (mpi_errno = MPI_ERR_TOPOLOGY)) I've also compiled Amber9 using mpicc, mpixx and mpif77 and also get a segmentation fault when I attempt to run sander.MPI (an Amber9 binary). Something tells me I'm doing something wrong. Here are the steps I followed to compile: I edit the file make.mvapich.vapi as follows: MTHOME=/usr/local/topspin PREFIX=/share/apps/mvapich/intel/mvapich-0.9.8-64 export CC =icc export CXX=icpc export F77=ifort export F90=ifort IO_BUS=_PCI_EX_ # For PCI Express LINKS=_SDR_ export CFLAGS="-D${ARCH} -DUSE_INLINE -DEARLY_SEND_COMPLETION -DRDMA_FAST_PATH \ -DVIADEV_RPUT_SUPPORT -DLAZY_MEM_UNREGISTER -D_SMP_ -D_SMP_RNDV_ \ $SUPPRESS -D${IO_BUS} -D${LINKS} \ ${HAVE_MPD_RING} -I${MTHOME}/include -I${MTHOME}/include/vapi $OPT_FLAG" I also have to edit mpid/vapi/viainit.c based on an error I received: case VAPI_PORT_ACTIVE: #ifdef VAPI_VERSION_CODE #if 0 #if VAPI_VERSION_CODE >= VAPI_VERSION(4,1,0) case VAPI_CLIENT_REREGISTER: case VAPI_RECEIVE_QUEUE_DRAINED: case VAPI_ECC_DETECT: case VAPI_PATH_MIG_ARMED: #endif #endif #endif ...and... case VAPI_PORT_ERROR: #ifdef VAPI_VERSION_CODE #if 0 #if VAPI_VERSION_CODE >= VAPI_VERSION(4,1,0) case VAPI_SRQ_CATASTROPHIC_ERROR: #endif #endif #endif I then just run: ./make.mvapich.vapi It appears to succeed and the directories and files get created in the --prefix location. Does anyone see anything glaringly wrong here? Thanks, Mike ________________________________ From: Gilad Shainer [mailto:Shainer@mellanox.com] Sent: Wednesday, February 07, 2007 11:36 To: Scott Weitzenkamp (sweitzen); Mike Hanby; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR The DDR or SDR is the link speed, 10Gb/s or 20Gb/s, and not the RAM. Gilad. ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Scott Weitzenkamp (sweitzen) Sent: Wednesday, February 07, 2007 8:57 AM To: Mike Hanby; mvapich-discuss@cse.ohio-state.edu Subject: RE: [mvapich-discuss] SDR or DDR Your HCA is SDR, with DDR you will see DDR in the tvflash -i output. Scott ________________________________ From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Hanby Sent: Wednesday, February 07, 2007 8:36 AM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] SDR or DDR I need to fill in the value for LINKS=_DDR_ or _SDR_ Does anyone know how I tell whether my Infiniband cards have DDR or SDR ram? Also, my cards are PCI Express. For IO_BUS, I would choose _PCI_EX_, correct? tvflash -i command reports the following: HCA #0: MT25208 Tavor Compat, Lion Cub, revision A0 Primary image is v4.7.600 build 3.2.0.110, with label 'HCA.LionCub.A0' Secondary image is v4.6.000 build 3.1.0.113, with label 'HCA.LionCub.A0' Vital Product Data Product Name: Lion cub P/N: 99-00026-01 E/C: Rev: B04 S/N: TS0548X03797 Freq/Power: PW=10W;PCIe 8X Date Code: 0548 Checksum: Ok Thanks, Mike -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070207/c94b3d23/attachment-0001.html From stevejones at stanford.edu Wed Feb 7 19:41:25 2007 From: stevejones at stanford.edu (Steve Jones) Date: Wed Feb 7 19:41:53 2007 Subject: [mvapich-discuss] SDR or DDR In-Reply-To: References: <42D8C30759A99B4F926167BFEC117E7310C82C@UABEXMB5.ad.uab.edu> Message-ID: <1170895285.45ca71b5d02f0@webmail.stanford.edu> Hi. I didn't immediately notice the Topspin roll was being used here. We have PATH set in the Cisco Topspin Roll to include the Topspin MPI wrapper scripts, but it looks like I'll have to write better documentation on using it as Rocks provides MPI wrapper scripts as well. You can also post to the Rocks list with questions about the Topspin Roll. Be sure to include Topspin in the subject line so it stands out to me. Steve Quoting "Scott Weitzenkamp (sweitzen)" : > Yes, .p scripts are for PGI compiler. We support Intel C/C++/F77/F90, > and PGI F77/F90 (no PGI C/C++ support) with the binary RPMs. > > Scott > > > ________________________________ > > From: mvapich-discuss-bounces@cse.ohio-state.edu > [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike > Hanby > Sent: Wednesday, February 07, 2007 2:02 PM > To: mvapich-discuss@cse.ohio-state.edu > Subject: RE: [mvapich-discuss] SDR or DDR > > > > Thanks Scott, I wasn't aware of that. It looks like there's also > mpicc.p (pgi?). > > > > I recompiled Amber9 using /usr/local/topspin/mpi/mpich/mpicc.i > (and CC.i and f77.i), and now it appears to running using mpirun_ssh (at > least I'm not getting a bunch of segfaults). > > > > Thanks, again > > > > Mike > > > > > ________________________________ > > > From: Scott Weitzenkamp (sweitzen) [mailto:sweitzen@cisco.com] > Sent: Wednesday, February 07, 2007 12:48 > To: Mike Hanby > Subject: RE: [mvapich-discuss] SDR or DDR > > > > The Topspin roll should include Intel compiler support, I will > admit this is not well documented, we are working to correct this. Look > for mpicc.i, mpiCC.i, mpif77.i, and mpif90.i in > /usr/local/topspin/mpi/mpich/bin/. > > > > Scott Weitzenkamp > > SQA and Release Manager > > Server Virtualization Business Unit > > Cisco Systems > > > > > > > ________________________________ > > > From: mvapich-discuss-bounces@cse.ohio-state.edu > [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike > Hanby > Sent: Wednesday, February 07, 2007 10:30 AM > To: mvapich-discuss@cse.ohio-state.edu > Subject: RE: [mvapich-discuss] SDR or DDR > > Thanks, I feel like a buffoon :-) > > > > I compiled MVAPICH 0.9.8 on a x86_64 Rocks 4.2.1 Cluster > system using Intel 9.1 compilers. I have the Topspin roll installed on > the cluster, where /usr/local/topspin contains the libraries and > binaries for Infiniband. I could use the mvapich included with the > Topspin roll, however my users want their applications compiled using > the Intel compilers, and the mvapich on the roll is compiled with GNU. > > > > If I compile a simple helloworld mpi c program using > mpicc and then run it using the command I get a Segmentation Fault: > > $ mpirun_rsh -np 1 node1 ~/mpi_hello > > bash: line 1: 12801 Segmentation fault /usr/bin/env > MPIRUN_MPD=0 MPIRUN_HOST=headnode MPIRUN_PORT=41013 > MPIRUN_PROCESSES=node1:' MPIRUN_RANK=0 MPIRUN_NPROCS=1 MPIRUN_ID=12669 > /home/makeuser/mpi_hello > > > > I looked through the make log and don't see any errors, > just a bunch of warnings like: > > graph_nbr.c(83): warning #187: use of "=" where "==" may > have been intended > > ( (topo->type != MPI_GRAPH) && (mpi_errno = > MPI_ERR_TOPOLOGY)) > > > > I've also compiled Amber9 using mpicc, mpixx and mpif77 > and also get a segmentation fault when I attempt to run sander.MPI (an > Amber9 binary). > > > > Something tells me I'm doing something wrong. > > > > Here are the steps I followed to compile: > > I edit the file make.mvapich.vapi as follows: > > MTHOME=/usr/local/topspin > > PREFIX=/share/apps/mvapich/intel/mvapich-0.9.8-64 > > export CC =icc > > export CXX=icpc > > export F77=ifort > > export F90=ifort > > IO_BUS=_PCI_EX_ # For PCI Express > > LINKS=_SDR_ > > export CFLAGS="-D${ARCH} -DUSE_INLINE > -DEARLY_SEND_COMPLETION -DRDMA_FAST_PATH \ > > -DVIADEV_RPUT_SUPPORT > -DLAZY_MEM_UNREGISTER -D_SMP_ -D_SMP_RNDV_ \ > > $SUPPRESS -D${IO_BUS} -D${LINKS} \ > > ${HAVE_MPD_RING} -I${MTHOME}/include > -I${MTHOME}/include/vapi $OPT_FLAG" > > > > I also have to edit mpid/vapi/viainit.c based on an > error I received: > > case VAPI_PORT_ACTIVE: > > #ifdef VAPI_VERSION_CODE > > #if 0 > > #if VAPI_VERSION_CODE >= VAPI_VERSION(4,1,0) > > case VAPI_CLIENT_REREGISTER: > > case VAPI_RECEIVE_QUEUE_DRAINED: > > case VAPI_ECC_DETECT: > > case VAPI_PATH_MIG_ARMED: > > #endif > > #endif > > #endif > > > > ...and... > > > > case VAPI_PORT_ERROR: > > #ifdef VAPI_VERSION_CODE > > #if 0 > > #if VAPI_VERSION_CODE >= VAPI_VERSION(4,1,0) > > case VAPI_SRQ_CATASTROPHIC_ERROR: > > #endif > > #endif > > #endif > > > > I then just run: > > ./make.mvapich.vapi > > > > It appears to succeed and the directories and files get > created in the --prefix location. Does anyone see anything glaringly > wrong here? > > > > Thanks, Mike > > > > > ________________________________ > > > From: Gilad Shainer [mailto:Shainer@mellanox.com] > Sent: Wednesday, February 07, 2007 11:36 > To: Scott Weitzenkamp (sweitzen); Mike Hanby; > mvapich-discuss@cse.ohio-state.edu > Subject: RE: [mvapich-discuss] SDR or DDR > > > > The DDR or SDR is the link speed, 10Gb/s or 20Gb/s, and > not the RAM. > > > > Gilad. > > > > > ________________________________ > > > From: mvapich-discuss-bounces@cse.ohio-state.edu > [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Scott > Weitzenkamp (sweitzen) > Sent: Wednesday, February 07, 2007 8:57 AM > To: Mike Hanby; mvapich-discuss@cse.ohio-state.edu > Subject: RE: [mvapich-discuss] SDR or DDR > > Your HCA is SDR, with DDR you will see DDR in the > tvflash -i output. > > > > Scott > > > > > ________________________________ > > > From: mvapich-discuss-bounces@cse.ohio-state.edu > [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike > Hanby > Sent: Wednesday, February 07, 2007 8:36 AM > To: mvapich-discuss@cse.ohio-state.edu > Subject: [mvapich-discuss] SDR or DDR > > I need to fill in the value for LINKS=_DDR_ or > _SDR_ > > Does anyone know how I tell whether my > Infiniband cards have DDR or SDR ram? > > > > Also, my cards are PCI Express. For IO_BUS, I > would choose _PCI_EX_, correct? > > > > tvflash -i command reports the following: > > > > HCA #0: MT25208 Tavor Compat, Lion Cub, revision > A0 > > Primary image is v4.7.600 build 3.2.0.110, > with label 'HCA.LionCub.A0' > > Secondary image is v4.6.000 build 3.1.0.113, > with label 'HCA.LionCub.A0' > > > > Vital Product Data > > Product Name: Lion cub > > P/N: 99-00026-01 > > E/C: Rev: B04 > > S/N: TS0548X03797 > > Freq/Power: PW=10W;PCIe 8X > > Date Code: 0548 > > Checksum: Ok > > > > Thanks, Mike > > From ce107 at MIT.EDU Thu Feb 8 14:13:35 2007 From: ce107 at MIT.EDU (Constantinos Evangelinos) Date: Thu Feb 8 14:34:06 2007 Subject: [mvapich-discuss] Dual port HCA back-to-back woes Message-ID: <200702081413.35384.ce107@mit.edu> Hi - we have two quad socket Opteron systems, each with a Voltaire HCA 400Ex connected directly back to back which I realise is an unusual configuration. Since Voltaire will not support back-to-back with OpenFabrics we are running the specific earlier Verbs-based Voltaire GridStack with the only firmware level for the cards that Voltaire supports for single port setups. Using minism as the session manager running on one of the nodes, I have been able to use this back-to-back setup with a pair of HCA 410Ex-Ds (I was initially sent by mistake) as well as the 400Exs. In that case one port is in the PORT_ACTIVE state while the other in the PORT_INITIALIZE state as minism will claim "Status: Port not discovered" for the 2nd port. If I start minism with a "-p 2" argument then the roles are reversed as port 1 is not discovered. The Voltaire distributed MVAPICH, OpenMPI and MVAPICH 0.9.8 built for a single port work fine with this half active configuration at half the potential speed for large messages of course. Having recompiled MVAPICH 0.9.8 with support for SDR/dual port I cannot use it with this setup (one port active, the other one in the initialize state) as I get the following error: [-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 in file viainit.c [-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 in file viainit.c If I start minism again with "-p 2" (on the same node or even the other node, it does not make any difference) and trick the system to bring the second port up to the active state as well I still have the same problem. The best situation a lot of experimentation with recompilation upon recompilation landed me was a setup where I could use what appeared to be both ports but only one process on either side could be involved in MPI communications, thereby negating any usability of such an approach (the reason I preferred dual SDR to single DDR was to have more bandwidth between the nodes when more than one processor on each side is communicating). OpenMPI will work fine in the half-active configuration but will not initiate communications successfully and hangs when both ports are tricked into becoming active concurrently. I realize that this is an unusual setup and it may be that minism will not be able to support such a setup and no fault lies with either MPI implementation. Do we know whether opensm and OpenFabrics would do any better (if I were to take the plunge and try a completely unsupported by Voltaire configuration)? Thanks for any help in advance. Constantinos -- Dr. Constantinos Evangelinos Room 54-1518, EAPS/MIT Earth, Atmospheric and Planetary Sciences 77 Massachusetts Avenue Massachusetts Institute of Technology Cambridge, MA 02139 +1-617-253-5259/+1-617-253-4464 (fax) USA From yiftahs at voltaire.com Thu Feb 8 19:39:59 2007 From: yiftahs at voltaire.com (Yiftah Shahar) Date: Thu Feb 8 19:42:49 2007 Subject: [mvapich-discuss] Dual port HCA back-to-back woes In-Reply-To: <200702081413.35384.ce107@mit.edu> Message-ID: <3857BB049D83424D9DB82753D37CEA553D18C1@taurus.voltaire.com> Hi Constantinos, You must understand that what you are doing by connecting HCAs back-to-back without a switch is actually creating several different IB subnets. You can't have any IB traffic go from one port to the other ones that are not connected directly to it (you can have in the same node (or HCA) the same LID...). I know that when using OpenIB based stack and using OpenSM you can activate all ports to be active. OpenMPI and Voltaire MPI (based on mvapich) will support multi ports/HCAs configuration only if all are connected to the same IB fabric (and as I explained this is not the case here). I'm not sure if MVAPICH multi rail support multi IB subnets. Yiftah Voltaire HPC solution > -----Original Message----- > From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss- > bounces@cse.ohio-state.edu] On Behalf Of Constantinos Evangelinos > Sent: Thursday, February 08, 2007 11:14 > To: mvapich-discuss@cse.ohio-state.edu > Subject: [mvapich-discuss] Dual port HCA back-to-back woes > > Hi - we have two quad socket Opteron systems, each with a Voltaire HCA > 400Ex > connected directly back to back which I realise is an unusual > configuration. > Since Voltaire will not support back-to-back with OpenFabrics we are > running > the specific earlier Verbs-based Voltaire GridStack with the only firmware > level for the cards that Voltaire supports for single port setups. Using > minism as the session manager running on one of the nodes, I have been > able > to use this back-to-back setup with a pair of HCA 410Ex-Ds (I was > initially > sent by mistake) as well as the 400Exs. In that case one port is in the > PORT_ACTIVE state while the other in the PORT_INITIALIZE state as minism > will > claim "Status: Port not discovered" for the 2nd port. If I start minism > with > a "-p 2" argument then the roles are reversed as port 1 is not discovered. > > The Voltaire distributed MVAPICH, OpenMPI and MVAPICH 0.9.8 built for a > single > port work fine with this half active configuration at half the potential > speed for large messages of course. > > Having recompiled MVAPICH 0.9.8 with support for SDR/dual port I cannot > use it > with this setup (one port active, the other one in the initialize state) > as I > get the following error: > > [-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 > in > file viainit.c > [-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 > in > file viainit.c > > If I start minism again with "-p 2" (on the same node or even the other > node, > it does not make any difference) and trick the system to bring the second > port up to the active state as well I still have the same problem. The > best > situation a lot of experimentation with recompilation upon recompilation > landed me was a setup where I could use what appeared to be both ports but > only one process on either side could be involved in MPI communications, > thereby negating any usability of such an approach (the reason I preferred > dual SDR to single DDR was to have more bandwidth between the nodes when > more > than one processor on each side is communicating). > > OpenMPI will work fine in the half-active configuration but will not > initiate > communications successfully and hangs when both ports are tricked into > becoming active concurrently. > > I realize that this is an unusual setup and it may be that minism will not > be > able to support such a setup and no fault lies with either MPI > implementation. Do we know whether opensm and OpenFabrics would do any > better > (if I were to take the plunge and try a completely unsupported by Voltaire > configuration)? > > Thanks for any help in advance. > > Constantinos > -- > Dr. Constantinos Evangelinos Room 54-1518, EAPS/MIT > Earth, Atmospheric and Planetary Sciences 77 Massachusetts Avenue > Massachusetts Institute of Technology Cambridge, MA 02139 > +1-617-253-5259/+1-617-253-4464 (fax) USA > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From vishnu at cse.ohio-state.edu Fri Feb 9 10:41:26 2007 From: vishnu at cse.ohio-state.edu (Abhinav Vishnu) Date: Fri Feb 9 10:43:40 2007 Subject: [mvapich-discuss] Dual port HCA back-to-back woes Message-ID: <20070209154124.GA9002@cse.ohio-state.edu> Dr. Constantinos, Thanks for using MVAPICH and reporting the problem to us. > Hi - we have two quad socket Opteron systems, each with a Voltaire HCA 400Ex > connected directly back to back which I realise is an unusual configuration. > Since Voltaire will not support back-to-back with OpenFabrics we are running > the specific earlier Verbs-based Voltaire GridStack with the only firmware > level for the cards that Voltaire supports for single port setups. Using > minism as the session manager running on one of the nodes, I have been able > to use this back-to-back setup with a pair of HCA 410Ex-Ds (I was initially > sent by mistake) as well as the 400Exs. In that case one port is in the > PORT_ACTIVE state while the other in the PORT_INITIALIZE state as minism will > claim "Status: Port not discovered" for the 2nd port. If I start minism with > a "-p 2" argument then the roles are reversed as port 1 is not discovered. > > The Voltaire distributed MVAPICH, OpenMPI and MVAPICH 0.9.8 built for a single > port work fine with this half active configuration at half the potential > speed for large messages of course. > > Having recompiled MVAPICH 0.9.8 with support for SDR/dual port I cannot use it > with this setup (one port active, the other one in the initialize state) as I > get the following error: > > [-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 in > file viainit.c > [-1] Abort: malloc for alladdrs/local_addr/lid_table/qp_table at line 278 in > file viainit.c In our testing with VAPI multi-rail device, we did not encounter this problem. Not sure why this problem is occuring on your machines. Can you please let us know the MPI test you are using? Also, we have not used minism for quite some time now (~3 years). For these years, we have been using opensm distributed with IB Gold (from Mellanox) and opensm distributed with OFED for the OpenFabrics drivers. > > If I start minism again with "-p 2" (on the same node or even the other node, > it does not make any difference) and trick the system to bring the second > port up to the active state as well I still have the same problem. The best > situation a lot of experimentation with recompilation upon recompilation > landed me was a setup where I could use what appeared to be both ports but > only one process on either side could be involved in MPI communications, > thereby negating any usability of such an approach (the reason I preferred > dual SDR to single DDR was to have more bandwidth between the nodes when more > than one processor on each side is communicating). > > OpenMPI will work fine in the half-active configuration but will not initiate > communications successfully and hangs when both ports are tricked into > becoming active concurrently. > > I realize that this is an unusual setup and it may be that minism will not be > able to support such a setup and no fault lies with either MPI > implementation. Do we know whether opensm and OpenFabrics would do any better > (if I were to take the plunge and try a completely unsupported by Voltaire > configuration)? In our lab, we have tried running opensm on the same node, bound to different ports of the HCA. It absolutely works fine. I would strongly recommend you to download OFED from the OpenFabrics website. FYI, I am posting it here: http://www.openfabrics.org/downloads.html Please use the OFED-1.1 tarball for building the OFED modules and userspace libraries. Once this step is over, please use the make.mvapich.gen2 script in the MVAPICH-0.9.8 top directory. For using the multi-rail version, please use the make.mvapich.gen2_multirail script. For more information on building instructions, please refer to the section 4.4.1 and 4.4.4 in the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/user_guide.html Please report back to us any problems during the compilation/execution of your MPI programs. Thanks again, :- Abhinav > > Thanks for any help in advance. > > Constantinos > -- > Dr. Constantinos Evangelinos Room 54-1518, EAPS/MIT > Earth, Atmospheric and Planetary Sciences 77 Massachusetts Avenue > Massachusetts Institute of Technology Cambridge, MA 02139 > +1-617-253-5259/+1-617-253-4464 (fax) USA > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From panda at cse.ohio-state.edu Fri Feb 9 17:24:36 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Fri Feb 9 17:25:03 2007 Subject: [mvapich-discuss] MVAPICH 0.9.9-beta release is available Message-ID: <200702092224.l19MOaYC006652@xi.cse.ohio-state.edu> The MVAPICH team is pleased to announce the availability of MVAPICH 0.9.9-beta with the following NEW features: - Message coalescing support to enable reduction of per Queue-pair send queues for reduction in memory requirement on large scale clusters. This design also increases the small message messaging rate significantly. - Designs for avoiding hot-spots in networks of large-scale clusters - Multi-pathing support leveraging LMC mechanism - Multi-port support for enabling user processes to bind to different IB ports for balanced communication performance on multi-core platforms - Multi-core optimized scalable shared memory design - Memory Hook support provided by integration with ptmalloc2 library. This provides safe release of memory to the Operating System and is expected to benefit the memory usage of applications that frequently use malloc and free operations. - Optimized, high-performance shared memory aware collective operations for multi-core platforms - Shared-Memory only channel (This interface support is useful for running MPI jobs on multi-processor systems without using any high-performance network. For example, multi-core servers, desktops, and laptops; and clusters with serial nodes.) A new "Multiple-pair Bandwidth and Message Rate" test is also available as a part of OSU_Benchmarks. For downloading MVAPICH 0.9.9-beta package and accessing the anonymous SVN, please visit the following URL: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ MVAPICH 0.9.9-beta is also available for OFED 1.2 testing. All feedbacks, including bug reports and hints for performance tuning, are welcome. Please post it to the mvapich-discuss mailing list. Thanks, MVAPICH Team From eborisch at ieee.org Fri Feb 9 18:17:38 2007 From: eborisch at ieee.org (Eric A. Borisch) Date: Fri Feb 9 18:18:04 2007 Subject: [mvapich-discuss] MVAPICH 0.9.9-beta release is available In-Reply-To: <200702092224.l19MOaYC006652@xi.cse.ohio-state.edu> References: <200702092224.l19MOaYC006652@xi.cse.ohio-state.edu> Message-ID: <392f95800702091517p7f566481n1c27efec01bfd28f@mail.gmail.com> Glitch during installation: ... mvapich-0.9.9-beta/bin/tarch does not exist (or is not a regular file)! ... tdevice isn't there either. I initially reported this bug on 0.9.8. Thanks, Eric On 2/9/07, Dhabaleswar Panda wrote: > The MVAPICH team is pleased to announce the availability of MVAPICH > 0.9.9-beta with the following NEW features: > > - Message coalescing support to enable reduction of per Queue-pair > send queues for reduction in memory requirement on large scale > clusters. This design also increases the small message messaging > rate significantly. > > - Designs for avoiding hot-spots in networks of large-scale clusters > > - Multi-pathing support leveraging LMC mechanism > - Multi-port support for enabling user processes to bind to > different IB ports for balanced communication performance > on multi-core platforms > > - Multi-core optimized scalable shared memory design > > - Memory Hook support provided by integration with ptmalloc2 library. > This provides safe release of memory to the Operating System and > is expected to benefit the memory usage of applications that > frequently use malloc and free operations. > > - Optimized, high-performance shared memory aware collective > operations for multi-core platforms > > - Shared-Memory only channel (This interface support is useful for > running MPI jobs on multi-processor systems without using any > high-performance network. For example, multi-core servers, > desktops, and laptops; and clusters with serial nodes.) > > A new "Multiple-pair Bandwidth and Message Rate" test is also > available as a part of OSU_Benchmarks. > > For downloading MVAPICH 0.9.9-beta package and accessing the anonymous > SVN, please visit the following URL: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > > MVAPICH 0.9.9-beta is also available for OFED 1.2 testing. > > All feedbacks, including bug reports and hints for performance tuning, > are welcome. Please post it to the mvapich-discuss mailing list. > > Thanks, > > MVAPICH Team > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- Eric A. Borisch eborisch@ieee.org From surs at cse.ohio-state.edu Sat Feb 10 09:48:53 2007 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Sat Feb 10 09:49:20 2007 Subject: [mvapich-discuss] MVAPICH 0.9.9-beta release is available In-Reply-To: <392f95800702091517p7f566481n1c27efec01bfd28f@mail.gmail.com> References: <200702092224.l19MOaYC006652@xi.cse.ohio-state.edu> <392f95800702091517p7f566481n1c27efec01bfd28f@mail.gmail.com> Message-ID: <45CDDB55.1080906@cse.ohio-state.edu> Hi Eric, Thanks a lot for immediately trying out the beta! I couldn't reproduce this behavior, `make install' worked OK for me. Will it be possible for you to send us your make-mine.log and install-mine.log? Thanks, Sayantan. Eric A. Borisch wrote: > Glitch during installation: > > ... > mvapich-0.9.9-beta/bin/tarch does not exist (or is not a regular file)! > ... > > tdevice isn't there either. > > I initially reported this bug on 0.9.8. > > Thanks, > Eric > > On 2/9/07, Dhabaleswar Panda wrote: >> The MVAPICH team is pleased to announce the availability of MVAPICH >> 0.9.9-beta with the following NEW features: >> >> - Message coalescing support to enable reduction of per Queue-pair >> send queues for reduction in memory requirement on large scale >> clusters. This design also increases the small message messaging >> rate significantly. >> >> - Designs for avoiding hot-spots in networks of large-scale clusters >> >> - Multi-pathing support leveraging LMC mechanism >> - Multi-port support for enabling user processes to bind to >> different IB ports for balanced communication performance >> on multi-core platforms >> >> - Multi-core optimized scalable shared memory design >> >> - Memory Hook support provided by integration with ptmalloc2 library. >> This provides safe release of memory to the Operating System and >> is expected to benefit the memory usage of applications that >> frequently use malloc and free operations. >> >> - Optimized, high-performance shared memory aware collective >> operations for multi-core platforms >> >> - Shared-Memory only channel (This interface support is useful for >> running MPI jobs on multi-processor systems without using any >> high-performance network. For example, multi-core servers, >> desktops, and laptops; and clusters with serial nodes.) >> >> A new "Multiple-pair Bandwidth and Message Rate" test is also >> available as a part of OSU_Benchmarks. >> >> For downloading MVAPICH 0.9.9-beta package and accessing the anonymous >> SVN, please visit the following URL: >> >> http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ >> >> MVAPICH 0.9.9-beta is also available for OFED 1.2 testing. >> >> All feedbacks, including bug reports and hints for performance tuning, >> are welcome. Please post it to the mvapich-discuss mailing list. >> >> Thanks, >> >> MVAPICH Team >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> > > -- http://www.cse.ohio-state.edu/~surs From surs at cse.ohio-state.edu Mon Feb 12 12:31:24 2007 From: surs at cse.ohio-state.edu (Sayantan Sur) Date: Mon Feb 12 12:31:49 2007 Subject: [mvapich-discuss] MVAPICH 0.9.9-beta release is available In-Reply-To: <45CDDB55.1080906@cse.ohio-state.edu> References: <200702092224.l19MOaYC006652@xi.cse.ohio-state.edu> <392f95800702091517p7f566481n1c27efec01bfd28f@mail.gmail.com> <45CDDB55.1080906@cse.ohio-state.edu> Message-ID: <20070212173122.GC2300@cse.ohio-state.edu> Hi Eric, I just checked in a minor fix to the make.mvapich.vapi_multirail to keep it from removing the tarch/tdevice files. I hope make install should be working fine for you now. Thanks, Sayantan. * On Feb,3 Sayantan Sur wrote : > Hi Eric, > > Thanks a lot for immediately trying out the beta! I couldn't reproduce > this behavior, `make install' worked OK for me. Will it be possible for > you to send us your make-mine.log and install-mine.log? > > Thanks, > Sayantan. > > Eric A. Borisch wrote: > >Glitch during installation: > > > >... > >mvapich-0.9.9-beta/bin/tarch does not exist (or is not a regular file)! > >... > > > >tdevice isn't there either. > > > >I initially reported this bug on 0.9.8. > > > >Thanks, > >Eric > > > >On 2/9/07, Dhabaleswar Panda wrote: > >>The MVAPICH team is pleased to announce the availability of MVAPICH > >>0.9.9-beta with the following NEW features: > >> > >>- Message coalescing support to enable reduction of per Queue-pair > >> send queues for reduction in memory requirement on large scale > >> clusters. This design also increases the small message messaging > >> rate significantly. > >> > >>- Designs for avoiding hot-spots in networks of large-scale clusters > >> > >> - Multi-pathing support leveraging LMC mechanism > >> - Multi-port support for enabling user processes to bind to > >> different IB ports for balanced communication performance > >> on multi-core platforms > >> > >>- Multi-core optimized scalable shared memory design > >> > >>- Memory Hook support provided by integration with ptmalloc2 library. > >> This provides safe release of memory to the Operating System and > >> is expected to benefit the memory usage of applications that > >> frequently use malloc and free operations. > >> > >>- Optimized, high-performance shared memory aware collective > >> operations for multi-core platforms > >> > >>- Shared-Memory only channel (This interface support is useful for > >> running MPI jobs on multi-processor systems without using any > >> high-performance network. For example, multi-core servers, > >> desktops, and laptops; and clusters with serial nodes.) > >> > >>A new "Multiple-pair Bandwidth and Message Rate" test is also > >>available as a part of OSU_Benchmarks. > >> > >>For downloading MVAPICH 0.9.9-beta package and accessing the anonymous > >>SVN, please visit the following URL: > >> > >>http://nowlab.cse.ohio-state.edu/projects/mpi-iba/ > >> > >>MVAPICH 0.9.9-beta is also available for OFED 1.2 testing. > >> > >>All feedbacks, including bug reports and hints for performance tuning, > >>are welcome. Please post it to the mvapich-discuss mailing list. > >> > >>Thanks, > >> > >>MVAPICH Team > >> > >>_______________________________________________ > >>mvapich-discuss mailing list > >>mvapich-discuss@cse.ohio-state.edu > >>http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > > > > > > -- > http://www.cse.ohio-state.edu/~surs > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -- http://www.cse.ohio-state.edu/~surs From kanojsarcar at yahoo.com Wed Feb 14 21:15:04 2007 From: kanojsarcar at yahoo.com (Kanoj Sarcar) Date: Wed Feb 14 23:04:49 2007 Subject: [mvapich-discuss] mvapich2-0.9.8-branch question Message-ID: <919875.73452.qm@web32507.mail.mud.yahoo.com> Hello all, I am a mpi/mvapich newbie looking at iwarp support in mvapich2. Between 0.9.8 release and branch, in procedure rdma_get_control_parameters() in file src/mpid/osu_ch3/channels/mrail/src/gen2/ibv_param.c, I see a difference that I am trying to understand. In the release version, the code was: if ((value = getenv("MV2_DISABLE_RDMA_FAST_PATH")) != NULL) { proc->has_adaptive_fast_path = 0; rdma_polling_set_limit = 0; } else { proc->has_adaptive_fast_path = 1; } but this got changed in the branch version to also turn off adaptive fast path in case of iwarp. I am curious about the rationale for this change. Is this to mostly support amasso card that (AFAIK) does not have user mode fastpath support? Or is there some architectural issue in current mvapich2 that requires turning off fastpath on iwarp rnics? Thanks. Kanoj PS: please cc me on responses, I am not subscribed to the list ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com From narravul at cse.ohio-state.edu Thu Feb 15 15:27:28 2007 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Thu Feb 15 15:27:55 2007 Subject: [mvapich-discuss] mvapich2-0.9.8-branch question In-Reply-To: <919875.73452.qm@web32507.mail.mud.yahoo.com> Message-ID: Hi Kanoj, The RDMA_FAST_PATH that we use in mvapich2 involves polling on the last byte of an expected message to flag message completions on the remote side. Since this is not guaranteed by iWARP we have disabled this by default for the iWARP mode. Does the RNIC you use guarantee inorder placement of RDMA bytes into the remote buffer? or guarantee that the last byte will be placed last into the user buffer? Regards, --Sundeep. On Wed, 14 Feb 2007, Kanoj Sarcar wrote: > Hello all, > > I am a mpi/mvapich newbie looking at iwarp support in > mvapich2. > > Between 0.9.8 release and branch, in procedure > rdma_get_control_parameters() in file > src/mpid/osu_ch3/channels/mrail/src/gen2/ibv_param.c, > I see a difference that I am trying to understand. > > In the release version, the code was: > > if ((value = getenv("MV2_DISABLE_RDMA_FAST_PATH")) > != NULL) { > proc->has_adaptive_fast_path = 0; > rdma_polling_set_limit = 0; > } else { > proc->has_adaptive_fast_path = 1; > } > > but this got changed in the branch version to also > turn off adaptive fast path in case of iwarp. > > I am curious about the rationale for this change. Is > this to mostly support amasso card that (AFAIK) does > not have user mode fastpath support? Or is there some > architectural issue in current mvapich2 that requires > turning off fastpath on iwarp rnics? > > Thanks. > > Kanoj > > PS: please cc me on responses, I am not subscribed to > the list > > > > ____________________________________________________________________________________ > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail beta. > http://new.mail.yahoo.com > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From kanojsarcar at yahoo.com Thu Feb 15 16:05:47 2007 From: kanojsarcar at yahoo.com (Kanoj Sarcar) Date: Thu Feb 15 16:48:18 2007 Subject: [mvapich-discuss] mvapich2-0.9.8-branch question In-Reply-To: Message-ID: <20070215210547.4065.qmail@web32512.mail.mud.yahoo.com> Hi Sundeep, Ok, thanks, I see the rationale now. Yes, the rnic I am working with can be put into a mode to guarantee in-order placement. I think a specifiable environment variable should be able to handle this case correctly, no? In general, are there any online documents describing internals of mvapich/iwarp and the various settable parameters etc? For example, if a newbie wanted to understand the iwarp primitives used in a benchmark program (say one of the osu_benchmarks), how would he go about that? Thanks. Kanoj --- Sundeep Narravula wrote: > Hi Kanoj, > > The RDMA_FAST_PATH that we use in mvapich2 > involves polling on the last > byte of an expected message to flag message > completions on the remote > side. Since this is not guaranteed by iWARP we have > disabled this > by default for the iWARP mode. > > Does the RNIC you use guarantee inorder placement of > RDMA bytes into the > remote buffer? or guarantee that the last byte will > be placed last into > the user buffer? > > Regards, > --Sundeep. > > > On Wed, 14 Feb 2007, Kanoj Sarcar wrote: > > > Hello all, > > > > I am a mpi/mvapich newbie looking at iwarp support > in > > mvapich2. > > > > Between 0.9.8 release and branch, in procedure > > rdma_get_control_parameters() in file > > > src/mpid/osu_ch3/channels/mrail/src/gen2/ibv_param.c, > > I see a difference that I am trying to understand. > > > > In the release version, the code was: > > > > if ((value = > getenv("MV2_DISABLE_RDMA_FAST_PATH")) > > != NULL) { > > proc->has_adaptive_fast_path = 0; > > rdma_polling_set_limit = 0; > > } else { > > proc->has_adaptive_fast_path = 1; > > } > > > > but this got changed in the branch version to also > > turn off adaptive fast path in case of iwarp. > > > > I am curious about the rationale for this change. > Is > > this to mostly support amasso card that (AFAIK) > does > > not have user mode fastpath support? Or is there > some > > architectural issue in current mvapich2 that > requires > > turning off fastpath on iwarp rnics? > > > > Thanks. > > > > Kanoj > > > > PS: please cc me on responses, I am not subscribed > to > > the list > > > > > > > > > ____________________________________________________________________________________ > > Do you Yahoo!? > > Everyone is raving about the all-new Yahoo! Mail > beta. > > http://new.mail.yahoo.com > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > ____________________________________________________________________________________ Now that's room service! Choose from over 150,000 hotels in 45,000 destinations on Yahoo! Travel to find your fit. http://farechase.yahoo.com/promo-generic-14795097 From narravul at cse.ohio-state.edu Thu Feb 15 17:07:51 2007 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Thu Feb 15 17:08:17 2007 Subject: [mvapich-discuss] mvapich2-0.9.8-branch question In-Reply-To: <20070215210547.4065.qmail@web32512.mail.mud.yahoo.com> Message-ID: Kanoj, > Yes, the rnic I am working with can be put into a mode > to guarantee in-order placement. This is good for RDMA_FAST_PATH. > I think a specifiable environment variable should be > able to handle this case correctly, no? We will be providing an option to enable RDMA_FAST_PATH with a specifiable environment variable. This will be in the SVN in the next 1-2 days. > In general, are there any online documents describing > internals of mvapich/iwarp and the various settable > parameters etc? For example, if a newbie wanted to > understand the iwarp primitives used in a benchmark > program (say one of the osu_benchmarks), how would he > go about that? All the information regarding the adjustable parameters of MVAPICH2 are available from the MVAPICH2 user-guide. http://nowlab.cse.ohio-state.edu/projects/mpi-iba/user_guide.html http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html Cheers, --Sundeep. > > Thanks. > > Kanoj > > --- Sundeep Narravula > wrote: > > > Hi Kanoj, > > > > The RDMA_FAST_PATH that we use in mvapich2 > > involves polling on the last > > byte of an expected message to flag message > > completions on the remote > > side. Since this is not guaranteed by iWARP we have > > disabled this > > by default for the iWARP mode. > > > > Does the RNIC you use guarantee inorder placement of > > RDMA bytes into the > > remote buffer? or guarantee that the last byte will > > be placed last into > > the user buffer? > > > > Regards, > > --Sundeep. > > > > > > On Wed, 14 Feb 2007, Kanoj Sarcar wrote: > > > > > Hello all, > > > > > > I am a mpi/mvapich newbie looking at iwarp support > > in > > > mvapich2. > > > > > > Between 0.9.8 release and branch, in procedure > > > rdma_get_control_parameters() in file > > > > > > src/mpid/osu_ch3/channels/mrail/src/gen2/ibv_param.c, > > > I see a difference that I am trying to understand. > > > > > > In the release version, the code was: > > > > > > if ((value = > > getenv("MV2_DISABLE_RDMA_FAST_PATH")) > > > != NULL) { > > > proc->has_adaptive_fast_path = 0; > > > rdma_polling_set_limit = 0; > > > } else { > > > proc->has_adaptive_fast_path = 1; > > > } > > > > > > but this got changed in the branch version to also > > > turn off adaptive fast path in case of iwarp. > > > > > > I am curious about the rationale for this change. > > Is > > > this to mostly support amasso card that (AFAIK) > > does > > > not have user mode fastpath support? Or is there > > some > > > architectural issue in current mvapich2 that > > requires > > > turning off fastpath on iwarp rnics? > > > > > > Thanks. > > > > > > Kanoj > > > > > > PS: please cc me on responses, I am not subscribed > > to > > > the list > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > Do you Yahoo!? > > > Everyone is raving about the all-new Yahoo! Mail > > beta. > > > http://new.mail.yahoo.com > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > ____________________________________________________________________________________ > Now that's room service! Choose from over 150,000 hotels > in 45,000 destinations on Yahoo! Travel to find your fit. > http://farechase.yahoo.com/promo-generic-14795097 > From kanojsarcar at yahoo.com Thu Feb 15 18:20:22 2007 From: kanojsarcar at yahoo.com (Kanoj Sarcar) Date: Thu Feb 15 19:01:59 2007 Subject: [mvapich-discuss] mvapich2-0.9.8-branch question In-Reply-To: Message-ID: <622949.87861.qm@web32503.mail.mud.yahoo.com> --- Sundeep Narravula wrote: > Kanoj, > > > Yes, the rnic I am working with can be put into a > mode > > to guarantee in-order placement. > > This is good for RDMA_FAST_PATH. > > > I think a specifiable environment variable should > be > > able to handle this case correctly, no? > > We will be providing an option to enable > RDMA_FAST_PATH with a specifiable > environment variable. This will be in the SVN in the > next 1-2 days. Thanks; any chance this can get picked into ofed 1.2? > > > In general, are there any online documents > describing > > internals of mvapich/iwarp and the various > settable > > parameters etc? For example, if a newbie wanted to > > understand the iwarp primitives used in a > benchmark > > program (say one of the osu_benchmarks), how would > he > > go about that? > > All the information regarding the adjustable > parameters of MVAPICH2 are > available from the MVAPICH2 user-guide. > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/user_guide.html > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html > Okay, thanks, I will look these over. Kanoj > Cheers, > --Sundeep. > > > > > Thanks. > > > > Kanoj > > > > --- Sundeep Narravula > > > wrote: > > > > > Hi Kanoj, > > > > > > The RDMA_FAST_PATH that we use in mvapich2 > > > involves polling on the last > > > byte of an expected message to flag message > > > completions on the remote > > > side. Since this is not guaranteed by iWARP we > have > > > disabled this > > > by default for the iWARP mode. > > > > > > Does the RNIC you use guarantee inorder > placement of > > > RDMA bytes into the > > > remote buffer? or guarantee that the last byte > will > > > be placed last into > > > the user buffer? > > > > > > Regards, > > > --Sundeep. > > > > > > > > > On Wed, 14 Feb 2007, Kanoj Sarcar wrote: > > > > > > > Hello all, > > > > > > > > I am a mpi/mvapich newbie looking at iwarp > support > > > in > > > > mvapich2. > > > > > > > > Between 0.9.8 release and branch, in procedure > > > > rdma_get_control_parameters() in file > > > > > > > > > > src/mpid/osu_ch3/channels/mrail/src/gen2/ibv_param.c, > > > > I see a difference that I am trying to > understand. > > > > > > > > In the release version, the code was: > > > > > > > > if ((value = > > > getenv("MV2_DISABLE_RDMA_FAST_PATH")) > > > > != NULL) { > > > > proc->has_adaptive_fast_path = 0; > > > > rdma_polling_set_limit = 0; > > > > } else { > > > > proc->has_adaptive_fast_path = 1; > > > > } > > > > > > > > but this got changed in the branch version to > also > > > > turn off adaptive fast path in case of iwarp. > > > > > > > > I am curious about the rationale for this > change. > > > Is > > > > this to mostly support amasso card that > (AFAIK) > > > does > > > > not have user mode fastpath support? Or is > there > > > some > > > > architectural issue in current mvapich2 that > > > requires > > > > turning off fastpath on iwarp rnics? > > > > > > > > Thanks. > > > > > > > > Kanoj > > > > > > > > PS: please cc me on responses, I am not > subscribed > > > to > > > > the list > > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > > > Do you Yahoo!? > > > > Everyone is raving about the all-new Yahoo! > Mail > > > beta. > > > > http://new.mail.yahoo.com > > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > > > > > ____________________________________________________________________________________ > > Now that's room service! Choose from over 150,000 > hotels > > in 45,000 destinations on Yahoo! Travel to find > your fit. > > http://farechase.yahoo.com/promo-generic-14795097 > > > > ____________________________________________________________________________________ No need to miss a message. Get email on-the-go with Yahoo! Mail for Mobile. Get started. http://mobile.yahoo.com/mail From panda at cse.ohio-state.edu Thu Feb 15 19:04:08 2007 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Thu Feb 15 19:04:37 2007 Subject: [mvapich-discuss] mvapich2-0.9.8-branch question In-Reply-To: <622949.87861.qm@web32503.mail.mud.yahoo.com> from "Kanoj Sarcar" at Feb 15, 2007 03:20:22 PM Message-ID: <200702160004.l1G048HR026603@xi.cse.ohio-state.edu> > > We will be providing an option to enable > > RDMA_FAST_PATH with a specifiable > > environment variable. This will be in the SVN in the > > next 1-2 days. > > Thanks; any chance this can get picked into ofed 1.2? Yes, it will be available for ofed 1.2. An updated SRPM will be issued once this option is checked into the SVN. DK From yann.kalemkarian at bull.net Fri Feb 16 07:30:44 2007 From: yann.kalemkarian at bull.net (Yann K.) Date: Fri Feb 16 07:29:37 2007 Subject: [mvapich-discuss] [MVAPICH2] Suspend / Resume Message-ID: <45D5A3F4.7040909@bull.net> Hello everybody, While looking at the mvapich2 gen2 code, I was looking for routines handling SIGSTOP and CONT, and couldn't find any. I work with an OFED stack and couldn't find anything on handling those signals as well at that level. What happens to MPI processes being served with an lsf, mpd, or slurmd SIGSTOP signal, especially if rdma memory is pinned and already registered on the board ? Thanks for ideas Yann K. From narravul at cse.ohio-state.edu Fri Feb 16 15:02:21 2007 From: narravul at cse.ohio-state.edu (Sundeep Narravula) Date: Fri Feb 16 15:02:49 2007 Subject: [mvapich-discuss] mvapich2-0.9.8-branch question In-Reply-To: <622949.87861.qm@web32503.mail.mud.yahoo.com> Message-ID: Hi Kanoj, > > We will be providing an option to enable > > RDMA_FAST_PATH with a specifiable > > environment variable. This will be in the SVN in the > > next 1-2 days. > > Thanks; any chance this can get picked into ofed 1.2? This option is now in the mvapich2-0.9.8 svn branch and already available with the latest MVAPICH2 SRPM for OFED 1.2. $ svn co https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/branches/0.9.8/ mvapich2 You can enable RDMA_FAST_PATH by having MV2_ENABLE_RDMA_FAST_PATH=1 in your environment. Regards, --Sundeep. From kanojsarcar at yahoo.com Fri Feb 16 15:52:54 2007 From: kanojsarcar at yahoo.com (Kanoj Sarcar) Date: Fri Feb 16 15:57:27 2007 Subject: [mvapich-discuss] mvapich2-0.9.8-branch question In-Reply-To: Message-ID: <149820.16502.qm@web32501.mail.mud.yahoo.com> Thanks all for your quick response to this! Kanoj --- Sundeep Narravula wrote: > Hi Kanoj, > > > > We will be providing an option to enable > > > RDMA_FAST_PATH with a specifiable > > > environment variable. This will be in the SVN in > the > > > next 1-2 days. > > > > Thanks; any chance this can get picked into ofed > 1.2? > > This option is now in the mvapich2-0.9.8 svn branch > and already available > with the latest MVAPICH2 SRPM for OFED 1.2. > > $ svn co > https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich2/branches/0.9.8/ > mvapich2 > > You can enable RDMA_FAST_PATH by having > MV2_ENABLE_RDMA_FAST_PATH=1 in > your environment. > > Regards, > --Sundeep. > > > > > ____________________________________________________________________________________ Never Miss an Email Stay connected with Yahoo! Mail on your mobile. Get started! http://mobile.yahoo.com/services?promote=mail From huanwei at cse.ohio-state.edu Sat Feb 17 11:23:42 2007 From: huanwei at cse.ohio-state.edu (wei huang) Date: Sat Feb 17 11:24:07 2007 Subject: [mvapich-discuss] [MVAPICH2] Suspend / Resume In-Reply-To: <45D5A3F4.7040909@bull.net> Message-ID: Hi Yann, Thanks for using mvapich2. May I have you clarify your question a bit more? Typically SIGSTOP is to pause the program and SIGCONT is to restart that program. Is this what you want to have? If you want to suspend a MPI job and restart later. May I suggest you to use the checkpoint/restart function of the latest mvapich2 release. Detailed instructions can be found at: http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html Please note that you need BLCR installed on your systems. Let us know if we undertand your question correctly. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Fri, 16 Feb 2007, Yann K. wrote: > Hello everybody, > > While looking at the mvapich2 gen2 code, I was looking for routines > handling SIGSTOP and CONT, and couldn't find any. I work with an OFED > stack and couldn't find anything on handling those signals as well at > that level. What happens to MPI processes being served with an lsf, mpd, > or slurmd SIGSTOP signal, especially if rdma memory is pinned and > already registered on the board ? > > Thanks for ideas > > Yann K. > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From yann.kalemkarian at bull.net Mon Feb 19 04:04:40 2007 From: yann.kalemkarian at bull.net (Yann K.) Date: Mon Feb 19 04:04:03 2007 Subject: [mvapich-discuss] [MVAPICH2] Suspend / Resume In-Reply-To: References: Message-ID: <45D96828.5090902@bull.net> Wei, Thanks for answering this one. To clarify my point. Some jobs in time can become more important than other and be scheduled to replace already running jobs. LSF allows this. Thus, the current running job must be stopped. How does this go ? + Does it happen without any pain with mvapich2 ? + How does the pinned memory behave ? + Are the memory pages swapped out ? How do they come back ? + How does the ofed memory registration which make virtual/physical associations behave then ? + What happens technically when jobs are stopped by a batch/scheduler ? + Will the second job have the benefit of all the RAM, will the pinned memory stay somehow ? Of course, I don't want to spend time to checkpoint/restart my job. I just want to suspend it (like a suspend to disk), let the pages being swapped out, let the other go job and work, and then putting my first job back to work. Y wei huang a ?crit : > Hi Yann, > > Thanks for using mvapich2. > > May I have you clarify your question a bit more? Typically SIGSTOP is to > pause the program and SIGCONT is to restart that program. Is this what you > want to have? > > If you want to suspend a MPI job and restart later. May I suggest you to > use the checkpoint/restart function of the latest mvapich2 release. > Detailed instructions can be found at: > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html > > Please note that you need BLCR installed on your systems. > > Let us know if we undertand your question correctly. > > Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Fri, 16 Feb 2007, Yann K. wrote: > > >> Hello everybody, >> >> While looking at the mvapich2 gen2 code, I was looking for routines >> handling SIGSTOP and CONT, and couldn't find any. I work with an OFED >> stack and couldn't find anything on handling those signals as well at >> that level. What happens to MPI processes being served with an lsf, mpd, >> or slurmd SIGSTOP signal, especially if rdma memory is pinned and >> already registered on the board ? >> >> Thanks for ideas >> >> Yann K. >> >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> > > > -- Yann Kalemkarian HPC Software Engineer Open Software R&D Bull, Architect of an Open World TM Phone: +33 4 7629 7393 www.bull.com From huanwei at cse.ohio-state.edu Mon Feb 19 21:37:53 2007 From: huanwei at cse.ohio-state.edu (wei huang) Date: Mon Feb 19 21:38:20 2007 Subject: [mvapich-discuss] [MVAPICH2] Suspend / Resume In-Reply-To: <45D96828.5090902@bull.net> Message-ID: Hi Yann, Thanks for letting us know your detailed requirements for the suspend/resume feature. The closest functionality to meet your requirements in current mvapich2 releases is our CR support, which writes the application memory footprints to disks and restart from that later. However, we are working on the feature you mentioned (suspend and resume) and it will be available during the next MVAPICH2 release. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Mon, 19 Feb 2007, Yann K. wrote: > Wei, > > Thanks for answering this one. To clarify my point. Some jobs in time > can become more important than other and be scheduled to replace already > running jobs. LSF allows this. Thus, the current running job must be > stopped. How does this go ? > + Does it happen without any pain with mvapich2 ? > + How does the pinned memory behave ? > + Are the memory pages swapped out ? How do they come back ? > + How does the ofed memory registration which make virtual/physical > associations behave then ? > + What happens technically when jobs are stopped by a batch/scheduler ? > + Will the second job have the benefit of all the RAM, will the pinned > memory stay somehow ? > > Of course, I don't want to spend time to checkpoint/restart my job. I > just want to suspend it (like a suspend to disk), let the pages being > swapped out, let the other go job and work, and then putting my first > job back to work. > > Y > > > wei huang a écrit : > > Hi Yann, > > > > Thanks for using mvapich2. > > > > May I have you clarify your question a bit more? Typically SIGSTOP is to > > pause the program and SIGCONT is to restart that program. Is this what you > > want to have? > > > > If you want to suspend a MPI job and restart later. May I suggest you to > > use the checkpoint/restart function of the latest mvapich2 release. > > Detailed instructions can be found at: > > > > http://nowlab.cse.ohio-state.edu/projects/mpi-iba/download-mvapich2/mvapich2_user_guide.html > > > > Please note that you need BLCR installed on your systems. > > > > Let us know if we undertand your question correctly. > > > > Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Fri, 16 Feb 2007, Yann K. wrote: > > > > > >> Hello everybody, > >> > >> While looking at the mvapich2 gen2 code, I was looking for routines > >> handling SIGSTOP and CONT, and couldn't find any. I work with an OFED > >> stack and couldn't find anything on handling those signals as well at > >> that level. What happens to MPI processes being served with an lsf, mpd, > >> or slurmd SIGSTOP signal, especially if rdma memory is pinned and > >> already registered on the board ? > >> > >> Thanks for ideas > >> > >> Yann K. > >> > >> > >> _______________________________________________ > >> mvapich-discuss mailing list > >> mvapich-discuss@cse.ohio-state.edu > >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >> > >> > > > > > > > > -- > Yann Kalemkarian > HPC Software Engineer > Open Software R&D > Bull, Architect of an Open World TM > Phone: +33 4 7629 7393 > www.bull.com > From jonathan_follows at uk.ibm.com Thu Feb 22 13:30:51 2007 From: jonathan_follows at uk.ibm.com (Jonathan Follows) Date: Thu Feb 22 13:31:22 2007 Subject: [mvapich-discuss] MVAPICH on large clusters - timeouts - any advice? Message-ID: Hello, I'm running on a relatively large cluster (160 nodes, dual-core dual-socket) with IB connecting all nodes. I recompiled MVAPICH 0.9.8 because I wanted to run under IBM's batch scheduler, LoadLeveler, and that worked fine. The IB implementation is with Voltaire PCIe adapters and I compiled MVAPICH using the "make.mvapich.gen2" script with appropriate modifications. I'm using Pathscale compilers, for example. With anything like a "reasonable" number of nodes (sometimes even 16, but >=64 for sure) I'm getting failures: [chpcc022:14] Got completion with error, code=12, dest rank=78 at line 397 in file viacheck.c I have now recompiled MVAPICH with -DON_DEMAND and, at run-time, VIADEV_CM_TIMEOUT=5000000. [REQUEST: the documentation is unclear but the value for this parameter needs to be specified in microseconds, I believe] Now my job is running, but it's probably running very badly; in due course I plan on changing this timeout value to something less (but greater than the default). Just looking for now for any comments, ideas, experiences, advice? Gratefully received of course, Thanks, Jonathan Follows Deep Computing, Consulting I/T Specialist IBM UK, Manchester [Internal 487099] POST: c/o IBM UK Limited, NHBR-1PH, Portsmouth PO6 3AU Tel: (+44) 1619057099 FAX: (+44) 870 1385642 Mobile: (+44) 7764660714 MOBX 273842 E-mail: Jonathan_Follows@uk.ibm.com Text messaging: http://www.jonathanfollows.com/pageme.html Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070222/9dacaa0b/attachment.html From gaoq at cse.ohio-state.edu Thu Feb 22 14:10:35 2007 From: gaoq at cse.ohio-state.edu (Qi Gao) Date: Thu Feb 22 14:11:14 2007 Subject: [mvapich-discuss] MVAPICH on large clusters - timeouts - any advice? References: Message-ID: <020001c756b5$27b71e10$0763a8c0@GAO> Hi Jonathan, Thanks for using MVAPICH. We are glad to work with you to solve the problems. For the Got completion with error, code=12, it is not about VIADEV_CM_TIMEOUT env variable. You can try to increase VIADEV_DEFAULT_TIME_OUT to 22. The unit of VIADEV_DEFAULT_TIME_OUT is specified by IB Spec, page 340, which is 4.096 us * 2 ^ (<5 bits time out value>) And about VIADEV_CM_TIMEOUT, it's only used for connection setup, and its unit is in milliseconds (the default value for this is 500 milliseconds). Thanks for your suggestion and we will modify the userguide to make it more clear. Please let us know if you have any questions. Regards, --Qi ----- Original Message ----- From: Jonathan Follows To: mvapich-discuss@cse.ohio-state.edu Sent: Thursday, February 22, 2007 1:30 PM Subject: [mvapich-discuss] MVAPICH on large clusters - timeouts - any advice? Hello, I'm running on a relatively large cluster (160 nodes, dual-core dual-socket) with IB connecting all nodes. I recompiled MVAPICH 0.9.8 because I wanted to run under IBM's batch scheduler, LoadLeveler, and that worked fine. The IB implementation is with Voltaire PCIe adapters and I compiled MVAPICH using the "make.mvapich.gen2" script with appropriate modifications. I'm using Pathscale compilers, for example. With anything like a "reasonable" number of nodes (sometimes even 16, but >=64 for sure) I'm getting failures: [chpcc022:14] Got completion with error, code=12, dest rank=78 at line 397 in file viacheck.c I have now recompiled MVAPICH with -DON_DEMAND and, at run-time, VIADEV_CM_TIMEOUT=5000000. [REQUEST: the documentation is unclear but the value for this parameter needs to be specified in microseconds, I believe] Now my job is running, but it's probably running very badly; in due course I plan on changing this timeout value to something less (but greater than the default). Just looking for now for any comments, ideas, experiences, advice? Gratefully received of course, Thanks, Jonathan Follows Deep Computing, Consulting I/T Specialist IBM UK, Manchester [Internal 487099] POST: c/o IBM UK Limited, NHBR-1PH, Portsmouth PO6 3AU Tel: (+44) 1619057099 FAX: (+44) 870 1385642 Mobile: (+44) 7764660714 MOBX 273842 E-mail: Jonathan_Follows@uk.ibm.com Text messaging: http://www.jonathanfollows.com/pageme.html ------------------------------------------------------------------------------ Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU ------------------------------------------------------------------------------ _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20070222/e4e8f50d/attachment.html From aquarijen at gmail.com Thu Feb 22 16:06:03 2007 From: aquarijen at gmail.com (Aquarijen) Date: Thu Feb 22 16:06:34 2007 Subject: [mvapich-discuss] viacheck.c error? In-Reply-To: <45B6DF54.2070204@cse.ohio-state.edu> References: <2e5ad1b10701191137y7e91389fv9561c47d207d61d7@mail.gmail.com> <20070119201522.GA19063@cse.ohio-state.edu> <2e5ad1b10701231238i594604e6w3c48fd23e053827a@mail.gmail.com> <45B6DF54.2070204@cse.ohio-state.edu> Message-ID: <2e5ad1b10702221306m1df8118avcd50cb579bf6b303@mail.gmail.com> Hi Sayantan and everyone, I had been pulled onto other projects for a while - sorry it has been so long for an update! But now I'm back on this as my first priority to get working... I've tried a few things. osu_latency, osu_bw and osu_bibw still fail with 2 processors - it is the same problem. :( No user can run IMB - we get the same viacheck.c error for it as well. Your suggestions for cpilog and simpleio worked fine and these run without problems now. Just none of the benchmarks... So I thought I would try out the new mvapich 0.9.9 beta and see how it went. I am having trouble compiling it and I think it may be a related problem? I have tried with icc and gcc. We have gcc (GCC) 4.0.2 20051125 (Red Hat 4.0.2-8) and icc (ICC) 9.1 20061101. There are warnings in viainit.c and viarecv.c, but there is an error in viacheck.c. Here is the error in icc: -------------------------------------------------------------------------------- icc -DHAVE_CONFIG_H -I. -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 -I/root/mvapich/mvapich-0.9.9-beta/include -I/root/mvapich/mvapich-0.9.9-beta/include -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 -I/root/mvapich/mvapich-0.9.9-beta/mpid/util -DMPID_DEVICE_CODE -DHAVE_UNAME=1 -DHAVE_NETDB_H=1 -DHAVE_GETHOSTBYNAME=1 -DMPID_DEBUG_NONE -DMPID_STAT_NONE -D_GNU_SOURCE -fPIC -D_EM64T_ -DEARLY_SEND_COMPLETION -DMEMORY_RELIABLE -DVIADEV_RPUT_SUPPORT -D_SMP_ -D_SMP_RNDV_ -DCH_GEN2 -D_ICC_ -I/usr/ofed/include -O3 -DHAVE_MPICHCONF_H -I/root/mvapich/mvapich-0.9.9-beta -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 -I. -c viacheck.c viacheck.c(1036): warning #167: argument of type "unsigned char *" is incompatible with parameter of type "char *" update_crc(1, v->buffer, header->dma_len), ^ viacheck.c(1557): warning #188: enumerated type mixed with another type rhandle->protocol); ^ viacheck.c(1749): warning #188: enumerated type mixed with another type rhandle->protocol); ^ viacheck.c(2570): error: identifier "IBV_EVENT_CLIENT_REREGISTER" is undefined case IBV_EVENT_CLIENT_REREGISTER: ^ cm_user.h(6): warning #864: extern inline function "odu_test_new_connection" was referenced but not defined inline void odu_test_new_connection(void); ^ compilation aborted for viacheck.c (code 2) make[3]: *** [viacheck.o] Error 2 Exit status from make was 2 make[2]: *** [mpilib] Error 1 make[1]: *** [mpi-modules] Error 2 make: *** [mpi] Error 2 ---------------------------------------------------------------------------------------------------------- and the error in gcc: ---------------------------------------------------------------------------------------------------------- gcc -DHAVE_CONFIG_H -I. -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 -I/root/mvapich/mvapich-0.9.9-beta/include -I/root/mvapich/mvapich-0.9.9-beta/include -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 -I/root/mvapich/mvapich-0.9.9-beta/mpid/util -DMPID_DEVICE_CODE -DHAVE_UNAME=1 -DHAVE_NETDB_H=1 -DHAVE_GETHOSTBYNAME=1 -DMPID_DEBUG_NONE -DMPID_STAT_NONE -fPIC -D_EM64T_ -DEARLY_SEND_COMPLETION -DMEMORY_RELIABLE -DVIADEV_RPUT_SUPPORT -D_SMP_ -D_SMP_RNDV_ -DCH_GEN2 -I/usr/ofed/include -O3 -DHAVE_MPICHCONF_H -D_GNU_SOURCE -I/root/mvapich/mvapich-0.9.9-beta -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 -I. -Wall -c viacheck.c viacheck.c: In function 'viadev_process_recv': viacheck.c:1036: warning: pointer targets in passing argument 2 of 'update_crc' differ in signedness viacheck.c: In function 'async_thread': viacheck.c:2570: error: 'IBV_EVENT_CLIENT_REREGISTER' undeclared (first use in this function) viacheck.c:2570: error: (Each undeclared identifier is reported only once viacheck.c:2570: error: for each function it appears in.) make[3]: *** [viacheck.o] Error 1 Exit status from make was 2 make[2]: *** [mpilib] Error 1 make[1]: *** [mpi-modules] Error 2 make: *** [mpi] Error 2 ----------------------------------------------------------------------------------------------------------------- What am I doing wrong? Thanks so much for any help you can give!!!! -Jen On 1/23/07, Sayantan Sur wrote: > Hello Jen, > > The OSU benchmarks should be ideally run for 2 processes. Can you try > osu_latency, osu_bw, osu_bibw with just 2 processes? I have a feeling > that the cluster isn't set up quite right, otherwise these simple > benchmarks wouldn't fail. Are other users able to run IMB on the cluster? > > cpilog might have compilation problems since MPE might have not been > compiled in when MVAPICH was built. To enable MPE, use --with-mpe as a > configure parameter in make.mvapich.gen2. (assuming you have downloaded > MVAPICH-0.9.8 from our website). > > Similarly, with simpleio, MPIIO component needs to be compiled in when > building MVAPICH. Use --with-romio as a configure parameter. > > Thanks, > Sayantan. > > Aquarijen wrote: > > Hi Sayantan, > > > > Thank you for your help. :) > > > > A few things about my environment. The compute nodes are 64 bit, so I > > pointed the mvapich compilation to /usr/ofed/lib64 - I have no 32 bit > > libs for ofed. The compute nodes have 2 processors each. When I have > > tried jobs, I have submitted them through pbs (torque) and specified > > that I want 1 processor per node - maui enforces this. I have tried > > all my runs with 40 nodes, one processor per node. > > > > cpi, cpip and hello++ run without problems. > > > > osu_bw fails with the error: > > Connection closed by 172.16.4.36^M > > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=1 > > at line 2355 in file viacheck.c > > done. > > > > osu_bcast fails with error: > > [39] Abort: [b09n001.oic.ornl.gov:39] Got completion with error, code=1 > > at line 2355 in file viacheck.c > > done. > > > > osu_bibw fails with: > > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=1 > > at line 2355 in file viacheck.c > > [39] Abort: [32] Abort: [24] Abort: [38] Abort: > > [b09n001.oic.ornl.gov:39] Got completion with error, code=12, dest > > rank=0 > > at line 397 in file viacheck.c > > [b09n008.oic.ornl.gov:32] Got completion with error, code=12, dest rank=0 > > at line 397 in file viacheck.c > > [b09n016.oic.ornl.gov:24] Got completion with error, code=12, dest rank=0 > > at line 397 in file viacheck.c > > [b09n002.oic.ornl.gov:38] Got completion with error, code=12, dest rank=0 > > at line 397 in file viacheck.c > > [36] Abort: [b09n004.oic.ornl.gov:36] Got completion with error, > > code=12, dest rank=0 > > at line 397 in file viacheck.c > > done. > > > > osu_latency fails with: > > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=1 > > at line 2355 in file viacheck.c > > done. > > > > I can't compile cpilog.c, I get: > > [2vt@b09l02 osu_benchmarks-mvapich]$ which mpicc > > /opt/mvapich-gcc-0.9.8/bin/mpicc > > [2vt@b09l02 osu_benchmarks-mvapich]$ mpicc cpilog.c -o cpilog > > cpilog.o(.text+0xd2): In function `main': > > cpilog.c: undefined reference to `MPE_Init_log' > > cpilog.o(.text+0xd7):cpilog.c: undefined reference to > > `MPE_Log_get_event_number'cpilog.o(.text+0xdf):cpilog.c: undefined > > reference to `MPE_Log_get_event_number'cpilog.o(.text+0xe7):cpilog.c: > > undefined reference to > > `MPE_Log_get_event_number'cpilog.o(.text+0xef):cpilog.c: undefined > > reference to `MPE_Log_get_event_number'cpilog.o(.text+0xf7):cpilog.c: > > undefined reference to > > `MPE_Log_get_event_number'cpilog.o(.text+0xff):cpilog.c: more > > undefined references to `MPE_Log_get_event_number' follow > > cpilog.o(.text+0x12e): In function `main': > > cpilog.c: undefined reference to `MPE_Describe_state' > > cpilog.o(.text+0x143):cpilog.c: undefined reference to > > `MPE_Describe_state' > > cpilog.o(.text+0x158):cpilog.c: undefined reference to > > `MPE_Describe_state' > > cpilog.o(.text+0x16d):cpilog.c: undefined reference to > > `MPE_Describe_state' > > cpilog.o(.text+0x1a2):cpilog.c: undefined reference to `MPE_Start_log' > > cpilog.o(.text+0x1c0):cpilog.c: undefined reference to `MPE_Log_event' > > cpilog.o(.text+0x1f0):cpilog.c: undefined reference to `MPE_Log_event' > > cpilog.o(.text+0x202):cpilog.c: undefined reference to `MPE_Log_event' > > cpilog.o(.text+0x21e):cpilog.c: undefined reference to `MPE_Log_event' > > cpilog.o(.text+0x230):cpilog.c: undefined reference to `MPE_Log_event' > > cpilog.o(.text+0x2da):cpilog.c: more undefined references to > > `MPE_Log_event' follow > > cpilog.o(.text+0x342): In function `main': > > cpilog.c: undefined reference to `MPE_Finish_log' > > collect2: ld returned 1 exit status > > > > I also can't compile simpleio.c. I get: > > [2vt@b09l02 osu_benchmarks-mvapich]$ mpicc simpleio.c > > simpleio.o(.text+0x252): In function `main': > > simpleio.c: undefined reference to `MPI_File_open' > > simpleio.o(.text+0x26e):simpleio.c: undefined reference to > > `MPI_File_write' > > simpleio.o(.text+0x277):simpleio.c: undefined reference to > > `MPI_File_close' > > simpleio.o(.text+0x2c0):simpleio.c: undefined reference to > > `MPI_File_open' > > simpleio.o(.text+0x2dc):simpleio.c: undefined reference to > > `MPI_File_read' > > simpleio.o(.text+0x2e5):simpleio.c: undefined reference to > > `MPI_File_close' > > collect2: ld returned 1 exit status > > > > Intel MPI benchmarks (IMB-MPI1) fail with: > > > > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=4 > > at line 2355 in file viacheck.c > > done. > > > > > > I'd be happy to provide any opther logs or any other info you might > > think would help! Sorry it took me so long for this - I had a few > > fires to put out. Now, this is #1 priority. > > > > Thanks for all your help!!!! > > Jen > > > > > > > > On 1/19/07, Sayantan Sur wrote: > >> Hello Jen, > >> > >> > I am new. And a little frusterated. :) > >> > >> Thanks for your post ... Hope your problems are short-lived :-) > >> > >> > I have compiled/installed mvapich 0.9.8 using ofed/gen2. > >> > > >> > I can run cpi on all my nodes just fine. The problem comes in when I > >> > try to use any of the osu benchmark programs. They seem to compile > >> > just fine, but when I try to run osu_bcast, I get the following error: > >> > > >> > [39] Abort: [b09n001.oic.ornl.gov:39] Got completion with error, > >> code=1 > >> > at line 2355 in file viacheck.c > >> > done. > >> > > >> > Where is this viacheck.c? Has anyone seen this before? I'd be happy > >> > to provide more details if you tell me what to provide. > >> > >> viacheck.c is an internal file in the MVAPICH implementation. I have a > >> couple of questions which you could answer ... > >> > >> 1) How many nodes was this run attempted? I have run osu_bcast on > >> 64-nodes/128 processes and it seems to be OK. > >> > >> 2) Can you run IMB (Intel MPI benchmarks)? They will also call > >> MPI_Bcast. > >> > >> 3) I'm wondering if you could run the other OSU benchmarks, such as > >> latency, bandwidth, bi-directional bandwidth? > >> > >> Thanks, > >> Sayantan. > >> > >> -- > >> http://www.cse.ohio-state.edu/~surs > >> > > > > > > -- > http://www.cse.ohio-state.edu/~surs > > -- When I play with my cat, who knows whether she is not amusing herself with me more than I with her. Michel de Montaigne From vishnu at cse.ohio-state.edu Thu Feb 22 16:17:55 2007 From: vishnu at cse.ohio-state.edu (Abhinav Vishnu) Date: Thu Feb 22 16:20:12 2007 Subject: [mvapich-discuss] viacheck.c error? In-Reply-To: <2e5ad1b10702221306m1df8118avcd50cb579bf6b303@mail.gmail.com> References: <2e5ad1b10701191137y7e91389fv9561c47d207d61d7@mail.gmail.com> <20070119201522.GA19063@cse.ohio-state.edu> <2e5ad1b10701231238i594604e6w3c48fd23e053827a@mail.gmail.com> <45B6DF54.2070204@cse.ohio-state.edu> <2e5ad1b10702221306m1df8118avcd50cb579bf6b303@mail.gmail.com> Message-ID: <45DE0883.2060409@cse.ohio-state.edu> Hi Aquarijen, Aquarijen wrote: > Hi Sayantan and everyone, > > I had been pulled onto other projects for a while - sorry it has been > so long for an update! But now I'm back on this as my first priority > to get working... I've tried a few things. > osu_latency, osu_bw and osu_bibw still fail with 2 processors - it is > the same problem. :( No user can run IMB - we get the same viacheck.c > error for it as well. > Sorry to know that you are facing these problems. Can you please provide us with more information with respect to these problems, if you have any? Please find my response to the problem you have reported inline. > Your suggestions for cpilog and simpleio worked fine and these run > without problems now. Just none of the benchmarks... > > So I thought I would try out the new mvapich 0.9.9 beta and see how it > went. I am having trouble compiling it and I think it may be a > related problem? > I have tried with icc and gcc. We have gcc (GCC) 4.0.2 20051125 (Red > Hat 4.0.2-8) and icc (ICC) 9.1 20061101. > > There are warnings in viainit.c and viarecv.c, but there is an error > in viacheck.c. Here is the error in icc: > -------------------------------------------------------------------------------- > > icc -DHAVE_CONFIG_H -I. > -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 > -I/root/mvapich/mvapich-0.9.9-beta/include > -I/root/mvapich/mvapich-0.9.9-beta/include > -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 > -I/root/mvapich/mvapich-0.9.9-beta/mpid/util -DMPID_DEVICE_CODE > -DHAVE_UNAME=1 -DHAVE_NETDB_H=1 -DHAVE_GETHOSTBYNAME=1 > -DMPID_DEBUG_NONE -DMPID_STAT_NONE -D_GNU_SOURCE -fPIC -D_EM64T_ > -DEARLY_SEND_COMPLETION -DMEMORY_RELIABLE -DVIADEV_RPUT_SUPPORT > -D_SMP_ -D_SMP_RNDV_ -DCH_GEN2 -D_ICC_ -I/usr/ofed/include -O3 > -DHAVE_MPICHCONF_H -I/root/mvapich/mvapich-0.9.9-beta > -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 -I. -c viacheck.c > viacheck.c(1036): warning #167: argument of type "unsigned char *" is > incompatible with parameter of type "char *" > update_crc(1, v->buffer, header->dma_len), > ^ > > viacheck.c(1557): warning #188: enumerated type mixed with another type > rhandle->protocol); > ^ > > viacheck.c(1749): warning #188: enumerated type mixed with another type > rhandle->protocol); > ^ > > viacheck.c(2570): error: identifier "IBV_EVENT_CLIENT_REREGISTER" is > undefined > case IBV_EVENT_CLIENT_REREGISTER: > ^ > May i request you to provide information about the Openfabrics Gen2 libraries which you are using. Typically, OFED libraries are present in /usr/local/ofed/{lib, lib64} depending upon your architecture (32-bit vs 64-bit). The include files are present in /usr/local/ofed/include. As an example, on our cluster (which has OFED-1.1 installed), the verbs.h file present in the above include location supports IBV_EVENT_CLIENT_REREGISTER event (line 200). Can you please check the verbs.h in the include directory and let us know if this event is supported. Thanks much, :- Abhinav > cm_user.h(6): warning #864: extern inline function > "odu_test_new_connection" was referenced but not defined > inline void odu_test_new_connection(void); > ^ > > compilation aborted for viacheck.c (code 2) > make[3]: *** [viacheck.o] Error 2 > Exit status from make was 2 > make[2]: *** [mpilib] Error 1 > make[1]: *** [mpi-modules] Error 2 > make: *** [mpi] Error 2 > ---------------------------------------------------------------------------------------------------------- > > > and the error in gcc: > ---------------------------------------------------------------------------------------------------------- > > gcc -DHAVE_CONFIG_H -I. > -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 > -I/root/mvapich/mvapich-0.9.9-beta/include > -I/root/mvapich/mvapich-0.9.9-beta/include > -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 > -I/root/mvapich/mvapich-0.9.9-beta/mpid/util -DMPID_DEVICE_CODE > -DHAVE_UNAME=1 -DHAVE_NETDB_H=1 -DHAVE_GETHOSTBYNAME=1 > -DMPID_DEBUG_NONE -DMPID_STAT_NONE -fPIC -D_EM64T_ > -DEARLY_SEND_COMPLETION -DMEMORY_RELIABLE -DVIADEV_RPUT_SUPPORT > -D_SMP_ -D_SMP_RNDV_ -DCH_GEN2 -I/usr/ofed/include -O3 > -DHAVE_MPICHCONF_H -D_GNU_SOURCE -I/root/mvapich/mvapich-0.9.9-beta > -I/root/mvapich/mvapich-0.9.9-beta/mpid/ch_gen2 -I. -Wall -c > viacheck.c > viacheck.c: In function 'viadev_process_recv': > viacheck.c:1036: warning: pointer targets in passing argument 2 of > 'update_crc' differ in signedness > viacheck.c: In function 'async_thread': > viacheck.c:2570: error: 'IBV_EVENT_CLIENT_REREGISTER' undeclared > (first use in this function) > viacheck.c:2570: error: (Each undeclared identifier is reported only once > viacheck.c:2570: error: for each function it appears in.) > make[3]: *** [viacheck.o] Error 1 > Exit status from make was 2 > make[2]: *** [mpilib] Error 1 > make[1]: *** [mpi-modules] Error 2 > make: *** [mpi] Error 2 > ----------------------------------------------------------------------------------------------------------------- > > > What am I doing wrong? > > Thanks so much for any help you can give!!!! > > -Jen > > > > On 1/23/07, Sayantan Sur wrote: >> Hello Jen, >> >> The OSU benchmarks should be ideally run for 2 processes. Can you try >> osu_latency, osu_bw, osu_bibw with just 2 processes? I have a feeling >> that the cluster isn't set up quite right, otherwise these simple >> benchmarks wouldn't fail. Are other users able to run IMB on the >> cluster? >> >> cpilog might have compilation problems since MPE might have not been >> compiled in when MVAPICH was built. To enable MPE, use --with-mpe as a >> configure parameter in make.mvapich.gen2. (assuming you have downloaded >> MVAPICH-0.9.8 from our website). >> >> Similarly, with simpleio, MPIIO component needs to be compiled in when >> building MVAPICH. Use --with-romio as a configure parameter. >> >> Thanks, >> Sayantan. >> >> Aquarijen wrote: >> > Hi Sayantan, >> > >> > Thank you for your help. :) >> > >> > A few things about my environment. The compute nodes are 64 bit, so I >> > pointed the mvapich compilation to /usr/ofed/lib64 - I have no 32 bit >> > libs for ofed. The compute nodes have 2 processors each. When I have >> > tried jobs, I have submitted them through pbs (torque) and specified >> > that I want 1 processor per node - maui enforces this. I have tried >> > all my runs with 40 nodes, one processor per node. >> > >> > cpi, cpip and hello++ run without problems. >> > >> > osu_bw fails with the error: >> > Connection closed by 172.16.4.36^M >> > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=1 >> > at line 2355 in file viacheck.c >> > done. >> > >> > osu_bcast fails with error: >> > [39] Abort: [b09n001.oic.ornl.gov:39] Got completion with error, >> code=1 >> > at line 2355 in file viacheck.c >> > done. >> > >> > osu_bibw fails with: >> > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=1 >> > at line 2355 in file viacheck.c >> > [39] Abort: [32] Abort: [24] Abort: [38] Abort: >> > [b09n001.oic.ornl.gov:39] Got completion with error, code=12, dest >> > rank=0 >> > at line 397 in file viacheck.c >> > [b09n008.oic.ornl.gov:32] Got completion with error, code=12, dest >> rank=0 >> > at line 397 in file viacheck.c >> > [b09n016.oic.ornl.gov:24] Got completion with error, code=12, dest >> rank=0 >> > at line 397 in file viacheck.c >> > [b09n002.oic.ornl.gov:38] Got completion with error, code=12, dest >> rank=0 >> > at line 397 in file viacheck.c >> > [36] Abort: [b09n004.oic.ornl.gov:36] Got completion with error, >> > code=12, dest rank=0 >> > at line 397 in file viacheck.c >> > done. >> > >> > osu_latency fails with: >> > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=1 >> > at line 2355 in file viacheck.c >> > done. >> > >> > I can't compile cpilog.c, I get: >> > [2vt@b09l02 osu_benchmarks-mvapich]$ which mpicc >> > /opt/mvapich-gcc-0.9.8/bin/mpicc >> > [2vt@b09l02 osu_benchmarks-mvapich]$ mpicc cpilog.c -o cpilog >> > cpilog.o(.text+0xd2): In function `main': >> > cpilog.c: undefined reference to `MPE_Init_log' >> > cpilog.o(.text+0xd7):cpilog.c: undefined reference to >> > `MPE_Log_get_event_number'cpilog.o(.text+0xdf):cpilog.c: undefined >> > reference to `MPE_Log_get_event_number'cpilog.o(.text+0xe7):cpilog.c: >> > undefined reference to >> > `MPE_Log_get_event_number'cpilog.o(.text+0xef):cpilog.c: undefined >> > reference to `MPE_Log_get_event_number'cpilog.o(.text+0xf7):cpilog.c: >> > undefined reference to >> > `MPE_Log_get_event_number'cpilog.o(.text+0xff):cpilog.c: more >> > undefined references to `MPE_Log_get_event_number' follow >> > cpilog.o(.text+0x12e): In function `main': >> > cpilog.c: undefined reference to `MPE_Describe_state' >> > cpilog.o(.text+0x143):cpilog.c: undefined reference to >> > `MPE_Describe_state' >> > cpilog.o(.text+0x158):cpilog.c: undefined reference to >> > `MPE_Describe_state' >> > cpilog.o(.text+0x16d):cpilog.c: undefined reference to >> > `MPE_Describe_state' >> > cpilog.o(.text+0x1a2):cpilog.c: undefined reference to `MPE_Start_log' >> > cpilog.o(.text+0x1c0):cpilog.c: undefined reference to `MPE_Log_event' >> > cpilog.o(.text+0x1f0):cpilog.c: undefined reference to `MPE_Log_event' >> > cpilog.o(.text+0x202):cpilog.c: undefined reference to `MPE_Log_event' >> > cpilog.o(.text+0x21e):cpilog.c: undefined reference to `MPE_Log_event' >> > cpilog.o(.text+0x230):cpilog.c: undefined reference to `MPE_Log_event' >> > cpilog.o(.text+0x2da):cpilog.c: more undefined references to >> > `MPE_Log_event' follow >> > cpilog.o(.text+0x342): In function `main': >> > cpilog.c: undefined reference to `MPE_Finish_log' >> > collect2: ld returned 1 exit status >> > >> > I also can't compile simpleio.c. I get: >> > [2vt@b09l02 osu_benchmarks-mvapich]$ mpicc simpleio.c >> > simpleio.o(.text+0x252): In function `main': >> > simpleio.c: undefined reference to `MPI_File_open' >> > simpleio.o(.text+0x26e):simpleio.c: undefined reference to >> > `MPI_File_write' >> > simpleio.o(.text+0x277):simpleio.c: undefined reference to >> > `MPI_File_close' >> > simpleio.o(.text+0x2c0):simpleio.c: undefined reference to >> > `MPI_File_open' >> > simpleio.o(.text+0x2dc):simpleio.c: undefined reference to >> > `MPI_File_read' >> > simpleio.o(.text+0x2e5):simpleio.c: undefined reference to >> > `MPI_File_close' >> > collect2: ld returned 1 exit status >> > >> > Intel MPI benchmarks (IMB-MPI1) fail with: >> > >> > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=4 >> > at line 2355 in file viacheck.c >> > done. >> > >> > >> > I'd be happy to provide any opther logs or any other info you might >> > think would help! Sorry it took me so long for this - I had a few >> > fires to put out. Now, this is #1 priority. >> > >> > Thanks for all your help!!!! >> > Jen >> > >> > >> > >> > On 1/19/07, Sayantan Sur wrote: >> >> Hello Jen, >> >> >> >> > I am new. And a little frusterated. :) >> >> >> >> Thanks for your post ... Hope your problems are short-lived :-) >> >> >> >> > I have compiled/installed mvapich 0.9.8 using ofed/gen2. >> >> > >> >> > I can run cpi on all my nodes just fine. The problem comes in >> when I >> >> > try to use any of the osu benchmark programs. They seem to compile >> >> > just fine, but when I try to run osu_bcast, I get the following >> error: >> >> > >> >> > [39] Abort: [b09n001.oic.ornl.gov:39] Got completion with error, >> >> code=1 >> >> > at line 2355 in file viacheck.c >> >> > done. >> >> > >> >> > Where is this viacheck.c? Has anyone seen this before? I'd be >> happy >> >> > to provide more details if you tell me what to provide. >> >> >> >> viacheck.c is an internal file in the MVAPICH implementation. I >> have a >> >> couple of questions which you could answer ... >> >> >> >> 1) How many nodes was this run attempted? I have run osu_bcast on >> >> 64-nodes/128 processes and it seems to be OK. >> >> >> >> 2) Can you run IMB (Intel MPI benchmarks)? They will also call >> >> MPI_Bcast. >> >> >> >> 3) I'm wondering if you could run the other OSU benchmarks, such as >> >> latency, bandwidth, bi-directional bandwidth? >> >> >> >> Thanks, >> >> Sayantan. >> >> >> >> -- >> >> http://www.cse.ohio-state.edu/~surs >> >> >> > >> > >> >> -- >> http://www.cse.ohio-state.edu/~surs >> >> > > From vishnu at cse.ohio-state.edu Thu Feb 22 16:50:28 2007 From: vishnu at cse.ohio-state.edu (Abhinav Vishnu) Date: Thu Feb 22 16:52:46 2007 Subject: [mvapich-discuss] viacheck.c error? In-Reply-To: <2e5ad1b10702221339y2d4814cfx7341c6dccf04cf0b@mail.gmail.com> References: <2e5ad1b10701191137y7e91389fv9561c47d207d61d7@mail.gmail.com> <20070119201522.GA19063@cse.ohio-state.edu> <2e5ad1b10701231238i594604e6w3c48fd23e053827a@mail.gmail.com> <45B6DF54.2070204@cse.ohio-state.edu> <2e5ad1b10702221306m1df8118avcd50cb579bf6b303@mail.gmail.com> <45DE0883.2060409@cse.ohio-state.edu> <2e5ad1b10702221339y2d4814cfx7341c6dccf04cf0b@mail.gmail.com> Message-ID: <45DE1024.4030009@cse.ohio-state.edu> Hi jen, > I'm not 100% sure of what information will be most helpful, but the > error output for osu_bw (as an example) is: > -------------------------------------------------- > Connection closed by 172.16.4.36^M > [0] Abort: [b09n040.oic.ornl.gov:0] Got completion with error, code=1 > at line 2355 in file viacheck.c > done. > --------------------------------------------------- > I think the problem is occuring, because your ssh connection got terminated during the execution of the application. As a result, any process which tries to communicate with the process present on the node which died, it will get the "completion with error" during data transmission. IMHO, your sysadmin should be able to help you with respect to terminating ssh connection. > > The BUILD_ID file of my ofed is: > --------------------------------------------------------------------- > OFED-1.0 > > openib-1.0 (REV=8031) > # User space > https://openib.org/svn/gen2/branches/1.0/src/userspace > # Kernel space > https://openib.org/svn/gen2/branches/1.0/ofed/tags/1.0/linux-kernel > Git: > ref: refs/heads/for-2.6.17 > commit 959eb39297e8c82f61fbfc283ad4ff11c883bf1e > > # MPI > mpi_osu-0.9.7-mlx2.1.0.tgz > openmpi-1.1b1-1.src.rpm > mpitests-1.0-0.src.rpm > --------------------------------------------------------------------------- > > > so that may be a problem - that it is ofed 1.0? > > > ------------------------------------------------------------------------------- > > enum ibv_event_type { > IBV_EVENT_CQ_ERR, > IBV_EVENT_QP_FATAL, > IBV_EVENT_QP_REQ_ERR, > IBV_EVENT_QP_ACCESS_ERR, > IBV_EVENT_COMM_EST, > IBV_EVENT_SQ_DRAINED, > IBV_EVENT_PATH_MIG, > IBV_EVENT_PATH_MIG_ERR, > IBV_EVENT_DEVICE_FATAL, > IBV_EVENT_PORT_ACTIVE, > IBV_EVENT_PORT_ERR, > IBV_EVENT_LID_CHANGE, > IBV_EVENT_PKEY_CHANGE, > IBV_EVENT_SM_CHANGE, > IBV_EVENT_SRQ_ERR, > IBV_EVENT_SRQ_LIMIT_REACHED, > IBV_EVENT_QP_LAST_WQE_REACHED > }; > ------------------------------------------------------------------------------------ > > > Soooo, I assume my new mission is to get ofed 1.1? :} > Yes, i guess this should be the safest bet. Please let us know the outcome of your experimentation. Thanks, :- Abhinav > Thanks!!!! > Jen > > From luis.kornblueh at zmaw.de Fri Feb 23 11:53:01 2007 From: luis.kornblueh at zmaw.de (Luis Kornblueh) Date: Fri Feb 23 11:53:31 2007 Subject: [mvapich-discuss] Bug in smpd Message-ID: <20070223165301.GB26201@creus.mpi.zmaw.de> Hi, sorry, if this is the wrong place. I try to get mvapich2 running in tight integration with SGE. It is recommended to use smpd for this. On our cluster are no home directories available. I tried to use the -smpdfile option. The problem is that in smpd_connect the clean call for smpd_open_xxx is not used but some - it looks like - quick hack code. So smpd is not using the command line option smpdfile. You can get smpd coming up as daemons, but mpiexec is bailing out. My target is to distribute the smpd file in the SGE TMPDIR which makes it available to a full job and gets cleaned up at the end. Hope someone can help - thanks a lot, Luis -- \\\\\\ (-0^0-) --------------------------oOO--(_)--OOo----------------------------- Luis Kornblueh Tel. : +49-40-41173289 Max-Planck-Institute for Meteorology Fax. : +49-40-41173298 Bundesstr. 53 D-20146 Hamburg Email: luis.kornblueh@zmaw.de Federal Republic of Germany From greg.lindahl at qlogic.com Fri Feb 23 21:52:29 2007 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri Feb 23 21:52:58 2007 Subject: [mvapich-discuss] MVAPICH on large clusters - timeouts - any advice? In-Reply-To: References: Message-ID: <20070224025229.GA4983@localhost.localdomain> On Thu, Feb 22, 2007 at 06:30:51PM +0000, Jonathan Follows wrote: > [chpcc022:14] Got completion with error, code=12, dest rank=78 at line 397 > in file viacheck.c My guess is that you've got some bad HCAs/cables/switches in your cluster. Have you looked at the error counters? -- greg From aliva at gatech.edu Mon Feb 26 00:03:50 2007 From: aliva at gatech.edu (Aliva Pattnaik) Date: Mon Feb 26 00:04:19 2007 Subject: [mvapich-discuss] deadlock with g95/gfortran Message-ID: <1172466230.45e26a36737b2@webmail.mail.gatech.edu> Hi, I am trying to run the fortran example problem(fpi.f) that comes with mvapich2- 0.9.8. I am using g95 to compile it. But while running it with mpiexec its getting deadlock, though in the "Top" output I can see the processes taking 99% of CPU time. The same situation is arising while using gfortran. But I am able to run c example problems compiled with gcc, successfully. The cluster that I am using is 64 bit AMD opteron with infiniband. I will really appreciate if someone can help me in fixing this problem. Thank you very much for your help, Aliva -- From luis.kornblueh at zmaw.de Mon Feb 26 07:49:21 2007 From: luis.kornblueh at zmaw.de (Luis Kornblueh) Date: Mon Feb 26 07:49:51 2007 Subject: [mvapich-discuss] More on bug in smpd Message-ID: <20070226124921.GA12087@creus.mpi.zmaw.de> Hello, I got some more info on this. My first idea that there is something wrong in mpiexec with respect the processing of the command line options is not the reason for the problem I see. I used the smpd mechanism for the default channel (base mpich2-1.0.5p3) and everything works fine - without any SGE ... When I use the mvapich2-0.9.8 ib channel I get the following error: op_read error on left context: socket connection closed, error stack: MPIDU_Socki_handle_read(623): connection closed by peer (set=0,sock=0) unable to read the cmd header on the left context, socket connection closed, error stack: MPIDU_Socki_handle_read(623): connection closed by peer (set=0,sock=0). I definitely use the same scripts only with the differently compiled executables. Any idea, what is going wrong? Cheerio, Luis -- \\\\\\ (-0^0-) --------------------------oOO--(_)--OOo----------------------------- Luis Kornblueh Tel. : +49-40-41173289 Max-Planck-Institute for Meteorology Fax. : +49-40-41173298 Bundesstr. 53 D-20146 Hamburg Email: luis.kornblueh@zmaw.de Federal Republic of Germany From huanwei at cse.ohio-state.edu Mon Feb 26 09:45:54 2007 From: huanwei at cse.ohio-state.edu (wei huang) Date: Mon Feb 26 09:46:24 2007 Subject: [mvapich-discuss] More on bug in smpd (fwd) In-Reply-To: <200702261436.l1QEaOtF025173@xi.cse.ohio-state.edu> Message-ID: Hi Luis, Thanks for letting us know more details. As you may know, the current release of mvapich2 (0.9.8) is based on mpich2-1.0.3. And we are pareparing a newer mvapich2 release based on the latest mpich2 version. However, we are looking into the problem you reported on mvapich2-0.9.8, and will get back to you once we get something. Thanks. -- Wei > > Hello, > > > > I got some more info on this. My first idea that there is something > > wrong in mpiexec with respect the processing of the command line options > > is not the reason for the problem I see. > > > > I used the smpd mechanism for the default channel (base > > mpich2-1.0.5p3) and everything works fine - without any SGE ... When > > I use the mvapich2-0.9.8 ib channel I get the following error: > > > > op_read error on left context: socket connection closed, error stack: > > MPIDU_Socki_handle_read(623): connection closed by peer (set=0,sock=0) > > unable to read the cmd header on the left context, socket connection closed, error stack: > > MPIDU_Socki_handle_read(623): connection closed by peer (set=0,sock=0). > > > > I definitely use the same scripts only with the differently compiled > > executables. > > > > Any idea, what is going wrong? > > > > Cheerio, > > Luis > > > > -- > > \\\\\\ > > (-0^0-) > > --------------------------oOO--(_)--OOo----------------------------- > > > > Luis Kornblueh Tel. : +49-40-41173289 > > Max-Planck-Institute for Meteorology Fax. : +49-40-41173298 > > Bundesstr. 53 > > D-20146 Hamburg Email: luis.kornblueh@zmaw.de > > Federal Republic of Germany > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > From huanwei at cse.ohio-state.edu Mon Feb 26 15:06:40 2007 From: huanwei at cse.ohio-state.edu (wei huang) Date: Mon Feb 26 15:07:09 2007 Subject: [mvapich-discuss] deadlock with g95/gfortran (fwd) In-Reply-To: Message-ID: Hi Aliva, Thanks for using mvapich2. We will take a look at this issue and get back to you. Thanks. -- Wei > > ---------- Forwarded message ---------- > Date: Mon, 26 Feb 2007 00:03:50 -0500 > From: Aliva Pattnaik > To: mvapich-discuss@cse.ohio-state.edu > Subject: [mvapich-discuss] deadlock with g95/gfortran > > > Hi, > > I am trying to run the fortran example problem(fpi.f) that comes with mvapich2- > 0.9.8. I am using g95 to compile it. But while running it with mpiexec its > getting deadlock, though in the "Top" output I can see the processes taking 99% > of CPU time. The same situation is arising while using gfortran. But I am able > to run c example problems compiled with gcc, successfully. > > The cluster that I am using is 64 bit AMD opteron with infiniband. > > I will really appreciate if someone can help me in fixing this problem. > > Thank you very much for your help, > Aliva > > > > -- > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >