From schuang at ats.ucla.edu Fri Feb 1 20:24:36 2008 From: schuang at ats.ucla.edu (Shao-Ching Huang) Date: Fri Feb 1 20:24:49 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: <20080201043541.GA5879@ats.ucla.edu> References: <47A1334B.5030906@ucla.edu> <20080201043541.GA5879@ats.ucla.edu> Message-ID: <20080202012436.GA25420@ats.ucla.edu> Hi Wei, We cleaned up a few things and re-ran the mpiGraph tests. The updated results are posted here: http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html Please ignore results in my previous email. Thank you. Regards, Shao-Ching On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote: > > Hi Wei, > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) > MPI process per node: > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out > > The results from the two runs are posted here: > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ > > During the tests, some other users are also running jobs on some of > these 48 nodes. > > Could you please help us interpret these results, if possible? > > Thank you. > > Shao-Ching Huang > > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: > > Hi Scott, > > > > We went up to 256 processes (32 nodes) and did not see the problem in few > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure > > the fabrics and system setup are ok. To diagnose this, we suggest you > > running mpiGraph program from http://sourceforge.net/projects/mpigraph. > > This test stresses the interconnects. It should fail at a much higher > > frequency than simple cpi program if there is a problem with your system > > setup. > > > > Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > My co-worker passed this along... > > > > > > Yes, the error happens on the cpi.c program too. It happened 2 times > > > among the 9 cases I ran. > > > > > > I was using 128 processes (on 32 4-core nodes). > > > > > > --- > > > > > > and another... > > > > > > It happens for a simple MPI program which just does MPI_Init and > > > MPI_Finalize and print out number of processors. It happened for > > > anything from 4 nodes (16 processors ) and more. > > > > > > What environment variables should we look for? > > > > > > Thanks, > > > Scott > > > > > > wei huang wrote: > > > > Hi Scott, > > > > > > > > On how many processes (and how many nodes) you ran your program? Do you > > > > have any environmental variables when you are running the program? Does > > > > the error happen on simple test like cpi? > > > > > > > > Thanks. > > > > > > > > Regards, > > > > Wei Huang > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > Dept. of Computer Science and Engineering > > > > Ohio State University > > > > OH 43210 > > > > Tel: (614)292-8501 > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > >> The low level ibv tests work fine. > > > > > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From huanwei at cse.ohio-state.edu Fri Feb 1 20:43:19 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Fri Feb 1 20:43:29 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: <20080202012436.GA25420@ats.ucla.edu> Message-ID: Hi, How often do you observe the failures when running the mpiGraph test? Do all the failure happen at startup, as your simple program? Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Fri, 1 Feb 2008, Shao-Ching Huang wrote: > > Hi Wei, > > We cleaned up a few things and re-ran the mpiGraph tests. The updated > results are posted here: > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html > > Please ignore results in my previous email. Thank you. > > Regards, > Shao-Ching > > > On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote: > > > > Hi Wei, > > > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) > > MPI process per node: > > > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out > > > > The results from the two runs are posted here: > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ > > > > During the tests, some other users are also running jobs on some of > > these 48 nodes. > > > > Could you please help us interpret these results, if possible? > > > > Thank you. > > > > Shao-Ching Huang > > > > > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: > > > Hi Scott, > > > > > > We went up to 256 processes (32 nodes) and did not see the problem in few > > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure > > > the fabrics and system setup are ok. To diagnose this, we suggest you > > > running mpiGraph program from http://sourceforge.net/projects/mpigraph. > > > This test stresses the interconnects. It should fail at a much higher > > > frequency than simple cpi program if there is a problem with your system > > > setup. > > > > > > Thanks. > > > > > > Regards, > > > Wei Huang > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > Dept. of Computer Science and Engineering > > > Ohio State University > > > OH 43210 > > > Tel: (614)292-8501 > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > My co-worker passed this along... > > > > > > > > Yes, the error happens on the cpi.c program too. It happened 2 times > > > > among the 9 cases I ran. > > > > > > > > I was using 128 processes (on 32 4-core nodes). > > > > > > > > --- > > > > > > > > and another... > > > > > > > > It happens for a simple MPI program which just does MPI_Init and > > > > MPI_Finalize and print out number of processors. It happened for > > > > anything from 4 nodes (16 processors ) and more. > > > > > > > > What environment variables should we look for? > > > > > > > > Thanks, > > > > Scott > > > > > > > > wei huang wrote: > > > > > Hi Scott, > > > > > > > > > > On how many processes (and how many nodes) you ran your program? Do you > > > > > have any environmental variables when you are running the program? Does > > > > > the error happen on simple test like cpi? > > > > > > > > > > Thanks. > > > > > > > > > > Regards, > > > > > Wei Huang > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > Dept. of Computer Science and Engineering > > > > > Ohio State University > > > > > OH 43210 > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > >> The low level ibv tests work fine. > > > > > > > > > > _______________________________________________ > > > > > mvapich-discuss mailing list > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From biswajit at crlindia.com Mon Feb 4 08:11:07 2008 From: biswajit at crlindia.com (Biswajit Mishra) Date: Mon Feb 4 08:11:23 2008 Subject: [mvapich-discuss] checkpointing performance Message-ID: <0AF7442124F01C49A6A93D8F04E5E3CF067C56@CHNEXVS01.VSNLXCHANGE.COM> Suppose I have build MVAPICH with checkpointing , but application doesnot take any checkpoint. (This is done by setting MV2_CKPT_INTERVAL to 0) . Is there any performance degradation in comparison to application using MVAPICH built without checkpointing support ....??? This message (including any attachment) is confidential and may be legally privileged. Access to this message by anyone other than the intended recipient(s) listed above is unauthorized. If you are not the intended recipient you are hereby notified that any disclosure, copying, or distribution of the message, or any action taken or omission of action by you in reliance upon it, is prohibited and may be unlawful. Please immediately notify the sender by reply e-mail and permanently delete all copies of the message if you have received this message in error. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080204/7222d251/attachment.html From huanwei at cse.ohio-state.edu Mon Feb 4 10:33:01 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Mon Feb 4 10:33:13 2008 Subject: [mvapich-discuss] checkpointing performance In-Reply-To: <0AF7442124F01C49A6A93D8F04E5E3CF067C56@CHNEXVS01.VSNLXCHANGE.COM> Message-ID: Hi Biswajit, Thanks for trying out CR feature. There will be slight performance degradation if you build MVAPICH2 with CR, even if application does not take any. This because some of the performance optimizations, such as intra-node shared memory communication, are not currently supported with CR enabled. We are working on adding this functionality for the next release of MVAPICH2. Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Mon, 4 Feb 2008, Biswajit Mishra wrote: > > Suppose I have build MVAPICH with checkpointing , but application > doesnot take any checkpoint. (This is done by setting > MV2_CKPT_INTERVAL to 0) . Is there any performance degradation in > comparison to application using MVAPICH built without checkpointing > support ....??? > > > > This message (including any attachment) is confidential and may be > legally privileged. Access to this message by anyone other than the > intended recipient(s) listed above is unauthorized. If you are not > the intended recipient you are hereby notified that any disclosure, > copying, or distribution of the message, or any action taken or > omission of action by you in reliance upon it, is prohibited and may > be unlawful. Please immediately notify the sender by reply e-mail and > permanently delete all copies of the message if you have received this > message in error. > From peter.cebull at inl.gov Wed Feb 6 16:58:22 2008 From: peter.cebull at inl.gov (Peter Cebull) Date: Wed Feb 6 19:23:15 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers Message-ID: <47AA2D7E.7030109@inl.gov> I'm trying to build MVAPICH with PGI compilers (version 7.1-3) and am running into an error. Our system is running SLES 10. Log files are attached. Has anyone seen this before? Thanks, Peter -- Peter Cebull HPC User Consultant Idaho National Laboratory P.O. Box 1625, MS3605 Idaho Falls, ID 83415 Phone: 208-526-1909 Email: Peter.Cebull@inl.gov -------------- next part -------------- A non-text attachment was scrubbed... Name: config.log Type: text/x-log Size: 6666 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080206/b4c03258/config-0001.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: config-mine.log Type: text/x-log Size: 28048 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080206/b4c03258/config-mine-0001.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: make-mine.log Type: text/x-log Size: 192376 bytes Desc: not available Url : http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080206/b4c03258/make-mine-0001.bin From perkinjo at cse.ohio-state.edu Wed Feb 6 21:30:10 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Wed Feb 6 21:30:28 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47AA2D7E.7030109@inl.gov> References: <47AA2D7E.7030109@inl.gov> Message-ID: <47AA6D32.50600@cse.ohio-state.edu> Peter Cebull wrote: > I'm trying to build MVAPICH with PGI compilers (version 7.1-3) and am > running into an error. Our system is running SLES 10. Log files are > attached. Has anyone seen this before? > > Thanks, > Peter > > > ------------------------------------------------------------------------ > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss Peter: We're sorry that you're having trouble with the PGI compiler. We'll take a look into this issue and get back to you as soon as we can. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From perkinjo at cse.ohio-state.edu Thu Feb 7 13:57:40 2008 From: perkinjo at cse.ohio-state.edu (Jonathan L. Perkins) Date: Thu Feb 7 13:57:50 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47AA6D32.50600@cse.ohio-state.edu> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> Message-ID: <47AB54A4.8080101@cse.ohio-state.edu> Jonathan Perkins wrote: > Peter Cebull wrote: >> I'm trying to build MVAPICH with PGI compilers (version 7.1-3) and am >> running into an error. Our system is running SLES 10. Log files are >> attached. Has anyone seen this before? >> >> Thanks, >> Peter >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> mvapich-discuss mailing list >> mvapich-discuss@cse.ohio-state.edu >> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > Peter: > We're sorry that you're having trouble with the PGI compiler. We'll > take a look into this issue and get back to you as soon as we can. > > Peter: Can you send a bit more information related to this error. Can you tell us which variables you set in order to use the pgi compiler. Also, are you using the make.mvapich.gen2 script for your build process? I haven't been able to reproduce this error using SuSE 10 and both PGI 7.0-7 and PGI 7.1-3. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From peter.cebull at inl.gov Thu Feb 7 14:26:51 2008 From: peter.cebull at inl.gov (Peter Cebull) Date: Thu Feb 7 14:27:25 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47AB54A4.8080101@cse.ohio-state.edu> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> Message-ID: <47AB5B7B.1030306@inl.gov> Jonathan L. Perkins wrote: > Jonathan Perkins wrote: >> Peter Cebull wrote: >>> I'm trying to build MVAPICH with PGI compilers (version 7.1-3) and >>> am running into an error. Our system is running SLES 10. Log files >>> are attached. Has anyone seen this before? >>> >>> Thanks, >>> Peter >>> >>> >>> ------------------------------------------------------------------------ >>> >>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >> >> Peter: >> We're sorry that you're having trouble with the PGI compiler. We'll >> take a look into this issue and get back to you as soon as we can. >> >> > > Peter: > Can you send a bit more information related to this error. Can you > tell us which variables you set in order to use the pgi compiler. > Also, are you using the make.mvapich.gen2 script for your build process? > > I haven't been able to reproduce this error using SuSE 10 and both PGI > 7.0-7 and PGI 7.1-3. > I've attached the make.mvapich.gen2 file I was using. I think all I changed were these variables IBHOME=${IBHOME:-/usr} IBHOME_LIB=${IBHOME_LIB:-/usr/lib64} PREFIX=${PREFIX:-/usr/local/mvapich/mvapich-1.0-beta/pgi-opt} export CC=${CC:-pgcc} export CXX=${CXX:-pgCC} export F77=${F77:-pgf77} export F90=${F90:-pgf90} and added some options to configure: ./configure --enable-shared-lib --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \ --enable-f77 --enable-f90 --enable-f90modules $ROMIO -lib="$LIBS" 2>&1 |tee config-mine.log Then I just executed the script. Let me know if you need any more info. Thanks, Peter -- Peter Cebull HPC User Consultant Idaho National Laboratory P.O. Box 1625, MS3605 Idaho Falls, ID 83415 Phone: 208-526-1909 Email: Peter.Cebull@inl.gov -------------- next part -------------- #!/bin/bash # Most variables here can be overridden by exporting them in the environment # before running this script. Default values have been provided if the # environment variable is not already set. source ./make.mvapich.def # The target architecture. If not exported outside of this script, # it will be found automatically or prompted for if necessary. # Supported: "_IA32_", "_IA64_", "_EM64T_", "_X86_64_" # if [ -z "$ARCH" ]; then arch fi # This is the compatibility mode. If not exported outside this # script, the user is prompted to select either mode. In # autodetection mode, different types of IB hardware can # be detected and optimized for. In compatibility mode a # common set of parameters is chosen by MVAPICH. # Supported: "AUTO_DETECT" and "COMPAT_MODE" # if [ -z "$COMPAT" ]; then prompt_compat_mode fi # Mandatory variables. All are checked except CXX and F90. IBHOME=${IBHOME:-/usr} IBHOME_LIB=${IBHOME_LIB:-/usr/lib64} PREFIX=${PREFIX:-/usr/local/mvapich/mvapich-1.0-beta/pgi-opt} export CC=${CC:-pgcc} export CXX=${CXX:-pgCC} export F77=${F77:-pgf77} export F90=${F90:-pgf90} if [ $ARCH = "SOLARIS" ]; then die_setup "MVAPICH GEN2 is not supported on Solaris." elif [ $ARCH = "MAC_OSX" ]; then die_setup "MVAPICH GEN2 is not supported on MacOS." fi # # Compiler specific flags. If you are using # ICC on IA64 platform, please set COMPILER_FLAG # to "icc" # COMPILER_FLAG=${COMPILER_FLAG:-} if [ "$COMPILER_FLAG" == "icc" ]; then COMPILER_FLAG="-D_ICC_" else COMPILER_FLAG="" fi # Check mandatory variable settings. if [ -z "$IBHOME" ] || [ -z "$PREFIX" ] || [ -z "$CC" ] || [ -z "$F77" ]; then die_setup "Please set mandatory variables in this script." elif [ ! -d $IBHOME ]; then die_setup "IBHOME directory $IBHOME does not exist." fi # Optional variables. # # Whether to enable ROMIO support. This is necessary if building the # F90 modules. if [ -n "$F90" ]; then ROMIO="--with-romio" else ROMIO=${ROMIO:---without-romio} fi # PTMALLOC support for MVAPICH2 memory hooks. Enabling this will allow # MVAPICH2 to release memory to the Operating System (when registration # cache is enabled). Enabled by default. Disable with "no". PTMALLOC=${PTMALLOC:-} if [ "$PTMALLOC" = "no" ]; then PTMALLOC="-DDISABLE_PTMALLOC" else PTMALLOC="" fi # Whether to use an optimized queue pair exchange scheme. This is not # checked for a setting in in the script. It must be set here explicitly. # Supported: "-DUSE_MPD_RING", "-DUSE_MPD_BASIC" and "" (to disable) HAVE_MPD_RING=${HAVE_MPD_RING:-} # Set this to override automatic optimization setting (-03). OPT_FLAG=${OPT_FLAG:--O3} export LIBS=${LIBS:--L${IBHOME_LIB} -Wl,-rpath=${IBHOME_LIB} -libverbs -libumad -lpthread} export FFLAGS=${FFLAGS:--L${IBHOME_LIB}} export CFLAGS=${CFLAGS:--D${ARCH} -D${COMPAT} ${PTMALLOC} -DEARLY_SEND_COMPLETION -DMEMORY_SCALE -DVIADEV_RPUT_SUPPORT -D_SMP_ -D_SMP_RNDV_ -DCH_GEN2 -D_GNU_SOURCE ${COMPILER_FLAG} ${HAVE_MPD_RING} -I${IBHOME}/include $OPT_FLAG} export MPIRUN_CFLAGS="${MPIRUN_CFLAGS} -DLD_LIBRARY_PATH_MPI=\\\"${PREFIX}/lib/shared\\\" -DPARAM_GLOBAL=\\\"${PREFIX}/etc/mvapich.conf\\\"" # Prelogue make distclean &>/dev/null set -o pipefail # Configure MVAPICH echo "Configuring MVAPICH..." ./configure --enable-shared-lib --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \ --enable-f77 --enable-f90 --enable-f90modules $ROMIO -lib="$LIBS" 2>&1 |tee config-mine.log ret=$? test $ret = 0 || die "configuration." # Build MVAPICH echo "Building MVAPICH..." make 2>&1 |tee make-mine.log ret=$? test $ret = 0 || die "building MVAPICH." # Install MVAPICH echo "MVAPICH installation..." rm -f install-mine.log make install 2>&1 |tee install-mine.log ret=$? test $ret = 0 || die "installing MVAPICH." From perkinjo at cse.ohio-state.edu Thu Feb 7 14:33:56 2008 From: perkinjo at cse.ohio-state.edu (Jonathan L. Perkins) Date: Thu Feb 7 14:34:08 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47AB5B7B.1030306@inl.gov> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> Message-ID: <47AB5D24.4070800@cse.ohio-state.edu> Peter Cebull wrote: > Jonathan L. Perkins wrote: >> Jonathan Perkins wrote: >>> Peter Cebull wrote: >>>> I'm trying to build MVAPICH with PGI compilers (version 7.1-3) and >>>> am running into an error. Our system is running SLES 10. Log files >>>> are attached. Has anyone seen this before? >>>> >>>> Thanks, >>>> Peter >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> _______________________________________________ >>>> mvapich-discuss mailing list >>>> mvapich-discuss@cse.ohio-state.edu >>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >>> Peter: >>> We're sorry that you're having trouble with the PGI compiler. We'll >>> take a look into this issue and get back to you as soon as we can. >>> >>> >> >> Peter: >> Can you send a bit more information related to this error. Can you >> tell us which variables you set in order to use the pgi compiler. >> Also, are you using the make.mvapich.gen2 script for your build process? >> >> I haven't been able to reproduce this error using SuSE 10 and both PGI >> 7.0-7 and PGI 7.1-3. >> > I've attached the make.mvapich.gen2 file I was using. I think all I > changed were these variables > > IBHOME=${IBHOME:-/usr} > IBHOME_LIB=${IBHOME_LIB:-/usr/lib64} > PREFIX=${PREFIX:-/usr/local/mvapich/mvapich-1.0-beta/pgi-opt} > export CC=${CC:-pgcc} > export CXX=${CXX:-pgCC} > export F77=${F77:-pgf77} > export F90=${F90:-pgf90} > > and added some options to configure: > > ./configure --enable-shared-lib --with-device=ch_gen2 --with-arch=LINUX > -prefix=${PREFIX} \ > --enable-f77 --enable-f90 --enable-f90modules $ROMIO -lib="$LIBS" > 2>&1 |tee config-mine.log > > Then I just executed the script. Let me know if you need any more info. > > Thanks, > Peter > Thanks for the info. I'm trying this out now. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From perkinjo at cse.ohio-state.edu Fri Feb 8 09:17:55 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Fri Feb 8 09:18:13 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47AB5D24.4070800@cse.ohio-state.edu> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> Message-ID: <47AC6493.2080302@cse.ohio-state.edu> Jonathan L. Perkins wrote: > Peter Cebull wrote: >> Jonathan L. Perkins wrote: >>> Jonathan Perkins wrote: >>>> Peter Cebull wrote: >>>>> I'm trying to build MVAPICH with PGI compilers (version 7.1-3) and >>>>> am running into an error. Our system is running SLES 10. Log files >>>>> are attached. Has anyone seen this before? >>>>> >>>>> Thanks, >>>>> Peter >>>>> >>>>> >>>>> ------------------------------------------------------------------------ >>>>> >>>>> >>>>> _______________________________________________ >>>>> mvapich-discuss mailing list >>>>> mvapich-discuss@cse.ohio-state.edu >>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>> >>>> Peter: >>>> We're sorry that you're having trouble with the PGI compiler. We'll >>>> take a look into this issue and get back to you as soon as we can. >>>> >>>> >>> >>> Peter: >>> Can you send a bit more information related to this error. Can you >>> tell us which variables you set in order to use the pgi compiler. >>> Also, are you using the make.mvapich.gen2 script for your build process? >>> >>> I haven't been able to reproduce this error using SuSE 10 and both >>> PGI 7.0-7 and PGI 7.1-3. >>> >> I've attached the make.mvapich.gen2 file I was using. I think all I >> changed were these variables >> >> IBHOME=${IBHOME:-/usr} >> IBHOME_LIB=${IBHOME_LIB:-/usr/lib64} >> PREFIX=${PREFIX:-/usr/local/mvapich/mvapich-1.0-beta/pgi-opt} >> export CC=${CC:-pgcc} >> export CXX=${CXX:-pgCC} >> export F77=${F77:-pgf77} >> export F90=${F90:-pgf90} >> >> and added some options to configure: >> >> ./configure --enable-shared-lib --with-device=ch_gen2 >> --with-arch=LINUX -prefix=${PREFIX} \ >> --enable-f77 --enable-f90 --enable-f90modules $ROMIO >> -lib="$LIBS" 2>&1 |tee config-mine.log >> >> Then I just executed the script. Let me know if you need any more info. >> >> Thanks, >> Peter >> > > Thanks for the info. I'm trying this out now. > Using the supplied make.mvapich.gen2 (editing the IBHOME variables), I am unable to get configure to run successfully. However, when I take a pristine make.mvapich.gen2 script and manually make the changes it configures and builds fine. I suggest starting over with a pristine make.mvapich.gen2 script. If this builds fine, you should modify the configure line and simply export the variables you'd like overwritten at the shell before executing the make.mvapich.gen2 script. Let us know how this goes. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From peter.cebull at inl.gov Fri Feb 8 12:06:14 2008 From: peter.cebull at inl.gov (Peter Cebull) Date: Fri Feb 8 12:06:52 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47AC6493.2080302@cse.ohio-state.edu> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> <47AC6493.2080302@cse.ohio-state.edu> Message-ID: <47AC8C06.6050209@inl.gov> An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080208/a6c360f2/attachment-0001.html From perkinjo at cse.ohio-state.edu Fri Feb 8 12:19:53 2008 From: perkinjo at cse.ohio-state.edu (Jonathan Perkins) Date: Fri Feb 8 12:20:11 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47AC8C06.6050209@inl.gov> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> <47AC6493.2080302@cse.ohio-state.edu> <47AC8C06.6050209@inl.gov> Message-ID: <47AC8F39.8000908@cse.ohio-state.edu> Peter Cebull wrote: > I am still reproducing the failure. First, I am able to successfully > build mvapich the following way. First I set some environment variables: > > export IBHOME=/usr IBHOME_LIB=/usr/lib64 CC=pgcc CXX=pgCC F77=pgf77 > export PREFIX=/usr/local/mvapich/mvapich-1.0-beta/pgi-opt > > Then I modify make.mvapich.gen2, changing only the configure line: > > *[cebupp@iceapps mvapich-1.0]$ diff make.mvapich.gen2 make.mvapich.gen2.orig > 109,110c109,110 > < ./configure --enable-sharedlib --with-device=ch_gen2 --with-arch=LINUX > -prefix=${PREFIX} \ > < --enable-f77 --enable-f90 --enable-f90modules $ROMIO > -lib="$LIBS" 2>&1 |tee config-mine.log > --- > > ./configure --with-device=ch_gen2 --with-arch=LINUX -prefix=${PREFIX} \ > > $ROMIO --without-mpe -lib="$LIBS" 2>&1 |tee config-mine.log > * > The pgf90 compiler gets picked up somehow, even though I haven't set > F90. The build completes, with f90 and f90modules enabled, but > /--without-romio/ (since I haven't set F90, --without-romio is taken by > default). > > Next, all I do is set F90: > > *export F90=pgf90 > * > This sets --with-romio, and that is where the build fails: > > *compiling ROMIO in directory mpi-io > /usr/local/src/mvapich-1.0/bin/mpicc -fPIC -D_EM64T_ -DAUTO_DETECT > -DEARLY_SEND_COMPLETION -DMEMORY_SCALE -DVIADEV_RPUT_SUPPORT -D_SMP_ > -D_SMP_RNDV_ -DCH_GEN2 -D_GNU_SOURCE -I/usr/include -O3 > -DHAVE_MPICHCONF_H -fPIC -D_EM64T_ -DAUTO_DETECT > -DEARLY_SEND_COMPLETION -DMEMORY_SCALE -DVIADEV_RPUT_SUPPORT -D_SMP_ > -D_SMP_RNDV_ -DCH_GEN2 -D_GNU_SOURCE -I/usr/include -O3 > -DHAVE_MPICHCONF_H -DFORTRANUNDERSCORE -D_LARGEFILE64_SOURCE > -D_FILE_OFFSET_BITS=64 -DHAVE_ROMIOCONF_H -I. > -I/usr/local/src/mvapich-1.0/romio/mpi-io -I../include > -I/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include > -I../adio/include > -I/usr/local/src/mvapich-1.0/romio/mpi-io/../../../include > -I../../../include -c close.c > PGC-S-0035-Syntax error: Recovery attempted by replacing identifier > MPI_Datarep_extent_function by '}' > (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 98) > PGC-S-0040-Illegal use of symbol, MPI_Datarep_conversion_function > (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 99) > PGC-W-0156-Type not specified, 'int' assumed > (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 99) > PGC-S-0040-Illegal use of symbol, MPI_Datarep_conversion_function > (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 100) > PGC-W-0156-Type not specified, 'int' assumed > (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 100) > PGC-S-0037-Syntax error: Recovery attempted by deleting '}' > (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 102) > PGC-W-0156-Type not specified, 'int' assumed > (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 102) > PGC/x86-64 Linux 7.1-3: compilation completed with severe errors > make[4]: *** [close.o] Error 2 > Make failed in directory mpi-io > make[3]: *** [mpiolib] Error 1 > make[2]: *** [mpio] Error 2 > make[1]: *** [mpi-modules] Error 1 > make: *** [mpi] Error 2 > Failure in building MVAPICH.* > > If you are able to build successfully with PGI 7.1-3 and --with-romio, > then I don't know what could possibly be the problem. I'll poke around a > little more deeply here, but if you have any ideas please let me know. I'll continue to look into this as well. -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From perkinjo at cse.ohio-state.edu Fri Feb 8 13:45:22 2008 From: perkinjo at cse.ohio-state.edu (Jonathan L. Perkins) Date: Fri Feb 8 13:45:32 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47AC8F39.8000908@cse.ohio-state.edu> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> <47AC6493.2080302@cse.ohio-state.edu> <47AC8C06.6050209@inl.gov> <47AC8F39.8000908@cse.ohio-state.edu> Message-ID: <47ACA342.9060405@cse.ohio-state.edu> Jonathan Perkins wrote: > Peter Cebull wrote: >> I am still reproducing the failure. First, I am able to successfully >> build mvapich the following way. First I set some environment variables: >> >> export IBHOME=/usr IBHOME_LIB=/usr/lib64 CC=pgcc CXX=pgCC F77=pgf77 >> export PREFIX=/usr/local/mvapich/mvapich-1.0-beta/pgi-opt >> >> Then I modify make.mvapich.gen2, changing only the configure line: >> >> *[cebupp@iceapps mvapich-1.0]$ diff make.mvapich.gen2 >> make.mvapich.gen2.orig >> 109,110c109,110 >> < ./configure --enable-sharedlib --with-device=ch_gen2 >> --with-arch=LINUX -prefix=${PREFIX} \ >> < --enable-f77 --enable-f90 --enable-f90modules $ROMIO >> -lib="$LIBS" 2>&1 |tee config-mine.log >> --- >> > ./configure --with-device=ch_gen2 --with-arch=LINUX >> -prefix=${PREFIX} \ >> > $ROMIO --without-mpe -lib="$LIBS" 2>&1 |tee config-mine.log >> * >> The pgf90 compiler gets picked up somehow, even though I haven't set >> F90. The build completes, with f90 and f90modules enabled, but >> /--without-romio/ (since I haven't set F90, --without-romio is taken >> by default). >> >> Next, all I do is set F90: >> >> *export F90=pgf90 >> * >> This sets --with-romio, and that is where the build fails: >> >> *compiling ROMIO in directory mpi-io >> /usr/local/src/mvapich-1.0/bin/mpicc -fPIC -D_EM64T_ -DAUTO_DETECT >> -DEARLY_SEND_COMPLETION -DMEMORY_SCALE -DVIADEV_RPUT_SUPPORT -D_SMP_ >> -D_SMP_RNDV_ -DCH_GEN2 -D_GNU_SOURCE -I/usr/include -O3 >> -DHAVE_MPICHCONF_H -fPIC -D_EM64T_ -DAUTO_DETECT >> -DEARLY_SEND_COMPLETION -DMEMORY_SCALE -DVIADEV_RPUT_SUPPORT -D_SMP_ >> -D_SMP_RNDV_ -DCH_GEN2 -D_GNU_SOURCE -I/usr/include -O3 >> -DHAVE_MPICHCONF_H -DFORTRANUNDERSCORE -D_LARGEFILE64_SOURCE >> -D_FILE_OFFSET_BITS=64 -DHAVE_ROMIOCONF_H -I. >> -I/usr/local/src/mvapich-1.0/romio/mpi-io -I../include >> -I/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include >> -I../adio/include >> -I/usr/local/src/mvapich-1.0/romio/mpi-io/../../../include >> -I../../../include -c close.c >> PGC-S-0035-Syntax error: Recovery attempted by replacing identifier >> MPI_Datarep_extent_function by '}' >> (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 98) >> PGC-S-0040-Illegal use of symbol, MPI_Datarep_conversion_function >> (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 99) >> PGC-W-0156-Type not specified, 'int' assumed >> (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 99) >> PGC-S-0040-Illegal use of symbol, MPI_Datarep_conversion_function >> (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 100) >> PGC-W-0156-Type not specified, 'int' assumed >> (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 100) >> PGC-S-0037-Syntax error: Recovery attempted by deleting '}' >> (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 102) >> PGC-W-0156-Type not specified, 'int' assumed >> (/usr/local/src/mvapich-1.0/romio/mpi-io/../adio/include/adioi.h: 102) >> PGC/x86-64 Linux 7.1-3: compilation completed with severe errors >> make[4]: *** [close.o] Error 2 >> Make failed in directory mpi-io >> make[3]: *** [mpiolib] Error 1 >> make[2]: *** [mpio] Error 2 >> make[1]: *** [mpi-modules] Error 1 >> make: *** [mpi] Error 2 >> Failure in building MVAPICH.* >> >> If you are able to build successfully with PGI 7.1-3 and --with-romio, >> then I don't know what could possibly be the problem. I'll poke around >> a little more deeply here, but if you have any ideas please let me know. > > I'll continue to look into this as well. > > Just another suggestion. Can you try MVAPICH from our trunk and/or mvapich2 to see if you encounter the same issue? Is there an older version of MVAPICH that you got to work with the same configuration? -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From peter.cebull at inl.gov Fri Feb 8 14:45:18 2008 From: peter.cebull at inl.gov (Peter Cebull) Date: Fri Feb 8 14:45:51 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47ACA342.9060405@cse.ohio-state.edu> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> <47AC6493.2080302@cse.ohio-state.edu> <47AC8C06.6050209@inl.gov> <47AC8F39.8000908@cse.ohio-state.edu> <47ACA342.9060405@cse.ohio-state.edu> Message-ID: <47ACB14E.9010308@inl.gov> Jonathan L. Perkins wrote: > > Just another suggestion. Can you try MVAPICH from our trunk and/or > mvapich2 to see if you encounter the same issue? Is there an older > version of MVAPICH that you got to work with the same configuration? > I tried mvapich-0.9.9-2008-02-05 with no luck. MVAPICH2 seemed to build okay. I've never built MVAPICH before -- our new system (SGI Altix ICE) came with gcc and Intel versions of mvapich 0.9.9, I'm just trying to build an equivalent configuration for the PGI compilers. It seems to be a problem finding include files. If I cd into romio/mpi-io and try to compile close.c, it fails. I copied the entire source tree to my desktop (OpenSUSE 10.1 and the same PGI compilers) and close.c compiled with no problem. Back on the other system if I copied the header files from the include directory into romio/mpi-io, then I could compile close.c. Peter -- Peter Cebull HPC User Consultant Idaho National Laboratory P.O. Box 1625, MS3605 Idaho Falls, ID 83415 Phone: 208-526-1909 Email: Peter.Cebull@inl.gov From weikuan.yu at gmail.com Fri Feb 8 15:32:53 2008 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Fri Feb 8 15:33:00 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47ACB14E.9010308@inl.gov> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> <47AC6493.2080302@cse.ohio-state.edu> <47AC8C06.6050209@inl.gov> <47AC8F39.8000908@cse.ohio-state.edu> <47ACA342.9060405@cse.ohio-state.edu> <47ACB14E.9010308@inl.gov> Message-ID: <47ACBC75.8030104@gmail.com> Hi, Peter, Could you please confirm the version number? is it mvapich-0.9.9 or the latest mvapich-1.0.0 from a nightly tarball? --Weikuan Peter Cebull wrote: > Jonathan L. Perkins wrote: >> >> Just another suggestion. Can you try MVAPICH from our trunk and/or >> mvapich2 to see if you encounter the same issue? Is there an older >> version of MVAPICH that you got to work with the same configuration? >> > I tried mvapich-0.9.9-2008-02-05 with no luck. MVAPICH2 seemed to build > okay. I've never built MVAPICH before -- our new system (SGI Altix ICE) > came with gcc and Intel versions of mvapich 0.9.9, I'm just trying to > build an equivalent configuration for the PGI compilers. > > It seems to be a problem finding include files. If I cd into > romio/mpi-io and try to compile close.c, it fails. I copied the entire > source tree to my desktop (OpenSUSE 10.1 and the same PGI compilers) and > close.c compiled with no problem. Back on the other system if I copied > the header files from the include directory into romio/mpi-io, then I > could compile close.c. > > Peter > -- Weikuan Yu <+> 1-865-574-7990 http://ft.ornl.gov/~wyu/ From peter.cebull at inl.gov Fri Feb 8 15:47:15 2008 From: peter.cebull at inl.gov (Peter Cebull) Date: Fri Feb 8 15:47:57 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47ACBC75.8030104@gmail.com> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> <47AC6493.2080302@cse.ohio-state.edu> <47AC8C06.6050209@inl.gov> <47AC8F39.8000908@cse.ohio-state.edu> <47ACA342.9060405@cse.ohio-state.edu> <47ACB14E.9010308@inl.gov> <47ACBC75.8030104@gmail.com> Message-ID: <47ACBFD3.20503@inl.gov> Weikuan Yu wrote: > > Could you please confirm the version number? is it mvapich-0.9.9 or > the latest mvapich-1.0.0 from a nightly tarball? > I've tried the Feb 5th tarball from http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/0.9.9, and the MVAPICH 1.0 beta dated 10/26/07: http://mvapich.cse.ohio-state.edu/download/mvapich/mvapich-1.0-beta.tar.gz. I'm starting to wonder if the build might be picking up the wrong header files. We have some header files in /usr/include (mpif.h, mpi.h, etc.), I guess from SGI's MPT. Maybe they're causing problems. Peter > > Peter Cebull wrote: >> Jonathan L. Perkins wrote: >>> >>> Just another suggestion. Can you try MVAPICH from our trunk and/or >>> mvapich2 to see if you encounter the same issue? Is there an older >>> version of MVAPICH that you got to work with the same configuration? >>> >> I tried mvapich-0.9.9-2008-02-05 with no luck. MVAPICH2 seemed to >> build okay. I've never built MVAPICH before -- our new system (SGI >> Altix ICE) came with gcc and Intel versions of mvapich 0.9.9, I'm >> just trying to build an equivalent configuration for the PGI compilers. >> >> It seems to be a problem finding include files. If I cd into >> romio/mpi-io and try to compile close.c, it fails. I copied the >> entire source tree to my desktop (OpenSUSE 10.1 and the same PGI >> compilers) and close.c compiled with no problem. Back on the other >> system if I copied the header files from the include directory into >> romio/mpi-io, then I could compile close.c. >> >> Peter >> > -- Peter Cebull HPC User Consultant Idaho National Laboratory P.O. Box 1625, MS3605 Idaho Falls, ID 83415 Phone: 208-526-1909 Email: Peter.Cebull@inl.gov From perkinjo at cse.ohio-state.edu Fri Feb 8 15:58:25 2008 From: perkinjo at cse.ohio-state.edu (Jonathan L. Perkins) Date: Fri Feb 8 15:58:37 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47ACBFD3.20503@inl.gov> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> <47AC6493.2080302@cse.ohio-state.edu> <47AC8C06.6050209@inl.gov> <47AC8F39.8000908@cse.ohio-state.edu> <47ACA342.9060405@cse.ohio-state.edu> <47ACB14E.9010308@inl.gov> <47ACBC75.8030104@gmail.com> <47ACBFD3.20503@inl.gov> Message-ID: <47ACC271.60307@cse.ohio-state.edu> Peter Cebull wrote: > Weikuan Yu wrote: >> >> Could you please confirm the version number? is it mvapich-0.9.9 or >> the latest mvapich-1.0.0 from a nightly tarball? >> > I've tried the Feb 5th tarball from > http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/0.9.9, and > the MVAPICH 1.0 beta dated 10/26/07: > http://mvapich.cse.ohio-state.edu/download/mvapich/mvapich-1.0-beta.tar.gz. In order to try our latest MVAPICH sources you can use Subversion. "svn export https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/trunk" should do the trick. > I'm starting to wonder if the build might be picking up the wrong header > files. We have some header files in /usr/include (mpif.h, mpi.h, etc.), > I guess from SGI's MPT. Maybe they're causing problems. > > Peter >> >> Peter Cebull wrote: >>> Jonathan L. Perkins wrote: >>>> >>>> Just another suggestion. Can you try MVAPICH from our trunk and/or >>>> mvapich2 to see if you encounter the same issue? Is there an older >>>> version of MVAPICH that you got to work with the same configuration? >>>> >>> I tried mvapich-0.9.9-2008-02-05 with no luck. MVAPICH2 seemed to >>> build okay. I've never built MVAPICH before -- our new system (SGI >>> Altix ICE) came with gcc and Intel versions of mvapich 0.9.9, I'm >>> just trying to build an equivalent configuration for the PGI compilers. >>> >>> It seems to be a problem finding include files. If I cd into >>> romio/mpi-io and try to compile close.c, it fails. I copied the >>> entire source tree to my desktop (OpenSUSE 10.1 and the same PGI >>> compilers) and close.c compiled with no problem. Back on the other >>> system if I copied the header files from the include directory into >>> romio/mpi-io, then I could compile close.c. >>> >>> Peter >>> >> > > -- Jonathan Perkins http://www.cse.ohio-state.edu/~perkinjo From weikuan.yu at gmail.com Fri Feb 8 16:01:16 2008 From: weikuan.yu at gmail.com (Weikuan Yu) Date: Fri Feb 8 16:01:23 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47ACBFD3.20503@inl.gov> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> <47AC6493.2080302@cse.ohio-state.edu> <47AC8C06.6050209@inl.gov> <47AC8F39.8000908@cse.ohio-state.edu> <47ACA342.9060405@cse.ohio-state.edu> <47ACB14E.9010308@inl.gov> <47ACBC75.8030104@gmail.com> <47ACBFD3.20503@inl.gov> Message-ID: <47ACC31C.6050706@gmail.com> Peter Cebull wrote: > I've tried the Feb 5th tarball from > http://mvapich.cse.ohio-state.edu/nightly/mvapich/branches/0.9.9, and > the MVAPICH 1.0 beta dated 10/26/07: > http://mvapich.cse.ohio-state.edu/download/mvapich/mvapich-1.0-beta.tar.gz. > > I'm starting to wonder if the build might be picking up the wrong header > files. We have some header files in /usr/include (mpif.h, mpi.h, etc.), > I guess from SGI's MPT. Maybe they're causing problems. Thanks for the info. Sounds like a possible cause, for which you may customize the order of include paths a little to fit your system. --Weikuan From peter.cebull at inl.gov Fri Feb 8 16:42:45 2008 From: peter.cebull at inl.gov (Peter Cebull) Date: Fri Feb 8 16:43:23 2008 Subject: [mvapich-discuss] Problem Building MVAPICH-1.0-beta with PGI Compilers In-Reply-To: <47ACC2E0.3040509@cse.ohio-state.edu> References: <47AA2D7E.7030109@inl.gov> <47AA6D32.50600@cse.ohio-state.edu> <47AB54A4.8080101@cse.ohio-state.edu> <47AB5B7B.1030306@inl.gov> <47AB5D24.4070800@cse.ohio-state.edu> <47AC6493.2080302@cse.ohio-state.edu> <47AC8C06.6050209@inl.gov> <47AC8F39.8000908@cse.ohio-state.edu> <47ACA342.9060405@cse.ohio-state.edu> <47ACB14E.9010308@inl.gov> <47ACBC75.8030104@gmail.com> <47ACBFD3.20503@inl.gov> <47ACC2E0.3040509@cse.ohio-state.edu> Message-ID: <47ACCCD5.6070006@inl.gov> Jonathan L. Perkins wrote: > I'm wondering what could be leading to this failure as well. Would it > be possible to provide remote access so that we can debug this issue > on your cluster? I've cc'd an internal mailing list for the MVAPICH > group. Getting the approvals for external access to our network is not straightforward, unfortunately. The good news is that I was able to confirm that the MPI header files in /usr/include were causing problems with the build. I was able to successfully build MVAPICH 1.0-beta after temporarily moving all the mpi*.h files out of /usr/include. I'll have to ask SGI why they are there. Thanks for all the responses and help troubleshooting! Peter -- Peter Cebull HPC User Consultant Idaho National Laboratory P.O. Box 1625, MS3605 Idaho Falls, ID 83415 Phone: 208-526-1909 Email: Peter.Cebull@inl.gov From schuang at ats.ucla.edu Sat Feb 9 01:02:25 2008 From: schuang at ats.ucla.edu (Shao-Ching Huang) Date: Sat Feb 9 01:02:42 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: References: <20080202012436.GA25420@ats.ucla.edu> Message-ID: <20080209060225.GA18723@ats.ucla.edu> Hi No failure was found in these mpiGraph runs. It's just that there is significant variation among the entries of the matrices, compared to another IB cluster of ours. http://reynolds.turb.ucla.edu/~schuang/mpiGraph/ Thanks. Shao-Ching On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote: > Hi, > > How often do you observe the failures when running the mpiGraph test? Do > all the failure happen at startup, as your simple program? > > Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Fri, 1 Feb 2008, Shao-Ching Huang wrote: > > > > > Hi Wei, > > > > We cleaned up a few things and re-ran the mpiGraph tests. The updated > > results are posted here: > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html > > > > Please ignore results in my previous email. Thank you. > > > > Regards, > > Shao-Ching > > > > > > On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote: > > > > > > Hi Wei, > > > > > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) > > > MPI process per node: > > > > > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out > > > > > > The results from the two runs are posted here: > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ > > > > > > During the tests, some other users are also running jobs on some of > > > these 48 nodes. > > > > > > Could you please help us interpret these results, if possible? > > > > > > Thank you. > > > > > > Shao-Ching Huang > > > > > > > > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: > > > > Hi Scott, > > > > > > > > We went up to 256 processes (32 nodes) and did not see the problem in few > > > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure > > > > the fabrics and system setup are ok. To diagnose this, we suggest you > > > > running mpiGraph program from http://sourceforge.net/projects/mpigraph. > > > > This test stresses the interconnects. It should fail at a much higher > > > > frequency than simple cpi program if there is a problem with your system > > > > setup. > > > > > > > > Thanks. > > > > > > > > Regards, > > > > Wei Huang > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > Dept. of Computer Science and Engineering > > > > Ohio State University > > > > OH 43210 > > > > Tel: (614)292-8501 > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > My co-worker passed this along... > > > > > > > > > > Yes, the error happens on the cpi.c program too. It happened 2 times > > > > > among the 9 cases I ran. > > > > > > > > > > I was using 128 processes (on 32 4-core nodes). > > > > > > > > > > --- > > > > > > > > > > and another... > > > > > > > > > > It happens for a simple MPI program which just does MPI_Init and > > > > > MPI_Finalize and print out number of processors. It happened for > > > > > anything from 4 nodes (16 processors ) and more. > > > > > > > > > > What environment variables should we look for? > > > > > > > > > > Thanks, > > > > > Scott > > > > > > > > > > wei huang wrote: > > > > > > Hi Scott, > > > > > > > > > > > > On how many processes (and how many nodes) you ran your program? Do you > > > > > > have any environmental variables when you are running the program? Does > > > > > > the error happen on simple test like cpi? > > > > > > > > > > > > Thanks. > > > > > > > > > > > > Regards, > > > > > > Wei Huang > > > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > > Dept. of Computer Science and Engineering > > > > > > Ohio State University > > > > > > OH 43210 > > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > > > >> The low level ibv tests work fine. > > > > > > > > > > > > _______________________________________________ > > > > > > mvapich-discuss mailing list > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > From lyan1 at cct.lsu.edu Tue Feb 12 15:57:59 2008 From: lyan1 at cct.lsu.edu (Le Yan) Date: Tue Feb 12 15:58:14 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: <20080209060225.GA18723@ats.ucla.edu> References: <20080202012436.GA25420@ats.ucla.edu> <20080209060225.GA18723@ats.ucla.edu> Message-ID: <1202849879.12661.68.camel@lyan1.hpc.lsu.edu> Hi, We have the same problem here with Mvapich2 1.0.1 on a Dell infiniband cluster. It has 8 cores per node and is running RHEL 4.5 (kernel 2.6.9-55). The OFED library version is 1.2. At first it seemed that any code compiled with Mvapich2 1.0.1 failed at the MPI_INIT stage when running with more than 128 procs. But later on we found that a code could run only if it doesn't use all 8 processors on the same node (which explains why mpiGraph never fails, because it uses only 1 processor per node). For example, a job running with 16 nodes and 8 procs per node will fail, but one with 32 nodes and 4 procs per node will not. In addition, if the MALLOC_CHECK_ environment variable is set to 1, a bunch of errors appear in the standard error like this: 61: malloc: using debugging hooks 61: free(): invalid pointer 0x707000! 61: Fatal error in MPI_Init: 61: Other MPI error, error stack: 61: MPIR_Init_thread(259)..: Initialization failed 61: MPID_Init(102).........: channel initialization failed 61: MPIDI_CH3_Init(178)....: 61: MPIDI_CH3I_CM_Init(855): Error initializing MVAPICH2 malloc library I'm not quite sure what these messages mean, but sure it looks like a memory issue? Both Mvapich2 0.98 and Mvapich 1.0beta are fine on the same system. Cheers, Le On Fri, 2008-02-08 at 22:02 -0800, Shao-Ching Huang wrote: > Hi > > No failure was found in these mpiGraph runs. It's just that there is > significant variation among the entries of the matrices, compared to > another IB cluster of ours. > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/ > > Thanks. > > Shao-Ching > > > On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote: > > Hi, > > > > How often do you observe the failures when running the mpiGraph test? Do > > all the failure happen at startup, as your simple program? > > > > Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Fri, 1 Feb 2008, Shao-Ching Huang wrote: > > > > > > > > Hi Wei, > > > > > > We cleaned up a few things and re-ran the mpiGraph tests. The updated > > > results are posted here: > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html > > > > > > Please ignore results in my previous email. Thank you. > > > > > > Regards, > > > Shao-Ching > > > > > > > > > On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote: > > > > > > > > Hi Wei, > > > > > > > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) > > > > MPI process per node: > > > > > > > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out > > > > > > > > The results from the two runs are posted here: > > > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ > > > > > > > > During the tests, some other users are also running jobs on some of > > > > these 48 nodes. > > > > > > > > Could you please help us interpret these results, if possible? > > > > > > > > Thank you. > > > > > > > > Shao-Ching Huang > > > > > > > > > > > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: > > > > > Hi Scott, > > > > > > > > > > We went up to 256 processes (32 nodes) and did not see the problem in few > > > > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure > > > > > the fabrics and system setup are ok. To diagnose this, we suggest you > > > > > running mpiGraph program from http://sourceforge.net/projects/mpigraph. > > > > > This test stresses the interconnects. It should fail at a much higher > > > > > frequency than simple cpi program if there is a problem with your system > > > > > setup. > > > > > > > > > > Thanks. > > > > > > > > > > Regards, > > > > > Wei Huang > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > Dept. of Computer Science and Engineering > > > > > Ohio State University > > > > > OH 43210 > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > > > My co-worker passed this along... > > > > > > > > > > > > Yes, the error happens on the cpi.c program too. It happened 2 times > > > > > > among the 9 cases I ran. > > > > > > > > > > > > I was using 128 processes (on 32 4-core nodes). > > > > > > > > > > > > --- > > > > > > > > > > > > and another... > > > > > > > > > > > > It happens for a simple MPI program which just does MPI_Init and > > > > > > MPI_Finalize and print out number of processors. It happened for > > > > > > anything from 4 nodes (16 processors ) and more. > > > > > > > > > > > > What environment variables should we look for? > > > > > > > > > > > > Thanks, > > > > > > Scott > > > > > > > > > > > > wei huang wrote: > > > > > > > Hi Scott, > > > > > > > > > > > > > > On how many processes (and how many nodes) you ran your program? Do you > > > > > > > have any environmental variables when you are running the program? Does > > > > > > > the error happen on simple test like cpi? > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > Regards, > > > > > > > Wei Huang > > > > > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > > > Dept. of Computer Science and Engineering > > > > > > > Ohio State University > > > > > > > OH 43210 > > > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > > > > > >> The low level ibv tests work fine. > > > > > > > > > > > > > > _______________________________________________ > > > > > > > mvapich-discuss mailing list > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > mvapich-discuss mailing list > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > -- Le Yan User support Louisiana Optical Network Initiative (LONI) Office: 225-578-7524 Fax: 225-578-6400 From huanwei at cse.ohio-state.edu Tue Feb 12 16:41:31 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Tue Feb 12 16:41:44 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: <1202849879.12661.68.camel@lyan1.hpc.lsu.edu> Message-ID: Hi, We donot see anything abnormal from our local testing. In order to help us locating the problem, could you please try the following: 1) Check if you have enough space in the /tmp directly 2) Disable ring based start using: mpiexec -n N -env MV2_USE_RING_STARTUP 0 ./a.out 3) If this fails, disable shared memory support using runtime variable MV2_USE_SHARED_MEM=0: mpiexec -n N -env MV2_USE_SHARED_MEM 0 ./a.out Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Tue, 12 Feb 2008, Le Yan wrote: > Hi, > > We have the same problem here with Mvapich2 1.0.1 on a Dell infiniband > cluster. It has 8 cores per node and is running RHEL 4.5 (kernel > 2.6.9-55). The OFED library version is 1.2. > > At first it seemed that any code compiled with Mvapich2 1.0.1 failed at > the MPI_INIT stage when running with more than 128 procs. But later on > we found that a code could run only if it doesn't use all 8 processors > on the same node (which explains why mpiGraph never fails, because it > uses only 1 processor per node). For example, a job running with 16 > nodes and 8 procs per node will fail, but one with 32 nodes and 4 procs > per node will not. > > In addition, if the MALLOC_CHECK_ environment variable is set to 1, a > bunch of errors appear in the standard error like this: > > 61: malloc: using debugging hooks > 61: free(): invalid pointer 0x707000! > 61: Fatal error in MPI_Init: > 61: Other MPI error, error stack: > 61: MPIR_Init_thread(259)..: Initialization failed > 61: MPID_Init(102).........: channel initialization failed > 61: MPIDI_CH3_Init(178)....: > 61: MPIDI_CH3I_CM_Init(855): Error initializing MVAPICH2 malloc library > > I'm not quite sure what these messages mean, but sure it looks like a > memory issue? > > Both Mvapich2 0.98 and Mvapich 1.0beta are fine on the same system. > > Cheers, > Le > > > On Fri, 2008-02-08 at 22:02 -0800, Shao-Ching Huang wrote: > > Hi > > > > No failure was found in these mpiGraph runs. It's just that there is > > significant variation among the entries of the matrices, compared to > > another IB cluster of ours. > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/ > > > > Thanks. > > > > Shao-Ching > > > > > > On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote: > > > Hi, > > > > > > How often do you observe the failures when running the mpiGraph test? Do > > > all the failure happen at startup, as your simple program? > > > > > > Thanks. > > > > > > Regards, > > > Wei Huang > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > Dept. of Computer Science and Engineering > > > Ohio State University > > > OH 43210 > > > Tel: (614)292-8501 > > > > > > > > > On Fri, 1 Feb 2008, Shao-Ching Huang wrote: > > > > > > > > > > > Hi Wei, > > > > > > > > We cleaned up a few things and re-ran the mpiGraph tests. The updated > > > > results are posted here: > > > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html > > > > > > > > Please ignore results in my previous email. Thank you. > > > > > > > > Regards, > > > > Shao-Ching > > > > > > > > > > > > On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote: > > > > > > > > > > Hi Wei, > > > > > > > > > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) > > > > > MPI process per node: > > > > > > > > > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out > > > > > > > > > > The results from the two runs are posted here: > > > > > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ > > > > > > > > > > During the tests, some other users are also running jobs on some of > > > > > these 48 nodes. > > > > > > > > > > Could you please help us interpret these results, if possible? > > > > > > > > > > Thank you. > > > > > > > > > > Shao-Ching Huang > > > > > > > > > > > > > > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: > > > > > > Hi Scott, > > > > > > > > > > > > We went up to 256 processes (32 nodes) and did not see the problem in few > > > > > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure > > > > > > the fabrics and system setup are ok. To diagnose this, we suggest you > > > > > > running mpiGraph program from http://sourceforge.net/projects/mpigraph. > > > > > > This test stresses the interconnects. It should fail at a much higher > > > > > > frequency than simple cpi program if there is a problem with your system > > > > > > setup. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > Regards, > > > > > > Wei Huang > > > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > > Dept. of Computer Science and Engineering > > > > > > Ohio State University > > > > > > OH 43210 > > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > > > > > My co-worker passed this along... > > > > > > > > > > > > > > Yes, the error happens on the cpi.c program too. It happened 2 times > > > > > > > among the 9 cases I ran. > > > > > > > > > > > > > > I was using 128 processes (on 32 4-core nodes). > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > and another... > > > > > > > > > > > > > > It happens for a simple MPI program which just does MPI_Init and > > > > > > > MPI_Finalize and print out number of processors. It happened for > > > > > > > anything from 4 nodes (16 processors ) and more. > > > > > > > > > > > > > > What environment variables should we look for? > > > > > > > > > > > > > > Thanks, > > > > > > > Scott > > > > > > > > > > > > > > wei huang wrote: > > > > > > > > Hi Scott, > > > > > > > > > > > > > > > > On how many processes (and how many nodes) you ran your program? Do you > > > > > > > > have any environmental variables when you are running the program? Does > > > > > > > > the error happen on simple test like cpi? > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Wei Huang > > > > > > > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > > > > Dept. of Computer Science and Engineering > > > > > > > > Ohio State University > > > > > > > > OH 43210 > > > > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > > > > > > > >> The low level ibv tests work fine. > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > mvapich-discuss mailing list > > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > mvapich-discuss mailing list > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > _______________________________________________ > > > > > mvapich-discuss mailing list > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > -- > Le Yan > User support > Louisiana Optical Network Initiative (LONI) > Office: 225-578-7524 > Fax: 225-578-6400 > > From Mike.Colonno at spacex.com Tue Feb 12 18:55:04 2008 From: Mike.Colonno at spacex.com (Mike Colonno) Date: Tue Feb 12 18:55:20 2008 Subject: [mvapich-discuss] mvapich missing 'mpispawn' ? In-Reply-To: References: <1202849879.12661.68.camel@lyan1.hpc.lsu.edu> Message-ID: Hi folks ~ Sorry for what must be a simple thing here: I built the most recent MVAPICH and I'm testing it out. My path includes the appropriate /bin directory: >> which mpispawn /usr/local/mvapich/bin/mpispawn mpif90 works great. Running a simple test program: >> mpirun_rsh -rsh -np 2 node1 node1 bounce /usr/bin/env: ./mpispawn: No such file or directory I'm guessing this is a simple settings thing but nothing seems to have an effect. Built various forms of MPICH(2) several time and I've never seen this... Thanks, Michael R. Colonno, Ph.D.?| Chief Aerodynamic Engineer Space Exploration Technologies 1 Rocket Road Hawthorne, CA 90250 W:?310?363 6263 | M: 310?570 3299 | F: 310?363 6001 | www.spacex.com -- This Email Contains Sensitive Proprietary and Confidential Information - Not for Further Distribution Without the Express Written?Consent?of Space Exploration Technologies -- From karl at tacc.utexas.edu Tue Feb 12 19:08:49 2008 From: karl at tacc.utexas.edu (Karl W. Schulz) Date: Tue Feb 12 19:09:05 2008 Subject: [mvapich-discuss] mvapich missing 'mpispawn' ? In-Reply-To: References: <1202849879.12661.68.camel@lyan1.hpc.lsu.edu> Message-ID: Hello Mike, If you include the full path to mpirun_rsh, I believe you will find that it will work (I suspect that mpispawn is being invoked without execvp so it's not picking up on your $PATH definition). Try: > /usr/local/mvapich/bin/mpirun_rsh -rsh -np 2 node1 node1 bounce Karl -----Original Message----- From: mvapich-discuss-bounces@cse.ohio-state.edu [mailto:mvapich-discuss-bounces@cse.ohio-state.edu] On Behalf Of Mike Colonno Sent: Tuesday, February 12, 2008 5:55 PM To: mvapich-discuss@cse.ohio-state.edu Subject: [mvapich-discuss] mvapich missing 'mpispawn' ? Hi folks ~ Sorry for what must be a simple thing here: I built the most recent MVAPICH and I'm testing it out. My path includes the appropriate /bin directory: >> which mpispawn /usr/local/mvapich/bin/mpispawn mpif90 works great. Running a simple test program: >> mpirun_rsh -rsh -np 2 node1 node1 bounce /usr/bin/env: ./mpispawn: No such file or directory I'm guessing this is a simple settings thing but nothing seems to have an effect. Built various forms of MPICH(2) several time and I've never seen this... Thanks, Michael R. Colonno, Ph.D.?| Chief Aerodynamic Engineer Space Exploration Technologies 1 Rocket Road Hawthorne, CA 90250 W:?310?363 6263 | M: 310?570 3299 | F: 310?363 6001 | www.spacex.com -- This Email Contains Sensitive Proprietary and Confidential Information - Not for Further Distribution Without the Express Written?Consent?of Space Exploration Technologies -- _______________________________________________ mvapich-discuss mailing list mvapich-discuss@cse.ohio-state.edu http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss From sridharj at cse.ohio-state.edu Thu Feb 14 13:39:16 2008 From: sridharj at cse.ohio-state.edu (Jaidev Sridhar) Date: Thu Feb 14 13:39:29 2008 Subject: [mvapich-discuss] mvapich missing 'mpispawn' ? In-Reply-To: References: <1202849879.12661.68.camel@lyan1.hpc.lsu.edu> Message-ID: <47B48AD4.1060606@cse.ohio-state.edu> Hi Mike, Thanks for bringing this issue to our attention. As Karl mentioned, you can work around this bug by specifying a path to mpirun_rsh. We have fixed this in the trunk version of mvapich. If you'd like to try this out, you can checkout the latest code from our svn repository - https://mvapich.cse.ohio-state.edu/svn/mpi/mvapich/trunk. Thanks, Jaidev On Tuesday 12 February 2008 06:55 PM, Mike Colonno wrote: > Hi folks ~ > > Sorry for what must be a simple thing here: I built the most recent MVAPICH and I'm testing it out. My path includes the appropriate /bin directory: > > >> which mpispawn > /usr/local/mvapich/bin/mpispawn > > mpif90 works great. Running a simple test program: > > >> mpirun_rsh -rsh -np 2 node1 node1 bounce > /usr/bin/env: ./mpispawn: No such file or directory > > I'm guessing this is a simple settings thing but nothing seems to have an effect. Built various forms of MPICH(2) several time and I've never seen this... > > Thanks, > > Michael R. Colonno, Ph.D. | Chief Aerodynamic Engineer > Space Exploration Technologies > 1 Rocket Road > Hawthorne, CA 90250 > W: 310 363 6263 | M: 310 570 3299 | F: 310 363 6001 | www.spacex.com > > -- This Email Contains Sensitive Proprietary and Confidential Information - Not for Further Distribution Without the Express Written Consent of Space Exploration Technologies -- > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From manoj at pnl.gov Mon Feb 18 18:21:36 2008 From: manoj at pnl.gov (Manojkumar Krishnan) Date: Mon Feb 18 22:59:57 2008 Subject: [mvapich-discuss] MVAPICH2 - MPI_Comm_spawn Message-ID: I was wondering if there is a version of MVAPICH2 (using OpenIB s/w stack) that supports MPI_Comm_spawn. If so, please let me know. Thanks. -Manoj:) --------------------------------------------------------------- Manojkumar Krishnan High Performance Computing Group Pacific Northwest National Laboratory Ph: (509) 372-4206 Fax: (509) 372-4720 http://hpc.pnl.gov/people/manoj --------------------------------------------------------------- From koop at cse.ohio-state.edu Tue Feb 19 21:55:22 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Tue Feb 19 21:55:34 2008 Subject: [mvapich-discuss] MVAPICH2 - MPI_Comm_spawn In-Reply-To: Message-ID: Hi Manoj, We currently do not support MPI_Comm_spawn when running over the OpenFabrics verbs. This support is being worked on for a future release. Currently it can be used only when compiled with TCP/IP support. Matt On Mon, 18 Feb 2008, Manojkumar Krishnan wrote: > > I was wondering if there is a version of MVAPICH2 (using OpenIB s/w stack) > that supports MPI_Comm_spawn. If so, please let me know. > > Thanks. > > -Manoj:) > --------------------------------------------------------------- > Manojkumar Krishnan > High Performance Computing Group > Pacific Northwest National Laboratory > Ph: (509) 372-4206 Fax: (509) 372-4720 > http://hpc.pnl.gov/people/manoj > --------------------------------------------------------------- > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From lyan1 at cct.lsu.edu Wed Feb 20 17:08:32 2008 From: lyan1 at cct.lsu.edu (Le Yan) Date: Wed Feb 20 17:08:49 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: References: Message-ID: <1203545312.26450.21.camel@lyan1.hpc.lsu.edu> Hi, Thank you for the suggestions. I apologize that I wasn't able to work on this for the past week. I'm not sure what the experience is for other people who had the same problem, but it looks like that "-env MV2_USE_RING_STARTUP 0" did the trick for us: I've been running 10+ jobs with 256 procs on the same set of nodes, and all jobs with it being passed at the command line ran just fine, as opposed to others that failed with the default environment setting. Hope this is helpful information. Cheers, Le On Tue, 2008-02-12 at 16:41 -0500, wei huang wrote: > Hi, > > We donot see anything abnormal from our local testing. In order to help us > locating the problem, could you please try the following: > > 1) Check if you have enough space in the /tmp directly > > 2) Disable ring based start using: > > mpiexec -n N -env MV2_USE_RING_STARTUP 0 ./a.out > > 3) If this fails, disable shared memory support using runtime variable > MV2_USE_SHARED_MEM=0: > > mpiexec -n N -env MV2_USE_SHARED_MEM 0 ./a.out > > Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Tue, 12 Feb 2008, Le Yan wrote: > > > Hi, > > > > We have the same problem here with Mvapich2 1.0.1 on a Dell infiniband > > cluster. It has 8 cores per node and is running RHEL 4.5 (kernel > > 2.6.9-55). The OFED library version is 1.2. > > > > At first it seemed that any code compiled with Mvapich2 1.0.1 failed at > > the MPI_INIT stage when running with more than 128 procs. But later on > > we found that a code could run only if it doesn't use all 8 processors > > on the same node (which explains why mpiGraph never fails, because it > > uses only 1 processor per node). For example, a job running with 16 > > nodes and 8 procs per node will fail, but one with 32 nodes and 4 procs > > per node will not. > > > > In addition, if the MALLOC_CHECK_ environment variable is set to 1, a > > bunch of errors appear in the standard error like this: > > > > 61: malloc: using debugging hooks > > 61: free(): invalid pointer 0x707000! > > 61: Fatal error in MPI_Init: > > 61: Other MPI error, error stack: > > 61: MPIR_Init_thread(259)..: Initialization failed > > 61: MPID_Init(102).........: channel initialization failed > > 61: MPIDI_CH3_Init(178)....: > > 61: MPIDI_CH3I_CM_Init(855): Error initializing MVAPICH2 malloc library > > > > I'm not quite sure what these messages mean, but sure it looks like a > > memory issue? > > > > Both Mvapich2 0.98 and Mvapich 1.0beta are fine on the same system. > > > > Cheers, > > Le > > > > > > On Fri, 2008-02-08 at 22:02 -0800, Shao-Ching Huang wrote: > > > Hi > > > > > > No failure was found in these mpiGraph runs. It's just that there is > > > significant variation among the entries of the matrices, compared to > > > another IB cluster of ours. > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/ > > > > > > Thanks. > > > > > > Shao-Ching > > > > > > > > > On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote: > > > > Hi, > > > > > > > > How often do you observe the failures when running the mpiGraph test? Do > > > > all the failure happen at startup, as your simple program? > > > > > > > > Thanks. > > > > > > > > Regards, > > > > Wei Huang > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > Dept. of Computer Science and Engineering > > > > Ohio State University > > > > OH 43210 > > > > Tel: (614)292-8501 > > > > > > > > > > > > On Fri, 1 Feb 2008, Shao-Ching Huang wrote: > > > > > > > > > > > > > > Hi Wei, > > > > > > > > > > We cleaned up a few things and re-ran the mpiGraph tests. The updated > > > > > results are posted here: > > > > > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html > > > > > > > > > > Please ignore results in my previous email. Thank you. > > > > > > > > > > Regards, > > > > > Shao-Ching > > > > > > > > > > > > > > > On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote: > > > > > > > > > > > > Hi Wei, > > > > > > > > > > > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) > > > > > > MPI process per node: > > > > > > > > > > > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out > > > > > > > > > > > > The results from the two runs are posted here: > > > > > > > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ > > > > > > > > > > > > During the tests, some other users are also running jobs on some of > > > > > > these 48 nodes. > > > > > > > > > > > > Could you please help us interpret these results, if possible? > > > > > > > > > > > > Thank you. > > > > > > > > > > > > Shao-Ching Huang > > > > > > > > > > > > > > > > > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: > > > > > > > Hi Scott, > > > > > > > > > > > > > > We went up to 256 processes (32 nodes) and did not see the problem in few > > > > > > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure > > > > > > > the fabrics and system setup are ok. To diagnose this, we suggest you > > > > > > > running mpiGraph program from http://sourceforge.net/projects/mpigraph. > > > > > > > This test stresses the interconnects. It should fail at a much higher > > > > > > > frequency than simple cpi program if there is a problem with your system > > > > > > > setup. > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > Regards, > > > > > > > Wei Huang > > > > > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > > > Dept. of Computer Science and Engineering > > > > > > > Ohio State University > > > > > > > OH 43210 > > > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > > > > > > > My co-worker passed this along... > > > > > > > > > > > > > > > > Yes, the error happens on the cpi.c program too. It happened 2 times > > > > > > > > among the 9 cases I ran. > > > > > > > > > > > > > > > > I was using 128 processes (on 32 4-core nodes). > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > > and another... > > > > > > > > > > > > > > > > It happens for a simple MPI program which just does MPI_Init and > > > > > > > > MPI_Finalize and print out number of processors. It happened for > > > > > > > > anything from 4 nodes (16 processors ) and more. > > > > > > > > > > > > > > > > What environment variables should we look for? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Scott > > > > > > > > > > > > > > > > wei huang wrote: > > > > > > > > > Hi Scott, > > > > > > > > > > > > > > > > > > On how many processes (and how many nodes) you ran your program? Do you > > > > > > > > > have any environmental variables when you are running the program? Does > > > > > > > > > the error happen on simple test like cpi? > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Wei Huang > > > > > > > > > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > > > > > Dept. of Computer Science and Engineering > > > > > > > > > Ohio State University > > > > > > > > > OH 43210 > > > > > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > > > > > > > > > >> The low level ibv tests work fine. > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > mvapich-discuss mailing list > > > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > mvapich-discuss mailing list > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > _______________________________________________ > > > > > > mvapich-discuss mailing list > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > _______________________________________________ > > > mvapich-discuss mailing list > > > mvapich-discuss@cse.ohio-state.edu > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > -- > > Le Yan > > User support > > Louisiana Optical Network Initiative (LONI) > > Office: 225-578-7524 > > Fax: 225-578-6400 > > > > > From panda at cse.ohio-state.edu Wed Feb 20 22:26:14 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Feb 20 22:26:30 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: <1203545312.26450.21.camel@lyan1.hpc.lsu.edu> Message-ID: > Hi, > > Thank you for the suggestions. I apologize that I wasn't able to work on > this for the past week. > > I'm not sure what the experience is for other people who had the same > problem, but it looks like that "-env MV2_USE_RING_STARTUP 0" did the > trick for us: I've been running 10+ jobs with 256 procs on the same set > of nodes, and all jobs with it being passed at the command line ran just > fine, as opposed to others that failed with the default environment > setting. Glad to know that you are able to run jobs successfully with the above option. > Hope this is helpful information. Yes, it is very helpful. We will take a look at this issue further. Thanks, DK > Cheers, > Le > > On Tue, 2008-02-12 at 16:41 -0500, wei huang wrote: > > Hi, > > > > We donot see anything abnormal from our local testing. In order to help us > > locating the problem, could you please try the following: > > > > 1) Check if you have enough space in the /tmp directly > > > > 2) Disable ring based start using: > > > > mpiexec -n N -env MV2_USE_RING_STARTUP 0 ./a.out > > > > 3) If this fails, disable shared memory support using runtime variable > > MV2_USE_SHARED_MEM=0: > > > > mpiexec -n N -env MV2_USE_SHARED_MEM 0 ./a.out > > > > Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Tue, 12 Feb 2008, Le Yan wrote: > > > > > Hi, > > > > > > We have the same problem here with Mvapich2 1.0.1 on a Dell infiniband > > > cluster. It has 8 cores per node and is running RHEL 4.5 (kernel > > > 2.6.9-55). The OFED library version is 1.2. > > > > > > At first it seemed that any code compiled with Mvapich2 1.0.1 failed at > > > the MPI_INIT stage when running with more than 128 procs. But later on > > > we found that a code could run only if it doesn't use all 8 processors > > > on the same node (which explains why mpiGraph never fails, because it > > > uses only 1 processor per node). For example, a job running with 16 > > > nodes and 8 procs per node will fail, but one with 32 nodes and 4 procs > > > per node will not. > > > > > > In addition, if the MALLOC_CHECK_ environment variable is set to 1, a > > > bunch of errors appear in the standard error like this: > > > > > > 61: malloc: using debugging hooks > > > 61: free(): invalid pointer 0x707000! > > > 61: Fatal error in MPI_Init: > > > 61: Other MPI error, error stack: > > > 61: MPIR_Init_thread(259)..: Initialization failed > > > 61: MPID_Init(102).........: channel initialization failed > > > 61: MPIDI_CH3_Init(178)....: > > > 61: MPIDI_CH3I_CM_Init(855): Error initializing MVAPICH2 malloc library > > > > > > I'm not quite sure what these messages mean, but sure it looks like a > > > memory issue? > > > > > > Both Mvapich2 0.98 and Mvapich 1.0beta are fine on the same system. > > > > > > Cheers, > > > Le > > > > > > > > > On Fri, 2008-02-08 at 22:02 -0800, Shao-Ching Huang wrote: > > > > Hi > > > > > > > > No failure was found in these mpiGraph runs. It's just that there is > > > > significant variation among the entries of the matrices, compared to > > > > another IB cluster of ours. > > > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/ > > > > > > > > Thanks. > > > > > > > > Shao-Ching > > > > > > > > > > > > On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote: > > > > > Hi, > > > > > > > > > > How often do you observe the failures when running the mpiGraph test? Do > > > > > all the failure happen at startup, as your simple program? > > > > > > > > > > Thanks. > > > > > > > > > > Regards, > > > > > Wei Huang > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > Dept. of Computer Science and Engineering > > > > > Ohio State University > > > > > OH 43210 > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > On Fri, 1 Feb 2008, Shao-Ching Huang wrote: > > > > > > > > > > > > > > > > > Hi Wei, > > > > > > > > > > > > We cleaned up a few things and re-ran the mpiGraph tests. The updated > > > > > > results are posted here: > > > > > > > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html > > > > > > > > > > > > Please ignore results in my previous email. Thank you. > > > > > > > > > > > > Regards, > > > > > > Shao-Ching > > > > > > > > > > > > > > > > > > On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote: > > > > > > > > > > > > > > Hi Wei, > > > > > > > > > > > > > > We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) > > > > > > > MPI process per node: > > > > > > > > > > > > > > mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out > > > > > > > > > > > > > > The results from the two runs are posted here: > > > > > > > > > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ > > > > > > > http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ > > > > > > > > > > > > > > During the tests, some other users are also running jobs on some of > > > > > > > these 48 nodes. > > > > > > > > > > > > > > Could you please help us interpret these results, if possible? > > > > > > > > > > > > > > Thank you. > > > > > > > > > > > > > > Shao-Ching Huang > > > > > > > > > > > > > > > > > > > > > On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: > > > > > > > > Hi Scott, > > > > > > > > > > > > > > > > We went up to 256 processes (32 nodes) and did not see the problem in few > > > > > > > > hundred runs (cpi). Thus, to narrow down the problem, we want to make sure > > > > > > > > the fabrics and system setup are ok. To diagnose this, we suggest you > > > > > > > > running mpiGraph program from http://sourceforge.net/projects/mpigraph. > > > > > > > > This test stresses the interconnects. It should fail at a much higher > > > > > > > > frequency than simple cpi program if there is a problem with your system > > > > > > > > setup. > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Wei Huang > > > > > > > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > > > > Dept. of Computer Science and Engineering > > > > > > > > Ohio State University > > > > > > > > OH 43210 > > > > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > > > > > > > > > My co-worker passed this along... > > > > > > > > > > > > > > > > > > Yes, the error happens on the cpi.c program too. It happened 2 times > > > > > > > > > among the 9 cases I ran. > > > > > > > > > > > > > > > > > > I was using 128 processes (on 32 4-core nodes). > > > > > > > > > > > > > > > > > > --- > > > > > > > > > > > > > > > > > > and another... > > > > > > > > > > > > > > > > > > It happens for a simple MPI program which just does MPI_Init and > > > > > > > > > MPI_Finalize and print out number of processors. It happened for > > > > > > > > > anything from 4 nodes (16 processors ) and more. > > > > > > > > > > > > > > > > > > What environment variables should we look for? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Scott > > > > > > > > > > > > > > > > > > wei huang wrote: > > > > > > > > > > Hi Scott, > > > > > > > > > > > > > > > > > > > > On how many processes (and how many nodes) you ran your program? Do you > > > > > > > > > > have any environmental variables when you are running the program? Does > > > > > > > > > > the error happen on simple test like cpi? > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Wei Huang > > > > > > > > > > > > > > > > > > > > 774 Dreese Lab, 2015 Neil Ave, > > > > > > > > > > Dept. of Computer Science and Engineering > > > > > > > > > > Ohio State University > > > > > > > > > > OH 43210 > > > > > > > > > > Tel: (614)292-8501 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 30 Jan 2008, Scott A. Friedman wrote: > > > > > > > > > > > > > > > > > > > >> The low level ibv tests work fine. > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > mvapich-discuss mailing list > > > > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > mvapich-discuss mailing list > > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > _______________________________________________ > > > > > > > mvapich-discuss mailing list > > > > > > > mvapich-discuss@cse.ohio-state.edu > > > > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > > > > _______________________________________________ > > > > mvapich-discuss mailing list > > > > mvapich-discuss@cse.ohio-state.edu > > > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > > > -- > > > Le Yan > > > User support > > > Louisiana Optical Network Initiative (LONI) > > > Office: 225-578-7524 > > > Fax: 225-578-6400 > > > > > > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From panda at cse.ohio-state.edu Wed Feb 20 23:23:22 2008 From: panda at cse.ohio-state.edu (Dhabaleswar Panda) Date: Wed Feb 20 23:23:36 2008 Subject: [mvapich-discuss] Announcing the release of MVAPICH2 1.0.2 Message-ID: The MVAPICH team is pleased to announce the release of MVAPICH2-1.0.2 version. Detailed changes and bug fixes applied to this version are available at the following URL: http://mvapich.cse.ohio-state.edu/download/mvapich2/changes.shtml We strongly encourage MVAPICH2 users to update their installations to this latest version. For downloading MVAPICH2 1.0.2 package, accessing the SVN and accessing the user guide, please visit the following URL: http://mvapich.cse.ohio-state.edu/ This version is also being made available through OFED 1.3. All questions and feedbacks, including bug reports, hints for performance tuning, patches and enhancements are welcome. Please post it to mvapich-discuss mailing list. Thanks, The MVAPICH Team From sylvain.jeaugey at bull.net Thu Feb 21 05:37:09 2008 From: sylvain.jeaugey at bull.net (Sylvain Jeaugey) Date: Thu Feb 21 05:38:48 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: References: Message-ID: Hi All, I worked a bit on this issue and found out that the crash was located in IB_ring_based_alltoall. De-activating Ring startup will definetly avoid the crash. I did the following patch and it seems to solve the bug : --- rdma_iba_priv.c.orig 2008-02-21 11:33:22.000000000 +0100 +++ rdma_iba_priv.c 2008-02-19 16:00:46.000000000 +0100 @@ -1258,6 +1258,7 @@ } /*Now all send and recv finished*/ + PMI_Barrier(); } } I increased MV2_DEFAULT_TIME_OUT to 24 to let me some time before the error appears (when the timeout expires). I saw that the alltoall operation is letting some processes behind, blocked in the function (10 out of 256, say). So, I'm suspecting that others destroyed something in their next operations and prevented the previous ones from completing the alltoall. Hope this helps, Sylvain On Tue, 12 Feb 2008, wei huang wrote: > Hi, > > We donot see anything abnormal from our local testing. In order to help us > locating the problem, could you please try the following: > > 1) Check if you have enough space in the /tmp directly > > 2) Disable ring based start using: > > mpiexec -n N -env MV2_USE_RING_STARTUP 0 ./a.out > > 3) If this fails, disable shared memory support using runtime variable > MV2_USE_SHARED_MEM=0: > > mpiexec -n N -env MV2_USE_SHARED_MEM 0 ./a.out > > Thanks. > > Regards, > Wei Huang > > 774 Dreese Lab, 2015 Neil Ave, > Dept. of Computer Science and Engineering > Ohio State University > OH 43210 > Tel: (614)292-8501 > > > On Tue, 12 Feb 2008, Le Yan wrote: > >> Hi, >> >> We have the same problem here with Mvapich2 1.0.1 on a Dell infiniband >> cluster. It has 8 cores per node and is running RHEL 4.5 (kernel >> 2.6.9-55). The OFED library version is 1.2. >> >> At first it seemed that any code compiled with Mvapich2 1.0.1 failed at >> the MPI_INIT stage when running with more than 128 procs. But later on >> we found that a code could run only if it doesn't use all 8 processors >> on the same node (which explains why mpiGraph never fails, because it >> uses only 1 processor per node). For example, a job running with 16 >> nodes and 8 procs per node will fail, but one with 32 nodes and 4 procs >> per node will not. >> >> In addition, if the MALLOC_CHECK_ environment variable is set to 1, a >> bunch of errors appear in the standard error like this: >> >> 61: malloc: using debugging hooks >> 61: free(): invalid pointer 0x707000! >> 61: Fatal error in MPI_Init: >> 61: Other MPI error, error stack: >> 61: MPIR_Init_thread(259)..: Initialization failed >> 61: MPID_Init(102).........: channel initialization failed >> 61: MPIDI_CH3_Init(178)....: >> 61: MPIDI_CH3I_CM_Init(855): Error initializing MVAPICH2 malloc library >> >> I'm not quite sure what these messages mean, but sure it looks like a >> memory issue? >> >> Both Mvapich2 0.98 and Mvapich 1.0beta are fine on the same system. >> >> Cheers, >> Le >> >> >> On Fri, 2008-02-08 at 22:02 -0800, Shao-Ching Huang wrote: >>> Hi >>> >>> No failure was found in these mpiGraph runs. It's just that there is >>> significant variation among the entries of the matrices, compared to >>> another IB cluster of ours. >>> >>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/ >>> >>> Thanks. >>> >>> Shao-Ching >>> >>> >>> On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote: >>>> Hi, >>>> >>>> How often do you observe the failures when running the mpiGraph test? Do >>>> all the failure happen at startup, as your simple program? >>>> >>>> Thanks. >>>> >>>> Regards, >>>> Wei Huang >>>> >>>> 774 Dreese Lab, 2015 Neil Ave, >>>> Dept. of Computer Science and Engineering >>>> Ohio State University >>>> OH 43210 >>>> Tel: (614)292-8501 >>>> >>>> >>>> On Fri, 1 Feb 2008, Shao-Ching Huang wrote: >>>> >>>>> >>>>> Hi Wei, >>>>> >>>>> We cleaned up a few things and re-ran the mpiGraph tests. The updated >>>>> results are posted here: >>>>> >>>>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html >>>>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html >>>>> >>>>> Please ignore results in my previous email. Thank you. >>>>> >>>>> Regards, >>>>> Shao-Ching >>>>> >>>>> >>>>> On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote: >>>>>> >>>>>> Hi Wei, >>>>>> >>>>>> We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) >>>>>> MPI process per node: >>>>>> >>>>>> mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out >>>>>> >>>>>> The results from the two runs are posted here: >>>>>> >>>>>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ >>>>>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ >>>>>> >>>>>> During the tests, some other users are also running jobs on some of >>>>>> these 48 nodes. >>>>>> >>>>>> Could you please help us interpret these results, if possible? >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Shao-Ching Huang >>>>>> >>>>>> >>>>>> On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: >>>>>>> Hi Scott, >>>>>>> >>>>>>> We went up to 256 processes (32 nodes) and did not see the problem in few >>>>>>> hundred runs (cpi). Thus, to narrow down the problem, we want to make sure >>>>>>> the fabrics and system setup are ok. To diagnose this, we suggest you >>>>>>> running mpiGraph program from http://sourceforge.net/projects/mpigraph. >>>>>>> This test stresses the interconnects. It should fail at a much higher >>>>>>> frequency than simple cpi program if there is a problem with your system >>>>>>> setup. >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> Regards, >>>>>>> Wei Huang >>>>>>> >>>>>>> 774 Dreese Lab, 2015 Neil Ave, >>>>>>> Dept. of Computer Science and Engineering >>>>>>> Ohio State University >>>>>>> OH 43210 >>>>>>> Tel: (614)292-8501 >>>>>>> >>>>>>> >>>>>>> On Wed, 30 Jan 2008, Scott A. Friedman wrote: >>>>>>> >>>>>>>> My co-worker passed this along... >>>>>>>> >>>>>>>> Yes, the error happens on the cpi.c program too. It happened 2 times >>>>>>>> among the 9 cases I ran. >>>>>>>> >>>>>>>> I was using 128 processes (on 32 4-core nodes). >>>>>>>> >>>>>>>> --- >>>>>>>> >>>>>>>> and another... >>>>>>>> >>>>>>>> It happens for a simple MPI program which just does MPI_Init and >>>>>>>> MPI_Finalize and print out number of processors. It happened for >>>>>>>> anything from 4 nodes (16 processors ) and more. >>>>>>>> >>>>>>>> What environment variables should we look for? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Scott >>>>>>>> >>>>>>>> wei huang wrote: >>>>>>>>> Hi Scott, >>>>>>>>> >>>>>>>>> On how many processes (and how many nodes) you ran your program? Do you >>>>>>>>> have any environmental variables when you are running the program? Does >>>>>>>>> the error happen on simple test like cpi? >>>>>>>>> >>>>>>>>> Thanks. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Wei Huang >>>>>>>>> >>>>>>>>> 774 Dreese Lab, 2015 Neil Ave, >>>>>>>>> Dept. of Computer Science and Engineering >>>>>>>>> Ohio State University >>>>>>>>> OH 43210 >>>>>>>>> Tel: (614)292-8501 >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, 30 Jan 2008, Scott A. Friedman wrote: >>>>>>>>> >>>>>>>>>> The low level ibv tests work fine. >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> mvapich-discuss mailing list >>>>>>>>> mvapich-discuss@cse.ohio-state.edu >>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> mvapich-discuss mailing list >>>>>>> mvapich-discuss@cse.ohio-state.edu >>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>>>> _______________________________________________ >>>>>> mvapich-discuss mailing list >>>>>> mvapich-discuss@cse.ohio-state.edu >>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>>>> >>> _______________________________________________ >>> mvapich-discuss mailing list >>> mvapich-discuss@cse.ohio-state.edu >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss >>> >> -- >> Le Yan >> User support >> Louisiana Optical Network Initiative (LONI) >> Office: 225-578-7524 >> Fax: 225-578-6400 >> >> > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > From koop at cse.ohio-state.edu Thu Feb 21 15:02:16 2008 From: koop at cse.ohio-state.edu (Matthew Koop) Date: Thu Feb 21 15:02:32 2008 Subject: [mvapich-discuss] Help with polled desc error In-Reply-To: Message-ID: Sylvain, Thanks. The error is a race-condition due to the freeing of resources in the on-demand path (turning that off will also avoid the issue). I've now checked in a fix into the trunk and 1.0 branches. Matt On Thu, 21 Feb 2008, Sylvain Jeaugey wrote: > Hi All, > > I worked a bit on this issue and found out that the crash was located in > IB_ring_based_alltoall. De-activating Ring startup will definetly avoid > the crash. > > I did the following patch and it seems to solve the bug : > > --- rdma_iba_priv.c.orig 2008-02-21 11:33:22.000000000 +0100 > +++ rdma_iba_priv.c 2008-02-19 16:00:46.000000000 +0100 > @@ -1258,6 +1258,7 @@ > } > > /*Now all send and recv finished*/ > + PMI_Barrier(); > } > } > > I increased MV2_DEFAULT_TIME_OUT to 24 to let me some time before the > error appears (when the timeout expires). I saw that the alltoall > operation is letting some processes behind, blocked in the function (10 > out of 256, say). > > So, I'm suspecting that others destroyed something in their next > operations and prevented the previous ones from completing the alltoall. > > Hope this helps, > > Sylvain > > On Tue, 12 Feb 2008, wei huang wrote: > > > Hi, > > > > We donot see anything abnormal from our local testing. In order to help us > > locating the problem, could you please try the following: > > > > 1) Check if you have enough space in the /tmp directly > > > > 2) Disable ring based start using: > > > > mpiexec -n N -env MV2_USE_RING_STARTUP 0 ./a.out > > > > 3) If this fails, disable shared memory support using runtime variable > > MV2_USE_SHARED_MEM=0: > > > > mpiexec -n N -env MV2_USE_SHARED_MEM 0 ./a.out > > > > Thanks. > > > > Regards, > > Wei Huang > > > > 774 Dreese Lab, 2015 Neil Ave, > > Dept. of Computer Science and Engineering > > Ohio State University > > OH 43210 > > Tel: (614)292-8501 > > > > > > On Tue, 12 Feb 2008, Le Yan wrote: > > > >> Hi, > >> > >> We have the same problem here with Mvapich2 1.0.1 on a Dell infiniband > >> cluster. It has 8 cores per node and is running RHEL 4.5 (kernel > >> 2.6.9-55). The OFED library version is 1.2. > >> > >> At first it seemed that any code compiled with Mvapich2 1.0.1 failed at > >> the MPI_INIT stage when running with more than 128 procs. But later on > >> we found that a code could run only if it doesn't use all 8 processors > >> on the same node (which explains why mpiGraph never fails, because it > >> uses only 1 processor per node). For example, a job running with 16 > >> nodes and 8 procs per node will fail, but one with 32 nodes and 4 procs > >> per node will not. > >> > >> In addition, if the MALLOC_CHECK_ environment variable is set to 1, a > >> bunch of errors appear in the standard error like this: > >> > >> 61: malloc: using debugging hooks > >> 61: free(): invalid pointer 0x707000! > >> 61: Fatal error in MPI_Init: > >> 61: Other MPI error, error stack: > >> 61: MPIR_Init_thread(259)..: Initialization failed > >> 61: MPID_Init(102).........: channel initialization failed > >> 61: MPIDI_CH3_Init(178)....: > >> 61: MPIDI_CH3I_CM_Init(855): Error initializing MVAPICH2 malloc library > >> > >> I'm not quite sure what these messages mean, but sure it looks like a > >> memory issue? > >> > >> Both Mvapich2 0.98 and Mvapich 1.0beta are fine on the same system. > >> > >> Cheers, > >> Le > >> > >> > >> On Fri, 2008-02-08 at 22:02 -0800, Shao-Ching Huang wrote: > >>> Hi > >>> > >>> No failure was found in these mpiGraph runs. It's just that there is > >>> significant variation among the entries of the matrices, compared to > >>> another IB cluster of ours. > >>> > >>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/ > >>> > >>> Thanks. > >>> > >>> Shao-Ching > >>> > >>> > >>> On Fri, Feb 01, 2008 at 08:43:19PM -0500, wei huang wrote: > >>>> Hi, > >>>> > >>>> How often do you observe the failures when running the mpiGraph test? Do > >>>> all the failure happen at startup, as your simple program? > >>>> > >>>> Thanks. > >>>> > >>>> Regards, > >>>> Wei Huang > >>>> > >>>> 774 Dreese Lab, 2015 Neil Ave, > >>>> Dept. of Computer Science and Engineering > >>>> Ohio State University > >>>> OH 43210 > >>>> Tel: (614)292-8501 > >>>> > >>>> > >>>> On Fri, 1 Feb 2008, Shao-Ching Huang wrote: > >>>> > >>>>> > >>>>> Hi Wei, > >>>>> > >>>>> We cleaned up a few things and re-ran the mpiGraph tests. The updated > >>>>> results are posted here: > >>>>> > >>>>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-8a.out_html/index.html > >>>>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-9a.out_html/index.html > >>>>> > >>>>> Please ignore results in my previous email. Thank you. > >>>>> > >>>>> Regards, > >>>>> Shao-Ching > >>>>> > >>>>> > >>>>> On Thu, Jan 31, 2008 at 08:35:41PM -0800, Shao-Ching Huang wrote: > >>>>>> > >>>>>> Hi Wei, > >>>>>> > >>>>>> We did 2 runs of mpiGraph that you suggested on 48 nodes, with one (1) > >>>>>> MPI process per node: > >>>>>> > >>>>>> mpiexec -np 48 ./mpiGraph 4096 10 10 >& mpiGraph.out > >>>>>> > >>>>>> The results from the two runs are posted here: > >>>>>> > >>>>>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-1.out_html/ > >>>>>> http://reynolds.turb.ucla.edu/~schuang/mpiGraph/mpiGraph-2.out_html/ > >>>>>> > >>>>>> During the tests, some other users are also running jobs on some of > >>>>>> these 48 nodes. > >>>>>> > >>>>>> Could you please help us interpret these results, if possible? > >>>>>> > >>>>>> Thank you. > >>>>>> > >>>>>> Shao-Ching Huang > >>>>>> > >>>>>> > >>>>>> On Thu, Jan 31, 2008 at 01:05:06PM -0500, wei huang wrote: > >>>>>>> Hi Scott, > >>>>>>> > >>>>>>> We went up to 256 processes (32 nodes) and did not see the problem in few > >>>>>>> hundred runs (cpi). Thus, to narrow down the problem, we want to make sure > >>>>>>> the fabrics and system setup are ok. To diagnose this, we suggest you > >>>>>>> running mpiGraph program from http://sourceforge.net/projects/mpigraph. > >>>>>>> This test stresses the interconnects. It should fail at a much higher > >>>>>>> frequency than simple cpi program if there is a problem with your system > >>>>>>> setup. > >>>>>>> > >>>>>>> Thanks. > >>>>>>> > >>>>>>> Regards, > >>>>>>> Wei Huang > >>>>>>> > >>>>>>> 774 Dreese Lab, 2015 Neil Ave, > >>>>>>> Dept. of Computer Science and Engineering > >>>>>>> Ohio State University > >>>>>>> OH 43210 > >>>>>>> Tel: (614)292-8501 > >>>>>>> > >>>>>>> > >>>>>>> On Wed, 30 Jan 2008, Scott A. Friedman wrote: > >>>>>>> > >>>>>>>> My co-worker passed this along... > >>>>>>>> > >>>>>>>> Yes, the error happens on the cpi.c program too. It happened 2 times > >>>>>>>> among the 9 cases I ran. > >>>>>>>> > >>>>>>>> I was using 128 processes (on 32 4-core nodes). > >>>>>>>> > >>>>>>>> --- > >>>>>>>> > >>>>>>>> and another... > >>>>>>>> > >>>>>>>> It happens for a simple MPI program which just does MPI_Init and > >>>>>>>> MPI_Finalize and print out number of processors. It happened for > >>>>>>>> anything from 4 nodes (16 processors ) and more. > >>>>>>>> > >>>>>>>> What environment variables should we look for? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Scott > >>>>>>>> > >>>>>>>> wei huang wrote: > >>>>>>>>> Hi Scott, > >>>>>>>>> > >>>>>>>>> On how many processes (and how many nodes) you ran your program? Do you > >>>>>>>>> have any environmental variables when you are running the program? Does > >>>>>>>>> the error happen on simple test like cpi? > >>>>>>>>> > >>>>>>>>> Thanks. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> Wei Huang > >>>>>>>>> > >>>>>>>>> 774 Dreese Lab, 2015 Neil Ave, > >>>>>>>>> Dept. of Computer Science and Engineering > >>>>>>>>> Ohio State University > >>>>>>>>> OH 43210 > >>>>>>>>> Tel: (614)292-8501 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, 30 Jan 2008, Scott A. Friedman wrote: > >>>>>>>>> > >>>>>>>>>> The low level ibv tests work fine. > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> mvapich-discuss mailing list > >>>>>>>>> mvapich-discuss@cse.ohio-state.edu > >>>>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> mvapich-discuss mailing list > >>>>>>> mvapich-discuss@cse.ohio-state.edu > >>>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>>>>> _______________________________________________ > >>>>>> mvapich-discuss mailing list > >>>>>> mvapich-discuss@cse.ohio-state.edu > >>>>>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>>>> > >>> _______________________________________________ > >>> mvapich-discuss mailing list > >>> mvapich-discuss@cse.ohio-state.edu > >>> http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > >>> > >> -- > >> Le Yan > >> User support > >> Louisiana Optical Network Initiative (LONI) > >> Office: 225-578-7524 > >> Fax: 225-578-6400 > >> > >> > > > > _______________________________________________ > > mvapich-discuss mailing list > > mvapich-discuss@cse.ohio-state.edu > > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > > > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From bertb at aspsys.com Tue Feb 26 18:56:09 2008 From: bertb at aspsys.com (Bert Beaudin) Date: Tue Feb 26 23:06:21 2008 Subject: [mvapich-discuss] mvapich2-1.01 and mpd as root Message-ID: <1204070169.1954.4.camel@puffy.private.aspsys.com> Hello all Running mvapich2-1.01 and I was wondering if it's mpd can run as root so others can attach to it to run jobs like mpich2 can do. If so what is needed in .mpd.conf? Thanks, Bert From huanwei at cse.ohio-state.edu Wed Feb 27 00:16:51 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Wed Feb 27 00:17:07 2008 Subject: [mvapich-discuss] mvapich2-1.01 and mpd as root In-Reply-To: <1204070169.1954.4.camel@puffy.private.aspsys.com> Message-ID: Hi Bert, This should be doable. It is the same as mpich2. You can refer mpich2's installation guide at: http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs Thanks. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Tue, 26 Feb 2008, Bert Beaudin wrote: > Hello all > Running mvapich2-1.01 and I was wondering if it's mpd can run as root so > others can attach to it to run jobs like mpich2 can do. If so what is > needed in .mpd.conf? > > Thanks, > Bert > > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From nilesh_awate at yahoo.com Thu Feb 28 09:27:23 2008 From: nilesh_awate at yahoo.com (nilesh awate) Date: Thu Feb 28 09:57:07 2008 Subject: [mvapich-discuss] Send Work Queue Message-ID: <577269.43885.qm@web94101.mail.in2.yahoo.com> Hi all, I'm using mvapich2 1.0.1over ofed 1.2(udapl stack) In mvapich source code(udapl/rdma_udpl_prive.c) i've observed foll. thing vc->mrail.send_wqes_avail[i] = rdma_default_max_wqe - 20; rdma_defualt max_wqe which is 300 default but if some provder have ep_attr.max_request_dtos = 16(for an example); then we need to set an env MV2_DEFAULT_MAX_WQE(which is not given in user guide) to 36(to get exaclty 16 ) as per the above code line So please some one explain the significance subtracting "20" from default work queue size. or is it reserving(20 slot) kind of thing ? (if it is so then provider with 16 dtos will be in trouble) Wating for reply Nilesh. Unlimited freedom, unlimited storage. Get it now, on http://help.yahoo.com/l/in/yahoo/mail/yahoomail/tools/tools-08.html/ -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.cse.ohio-state.edu/pipermail/mvapich-discuss/attachments/20080228/e6106fdb/attachment.html From christian.guggenberger at rzg.mpg.de Thu Feb 28 10:47:04 2008 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Thu Feb 28 10:47:21 2008 Subject: [mvapich-discuss] default envs for mpiexec Message-ID: <20080228154704.GN22803@daltons.rzg.mpg.de> Hi, I'd like to know if there's an easy way to add default environment variables to mpiexec. We are experience trouble with SRQ on some hardware and I would thus disable SRQ globally by default. My current approach would have been to add envToSend['MV2_USE_SRQ'] = str(0) into mpiexec.py, but this would disable the option to explicitely enable SRQ via 'mpiexec -env MV2_USE_SRQ 1 ...' Anyone on the list who'd know of a more elegant solution ? thanks a lot, - Christian From huanwei at cse.ohio-state.edu Thu Feb 28 11:03:07 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Thu Feb 28 11:03:23 2008 Subject: [mvapich-discuss] default envs for mpiexec In-Reply-To: <20080228154704.GN22803@daltons.rzg.mpg.de> Message-ID: Hi Christian, mpiexec should carry your environmental variable automatically. So if you have MV2_USE_SRQ=0 in your shell environment on the node you launch the job, SRQ will be disabled everytime you launch the application. Please let us know if this works. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Thu, 28 Feb 2008, Christian Guggenberger wrote: > Hi, > > I'd like to know if there's an easy way to add default environment > variables to mpiexec. We are experience trouble with SRQ on some > hardware and I would thus disable SRQ globally by default. My current > approach would have been to add > > envToSend['MV2_USE_SRQ'] = str(0) > > into mpiexec.py, but this would disable the option to explicitely enable > SRQ via 'mpiexec -env MV2_USE_SRQ 1 ...' > > Anyone on the list who'd know of a more elegant solution ? > > thanks a lot, > - Christian > > _______________________________________________ > mvapich-discuss mailing list > mvapich-discuss@cse.ohio-state.edu > http://mail.cse.ohio-state.edu/mailman/listinfo/mvapich-discuss > From christian.guggenberger at rzg.mpg.de Thu Feb 28 11:24:30 2008 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Thu Feb 28 11:24:49 2008 Subject: [mvapich-discuss] default envs for mpiexec In-Reply-To: References: <20080228154704.GN22803@daltons.rzg.mpg.de> Message-ID: <20080228162429.GP22803@daltons.rzg.mpg.de> Hello Wei, > > mpiexec should carry your environmental variable automatically. So if you > have MV2_USE_SRQ=0 in your shell environment on the node you launch the > job, SRQ will be disabled everytime you launch the application. Please let > us know if this works. > this would work, of course. However, I'd like to avoid overriding these shell environments on each system. A centralized approach (by adjusting mpiexec etc.) would be more appreciated, if possible. (we usually make mpi libraries available within AFS; thus setting some defaults in the global installation would make life easier for me ;) cheers. - Christian From huanwei at cse.ohio-state.edu Thu Feb 28 11:28:48 2008 From: huanwei at cse.ohio-state.edu (wei huang) Date: Thu Feb 28 11:29:04 2008 Subject: [mvapich-discuss] default envs for mpiexec In-Reply-To: <20080228162429.GP22803@daltons.rzg.mpg.de> Message-ID: Are you willing to change the code and recompile? I can send you a patch to disable it in code. Regards, Wei Huang 774 Dreese Lab, 2015 Neil Ave, Dept. of Computer Science and Engineering Ohio State University OH 43210 Tel: (614)292-8501 On Thu, 28 Feb 2008, Christian Guggenberger wrote: > Hello Wei, > > > > > mpiexec should carry your environmental variable automatically. So if you > > have MV2_USE_SRQ=0 in your shell environment on the node you launch the > > job, SRQ will be disabled everytime you launch the application. Please let > > us know if this works. > > > this would work, of course. However, I'd like to avoid overriding these > shell environments on each system. A centralized approach (by adjusting > mpiexec etc.) would be more appreciated, if possible. (we usually make > mpi libraries available within AFS; thus setting some defaults in the > global installation would make life easier for me ;) > > cheers. > - Christian > From christian.guggenberger at rzg.mpg.de Thu Feb 28 11:46:11 2008 From: christian.guggenberger at rzg.mpg.de (Christian Guggenberger) Date: Thu Feb 28 11:46:29 2008 Subject: [mvapich-discuss] default envs for mpiexec In-Reply-To: References: <20080228162429.GP22803@daltons.rzg.mpg.de> Message-ID: <20080228164611.GQ22803@daltons.rzg.mpg.de> Hi Wei, > Are you willing to change the code and recompile? I can send you a patch > to disable it in code. > this would be an option as well. I was thinking of preparing mvapich2-1.0.2p1 in the near future, so I'd appreciate your patch. As the problem with SRQ is only visible on em64t with pci-ex adapters (25208), but not on Opteron with pci-x (23108), I could keep it enabled for the latter. You might probably be interested, so I'll describe the SRQ problem we are seeing on em64t with 25208 adapters (OFED-1.2.5.5, SLES9 SP4): Even simplest MPI programs occasionnaly hang in MPI_FINALIZE (even for intra-node only communciations). We have not been able to track this down further, but disabling SRQ helps. -> pthread_cond_wait, FP=7fbfffea20 ibv_cmd_destroy_srq, FP=7fbfffea80 mthca_destroy_srq, FP=7fbfffea90 ibv_destroy_srq, FP=7fbfffeaa0 MPIDI_CH3I_RMDA_finalize, FP=7fbfffeb10 MPIDI_CH3_Finalize, FP=7fbfffeb30 MPID_Finalize, FP=7fbfffeb50 PMPI_Finalize, FP=7fbfffeb90 pmpi_finalize_, FP=7fbfffeba0 cheers. - Christian From chai.15 at osu.edu Thu Feb 28 14:12:08 2008 From: chai.15 at osu.edu (LEI CHAI) Date: Thu Feb 28 14:12:33 2008 Subject: [mvapich-discuss] Send Work Queue Message-ID: <18834b18bbaf.18bbaf18834b@osu.edu> Hi Nilesh, THe 20 WQE's are reserved for credit information. It's true that it only works for netowrks that provides more than 20 WQE's, and so far we have not encountered a network that provides less than 20 WQE's. We will revise it in the future. Lei Content-Type: multipart/alternative; boundary="0-203000447-1204208843=:43885" --0-203000447-1204208843=:43885 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable =0AHi all, =0A=0AI'm using mvapich2 1.0.1over ofed 1.2(udapl stack) =0A=0AI= n mvapich source code(udapl/rdma_udpl_prive.c) i've observed foll. thing=0A= =0Avc->mrail.send_wqes_avail[i] =3D rdma_default_max_wqe - 20;=0A=0Ardma_de= fualt max_wqe which is 300 default=0A=0Abut if some provder have ep_attr.ma= x_request_dtos =3D 16(for an example); then we need to set=0A=0Aan env MV2= _DEFAULT_MAX_WQE(which is not given in user guide) to 36(to get exaclty 16 = ) as per the above code line=0A=0ASo please some one explain the significan= ce subtracting "20" from default work queue size.=0A=0Aor is it reserving(2= 0 slot) kind of thing ? (if it is so then provider with 16 dtos will be in= trouble)=0A=0AWating for reply=0A=0ANilesh.=0A=0A =0A=0A=0A=0A=0A=0A=0A = Unlimited freedom, unlimited storage. Get it now, on http://help.yahoo.c= om/l/in/yahoo/mail/yahoomail/tools/tools-08.html/ --0-203000447-1204208843=:43885 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable

Hi all,

I'm using mvapich2 1.0.1over ofed = 1.2(udapl stack)

In mvapich source code(udapl/rdma_udpl_prive.c) i'= ve observed foll. thing

vc->mrail.send_wqes_avail[i= ] =3D rdma_default_max_wqe - 20;

rdma_defualt max_wqe which is 300 d= efault

but if some provder have ep_attr.max_request_dtos =3D 16(for = an example);  then we need to set

an env MV2_DEFAULT_MAX_WQE(wh= ich is not given in user guide) to 36(to get exaclty 16 ) as per the above = code line

So please some one explain the significance subtracting "2= 0" from default work queue size.

or is it reserving(20 slot) kind of thing  ? (if it is so then provider with 16 dtos will be in trouble)<= br>
Wating for reply

Nilesh.

 



=0A=0A=0A