1. The Linux version can use SMP parallelism for HF and DFT using the usual
system-V shared memory and unix fork/join calls. This does not require
the parallel blas library. It does require the following:
a. A machine with more than one procesor and Linux compiled with SMP support,
and with the kernel parameter for maximum shared memory segment set
to some large value (the maximum amount of memory you'll want to give
G98, typically 3/4 or more of the amount of physical memory on the machine).
b. An input deck for a reasonably large job (i.e., not water STO-3G!)
with %nproc set to requiest more than one processor. %nproc works
by default with the Linux version (even on machines with only 1
CPU), but the default amount of shared memory allowed by the kernel
is not sufficent to run anything, so jobs with %nproc will fail
until the kernel is rebuilt with this limit increased.
2. We have not attempted to build with the Portland compiler and the parallel
blas library. We plan to use the thread features of the Portland compiler
to make our own parallel calls to the (serial) blas library routines, but
this is not in the current revision of G98.
3. Some post-SCF jobs spend most of their time in matrix multiplies,
which current use the serial blas routines and thus are not yet parallelized.
4. Running efficiently in parallel in 2 CPUs requires twice as much memory
bandwidth a running on 1 CPU. Server vendors like SGI, IBM, and
Digital/Compaq devote considerable effort and cost to having enough
bandwidth to keep several processors going. A system with two Intel
processors but only enough bandwidth to keep one busy is not going to
run twice as fast in parallel, nor will it run two separate jobs in the
same elapsed time as a single job. The observed performance will depend
strongly on the details of the hardware.
5. For networks of Linux machines we do support parallelism in HF and DFT
calculations using Linda, just as for networks with other models of
workstations. Linda is required only for network parallelism, not for SMP
paralleism.
6. The use of the tuned blas library produces a substantial (serial)
performance improvement in those jobs which spend a lot of time doing
matrix multiplies. It doesn't make much difference for jobs which spend
their time doing integrals. So it will make a big difference for QCISD(T)
but not for HF.
7. We use the same compiler (Portand) and blas libraries on both Windows/NT
and Linux, and performance is similar. The differences in reported times
are a few percent and probably reflect Linux reporting system time
overhead (as other unix versions do) while this isn't included on a
per-process basis under NT. Elapsed times are sometimes a bit better
under Linux because of better I/O, but the best summary of the situation
is that there isn't a signficant performance diffence for running a
single, stand-alone job under one OS or the other.
Mike Frisch