1. The Linux version can use SMP parallelism for HF and DFT using the usual system-V shared memory and unix fork/join calls. This does not require the parallel blas library. It does require the following:
a. A machine with more than one procesor and Linux compiled with SMP support, and with the kernel parameter for maximum shared memory segment set to some large value (the maximum amount of memory you'll want to give G98, typically 3/4 or more of the amount of physical memory on the machine).
b. An input deck for a reasonably large job (i.e., not water STO-3G!) with %nproc set to requiest more than one processor. %nproc works by default with the Linux version (even on machines with only 1 CPU), but the default amount of shared memory allowed by the kernel is not sufficent to run anything, so jobs with %nproc will fail until the kernel is rebuilt with this limit increased.

2. We have not attempted to build with the Portland compiler and the parallel blas library. We plan to use the thread features of the Portland compiler to make our own parallel calls to the (serial) blas library routines, but this is not in the current revision of G98.

3. Some post-SCF jobs spend most of their time in matrix multiplies, which current use the serial blas routines and thus are not yet parallelized.

4. Running efficiently in parallel in 2 CPUs requires twice as much memory bandwidth a running on 1 CPU. Server vendors like SGI, IBM, and Digital/Compaq devote considerable effort and cost to having enough bandwidth to keep several processors going. A system with two Intel processors but only enough bandwidth to keep one busy is not going to run twice as fast in parallel, nor will it run two separate jobs in the same elapsed time as a single job. The observed performance will depend strongly on the details of the hardware.

5. For networks of Linux machines we do support parallelism in HF and DFT calculations using Linda, just as for networks with other models of workstations. Linda is required only for network parallelism, not for SMP paralleism.

6. The use of the tuned blas library produces a substantial (serial) performance improvement in those jobs which spend a lot of time doing matrix multiplies. It doesn't make much difference for jobs which spend their time doing integrals. So it will make a big difference for QCISD(T) but not for HF.

7. We use the same compiler (Portand) and blas libraries on both Windows/NT and Linux, and performance is similar. The differences in reported times are a few percent and probably reflect Linux reporting system time overhead (as other unix versions do) while this isn't included on a per-process basis under NT. Elapsed times are sometimes a bit better under Linux because of better I/O, but the best summary of the situation is that there isn't a signficant performance diffence for running a single, stand-alone job under one OS or the other.

Mike Frisch