Version 2.1, April 2005:

Changes since version 2.0:

- Added support for x86_64. As opposed to older Pentium/AMD processors,
  the "RISC" version of the modified Gram-Schmidt re-orthogonalization 
  is faster on this platform. This may be due to the increased number 
  of SSE floating point registers in the x86_64 instruction set 
  architecture.

- Fixed a cut and paste bug in extended local reorthogonalization
  code in xLANBPRO. The updated norm of u_{k+1} was being compared to 
  alpha instead of beta to determine if further orthogonalization is
  necessary. This would likely have degraded performance or accuracy
  very slightly in some pathological cases.

Version 2.0, March 2004: 

Changes since version 1.2:

- Extended PROPACK to handle matrices of all four fundamental data
  types: real (real*4), double precision (real*8), complex (complex*8)
  and double complex (complex*16). A word of caution is appropriate here:
  Even if only single precision accuracy is desired in the computed
  values, doing the computation in double precision might still be the 
  fastest way to obtain the result, for the following reason: Partial 
  reorthogonalization is less effective in single precision, since the 
  factor by which the level of orthogonality can decrease between 
  reorthogonalization is typically smaller. The rate at which it is 
  lost, on the tother hand, primarily depends on the distribution of the 
  singular values and is largely independent of the precision. Therefore 
  the number of  iterations between consecutive reorthogonalizations 
  decreases when going from double to single precision for the same matrix, 
  and consequently the total number of operations spent reorthogonalizing 
  increases. However, since each iteration is typically faster (depending 
  on the hardware), the total computation time might still decrease. Doing 
  the computation in single precision obviously reduces the amount of memory 
  required, which might be a driving factor in some cases.

- Added the program in compare.F to verify that the installed example 
  programs compute something that is consistent with a set of reference 
  results generated by me.

- Eugene M. Fluder, Jr. of Merck & Co., Inc. kindly contributed
  support for the IBM Power 4 platform.

- Fixed a bug that caused a non-zero starting vector passed to 
  xLANBPRO to be replaced with a random vector when N > M.

- Added support for Mac OS X/Darwin. Thanks to Felix Herrmann, 
  Department of Earth and Ocean Sciences, University of British 
  Columbia for providing access to a Mac OS X platform.

Known Problems:

- DLANSVD_IRL computes incorrect results if WHICH='S' and P>DIM/2. 
- Parallel performance is poor on distributed memory machines.
- There seems to be a subtle bug in SGEMM and CGEMM of the Goto
  BLAS library for the Itanium 2 platform. This sometimes causes 
  the singular vectors computed by slansvd_irl and clansvd_irl to 
  be incorrect. When using the Intel MKL BLAS library I have not
  observed any problems of this nature.
- The performance of the implicitly restarted version in single 
  precision is poor under SunOS using BLAS from the Sun performance 
  library. Probably BLAS related.

==============================================================================

Version 1.2, January 2004: 

Changes since version 1.1:

- Added missing documentation in dlanbpro.f and dritzvec.f and in other
  places such that at least the non-trivial parts of the code are now
  documented in some detail.

- Extended the programs in the Examples directory to handle more matrix
  formats. In addition to the Harwell-Boeing format it now handles dense
  and diagonal matrices as well as sparse matrices stored in coordinate 
  format. The routines handling matrix I/O and matrix-vector multiply 
  can be found in the file matvec.F. I stress that these routines are not 
  tuned to achieve production level performance for sparse matrix-vector 
  multiplication, but are primarily meant to illustrate the use of DLANSVD 
  and DLANSVD_IRL and to make it easy to explore the numerical properties 
  of the algorithms with test matrices without having to write new code.

- Changed the installation procedure by introducing the shell script
  "configure" that examines the OS and CPU type of the system and 
  automatically generates a (hopefully) appropriate make.inc file 
  with the system and compiler dependent options. Some manual hacking
  of make.linux_gcc_ia32 is still required to fine tune the gcc 
  optimization flags for various flavors of ia32 processors 
  (AMD processors and older Pentiums and such).

- The code has been parallelized using OpenMP. It was tested on 
  several SMP systems including ia32 (dual Xeon system) with 
  the Intel compiler version 7.1, on ia64 (SGI Altix 3300 system) with
  the Intel compiler version 8.0, on an IRIX system with the MIPSpro 
  compilers version 7.30 (SGI Origin 2000 system) and on an AIX system 
  (IBM Power 4 32 processor node) using the xlf90 compiler. 
  If you run PROPACK on larger or different SMP systems I would be 
  interested in hearing how well it scales. I am working on getting an 
  MPI version ready for public consumption, and hopefully will find the time 
  to get it into shape for version 2.1.

- Added support for the ia64 (Itanium) platform. I highly recommend
  using the Intel compilers for this platform if available (run 
  the configure script with option "-icc"), since gcc generates 
  terribly slow code for the Itanium processor.

- After experiencing endless problems with performance bugs,
  incorrect results, and linking failures I decided to include 
  all LAPACK routines used by PROPACK as source code instead 
  of relying on pre-built LAPACK libraries.  This also eliminates 
  the problem with older systems only having LAPACK version 2.0 
  installed. See known problems under version 1.1 for more info.

- Fixed a bug that prevented the singular vectors from being
  computed when an invariant subspace was found and the dimension 
  of the subspace was smaller than the requested number of singular 
  values. Thanks to Dr. Wolfgang Duemmler, Siemens AG, Erlangen, for 
  reporting this. In addition, an exit code of info == 0 was returned
  instead of info == dimension of the invariant subspace. This was
  also reported by Eugene M. Fluder, Jr., Merck & Co., Inc..

- Changed error-bound refinement using the gap theorem to be 
  more robust (pessimistic). The old version would only look at 
  the gap |\theta_i-theta_{i+1}| when refining \theta_i, while 
  strictly speaking,  
      min( |\theta_i-theta_{i+1}|, |\theta_i-theta_{i-1}| )
  (minus slack from existing error bounds) should be used.
  Here I define \theta_{0}=+infinity, \theta{n+1}=0 when
  refining the extreme Ritz values. This adds refinement to the 
  last Ritz value, which was previously missed in the case when 
  the dimension of the Krylov subspace was equal to min(m,n). This 
  solves a problem where PROPACK could get stuck when trying to 
  compute all singular values for a matrix with a tiny smallest 
  singular value.

- Fixed a bug where the last left singular vector would be reported 
  as zero when computing all singular values and vectors of a matrix
  of rank min(m,n)-1, even though it could have been computed accurately 
  from the available Lanczos bidiagonalization.


Known Problems:

- DLANSVD_IRL computes incorrect results if WHICH='S' and P>DIM/2. 
- Parallel performance is poor on distributed memory machines.

==============================================================================

Version 1.1, June 2003: 

Changes since version 1.0:

- Fixed two bugs where dgetu0 and dreorth were being called with the
  wrong number of parameters. Thanks to Jerzy Czaplicki, Institut de 
  Pharmacologie et de Biologie Structurale CNRS, Universit Paul
  Sabatier, Toulouse for reporting this.

- Added experimental support for computing the smallest singular
  values in the implicitly restarted version of PROPACK. The 
  subroutine DLANSVD_IRL now takes an additional argument "WHICH", 
  which can have the values 'L' or 'S'. If WHICH is 'L' then 
  the NEIG largest singular values are computed.  If WHICH is 'S' 
  then DLANBSVD_IRL attempts to compute the NEIG smallest singular 
  values by repeatedly filtering out the largest Ritz values when 
  restarting (using them as shifts) until convergence.
  NOTICE: Be aware that for large and ill-conditioned matrices the
  convergence can be very slow and the algorithm may even fail to
  converge at all.

- Added support for the Intel compilers under Linux.

- Split options for GCC and the Intel compilers into separate files
  make.linux_gcc and make.linux_intel.

- The minimum length of the integer workspace IWORK as specified in the
  interface of DLANSVD and DLANSVD_IRL was incorrect and inconsistent 
  with the length used in the example programs. Thanks to Tom Schweiger, 
  Acxiom Corporation for reporting this.

- Fixed bugs in example programs:
  o Dimensions of array arguments x and y were reversed in the Harwell-Boeing
    matrix-vector multiply subroutine atvHB(m, n, x, y) used by the example 
    program. Thanks to Hannes Schwarzl of Institute of Geophysics and Planetary
    Physics, UCLA, for reporting this.
  o The COLPTR array in HB.h should be of length NMAX+1, not NMAX.

- Changed the order in which libraries are linked with the example programs
  to ensure that the platform optimized version of the ILAENV subroutine 
  provided by a commercial LAPACK implementation is not overwritten by the
  default values in the file supplied with PROPACK. The divide-and-conquer 
  code in the DC directory in only meant as a backup for systems that have 
  an LAPACK library older than version 3.0 installed.

- Made a small modification of the divide-and-conquer SVD code in dbdsdc.f
  to manually set the SMLSIZ parameter to 25, if run in combination with
  version 2.0 of ILAENV.

Known Problems:

- We have observed two problems when using the Intel Math Kernel Library(tm)
  (MKL) and the Intel compiler on the ia32 platform under Linux:
  1) the performance of the LAPACK routines DBDSQR and DBDSDC from MKL is 
     severely crippled (presumably) to ensure thread safety.  This is a 
     problem in MKL, not PROPACK, but we mention it since it can severely 
     reduce performance. 
  2) The LAPACK divide-and-conquer source code (DBDSDC) supplied with 
     PROPACK generates incorrect singular vectors when compiled with the 
     Intel compiler version 7.0. The version in the Intel Math Kernel 
     Library (TM) works correctly (albeit very slowly).
  To get the best performance with the Intel compiler and MKL on the ia32 
  platform we recommend using only the BLAS routines in MKL in combination 
  with either the pre-compiled LAPACK 3.0 libraries available from NETLIB
  or LAPACK 3.0 compiled with GCC from source code.

- DLANSVD_IRL computes incorrect results if WHICH='S' and P>DIM/2. 

==============================================================================

Version 1.0: Initial version.
