commit 940a707ac78de975110e17c95765e65b89aa5e10 (HEAD -> master, tag: 0.2.2)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 16:38:42 2017 -0500

    Version file update (0.2.2)

commit d5a5e003ea9b24bb6abf12e88862e8eb61ffb03d (origin/master, origin/HEAD, origin/1m, 1m)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 15:48:30 2017 -0500

    Fixed a trsm1m bug that affected right-side cases.
    
    Details:
    - Fixed a bug introduced in 1c732d3 that affected trsm1m_r. The result
      was nondeterministic behavior (usually segmentation faults) for certain
      problem sizes beyond the 1m instance of kc (e.g. 128 on haswell). The
      cause of the bug was my commenting out lines in bli_gemm1m_ukr_ref.c
      which explicitly directed the virtual gemm micro-kernel to use temporary
      space if the storage preference of the [real domain] gemm ukernel did
      not match the storage of the output matrix C. In the context of gemm,
      this handling is not needed because agreement between the storage pref
      and the matrix is guaranteed by a high-level optimization in BLIS.
      However, this optimization is not applied to trsm because the storage
      of C is not necessarily the same as the storage of the micro-panels of
      B--both of which are updated by the micro-kernel during a trsm
      operation. Thus, the guarantee of storage/preference agreement is not
      in place for trsm, which means we must handle that case within the
      virtual gemm micro-kernel.
    - Comment updates and a minor macro change to bli_trsm*_cntx_init() for
      3m1, 4m1a, and 1m.

commit e80993e71f4d571e9650a8e90ed386e32059eae5
Merge: a509fbd5 ca3a7924
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 12:30:28 2017 -0500

    Merge branch 'master' into 1m

commit ca3a7924770d6cf203cce4ca9f5482e1d0d4e961
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 2 12:09:39 2017 -0500

    README.md update.
    
    Details:
    - Updated bibtex entries for 4th BLIS paper, and adds entries for 5th
      and 6th BLIS papers.

commit 6e7de6ef84babb273dc5528a9b9d01f0febe394b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 17 12:10:24 2017 -0500

    Minor updates to test/3m4m.
    
    Details:
    - Updated initial problem size and increment in Makefile.
    - Updated code in test_gemm.c to correctly query kc from context.

commit f484c6cd4389dc7ae5b972849e12e98ad5bbf9a4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 17 12:07:27 2017 -0500

    Whitespace reformatting to armv8a kernels file.
    
    Details:
    - Updated formatting of function signature/header in
      kernels/armv8a/3/bli_gemm_opt_4x4.c.

commit a509fbd5ac04fafd4e51b43d2f59ca56432dc212
Merge: 69b4846a 513944e4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 21 17:06:16 2017 -0600

    Merge branch 'master' into 1m

commit 69b4846ae9adb157c4171b52e159684db2867853
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 21 15:33:39 2017 -0600

    Disabled experiment-related 1m code.
    
    Details:
    - Commented out code in frame/ind/oapi/bli_l3_3m4m1m_oapi.c that was
      specifically inserted to facilitate the benchmarking of 1m block-panel
      and panel-block algorithms.
    - Updates to test/3m4m/Makefile, runme.sh script, and test_gemm.c to
      reflect changes used/needed during benchmarking.

commit 513944e4a951d8823b4de161b86ad7a965b4d99b
Merge: 8b462a0e 0e18f68c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Feb 20 10:04:33 2017 -0500

    Merge pull request #118 from devinamatthews/master
    
    Handle k=0 correctly in KNL dgemm ukernel.

commit 0e18f68cf12eb9189ba901a20040b1cdae417670
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Feb 20 09:03:21 2017 -0600

    Handle k=0 correctly in KNL dgemm ukernel.

commit 8b462a0e8c3e9252f0401940849e53cc772256fa
Merge: c362afc5 7d42fc07
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Feb 19 23:03:03 2017 -0500

    Merge pull request #117 from devinamatthews/master
    
    Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.

commit 7d42fc0796ef0c010375fd8e59b1240ba41ce4d2
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Feb 19 21:10:55 2017 -0500

    Cast dim_t and inc_t parameters to 64-bit in KNL microkernels.

commit c362afc525bab4050581d1b0fcea2fe4d582c608
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 9 11:54:59 2017 -0600

    Added missing "level-0" BLAS [sd]cabs1_().
    
    Details:
    - Fixed issue #115 by adding implementations for scabs1_() and dcabs1_()
      to the BLAS compatibility layer. Thanks to heroxbd for pointing out
      their absence.

commit 018180c938c32efbeaaf626ba71ec5b780664db1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 8 11:20:52 2017 -0600

    Fixed a minor bug in configure (issue #114).
    
    Details:
    - Fixed a bug in the configure script whereby a non-preferred value for
      --enable-threading would cause problems in common.mk vis-a-vis detecting
      which threading model was chosen. Thanks to heroxbd for reporting this
      issue.

commit ddf45e71770c55ea4a58ca24ea4913fe5d8beb9b
Merge: a6ab91bc 78e1b16e
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jan 27 14:25:40 2017 -0600

    Merge pull request #113 from devinamatthews/knl_thread_params
    
    Change default threading parameters for KNL.

commit 78e1b16e16d589ed31b2e712115ee282097f114d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jan 27 14:22:20 2017 -0600

    Change default threading parameters for KNL.

commit 1c732d3ddc4ac0861d3b0e0dd15eb7e071615502
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 25 16:25:46 2017 -0600

    Added 1m-specific APIs for bp, pb gemm algorithms.
    
    Details:
    - Defined bli_gemmbp_cntl_create(), bli_gemmpb_cntl_create(), with the
      body of bli_gemm_cntl_create() replaced with a call to the former.
    - Defined bli_cntl_free_w_thrinfo(), bli_cntl_free_wo_thrinfo(). Now,
      bli_cntl_free() can check if the thread parameter is NULL, and if so,
      call the latter, and otherwise call the former.
    - Defined bli_gemm1mbp_cntx_init(), bli_gemm1mpb_cntx_init(), both in
      terms of bli_gemm1mxx_cntx_init(), which behaves the same as
      bli_gemm1m_cntx_init() did before, except that an extra bool parameter
      (is_pb) is used to support both bp and pb algorithms (including to
      support the anti-preference field described below).
    - Added support for "anti-preference" in context. The anti_pref field,
      when true, will toggle the boolean return value of routines such as
      bli_cntx_l3_ukr_eff_prefers_storage_of(), which has the net effect of
      causing BLIS to transpose the operation to achieve disagreement (rather
      than agreement) between the storage of C and the micro-kernel output
      preference. This disagreement is needed for panel-block implementations,
      since they induce a transposition of the suboperation immediately before
      the macro-kernel is called, which changes the apparent storage of C. For
      now, anti-preference is used only with the pb algorithm for 1m (and not
      with any other non-1m implementation).
    - Defined new functions,
        bli_cntx_l3_ukr_eff_prefers_storage_of()
        bli_cntx_l3_ukr_eff_dislikes_storage_of()
        bli_cntx_l3_nat_ukr_eff_prefers_storage_of()
        bli_cntx_l3_nat_ukr_eff_dislikes_storage_of()
      which are identical to their non-"eff" (effectively) counterparts except
      that they take the anti-preference field of the context into account.
    - Explicitly initialize the anti-pref field to FALSE in
      bli_gks_cntx_set_l3_nat_ukr_prefs().
    - Added bli_gemm_ker_var1.c, which implements a panel-block macro-kernel
      in terms of the existing block-panel macro-kernel _ker_var2(). This
      technique requires inducing transposes on all operands and swapping
      the A and B.
    - Changed bli_obj_induce_trans() macro so that pack-related fields are
      also changed to reflect the induced transposition.
    - Added a temporary hack to bli_l3_3m4m1m_oapi.c that allows us to easily
      specify the 1m algorithm (block-panel or panel-block).
    - Renamed the following cntx_t-related macros:
        bli_cntx_get_pack_schema_a() -> bli_cntx_get_pack_schema_a_block()
        bli_cntx_get_pack_schema_b() -> bli_cntx_get_pack_schema_b_panel()
        bli_cntx_get_pack_schema_c() -> bli_cntx_get_pack_schema_c_panel()
      and updated all instantiations. Also updated the field names in the
      cntx_t struct.
    - Comment updates.

commit a6ab91bc61432490fadf18d596de4589645f37dd
Merge: 145a551d 7f31a630
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 30 09:26:58 2016 -0600

    Merge pull request #111 from figual/master
    
    Fixed missing cntx argument in ARMv8 microkernels.

commit 7f31a6307b7bd35f913c895947552c3a176f789b
Author: Francisco Igual <figual@ucm.es>
Date:   Sun Nov 27 14:40:47 2016 +0100

    Fixed missing cntx argument in ARMv8 microkernels.

commit 126482a3b609b9ad7026ba348f6c4bf6a29be8a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 25 18:29:49 2016 -0600

    Implemented the 1m method.
    
    Details:
    - Implemented the 1m method for inducing complex domain matrix
      multiplication. 1m support has been added to all level-3 operations,
      including trsm, and is now the default induced method when native
      complex domain gemm microkernels are omitted from the configuration.
    - Updated _cntx_init() operations to take a datatype parameter. This was
      needed for the corresponding function for 1m (because 1m requires us
      to choose between column-oriented or row-oriented execution, which
      requires us to query the context for the storage preference of the
      gemm microkernel, which requires knowing the datatype) but I decided
      that it made sense for consistency to add the parameter to all other
      cntx initialization functions as well, even though those functions
      don't use the parameter.
    - Updated bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs() to take
      a second scalar for each blocksize entry. The semantic meaning of the
      two scalars now is that the first will scale the default blocksize
      while the second will scale the maximum blocksize. This allows scaling
      the two independently, and was needed to support 1m, which requires
      scaling for a register blocksize but not the register storage
      blocksize (ie: "packdim") analogue.
    - Deprecated bli_blksz_reduce_dt_to() and defined two new functions,
      bli_blksz_reduce_def_to() and bli_blksz_reduce_max_to(), for reducing
      default and maximum blocksizes to some desired blocksize multiple.
      These functions are needed in the updated definitions of
      bli_cntx_set_blkszs() and bli_gks_cntx_set_blkszs().
    - Added support for the 1e and 1r packing schemas to packm, including
      1e/1r packing kernels.
    - Added a minor optimization to bli_gemm_ker_var2() that allows, under
      certain circumstances (specifically, real domain beta and row- or
      column-stored matrix C), the real domain macrokernel and microkernel
      to be called directly, rather than using the virtual microkernel
      via the complex domain macrokernel, which carries a slight additional
      amount of overhead.
    - Added 1m support to the testsuite.
    - Added 1m support to Makefile and runme.sh in test/3m4m. Also simplified
      some code in test_gemm.c driver.

commit 145a551d524ae5492667a05fc248923d922df850
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 23 17:59:06 2016 -0600

    Switched to simpler trsm_r implementation.
    
    Details:
    - Disabled the implementation of trsm_r that allows the right-hand matrix
      B to be trianglar, and switched to the implementation that simply
      transposes the operation (and thus the storage of C) in order to recast
      the operation as trsm_l. This avoids the need to use trsm_rl and trsm_ru
      macrokernels, which require an awkward swapping of MR and NR. For now,
      the support for trsm_r macrokernels, via separate control trees, remains.
    - Modified bli_config_macro_defs.h so that BLIS_RELAX_MCNR_NCMR_CONSTRAINTS
      is defined by default. This is mostly a safety precaution in case someone
      tries to switch back to the previous trsm_r implementation, but also
      serves as a convenience on some systems where one does not naturally
      choose blocksizes in a way that satisfies MC % NR = 0 and NC % MR = 0.

commit b3e58ee30307cf1e11529f2113acb9abbeda25af
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 23 17:58:26 2016 -0600

    Reimplemented 4x12 haswell ukernels (real only).
    
    Details:
    - Replaced permutation-based implementations in bli_gemm_asm_d4x12.c, which
      defines 4x24 single real and 4x12 double real gemm microkernels, with
      broadcast-based implementations. (The previous microkernel file has been
      moved to an 'old' subdirectory.)

commit bdc0a264d2fb5940bfd09298b1de823674a39053
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 16 14:13:08 2016 -0600

    Adjusted stride selection of ct in macrokernels.
    
    Details:
    - Updated the changes introduced in 618f433 so that the strides of the
      temporary microtile ct used in the macrokernels is determined based
      on the storage preference of the microkernel (via the new functions
      below), rather than the strides of c. In almost all cases, presently,
      this change results in no net effect, as a high-level optimization
      in the _front() functions aligns the storage of c to that of the
      microkernel's preference. However, I encountered some cases where
      this is not always the case in some development code that has yet
      to be committed, and therefore I'm generalizing the framework code
      in advance.
    - Defined two new functions in bli_cntx.c:
        bli_cntx_l3_ukr_prefers_rows_dt()
        bli_cntx_l3_ukr_prefers_cols_dt()
      which return bool_t's based on the current micro-kernel's storage
      preferences. For induced methods, the preference of the underlying
      real domain microkernel is returned.
    - Updated definition of bli_cntx_l3_ukr_dislikes_storage_of(), and
      by proxy bli_cntx_l3_ukr_prefers_storage_of(), to be in terms of
      the above functions, rather than querying the preferences of the
      native microkernel directly (which did the wrong thing for induced
      methods).

commit 031978d2647cf08316858baf29c84ebba9c3133e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 16 14:04:33 2016 -0600

    Fixed inactive trsm_r blocksize constraint code.
    
    Details:
    - Changed a cpp macro that was meant to prevent using certain trsm_r code
      if BLIS_RELAX_MCNR_NCMR_CONSTRAINTS was defined. It was actually coded
      incorrectly at first. I've now fixed its location and changed its
      consequence to a compile-time #error message.

commit 6b5a4032d2e3ed29a272c7f738b7e3ed6657e556
Merge: 3b524a08 a8220e3a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 10 15:28:24 2016 -0600

    Merge pull request #109 from devinamatthews/omp_num_threads
    
    Add automatic loop thread assignment.

commit a8220e3a86433b5d76789e32ea7ca014a11b6d17
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Nov 10 14:19:34 2016 -0600

    - Fix typo in bli_cntx.c
    - Bump BLIS_DEFAULT_NR_THREAD_MAX to 4

commit c05b3862f6241486442b313eff0c8bee7b5e1274
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Nov 4 15:48:02 2016 -0500

    Add automatic loop thread assignment.
    
    - Number of threads is determined by BLIS_NUM_THREADS or OMP_NUM_THREADS, but can be overridden by BLIS_XX_NT as before.
    - Threads are assigned to loops (ic, jc, ir, and jc) automatically by weighted partitioning and heuristics, both of which are tunable via bli_kernel.h.
    - All level-3 BLAS covered.

commit 3b524a08e3fb8380e7b8b2ba835312c51a331570
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 2 17:45:18 2016 -0500

    Consolidated 3m1/4m1 gemmtrsm, trsm ukernel code.
    
    Details:
    - Consolidated the macros that define the lower and upper versions of the
      gemmtrsm microkernels into a single macro that is instantiated twice.
      Did this for both 3m1 and 4m1 microkernels.
    - Consolidated lower and upper versions of the trsm microkernels for 3m1
      and 4m1 into single files (each).

commit ead231aca635deb3db270f118454e4222c627f31
Merge: d25e6f8b 62987f60
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 2 13:03:50 2016 -0500

    Merge pull request #108 from devinamatthews/patch-2
    
    Update .travis.yml with additional tests

commit 62987f60a6a6ff0a75b31d0404f493593ce35ccc
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Nov 2 11:20:37 2016 -0500

    Allow KNL to fail

commit 8f9010542c751ae3cbfe6121cb011d8985c1e00d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Nov 2 11:18:32 2016 -0500

    Fix some problems with OSX builds:
    
    - Update CPU detection for Intel archs (esp. Skylake)
    - Allow clang for the reference config

commit d25e6f8b63c57f30b8a67dffbf4995977cf9f235
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 1 14:35:15 2016 -0500

    Can disable trsm_r-specific blocksize constraints.
    
    Details:
    - Added cpp guards around the constraints in bli_kernel_macro_defs.h
      that enforce MC % NR = 0 and NC % MR = 0. These constraints are ONLY
      needed when handling right-side trsm by allowing the matrix on the
      right (matrix B) to be triangular, because it involves swapping
      register, but not cache, blocksizes (packing A by NR and B by MR)
      and then swapping the operands to gemmtrsm just before that kernel
      is called. It may be useful to disable these constraints if, for
      example, the developer wishes to test the configuration with
      a different set of cache blocksizes where only MC % MR = 0 and
      NC % NR = 0 are enforced.
    - In summary, #defining BLIS_RELAX_MCNR_NCMR_CONSTRAINTS will bypass
      the enforcement of MC % NR = 0 and NC % MR = 0.

commit 1a67e3688edb073a9d44c160e7b0798e08796b8a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Nov 1 13:53:18 2016 -0500

    Bogus commit
    
    Need to trigger another Travis build.

commit 2cd82d67b372cad1bed50cfd99e524f1f40b4e24
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Nov 1 13:25:50 2016 -0500

    Some fixes for .travis.yml
    
    - Switch to gcc-5 to support knl
    - Don't run tests in parallel -- it is super slow.
    - Use clang on OSX since gcc is only a zombie husk.

commit a3db4e6bdfe745083acf704ab0f51f74ea869538
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Nov 1 10:33:18 2016 -0500

    Update .travis.yml with additional tests
    
    - Test knl configuration (without running of course).
    - Test openmp and pthreads threading for auto configuration with 4 threads.
    - Test auto configuration with and without pthreads on OSX.
    - Also, run make in parallel.
    
    I don't know how the `addons:` section works on OSX; hopefully it is just ignored.

commit 8a11a2174a1a5b9426f13bbc5338dc86ab138cdd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 31 19:07:55 2016 -0500

    Updates to non-default haswell microkernels.
    
    Details:
    - Updated s and d microkernels in bli_gemm_asm_d8x6.c to relax alignment
      constraints.
    - Added missing c and z microkernels, which are based on the corresponding
      kernels in the d6x8 set.
    - This completes the d8x6 set (which may be used for situations when it
      is desirable to have a microkernel with a column preference).

commit 618f4331eba209803ecab99747872eceb1b5f091
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 31 14:40:51 2016 -0500

    Align strides of ct in macrokernels to that of c.
    
    Details:
    - Previously, rs_ct and cs_ct, the strides of the temporary microtile used
      primarily in the macrokernels' edge case handling, were unconditionally
      set to 1 and MR, respectively. However, Devin Matthews noted that this
      ought to be changed so that the strides of ct were in agreement with the
      strides of C. (That is, if C was row-stored, then ct should be accessed
      as by rows as well.) The implicit assumption is that the strides of C
      have already been adjusted, via induced transposition, if the storage
      preference of the microkernel is at odds with the storage of C. So, if
      the microkernel prefers row storage, the macrokernel's interior cases
      would present row-stored (ideal) microkernel subproblems to the
      microkernel, but for edge cases, it would still see column-stored
      subproblems (not ideal). This commit fixes this issue. Thanks to Devin
      for his suggestion.

commit 630391002325a589063aec2ab0a7d89ef2e178c0
Merge: 956b3edf 216206c1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 25 19:34:51 2016 -0500

    Merge pull request #105 from devinamatthews/knl
    
    Support for Intel Knight's Landing.

commit 216206c1d328a865c2192e35a4df6e9aff79a85b
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Oct 25 13:56:18 2016 -0500

    Fix up for merge to master.

commit 11eb7957abbcdf02d5e312898e094260eadb1209
Merge: cd5b6681 956b3edf
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Oct 25 13:51:07 2016 -0500

    Merge branch 'master' into knl
    
    # Conflicts:
    #       frame/thread/bli_thread.h

commit cd5b6681838899283cd94e5427dfda206e7fbabe
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Oct 25 13:49:27 2016 -0500

    Don't use %rbp in KNL packing kernels.

commit 956b3edf8eb09480f31f2e861c1b10f9ecbb2e52
Merge: b7e41d71 0662a3c1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 25 13:02:57 2016 -0500

    Merge pull request #104 from devinamatthews/misspellings
    
    Add flexible options for thread model (pthread/posix for pthreads etc.).

commit 0662a3c1b1f4644a86bf8e5073d1391808c91b4a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Oct 25 12:42:44 2016 -0500

    Add flexible options for thread model (pthread/posix for pthreads etc.).

commit b7e41d71b07d2af6d22d632c70e0c5f7ce46852c
Merge: 4bd905bd 5117d444
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 24 16:47:46 2016 -0500

    Merge pull request #103 from devinamatthews/patch-1
    
    Change .align to .p2align in Bulldozer ukernels.

commit 5117d444f7f3a2bc327f067926eaf2398212edda
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Oct 24 16:20:47 2016 -0500

    Change .align to .p2align in Bulldozer ukernels
    
    Apparently OSX doesn't allow .align directives for >16B, so I've changed these to their .p2align counterparts.

commit 4bd905bd4597e0ad7bedf31e25e779d3e2dfda29
Merge: 936d5fdc 7f32dd57
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 21 14:48:44 2016 -0500

    Merge pull request #93 from ShadenSmith/config_check
    
    Adds sanity check to configuration choice.

commit 936d5fdc26c6c4dab199a8d11fde948975cfa1d6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 21 14:34:27 2016 -0500

    Fixed multithreading compilation bug in 970745a.
    
    Details:
    - Moved the definition of the cpp macro BLIS_ENABLE_MULTITHREADING
      from bli_thread.h to bli_config_macro_defs.h. Also moved the
      sanity check that OpenMP and POSIX threads are not both enabled.
    - Thanks to Krzysztof Drewniak for reporting this bug.

commit 8feb0f85a674e84bec2417486e3bcea584b14c04
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 19 16:05:41 2016 -0500

    Removed auto-prototyping of malloc()/free() substitutes.
    
    Details:
    - Removed the header file, bli_malloc_prototypes.h, which automatically
      generated prototypes for the functions specified by the following
      cpp macros:
        BLIS_MALLOC_INTL
        BLIS_FREE_INTL
        BLIS_MALLOC_POOL
        BLIS_FREE_POOL
        BLIS_MALLOC_USER
        BLIS_FREE_USER
      These prototypes were originally provided primarily as a convenience
      to those developers who specified their own malloc()/free() substitutes
      for one or more of the following. However, we generated these prototypes
      regardless, even when the default values (malloc and free) of the
      macros above were used. A problem arose under certain circumstances
      (e.g., gcc in C++ mode on Linux with glibc) when including blis.h that
      stemmed from the "throw" specification which was added to the glibc's
      malloc() prototype, resulting in a prototype mismatch. Therefore, going
      forward, developers who specify their own custom malloc()/free()
      substitutes must also prototype those substitutes via bli_kernel.h.
      Thanks to Krzysztof Drewniak for reporting this bug, and Devin Matthews
      for researching the nature and potential solutions.

commit 970745a5fc7c29de3e202988e5eb104fabca4fdc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 19 15:58:03 2016 -0500

    Reorganized typedefs to avoid compiler warnings.
    
    Details:
    - Relocated membrk_t definition from bli_membrk.h to bli_type_defs.h.
    - Moved #include of bli_malloc.h from blis.h to bli_type_defs.h.
    - Removed standalone mtx_t and mutex_t typedefs in bli_type_defs.h.
    - Moved #include of bli_mutex.h from bli_thread.h to bli_typedefs.h.
    - The redundant typedefs of membrk_t and mtx_t caused a warning on some C
      compilers. Thanks to Tyler Smith for reporting this issue.

commit 28b2af8a71133ce68774e153b6e05afb05affba8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 13 14:50:08 2016 -0500

    Added disabled code to print thrinfo_t structures.
    
    Details:
    - Added cpp-guarded code to bli_thrcomm_openmp.c that allows a curious
      developer to print the contents of the thrinfo_t structures of each
      thread, for verification purposes or just to study the way thread
      information and communicators are used in BLIS.
    - Enabled some previously-disabled code in bli_l3_thrinfo.c for freeing
      an array of thrinfo_t* values that is used in the new, cpp-guarde code
      mentioned above.
    - Removed some old commented lines from bli_gemm_front.c.

commit 11eed3f683d09e65f721567b346b0f733bff9a64
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 13 14:23:23 2016 -0500

    Fixed a configure -t omp/openmp bug from fd04869.
    
    Details:
    - Forgot to update certain occurrences of "omp" in common.mk during
      commit fd04869, which changed the preferred configure option string
      for enabling OpenMP from "omp" to "openmp".

commit 9cda6057eaa16a24ac8785a9fa167df6c9edba44
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 11 13:21:26 2016 -0500

    Removed previously renamed/old files.
    
    Details:
    - Removed frame/base/bli_mem.c and frame/include/bli_auxinfo_macro_defs.h,
      both of which were renamed/removed in 701b9aa. For some reason, these
      files survived when the compose branch was merged back into master.
      (Clearly, git's merging algorithm is not perfect.)
    - Removed frame/base/bli_mem.c.prev (an artifact of the long-ago changed
      memory allocator that I was keeping around for no particular reason).

commit 22377abd84b9e560ffe1c4e4d284eb443ddb7133
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 10 13:43:56 2016 -0500

    Fixed bli_gemm() segfault on empty C matrices.
    
    Details:
    - Fixed a bug that would manifest in the form of a segmentation fault
      in bli_cntl_free() when calling any level-3 operation on an empty
      output matrix (ie: m = n = 0). Specifically, the code previously
      assumed that the entire control tree was built prior to it being
      freed. However, if the level-3 operation performs an early exit, the
      control tree will be incomplete, and this scenario is now handled.
      Thanks to Elmar Peise for reporting this bug.

commit 0b571cd94d9b175331c9453258a6b1389a718ae8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 6 14:48:15 2016 -0500

    Fixed segfault in bli_free_align() for NULL ptrs.
    
    Details:
    - Fixed a bug in bli_free_align() caused by failing to handle NULL pointers
      up-front, which led to performing pointer arithmetic on NULL pointers in
      order to free the address immediately before the pointer. Thanks to Devin
      Matthews for reporting this bug.

commit 4fb9b4ef2e4cf2626a6e000a41628fb823f16da8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 5 14:41:35 2016 -0500

    CHANGELOG update (0.2.1)

commit 866b2dde3f41760121115fb25f096d4344e8b4f9 (tag: 0.2.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 5 14:41:34 2016 -0500

    Version file update (0.2.1)

commit 87fddeab3c8a5ccb1bbf02e5f89db1464e459ba9
Merge: 86969873 6f71cd34
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 5 13:35:01 2016 -0500

    Merge branch 'compose'

commit 6f71cd344951854e4cff9ea21bbdfe536e72611d (origin/compose, compose)
Merge: c0630c40 8d55033c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 4 15:53:46 2016 -0500

    Merge pull request #94 from flame/distcomm
    
    Implemented distributed thrinfo_t management.

commit 86969873b5b861966d717d8f9f370af39e3d9de6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 4 14:24:59 2016 -0500

    Reclassified amaxv operation as a level-1v kernel.
    
    Details:
    - Moved amaxv from being a utility operation to being a level-1v operation.
      This includes the establishment of a new amaxv kernel to live beside all
      of the other level-1v kernels.
    - Added two new functions to bli_part.c:
        bli_acquire_mij()
        bli_acquire_vi()
      The first acquires a scalar object for the (i,j) element of a matrix,
      and the second acquires a scalar object for the ith element of a vector.
    - Added integer support to bli_getsc level-0 operation. This involved
      adding integer support to the bli_*gets level-0 scalar macros.
    - Added a new test module to test amaxv as a level-1v operation. The test
      module works by comparing the value identified by bli_amaxv() to the
      the value found from a reference-like code local to the test module
      source file. In other words, it (intentionally) does not guarantee the
      same index is found; only the same value. This allows for different
      implementations in the case where a vector contains two or more elements
      containing exactly the same floating point value (or values, in the case
      of the complex domain).
    - Removed the directory frame/include/old/.

commit 8d55033c966feed99fcca2a58017c3ab5b1646dc (origin/distcomm)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 27 15:20:58 2016 -0500

    Implemented distributed thrinfo_t management.
    
    Details:
    - Implemented Ricardo Magana's distributed thread info/communicator
      management. Rather that fully construct the thrinfo_t structures, from
      root to leaf, prior to spawning threads, the threads individually
      construct their thrinfo_t trees (or, chains), and do so incrementally,
      as needed, reusing the same structure nodes during subsequent blocked
      variant iterations. This required moving the initial creation of the
      thrinfo_t structure (now, the root nodes) from the _front() functions
      to the bli_l3_thread_decorator(). The incremental "growing" of the tree
      is performed in the internal back-end (ie: _int()) function, and so
      mostly invisible. Also, the incremental growth of the thrinfo_t tree is
      done as a function of the current and parent control tree nodes (as well
      as the parent thrinfo_t node), further reinforcing the parallel
      relationship between the two data structures.
    - Removed the "inner" communicator from thrinfo_t structure definition,
      as well as its id. Changed all APIs accordingly. Renamed
      bli_thrinfo_needs_free_comms() to bli_thrinfo_needs_free_comm().
    - Defined bli_l3_thrinfo_print_paths(), which prints the information
      in an array of thrinfo_t* structure pointers. (Used only as a
      debugging/verification tool.)
    - Deprecated the following thrinfo_t creation functions:
        bli_packm_thrinfo_create()
        bli_l3_thrinfo_create()
      because they are no longer used. bli_thrinfo_create() is now called
      directly when creating thrinfo_t nodes.

commit fd04869ae4d4a3b0ebb9052557c296456bce7c0d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 27 14:14:11 2016 -0500

    Changed configure's 'omp' threading to 'openmp'.
    
    Details:
    - Changed the configure script so that the expected string argument to the
      -t (or --enable-threading=) option that enables OpenMP multithreading is
      'openmp'. The previous expected string, 'omp', is still supported but
      should be considered deprecated.

commit 9424af87209e4e435e2e742430945152690170b0
Merge: efa7341d c0630c40
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 27 12:51:08 2016 -0500

    Merge branch 'compose'

commit 7f32dd57c6bd41c0704341752842277dd6a4c8eb
Author: Shaden Smith <shaden@cs.umn.edu>
Date:   Sat Sep 17 11:33:57 2016 -0500

    Adds sanity check to configuration choice.

commit efa7341df0b0115926aa8a6e8a4ebfb24fdbf11e
Merge: 121c39d4 e1453f68
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 16 11:01:57 2016 -0500

    Merge pull request #92 from ShadenSmith/readme_fix
    
    Fixes broken URL in README.md

commit e1453f68f6afd90ae9a29b7a5faa46aa79bbf741
Author: Shaden Smith <ShadenTSmith@gmail.com>
Date:   Fri Sep 16 09:29:28 2016 -0500

    Fixes broken URL in README.md

commit c0630c4024b08750043a2942a3e8a037aa6b6259
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 12 13:59:02 2016 -0500

    Added debugging printf()'s to bli_l3_thrinfo.c.
    
    Details:
    - Added optional printf() statements to print out thread communicator
      info as the thrinfo_t structure is built in bli_l3_thrinfo.c.
    - Minor changes to frame/thread/bli_thrinfo.h.

commit 7b3bf1ffcd7160ccbf6c2518af6d88f6742e4977
Merge: 35509818 121c39d4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 6 15:47:13 2016 -0500

    Merge branch 'master' into compose

commit 121c39d455f2db6f7ce6802ba7f73ad5e088c68c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 5 13:11:42 2016 -0500

    Added complex gemm micro-kernels for haswell.
    
    Details:
    - Defined cgemm (3x8) and zgemm (3x4) micro-kernels for haswell-based
      architectures. As with their real domain brethren, these kernels perfer
      row storage, (though this doesn't affect most users due to high-level
      optimizations in most level-3 operations that induce a transpose to
      whatever storage preference the kernel may have).

commit 35509818cbea1598b123421f81c42120889a03c3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 31 17:34:15 2016 -0500

    Added, moved some thread barriers.
    
    Details:
    - Removed thread barriers from the end of the loop bodies of
      bli_gemm_blk_var1(), bli_gemm_blk_var2(), bli_trsm_blk_var1(),
      and bli_trsm_blk_var2().
    - Moved the thread barrier at the end of bli_packm_int() to the
      end of bli_l3_packm(), and added missing barriers to that function.
    - Removed the no longer necessary (and now incorrect) ochief guard
      in bli_gemm3m3_packa() on the bli_obj_scalar_reset() on C.
    - Thanks to Tyler Smith for help with these changes.

commit abd61f9fa75d77a96d1491b3e035451ee73238fe
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 30 12:34:19 2016 -0500

    Updated BLIS4 TOMS citation in README.md.

commit 701b9aa3ff028decbf90efac0dca5bd64fe26269
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 26 19:04:45 2016 -0500

    Redesigned control tree infrastructure.
    
    Details:
    - Altered control tree node struct definitions so that all nodes have the
      same struct definition, whose primary fields consist of a blocksize id,
      a variant function pointer, a pointer to an optional parameter struct,
      and a pointer to a (single) sub-node. This unified control tree type is
      now named cntl_t.
    - Changed the way control tree nodes are connected, and what computation
      they represent, such that, for example, packing operations are now
      associated with nodes that are "inline" in the tree, rather than off-
      shoot braches. The original tree for the classic Goto gemm algorithm was
      expressed (roughly) as:
    
        blk_var2 -> blk_var3 -> blk_var1 -> ker_var2
                             |           |
                             -> packb    -> packa
    
      and now, the same tree would look like:
    
        blk_var2 -> blk_var3 -> packb -> blk_var1 -> packa -> ker_var2
    
      Specifically, the packb and packa nodes perform their respective packing
      operations and then recurse (without any loop) to a subproblem. This means
      there are now two kinds of level-3 control tree nodes: partitioning and
      non-partitioning. The blocked variants are members of the former, because
      they iteratively partition off submatrices and perform suboperations on
      those partitions, while the packing variants belong to the latter group.
      (This change has the effect of allowing greatly simplified initialization
      of the nodes, which previously involved setting many unused node fields to
      NULL.)
    - Changed the way thrinfo_t tree nodes are arranged to mirror the new
      connective structure of control trees. That is, packm nodes are no longer
      off-shoot branches of the main algorithmic nodes, but rather connected
      "inline".
    - Simplified control tree creation functions. Partitioning nodes are created
      concisely with just a few fields needing initialization. By contrast, the
      packing nodes require additional parameters, which are stored in a
      packm-specific struct that is tracked via the optional parameters pointer
      within the control tree struct. (This parameter struct must always begin
      with a uint64_t that contains the byte size of the struct. This allows
      us to use a generic function to recursively copy control trees.) gemm,
      herk, and trmm control tree creation continues to be consolidated into
      a single function, with the operation family being used to select
      among the parameter-agnostic macro-kernel wrappers. A single routine,
      bli_cntl_free(), is provided to free control trees recursively, whereby
      the chief thread within a groups release the blocks associated with
      mem_t entries back to the memory broker from which they were acquired.
    - Updated internal back-ends, e.g. bli_gemm_int(), to query and call the
      function pointer stored in the current control tree node (rather than
      index into a local function pointer array). Before being invoked, these
      function pointers are first cast to a gemm_voft (for gemm, herk, or trmm
      families) or trsm_voft (for trsm family) type, which is defined in
      frame/3/bli_l3_var_oft.h.
    - Retired herk and trmm internal back-ends, since all execution now flows
      through gemm or trsm blocked variants.
    - Merged forwards- and backwards-moving variants by querying the direction
      from routines as a function of the variant's matrix operands. gemm and
      herk always move forward, while trmm and trsm move in a direction that
      is dependent on which operand (a or b) is triangular.
    - Added functions bli_thread_get_range_mdim(), bli_thread_get_range_ndim(),
      each of which takes additional arguments and hides complexity in managing
      the difference between the way ranges are computed for the four families
      of operations.
    - Simplified level-3 blocked variants according to the above changes, so that
      the only steps taken are:
      1. Query partitioning direction (forwards or backwards).
      2. Prune unreferenced regions, if they exist.
      3. Determine the thread partitioning sub-ranges.
      <begin loop>
        4. Determine the partitioning blocksize (passing in the partitioning
           direction)
        5. Acquire the curren iteration's partitions for the matrices affected
           by the current variants's partitioning dimension (m, k, n).
        6. Call the subproblem.
      <end loop>
    - Instantiate control trees once per thread, per operation invocation.
      (This is a change from the previous regime in which control trees were
      treated as stateless objects, initialized with the library, and shared
      as read-only objects between threads.) This once-per-thread allocation
      is done primarily to allow threads to use the control tree as as place
      to cache certain data for use in subsequent loop iterations. Presently,
      the only application of this caching is a mem_t entry for the packing
      blocks checked out from the memory broker (allocator). If a non-NULL
      control tree is passed in by the (expert) user, then the tree is copied
      by each thread. This is done in bli_l3_thread_decorator(), in
      bli_thrcomm_*.c.
    - Added a new field to the context, and opid_t which tracks the "family"
      of the operation being executed. For example, gemm, hemm, and symm are
      all part of the gemm family, while herk, syrk, her2k, and syr2k are
      all part of the herk family. Knowing the operation's family is necessary
      when conditionally executing the internal (beta) scalar reset on on
      C in blocked variant 3, which is needed for gemm and herk families,
      but must not be performed for the trmm family (because beta has only
      been applied to the current row-panel of C after the first rank-kc
      iteration).
    - Reexpressed 3m3 induced method blocked variant in frame/3/gemm/ind
      to comform with the new control tree design, and renamed the macro-
      kernel codes corresponding to 3m2 and 4m1b.
    - Renamed bli_mem.c (and its APIs) to bli_memsys.c, and renamed/relocated
      bli_mem_macro_defs.h from frame/include to frame/base/bli_mem.h.
    - Renamed/relocated bli_auxinfo_macro_defs.h from frame/include to
      frame/base/bli_auxinfo.h.
    - Fixed a minor bug whereby the storage-to-ukr-preference matching
      optimization in the various level-3 front-ends was not being applied
      properly when the context indicated that execution would be via an
      induced method. (Before, we always checked the native micro-kernel
      corresponding to the datatype being executed, whereas now we check
      the native micro-kernel corresponding to the datatype's real projection,
      since that is the micro-kernel that is actually used by induced methods.
    - Added an option to the testsuite to skip the testing of native level-3
      complex implementations. Previously, it was always tested, provided that
      the c/z datatypes were enabled. However, some configurations use
      reference micro-kernels for complex datatypes, and testing these
      implementations can slow down the testsuite considerably.

commit 73517f522b69de429dd7f3df60a70c068149ab28
Merge: c6f5c215 50293da3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 23 13:46:59 2016 -0500

    Merge branch 'master' into compose

commit 50293da38d5f2b7be9bbc94b9e85aacb6a10f672
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 23 13:38:36 2016 -0500

    Avoid compiling BLAS/CBLAS files when disabled.
    
    Details:
    - Updated the top-level Makefile, build/config.mk.in template, and
      configure script so that object files corresponding to source files
      belonging to the BLAS compatibility layer are not compiled (or archived)
      when the compatibility layer is disabled. (Same for CBLAS.) Thanks
      to Devin Matthews for suggesting this optimization.
    - Slight change to the way configure handles internal variables. Instead
      of converting (overwriting) some, such as enable_blas2blis and
      enable_cblas, from a "yes" or "no" to a "1" or "0" value, the latter are
      now stored in new variables that live alongside the originals (with the
      suffix "_01").  This is convenient since some values need to be
      sed-substituted into the config.mk.in template, which requires "yes" or
      "no", while some need to be written to the bli_config.h.in template,
      which requires "0" or "1".

commit c6f5c215ee793d03ea834469fc2adc53feaffc42
Merge: d52cb767 16a4c7a8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 22 17:33:02 2016 -0500

    Merge branch 'master' into compose

commit 16a4c7a823d60707ed9272f5d36e5c5d54c0ba4b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 19 11:38:36 2016 -0500

    Fixed bugs in bli_mutex_init() and friends.
    
    Details:
    - Fixed a couple of bugs that affected OpenMP and POSIX threads
      configurations that resulted in compiler errors and warnings due
      to type mismatch, and in the case of pthreads, a missing function
      argument. The bugs are fairly recent, introduced in a017062.

commit c8e4ef93953ba2b79fb7e0973c08469c0e28a2cd
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 16:13:03 2016 -0500

    Add prefetchw to 30x8 kernel.

commit 4b5a2f3d6e7ffeb5cc2be8448554f5c2083ad68f
Merge: 380736bf 9f52a587
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 16:09:51 2016 -0500

    Merge remote-tracking branch 'origin/knl' into knl
    
    # Conflicts:
    #       kernels/x86_64/knl/3/bli_dgemm_opt_24x8.c

commit 380736bfe955efbdd7274c90b6fd635688e83bc4
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 16:08:28 2016 -0500

    Add (new) 30x8 KNL kernel and fix non-scatter prefetch bug.

commit 9f52a587dee855daa73c194e41b6951416544e9a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 16:03:53 2016 -0500

    Try prefetchw[t1] instead of regular prefetch for C.

commit 8945a1512d366bc6a8a85718d12cbf5de6f2898b
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Aug 3 11:28:24 2016 -0500

    This version gets ~1550 GFLOPs on KNL wuth 16x4.

commit 6ce4c022ebdea00c2b951090e3c2e9e88735b9ce
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 16:26:36 2016 -0500

    Switch back to 24x8. I could only squeeze 24.5GFLOP out of 8x24, and scalability is not improved.

commit d52cb7671509592a8078729477b40b60380518a2
Merge: 95abea46 c31b1e7b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 27 16:04:55 2016 -0500

    Merge branch 'master' into compose

commit c31b1e7b9d659b96433a87e5aecb90e457a104cc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 27 15:58:07 2016 -0500

    Relax alignment restrictions for sandybridge ukrs.
    
    Details:
    - Relaxed the base pointer and leading dimension alignment restrictions
      in the sandybridge gemm microkernels, allowing the use of vmovups/vmovupd
      instead of vmovaps/vmovapd. These change mimic those made to the haswell
      microkernels in e0d2fa0 and ee2c139.
    - Updated testsuite modules as well as standalone test drivers in 'test'
      directory to use DBL_MAX as the initial time candidate. Thanks to Devin
      Matthews for suggesting this change.
    - Inserted #include "float.h" into bli_system.h (to gain access to DBL_MAX).
    - Minor update (vis-a-vis contexts) to driver code in test/3m4m.

commit b8f2b55532849d45d379afbdd05a52ff6100800d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 15:22:55 2016 -0500

    Try an 8x24 kernel for the hell of it.

commit 7ede5863ae3567f7c0852efc2d5cd649ca19e0f3
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 13:41:27 2016 -0600

    Allocate pack buffer on MCDRAM for KNL.

commit ad89ed2e829c7b261d8ba0998a3cb83ad576ee04
Merge: 2c9de740 81e2b05f
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 11:45:40 2016 -0500

    Merge branch 'knl' of github.com:devinamatthews/blis into knl

commit 2c9de740edb66c4692c200731763bbd1d3171ccb
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 11:44:54 2016 -0500

     This version gets ~26GF on one core.

commit 81e2b05f31bca4e1e1676e7b533d1868d9f9be33
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jul 27 11:39:05 2016 -0500

    Add optimized packing kernels for KNL.

commit a7d8ca97b8d835c32d90ff20a565c82733f014a8
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 15:15:13 2016 -0500

    All fixed.

commit 963d0393b023f4134bb0c682923faf9964c0e645
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 14:40:53 2016 -0500

    Add 24xk pack kernel.

commit 117b76739afba481768897d2580f8365d3345417
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 13:53:07 2016 -0500

    In the midst of debugging.

commit 8c0a4fd1d3535d608a9a309a61ffee0a73c3646f
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 13:09:24 2016 -0500

    Fix some row/column confusion.

commit c44f9f96930312125b15e64c326ab5ab5cc02633
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 12:02:24 2016 -0500

    Simplify displacements -- clang assembler was badly botching EVEX compressed displacements giving false alarms for instruction length.

commit e0cce177cc1b47ec9f11ac0556241feaa3564df1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Jul 25 10:02:25 2016 -0500

    Minor fixes for 8x24 KNL kernel.

commit 65735bbedf75784c48bd11e05b3fdc98fc66b4bc
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Jul 24 21:50:32 2016 -0500

    Switch to 24x8 kernel, unrolled by 16.

commit 45d5dc97177117220bd9dd0abf85aafc185acad1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sun Jul 24 14:25:26 2016 -0500

    Add 24x8 "KNC-style" kernel for KNL.

commit 95abea46f86816fddfc9ff0abfa52880801461be
Merge: d0dfe5b5 a017062f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 23 15:38:33 2016 -0500

    Merge branch 'master' into compose

commit a017062fdf763037da9d971a028bb07d47aa1c8a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 22 17:02:59 2016 -0500

    Integrated "memory broker" (membrk_t) abstraction.
    
    Details:
    - Integrated a patch originally authored and submitted by Ricardo Magana
      of HP Enterprise. The changeset inserts use of a new object type, membrk_t,
      (memory broker) that allows multiple sets of memory pools on, for example,
      separate NUMA nodes, each of which has a separate memory space.
    - Added membrk field to cntx_t and defined corresponding accessor macros.
    - Added membrk field to mem_t object and defined corresponding accessor macros.
    - Created new bli_membrk.c file, which contains the new memory broker API,
      including:
        bli_membrk_init(), bli_membrk_finalize()
        bli_membrk_acquire_[mv](), bli_membrk_release(),
        bli_membrk_init_pools(), bli_membrk_reinit_pools(),
        bli_membrk_finalize_pools(),
        bli_membrk_pool_size()
    - In bli_mem.c, changed function calls to
        bli_mem_init_pools()     -> bli_membrk_init()
        bli_mem_reinit_pools()   -> bli_membrk_reinit()
        bli_mem_finalize_pools() -> bli_membrk_finalize()
    - In bli_packv_init.c, bli_packm_init.c, changed function calls to:
        bli_mem_acquire_[mv]() -> bli_membrk_acquire_[mv]()
        bli_mem_release()      -> bli_membrk_release()
    - Added bli_mutex.c and related files to frame/thread. These files define
      abstract mutexes (locks) and corresponding APIs for pthreads, openmp, or
      single-threaded execution. This new API is employed within functions
      such as bli_membrk_acquire_[mv]() and bli_membrk_release().

commit 8ff2e069c48c12fd06b9c48c6b3aeb4ea9b0e6e1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 16:22:26 2016 -0500

    Add 4x unrolled variant for KNL microkernel.

commit 9cb2ed9b0c25f31a22c1c9719b062fa665ad7adf
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 16:10:30 2016 -0500

    Git rid of one RBX update.

commit 451bde076f0320d60cd2475cfb048ac4a2b798bb
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 15:43:00 2016 -0500

    Add some more knobs to twiddle for KNL microkernel.

commit 8c6e621c099521e7a4d87e007bb8224faa5f33a3
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 15:05:15 2016 -0500

    Make knl conform to new kernel dir structure.

commit ce7214c6618d6f22f4ce2ee452336236916d1f30
Merge: 119d0399 ce59f811
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 14:59:53 2016 -0500

    Merge remote-tracking branch 'origin/master' into knl

commit ce59f81108ec9aea918a7e77030da8acfdd397ce
Merge: ff41153f 707a2b7f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 22 14:48:14 2016 -0500

    Merge pull request #88 from devinamatthews/32bit-dim_t
    
    Handle 32-bit dim_t in 64-bit microkernels.

commit 707a2b7faca137cca7cab7b11a12c44ddaf7ad53
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 13:49:44 2016 -0500

    Somehow forgot the most important microkernel.

commit 47ec045056351ac4f0791c071fa0daaa81699c8c
Merge: 08f1d6b6 ff41153f
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 13:45:23 2016 -0500

    Merge remote-tracking branch 'upstream/master' into 32bit-dim_t

commit 08f1d6b6fa344275de0f675f69737145ccf6646a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 13:44:37 2016 -0500

    Use 64-bit intermediate variable for k for architectures that do 64-bit loads in case dim_t is 32-bit.

commit ff41153f4eb7f38ed94bdd9a3fd81fb979f3f401
Merge: f9214ced e0d2fa0d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 22 13:21:03 2016 -0500

    Merge pull request #86 from devinamatthews/haswell-vmovups
    
    Remove alignment restrictions on C in haswell kernel.

commit e0d2fa0d835ab49366aeb790363bb2b571d36ed8
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 12:56:51 2016 -0500

    Relax alignment restrictions for haswell sgemm.

commit f9214ced97392861f5a0ea72abfcf6f41faf674c
Merge: 413d62ac 08666eaa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 22 12:16:39 2016 -0500

    Merge pull request #85 from devinamatthews/qopenmp
    
    Change -openmp to -fopenmp for icc.

commit ee2c139df6ad53c6aec8a67ab23b3b1912e8d259
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 12:06:03 2016 -0500

    Remove alignment restrictions on C in haswell kernel.

commit 08666eaa20d8a31f2f92f944e5bfa7c1558c53e4
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 11:07:34 2016 -0500

    Change -openmp to -fopenmp for icc.

commit 119d0399428905053265f3aca1cc8cc1fde3b363
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Jul 22 10:23:31 2016 -0500

    Add 8x24 KNL kernel.

commit b58cda9eba0c1e175460aae109baf792d29ba5bf
Merge: 318f063d 413d62ac
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Jul 19 14:09:09 2016 -0500

    Merge remote-tracking branch 'origin/master' into knl
    
    # Conflicts:
    #       frame/base/bli_threading.h
    #       frame/include/blis.h
    #       frame/thread/bli_thread.c

commit d0dfe5b5372cc7558ee9c4104b29f82eecc7ed61
Merge: 31def12e 413d62ac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 14 11:01:06 2016 -0500

    Merge branch 'master' into compose

commit 413d62aca28edabba56605a9f87d5b715831e1db
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 12 15:02:52 2016 -0500

    README update (use official ACM TOMS links).

commit dfa431f696db2df4065ea454df268a2e0bc02eac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 12 14:21:19 2016 -0500

    README update (BLIS2 TOMS article now in-print).

commit 31def12e2629f187e40f93f6bae9e26a6c2660e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 30 15:19:20 2016 -0500

    First phase of control tree redesign.
    
    Details:
    - These changes constitute the first set of changes in preparation to
      revamping the structure and use of control trees in BLIS. Modifications
      in this commit don't affect the control tree code yet, but rather lay
      the groundwork.
    - Defined wrappers for the following functions, where the the wrappers
      each take a direction parameter of a new enumerated type (BLIS_BWD or
      BLIS_FWD), dir_t, and executes the correct underlying function.
      - bli_acquire_mpart_*() and _vpart_*()
      - bli_*_determine_kc_[fb]()
      - bli_thread_get_range_*() and bli_thread_get_range_weighted_*()
    - Consolidated all 'f' (forwards-moving) and 'b' (backwards-moving)
      blocked variants for trmm and trsm, and renamed gemm and herk variants
      accordingly. The direction is now queried via routines such as
      bli_trmm_direct(), which deterines the direction from the implied side
      and uplo parameters. For gemm and herk, it is uncondtionally BLIS_FWD.
    - Defined wrappers to parameter-specific macrokernels for herk, trmm, and
      trsm, e.g. bli_trmm_xx_ker_var2(), that execute the correct underlying
      macrokernel based on the implied parameters. The same logic used to
      choose the dir_t in _direct() functions is used here.
    - Simplified the function pointer arrays in _int() functions given the
      consolidation and dir_t querying mentioned above.
    - Function signature (whitespace) reformatting for various functions.
    - Removed old code in various 'old' directories.

commit 232754feecf29452987666b9f5ebba2619bfd0b0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 21 14:25:39 2016 -0500

    Fixed compiler warning in rand[vm], randn[vm].
    
    Details:
    - Fixed compiler warnings about unused variables related to the disabling
      of normalization in the structured cases of the rand[vm] and randn[vm]
      operations.

commit a89555d1605574f3685813dcc972b636dd61264d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 17 14:08:35 2016 -0500

    Added randn[vm] operations, support in testsuite.
    
    Details:
    - Defined a new randomization operation, randn, on vectors and matrices.
      The randnv and randnm operations randomize each element of the target
      object with values from a narrow range of values. Presently, those
      values are all integer powers of two, but they do not need to be powers
      of two in order to achieve the primary goal, which is to initialize
      objects that can be operated on with plenty of precision "slack"
      available to allow computations that avoid roundoff. Using this method
      of randomization makes it much more likely that testsuite residuals of
      properly-functioning operations are close to zero, if not exactly zero.
    - Updated existing randomization operations randv and randm to skip
      special diagonal handling and normalization for matrices with structure.
      This is now handled by the testsuite modules by explicitly calling a
      testsuite function that loads the diagonal (and scales off-diagonal
      elements).
    - Added support for randnv and randnm in the testsuite with a new switch
      in input.general that universally toggles between use of the classic
      randv/randm, which use real values on the interval [-1,1], and
      randnv/randnm, which use only values from a narrow range. Currently,
      the narrow range is: +/-{2^0, 2^-1, 2^-2, 2^-3, 2^-4, 2^-5, 2^-6}, as
      well as 0.0.
    - Updated testsuite modules so that a testsutie wrapper function is called
      instead of directly calling the randomization operations (such as
      bli_randv() and bli_randm()). This wrapper also takes a bool_t that
      indicates whether the object's elements should be normalized. (NOTE: As
      alluded to above, in the test modules of triangular solve operations such
      as trsv and trsm, we perform the extra step of loading the diagonal.)
    - Defined a new level-0 operation, invertsc, which inverts a scalar.
    - Updated the abval2ris and sqrt2ris level-0 macros to avoid an unlikely
      but possible divide-by-zero.
    - Updated function signature and prototype formatting in testsuite.

commit 318f063dcbd8b594969e401bc99146d24b01066a
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Jun 8 17:46:50 2016 -0500

    Add new KNL microkernel derived from Haswell.

commit 096895c5d538a7f8817603d7cf28c52e99340def
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 6 13:32:04 2016 -0500

    Reorganized code, APIs related to multithreading.
    
    Details:
    - Reorganized code and renamed files defining APIs related to multithreading.
      All code that is not specific to a particular operation is now located in a
      new directory: frame/thread. Code is now organized, roughly, by the
      namespace to which it belongs (see below).
    - Consolidated all operation-specific *_thrinfo_t object types into a single
      thrinfo_t object type. Operation-specific level-3 *_thrinfo_t APIs were
      also consolidated, leaving bli_l3_thrinfo_*() and bli_packm_thrinfo_*()
      functions (aside from a few general purpose bli_thrinfo_*() functions).
    - Renamed thread_comm_t object type to thrcomm_t.
    - Renamed many of the routines and functions (and macros) for multithreading.
      We now have the following API namespaces:
      - bli_thrinfo_*(): functions related to thrinfo_t objects
      - bli_thrcomm_*(): functions related to thrcomm_t objects.
      - bli_thread_*(): general-purpose functions, such as initialization,
        finalization, and computing ranges. (For now, some macros, such as
        bli_thread_[io]broadcast() and bli_thread_[io]barrier() use the
        bli_thread_ namespace prefix, even though bli_thrinfo_ may be more
        appropriate.)
    - Renamed thread-related macros so that they use a bli_ prefix.
    - Renamed control tree-related macros so that they use a bli_ prefix (to be
      consistent with the thread-related macros that were also renamed).
    - Removed #undef BLIS_SIMD_ALIGN_SIZE from dunnington's bli_kernel.h. This
      #undef was a temporary fix to some macro defaults which were being applied
      in the wrong order, which was recently fixed.

commit 232530e88ff99f37abcae5b6fb5319a9a375a45f
Merge: 4bcabd1b eef37f8b
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Jun 1 15:14:10 2016 -0500

    Merge commit 'refs/pull/81/head' of https://github.com/flame/blis
    
    Conflicts:
            frame/base/bli_threading_pthreads.c
            frame/base/bli_threading_pthreads.h

commit 4bcabd1bf60688c38cf562459fc5e8be8b831756
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Jun 1 13:27:28 2016 -0500

    Use spin locks instead of pthread barriers

commit eef37f8b4d81845a6ba4bf25586d32b50c3e8a68
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Sun May 29 22:28:13 2016 -0700

    use GCC intrinsic instead of pthread_mutex for atomic increment and fetch

commit 9dcd6f05c4c3ff2ce7cd87a9951a96ebef22681e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 24 13:15:32 2016 -0500

    Implemented developer-configurable malloc()/free().
    
    Details:
    - Replaced all instances of bli_malloc() and bli_free() with one of:
      - bli_malloc_pool()/bli_free_pool()
      - bli_malloc_user()/bli_free_user()
      - bli_malloc_intl()/bli_free_intl()
      each of which can be configured to call malloc()/free() substitutes,
      so long as the substitute functions have the same function type
      signatures as malloc() and free() defined by C's stdlib.h. The _pool()
      function is called when allocating blocks for the memory pools (used
      for packing buffers, primarily), the _user() function is called when
      obj_t's are created (via bli_obj_create() and friends), and the _intl()
      function is called for internal use by BLIS, such as when creating
      control tree nodes or temporary buffers for manipulating internal data
      structures. Substitutes for any of the three types of bli_malloc() may
      be specified by #defining the following pairs of cpp macros in
      bli_kernel.h:
      - BLIS_MALLOC_POOL/BLIS_FREE_POOL
      - BLIS_MALLOC_USER/BLIS_FREE_USER
      - BLIS_MALLOC_INTL/BLIS_FREE_INTL
      to be the name of the substitute functions. (Obviously, the object
      code that contains these functions must be provided at link-time.)
      These macros default to malloc() and free(). Subsitute functions are
      also automatically prototyped by BLIS (in bli_malloc_prototypes.h).
    - Removed definitions for bli_malloc() and bli_free().
    - Note that bli_malloc_pool() and bli_malloc_user() are now defined in
      terms of a new function, bli_malloc_align(), which aligns memory to an
      arbitrary (power of two) alignment boundary, but does so manually,
      whereas before alignment was performed behind the scenes by
      posix_memalign(). Currently, bli_malloc_intl() is defined in terms
      of bli_malloc_noalign(), which serves as a simple wrapper to the
      designated function that is passed in (e.g. BLIS_MALLOC_INTL).
      Similarly, there are bli_free_align() and bli_free_noalign(), which
      are used in concert with their bli_malloc_*() counterparts.

commit 9dd440109a9d964f5cd286e9f83c487ad703e1e4
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Sat May 21 15:21:58 2016 -0700

    fix 404 link to BuildSystem
    
    Google Code is dead.  Long live GitHub!

commit d309f20b7376a68efa3b864ad790c2021c071655
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 18 15:13:53 2016 -0500

    Added alignment switch to testsuite.
    
    Details:
    - Added a new input parameter to input.general that globally toggles
      whether testsuite tests are performed on objects whose buffers and
      leading dimensions have been aligned, and changed the implementation
      of libblis_test_mobj_create() to employ alignment (or not) regardless
      of whether row, column, or general storage is being tested.
    - Updated configure script's "--help" text to indicate default behavior
      for internal integer type size and BLAS/CBLAS integer type size
      options.

commit 32db0adc218ea4ae370164dbe8d23b41cd3526d3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 17 15:20:16 2016 -0500

    Generate prototypes for user-defined packm kernels.
    
    Details:
    - Created template prototypes for packm kernels (in bli_l1m_ker.h), and
      then redefined reference packm kernels' prototyping headers in terms of
      this template, as is already done for level-1v, -1f, and -3 kernels.
    - Automatically generate prototypes for user-defined packm kernels in
      bli_kernel_prototypes.h (using the new template prototypes in
      bli_l1m_ker.h).
    - Defined packm kernel function types in bli_l1m_ft.h, including for
      packm kernels specific to induced methods, which are now used in
      bli_packm_cxk.c and friends rather than using a locally-defined
      function type.
    - In bli_packm_cxk.c, extended function pointer for packm kernels array
      from out to index 31 (from previous maximum of 17). This allows us to
      store the unrolled 30xk kernel in the array for use (on knc, for
      example). Note: This should have been done a long time ago.

commit e3bd5ca64ae7c190ba689396c0de687b829a11fe
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu May 12 20:54:13 2016 -0500

     Fix SIMD definitions in KNL config, and a couple of fixes to C update.

commit 4fe02e3d497995d94d34d3fcf5af895084cfc8b9
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu May 12 20:53:58 2016 -0500

    Move bli_kernel.h before bli_threading.h in order of inclusion in blis.h.

commit 4bcf1b35abea3f3dfc8f2fe462dcf155cf199e55
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 11 16:09:49 2016 -0500

    Fixed bli_get_range_*() bugs in trsm variants.
    
    Details:
    - Fixed incorrect calls to bli_get_range_*() from within trsm blocked
      variants 1f, 2b, and 2f. The bug somehow went undetected since the
      big commit (537a1f4), and, strangely, did not manifest via the BLIS
      testsuite. The bug finally came to our attention when running thei
      libflame test suite while linking to BLIS. Thanks to Kiran Varaganti
      for submitting the initial report that led to this bug.

commit 9cfa33023f123a6c17e987f72fba174ce073f0b6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 11 16:02:30 2016 -0500

    Minor updates to bli_f2c.h.
    
    Details:
    - Added #undef guards to certain #define statements in bli_f2c.h,
      and renamed the file guard to BLIS_F2C_H. This helps when
      #including "blis.h" from an application or library that already
      #includes an "f2c.h" header.

commit a09a2e23eacf5328858c8318bb637c5ff3b71d08
Merge: 4dcd37eb 7c604e1c
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed May 11 10:47:11 2016 -0500

    Merge pull request #76 from devinamatthews/move_simd_defs
    
    Move default SIMD-related definitions to bli_kernel_macro_defs.h

commit 4dcd37eb1b12a6e08cc13df7b61391ef8363f5d8
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue May 10 16:28:59 2016 -0500

    fixing knc simd align size

commit 619dee0daec3474b4e5a55df90a61aabcae194f2
Merge: b790b3d9 7c604e1c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue May 10 12:13:24 2016 -0500

    Merge branch 'move_simd_defs' into knl

commit 7c604e1cbc1609b6e12d3ee973c08b7af5035be4
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue May 10 12:11:55 2016 -0500

    Move default SIMD-related definitions to bli_kernel_macro_defs.h. Otherwise, configurations which customize these fail as these are now defined in bli_kernel.h.

commit b790b3d9e1820f3b691676de48c291cae083452d
Merge: 4f8c05c9 a7be2d28
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue May 10 11:49:47 2016 -0500

    Merge branch 'master' into knl

commit a7be2d28e8930b154d0da1d6929b54a96e210af6
Merge: 97b512ef 4b1e55ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 10 11:48:51 2016 -0500

    Merge pull request #74 from devinamatthews/fix_common_symbols
    
    Default-initialize all extern global variables to avoid generating common symbols.

commit 4b1e55edbfe0e1cb2e7b9428424903497cb7a841
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue May 10 10:08:47 2016 -0500

    Default-initialize all extern global variables to avoid generating common symbols. Fixes #73.

commit 97b512ef62c7e25c97ed5e9eca81cd7015b2ac91
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 6 10:24:30 2016 -0500

    Include headers from cblas.h to pull in f77_int.
    
    Details:
    - Added #include statements for certain key BLIS headers so that the
      definition of f77_int is pulled in when a user compiles application
      code with only #include "cblas.h" (and no other BLIS header). This
      is necessary since f77_int is now used within the cblas API.

commit c3a4d39d03665135f1616588b5ef7c3e9ef5688d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 4 17:22:56 2016 -0500

    Updates to haswell gemm micro-kernels.
    
    Details:
    - Added two new sets of [sd]gemm micro-kernels for haswell architectures,
      one that is 4x24/4x12 (s and d) and one that is 6x16/6x8.
    - Changed the haswell configuration to use the 6x16/6x8 micro-kernels
      by default.
    - Updated various Makefiles, in test, test/3m4m, and testsuite.

commit 0b01d355ae861754ae2da6c9a545474af010f02e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 27 15:21:10 2016 -0500

    Miscellaneous cleanups, fixes to recent commits.
    
    Details:
    - Fixed a typo in bli_l1f_ref.h, introduced into bbb8569, that only
      manifested when non-reference level-1f kernels were used.
    - Added an #undef BLIS_SIMD_ALIGN_SIZE to bli_kernel.h of dunnington
      configuration to prevent a compile-time warning until I can figure out
      the proper permanent fix.
    - Moved frame/1f/kernels/bli_dotxaxpyf_ref_var1.c out of the compilation
      path (into 'other' directory). _ref_var2 is used by default, which is
      the variant that is built on axpyf and dotxf instead of dotaxpyv.
    - Removed section of frame/include/bli_config_macro_defs.h pertaining to
      mixed datatype support.

commit ed7326c836f427e2f8420b015220ce293207b10c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 27 14:57:40 2016 -0500

    Added 'restrict' to l1v/l1f code in 'kernels' dir.
    
    Details:
    - Added 'restrict' keyword to existing kernel definitions in 'kernels'
      directory. These changes were meant for inclusion in bbb8569.

commit bbb8569b2a08c3bcd631d5a05eb389d01d94ac07
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 27 14:13:46 2016 -0500

    Use 'restrict' in all kernel APIs; wspace changes.
    
    Details:
    - Updated level-1v, level-1f kernel function types (bli_l1?_ft.h) and
      generic kernel prototypes (bli_l1?_ker.h) to use 'restrict' for all
      numerical operand pointers (ie: all pointers except the cntx_t).
    - Updated level-1f reference kernel definitions to use 'restrict' for
      all numerical operand pointers. (Level-1v reference kernel definitions
      were already updated in bdbda6e.)
    - Rewrote the level-1v and level-1f reference kernel prototypes in
      bli_l1v_ref.h and bli_l1f_ref.h, respectively, to simply #include
      bli_l1v_ker.h and bli_l1f_ker.h with redefined function base names
      (as was already being done for the level-3 micro-kernel prototypes
      in bli_l3_ref.h), rather than duplicate the signatures from the
      _ker.h files.
    - Added definitions to frame/include/bli_kernel_prototypes.h for axpbyv
      and xpbyv, which were probably meant for inclusion in bdbda6e.
    - Converted a number of instances of four spaces, as introduced in
      bdbda6e, to tabs.

commit 4ea419c72c789825e1f93a1eee88219bbf873930
Merge: f1e9be2a bdbda6e6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 26 12:50:45 2016 -0500

    Merge pull request #70 from devinamatthews/daxpby
    
    Give the level1v operations some love

commit bdbda6e6acc682ab1b6ca680edebd09ae12a832c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 25 11:05:57 2016 -0500

    Give the level1v operations some love:
    
    - Add missing axpby and xpby operations (plus test cases).
    - Add special case for scal2v with alpha=1.
    - Add restrict qualifiers.
    - Add special-case algorithms for incx=incy=1.

commit f1e9be2aba1a057eedb947bbae96848597777408
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 22 15:34:02 2016 -0500

    Minor tweak to test/Makefile.
    
    Details:
    - Just committing a minor change to test/Makefile that has been lingering
      in my local working copy for longer than I can remember.

commit aa0bceec277938328dabeb744680623f24fb0b61
Merge: 4136553f e2784b4c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 22 12:01:31 2016 -0500

    Merge branch 'master' of github.com:flame/blis

commit 4136553f0d0661a668dfdb9edcd7ce1c5773dde7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 22 11:53:53 2016 -0500

    Clear level-3 cntx_t's via memset() before use.
    
    Details:
    - In all level-3 operations' _cntx_init() functions, replaced calls to
      bli_cntx_obj_init() with calls to bli_cntx_obj_clear(), and in all
      level-3 operations' _cntx_finalize() functions, removed calls to
      bli_cntx_obj_finalize(), leaving those function definitions empty.
    - Changed the definition of bli_cntx_obj_clear() so that the clearing
      occurs via a single call to memset().

commit 4f8c05c9e2ef4cbb82b35a3ebf1f0a0ac665830e
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Apr 21 10:00:59 2016 -0500

    Rearrange KNL dgemm kernel again to streamline usage of ymm register. sgemm and dgemm now both working with Intel SDE.

commit e2784b4c921f706e756df3e146e20a4cb63f53e3
Merge: dd0ab1d9 a9b6c3ab
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 20 18:34:09 2016 -0500

    Merge pull request #67 from devinamatthews/cblas-f77-int
    
    Change CBLAS integer type to f77_int

commit a9b6c3abda6222a8b240361643932e83cf726c4f
Merge: e4c54c81 dd0ab1d9
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 20 16:00:10 2016 -0500

    Merge remote-tracking branch 'origin/master' into cblas-f77-int
    
    # Conflicts:
    #       config/haswell/bli_config.h

commit e4c54c81463c2a19c9bb6b1f0f1be3fa9d018a45
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 20 15:56:46 2016 -0500

    Change integer type in CBLAS function signatures to f77_int, and add proper const-correctness to BLAS layer.

commit dd0ab1d93f33abca6af9edd7b8e52da62dcfa5b1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 20 14:38:23 2016 -0500

    Converted some bli_cntx query functions to macros.
    
    Details:
    - Commented out several datatype-aware query functions (those ending in
      _dt) from bli_cntx.c, as well as their prototypes in bli_cntx.h, and
      added equivalent cpp query macros to bli_cntx.h.
    - Added 'bli_config.h' to .gitignore.

commit 7193230f7d35edbd1d2f77842a613971f1603463
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 20 09:37:30 2016 -0500

    Work around missing VPMULLQ on KNL.

commit a30ccbc4c6a6e6460e78af6b5c530ee0d06f98fb
Merge: eb2f18e4 0e1a9821
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 19 15:04:33 2016 -0500

    Merge pull request #66 from devinamatthews/blas-configure
    
    Add configure options and generate bli_config.h automatically.

commit bd44cf13e886069bc66c10ac0db178be96629a0d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Apr 19 13:43:04 2016 -0500

    Fix copy-paste errors in KNL kernels.

commit eb2f18e4844d985715df20798f50f9cc12e3b5ad
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 19 12:50:32 2016 -0500

    More compile-time fixes to bgq gemm ukernel code.

commit 0e1a9821d860f6c1d818baf4c48d21a23726c132
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Apr 19 11:44:37 2016 -0500

    Add configure options and generate bli_config.h automatically.
    
    Options to configure have been added for:
    - Setting the internal BLIS and BLAS/CBLAS integer sizes.
    - Enabling and disabling the BLAS and CBLAS layers.
    
    Additionally, configure options which require defining macros (the above plus the threading model), write their macros to the automatically-generated bli_config.h file in the top-level build directory. The old bli_config.h files in the config dirs were removed, and any kernel-related macros (SIMD size and alignment etc.) were moved to bli_kernel.h. The Makefiles were also modified to find the new bli_config.h file.
    
    Lastly, support for OMP in clang has been added (closes #56).

commit a11eec05928ddc5c43fa5dbcd35f2edd24ff35a1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 13:13:36 2016 -0500

    Add sgemm ukernels for KNL. vpmullq is not implemented on KNL -- needs workaround.

commit ff84469a4575f1ef8a0010046fde52240a312cae
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 18 12:29:09 2016 -0500

    Applied various compilation fixes to bgq kernels.

commit c38e0dab05b2dc36672eab96e1248fb7fb2d785b
Merge: bd5e2296 cbcd0b73
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 10:21:35 2016 -0500

    Merge remote-tracking branch 'origin/master' into knl

commit bd5e2296e98e042c31f1e8ece2c1ca8e4bdc2d4c
Merge: 4745def0 49f85177
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 10:15:22 2016 -0500

    Merge remote-tracking branch 'origin/knl' into knl

commit 4745def0c87377ae83ad73ac514d7de08a96b2ac
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 10:15:05 2016 -0500

    Add 64-bit offset vector so we can use vgatherqpd.

commit 49f85177f886f38889b60503a4e12fa7f04be1fd
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 18 10:14:11 2016 -0500

    KNL ukernel compiles with gcc.

commit cbcd0b739dc54bd14fbb46aeda267c26725cd70f
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Mon Apr 18 03:12:57 2016 -0500

    Changing ifdef for OSX pthread barriers

commit 58b2c3cf040134d1be913c585a3c6905629116c0
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sat Apr 16 16:12:24 2016 -0500

     Rewrite of KNL kernel in GNU extended asm syntax.

commit dd62080cea78f3a23616200d6640e52c102b2bb9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 15 11:15:41 2016 -0500

    Compile-time fix to bgq l1f kernels.
    
    Details:
    - Fixed an old reference to bli_daxpyf_fusefac, which no longer exists,
      by replacing it with the axpyf fusing factor (8), and cleaned up the
      relevant section of config/bgq/bli_kernel.h.
    - Removed most of the details of the level-3 kernels from the template
      kernel code in config/template/kernels/3 and replaced it with a
      reference to the relevant kernel wiki maintained on the BLIS github
      website.

commit d5a915dd8d7a6ead42a68772e4420eb3647e6f1a
Merge: 4320b725 41694675
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 14 12:56:36 2016 -0500

    Merge branch 'master' of github.com:flame/blis

commit 4320b725a1f8fd34101470b6cf52ad504a79c517
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 14 12:51:29 2016 -0500

    Use kernel CFLAGS on "ukernels" directories.
    
    Details:
    - Updated the top-level Makefile so that the CFLAGS variable designated
      for kernel source code is applied not only to source code in
      directories named "kernels" but source code in any directory that
      contains the substring "kernels", such as "ukernels".
    - Formally disabled some code in gen-make-frag.sh script that was already
      effectively disabled. The code was related to handling "noopt" and
      "kernel" directories, which is now handled independently within the
      top-level Makefile without needing to place these source files into
      a spearate makefile variable.

commit 41694675e4cb56e2e0323c7a7db48e0819606a31
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Apr 13 15:51:08 2016 -0500

    pthreads bugfixes
    
    Getting pthreads to work on my Mac
    Implemented a pthread barrier when _POSIX_BARRIER isn't defined
    Now spawn n-1 threads instead of n threads so that master thread isn't just spinning the whole time
    Add -lpthread instead of -pthread to LDFLAGS (for clang)

commit f756dbfa0d542cbc497724981520c83abf049c4b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 13 11:25:33 2016 -0500

    Removed stale #include from bgq configuration.
    
    Details:
    - Removed an old #include statement ("bli_gemm_8x8.h") from the
      bli_kernel.h file in the bgq configuration. It turns out this
      file was no longer needed even prior to 537a1f4.

commit 0bd4169ea75f690714e7d2912229932a75d8a7e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 11 18:08:32 2016 -0500

    Fixed context-broken dunnington/penryn kernels.
    
    Details:
    - Added missing context parameters to several instances where simpler
      kernels, or reference kernels, are called instead of executing the
      main body code contained in the kernel function in question.
    - Renamed axpyv and dotv kernel files to use "opt" instead of "int"
      substring, for consistency with level-1f kernels.

commit 7912af5db45b7372d19a9a3dfeb82df302a05628
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 11 17:32:13 2016 -0500

    CHANGELOG update (0.2.0)

commit 898614a555ea0aa7de4ca07bb3cb8f5708b6a002 (tag: 0.2.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 11 17:32:09 2016 -0500

    Version file update (0.2.0)

commit 537a1f4f85ce1aa008901857cb3182e6b4546d7f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 11 17:21:28 2016 -0500

    Implemented runtime contexts and reorganized code.
    
    Details:
    - Retrofitted a new data structure, known as a context, into virtually
      all internal APIs for computational operations in BLIS. The structure
      is now present within the type-aware APIs, as well as many supporting
      utility functions that require information stored in the context. User-
      level object APIs were unaffected and continue to be "context-free,"
      however, these APIs were duplicated/mirrored so that "context-aware"
      APIs now also exist, differentiated with an "_ex" suffix (for "expert").
      These new context-aware object APIs (along with the lower-level, type-
      aware, BLAS-like APIs) contain the the address of a context as a last
      parameter, after all other operands. Contexts, or specifically, cntx_t
      object pointers, are passed all the way down the function stack into
      the kernels and allow the code at any level to query information about
      the runtime, such as kernel addresses and blocksizes, in a thread-
      friendly manner--that is, one that allows thread-safety, even if the
      original source of the information stored in the context changes at
      run-time; see next bullet for more on this "original source" of info).
      (Special thanks go to Lee Killough for suggesting the use of this kind
      of data structure in discussions that transpired during the early
      planning stages of BLIS, and also for suggesting such a perfectly
      appropriate name.)
    - Added a new API, in frame/base/bli_gks.c, to define a "global kernel
      structure" (gks). This data structure and API will allow the caller to
      initialize a context with the kernel addresses, blocksizes, and other
      information associated with the currently active kernel configuration.
      The currently active kernel configuration within the gks cannot be
      changed (for now), and is initialized with the traditional cpp macros
      that define kernel function names, blocksizes, and the like. However,
      in the future, the gks API will be expanded to allow runtime management
      of kernels and runtime parameters. The most obvious application of this
      new infrastructure is the runtime detection of hardware (and the
      implied selection of appropriate kernels). With contexts in place,
      kernels may even be "hot swapped" at runtime within the gks. Once
      execution enters a level-3 _front() function, the memory allocator will
      be reinitialized on-the-fly, if necessary, to accommodate the new
      kernels' blocksizes. If another application thread is executing with
      another (previously loaded) kernel, it will finish in a deterministic
      fashion because its kernel information was loaded into its context
      before computation began, and also because the blocks it checked out
      from the internal memory pools will be unaffected by the newer threads'
      reinitialization of the allocator.
    - Reorganized and streamlined the 'ind' directory, which contains much of
      the code enabling use of induced methods for complex domain matrix
      multiplication; deprecated bli_bsv_query.c and bli_ukr_query.c, as
      those APIs' functionality is now mostly subsumed within the global
      kernel structure.
    - Updated bli_pool.c to define a new function, bli_pool_reinit_if(),
      that will reinitialize a memory pool if the necessary pool block size
      has increased.
    - Updated bli_mem.c to use bli_pool_reinit_if() instead of
      bli_pool_reinit() in the definition of bli_mem_pool_init(), and placed
      usage of contexts where appropriate to communicate cache and register
      blocksizes to bli_mem_compute_pool_block_sizes().
    - Simplified control trees now that much of the information resides in
      the context and/or the global kernel structure:
      - Removed blocksize object pointers (blksz_t*) fields from all control
        tree node definitions and replaced them with blocksize id (bszid_t)
        values instead, which may be passed into a context query routine in
        order to extract the corresponding blocksize from the given context.
      - Removed micro-kernel function pointers (func_t*) fields from all
        control tree node definitions. Now, any code that needs these function
        pointers can query them from the local context, as identified by a
        level-3 micro-kernel id (l3ukr_t), level-1f kernel id, (l1fkr_t), or
        level-1v kernel id (l1vkr_t).
      - Removed blksz_t object creation and initialization, as well as kernel
        function object creation and initialization, from all operation-
        specific control tree initialization files (bli_*_cntl.c), since this
        information will now live in the gks and, secondarily, in the context.
    - Removed blocksize multiples from blksz_t objects. Now, we track
      blocksize multiples for each blocksize id (bszid_t) in the context
      object.
    - Removed the bool_t's that were required when a func_t was initialized.
      These bools are meant to allow one to track the micro-kernel's storage
      preferences (by rows or columns). This preference is now tracked
      separately within the gks and contexts.
    - Merged and reorganized many separate-but-related functions into single
      files. This reorganization affects frame/0, 1, 1d, 1m, 1f, 2, 3, and
      util directories, but has the most obvious effect of allowing BLIS
      to compile noticeably faster.
    - Reorganized execution paths for level-1v, -1d, -1m, and -2 operations
      in an attempt to reduce overhead for memory-bound operations. This
      includes removal of default use of object-based variants for level-2
      operations. Now, by default, level-2 operations will directly call a
      low-level (non-object based) loop over a level-1v or -1f kernel.
    - Converted many common query functions in blk_blksz.c (renamed from
      bli_blocksize.c) and bli_func.c into cpp macros, now defined in their
      respective header files.
    - Defined bli_mbool.c API to create and query "multi-bools", or
      heterogeneous bool_t's (one for each floating-point datatype), in the
      same spirit as blksz_t and func_t.
    - Introduced two key parameters of the hardware: BLIS_SIMD_NUM_REGISTERS
      and BLIS_SIMD_SIZE. These values are needed in order to compute a third
      new parameter, which may be set indirectly via the aforementioned
      macros or directly: BLIS_STACK_BUF_MAX_SIZE. This value is used to
      statically allocate memory in macro-kernels and the induced methods'
      virtual kernels to be used as temporary space to hold a single
      micro-tile. These values are now output by the testsuite. The default
      value of BLIS_STACK_BUF_MAX_SIZE is computed as
      "2 * BLIS_SIMD_NUM_REGISTERS * BLIS_SIMD_SIZE".
    - Cleaned up top-level 'kernels' directory (for example, renaming the
      embarrassingly misleading "avx" and "avx2" directories to "sandybridge"
      and "haswell," respectively, and gave more consistent and meaningful
      names to many kernel files (as well as updating their interfaces to
      conform to the new context-aware kernel APIs).
    - Updated the testsuite to query blocksizes from a locally-initialized
      context for test modules that need those values: axpyf, dotxf,
      dotxaxpyf, gemm_ukr, gemmtrsm_ukr, and trsm_ukr.
    - Reformatted many function signatures into a standard format that will
      more easily facilitate future API-wide changes.
    - Updated many "mxn" level-0 macros (ie: those used to inline double loops
      for level-1m-like operations on small matrices) in frame/include/level0
      to use more obscure local variable names in an effort to avoid variable
      shaddowing. (Thanks to Devin Matthews for pointing these gcc warnings,
      which are only output using -Wshadow.)
    - Added a conj argument to setm, so that its interface now mirrors that
      of scalm. The semantic meaning of the conj argument is to optionally
      allow implicit conjugation of the scalar prior to being populated into
      the object.
    - Deprecated all type-aware mixed domain and mixed precision APIs. Note
      that this does not preclude supporting mixed types via the object APIs,
      where it produces absolutely zero API code bloat.

commit dd856c2cb75a2221a503a73dde27790c34b91570
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Apr 11 10:39:18 2016 -0500

    Translated MIC kernel to KNL and cleaned up a bit. Only real change is lack of swizzle modifiers for FMA instructions (used bcast from memory instead).

commit 7f27431d3fffdda99c282ec412731d0a90cb32a7
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Apr 8 10:04:39 2016 -0500

    Copy mic kernel to knl for transliteration.

commit f8f02f0334ac020021e15a415bcd33aeea01deb4
Merge: 32c92d94 d1f8e5d9
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 6 11:37:05 2016 -0500

    Merge branch 'master' into const_correctness

commit 32c92d945c55708da0eb63be1771f8c5430e3910
Merge: 62914ccb 20af937b
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Apr 6 11:36:02 2016 -0500

    Merge branch 'master' into const_correctness

commit d1f8e5d9b2ecd054ed103f4d642d748db2d4f173
Merge: 20af937b c11d28ee
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 5 12:21:27 2016 -0500

    Merge pull request #60 from esauvage/master
    
    sgemm µkernel for bulldozer : bug correction for k%4 != 0

commit c11d28eed89d65494bc4019f04d046520866c0ff
Author: Etienne Sauvage <etienne.sauvage@gmail.com>
Date:   Sat Apr 2 21:15:48 2016 +0200

    cgemm µkernel for bulldozer : bug correction for k%4 != 0

commit 20af937b57f82bb3acb09418d5c0206e1b24f2c7
Merge: 36c3abb0 fc61a114
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 31 14:37:30 2016 -0500

    Merge pull request #59 from devinamatthews/fix_testsuite_makefile
    
    Fix testsuite makefile

commit fc61a1143edeba4946d4b9915f1775bb08e643fc
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Mar 31 10:53:01 2016 -0500

    Fix formatting in configure.

commit 26379b14de630e3a6c6eef5dfe87ff001558a8a6
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Mar 31 10:45:48 2016 -0500

    Adjust paths in common.mk to support building from testsuite dir.

commit 36c3abb05fecb02d4a9ab13b2b69d133adf34583
Merge: 64b41fa5 917ce754
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 31 10:26:17 2016 -0500

    Merge pull request #58 from esauvage/master
    
    cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer confi…

commit 356d854fc9e34642cc46e0e02a8ceb56114878af
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Mar 30 16:33:15 2016 -0500

    Make symlink to common.mk in build directory.

commit edbb8470044f82ef959583ee09613a5a985292b5
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Wed Mar 30 16:27:11 2016 -0500

    Refactor out some definitions which moved from make_defs.mk to Makefile for use in testsuite Makefile.

commit 917ce75482a543fef46553efff6c246939761e59
Author: Etienne Sauvage <etienne.sauvage@gmail.com>
Date:   Wed Mar 30 22:03:09 2016 +0200

    cgemm & zgemm micro-kernels for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel

commit 62914ccbcdb3c594f065dcfa65bd7e7b95c79283
Merge: bbf704bf 64b41fa5
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Tue Mar 29 15:24:25 2016 -0500

    Merge branch 'master' into const_correctness

commit 64b41fa554dff44b2f9ad48901b67c63836407a8
Merge: 1b09e343 0171ad58
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 29 15:19:41 2016 -0500

    Merge pull request #54 from devinamatthews/more_config_opts
    
    More config opts

commit 1b09e343dfe5b48b4842e2cb96f41c8cc249bad0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 29 12:55:28 2016 -0500

    Updated gcc version from 4.8 to 4.9 in .travis.yml.

commit 0171ad58997b3a5a9b76301511dbe0751fffc940
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Mar 28 13:55:06 2016 -0500

    Add icc and clang support for Intel architectures, fixes #47. 2bd036f fixes #49 BTW.

commit 3090fff64cc87ff2519a09f38e6b8699cf3cba11
Merge: 8624e365 4ca5d5b1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 28 12:36:25 2016 -0500

    Merge pull request #44 from esauvage/master
    
    sgemm micro-kernel for FMA4 instruction set

commit e6e566426ac3ded7ef87cd8ff9be98accfdc4acc
Merge: 469429ec 8624e365
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sat Mar 26 14:10:15 2016 -0500

    Merge branch 'master' into more_config_opts

commit 8624e36543160739d954c4dbcc5a5594458f3a12
Merge: a315833f 2bd036f1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 26 13:56:28 2016 -0500

    Merge pull request #50 from devinamatthews/fix_noopt_avx
    
    Fix configuration issue where instruction set flags are not specified for debug builds.

commit 469429ec34e5b1a172ce35596f9c7afdaacac131
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 20:45:41 2016 -0500

     Fix LD_FLAGS -> LDFLAGS.

commit 8442d65c9ead0376fc5f2dfad62fd4862ab9b2b3
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 20:06:48 2016 -0500

    Replace -march=native with specific architecture flags to support cross-compiling, and add icc support for Intel architectures.

commit 76099f20be1b49ac960f7e3c5a8296bbf4e1782d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 17:22:58 2016 -0500

    Add threading option to configure.

commit ad43eab4c7899d56d8d7caa6e2d92bc0581ea5a5
Merge: 9452bdb3 2bd036f1
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 15:00:02 2016 -0500

    Merge branch 'fix_noopt_avx' into more_config_opts

commit 9452bdb3afbf2d7f898134a091d7790817e7be9c
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 14:59:50 2016 -0500

    Add options for verbose make output and static/shared linking to configure.

commit 2bd036f1f9ce1ee0864365557f66d9415dd42de3
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 12:16:49 2016 -0500

    Fix configuration issue where instruction set flags are not specified for debug builds.

commit bbf704bf7501411964a63a68f1af541f612cf92d
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 25 09:55:35 2016 -0500

     Add missing const to bli_read_nway_from_env.

commit a315833f067944fb0bc14cf60f0c7dcb5dc897b6
Merge: 1d1a426d af92773f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 24 12:30:21 2016 -0500

    Merge pull request #48 from figual/master
    
    Updated and improved ARMv8 micro-kernels.

commit af92773f4f85a2441fe0c6e3a52c31b07253d08e
Author: figual <figual@ucm.es>
Date:   Wed Mar 23 22:07:02 2016 +0100

    Updated and improved ARMv8 micro-kernels.

commit a4d7729776d17d9bdf2341eacd70b9770b9ba8d2
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Mon Mar 21 09:55:21 2016 -0500

    Set default value for debug_type variable.

commit 0e2447fa55d8c5fa2b1fc4150073512495c5f9eb
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Mar 17 16:32:05 2016 -0500

    Add const correctness to auxinfo_t struct (microkernels need update theoretically).

commit 1d1a426d18ec03754021456862a1f4d1dfec1fbf
Merge: 5a978fff d226dfa0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 7 15:17:53 2016 -0600

    Merge pull request #46 from devinamatthews/new-config-opts
    
    Add several changes to the build system.

commit d226dfa05190eb477b33563b1edccf8603973336
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Sat Mar 5 16:18:14 2016 -0600

    Add several changes to the build system.
    
    1) Add -- options.
    2) Add -d/--enable-debug option to enable debugging symbols with and without optimization.
    3) Allow user to specify CC at configure time, and determine vendor (gcc/icc/etc.). For now configurations enforce a particular vendor.
    4) Add make V=[0,1] option to control build verbosity.

commit 5a978fffdb8f09a81c89541d541d4a6830cd70a4
Merge: adb2b4e0 63e26423
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 4 17:26:58 2016 -0600

    Merge pull request #45 from devinamatthews/high_prec_timers
    
    Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday

commit 63e264239053b913164a849dd8a45829087eaddc
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 4 13:17:50 2016 -0600

    Make sure that -lrt is linked on Linux.

commit 44fddd48dc1708a956803d1948f04429ec0d8700
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Fri Mar 4 12:36:38 2016 -0600

    Add missing \.

commit 7cabd2131f953de23e7015d760b0ddfda51b1251
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Mar 3 11:43:07 2016 -0600

    Use clock_gettime(CLOCK_MONOTONIC) and mach_absolute_time instead of gettimeofday.

commit adb2b4e096c78e8b2f85fd372cf0d5eb04af5be8
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Mar 2 14:48:12 2016 -0600

    Fixing guard for non implemented partitioning through packed matrices

commit 4ca5d5b1fd6f2e4a8b2e139c5405475239581e51
Author: Etienne Sauvage <etienne.sauvage@gmail.com>
Date:   Tue Mar 1 21:33:01 2016 +0100

    sgemm micro-kernel for FMA4 instruction set (bulldozer configuration), based on x86_64/avx micro-kernel

commit 627d59b5ba06866b26f46e4434a0435b600925e3
Author: Etienne Sauvage <etienne.sauvage@gmail.com>
Date:   Mon Feb 29 21:53:12 2016 +0100

    symbolic link for bulldozer configuration to kernels

commit 2dc5c0ae038ed175fab85751803ada05734d1ba1
Merge: f2809fc5 3d0fae81
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 29 12:22:51 2016 -0600

    Merge pull request #40 from tkelman/bulldozer-symlink
    
    Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer

commit f2809fc5f74466c755da6a5b4632853e634060b5
Merge: f86b94f2 8624a33c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Feb 27 13:06:03 2016 -0600

    Merge pull request #39 from devinamatthews/fix_f2c_conflicts
    
    Devin's f2c type namespace update.
    
    Details:
    - Added "bla_" prefix to f2c type names to prevent conflicts with external user code.
    - Removed most of the body of bli_f2c.h, which was unused.

commit 3d0fae810d942085d8f2d389820b4e0027577db8
Author: Tony Kelman <tony@kelman.net>
Date:   Thu Feb 25 23:24:03 2016 -0800

    Add symlink from config/bulldozer/kernels to kernels/x86_64/bulldozer
    
    to fix linking issue mentioned in #37 and https://groups.google.com/forum/#!topic/blis-devel/iypwljcaeEI

commit 8624a33ccc12dff6f6c4f92992ca5636af1576a6
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Feb 25 13:51:26 2016 -0600

    Fix remaining f2c conflicts.

commit 372eef0b6c0a535bf88d4b46b72f61266e8491ba
Author: Devin Matthews <dmatthews@utexas.edu>
Date:   Thu Feb 25 12:01:58 2016 -0600

     Fixed most conflicts after hack-n-slash ofr bli_f2c.h, cleanup in
    progress.

commit f86b94f206e2e09fa3221cc55c3dc5b05ca4775a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 23 18:12:34 2016 -0600

    Included missing blas2blis integer def to CBLAS.
    
    Details:
    - Added #include "bli_config_macro_defs" to all cblas_*.c files in
      compat/cblas/src. This has the effect of defining
      BLIS_BLAS2BLIS_INT_TYPE_SIZE to the default value if bli_config.h does
      not define it. Thanks to Tony Kelman for reporting this bug.
    - In cblas_i?amax.c, changed the type of the variable 'iamax' from 'int'
      to 'f77_int'. This eliminates a compiler warning and a potential
      runtime bug and/or crash when the size of an int differs from the size
      of f77_int (as determined by BLIS_BLAS2BLIS_INT_TYPE_SIZE).

commit 0b126de1342c11c65623bcb38e258e21e9244e3d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 13 16:29:12 2015 -0600

    Consolidated packm_blk_var1 and packm_blk_var2.
    
    Details:
    - Consolidated the two blocked variants for packm into a single
      implementation (packm_blk_var1) and removed the other variant.
    - Updated all induced method _cntl_init() functions in frame/cntl/ind/
      to use the new blocked variant 1.
    - Defined two new macros, bli_is_ind_packed() and bli_is_nat_packed(),
      to detect pack_t schemas for induced methods and native execution,
      respectively.

commit 30e5eb29e060b97752f702d2ea5d101d950f53b2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 13 12:14:19 2015 -0600

    Minor changes to treatment of rs, cs in bli_obj.c.
    
    Details:
    - Applied a patch submitted by Devin Matthews that:
      - implements subtle changes to handling of somewhat unusual cases of
        row and column strides to accommodate certail tensor cases, which
        includes adding dimension parameters to _is_col_tilted() and
        _is_row_tilted() macros,
      - simplifies how buffers are sized when requested BLIS-allocated
        objects,
      - re-consolidates bli_adjust_strides_*() into one function, and
      - defines 'restrict' keyword as a "nothing" macro for C++ and pre-C99
        environments.

commit f0a4f41b5acf55b41707ec821c4c5f9076dfbc24
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 12 15:22:50 2015 -0600

    Fixed unimplemented case in core2 sgemm ukernel.
    
    Details:
    - Implemented the "beta == 0" case for general stride output for the
      dunnington sgemm micro-kernel. This case had been, up until now,
      identical to the "beta != 0" case, which does not work when the
      output matrix has nan's and inf's. It had manifested as nan residuals
      in the test suite for right-side tests of ctrsm4m1a. Thanks to Devin
      Matthews for reporting this bug.

commit 42810bbfa0b8f006ecc5128d903909ec13ea63f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 12 12:07:46 2015 -0600

    Fixed minor bugs for uncommon obj_create cases.
    
    Details:
    - Separated bli_adjust_strides() into _alloc() and _attach() flavors so
      that the latter can avoid a test performed by the former, in which the
      rs and cs are overridden and set to zero if either matrix dimension is
      zero. Actually, we also disable this overridding behavior, even for the
      _alloc() case, since keeping the original strides (probably) does not
      hurt anything. The original code has been kept commented-out, though,
      in case an unintended consequence is later discovered.
    - Fixed a typo in an error check for general stride cases where rs == cs.

commit 3e6dd11467643fbc2cb45c13cec8dd6024232833
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 3 10:30:08 2015 -0600

    Minor re-expression in quadratic partitioning code.
    
    Details:
    - Minor change to quadratic equation solution code that avoids
      recomputation of the sqrt() parameter when the compiler is not
      smart enough to perform this optimization automatically.

commit 0694b722f7e4df00efb32639095a2aca80e67f52
Merge: 3e116f0a 33557ecc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 2 17:24:25 2015 -0600

    Merge branch 'master' of github.com:flame/blis

commit 3e116f0a2953f50b3c068759a775ad7ffae04e49
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 2 17:18:23 2015 -0600

    Fixed imaginary bug in quadratic partitioning code.
    
    Details:
    - Fixed a bug in the relatively new quadratic partitioning code that,
      under the right conditions, would perform sqrt() on a negative value.
      If the solution is imaginary, we discard it and use an alternate
      partition width that assumes no diagonal intersection. That alternate
      width is actually already computed, so, the fix was quite simple.
      Thanks to Devangi Parikh for reporting this bug.

commit 33557ecccaf49b2569b7f3d7bcea52c2aab94c68
Author: Jeff Hammond <jeff.science@gmail.com>
Date:   Mon Nov 2 12:18:43 2015 -0800

    add Travis CI build status icon to the README

commit 4a502fbe77bd0f701108baaa559d9cfb483f88de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 2 13:28:34 2015 -0600

    Laid groundwork for runtime memory pool resizing.
    
    Details:
    - Changed bli_pool_finalize() so that the freeing begins with the block
      at top_index instead of block 0. This allows us to use the function
      for terminal finalization as well as temporary cleanup prior to
      reinitialization. Also, clear the pool_t struct upon _pool_finalize()
      in case it is called in the terminal case with some blocks still
      checked out to threads (in which case the threads will see the new
      block size as 0 and thus release the block as intended).
    - Added bli_pool_reinit(), which calls _pool_finalize() followed by
      _pool_init() with new parameters.
    - Added bli_mem_reinit(), which is based on bli_pool_reinit().
    - Added new wrapper, _mem_compute_pool_block_sizes(), which calls
      _mem_compute_pool_block_sizes_dt().
    - Updated bli_mem_release() so that the pblk_t is freed, via
      _pool_free_block(), if the block size recorded in the mem_t at the
      time the pblk_t was acquired is now different from the value in the
      pool_t.

commit 37e55ca39bdbddaec03ad30d43e8ad2b3e549c96
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 30 18:25:04 2015 -0500

    Fixed obscure 3m1/4m1a bugs in trmm[3] and trsm.
    
    Details:
    - Fixed a family of bugs in the triangular level-3 operations for
      certain complex implementations (3m1 and 4m1a) that only manifest if
      one of the register blocksizes (PACKMR/PACKNR, actually) is odd:
      - Fixed incorrect imaginary stride computation in bli_packm_blk_var2()
        for the triangular case.
      - Fixed the incorrect computation of imaginary stride, as stored in
        the auxinfo_t struct in trmm and trsm macro-kernels.
      - Fixed incorrect pointer arithmetic in the trsm macro-kernels in the
        cases where the the register blocksize for the triangular matrix is
        odd. Introduced a new byte-granular pointer arithmetic macro,
        bli_ptr_add(), that computes the correct value.
    - Added cpp macro to bli_macro_defs.h for typeof() operator, defined in
      terms of __typeof__, which is used by bli_ptr_add() macro.
    - Disabled the row- vs. column-storage optimization in bli_trmm_front()
      for singleton problems because the inherent ambiguity of whether a
      scalar is row-stored or column-stored causes the wrong parameter
      combination code to be executed (by dumb luck of our checking for
      row storage first).
    - Added commented-out debugging lines to 3m1/4m1a and reference
      micro-kernels, and trsm_ll macro-kernel.

commit 46294d80e5a79c598e200e1c8ec2a642ff839971
Merge: d3159c57 a0a7b85a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 27 12:41:23 2015 -0500

    Merge pull request #35 from figual/master
    
    Fixed incomplete code in the double precision ARMv8 microkernel.

commit a0a7b85ac3e157af53cff8db0e008f4a3f90372c
Author: Francisco Igual <figual@ucm.es>
Date:   Tue Oct 27 08:59:15 2015 +0000

    Fixed incomplete code in the double precision ARMv8 microkernel.

commit d3159c5740c9ee7f8c0b661003aab6f00646ad6f
Merge: b489152e 7e03e45b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 21 14:54:00 2015 -0500

    Merge branch 'master' of github.com:flame/blis

commit b489152e112644ec3b6d19e687231a9607f7694f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 21 14:53:17 2015 -0500

    Use vzeroall in haswell micro-kernels.

commit 7e03e45bfe6c27c4fdbf06b1caa7f49e9a5fef49
Merge: 77ddb0b1 4f88c29f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 14 13:26:07 2015 -0500

    Merge pull request #33 from xianyi/master
    
    Enable Travis CI

commit 4f88c29f9e634cbb6fb22d8c88931f0ec78ad7db
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Wed Oct 14 12:57:50 2015 -0500

    Detect Intel Broadwell (using Haswell config).

commit 4b0ac1a9984a93f7ad4369b10fca63991107d9f5
Merge: fe3e355c 77ddb0b1
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Wed Oct 14 12:51:05 2015 -0500

    Merge branch 'upstream_master'

commit 77ddb0b1d31ada111dadf392766ba6d9210ed9fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 13 12:53:06 2015 -0500

    Removed flop-counting mechanism.
    
    Details:
    - Removed the optional flop-counting feature introduced in commit
      7574c994.

commit 276da366187460a4c8e6e0910e79cb39ce780bfe
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 12 11:43:03 2015 -0500

    Minor formatting change to README.md.

commit d17057446f5404824478e8a6cd08f242ab75544a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 12 11:39:49 2015 -0500

    Added "Getting Started" section to README.md.
    
    Details:
    - Added section to README.md file containing links to wikis with brief
      descriptions.

commit e7e1f2f7b601b21b50e3cdad8972cb3fe11018d3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 2 16:51:52 2015 -0500

    Minor updates to CREDITS, README files.

commit 55329906ecd7ce1ab910e4d30a29354a9172e7ea
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Sep 26 20:47:19 2015 -0500

    Minor edits to README.md, testsuite.
    
    Details:
    - Fixed typos in README.md.
    - Fixed column heading alignment for testsuite when matlab output is
      enabled.
    - Minor updates to test/3m4m/runme.sh and test/3m4m/Makefile.

commit bbebdb5793a8fd6aaf257012ab0272beaa04a0de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 25 14:47:27 2015 -0500

    Replaced README with README.md.
    
    Details:
    - Replaced the old (and short) README file with a much more comprehensive
      version written in github-flavored markdown. The new file is based on
      content taken from the old Google Code homepage.

commit e2e9d64a63485461192d9c2a6dd0183a8b71013c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 24 12:14:03 2015 -0500

    Load balance thread ranges for arbitrary diagonals.
    
    Details:
    - Expanded/updated interface for bli_get_range_weighted() and
      bli_get_range() so that the direction of movement is specified in the
      function name (e.g. bli_get_range_l2r(), bli_get_range_weighted_t2b())
      and also so that the object being partitioned is passed instead of an
      uplo parameter. Updated invocations in level-3 blocked variants, as
      appropriate.
    - (Re)implemented bli_get_range_*() and bli_get_range_weighted_*() to
      carefully take into account the location of the diagonal when computing
      ranges so that the area of each subpartition (which, in all present
      level-3 operations, is proportional to the amount of computation
      engendered) is as equal as possible.
    - Added calls to a new class of routines to all non-gemm level-3 blocked
      variants:
        bli_<oper>_prune_unref_mparts_[mnk]()
      where <oper> is herk, trmm, or trsm and [mnk] is chosen based on which
      dimension is being partitioned. These routines call a more basic
      routine, bli_prune_unref_mparts(), to prune unreferenced/unstored
      regions from matrices and simultaneously adjust other matrices which
      share the same dimension accordingly.
    - Simplified herk_blk_var2f, trmm_blk_var1f/b as a result of more the
      new pruning routines.
    - Fixed incorrect blocking factors passed into bli_get_range_*() in
      bli_trsm_blk_var[12][fb].c
    - Added a new test driver in test/thread_ranges that can exercise the new
      bli_get_range_*() and bli_get_range_weighted_*() under a range of
      conditions.
    - Reimplemented m and n fields of obj_t as elements in a "dim"
      array field so that dimensions could be queried via index constant
      (e.g. BLIS_M, BLIS_N). Adjusted/added query and modification
      macros accordingly.
    - Defined mdim_t type to enumerate BLIS_M and BLIS_N indexing values.
    - Added bli_round() macro, which calls C math library function round(),
      and bli_round_to_mult(), which rounds a value to the nearest multiple
      of some other value.
    - Added miscellaneous pruning- and mdim_t-related macros.
    - Renamed bli_obj_row_offset(), bli_obj_col_offset() macros to
      bli_obj_row_off(), bli_obj_col_off().

commit fe3e355c9c5a6f65b8736b009e2d501b62a83ea1
Merge: efa641e3 4dd9dd3e
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Fri Aug 21 14:38:36 2015 -0500

    Merge branch 'upstream_master'

commit efa641e36b73abee34166a252e90e28a6281d92d
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Sat Aug 22 03:15:50 2015 +0800

    Try to fix the compiling bug on travis.

commit 4dd9dd3e1de626b51bfe85d9ee65f193d60e8d38
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 21 11:52:37 2015 -0500

    Fixed minor alignment ambiguity bug in bli_pool.c.
    
    Details:
    - Fixed a typecasting ambiguity in bli_pool_alloc_block() in which
      pointer arithmetic was performed on a void* as if it were a byte
      pointer (such as char*). Some compilers may have already been
      interpreting this situation as intended, despite the sloppiness.
      Thanks to Aleksei Rechinskii for reporting this issue.
    - Redefined pointer alignment macros to typecast to uintptr_t instead of
      siz_t.

commit 12ffd568b04feda57147c13b67717416a01c82f8
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Sat Aug 22 00:24:28 2015 +0800

    Add Travis CI.

commit ecc3ebb749e0861c27deda52b5f87236ede4901b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 29 13:31:12 2015 -0500

    CHANGELOG update (0.1.8)

commit 47caa33485b91ea6f2a5e386e61210c90c5f489f (tag: 0.1.8)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 29 13:31:09 2015 -0500

    Version file update (0.1.8)

commit ef0fbbbdb6148b96938733fce72cb4ed7dad685e
Merge: fdfe14f1 d4b89136
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 9 13:54:54 2015 -0500

    Merge branch 'master' of github.com:flame/blis

commit fdfe14f1e17ba5a2f8dfa0bdb799c6b0e730211b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 9 13:52:39 2015 -0500

    Added support for Intel Haswell/Broadwell.
    
    Details:
    - Added sgemm and dgemm micro-kernels, which employ 256-bit AVX vectors
      and FMA instructions. (Complex support is currently provided by default
      induced method, 4m1a.)
    - Added a 'haswell' configuration, which uses the aforementioned kernels.
    - Inserted auto-detection support for haswell configuration in
      build/auto-detect/cpuid_x86.c.
    - Modified configure script to explicitly echo when automatic or manual
      configuration is in progress.
    - Changed beta scalar in test_gemm.c module of test suite to -1.0 to 0.9.

commit d4b891369c1eb0879ade662ff896a5b9a7fca207
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 7 10:06:53 2015 -0500

    Added 'carrizo' configuration.
    
    Details:
    - Added a new configuration for AMD Excavator-based hardware also known
      as Carrizo when referring to the entire APU. This configuration uses
      the same micro-kernels as the piledriver, but with different
      cache blocksizes.

commit 0b7255a642d56723f02d7ca1f8f21809967b8515
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 19 12:01:50 2015 -0500

    CHANGELOG update (0.1.7)

commit 267253de8a7be546ce87626443ee38701c1d411f (tag: 0.1.7)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 19 12:01:49 2015 -0500

    Version file update (0.1.7)

commit 7cd01b71b5e757a6774625b3c9f427f5e7664a76
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 19 11:31:53 2015 -0500

    Implemented dynamic allocation for packing buffers.
    
    Details:
    - Replaced the old memory allocator, which was based on statically-
      allocated arrays, with one based on a new internal pool_t type, which,
      combined with a new bli_pool_*() API, provides a new abstract data
      type that implements the same memory pool functionality but with blocks
      from the heap (ie: malloc() or equivalent). Hiding the details of the
      pool in a separate API also allows for a much simpler bli_mem.c family
      of functions.
    - Added a new internal header, bli_config_macro_defs.h, which enables
      sane defaults for the values previously found in bli_config. Those
      values can be overridden by #defining them in bli_config.h the same
      way kernel defaults can be overridden in bli_kernel.h. This file most
      resembles what was previously a typical configuration's bli_config.h.
    - Added a new configuration macro, BLIS_POOL_ADDR_ALIGN_SIZE, which
      defaults to BLIS_PAGE_SIZE, to specify the alignment of individual
      blocks in the memory pool. Also added a corresponding query routine to
      the bli_info API.
    - Deprecated (once again) the micro-panel alignment feature. Upon further
      reflection, it seems that the goal of more predictable L1 cache
      replacement behavior is outweighed by the harm caused by non-contiguous
      micro-panels when k % kc != 0. I honestly don't think anyone will even
      miss this feature.
    - Changed bli_ukr_get_funcs() and bli_ukr_get_ref_funcs() to call
      bli_cntl_init() instead of bli_init().
    - Removed query functions from bli_info.c that are no longer applicable
      given the dynamic memory allocator.
    - Removed unnecessary definitions from configurations' bli_config.h files,
      which are now pleasantly sparse.
    - Fixed incorrect flop counts in addv, subv, scal2v, scal2m testsuite
      modules. Thanks to Devangi Parikh for pointing out these
      miscalculations.
    - Comment, whitespace changes.

commit 9848f255a3bab17d1139c391cca13ff3f1ffe6ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 11 19:14:22 2015 -0500

    Added early return to API-level _init() routines.
    
    Details:
    - Added conditional code that returns early from the API-level _init()
      routines if the API is already initialized. Actually meant for this to
      be included in 5f93cbe8.

commit 5f93cbe870f3478870e15581e7fd450dad5bba1e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 11 18:52:12 2015 -0500

    Introduced API-level initialization.
    
    Details:
    - Added API-level initialization state to _const, _error, _mem, _thread,
      _ind, and _cntl APIs. While this functionality will mostly go unused,
      adding miniscule overhead at init-time, there will be at least once
      instance in the near future where, in order to avoid an infinite loop,
      a certain portion of the initialization will call a query function that
      itself attempts to call bli_init(). API-level initialization will allow
      this later stage to verify that an earlier stage of initialization has
      completed, even if the overall call to bli_init() has not yet returned.
    - Added _is_initialized() functions for each API, setting the underlying
      bool_t during _init() and unsetting it during _finalize().
    - Comment, whitespace changes.

commit ee129c6b028bc5ac88da7c74fde72c49803742ff
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 10 12:53:28 2015 -0500

    Fixed bugs in _get_range(), _get_range_weighted().
    
    Details:
    - Fixed some bugs that only manifested in multithreaded instances of
      some (non-gemm) level-3 operations. The bugs were related to invalid
      allocation of "edge" cases to thread subpartitions. (Here, we define
      an "edge" case to be one where the dimension being partitioned for
      parallelism is not a whole multiple of whatever register blocksize
      is needed in that dimension.) In BLIS, we always require edge cases
      to be part of the bottom, right, or bottom-right subpartitions.
      (This is so that zero-padding only has to happen at the bottom, right,
      or bottom-right edges of micro-panels.) The previous implementations
      of bli_get_range() and _get_range_weighted() did not adhere to this
      implicit policy and thus produced bad ranges for some combinations of
      operation, parameter cases, problem sizes, and n-way parallelism.
    - As part of the above fix, the functions bli_get_range() and
      _get_range_weighted() have been renamed to use _l2r, _r2l, _t2b,
      and _b2t suffixes, similar to the partitioning functions. This is
      an easy way to make sure that the variants are calling the right
      version of each function. The function signatures have also been
      changed slightly.
    - Comment/whitespace updates.
    - Removed unnecessary '/' from macros in bli_obj_macro_defs.h.

commit 9135dfd69d39f3bbd75034f479f27a78dbfebcce
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 5 13:37:44 2015 -0500

    Minor updates to test/3m4m files.

commit d62ceece943b20537ec4dd99f25136b9ba2ae340
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 3 12:56:45 2015 -0500

    Minor update to test/3m4m/runme.sh.
    
    Details:
    - Removed some stale script code that should have been removed
      during 590bb3b8c.

commit b6ee82a3d421c9c4f1eb6848c7c6e37aa46de799
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 3 12:14:23 2015 -0500

    Minor cleanup to bli_init() and friends.
    
    Details:
    - Spun-off initialization of global scalar constants to bli_const_init()
      and of threading stuff to bli_thread_init().
    - Added some missing _finalize() functions, even when there is nothing
      to do.

commit 1213f5cebabc1637ce9dd45c4bfa87bb93677c29
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 2 13:27:47 2015 -0500

    POSIX thread bugfixes/edits to bli_init.c, _mem.c.
    
    Details:
    - Fixed a sort-of bug in bli_init.c whereby the wrong pthread mutex
      was used to lock access to initialization/finalization actions.
      But everything worked out okay as long as bli_init() was called by
      single-threaded code.
    - Changed to static initialization for memory allocator mutex in
      bli_mem.c, and moved mutex to that file (from bli_init.c).
    - Fixed some type mismatches in bli_threading_pthreads.c that resulted
      in compiler warnings.
    - Fixed a small memory leak with allocated-but-never-freed (and unused)
      pthread_attr_t objects.
    - Whitespace changes to bli_init.c and bli_mem.c.

commit 590bb3b8c5c0389159c5a9451b6c156c5f237e8a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun May 24 16:02:53 2015 -0500

    Backed-out adjusted dim changes to test/3m4m.
    
    Details:
    - Reverted most changes applied during commit ec25807b.

commit ec25807b26da943868f0d0517c3720e50181b8f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 10 13:23:50 2015 -0500

    Tweaks to test/3m4m to test with adjusted dims.
    
    Details:
    - Updated test/3m4m driver files to build test drivers that allow
      comparision of real "asm_blis" results to complex "asm_blis" results,
      except with the latter's problem sizes adjusted so that problems are
      generated with equal flop counts.

commit 426b6488580a92bf071a62dc319a9c837ce39821
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 8 15:12:21 2015 -0500

    Fixed a packing bug that manifested in trsm_r.
    
    Details:
    - Fixed a bug that caused a memory leak in the contiguous memory
      allocator. Because packm_init() was using simple aliasing when
      a subpartition object was marked as zeros by bli_acquire_mpart_*(),
      the "destination" pack object's mem_t entry was being overwritten
      by the corresponding field of the "source" object (which was likely
      NULL). This prevented the block from being released back to the
      memory allocator. But this bug only manifested when changing the
      location of packing B from outside the var1 loop to inside the
      var3 loop, and only for trsm with triangular B (side = right). The
      bug was fixed by changing the type of alias used in packm_init()
      when handling zero partition cases. Specifically, we now use
      bli_obj_alias_for_packing(), which does not clobber the destination
      (pack) object's mem_t field. Thanks to Devangi Parikh for this bug
      report.

commit c84286d5cef48f16d83831baac1f46b9856b9a36
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 4 15:39:14 2015 -0500

    More minor tweaks to test/3m4m.
    
    Details:
    - Added a line of output that forces matlab to allocate the entire array
      up-front.
    - Re-enabled real domain benchmarks in runme.sh, which were temporarily
      disabled.

commit 309717c8ebf4ef1369f15cf41340e13c25b41573
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 3 19:28:49 2015 -0500

    More tweaks to test/3m4m, configurations.
    
    Details:
    - Fixed incorrect number of mc_x_kc memory blocks in
      sandybridge/bli_config.h.
    - Enabled OpenMP multithreding in piledriver/bli_config.h.
    - More updates to test/3m4m driver files.

commit 4baf3b9c69b2f648be9e46e07ccc9859dd675828
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 3 16:44:32 2015 -0500

    Tweaked test/3m4m driver, including acml support.
    
    Details:
    - Added ACML support to test/3m4m driver Makefile and runme.sh script.

commit a32f7c49ca4ea869d2a6c66818780f4321743d67
Merge: 349e075a 4bfd1ce8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 3 08:28:11 2015 -0500

    Merge pull request #23 from xianyi/master
    
    Add auto-detecting CPU  on configure stage.

commit 349e075ad6a8e2a1211d94f36d24828c9d44b052
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 2 18:12:28 2015 -0500

    Tweaks to sandybridge config, test/3m4m driver.
    
    Details:
    - Enable OpenMP support by default in sandybridge's bli_config.h.
    - Reorganized sandybridge's bli_kernel.h.
    - Updated 3m4m Makefile, runme.sh to also test MKL implementation.

commit 4bfd1ce8ca93f93d170dd2715f0a32027b417b46
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Thu Apr 2 16:40:21 2015 -0500

    Detect NEON for cortex-a9 and cortex-a15.

commit aa6eec4f43137057276fe6119bdbfb5c52682527
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Thu Apr 2 16:03:44 2015 -0500

    Detect the CPU architecture. Support ARM cores.
    
    Detect the CPU architecture by compiler's predefined macros.
    Then, detect the CPU cores.
    
    Support detecting x86 and ARM architectures.

commit 2947cfb749c937b0f62fac36cc92f123bd45b53c
Author: Zhang Xianyi <traits.zhang@gmail.com>
Date:   Wed Apr 1 12:24:00 2015 -0500

    Add auto-detecting CPU  on configure stage.
    e.g.  /Path_to_BLIS/configure auto
    
    Now, it only support detecting x86 CPUs.

commit 26a4b8f6f985597f80e0174990bf541f1d9bafac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 1 10:44:54 2015 -0500

    Implemented 3m2, 3m3 induced algorithms (gemm only).
    
    Details:
    - Defined a new "3ms" (separated 3m) pack schema and added appropriate
      support in packm_init(), packm_blk_var2().
    - Generalized packm_struc_cxk_3mi to take the imaginary stride (is_p)
      as an argument instead of computing it locally. Exception: for trmm,
      is_p must be computed locally, since it changes for triangular
      packed matrices. Also exposed is_p in interface to dt-specific
      packm_blk_var2 (and _var1, even though it does not use imaginary
      stride).
    - Renamed many functions/variables from _3mi to _3mis to indicate that
      they work for either interleaved or separated 3m pack schemas.
    - Generalized gemm and herk macro-kernels to pass in imaginary stride
      rather than compute them locally.
    - Added support for 3m2 and 3m3 algorithms to frame/ind, including 3m2-
      and 3m3-specific virtual micro-kernels.
    - Added special gemm macro-kernels to support 3m2 and 3m3.
    - Added support for 3m2 and 3m3 to testsuite.
    - Corrected the type of the panel dimension (pd_) in various macro-
      kernels from inc_t to dim_t.
    - Renamed many functions defined in bli_blocksize.c.
    - Moved most induced-related macro defs from frame/include to
      frame/ind/include.
    - Updated the _ukernel.c files so that the micro-kernel function pointers
      are obtained from the func_t objects rather than the cpp macros that
      define the function names.
    - Updated test/3m4m driver, Makefile, and run script.

commit ddf62ba7d2da08225b201585b85e06c967767dea
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Mar 27 14:27:51 2015 -0500

    Refuse to free the packm thread info if it uses the single threaded version

commit 016fc587584d958a0e430a56a5e2c05022ac2f17
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Mar 27 14:23:02 2015 -0500

    Don't free packm thread info if it is null

commit 00a443c529a60862a57b93e303a0b3212c9b1df4
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Mar 27 14:11:07 2015 -0500

    Use bli_malloc instead of malloc for the thread info paths

commit f1a6b7d02861ccebdc500ea98778cc0f6cddad17
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 18 15:37:10 2015 -0500

    Reorganized code for induced complex methods.
    
    Details:
    - Consolidated most of the code relating to induced complex methods
      (e.g. 4mh, 4m1, 3mh, 3m1, etc.) into frame/ind. Induced methods
      are now enabled on a per-operation basis. The current "available"
      (enabled and implemented) implementation can then be queried on
      an operation basis. Micro-kernel func_t objects as well as blksz_t
      objects can also be queried in a similar maner.
    - Redefined several micro-kernel and operation-related functions in
      bli_info_*() API, in accordance with above changes.
    - Added mr and nr fields to blksz_t object, which point to the mr
      and nr blksz_t objects for each cache blocksize (and are NULL for
      register blocksizes). Renamed the sub-blocksize field "sub" to
      "mult" since it is really expressing a blocksize multiple.
    - Updated bli_*_determine_kc_[fb]() for gemm/hemm/symm, trmm, and
      trsm to correctly query mr and nr (for purposes of nudging kc).
    - Introduced an enumerated opid_t in bli_type_defs.h that uniquely
      identifies an operation. For now, only level-3 id values are defined,
      along with a generic, catch-all BLIS_NOID value.
    - Reworked testsuite so that all induced methods that are enabled
      are tested (one at a time) rather than only testing the first
      available method.
    - Reformated summary at the beginning of testsuite output so that
      blocksize and micro-kernel info is shown for each induced method
      that was requested (as well as native execution).
    - Reduced the number of columns needed to display non-matlab
      testsuite output (from approx. 90 to 80).

commit 8d5169ccda954e5f72944308a036dcb7ebfc9097
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 18 11:38:08 2015 -0500

    Fixed bug in release of mem_t buffer.
    
    Details:
    - Fixed a bug that affects all level-2 and level-3 blocked variants. The
      bug only manifested, however, if the packing of operands (A and B in
      gemm, for example) spanned multiple nodes in the control tree. Until
      recently, the main consumers of packm were level-3 operations, all of
      which packed both input operands from blocked variant 1 (B outside of
      the loop, and A within the loop). This particular usage masked a flaw
      in the code whereby bli_obj_release_pack() would always release the
      underlying mem_t buffer (provided it was allocated), even if the buffer
      was not allocated in the current variant. This has been fixed by
      replacing all calls to bli_obj_release_pack() with calls to a new
      function, bli_packm_release(), which takes the same control tree node
      argument passed into the object's corresponding call to packm_init()
      or packv_init(). bli_packm_release() then proceeds to invoke
      bli_obj_release_pack() only if the control tree node indicates that
      packing was requested. Thanks to Devangi Parikh for identifying this
      bug.

commit c0acca0f5182ba96fd39c9d10b34a896a6e74206
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 3 10:56:22 2015 -0600

    Clarified comments in testsuite input.operations.

commit 03ba9a6b17861d9e1adc0cf924439c4d7e860d19
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 24 10:33:28 2015 -0600

    Removed some 'old' directories.

commit a86db60ee270cdeb745ae7cf68f9e0becc9f522d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 23 18:42:39 2015 -0600

    Extensive renaming of 3m/4m-related files, symbols.
    
    Details:
    - Renamed all remaining 3m/4m packing files and symbols to 3mi/4mi
      ('i' for "interleaved"). Similar changes to 3M/4M macros.
    - Renamed all 3m/4m files and functions to 3m1/4m1.
    - Whitespace changes.

commit 8cf8da291a0fb2f491f410969a76ec0fbda47faf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 20 15:24:27 2015 -0600

    Minor updates to induced complex mode management.
    
    Details:
    - Relocated bli_4mh.c, bli_4mb.c, bli_4m.c, bli_3mh.c, bli_3m.c (and
      associated headers) from frame/base to frame/base/induced.
    - Added bli_xm.? to frame/base/induced, which implements
      bli_xm_is_enabled(), which detects whether ANY induced complex method
      is currently enabled.
    - The new function bli_xm_is_enabled() is now used in bli_info.c to
      detect when an induced complex method is used, so we know when to
      return blocksizes from one of the induced methods' blocksize objects.

commit 411e637ee7d1083a84f58f08938d51e63d7c3c9a
Merge: c2569b88 fc0b7712
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Fri Feb 20 20:39:25 2015 -0600

    Merge branch 'master' of http://github.com/flame/blis

commit c2569b8803d4ccc1d7b6f391713461b51443601d
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Fri Feb 20 20:38:19 2015 -0600

    Fixed a memory leak in freeing the thread infos

commit fc0b771227abf86d81f505b324f69f6e83db1d8f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 20 11:47:44 2015 -0600

    Added max(mr,nr) to kc in static mem pools.
    
    Details:
    - Changed the static memory definitions to compute the maximum register
      blocksize for each datatype and add it to kc when computing the size
      of blocks of A and B. This formally accounts for the nudging of kc
      up to a multiple of mr or nr at runtime for triangular operations
      (e.g. trmm).

commit af32e3a608631953ef770341df10a14a991bf290
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Thu Feb 19 22:51:11 2015 -0600

    Fixed a bug with get_range_weighted would return end = 0 for small problem sizes

commit 441d47542a64e131578d00da7404c1ed387a721c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 19 17:06:10 2015 -0600

    Renamed 3m and 4m symbols/macros to 3mi and 4mi.
    
    Details:
    - Renamed several variables and macros from 3m/4m to 3mi/4mi. This is
      because those packing schemas were always implicitly "interleaved".
      This new naming scheme will make way for new schemas that separate
      instead of interleve the real and imaginary (and summed) parts.
    - Expanded the pack format sub-field of the pack schema field of the
      info_t to 4 bits (from 3). This will allow for more schema types
      going forward.
    - Removed old _cntl.c files for herk3m, herk4m, trmm3m, trmm4m.

commit 518a1756ccf02122b96fc437b538604a597df42a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 19 14:27:09 2015 -0600

    Fixed indexing bug for trmm3 via 3mh, 4mh.
    
    Details:
    - Fixed a bug that only affected trmm3 when performed via 3mh or 4mh,
      whereby micro-panels of the triangular matrix were packed with "dead
      space" between them due to failing to adjust for the fact that pointer
      arithmetic was occurring in units of complex elements while the data
      being packed consisted of real elements. It turns out that the macro-
      kernel suffered from the same bug, meaning the panels were actually
      being packed and read consistently. The only way I was able to
      discover the bug in the first place was because the packed block of A
      was overflowing into the beginning of the packed row panel of B using
      the sandybridge configuration.

commit 493087d730f01d5169434f461644e5633f48a42f
Merge: 650d2a6f 25021299
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 18 09:45:51 2015 -0600

    Merge branch 'master' of github.com:flame/blis

commit 25021299b670775df8ca9c87910c63d7e74ed946
Merge: fe2b8d39 f05a5763
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 11 20:03:21 2015 -0600

    Merge branch 'master' of github.com:flame/blis

commit fe2b8d39a445ac848686e78c7540fd046cb95492
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 11 19:33:10 2015 -0600

    Fixed an obscure bug in 3mh/3m/4mh/4m packing.
    
    Details:
    - Modified bli_packm_blk_var1.c and _var2.c to increase the triangular
      case's panel increment by 1 if it would otherwise be odd. This is
      particularly necessary in _var2.c when handling the interleaved 3m
      or ro/io/rpi pack schemas, since division of an odd number by 2 can
      happen if both the panel length and the panel packing dimension
      (register packing blocksize) are odd, thus making their product odd.
    - Modified bli_packm_init.c so that panel strides are increased by 1
      if they would otherwise be odd, even for non-3m related packing.
    - Modified the trmm and trsm macro-kernels so that triangular packed
      micro-panels are traversed with this new "increment by 1 if odd"
      policy.
    - Added sanity checks in trmm and trsm macro-kernels that would result
      in an abort() if the conditions that would lead to a "divide odd
      integer by 2" scenario ever manifest.
    - Defined bli_is_odd(), _is_even() macros in bli_scalar_macro_defs.h.

commit 650d2a6ff2e593151a296ca86b5214afcc747afc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 9 14:59:20 2015 -0600

    Added initial support for imaginary stride.
    
    Details:
    - Added an imaginary stride field ("is") to obj_t.
    - Renamed bli_obj_set_incs() macro to bli_obj_set_strides().
    - Defined bli_obj_imag_stride() and bli_obj_set_imag_stride() and
      added invocations in key locations.
    - Added some basic error-checking related to imaginary stride.
    - For now, imaginary stride will not be exposed into the most-used
      BLIS APIs such as bli_obj_create(), and certainly not the
      computational APIs such as bli_dgemm().

commit f05a57634a7c8e3864b25b3335d1194c1ea1aeb9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Feb 8 19:40:34 2015 -0600

    Defined gemm cntl function to query ukrs func_t.
    
    Details:
    - Added a new function, bli_gemm_cntl_ukrs(), that returns the func_t*
      for the gemm micro-kernels from the leaf node of the control tree.
      This allows all the func_t* fields from higher-level nodes in the tree
      to be NULL, which makes the function that builds the control trees
      slightly easier to read.
    - Call bli_gemm_cntl_ukrs() instead of the cntl_gemm_ukrs() macro in
      all bli_*_front() functions (which is needed to apply the row/column
      preference optimization).
    - In all level-3 bli_*_cntl_init() functions, changed the _obj_create()
      function arguments corresponding to the gemm_ukrs fields in higher-
      level cntl tree nodes to NULL.
    - Removed some old her2k macro-kernels.

commit cefd3d5d2001264de17cf63dae541f890cb9daaf
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 5 11:09:12 2015 -0600

    A couple of functions were incorrectly ifdeffed away on Xeon Phi. Fixed this

commit 7574c9947d57a19f613880e3b9f62f8c8f6df4ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 4 12:11:55 2015 -0600

    Added basic flop-counting mechanism (level-3 only).
    
    Details:
    - Added optional flop counting to all level-3 front-ends, which is
      enabled via BLIS_ENABLE_FLOP_COUNT. The flop count can be
      reset at any time via bli_flop_count_reset() and queried via
      bli_flop_count(). Caveats:
      - flop counts are approximate for her[2]k, syr[2]k, trmm, and
        trsm operations;
      - flop counts ignore extra flops due to non-unit alpha;
      - flop counts do not account for situations where beta is zero.

commit ceda4f27d1f1bcf19320e09848e0f2e3b9941e6c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 29 13:22:54 2015 -0600

    Implemented bli_obj_imag_equals().
    
    Details:
    - Implemented a new function, bli_obj_imag_equals(), which compares the
      imaginary part of the first argument to the second argument, which may
      be a BLIS_CONSTANT or of a regular real datatype.

commit 81114824a05a9053229efd577a8a94a856deda93
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 6 12:15:21 2015 -0600

    Minor 4m/3m consolidation to mem_pool_macro_defs.h.
    
    Details:
    - Merged the 4m and 3m definitions in bli_mem_pool_macro_defs.h to
      reduce code and improve readability.

commit 36a9b7b7436d9423ba4de2a9f85cfcd43577b783
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Dec 17 21:53:50 2014 +0000

    reduced the default number of MC by KC blocks for bgq

commit c60619c7c3568f044a849abbab60209aa7455423
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 16 17:08:22 2014 -0600

    Minor tweaks for 3m4m test drivers.
    
    Details:
    - Changed gemm_kc blocksizes to be reduced by two-thirds instead of
      half.
    - Changed 3m4m/test_gemm.c driver to divide by 3 instead of 2 when
      computing the fixed k dimension.
    - Fixed runme.sh so that it would use multiple threads for s/dgemm
      cases.

commit c6929ba6a5e6f633a7295e979a2b8df8c7ecdb1b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 16 11:27:50 2014 -0600

    Added 4m_1b to test/3m4m test driver and script.

commit 785d480805fc0d6f4251b5499933515740b6b2a7
Merge: 9456f330 4156c088
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 12 14:34:19 2014 -0600

    Merge branch 'master' of github.com:flame/blis

commit 9456f330af4617f9ee32972d51f974aa2d84f97b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 12 14:31:57 2014 -0600

    Added 4m_1b implementation for gemm.
    
    Details:
    - Added yet another 4m-based implementation for complex domain level-3
      operations. This method, which the 3m/4m paper identifies as Algorithm
      "4m_1b" fissures the first loop around the micro-kernel so that the
      real sub-panel of the current micro-panel of B is multiplied against
      (both sub-panels of) all micro-panels of A, before doing the same for
      the imaginary sub-panel of the micro-panel of B. For now, only gemm is
      supported, and 4m_1b (labeled "4mb" within the framework) is not yet
      integrated into the test suite.

commit 4156c0880d9aea4ff04a9c4fa139ba8c437d8bfb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 9 16:03:14 2014 -0600

    Fixed obscure level-2 packing / general stride bug.
    
    Details:
    - Fixed a bug in certain structured level-2 operations that manifested
      only when the structured matrix was provided to BLIS as matrix stored
      with general stride. The bug was introduced in c472993b when the
      densify field was removed from the packm control tree node and
      associated APIs. Since then, the packed object was unconditionally
      marked with an uplo field of BLIS_DENSE. This is fine for level-3
      operations where micro-panels are always densified, but in level-2
      contexts, the underlying unblocked variant (fused or unfused) of
      structured operations (e.g. trmv) still needs to know whether to
      execute its "lower" or "upper" branches of code. Since this field
      was unconditionally being set to BLIS_DENSE, the unblocked variants
      were always executed the "else" branch, which happened to be the
      "lower" case code. Thus, running an upper case produced the wrong
      answer. This most obviously manifested in the form of failures for
      trmm, trmm3, and trsm in the test suite.
      The bug was fixed by setting the packed object's uplo field to
      BLIS_DENSE only if the schema indicated that micro-panels were to be
      packed. Otherwise, we can assume we are packing to regular row or
      column storage, as is the case with level-2 packing. Thanks to
      Francisco Igual for reporting the testsuite failures and ultimately
      leading us to this bug.

commit 689f60a578b461119e9ea90c74f642b9eb79addb
Merge: bef24e67 483e4d6a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Dec 7 14:03:30 2014 -0600

    Merge pull request #21 from figual/master
    
    Adding armv8a configuration and micro-kernels.

commit 483e4d6a3fdbef9d9ab47fb674c9476c70ca9f0f
Author: Francisco D. Igual <figual@ucm.es>
Date:   Sun Dec 7 20:27:49 2014 +0100

    Adding armv8a configuration and micro-kernels.
    
    Only sgemm micro-kernel is fully functional at this point.

commit bef24e67e0f93579c2a80315348dc2e227f72a72
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Nov 26 18:00:56 2014 -0600

    Fixed a type of race condition exposed by pthreads implementation.
    Lead thread of the inner thread communicator could exit subproblem, move on the next iteration of the loop and modify a1_pack, b1_pack, or c1_pack while other threads were still using those.
    
    Barriers were inserted to fix this.

commit 76bde44411f0e34266bab9d666a54ef22be97320
Merge: e56e6143 f3d729e5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 26 17:25:24 2014 -0600

    Merge branch 'master' of github.com:flame/blis

commit f3d729e504ec012e7dc7e02b2ecd42e004c6894d
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Nov 26 22:25:24 2014 -0600

    Added static mutex to bli_init and bli_finalize

commit d71cc797866ff502ad1127527016f463267eef80
Author: Tyler Michael Smith <tms@cs.utexas.edu>
Date:   Wed Nov 26 21:35:39 2014 -0600

    Refactored bli_threading files and added support for pthreads

commit e56e61438ff7fcf25a48c0b7603f18df782b50b6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 26 17:20:35 2014 -0600

    Minor cleanups to bli_threading.h and friends.
    
    Details:
    - No longer need to define BLIS_ENABLE_MULTITHREADING manually in
      bli_config.h; it now gets defined when BLIS_ENABLE_OPENMP or
      BLIS_ENABLE_PTHREADS is defined.
    - Added sanity check to prevent both BLIS__ENABLE_OPENMP and
      BLIS_ENABLE_PTHREADS from being enabled simultaneously.
    - Reorganization of bli_threading*.h header files, which led to
      simplification of threading-related part of blis.h.
    - added "-fopenmp -lpthread" to LDFLAGS of sandybridge make_defs.mk
      file.

commit 3be2744cbe2c56d38c23fd818aa5c1f10cc7ea51
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 21 12:28:08 2014 -0600

    Update to template gemm ukernel comments.
    
    Details:
    - Updated comments on alignment of a1 and b1 to match wiki.

commit 994429c6881b2ade92d9d7949bcaebfbf2cc65eb
Merge: 58796abd 694029d9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 20 13:55:35 2014 -0600

    Merge pull request #20 from TimmyLiu/master
    
    #define PASTEF773 required by cblas compatibility layer

commit 694029d9d7db857d642ab536955c0621791108c8
Author: Timmy <timmy.liu@amd.com>
Date:   Wed Nov 19 15:25:14 2014 -0600

    #define PASTEF773 required by cblas compatiility layer

commit 58796abda66b133346f8d523b39178afc336351f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 6 14:31:52 2014 -0600

    Removed KC constraint comments from _kernel.h files.
    
    Details:
    - Since 4674ca8c, the constraint that KC be a multiple of both MR and
      NR have been relaxed, and thus it was time to remove the comments
      from the top of the bli_kernel.h files of all configurations.

commit 7bbc95a54f706d43c7f7951f0e5995f86130cd52
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 29 10:52:23 2014 -0500

    Added new piledriver micro-kernels.
    
    Details:
    - Added new micro-kernels for the AMD piledriver architecture (one
      for each datatype).
    - Updates and tweaks to piledriver configuration.
    - Added 3xk packm micro-kernel support.
    - Explicitly unrolled some of the smaller packm micro-kernels.
    - Added notes to avx/sandybridge and piledriver micro-kernel files
      acknowledging the influence of the corresponding kernel code in
      OpenBLAS.

commit 59613f1d5500f6279963327db2fbc84bc9135183
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 17:21:37 2014 -0500

    Added separeate micro-panel alignment for A and B.
    
    Details:
    - Changed the recently-added micro-panel alignment macros so that we now
      have two sets--one for micro-panels of matrix A and one for micro-
      panels of matrix B: BLIS_UPANEL_[AB]_ALIGN_SIZE_?.
    - Store each set of alignment values into a separate blksz_t object in
      bli_gemm_cntl_init().
    - Adjusted packm_init() to use the separate alignment values.
    - Added query routines for the new alignment values to bli_info.c.
    - Modified test suite output accordingly.

commit a8e12884ee1fddd3fd77ca5a68aa0cb857f3af57
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 11:35:48 2014 -0500

    CHANGELOG update (0.1.6)

commit 38ea5022e4ed846112198c4e1672fcdaeb90dc71 (tag: 0.1.6)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 11:35:45 2014 -0500

    Version file update (0.1.6)

commit a3e6341bdb0e28411f935d6b4708a6389663e004
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 11:13:28 2014 -0500

    Factored common code from blocksize functions.
    
    Details:
    - Split bli_determine_blocksize_[fb]() into two functions each, the
      newer ones ending with the _sub suffix. These new sub-functions are
      now called from bli_[gemm|trmm|trsm]_determine_kc_[fb](), which
      eliminates redundant code and will allow any future tweaks to the
      core sub-functions to automatically be inherited by the operation-
      specific versions.

commit 4674ca8cffb58331ff7edf23bbe0e3f6a7558489
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 23 10:50:59 2014 -0500

    Extended newly relaxed KC to hemm, symm.
    
    Details:
    - These changes were intended for the previous commit.
    - Defined bli_gemm_determine_kc_[fb]() and bli_gemm_determine_kc_[fb](),
      which determine blocksizes for gemm-based operations, taking special
      care to "nudge" the kc dimension up to a multiple of MR or NR for
      hemm and symm operations, as needed.
    - Changed bli_gemm_blk_var3f.c to call bli_gemm_determine_kc_f().
      instead of bli_determine_blocksize_f().
    - Comment updates to bli_trmm_blocksize.c, bli_trsm_blocksize.c.

commit ab954ba6f874eaca7b001804491f866ef6b9b327
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 22 17:21:58 2014 -0500

    Relaxed constraint that KC be multiple of MR, NR.
    
    Details:
    - Relaxed a long-held requirement in register blocksizes that required
      the kernel programmer to choose a KC that was divisible by both MR
      and NR. This was very constraining on some architectures that did not
      use register blocksizes that were powers of two. The constraint is
      now enforced only for trmm and trsm, where it is needed, and it is
      now handled by "nudging" kc upward at runtime, if necessary, to be a
      multiple of MR or NR, as needed.
    - Defined bli_trmm_determine_kc_[fb]() and bli_trsm_determine_kc_[fb](),
      which determine blocksizes for trmm and trsm, taking special care to
      "nudge" the kc dimension up to a multiple of MR or NR, as needed.
    - Changed bli_trmm_blk_var3[fb].c to call bli_trmm_determine_kc_[fb]()
      instead of bli_determine_blocksize_[fb]().
    - Added safeguard to bli_align_dim_to_mult() that returns the dimension
      unmodified if the dimension multiple is zero (to avoid division by
      zero).
    - Removed cpp guard/check for KC % MR == 0 and KC % NR == 0 from
      bli_kernel_macro_defs.h.
    - Whitespace, variable name changes to bli_blocksize.c.
    - Removed old commented code from bli_gemm_cntl.c.

commit 95cdae65d6b88e043ee14bcd53cd2e800d7aecb4
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Oct 22 16:30:16 2014 -0500

    Fixed bug in KNC microkernel where k=0 and beta != 1

commit e64dba5633fc49b768b5edc7762f2b5d8a4d0588
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 20 19:23:06 2014 -0500

    Re-implemented micro-panel alignment.
    
    Details:
    - This commit re-implements a feature that was removed in commit
      c2b2ab62. It was removed because, at the time, I wasn't sure how the
      micro-panel alignment feature would interact with the 4m method (when
      applied at the micro-kernrel level), and so it seemed safer to disable
      the feature entirely rather than allow possible breakage. This commit
      revisits the issue and safely re-implements the feature in a way that
      is compatible with 4m, 3m, 4mh, and 3mh (and native execution).
    - Modified the static memory pool to account for micro-panel alignment
      space.
    - Modified packm_init and blocked variants to align whole micro-panels
      by a datatype-specific alignment value that may be set by the
      configuration. (If it is not set by the configuration, it will default
      to BLIS_SIZEOF_?.)
    - Modified macro-kernels so that:
      - storage stride is handled properly given the new micro-panel
        alignment behavior;
      - indexing through 3m/4m/rih-type sub-panels, as is done by trmm and
        trsm, is more robust (e.g. will work if the applicable packing
        register blocksize is odd);
      - imaginary strides are computed and stored within auxinfo_t structs,
        which allows the virtual micro-kernels to more easily determine how
        to index into the micro-panel operands.
    - Modified virtual 3m and 4m micro-kernels to use the imaginary strides
      within the auxinfo_t structs instead of panel strides.
    - Deprecated the panel stride fields from the auxinfo_t structs.
    - Updated test suite to print out the micro-panel alignment values.

commit add16b0e5402924301e7078e4ca5e3ef725bff0b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 17 11:49:24 2014 -0500

    Added 3m4m test driver subdir of 'test'.
    
    Details:
    - Added a modified test driver for [cz]gemm that will test all 3m/4m
      as well as assembly-based and OpenBLAS implementations of gemm
      in single and multithreaded modes.

commit e171504a72406c61a173241d8bccf0a5ceb10582
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 17 11:25:59 2014 -0500

    Use correct definition of bli_is_last_iter().
    
    Details:
    - As intended for previous commit, the new definition of
      bli_is_last_iter() is now disabled in favor of the old
      definition.

commit 0d954087b2b55d2f5f3c5e57d702b318ca2300f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 17 11:19:34 2014 -0500

    Minor changes and fixes.
    
    Details:
    - Redefined bli_is_last_iter() to take thread_id and num_thread
      arguments, which allows the macro to correctly compute whether a
      given iteration is the last that the thread will compute in that
      particular loop. The new definition, however, remains disabled
      (commented out) until someone can look at this more closely, as
      the new definition seems to actually hurt performance slightly.
    - Whitespace and related updates to level-3 macro-kernels.
    - Updated test suite so that performance results in the hundreds of
      gigaflops does not disrupt the column alignment of the output.

commit d1e86e1876e433f54b501ec5a005b4ba7c5ce4e6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Oct 12 13:43:47 2014 -0500

    More minor tweaks to sandybridge/avx micro-kernel.
    
    Details:
    - Re-enabled use of b_next for dgemm and cgemm micro-kernels.

commit 7b6fe4cae57cb22c09c1a97595e1a201a02cbcd2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Oct 12 12:01:51 2014 -0500

    Minor tweaks to sandybridge/avx micro-kernels.
    
    Details:
    - Changed the MC blocksize for zgemm micro-kernel from 128 to 64.
    - Removed usage of b_next in all x86_64/avx gemm micro-kernels.

commit a6a156e9feec47154e7a0fd43bcc006b1fc04aba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 10 14:26:41 2014 -0500

    Added cgemm ukernel for avx/sandybridge.
    
    Details:
    - Implemented AVX-based cgemm micro-kernel (via GNU extended inline
      assembly syntax).
    - Updated sandybridge configuration accordingly.

commit 6f8575ab2580e167a022293b76ddf0514f71b613
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 10 10:01:45 2014 -0500

    Added zgemm ukernel for avx/sandybridge.
    
    Details:
    - Implemented AVX-based zgemm micro-kernel (via GNU extended inline
      assembly syntax).
    - Updated sandybridge configuration accordingly.

commit 23ce7ee542a12ca40b4b6090ad2558d180e16d37
Merge: 99fd9a39 7a8ad47f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 9 16:41:22 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 99fd9a39718cb7281f6fb23f9fef7cca4fe514f4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 9 16:38:04 2014 -0500

    Fixed two minor bugs.
    
    Details:
    - Fixed a bug in the test suite for the trsm_ukr and gemmtrsm_ukr test
      modules whereby the uplo bits of some packed matrix objects were not
      being set properly, resulting in false FAILURE results for those
      tests. Thanks to Tyler Smith for bringing this issue to my attention.
    - Fixed a bug in bli_obj_alloc_buffer() that caused an unnecessary
      "not yet implemented" abort() when creating a 1x1 object with non-unit
      strides.

commit 7a8ad47fb2d100a9da93aa8cab774fcceeaab733
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Oct 8 15:52:13 2014 -0500

    Minor changes to knc configuration, including preference row major storage
    Also fixed a bug in the knc micro-kernel where it would fail if k == 0

commit 76b7c34af0c09f47d9615b18857a356acddc788a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 2 14:15:38 2014 -0500

    Fixed a bug in the pack schema-related bit macros.
    
    Details:
    - Expanded the BLIS_PACK_SCHEMA_BITS value in bli_type_defs.h to
      include all six bits presently used in the pack schema bitfield of
      the info field of obj_t structs. Prior to this commit, the macro
      constant only included the lowest five bits, which excluded the
      "is or is not packed" bit. This manifested as a strange bug in
      probably many level-2 codes that invoked packing, though we only
      observed it in ger before fixing. Thanks to Devin Matthews for
      finding and reporting this bug.

commit a5763e332226598d70c47dfa9cad4578e15ef5f4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 2 13:28:17 2014 -0500

    Added extra output to bli_obj_print().
    
    Details:
    - Print extra values from info field of obj_t struct within
      bli_obj_print().

commit 9bba209fc44fbfce943ba6a51cd8278a0cb6b159
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Sep 29 14:56:36 2014 -0500

    Fixed bug when packing anywhere besides in blk_var_1 for gemm.

commit 614a4afc9272adb47e5a8b83b39d56c2804d95d6
Merge: b541b667 4a7df04e
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Sep 26 10:49:57 2014 -0500

    Merge branch 'master' of http://github.com/flame/blis

commit 4a7df04e8a4ffdb9561d26426afd35e4fe15b013
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 22 16:06:15 2014 -0500

    Added 30xk support for packm ukernels.
    
    Details:
    - Updated bli_kernel_*_macro_defs.h headers to include default
      definitions for 30xk packm kernels.
    - Extended function pointer arrays in bli_packm_cxk_*() out to 31 and
      included 30xk kernels.
    - Addex 30xk kernels to frame/1m/packm/ukernels/bli_packm_ref_cxk_*.c.

commit b6d4bd792e0d44ce4b28afef343f5ff3ba89c285
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 22 16:02:37 2014 -0500

    Fixed missing tabs from Makefile patch.

commit 32630f9b6f0d5ba28d5b56dae4c7288a37158743
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 19 17:18:20 2014 -0500

    Comment update to virtual micro-kernels.

commit 13447cffead7c6d137a7a3ccbf9e552ed0477467
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 19 13:00:48 2014 -0500

    Minor bugfix to top-level Makefile.
    
    Details:
    - Applied a patch that allows the top-level Makefile to work on certain
      systems. The patch simply separates out the source-to-object code
      generation rules for .c and .S files into two separate rules. Thanks
      to Devin Matthews for submitting this patch.

commit e80a4537846416719c067ae08a53aeda978c572d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 18 10:24:20 2014 -0500

    Fixed bug introduced by bugfix in 25b258d.
    
    Details:
    - We actually need to check alignment of lda*sizeof(double) and NOT
      a+lda because in the latter case, alignment could cancel out and
      still allow the optimized code to run when it shouldn't. Thanks
      to Devin for pointing this out.

commit 25b258d61f9c8cee64e922f4131784b6edb196dd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 18 10:10:49 2014 -0500

    Fixed a non-fatal problem with bugfix in a68b316c.
    
    Details:
    - The bugfix in a68b316c was inadvertantly checkin alignment of the
      leading dimension itself, rather than the byte size of the leading
      dimension. Now, we simply check alignment of a+lda.

commit 96302d4fc81363410e41c3a3c43a65df44d97ad9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 18 09:43:40 2014 -0500

    Renamed bli_info_get_*_ukr_type() functions.
    
    Details:
    - Added _string() suffix to bli_info_get_*_ukr_type() function names.
      This makes them consistent with the bli_info_get_*_impl_string()
      functions.

commit a68b316ca4852509f84ed50e01afac486bf70f58
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 17 11:10:07 2014 -0500

    Fixed alignment bugs in level-1f kernels.
    
    Details:
    - Fixed bugs whereby the level-1f dotxf, axpyxf, and dotxaxpyf kernels
      were attempting to compute problems with unaligned leading dimensions
      with optimized code, rather than (correctly) using the reference
      implementations. Thanks to Devin Matthews for reporting this bug.

commit 870761eb902e4866090d1d3446a345df3d6d4599
Merge: e9899be0 a2b59a37
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 16 18:20:49 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit e9899be09044829e23386bd73e394f1dd7778210
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 16 18:19:32 2014 -0500

    Added high-level implementations of 4m, 3m.
    
    Details:
    - Added "4mh" and "3mh" APIs, which implement the 4m and 3m methods at
      high levels, respectively. APIs for trmm and trsm were NOT added due
      to the fact that these approaches are inherently incompatible with
      implementing 4m or 3m at high levels (because the input right-hand
      side matrix is overwritten).
    - Added 4mh, 3mh virtual micro-kernels, and updated the existing 4m and
      3m so that all are stylistically consistent.
    - Added new "rih" packing kernels (both low-level and structure-aware)
      to support both 4mh and 3mh.
    - Defined new pack_t schemas to support real-only, imaginary-only, and
      real+imaginary packing formats.
    - Added various level0 scalar macros to support the rih packm kernels.
    - Minor tweaks to trmm macro-kernels to facilitate 4mh and 3mh.
    - Added the ability to enable/disable 4mh, 3m, and 3mh, and adjusted
      level-3 front-ends to check enabledness of 3mh, 3m, 4mh, and 4m (in
      that order) and execute the first one that is enabled, or the native
      implementation if none are enabled.
    - Added implementation query functions for each level-3 operation so
      that the user can query a string that describes the implementation
      that is currently enabled.
    - Updated test suite to output implementation types for reach level-3
      operation, as well as micro-kernel types for each of the five micro-
      kernels.
    - Renamed BLIS_ENABLE_?COMPLEX_VIA_4M macros to _ENABLE_VIRTUAL_?COMPLEX.
    - Fixed an obscure bug when packing Hermitian matrices (regular packing
      type) whereby the diagonal elements of the packed micro-panels could
      get tainted if the source matrix's imaginary diagonal part contained
      garbage.

commit a2b59a37f166f70a6dd5793db2530823ef590c2b
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Sep 15 10:44:44 2014 -0500

    Fixed make defs so that they actually compile for bulldozer

commit 86fc7e40764f78ec217f50216ef4fa5b57dbfbc7
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Sep 15 10:35:46 2014 -0500

    Added bulldozer configuration and updated piledriver micro-kernel

commit 0644e61a79a57f136be5f4c47b9099cff2af06e0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 11 12:55:34 2014 -0500

    Minor updates to bli_packm_init.c.

commit 9dc9b44a057a08e20ad4d423344f0ecad54c1eb2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 11 12:03:28 2014 -0500

    Renamed bli_obj_pack_status() to _pack_schema().
    
    Details:
    - Renamed the bli_obj_pack_status() macro to bli_obj_pack_schema() in
      order to help avoid confusion as to what the macro returns.

commit cf5efdde0588a0d5b6ea57fe7d7be5000be06f8e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Sep 11 11:47:56 2014 -0500

    Pass pack_t schemas into ukernels via auxinfo_t.
    
    Details:
    - Modified macro-kernels to pass the pack_t schema values for matrices
      A and B into the datatype-specific functions, where they are now
      inserted into a newly-expanded auxinfo_t struct. This gives gives the
      micro-kernels access to the pack_t schema values embedded in the
      control trees, which determine the precise format into which the
      matrix elements are packed.
    - Updated a call to bli_packm_init_pack() in src/test_libblis.c to
      remove densify argument. Meant to include this in commit c472993b.

commit cc8d2b82775cca3c2d51bf427f4e77c8024a6d15
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 9 13:48:22 2014 -0500

    Updated old test drivers in 'test'.

commit c472993bbccb69e9ffc409c79b742426c8ad2ad4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 9 13:42:04 2014 -0500

    Removed densify argument to packm_cntl_obj_create().
    
    Details:
    - Removed the "densify" bool_t argument to bli_packm_cntl_obj_create().
      This argument was inserted very early in BLIS's development, when it
      was anticipated that the developer may sometimes wish to pack a
      Hermitian, symmetric, or triangular matrix without making it dense.
      But as it turns out, if we are packing a matrix, we always want to
      make it dense in some way or another due to the fact that the micro-
      kernel only multiplies dense micro-panels. Thus, unless/until there
      is a real need for the feature, it seems reasonable to remove it from
      the packm_cntl API.

commit 5c43ee387146cd76dc59b730dac6683a8446b834
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 8 15:19:29 2014 -0500

    Moved trmm4m/3m_cntl files to 'old' directory.
    
    Details:
    - Meant to include this in previous commit.

commit 7b2f469d5465ed73b1ca88124bc9a1987388aa27
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 8 14:49:50 2014 -0500

    Retired trmm_t control tree definitions, usage.
    
    Details:
    - Replaced all trmm_t control tree instances and usage with that of
      gemm_t. This change is similar to the recent retirement of the herk_t
      control tree.
    - Tweaked packm blocked variants so that the triangular code does NOT
      assume that k is a multiple of MR (when A is triangular) or NR (when
      B is triangular). This means that bottom-right micro-panels packed for
      trmm will have different zero-padding when k is not already a multiple
      of the relevant register blocksize. While this creates a seemingly
      arbitrary and unnecessary distinction between trmm and trsm packing,
      it actually allows trmm to be handled with one control tree, instead
      of one for left and one for right side cases. Furthermore, since only
      one tree is required, it can now be handled by the gemm tree, and thus
      the trmm control tree definitions can be disposed of entirely.
    - Tweaked trmm macro-kernels so that they do NOT inflate k up to a
      multiple of MR (when A is triangular) or NR (when B is triangular).
    - Misc. tweaks and cleanups to bli_packm_struc_cxk_4m.c and _3m.c, some
      of which are to facilitate above-mentioned changes whereby k is no
      longer required to be a multiple of register blocksize when packing
      triangular micro-panels.
    - Adjusted trmm3 according to above changes.
    - Retired trmm_t control tree creation/initialization functions.

commit 576e9e9255a79dba9cd3c804267f51e0b4aa6e8a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Sep 7 16:12:52 2014 -0500

    Retired herk_t control tree definitions, usage.
    
    Details:
    - Replaced all herk_t control tree instances and usage with that of
      gemm_t, since the two types presently have the same fields. This means
      that herk, her2k, syrk, and syr2k can simply use the gemm control tree
      as-is, just as hemm and symm have been doing for some time now.
    - Retired herk_t control tree creation/initialization functions.
    - Retired many _target.c and .h files into 'old' directories.

commit b2fed052c9a23d858ef0afbe220b342bce9aa7f7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 3 17:07:25 2014 -0500

    Minor code cleanup to bli_packm_struc_cxk*.c
    
    Details:
    - Realized that we don't need to track rs_p11 and cs_p11 for
      Hermitian/symmetric case of bli_packm_struc_cxk*(). They are always
      equal to rs_p and cs_p.

commit 023ce770966b3b5a98bba729c5af1f45e15ebb97
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 3 10:47:53 2014 -0500

    Minor update to packm_cxk kernels.
    
    Details:
    - Changed m and n dimension parameter names to panel_dim and panel_len,
      respectively, in packm_cxk, packm_cxk_3m, packm_cxk_4m kernel wrapper
      functions. This makes the code a little easier to read since "m" and
      "n" have connotations that are not applicable here.
    - Comment updates.

commit 189def3667d9218adbeec45e2801fd074341a679
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 1 16:23:17 2014 -0500

    Retired portions of bli_kernel_3m/4m_macro_defs.h.
    
    Details:
    - Removed sections of bli_kernel_[4m|3m]_macro_defs.h that defined
      4m/3m-specific blocksizes after realizing that this can be done in
      bli_gemm[4m|3m]_cntl.c, since that is (mostly) the only place they
      are used.
    - The maximum cache values for 4m/3m are stll needed when computing mem
      pool dimensions in bli_mem_pool_macro_defs.h. As a workaround, "local"
      definitions in terms of the regular cache blocksizes are now in place.
    - Similarly, the register blocksizes for 4m/3m are still needed in
      bli_kernel_post_macro_defs.h. As a workaround, "local" definitions in
      terms of the regular register blocksizes are now in place.

commit af521ee6f2a77d61c98b833e85c09969987bc00d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 1 14:06:46 2014 -0500

    Changed semantics of blocksize extensions.
    
    Details:
    - Changed semantics of cache and register blocksize extensions so that
      the extended values are tracked, rather than just the marginal
      extensions.
    - BLIS_EXTEND_[MKN]C_? has been renamed BLIS_MAXIMUM_[MKN]C_?.
    - BLIS_EXTEND_[MKN]R_? has been renamed BLIS_PACKDIM_[MKN]R_?.
    - bli_blksz_ext_*() APIs have been renamed to bli_blksz_max_*(). Note
      that these "max" query routines grab the maximum value for cache
      blocksizes and the packdim value for register blocksizes.
    - bli_info_*() API has been updated accordingly.
    - All configurations have been updated accordingly.

commit 07f23aefd52f5ba4960dbd46e59b180a2136b8e9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 31 11:58:50 2014 -0500

    Pass pack schema into packm_struc_cxk*().
    
    Details:
    - Changed the interface to the packm_struc_cxk*() kernels to include
      the pack_t schema. This allows the implementation to more easily
      determine how the micro-panel is stored (row-stored column panel
      or column-stored row panel).
    - Updated packm blocked variants to pass in the schema.
    - Updated packm_ker_t function pointer definition accordingly.

commit f032ba9b1186cb02184574d339565f53d733aa42
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 30 16:21:20 2014 -0500

    Reorganized packm implementation.
    
    Details:
    - Reorganized packm variants and structure-aware kernels so that all
      routines for a given pack format (4m, 3m, regular) reside in a single
      file.
    - Renamed _blk_var4 to _blk_var2 and generalized so that it will work
      for
      both 4m and 3m, and adjusted 4m/3m _cntl_init() functions accordingly.
    - Added a new packm_ker_t function pointer type to
      bli_kernel_type_defs.h
      to facilitate function pointer typecasting in the datatype-specific
      packm_blk_var2() functions.
    - Deprecated _blk_var3.
    - Fixed a bug in the triangular micro-panel packing facility that
      affected trmm and trmm3 with unit diagonals.

commit c6793cecb70788bdf2c76ab8102504ea97be9d2a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 28 17:14:48 2014 -0500

    Reorganized #includes for scalar macro headers.
    
    Details:
    - Reordered the #include statements in bli_scalar_macro_defs.h so that
      conventional, ri-, and ri3-based macros are grouped together.
    - Renamed bli_eqri.h (and macros within) to end with 'ris' suffix.

commit b4da8907284345be4374f87a88679c4886ab866e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 28 14:10:32 2014 -0500

    Whitespace, comments updates on packm_blk_var?.c.

commit 46e46a1d83da586c3dd9fd7a01eb16067abbaee1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 28 12:05:45 2014 -0500

    Minor updates to packm blocked, cxk_3m/4m code.
    
    Details:
    - Added 'const' qualifier to inlined packing code that handles
      micro-panel packing that is too large for an existing packm ukernel.
    - Comment updates.

commit 908dc688b5979995eaacb3aa937f241551a8df00
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 28 11:55:12 2014 -0500

    Pass pack schema into blocked packm routines.
    
    Details:
    - Rather than passing the packm blocked routines a boolean value that
      represents whether the matrix is being packed to row or column storage,
      we now pass in the pack schema itself.

commit a0ff6066e06075ab5f92b19247b39b92ed15f1bf
Merge: c4c99c48 d40b32bc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 15:56:21 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit c4c99c4813bf9817592a7899c5d33412fe22313f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 15:52:22 2014 -0500

    Renamed packm scalar from beta to kappa.
    
    Details:
    - The packm implementation (i.e. sources files in frame/1m/packm and
      frame/1m/packm/ukernels), interchangeably used the names "beta" and
      "kappa" to refer to the optional scalar to be applied during packing.
      This commit renames all uses of "beta" to be "kappa", since "beta"
      sometimes evokes the scalar specifically on the output matrix of a
      level-2 or level-3 operation.

commit d40b32bc24ffbae24123e054307b3138969bb095
Merge: 9331f794 6c25c379
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 13:46:36 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 6c25c379fadb50834146e1614f7b80c093c2aad0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 13:44:10 2014 -0500

    Consolidated unpackm ukernels into single file.
    
    Details:
    - Reorganized unpackm ukernels into a single file,
      bli_unpackm_ref_cxk.c, in a manner similar to what was done for packm
      ukernels in commit 4cc2b46.

commit 9331f79443223fe267676ee54c439e1ed320380c
Merge: 7fc48a7d 670b6392
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 10:54:21 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 670b63926a7f4fc694abc5b1582ef8a4f367f5a8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 24 10:46:27 2014 -0500

    Added whitespace to bli_obj_scalar_ routine calls.
    
    Details:
    - Added extra spaces to align arguments of
      bli_obj_scalar_init_detached_copy_of(). This misalignment was due to
      the fact that the function was previously named
      bli_obj_init_scalar_copy_of() and the name change, performed in
      b444489f, was done via recursive sed commands which left subsequent
      lines untouched.

commit 7fc48a7d920e07fd8e9528ab2565123f8f4e67f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 23 16:50:58 2014 -0500

    Combined 4m/3m bits into an expanded bitfield.
    
    Details:
    - Combined the 4m/3m bits into an expanded bitfield, which will encode
      the packing "format" of the micro-panels. This will allow for more
      easily and compactly encoding additional formats.
    - Other minor comment/whitespace updates to bli_type_defs.h.
    - Updated bli_obj_macro_defs.h and bli_param_macro_defs.h to use the new
      format bitfield.
    - Comment update to bli_kernel_post_macro_defs.h.
    - Whitespace changes to bli_kernel_3m_macro_defs.h, _4m_macro_defs.h.

commit ef0143cc1417e4815e4cafd5a464cc83fe7a1e86
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Aug 23 14:02:27 2014 -0500

    Renamed _ri, _ri3 packm ukernels to _4m, _3m.
    
    Details:
    - Renamed packm ukernels, _cxk dispatcher, and structure-aware _cxk
      helper functions to use _4m and _3m instead of _ri and _ri3 suffixes.
    - Updated names of cpp macros that correspond to packm ukernels.

commit b0ccac116158b5ed3316d34798748ba0c6d78672
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 21 19:21:52 2014 -0500

    Cleaned up front-end layering for 4m/3m.
    
    Details:
    - Added an extra layer to level-3 front-ends (examples: bli_gemm_entry()
      and bli_gemm4m_entry()) to hide the control trees from the code that
      decides whether to execute native or 4m-based implementations. The
      layering was also applied to 3m.
    - Branch to 4m code based on the return value of bli_4m_is_enabled(),
      rather than the cpp macros BLIS_ENABLE_?COMPLEX_VIA_4M. This lays
      the groundwork for users to be able to change at runtime which
      implementation is called by the main front-ends (e.g. bli_gemm()).
    - Retired some experimental gemm code that hadn't been touched in
      months.

commit bedec95451cabfa7a8906b51018a5e0572998a5e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 21 18:25:48 2014 -0500

    Added bli_4m API for querying 4m enabled state.
    
    Details:
    - Added bli_4m.c (and header), which defines a simple API that can be
      used to query, enable, and disable 4m-based complex support in BLIS.
      The macros BLIS_ENABLE_?COMPLEX_VIA_4M are now used to initialize
      the variable that determines the state (enabled or disabled).
    - Changed bli_info*() API so that all cache and register blocksize-
      related query routines return the blksz_t objects' values as they
      exist at runtime, rather than return the values as determined by the
      configuration system (e.g. bli_kernel.h, or defaults for those values
      not specified). This sets the foundation for being able to change
      those blocksizes at runtime.

commit b541b667cabfa6d41b50ad1e49209651ee6812cc
Merge: 699a8151 dd61307f
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Aug 20 14:44:51 2014 -0500

    Merge branch 'master' of http://github.com/flame/blis
    
    Conflicts:
            frame/3/trsm/bli_trsm_blk_var2b.c
            frame/3/trsm/bli_trsm_blk_var2f.c

commit 699a8151ca3d5021e834a1784ef45dcc3a3d17cd
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Aug 20 14:43:17 2014 -0500

    Some improvements to trsm parallelism

commit dd61307f55bb6bc762fe0ef0446479d6c0536723
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 20 09:52:16 2014 -0500

    Minor update to sandybridge MC_S, KC_S.
    
    Details:
    - Changed sandybridge MC and KC for single-precision real to 128 and 384,
      respectively.
    - Updated comments in template configuration's gemm micro-kernel file
      to document the new "contiguous row preference" macro.

commit d0eec4bddd740ce360d0f655362c551287cf925b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 19 15:49:19 2014 -0500

    Added optional row preference to ukernel config.
    
    Details:
    - Added the ability for the kernel developer to indicate the gemm micro-
      kernel as having a preference for accessing the micro-tile of C via
      contiguous rows (as opposed to contiguous columns). This property may
      be encoded in bli_kernel.h as BLIS_?GEMM_UKERNEL_PREFERS_CONTIG_ROWS,
      which may be defined or left undefined. Leaving it undefined leads to
      the default assumption of column preference.
    - Changed conditionals in frame/3/*/*_front.c that induce transposition
      of the operation so that the transposition is induced only if there
      is disagreement between the storage of C and the preference of the
      micro-kernel. Previously, the only conditional that needed to be met
      was that C was row-stored, which is to say that we assumed the micro-
      kernel preferred column-contiguous access on C.
    - Added a "prefers_contig_rows" property to func_t objects, and updated
      calls to bli_func_obj_create() in _cntl.c files in order to support
      the above changes.
    - Removed the row-storage optimization from bli_trsm_front.c because
      it is actually ineffective. This is because the right-side case of
      trsm flips the A and B micro-panel operands (since BLIS only requires
      left-side gemmtrsm/trsm kernels), meaning any transposition done
      at the high level is then undone at the low level.
    - Tweaked trmm, trmm3 _front.c files to eliminate a possible redundant
      invocation of the bli_obj_swap() macro.

commit 4cc2b464f29cafbfef9295b073b857fe0752f710
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Aug 15 11:49:15 2014 -0500

    Reorganized packm ukernels.
    
    Details:
    - Previously, packm micro-kernels were organized by the implied register
      blocksize (panel dimension) assumed by the kernel, meaning conventional,
      ri, and ri3 variations of some micro-kernel size were housed in the same
      file. This commit reorganizes the micro-kernels so that all sizes reside
      in the same file for each format type (conventional, ri, and ri3).

commit fcc10054a11b6fc3976986f57feccf741596cbf6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 13 12:32:06 2014 -0500

    Tweaks to gemm4m, gemm3m virtual ukernels.
    
    Details:
    - Fixed a potential, but as-yet unobserved bug in gemm3m that would
      allow undesirable inf/NaN propogation, since C was being scaled by
      beta even if it was equal to zero.
    - In gemm3m micro-kernel, we now avoid copying C to the temporary
      micro-tile if beta is zero.
    - Rearranged computation in gemm4m so that the temporary C micro-tile
      is accessed less, and C is accessed only after the micro-kernel
      calls. This improves performance marginally in most situations.
    - Comment updates to both gemm4m and gemm3m micro-kernels.

commit cdcbacc2fa871317c8e7ef961ecc6d70ab22dc34
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 12 12:45:38 2014 -0500

    Removed redundant redef of packm ukr prototypes.
    
    Details:
    - Removed redundant macro code that redefined packm ukernel prototypes
      when the previous macro was already sufficient. This helps de-clutter
      the packm ukernel prototyping headers a little bit.

commit 82dac98d9032ccb598068a55ddf23d7898491e9e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 12 12:36:25 2014 -0500

    Relocated packm ukernel #includes.
    
    Details:
    - Consolidated the #include statements for packm ukernel headers from
      bli_packm_cxk.h, bli_packm_cxk_ri.h, and bli_packm_cxk_ri3.h to
      bli_packm.h.
    - Comment/whitespace updates to bli_packm_blk_var3.c, _var4.c.

commit 7f77856e25aad5fc6f172ed3e57b6351804e31a4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 12 12:20:15 2014 -0500

    Removed unused 4m/3m-related packm macro defs.
    
    Details:
    - Removed unused and unneeded s- and d-flavored macro definitions for
      packm ukernels related to the complex 4m and 3m methods, as
      implemented in BLIS.

commit bc1d86b2d4d436b1dfba2d0098501aaca9cbb8b5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 7 19:01:20 2014 -0500

    Sandy Bridge configuration, micro-kernel update.
    
    Details:
    - Minor updates to bli_config and bli_kernel.h for sandybridge
      configuration.
    - Renamed existing AVX intrinsic-based micro-kernel file to
      bli_gemm_int_d8x4.c.
    - Added new file, bli_gemm_asm_d8x4.c, which provides assembly-based
      gemm micro-kernels for single- and double-precision real.

commit 98ec95877a95242e159b2bf0c879115a59e4c6e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 7 18:28:32 2014 -0500

    Corrected comment for _obj_is_[row|col]_stored().
    
    Details:
    - Fixed a mistake in the comments introduced in the previous commit for
      bli_obj_is_row_stored() and bli_obj_is_col_stored().

commit 43d5e419e1b424d2143817103dbee8ead797e8aa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 7 18:20:40 2014 -0500

    Reverted _obj_is_[row|col]_stored() macros.
    
    Details:
    - Rolled back recent changes to bli_obj_is_row_stored() and
      bli_obj_is_col_stored() so that those macros now only inspect the
      strides (row or column). It turns out that the more sophisticated
      definitions introduced in a51e32e are not necessary, because these
      "obj" macros are virtually never used on packed matrices, and when
      they are, they can use bli_obj_is_[row|col}_packed() macros, which
      inspect the info bitfield.

commit 45692e3ad4b7e1d05ac4302398df4efce04b4284
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 7 13:21:15 2014 -0500

    Reverted some accidental changes.
    
    Details:
    - Reverted some changes that were unintentionally included in the
      previous commit (9526ce98). Thanks to Tony Kelman for pointing
      this out. (Note: a few select changes were not reverted.)

commit 9526ce98812be908bc4915f2849b657fb6ce1b49
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 6 14:13:46 2014 -0500

    Updated copyright headers of emscripten configuration files.

commit 30833ed71d56f231ddba21e632bcbbc90b12a97c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 6 12:12:03 2014 -0500

    Minor edits to configurations' make_defs.mk files.
    
    Details:
    - Redefined CFLAGS, CFLAGS_NOOPT, and CFLAGS_KERNELS so that CFLAGS_NOOPT
      is defined first and then the other two are defined in terms of
      CFLAGS_NOOPT. This textually cleans up the definitions and makes them a
      little easier to read.

commit 9d61afeae2ba70fe1df07e7546f6954ea83aed12
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 4 16:01:59 2014 -0500

    CHANGELOG update (0.1.5)

commit bde56d0ecfd0ec20330fac290b91a6dca0cf94e9 (tag: 0.1.5)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 4 16:01:58 2014 -0500

    Version file update (0.1.5)

commit 4c6ceea4be35d089630986eb5b959b9e97214077
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 4 15:49:59 2014 -0500

    Added CBLAS compatibility layer.
    
    Details:
    - Added a new section in bli_config.h files of all configurations for
      enabling CBLAS support. (Currently, the default is for the CBLAS layer
      to be disabled.)
    - Added a directory, frame/compat/cblas, to house CBLAS source code. A
      subdirectory 'f77_sub' holds subroutine wrappers corresponding to
      subroutines found in CBLAS that allow calling some BLAS routines with
      the return value passed as the last argument rather than as an actual
      (function) return value. This was probably intended to allow CBLAS to
      avoid the whole f2c debacle altogether. However, since BLIS does not
      assume the presence of a Fortran compiler, we had to provide similar
      routines in C.
    - A script, integrate-cblas-tarball.sh, is included to streamline the
      integration of future revisions of the CBLAS source code.
    - The current tarball, cblas.tgz, that was used with the above script to
      generate the present set of CBLAS source code is also included.
    - Updated blis.h to include necessary CBLAS-related headers.

commit caab62dac0fb0bd0d674118f409c81680db94d29
Merge: 383631b5 db97ce97
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Aug 3 14:36:18 2014 -0500

    Merge pull request #19 from kevinoid/fix-install-perms-error
    
    Fix permissions error installing to non-owned directory

commit db97ce979b88c051922c2f946ce52d523c7a12c6
Author: Kevin Locke <kevin@kevinlocke.name>
Date:   Sun Aug 3 12:48:04 2014 -0600

    Fix permissions error installing to non-owned directory
    
    When installing to a directory which is not owned by the installing
    user, even when the user has write permission for the directory, the
    installation can fail with an error similar to the following:
    
    Installing libblis-0.1.4-7-sandybridge.a into /usr/local/lib/
    install: cannot change permissions of ‘/usr/local/lib’: Operation not permitted
    Makefile:658: recipe for target '/usr/local/lib/libblis-0.1.4-7-sandybridge.a' failed
    make: *** [/usr/local/lib/libblis-0.1.4-7-sandybridge.a] Error 1
    
    In the example case, the error occurred because the user attempted to
    install to /usr/local and /usr/local/lib is owned by root with mode 2755
    which the Makefile unsuccessfully attempted to change to 0755.
    
    Given that installing to /usr/local is likely to be quite common and the
    ownership/permissions are the default for Debian and Debian-derived
    Linux distributions (perhaps others as well), this commit attempts to
    support that use case by using mkdir rather than install to create the
    directory (which is the same approach as Automake).
    
    Signed-off-by: Kevin Locke <kevin@kevinlocke.name>

commit 383631b514c3d42b724640f57644eea276cc418c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 31 14:51:48 2014 -0500

    Redefined bit field macros with bitshift operator.
    
    Details:
    - Redefined many of the macros that define bit fields and bit values in
      the obj_t info field using the bitshift operator (<<). This makes it
      easier to reorder bit fields, or expand existing bit fields, or add
      new fields. The bitshifting should be evaluated by the compiler at
      compile-time.

commit 137143345dc93cc9a83da5ba88b25bac7502de86
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 31 12:12:45 2014 -0500

    Reimplemented unit blocksize fix in prev commit.
    
    Details:
    - Instead of inferring the storage format of the micro-panels from within
      the packm variants, we now pass in a bool_t value that denotes whether
      the packed matrix contains row-stored column panels or column-stored
      row panels. This value can then be tested more easily inside the main
      packm variant loop.
    - Renumbered pack_t schema values in bli_type_defs.h so that there are
      now five bits, each with different meaning:
      - 4: packed or not packed?
      - 3: packed for 3m?
      - 2: packed for 4m?
      - 1: packed to panels?
      - 0: stored by rows or columns?
    - Added new macros that test for status of above bits in schema bit
      subfield, and renamed some existing macros related to 4m/3m.

commit a51e32ec061941cd10119ea80115c82a40b1673f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 30 10:41:48 2014 -0500

    Fixed unit register blocksize brokenness.
    
    Details:
    - Fixed a breakdown in BLIS's ability to differentiate between row-stored
      and column-stored micro-panels when MR or NR is unit. When either
      register blocksize (or both) is equal to one, inspecting the strides of
      the affected packed micro-panel is no longer sufficient to determine
      whether the micro-panel is a row-stored column panel or a column-stored
      row panel (because both strides are unit). At that point, dimension
      information is necessary when invoking the bli_is_row_stored_f() and
      bli_is_col_stored_f() macros (and their "obj" counterparts). Thanks to
      Ilya Polkovnichenko for reporting this bug.
    - Added panel dimensions (m and n) to obj_t, which are set in
      packm_init() and then passed into the blocked variants to support the
      aforementioned update.

commit c2732272f0ac680a0ad19fa9db5d587398a1479a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 29 16:37:18 2014 -0500

    Removed old/unused packm variants.

commit b97fa9a5a70fe0123e5eebd999b947461d38445f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 27 18:54:09 2014 -0500

    Minor usage update to build/bump-version.sh.

commit b18ba5f62d98629cdd519ff4c96fc67ec1a62fb9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 27 18:52:05 2014 -0500

    Added missing 'bla_' prefix to r_imag(), d_imag().
    
    Details:
    - Added "bla_" to f2c functions r_imag() and d_imag(). Thanks to Murtaza
      Ali for pointing the mis-named functions.

commit af7a8e6c042cade452130a6729377f1a3ef4e19e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 27 18:20:13 2014 -0500

    CHANGELOG update (0.1.4)

commit a7537071b152ecff671f8716595d37dc09e4fd51 (tag: 0.1.4)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 27 18:20:12 2014 -0500

    Version file update (0.1.4)

commit acff74041bf02c7b9fdfa24b507bca782a4c5fce
Merge: cdb9413e 47b243ef
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 23 15:07:30 2014 -0500

    Merge branch 'master' of https://github.com/flame/blis

commit cdb9413e140f8a198666250ec88fa34b5425a9c3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 23 15:05:15 2014 -0500

    Enabled threading for a couple more loops in TRSM
    
    JC loop is now enabled for the left-sided case
    IC loop is now enabled for the right-sided case

commit 47b243ef08f4101de3d936f2373343e67eaa4dd5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 23 13:41:13 2014 -0500

    Call setid for early return from herk/her2k.
    
    Details:
    - Added setid call (to zero imaginary parts of diagonal elements) to
      early return branches of herk_front() and her2k_front() for cases
      where alpha is zero. Thanks to Murtaza Ali for suggesting this fix.
    - Comment update.

commit 3e7b0db5b0e24f5fd66c60bacabc019885ddbec5
Merge: 2f8a357d ed3e33d5
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 23 13:40:44 2014 -0500

    Merge branch 'master' of https://github.com/flame/blis

commit 2f8a357de5fb55163a969d888cf059f24b78125c
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 23 13:40:12 2014 -0500

    Some TRSM threading fixes/additions

commit ed3e33d548047be3283ff41268fdf716563bc542
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 22 14:40:43 2014 -0500

    Tweaked behavior of herk, her2k for BLAS compat.
    
    Details:
    - Updated herk_front() and her2k_front() to explicitly set the imaginary
      components of the diagonal entries of C to zero after the computation
      is complete. This is needed in case downstream applications read the
      full diagonal entries (i.e., including imaginary part), which could, in
      the absence of this modification, accumulate numerical error from
      subsequent rank-k/rank-2k updates.
    - Updated BLAS compatibility wrappers for herk and her2k to return early
      if:
        n == 0 || ( ( alpha == 0 || k == 0 ) && beta == 1 )
      This also results in the imaginary components of diagonal entries NOT
      being set to zero (see above), which is consistent with BLAS.
    - Updated mkherm to use setid instead of an inlined loop over the
      diagonal.

commit ea59a5c93cde1467a3715abc53dda4aecf961873
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 22 14:36:02 2014 -0500

    Added new level-1d operation: setid.
    
    Details:
    - Defined a new level-1d operation, setid, which sets the imaginary
      elements of an object's diagonal to a single scalar. This can be
      useful, for example, when trying to make the diagonal of a Hermitian
      matrix real-valued.

commit 8965a965931318619ceaebd7c32edccf3022d0c7
Merge: 1785efb5 5b73e80b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 22 14:34:32 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 1785efb5420bc7b9c850a068cb5d99837071e877
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 22 14:33:01 2014 -0500

    Minor improvements to invertd and setd.
    
    Details:
    - Added missing call to invertd_check() from front-end.
    - Changed setd front-end call of scald_check() to setd_check().

commit 5b73e80b71c054c1945a06aff044ef629bc1a9a0
Merge: a41e68e0 20690fe3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 18 12:21:20 2014 -0500

    Merge pull request #16 from Maratyszcza/emscripten
    
    Emscripten port

commit a41e68e09e73b999fab0bb430a43dccfc63aab45
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 17 13:25:56 2014 -0500

    Reimplemented BLIS initialization/finalization.
    
    Details:
    - Rewrote bli_init() and bli_finalize() with OpenMP critical sections
      for thread-safety. Also added lots of explanatory comments.
    - Renamed bli_init_safe() and bli_finalize_safe() with the _auto()
      suffix, and reimplemented for simplicity. Updated all invocations
      in BLAS compatibility layer to use _auto() suffix.

commit 36358948ea75074bda32a9f8c008f835b87d21db
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 17 10:58:10 2014 -0500

    Retired frame/3/gemm/other directory.
    
    Details:
    - Removed frame/3/gemm/other directory, which contained some outdated
      and/or experimental variants.

commit c73261f17edf589e76bdbe297702a1fbbd69275f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 16:23:51 2014 -0500

    More minor cleanups post-copyright update.

commit 2a09d24463d358be6243b24f112fad057c2aefe0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 16:17:09 2014 -0500

    Reverted power7 symlinks destroyed by sed script.
    
    Details:
    - Reverted two symlinks, in kernels/power7/3/test, back to being symlinks
      after recursive-sed.sh mistakenly replaced them with copies of the
      actual files to which they referred. Meant to include this in previous
      commit.

commit 7ed415824d3b2e78541b6f64e404ca5347c06d3d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 16:14:33 2014 -0500

    Updated copyright headers (continued).
    
    Details:
    - Inserted "at Austin" into third clause of license declarations.
      Meant to include this change in previous commit.

commit 5c2c6c85616834ff2716ece083118201d9df6dde
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 16:05:03 2014 -0500

    Updated copyright headers to contain "at Austin".
    
    Details:
    - Updated copyright headers to include "at Austin" in the name of the
      University of Texas.
    - Updated the copyright years of a few headers to 2014 (from 2011 and
      2012).

commit fcec68cda3f6e90ae055e7304e6674c1c5c8d010
Merge: 94c0df79 4a20ed1a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 11:35:34 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit 94c0df797eda377931f29a41ba6a89c0ed58daca
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 14 11:24:36 2014 -0500

    Changed order of zero dim / error checking.
    
    Details:
    - Updated level-2 and level-3 internal back-ends so that the operation's
      _check() function is called BEFORE any attempt to return early due to
      the presence of zero dimensions. This ordering makes more sense because
      (for example) object dimensions should match even if one of them is
      zero. Previously, a dimension mismatch could result in an early return
      with no error message.
    - Updated bli_check_object_buffer() so that NULL buffers result in an
      error only if the object is dimensionally non-empty (i.e., only if both
      of the object's dimensions are non-zero). This allows BLIS operations
      to be performed on dimensionally empty objects (i.e., where at least one
      dimension is zero).
    - Updated the error message associated with bli_check_object_buffer()
      to mention the newly relaxed constraint mentioned above, vis-a-vis
      non-zero dimensions.

commit 20690fe3018ce17c8df61ce0bffecaa7911dc3a5
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jul 13 22:50:56 2014 -0700

    Emscripten port

commit 4a20ed1a3f5e9e5232df30aa0e568e6c00c56ce1
Merge: 6a515e98 8ccdfaef
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 13 17:45:01 2014 -0500

    Merge pull request #14 from Maratyszcza/master
    
    Support "make test" for PNaCl configuration

commit 6a515e988f2ae1628258a6dec2c0e9cf2d04790f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 13 17:38:33 2014 -0500

    Implemented dsdot() and sdsdot() in compat layer.
    
    Details:
    - Replaced "not yet implemented" error messages in dsdot() and sdsdot()
      with actual implementations. (These routines are so rarely used that
      this log message will probably lead to some people learning of their
      existence for the first time.)

commit 255668ddd1004552c6cc65035ec6486671ce99bb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 13 17:30:44 2014 -0500

    Inserted gemv beta-scaling bug into compat layer.
    
    Details:
    - BLAS has a peculiar bug (or feature) whereby calling gemv on a vector
      y of non-zero length and a vector x of zero length results in no action.
      Given that the operation is y := beta*y + A*x, many (most?) individuals
      would expect vector y to still be scaled by beta. BLIS, when called
      natively, handles these cases intuitively (with beta scaling).
      Unfortunately, many BLAS test suites actually check for the way this
      situation is handled. Therefore, we have decided to implement this "bug"
      in the compatibility layer so as to provide "bug-for-bug" compatibility
      with BLAS.

commit 570a154581bdb353fa13a219c7cb3c81d3dceffd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jul 12 17:51:05 2014 -0500

    Comment/formatting updates to build scripts.
    
    Details:
    - Minor updates to comments and formatting in bump-version.sh and
      update-version-file.sh scripts.

commit 26cd81990631ff799791629206e068126ff9e3a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 10 13:16:07 2014 -0500

    Added bli_info_*() query functions.
    
    Details:
    - Added a new API family, bli_info_*(), which can be used to query
      information about how BLIS was configured. Most of these values are
      returned as gint_t, with the exception of the version string which
      is char*.
    - Changed how the testsuite driver queries information about how BLIS
      was configured (from using macro constants directly to using the
      new bli_info API).
    - Removed bli_version.c and its header file.
    - Added STRINGIFY_INT() macro to bli_macro_defs.h
    - Renamed info_t type in bli_type_defs.h to objbits_t (not because of
      an actual naming conflict, but because the name 'info_t' would now be
      somewhat misleading in the presence of the new bli_info API, as the
      two are unrelated).

commit 970b43141697d8c31a033f59513bb59d7cc78ab0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 10 09:30:00 2014 -0500

    Minor bugfixes to BLAS compatibility layer.
    
    Details:
    - Changed bla_amax.c so that i?amax() routines now correctly return 0
      if ( n < 1 || incx <= 0 ).
    - Changed bla_rotg.c and bla_rotmg.c to use bli_fabs() macro instead of
      f2c's abs() macro for float and double cases.
    - Thanks to Murtaza Ali for suggesting the two fixes above.
    - Updated label of fnormv to normfv in testsuite/input.operations.

commit 8ccdfaef4c42ad8957af8607a1a9ee29b9277d4b
Author: Marat Dukhan <maratek@gmail.com>
Date:   Tue Jul 8 23:14:36 2014 -0700

    Replicated logic from testsuite/Makefile in top-level Makefile to support make test

commit caa6507ff3724c80d60987f309b8bbc5b50a9841
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 8 10:25:27 2014 -0500

    Minor cleanup to standalone test drivers.
    
    Details:
    - Very minor code changes to standalone test drivers in 'test' directory.
    - Added *.so files to '.gitignore'.

commit 6c65e9a58fe55990ebb99ec3986443e18af35338
Merge: cb12e456 daca500d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 8 10:13:49 2014 -0500

    Merge branch 'master' of github.com:flame/blis

commit cb12e456f94c196c093e52f02a7cbca0032fc86e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 8 10:07:46 2014 -0500

    Fixed possible level-3 inf/NaN issue when beta=0.
    
    Details:
    - Redefined xpbys_mxn and xpbys_mxn_u/_l macros to employ a copy
      (instead of scaling by beta) when beta is zero. This will stamp out
      any possible infs or NaNs in the output matrix, if it happens to be
      uninitialized. Thanks to Tony Kelman for isolating this bug.

commit daca500db5e2448ba0da8047b75eb0f88d9f40e3
Merge: ab3bc915 47023502
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Jul 3 12:52:52 2014 -0500

    Merge branch 'master' of http://github.com/flame/blis

commit 4702350278af31f662b458127777dd4d85a3192f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 3 11:48:23 2014 -0500

    Defined _ukernel_void() wrappers to micro-kernels.
    
    Details:
    - Added wrappers for micro-kernels so that users may invoke the
      micro-kernels without knowing what the function names actually are.
      This is useful when an application wishes to call the micro-kernel
      from a shared library instance of BLIS, where the application may not
      necessarily have the luxury of grabbing the micro-kernel name(s) from
      C preprocessor macros at compile-time. Also, since the wrappers use
      void* pointers, one's environment does not need to be aware of some
      BLIS types such as scomplex and dcomplex. These wrappers now join the
      level-1 and level-1f kernel wrappers, which pre-dated this commit.
    - Removed the wrapper definitions and prototypes from the micro-kernel
      test suite modules, and replaced calls to them with calls to the new
      wrappers mentioned above.

commit ab3bc9153b914fbaf259e15b66c91d628e7c8661
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Jul 3 11:19:43 2014 -0500

    Fixed a bug for TRSM when BLIS_ENABLE_MULTITHREADING is not set but the multithreading environment variables are turned on

commit b8134b720b985783ee6a582a3eb5d6c51f00d051
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Jul 2 16:02:39 2014 -0500

    Quick and dirty multithreading for TRSM
    
    Should work fine for small number of threads (up to 8 or maybe even 16).
    However, performance is yet untested.
    This parallelizes the "JR" loop for the left sided cases
    and the "IR" loop for the right sided cases.
    
    Future work is to parallelize the outer loops as well.

commit e8ef69692831db07ddbe9485a5e504ac3f03e496
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 2 14:59:27 2014 -0500

    Added shared library support to build system.
    
    Details:
    - Modified top-level Makefile to support building shared (dynamic)
      libraries.
    - Updated most configurations' make_defs.mk files to include necessary
      compiler/linker flags needed by top-level Makefile.
    - Note that by default, all configurations presently do NOT build
      shared libraries. To enable, one must change the value of
      BLIS_ENABLE_DYNAMIC_BUILD to 'yes'.

commit b80df0f2cffb015da02e70a82b8512da9891ab67
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 13:52:39 2014 -0500

    Added bump-version.sh script to 'build' directory.
    
    Details:
    - Added a bash script, bump-version.sh, to aid in incrementing the BLIS
      version string.

commit 9ef1f1e21d083697fc730e48d7d9169c201f3da2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 13:48:17 2014 -0500

    CHANGELOG update (0.1.3)

commit 036cc634918463b1caa0fd89c9a211f2f5639af7 (tag: 0.1.3)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 13:48:17 2014 -0500

    Version file update (0.1.3)

commit 09d9a3bf6763932d9f571085b2cfd1b8631eccba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 13:43:26 2014 -0500

    Reverting version file to test new version script.
    
    Details:
    - Changed version file contents to 0.1.2 so that I can test out a new
      version file bumping script.

commit ebb33965981dcb2b0bdee5fc7fdf6c959420f311
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 11:22:50 2014 -0500

    Added 'version' file.

commit 2cb9a5501a3cbeb6692cf68e896087ba73b6af69
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 10:42:29 2014 -0500

    Removed 'version' from .gitignore file.

commit b40dcefc5ee31f67aa3990e2e9d2ef8ed1386a25
Merge: 7101a8ee b693b0cd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 23 10:39:05 2014 -0500

    Merge pull request #11 from Maratyszcza/stable
    
    [sc]axpy kernels for PNaCl

commit b693b0cddcfb41450e3c09a3ab97acb44c1ccdec
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jun 22 13:44:25 2014 -0700

    [SC]AXPY kernels for PNaCl

commit 7101a8eec0327d6c3a7eb36eb4b0fd45c1c6d162
Merge: ad48dca2 020a831b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 19 21:46:50 2014 -0500

    Merge pull request #10 from Maratyszcza/stable
    
    Portable Native Client port

commit 020a831bc5f61744cb8354886aa679b99b1285f6
Author: Marat Dukhan <maratek@gmail.com>
Date:   Thu Jun 19 00:58:26 2014 -0700

    Code clean-up in PNaCl port

commit 491be4f91ed725522f5cc7184053857c6c376ada
Author: Marat Dukhan <maratek@gmail.com>
Date:   Thu Jun 19 00:45:44 2014 -0700

    Optimized dot product kernels for PNaCl

commit 4b8e71aab80182873a2e138eb07902b8d8fd5480
Author: Marat Dukhan <maratek@gmail.com>
Date:   Thu Jun 19 00:43:25 2014 -0700

    Use AR rcs flags for PNaCl target to avoid warning

commit 031deb2a5c718d569bde842590a791b812f4cf1d
Author: Marat Dukhan <maratek@gmail.com>
Date:   Wed Jun 18 03:11:34 2014 -0700

    PNaCl configuration: use pnacl-ar instead or ar (fixes build issue on Mac)

commit 68a02976e3c3638f0a9821342e269a1743e3ace3
Author: Marat Dukhan <maratek@gmail.com>
Date:   Wed Jun 18 03:10:25 2014 -0700

    Compile pnacl configuration in GNU11 mode to avoid warning about non-standard features

commit 6f8462eb0ec278b89731e73ef583386a3371d095
Author: Marat Dukhan <maratek@gmail.com>
Date:   Wed Jun 18 03:08:46 2014 -0700

    Fix inconsistent VERBOSE macro in Makefile

commit b2ffb4de8b6872cb23537ad282e557d11dcd9c8b
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jun 15 18:41:30 2014 -0400

    Reformatted PNaCl GEMM kernels

commit 6de2d472d98baa215264a776f3d5291780a6a085
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jun 15 08:44:31 2014 -0400

    CGEMM and ZGEMM kernels for PNaCl

commit f064711a5e6fb3852c17c7520909b09dc27665f2
Author: Marat Dukhan <maratek@gmail.com>
Date:   Sun Jun 15 06:27:37 2014 -0400

    SGEMM and DGEMM kernels for PNaCl

commit ad48dca22913a363899f0bef45553898718eebb1
Merge: ee2b6792 7118f87e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 14 15:10:13 2014 -0500

    Merge pull request #9 from tkelman/memalign_windows
    
    Use _aligned_malloc instead of posix_memalign on Windows

commit 7118f87e18b4941423472afc00215c1d1f2a1fcd
Author: Tony Kelman <tony@kelman.net>
Date:   Sat Jun 14 06:53:20 2014 -0700

    Use _aligned_malloc instead of posix_memalign on Windows

commit ee2b679281ca45fb40b2198e293bc3bc3d446632
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Jun 6 12:41:55 2014 -0500

    Only include omp.h if BLIS_ENABLE_OPENMP is set

commit 19c05dfaac43c627f86e897c8c00f1f9440754aa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 5 10:54:16 2014 -0500

    CHANGELOG update (for 0.1.2).

commit 00f232f8ed1f7c41619b12ebf779ebe2c3b2d3cd (tag: 0.1.2)
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Jun 2 13:40:57 2014 -0500

    Added single-precision micro-kernel for Knights Corner aka MIC aka Xeon Phi

commit 3fc60e491426f6248c0feae88d971e4d1f88fb95
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 21 11:34:42 2014 -0500

    Fixed ldim alignment bug in core2 gemm ukernel.
    
    Details:
    - Fixed a bug in the dunnington/core2 gemm micro-kernels that resulted in
      a segmentation fault if a column-stored matrix's starting address was
      aligned, but its leading dimension was such that its second column was
      unaligned. Basically, the micro-kernel was assuming that aligned load
      instructions were safe when they actually were not. An extra condition
      that checks the alignment of cs_c (ie: the leading dimension in the
      column storage case) has now been added. Thanks to Michael Lehn for
      reporting this bug.

commit 77a2d8dac8b242d7a202c9aabda3927ab68cf987
Merge: 8c5d6071 21fb0893
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue May 20 09:53:19 2014 -0500

    Merge pull request #8 from tlrmchlsmth/master
    
    Added multithreading to most level-3 operations.

commit 21fb089387ee7c87f6dc53b0f60f68b48d3ff3e8
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon May 19 20:38:55 2014 -0700

    Reverting changes dunnington and reference configs
    
    Now they are unchanged from the main branch of BLIS

commit 8a0ef0e0db5880730425926f8ba56b457a2ba764
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri May 16 13:44:14 2014 -0500

    Fixed rounding error in bli_get_range_weighted

commit 0b4b1680334528b1b60bc696537600f763198e92
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri May 16 12:23:37 2014 -0500

    Fixed bug with disabling JC loop threading for right sided trmm

commit 5c048a90d8dfa1dbde4e45fbc10ffcbdfe59d960
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed May 14 16:20:06 2014 -0500

    Disabled parallelism for right-sided TRMM JC loop
    
    The loop has dependent iterations.

commit 13a4c717ed0e273359dbaf5554cc4fa70b087d71
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed May 14 14:59:04 2014 -0500

    Fixed bug with bli_get_range_weighted

commit 45957cc7745e9bb1698408d72f53ef192e960820
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue May 13 17:14:46 2014 -0500

    Allowed threading to be turned off
    
    No longer requires OpenMP to compile
    Define the following in bli_config.h in order to enable multithreading:
    BLIS_ENABLE_MULTITHREADING
    BLIS_ENABLE_OPENMP
    
    Also fixes a bug with bli_get_range_weighted

commit bd1dc98ce599d74513a553fe3b37a2ebca1c3812
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon May 12 17:26:19 2014 -0500

    Disabled multithreading of the kc loop

commit 456df0372170bd7ca2c7e2d85365a69f1f04de88
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Apr 30 12:28:00 2014 -0500

    Replaced register blocksize hack with querying the register blocksize for determining parallelism granularity

commit f4fdfe8fc573553eb36795b79cdf681270dab71b
Merge: 31bb065b 8c5d6071
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Apr 30 11:46:35 2014 -0500

    Merge http://github.com/flame/blis

commit 8c5d6071e24ba10a53669390a47287e86ff354ce
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 29 12:26:12 2014 -0500

    Added _check() routines for fprint[mv], rand[mv].
    
    Details:
    - Added _check() routines for fprintm, fprintv, randm, and randv.
    - Added invocations to the above routines from their respective
      front-ends.

commit 262cdabcc885bcf6636f4d8bb7d320f95e81d820
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 28 16:48:25 2014 -0500

    Changed treatment of NULL object buffers.
    
    Details:
    - Relaxed the constraint in bli_obj_attach_buffer_check(), which required
      the buffer address being attached to be non-NULL. This is acceptable
      because the user was already able to create and use objects with NULL
      buffers (via bli_obj_create_without_buffer(), which initializes the
      buffer to NULL).
    - Inserted calls to newly defined function, bli_check_object_buffer(),
      into nearly all operations' _check() or _int_check() functions. This
      allows BLIS to abort peacefully if a computational routine is called
      with an object containing a NULL buffer. By contrast, under such
      conditions, BLAS would typically fail with a segmentation fault.
    - Within operation front-ends, moved the calls to _check()/_int_check()
      so that zero dimensions are checked first (and if found, execution
      returns with trivial or no computation). This resolves issue #7. Thanks
      to Jack Poulson for reporting this bug.

commit 31bb065ba40ae0c5a614e743b8025abca012b99e
Merge: 20e24430 7c619599
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Apr 23 12:30:19 2014 -0500

    Merge http://github.com/flame/blis

commit 7c61959955c8ba78160d0ed4d1979022029d963b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 10 17:18:36 2014 -0500

    Can now query register blocksizes from blk algs.
    
    Details:
    - Added a new field to blksz_t objects that allows one to attach a
      sub-object. Doing this allows us to associate a register blocksize with
      any given cache blocksize. That way, the register blocksize can be
      queried wherever the cache blocksize would normally be accessible
      (e.g. a blocked algorithm).
    - Modified bli_gemm_cntl.c (and 4m/3m variants) so that the register
      blocksizes are attached to the cache blocksizes after they are created.

commit 58671597d3d450817b2eda576c05ed6dadd8af6d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 10 15:35:30 2014 -0500

    Minor cleanups to level-2 _cntl.c files.
    
    Details:
    - Changed level-2 _cntl.c files so that the blocksizes for gemv are
      imported and used, rather than blocksizes being declared locally.
    - Whitespace changes to gemv_cntl.c and gemm_cntl.c files (as well as
      4m/3m variants).
    - Removed test/old/test_blis2.c.

commit 20e24430a772bc0fbaf24dec2f8c544096fd3f4e
Author: Tyler Michael Smith <tmsmith@vestalac1.ftd.alcf.anl.gov>
Date:   Tue Apr 8 17:50:44 2014 +0000

    Some fixes for the bgq kernels

commit bde697f75ec1e7f2decebee0c9bd620b4c134cd5
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 16:43:44 2014 -0500

    Add -openmp to ldflags as well

commit c332be8cd471eeace7b4fa4ae7443088b6a68ec3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 16:37:50 2014 -0500

    Added -openmp flag to Xeon Phi build for convenience

commit e7ca9e4b4a24d585c9aec8293fc7bb79e4171ad0
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 16:31:15 2014 -0500

    Used BLIS_DEFAULT_*_MR for rounding partitioning instead of BLIS_DEFAULT_*_MC

commit 7b9b228c6fa4cfb70b1ebb855b009a036e85fac3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 16:29:10 2014 -0500

    Fix for tree barrier freeing bug

commit 5ec93bd9a76096312d51c326ccde1e9bd0a436ab
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 15:09:10 2014 -0500

    Bunch of minor fixes
    
    Removed barrier after unpackm in all level3 blocked variants
    Now there is an implicit barrier inside unpackm that only occurs if C is packed (which is usually not the case)
    
    Moved the enabling of the tree barriers into bli_config.h
    Fed the default MR and NR for double precision into bli_get_range instead of the number 8

commit 575fb9b0b08f3bdb56ccde056da619d1585617c1
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 12:13:29 2014 -0500

    Changed default blocking factor to default double precision MR and NR

commit ab9c7880335c281432d5809fe0dec46753d22569
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 11:38:11 2014 -0500

    Added faster tree barriers necessary for performance for Xeon Phi
    
    Fixed up some stuff in the thread info free functions
    Disabled threading for TRSM so that it actually works when threading environment variables are set

commit ec58a7923cccac08632670caadf3cf6ff5dce766
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 10:22:48 2014 -0500

    Freeing thread info paths.
    
    Also made herk IC and JC loops do weighted partitioning

commit 2b6848b2397d6d84ca4e5f792fc51ad05e351a36
Merge: 4e3eb39a 21a0efb3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Apr 4 09:54:54 2014 -0500

    Merge http://github.com/flame/blis
    
    Conflicts:
            kernels/bgq/1/bli_axpyv_opt_var1.c
            kernels/bgq/1/bli_dotv_opt_var1.c

commit 4e3eb39aca4df0b9fdc003d468f368a2f2ba597d
Author: Tyler Michael Smith <tmsmith@vestalac1.ftd.alcf.anl.gov>
Date:   Fri Apr 4 14:50:03 2014 +0000

    Some fixes to the bgq config
    MR and NR for double complex were wrong
    Default fusing factor for double precision was wrong as well

commit 21a0efb33d7435139e9c43c1a4787a6bff533e26
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 3 16:38:44 2014 -0500

    Fixed follow-up to issue #6.

commit c318157a9bee8ea6e59be16f99f65d9271fe0d27
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 3 16:24:34 2014 -0500

    Fixed issue #6 (incorrect 'restrict' usage).
    
    Details:
    - Fixed improper usage of restrict keyword in axpyv and dotv bgq kernels.
      (However, there may be other instances of similar misuse elsewhere in
      BLIS.) Thanks to Jeff Hammond for reporting this issue.

commit b5150a1bf3bd89598e2b3aeac110eb5b44ac6c12
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 3 12:25:45 2014 -0500

    Added #include "arm_neon.h" to ARM gemm ukernel.
    
    Details:
    - Inserted #include "arm_neon.h" into gemm ukernel source file for
      arm/neon. Thanks to Jean-Michel Hautbois for suggesting this fix.

commit 2041c264517b6c590fd4f7e8253e6911b622d1c3
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Apr 3 10:30:03 2014 -0500

    Added barriers needed prior to doing scalar reset for rank-k updates.

commit 47a90e69dfde3f4f8fdf90654248a6b499fbadbc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 1 14:34:31 2014 -0500

    Attempted to fix uninitialized variable warnings.
    
    Details:
    - Added initialization statements to various macros used in level 1m and
      1m-like operations. I wasn't able to reproduce the reported behavior,
      so hopefully this takes care of it. Thanks to Jeff Hammond for the
      report.

commit d27b4f690c14b1f836f8c7a3c0e91e09d852f02e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 1 12:57:24 2014 -0500

    Use generic paths for toolchain in POWER7.
    
    Details:
    - Fixed issue #4. Thanks to Jeff Hammond for contributing changes.

commit 1584ae1c83c3a8c1af76acb46404747507650f19
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Mar 28 15:15:48 2014 -0500

    Fixed race condition involving scalar reset

commit 459dde4acc09e49380da58fb7b246db488884ad9
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Mar 27 17:06:45 2014 -0500

    Made barrier after packing implicit.
    
    This also fixed a bug where barriers in the blocked variants were inserted after the inner packing routines,
    but not the outer packing routines.
    This allowed, for instance, the block of B to not be finished being packed before computation to occur.

commit 9f78ec6e7e95fcad89a167b27cad7e2d74b6d122
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Mar 27 14:18:46 2014 -0500

    Some fixes for the internal functions,
    was innappropriately only having thread chief do some things.

commit a6fd48345424e097f71652be013aa897e098b41e
Author: Tyler Michael Smith <tmsmith@vestalac1.ftd.alcf.anl.gov>
Date:   Wed Mar 26 17:19:46 2014 +0000

    Added test drivers for level 3 BLAS that run tests in parallel using MPI

commit 73b3db594864be0f9be9a0eb29bf961fa9c95f29
Author: Tyler Michael Smith <tmsmith@vestalac1.ftd.alcf.anl.gov>
Date:   Wed Mar 26 15:39:05 2014 +0000

    Some fixes for the bgq configuration

commit f0824a04fc75e231c3a3d7757fa4e7294173282f
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 24 15:21:42 2014 -0500

    Initial commit to enable threading in TRSM,
    
    Also enabled weighted partitioning for herk, trmm
    Fixed bug where multiple threads would try to modify the same state in the internal level 3 functions
    Correctly computed a_next and b_next for gemm, herk macrokernels
    a_next and b_next point to the current micropanels in trmm

commit 23d9eab354fbc88165889832955e126772bf8488
Merge: 5d5dc2ee fd3e32a5
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Mar 20 16:54:35 2014 -0500

    Merge https://github.com/flame/blis

commit 5d5dc2eedef2f7c90d61371a1b457be5c06cf583
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Mar 20 16:43:36 2014 -0500

    Parallelized trmm and trmm3
    
    Also fixed bugs in packm

commit fd3e32a5f419fa412f46afe4dd1c3a26e15f3eb4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 20 13:59:48 2014 -0500

    Refined INSERT_GENTFUNC macro usage.
    
    Details:
    - Defined new INSERT_GENTFUNC macros so that the macro always takes
      exactly the number of arguments needed for the particular operation or
      variant being defined. Many operations were using INSERT_GENTFUNC
      macros that expected one auxiliary argument even though none were
      needed. Those instances have now been updated. Most of these instances
      were in the level-0 and -1v operations, as well as some operations
      defined in frame/util.

commit 9b0e715f29338a1a1d6445907d2445c35f011121
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 19 15:47:54 2014 -0500

    Minor simplifications to trmm, trsm macro-kernels.
    
    Details:
    - Simplified some code that would have allowed the diagonal of a trmm
      or trsm triangular matrix to intersect the short end of a micro-panel.
      This is disallowed via higher-level constraints on cache blocksizes, so
      this code was never needed and only served to obfuscate.
    - Updated some comments in trmm, trsm macro-kernels.

commit a3902750b9ab4923433f7e353f3669c3c419f8e4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Mar 19 12:35:17 2014 -0500

    Reorganized norm operations.
    
    Details:
    - Completely reoganized norm operations:
      - Renames:
        - fnormsc, fnormv, fnormm -> normfsc, normfv, normfm (2-norm)
        - absumv -> norm1v (vector 1-norm)
      - New operations:
        - norm1m (matrix 1-norm)
        - normiv, normim (infinity-norm)
        - amaxv (BLAS-like absolute maximum value index)
        - asumv (BLAS-like absolute sum)
    - Deprecated absumm, as it did not correspond to any actual norm.
      (However, an inlined version now exists in the testsuite module for
      randm.)

commit c0140cb752f27e99742f85d23be2181c00a1335e
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Wed Mar 19 11:21:16 2014 -0500

    Fixed packm variants 3 and 4 where every thread was trying to manipulate the same state
    
    Now just performed by the master thread.

commit fb42983bd9943711baa7d1c6496de1215bb816ef
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 18 16:37:28 2014 -0500

    Fixed a barrier bug and a thread decorator bug

commit aa2405f8b23d0f8d2ec04790882f2176ef2e8fd8
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 18 15:23:09 2014 -0500

    Fixing function pointer issues with thread decorator

commit ec8b88f93533942d3711191873310e7ff281bda6
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 18 14:35:37 2014 -0500

    Enabled threading for packm blocked variants 3 and 4

commit 0ac534cdf657bbf04601abfe719ba2887aab5da7
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 18 13:26:27 2014 -0500

    Added decorator for calling parallelized intermal functions
    
    Will allow for easy support for different threading models

commit 5296f58975f7d351f88909cc80b6d0cffd73def7
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 17 17:15:35 2014 -0500

    Fixing some bugs with herk parallelization

commit c51d0110831eb89361b4720bf7ed75edbd26ebce
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 17 15:00:47 2014 -0500

    Initial multithreading support for HERK

commit c720b141568d1f289146bf34ded08001f2c0dfbb
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 17 11:39:32 2014 -0500

    Switched to using environment variables to control threading.
    
    The environment variables all follow the format BLIS_X_NT,
    where X is the index of the loop as described in our paper
    Anatomy of High Performance Many-Threaded Matrix Multiplication.
    These indices are IR, JR, IC, KC, and JC.
    
    Also enabled parallelism for hemm and symm, but these are currently untested.

commit 92233cf64274b27b2217c5cfffe75443ff6137a4
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 11 14:16:08 2014 -0500

    Some fixes to gemm thread info tree creation,
    Changed microkernel tests to use the new BLIS_PACKM_SINGLE_THREADED
    instead of BLIS_SINGLE_THREADED

commit 020f80c30289d8bcaa688bf600b01fae9b23b54f
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Tue Mar 11 12:08:17 2014 -0500

    Added files specific to threading for gemm and packm operations

commit 8d8f4352a41926bc923e47be836365b6b726aff2
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 10 15:47:28 2014 -0500

    Added single threaded thread info data structures specifically for gemm and packm

commit 0e8677761175189583ca7d855e24b2bbdd2dada8
Merge: 2e727a02 b3bff631
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 10 15:16:21 2014 -0500

    Merge branch 'master' of https://github.com/tlrmchlsmth/blis

commit 2e727a025a8f796d2b6bd14f489d0ee72e7d1fc7
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Mon Mar 10 15:14:33 2014 -0500

    Modifying the thread info data structures
    
    This change makes each operation have its own thread info type,
    allowing more fine control of threading in operations that have different types of suboperations

commit a770590cf21a459f04bf941c58ee2afd272cc441
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 3 14:31:44 2014 -0600

    Minor fixes to sumsqv, abmaxv.
    
    Details:
    - Minor update to bli_sumsqv_unb_var1() to bring it up-to-date with
      LAPACK 3.5.0's zlassq.f, which, starting with 3.4.2, returns NaN when
      the vector (or matrix) contains a NaN.
    - Minor change to bli_abmaxv_unb_var1() to more closely mimic the
      behavior of netlib BLAS's izamax(). There, a "less than or equal to"
      operator is used in the search instead of "less than", which would
      change the element index returned if there were multiple maximum values.
    - Added macro function definitions for bli_isinf() and bli_isnan(), which
      are currently implemented in terms of isinf() and isnan() from math.h.

commit b3bff631eadf98b15cb422fb4a8e2f855c23e8a7
Merge: 2c158fb8 e8757b03
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 16:53:24 2014 -0600

    Merge https://github.com/flame/blis

commit 2c158fb885c27f7b599dc1e85b57edd684f19223
Merge: e4738c48 c2b2ab62
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 16:46:23 2014 -0600

    Merge https://github.com/flame/blis
    
    Conflicts:
            frame/1m/packm/bli_packm_blk_var1.c

commit e8757b03a74f9891632242e9a90efb32150826f5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 27 16:40:07 2014 -0600

    Use "%ld" as int format specifier in fprintm.
    
    Details:
    - Changed "%d" to "%ld" when printing integers via bli_fprintm().
    - Meant to include this in previous commit.

commit c663ce3b5170fee7dfb5b528b650d70c8e932cac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 27 16:32:57 2014 -0600

    Fixed various bugs when C99 complex is enabled.
    
    Details:
    - Fixed various bugs in packm_*_cxk(), the 4m/3m micro-kernels, and
      elsewhere in the framework that were not yet set up to work properly
      when BLIS_ENABLE_C99_COMPLEX is defined in bli_config.h
    - Extensive changes to f2c-derived files in frame/compat/f2c to allow
      C99 complex storage. Most of these changes center around accessing
      real and imaginary components via bli_?real()/bli_?imag() accessor
      macros, and setting of values via bli_?sets() assignment macros.
      (Thanks to Vladimir Sukarev for pointing out that _ENABLE_C99_COMPLEX
      was broken.)

commit e4738c48e00b89391d9baa1fd0aa62d1ea2f95e6
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 16:29:46 2014 -0600

    Added support for parallelism in gemm micro-kernel

commit bfe214b633765ed40b57b330fbb84c332663aa40
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 15:53:10 2014 -0600

    Fixed bug with parallel packing, and bug with allocating an array of thread infos
    
    In packm variant 1, the variable p_begin was incremented each iteration, causing a dependency.
    This dependeny was removed, allowing each iteration to be executed in parallel.
    
    Somewhere in bli_threading.c, I was allocating an array of pointers instead of an array of structs.

commit 6193d9ceea552e67170dba45abde04c64271c705
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 14:09:19 2014 -0600

    Fixed bug in thread trees

commit ac5a2de1d17ffd460b00fee9757898525a09abae
Merge: 01b125e8 bd3c7ecf
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 11:59:33 2014 -0600

    Merge branch 'master' of https://github.com/tlrmchlsmth/blis

commit 01b125e815f19410e8e0611d088b84570e499e93
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Thu Feb 27 11:55:45 2014 -0600

    First pass at adding parallelism to BLIS.
    
    Added a multithreading infrastructure that should be independent of multithreading implementation in the future.
    Currently, gemm blocked variants 1f and 2f, and packm variant blocked variant 1 is parallelized.

commit c2b2ab62707e4174892aff3ce65f36f54878fae5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 26 12:46:45 2014 -0600

    Deprecated panel stride alignment in bli_config.h.
    
    Details:
    - Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE from bli_config.h of all
      configurations. It was already going unused in packm_init() since the
      recent 4m/3m commit. This setting was rarely, if ever, useful, and its
      existence only posed a potential risk for 4m/3m-based implementations.
    - Removed BLIS_CONTIG_STRIDE_ALIGN_SIZE usage from mem_pool_macro_defs.h.
    - Updated comments regarding CONTIG_STRIDE_ALIGN_SIZE in template
      micro-kernels.

commit f18aee83a5ac1b14808686fc3c5a3c846a1d99b9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 25 17:58:42 2014 -0600

    CHANGELOG update (for 0.1.1).

commit fde5f1fdece19881f50b142e8611b772a647e6d2 (tag: 0.1.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 25 13:34:56 2014 -0600

    Added extensive support for configuration defaults.
    
    Details:
    - Standard names for reference kernels (levels-1v, -1f and 3) are now
      macro constants. Examples:
        BLIS_SAXPYV_KERNEL_REF
        BLIS_DDOTXF_KERNEL_REF
        BLIS_ZGEMM_UKERNEL_REF
    - Developers no longer have to name all datatype instances of a kernel
      with a common base name; [sdcz] datatype flavors of each kernel or
      micro-kernel (level-1v, -1f, or 3) may now be named independently.
      This means you can now, if you wish, encode the datatype-specific
      register blocksizes in the name of the micro-kernel functions.
    - Any datatype instances of any kernel (1v, 1f, or 3) that is left
      undefined in bli_kernel.h will default to the corresponding reference
      implementation. For example, if BLIS_DGEMM_UKERNEL is left undefined,
      it will be defined to be BLIS_DGEMM_UKERNEL_REF.
    - Developers no longer need to name level-1v/-1f kernels with multiple
      datatype chars to match the number of types the kernel WOULD take in
      a mixed type environment, as in bli_dddaxpyv_opt(). Now, one char is
      sufficient, as in bli_daxpyv_opt().
    - There is no longer a need to define an obj_t wrapper to go along with
      your level-1v/-1f kernels. The framework now prvides a _kernel()
      function which serves as the obj_t wrapper for whatever kernels are
      specified (or defaulted to) via bli_kernel.h
    - Developers no longer need to prototype their kernels, and thus no
      longer need to include any prototyping headers from within
      bli_kernel.h. The framework now generates kernel prototypes, with the
      proper type signature, based on the kernel names defined (or defaulted
      to) via bli_kernel.h.
    - If the complex datatype x (of [cz]) implementation of the gemm micro-
      kernel is left undefined by bli_kernel.h, but its same-precision real
      domain equivalent IS defined, BLIS will use a 4m-based implementation
      for the datatype x implementations of all level-3 operations, using
      only the real gemm micro-kernel.

commit 15b51e990f1d21333b5f7af97c211756247336e5
Merge: 6363a9f6 fc04b5eb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 21 09:04:32 2014 -0600

    Merge branch 'master' of github.com:fgvanzee/blis

commit fc04b5eb69868c341ce03f5ef1f02de4b8c121b0
Merge: b29e1c2b d1813c9d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 21 09:04:13 2014 -0600

    Merge pull request #3 from figual/master
    
    New ARM armv7a kernels and Assembly file consideration in Makefile

commit d1813c9dee34410833db5061e6588ec1a6c9ecd4
Author: Francisco Igual <figual@pandaboard.(none)>
Date:   Fri Feb 21 15:14:31 2014 +0100

    Added new armv7a micro-kernels and configuration files from Werner Saar.

commit 0cd098c03a000ed9426a7e9135190696da8cadbc
Author: Francisco Igual <figual@pandaboard.(none)>
Date:   Fri Feb 21 15:12:30 2014 +0100

     o Modified Makefile to consider .S assembly microkernels.

commit 6363a9f658257fe3d814a3dce5308f807adb54a2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 19 17:00:52 2014 -0600

    Added level-3 support for complex via 4m-/3m.
    
    Details:
    - Added the ability to induce complex domain level-3 operations via new
      virtual complex micro-kernels which are implemented via only real
      domain micro-kernels. Two new implementations are provided: 4m and 3m.
      4m implements complex matrix multiplication in terms of four real
      matrix multiplications, where as 3m uses only three and thus is
      capable of even higher (than peak) performance. However, the 3m method
      has somewhat weaker numerical properties, making it less desirable
      in general.
    - Further refined packing routines, which were recently revamped, and
      added packing functionality for 4m and 3m.
    - Some modifications to trmm and trsm macro-kernels to facilitate indexing
      into micro-panels which were packed for 4m/3m virtual kernels.
    - Added 4m and 3m interfaces for each level-3 operation.
    - Various other minor changes to facilitate 4m/3m methods.

commit b29e1c2b278c177e104c84ba462820ee8296df6c
Merge: ee60377e bd3c7ecf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 14 14:11:54 2014 -0600

    Merge pull request #2 from tlrmchlsmth/master
    
    Fixes and improvements to xeon phi implementation.

commit bd3c7ecfb54a9b9851c7d364f41c21e4cff52f6f
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Feb 14 14:05:57 2014 -0600

    Removing changes to input.general and input.operations

commit ce066863683cb4e910270cf8ab8e138b01ff3358
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Feb 14 13:40:24 2014 -0600

    Fixed more Xeon Phi bugs, especially with scattered update

commit 31134b5c7076423aee1b4f494e925f27171d97e6
Author: Tyler Smith <tms@cs.utexas.edu>
Date:   Fri Feb 14 11:19:44 2014 -0600

    Some fixes, changes, and improvements to the microkernel to the Xeon Phi

commit ee60377e467862b9d8a7205c45dce5cf66c78c46
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 13 14:03:31 2014 -0600

    Shifted some fields in info_t.
    
    Details:
    - Shifted the pack order, pack buffer type, and structure type fields
      to make room for an extra bit in the pack type/status field.

commit bd3ab1ad4cf42f8bc30ab262acf8eccb49bb1a08
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 13 09:29:55 2014 -0600

    Minor fixes to trsm consistent with prev on trmm.
    
    Details:
    - Removed use of bli_min() and bli_max() that were only being used to
      try to support situations where the diagonal would intersect the
      short end of some micro-panels, which is situation that is disallowed
      at a higher level by various constraints on the register and cache
      blocksize. This only affected trsm_ll and trsm_lu.
    - Use panel stride as passed into the macro-kernel rather than compute
      it via k and PACKMR/PACKNR. This affects all macro-kernels of trsm.

commit 6260b0b5f8bd248f3f66e5a1c6854bdbd9d02ad0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 13 09:19:56 2014 -0600

    Fixed obscure bug in trmm_ll, trmm_lu.
    
    Details:
    - Fixed an obscure bug in left-hand trmm that would only manifest when
      non-zero register blocksize extensions (PACKMR > MR or PACKNR > NR)
      are used.
    - Removed use of bli_min() and bli_max() that were only being used to
      try to support situations where the diagonal would intersect the
      short end of some micro-panels, which is situation that is disallowed
      at a higher level by various constraints on the register and cache
      blocksize. This only affected trmm_ll and trmm_lu.
    - Use panel stride as passed into the macro-kernel rather than compute
      it via k and PACKMR/PACKNR. This affects all macro-kernels of trmm.

commit 16915c1c1e55c660bf82141cdadf7c0860d5b464
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 11 10:54:19 2014 -0600

    Fixed an obscure bug in packm_cxk().
    
    Details:
    - Fixed a bug in packm_cxk() whereby the packm ukernel was being chosen
      from ldp, which is always equal to PACKMR or PACKNR. The problem with
      this is that the pack ukernels were implicitly assuming that the
      panel dimension of the panel being packed was equal to ldp, which
      is not the case when the register blocksizes extensions are non-zero
      (ie: when PACKMR > MR or PACKNR > NR, whichever is applicable). This
      problem has been fixed by passing ldp into the pack ukernels, which
      now walk through the packed micro-panel region by incrementing by this
      value, rather than incrementing by the inherent panel dimension value
      assumed by each packm ukernel (e.g. 4 in the case of packm_ref_4xk).
    - Also fixed a very minor edge case inefficiency whereby pack ukernels
      smaller than the default were not being used in edge cases, and instead
      those situations were being handled by scal2m. This is related to the
      issue above, because the pack ukernel itself was being chosen based on
      ldp instead of the panel dimension.

commit b7da57b282c5a5e2208946e60309d2352f55351d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 11 10:28:23 2014 -0600

    Updated calls to packm_blk_var2() in testsuite.
    
    Details:
    - In ukernel testsuite modules, replaced calls to packm_blk_var2() with
      _var1(). Meant to include this in previous commit.

commit c255a293e25b2223c88e8800267cd06ad2a90041
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 10 14:31:24 2014 -0600

    Consolidated packm_blk_var2 and var3.
    
    Details:
    - Consolidated the functionality previously supported by packm_blk_var2()
      and packm_blk_var3() into a new variant, packm_blk_var1().
    - Updates to packm_gen_cxk(), packm_herm_cxk.c(), and packm_tri_cxk()
      to accommodate above changes.
    - Removed packm_blk_var3() and retired packm_blk_var2() to
      frame/1m/packm/old.
    - Updated all level-3 _cntl_init() functions so that the new, more
      versatile packm_blk_var1 is used for all level-3 matrix packing.

commit 32d8f264ae7b28155f5d7b21dcc5ecb78da2e0ab
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Feb 9 10:07:37 2014 -0600

    Refactored packm variants.
    
    Details:
    - Revised packm_blk_var2() and _var3() by encapsulating the general,
      hermitian/symmetric, and triangular panel-packing subproblems into
      separate functions: packm_gen_cxk(), packm_herm_cxk(), and
      packm_tri_cxk(), respectively. Also, homogenized the packm code as
      well as the new specialized packm_*_cxk() code to further improve
      readability.

commit 6c8067028707947fcdf4f856a272e15bb9ed91e3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 7 11:27:15 2014 -0600

    Renamed enumerated type in testsuite and modules.
    
    Details:
    - Renamed the test suite's "mt_impl_t" enumerated type to "iface_t", and
      renamed all corresponding "impl" variables to "iface".

commit 6c12598b1bc567f0b08f58aebdc753a1c1390378
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 6 18:26:35 2014 -0600

    Employ simpler INSERT_ macro for ref ukernels.
    
    Details:
    - Defined a new macro, INSERT_GENTFUNC_BASIC0, which takes only one
      argument--the base name of the function--and employed this macro
      in the reference micro-kernel files instead of the _BASIC macro,
      which takes one auxiliary argument. That argument was not being
      used and probably just acted to unnecessarily obfuscate.

commit 32cae66326b68706d0e695cfd60c9ca5bc32c534
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 6 18:06:42 2014 -0600

    Fixed some instances of sloppy 'restrict' usage.
    
    Details:
    - Fixed some technical incorrectness with some usage of the 'restrict'
      keyword in the reference trsm micro-kernels.
    - Tweak to testsuite/Makefile that causes rebuild if libblis was
      touched.

commit 7aceef7683e2a2aff3c7ec2a73508036af2e19e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 6 17:31:19 2014 -0600

    Updated comments in macro-kernels.
    
    Details:
    - Updated (and fixed some errors in) the "Assumptions/assertions" comment
      section of macro-kernels.
    - Changed register blocksizes of reference configuration to MR = 8 and
      NR = 4. It's always good for MR != NR in the reference configuration
      since it may help uncover bugs related to non-square micro-kernels.

commit 8fd292aa78950bcdf556605718f09d13f9575abc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 6 14:32:21 2014 -0600

    Pass panel dimensions into macro-kernels.
    
    Details:
    - Modified the interfaces to the datatype-specific macro-kernels so that:
      - pd_a and pd_b are passed in (which contain the panel dimensions of
        packed panels of a and b).
      - rs_a and cs_b are no longer passed in (they were guaranteed to be 1).
    - Modified implementations of datatype-specific macro-kernels so pd_a,
      pd_b, cs_a, and rs_b are used instead of cpp macros for MR, NR, PACKMR,
      and PACKNR, respectively.
    - Declare temporary c matrices (ct) as being maxmr-by-maxnr, which for now
      is equivalent to being mr-by-nr. maxmr and maxnr are declared in a new
      header file bli_kernel_post_macro_defs.h.

commit 3404e6657eabb017cd1580a2f1dd8e6fb13df923
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 5 11:19:10 2014 -0600

    Deprecated incremental blocksize macro const defs.
    
    Details:
    - Removed macro constant definitions related to incremental blocksizes
      from all configurations' bli_kernel.h files. This change is minor and
      is mostly a cleanup related to a previous commit.

commit 1e9afd39a63e0a58167d4439c1a0a880a4a35657
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 4 20:15:19 2014 -0600

    Comment updates (removed vestiges of "bd").

commit 5cf58f7c2d5bc0d2d94d9576f7158d8f133b7aac
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 4 09:15:19 2014 -0600

    Added early returns for "object is zeros" case.
    
    Details:
    - Added some logic to packm_init(), pack_int() and gemm_int() so that
      (a) objects marked as BLIS_ZEROS are not packed, and (b) those
      objects are not computed with. This functionality is not currently
      needed by any existing implementations, but may be used in the
      future.

commit 6bbd4be769a9b344a55abe5ddaca1a99fd29f7b4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 3 13:15:25 2014 -0600

    Added 'f' on some gemm and trmm blocked variants.
    
    Details:
    - Added 'f' to some block variant files/functions to be consistent with
      other file/functions' naming convention. Here, the f indicates
      partitioning in the "forward" direction.

commit eb13cb2c6b182df5e2a9b88c76f50e2cee25b9e0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 3 11:07:01 2014 -0600

    Removed redundant non-gemm blksz_t creation.
    
    Details:
    - Removed code that creates duplicate blksz_t objects for herk, trmm,
      and trsm. Instead, the gemm blksz_t objects are accessed via extern
      and used directly. This reduces the amount of code associated with
      each of the three _cntl_init() and _cntl_finalize() function.

commit 0a023a7d9e58e53b8c204a5f49aa8ca9afeba938
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 29 14:02:08 2014 -0600

    Introduced new level-3 front-end layer.
    
    Details:
    - Added new _front() functions for each level-3 operation. This is done
      so that the choosing of the control tree (and *only* the choosing of
      the control tree) happens in what was previously the "front end"
      (e.g. bli_gemm()). That control tree is then passed into the _front()
      function, which then performs up-front tasks such as parameter
      checking.

commit 251c5d112196d37b183e554bc9d406104aed65fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jan 28 19:40:29 2014 -0600

    Removed redundant hemm, her2k control trees.
    
    Details:
    - Removed code that generated a control tree specifically for hemm and
      symm. Instead, the gemm control tree is now configured so that it
      works for gemm, hemm, or symm.
    - Retired most her2k code, as it was not being used. (Currently, her2k is
      implemented as two invocations of herk.) I couldn't think of many
      situations where her2k variants were needed.
    - Removed some older her2k code.

commit 5a36e5bf2f59d1e85d6dbce32a07d604c5e82d11
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 27 11:13:00 2014 -0600

    Embed func_t microkernel objects in control trees.
    
    Details:
    - Modified all control tree node definitions to include a new field of
      type func_t*, which is similar to a blksz_t except that it contains
      one function pointer (each typed simply as void*) for each datatype.
      We use the func_t* to embed pointers to the micro-kernels to use for
      the leaf-level nodes of each control tree. This change is a natural
      extension of control trees and will allow more flexibility in the
      future.
    - Modified all macro-kernel wrappers to obtain the micro-kernel pointers
      from the incomming (previously ignored) control tree node and then pass
      the queried pointer into the datatype-specific macro-kernel code, which
      then casts the pointer to the appropriate type (new typedefs residing
      in bli_kernel_type_defs.h) and then uses the pointer to call the micro-
      kernel. Thus, the micro-kernel function is no longer "hard-coded" (that
      is, determined when the datatype-specific macro-kernel functions are
      instantiated by the C preprocessor).
    - Added macros to bli_kernel_macro_defs.h that build datatype-specific
      base names if they do not exist already, and then uses those to build
      datatype-specific micro-kernel function names. This will allow
      developers extra flexibility if they wanted to, for example, name each
      of their datatype-specific micro-kernels differently (e.g. double
      real might be named bli_dgemm_opt_4x4() while double complex might be
      named bli_zgemm_opt_2x2()).
    - Inserted appropriate code into _cntl_init() functions that allocates
      and initializes a func_t object for the corresponding micro-kernels.
      The gemm ukernel func_t object is created once, in bli_gemm_cntl_init(),
      and then reused via extern wherever possible.

commit 6cbd6f1c7f1915180aa28939833afde48665c5ae
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 24 10:38:29 2014 -0600

    Removed commented mixed domain macro-kernel code.
    
    Details:
    - Removed commented-out code from macro-kernels that was supposed to
      facilitate implementing mixed domain (complex times real) matrix
      multiplication. This functionality is still (probably possible),
      but I'm getting tired of looking at the code every time I edit
      a macro-kernel. Plus, there are probably ways of doing it at a
      higher level, via control trees.

commit 29778be1119f1a884330d7f8dc424a2df4101d58
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 22 16:03:11 2014 -0600

    Removed b_aux field from cntl nodes.
    
    Details:
    - Removed b_aux field from all control tree node definitions. This field
      was being used in certain optimizations (incremental blocking) that were
      not actually being employed within BLIS, and are probably not employed
      by others.
    - Updated all _cntl_obj_create() function definitions and invocations
      according to above change.
    - Retired bli_gemm_blk_var4.c, which was one such function that employed
      incremental blocking, but which was never called by BLIS itself.

commit 06ac727a42ec9e832c7832745036702014638f99
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 15 16:44:52 2014 -0600

    Updated some comments in level-3 front ends.

commit d628bf1da1560f1f5126a1ddfed8714f0a4b8da3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 15 11:40:12 2014 -0600

    Consolidated pack_t enums; retired VECTOR value.
    
    Details:
    - Changed the pack_t enumerations so that BLIS_PACKED_VECTOR no longer has
      its own value, and instead simply aliases to BLIS_PACKED_UNSPEC. This
      makes room in the three pack_t bits of the info field of obj_t so that
      two values are now unused, and may be used for other future purposes.
    - Updated sloppy terminology usage in comments in level-2 front-ends.
      (Replaced "is contiguous" with more accurate "has unit stride".)

commit ddc8c1c379b4787be5954802906593d7ea144452
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 13 14:55:43 2014 -0600

    Suppress warning in Makefile (UNINSTALL_LIBS).
    
    Details:
    - Redirect errors to /dev/null when using 'find' to locate libraries that
      would be uninstalled upon executing "make uninstall-old". Before, if the
      Makefile was read before $(INSTALL_PREFIX)/lib existed, a "No such file
      or directory" message was emitted. This message was harmless, but is now
      suppressed in this situation.

commit f8f67d7251bffc05020e20527c100c8115fd5e55
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 10 09:06:11 2014 -0600

    Typecast bli_getopt() return value in testsuite.
    
    Details:
    - In the test suite driver, inserted an explicit typecast of the return
      value of bli_getopt() prior parsing. The lack of typecast caused a
      problem on at least one system whereby a return value of -1 was
      interpreted as garbage character. Thanks to Francisco Igual for finding
      and submitting this fix.

commit e7f154fe2ed3e10e2323cefe5d25c2c23ac902c4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 10 08:48:07 2014 -0600

    Applied edge case fix to arm/neon microkernel.
    
    Details:
    - Applied an edge case bugfix, courtesy of Francisco Igual, to the current
      double precision real gemm microkernel in kernels/arm/neon/3.

commit 89c76a8a51d070d263c13bfa5ace65769509f2b4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jan 9 12:08:37 2014 -0600

    Allow building outside source distribution.
    
    Details:
    - Modified build system (mostly configure and top-level Makefile) so that
      a user can build a BLIS library outside of the top-level directory of
      the source distribution.
    - Added "test" target to Makefile so that the user can run "make test",
      which will compile, link, and run the testsuite binary. This works even
      if the build directory is externally located, thanks to the test suite
      binary's new -g and -o command-line options. Also, when creating the
      test suite via the top-level Makefile, the linking is against the
      local archive, in lib/<configname>, rather than at <install_prefix>/lib.
    - Modified testsuite/Makefile so that it links against the library built
      locally, in ../lib/<configname>.
    - Added "-lm" to LDFLAGS of most configurations' make_defs.mk.
    - Various other cleanups to build system.

commit 12fa82ec12cc340ab28552997d9d50f7c98691f8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jan 8 16:09:26 2014 -0600

    Implemented bli_getopt().
    
    Details:
    - Added bli_getopt.c and .h files to frame/base. These files implement
      a custom version of getopt(), which may be used to parse command line
      options passed into a program via argc/argv. I am implementing this
      function myself, as opposed to using the version available via unistd.h,
      for portability reasons, as the only requirements are string.h (which
      is available via the standard C library).
    - Modified test suite to allow the user to specify the file name (and/or
      path) to the parameters and operations input files: -g may be used to
      specify the general input file and -o to specify the operations input
      file). If -g or -o or both are not given, default filenames are assumed
      (as well as their existence in the current directory).

commit cafb58e86ea5cfb21b9eedc57ca8ebbf24252098
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 6 13:28:36 2014 -0600

    Updated template micro-kernels to use auxinfo_t.
    
    Details:
    - Updated template micro-kernel implementations (located in
      config/template/kernels), to adhere to the new auxinfo_t interface.
      Meant to include this change in a0331fb1.
    - Changed template configuration to use 64-bit integers (for both BLIS
      and the BLAS compatibility layer).

commit 9ab126b499c3805045020cb89a8a5848e28d3bf5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jan 6 12:13:26 2014 -0600

    Removed error checks in netlib->BLIS param mapping
    
    Details:
    - Disabled error checking in netlib-to-BLIS parameter mapping functions.
      If the char value input to these functions was not one of the defined
      values, bli_check_error_code() with the appropriate error code value
      would be called, resulting in an abort(). This was unnecessary and
      redundant since these routines are currently only used within the
      BLAS compatibility layer, and they are only called AFTER parameter
      checking has already been performed on the original BLAS char values.
      If the application tried to override xerbla() to prevent an abort()
      from being called, this error checking would still get in the way.
      Thus, instead of reporting the error situation to the framework (ie:
      calling abort()), an arbitrary BLIS parameter value is now chosen and
      the function returns normally. Thanks to Jeff Hammond for finding and
      reporting this issue.

commit 2cb13600f9f9601c60e7f96f4ca159d169ade9cb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 3 12:29:13 2014 -0600

    Updated year in copyright headers to 2014.

commit 290fa54e0083c9c837188b8321b13b1b282e7b0c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 20 14:10:26 2013 -0600

    Store variable panel strides in trmm/trsm auxinfo.
    
    Details:
    - Changed the value being stored into the auxinfo_t structure in trmm
      and trsm macro-kernels. Whereas before we stored whatever value was
      provided to the macro-kernel implementation via ps_a/ps_b, now we
      store the stride that will advance to the next variable-length
      micro-panel of the triangular matrix A (left) or B (right).
    - Whitespace changes to the files affected above.

commit e3a6c7e77667fd749248df3f75f880266c3136ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 19 16:29:31 2013 -0600

    Macroized conditionals for a2/b2 in macro-kernels.
    
    Details:
    - Replaced conditional expressions in macro-kernels related to computing
      the addresses a2 and b2 (a_next and b_next) with a preprocessor macro
      invocation, bli_is_last_iter(), that tests the same condition.
    - Updated gemm_ukr module to use auxinfo_t argument.
    - Whitespace changes in test suite ukr modules.

commit a0331fb10a50393e31d16339053b75b944132da1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 19 14:50:11 2013 -0600

    Introduced auxinfo_t argument to micro-kernels.
    
    Details:
    - Removed a_next and b_next arguments to micro-kernels and replaced them
      with a pointer to a new datatype, auxinfo_t, which is simply a struct
      that holds a_next and b_next. The struct may hold other auxiliary
      information that may be useful to a micro-kernel, such as micro-panel
      stride. Micro-kernels may access struct fields via accessor macros
      defined in bli_auxinfo_macro_defs.h.
    - Updated all instances of micro-kernel definitions, micro-kernel calls,
      as well as macro-kernels (for declaring and initializing the structs)
      according to above change.

commit 392428dea4001fe4384efe29f6cde32f8abeeb35
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 12 19:01:47 2013 -0600

    Added "ri" scalar macros.
    
    Details:
    - Added set of basic scalar macros that take arguments' real and
      imaginary components separately, named like the previous set except
      with the "ris" (instead of "s") suffix.
    - Redefined the previous set of scalar macros (those that take arguments
      "whole") in terms of the new "ri" set.
    - Renamed setris and getris macros to sets and gets.
    - Renamed setimag0 macros to seti0s.
    - Use bli_?1 macro instead of a local constant in bla_trmv.c, bla_trsv.c.

commit f60c8adc2f61eaba06b892f4e73000159de93056
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 10 14:39:56 2013 -0600

    Minor updates to dunnington configuration.
    
    Details:
    - Added commented alternatives to dunnington configuration's bli_kernel.h.
    - Minor reformatting of optimization flag variables in make_defs.mk.

commit 4ef20150492db254b5baf2368add62e19b0ac11b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 9 18:53:03 2013 -0600

    Tweaks to dunnington configuration (x86_64/core2).
    
    Details:
    - Updated BLIS_DEFAULT_KC_D from 256 to 384.
    - Enabled cache blocksize extension of up to 25% for MC and KC (for
      double-precision real).

commit 5ad2ce7bf5ba3ea955e6d517bfd270e02820263b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 9 18:30:49 2013 -0600

    Minor x86_64 (core2) kernel fixes.
    
    Details:
    - Fixed copy-and-paste bug whereby [scz]gemmtrsm_u_opt_d4x4 kernels
      for x86_64/core2 were calling the wrong reference code (l instead
      of u).
    - Fixed some unused variables in x86_64/core2 dotaxpyv and dotxaxpyf
      kernels.
    - Minor typecasting fix in testsuite/src/test_libblis.c.
    - Makefile updates.

commit d289f5d3a9c0e1a68a17c1c32b736e282a289c4c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 5 10:56:13 2013 -0600

    Whitespace changes to level-2 blocked variants.
    
    Details:
    - Joined some lines in level-2 blocked variants to match formatting used
      in level-3 blocked variants.
    - Streamlined implementation of bli_obj_equals() in bli_query.c.

commit b444489f100d218bc8ef29b01ff8489c358559f9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 3 16:08:30 2013 -0600

    Added new "attached" scalar representation.
    
    Details:
    - Added infrastructure to support a new scalar representation, whereby
      every object contains an internal scalar that defaults to 1.0. This
      facilitates passing scalars around without having to house them in
      separate objects. These "attached" scalars are stored in the internal
      atom_t field of the obj_t struct, and are always stored to be the same
      datatype as the object to which they are attached. Level-3 variants no
      longer take scalar arguments, however, level-3 internal back-ends stll
      do; this is so that the calling function can perform subproblems such
      as C := C - alpha * A * B on-the-fly without needing to change either
      of the scalars attached to A or B.
    - Removed scalar argument from packm_int().
    - Observe and apply attached scalars in scalm_int(), and removed scalar
      from interface of scalm_unb_var1().
    - Renamed the following functions (and corresponding invocations):
    
       bli_obj_init_scalar_copy_of()
                               -> bli_obj_scalar_init_detached_copy_of()
       bli_obj_init_scalar()   -> bli_obj_scalar_init_detached()
       bli_obj_create_scalar_with_attached_buffer()
                               -> bli_obj_create_1x1_with_attached_buffer()
       bli_obj_scalar_equals() -> bli_obj_equals()
    
    - Defined new functions:
    
       bli_obj_scalar_detach()
       bli_obj_scalar_attach()
       bli_obj_scalar_apply_scalar()
       bli_obj_scalar_reset()
       bli_obj_scalar_has_nonzero_imag()
       bli_obj_scalar_equals()
    
    - Placed all bli_obj_scalar_* functions in a new file, bli_obj_scalar.c.
    - Renamed the following macros:
    
       bli_obj_scalar_buffer() -> bli_obj_buffer_for_1x1()
       bli_obj_is_scalar()     -> bli_obj_is_1x1()
    
    - Defined new macros to set and copy internal scalars between objects:
    
       bli_obj_set_internal_scalar()
       bli_obj_copy_internal_scalar()
    
    - In level-3 internal back-ends, added conditional blocks where alpha and
      beta are checked for non-unit-ness. Those values for alpha and beta are
      applied to the scalars attached to aliases of A/B/C, as appropriate,
      before being passed into the variant specified by the control tree.
    - In level-3 blocked variants, pass BLIS_ONE into subproblems instead of
      alpha and/or beta.
    - In level-3 macro-kernels, changed how scalars are obtained. Now, scalars
      attached to A and B are multiplied together to obtain alpha, while beta
      is obtained directly from C.
    - In level-3 front-ends, removed old function calls meant to provide
      future support for mixed domain/precision. These can be added back later
      once that functionality is given proper treatment. Also, removed the
      creating of copy-casts of alpha and beta since typecasting of scalars
      is now implicitly handled in the internal back-ends when alpha and
      beta are applied to the attached scalars.

commit 992de486d6f23e69a623abd15ae77d7881d13871
Merge: 9552e6ee fd4ac636
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 2 13:58:46 2013 -0600

    Unimplemented kernels now call reference.
    
    Details:
    - Updated arm, bgq, loongson3a, and x86_64 kernels so that unimplemented
      datatypes call the corresponding reference kernel. Previously, these
      kernel functions called abort() with a "not yet implemented" error
      message.

commit fd4ac636d9a55cec1476a444bd4e70def219dc8f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 2 13:50:36 2013 -0600

    Unimplemented kernels now call reference.
    
    Details:
    - Updated micro-kernels for arm, bgq, loongson3a, and x86_64 so that
      unimplemented kernel functions simply call the corresponding reference
      implementation. (Previously, these unimplemented functions would
      abort() with a "not yet implemented" message.)

commit 9552e6ee824d4345d5e908e869e071d19829819a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Nov 24 11:40:31 2013 -0600

    Removed optional scaling from packm control tree.
    
    Details:
    - Removed does_scale field from packm control tree node and
      bli_packm_cntl_obj_create() interface. Adjusted all invocations of
      _cntl_obj_create() accordingly.
    - Redefined/renamted macros that are used in aliasing so that now,
      bli_obj_alias_to() does a full alias (shallow copy) while
      bli_obj_alias_for_packing() does a partial alias that preserves the
      pack_mem-related fields of the aliasing (destination) object.
    - Removed bli_trmm3_cntl.c, .h after realizing that the trmm control tree
      will work just fine for bli_trmm3().
    - Removed some commented vestiges of the typecasting functionality needed
      to support heterogeneous datatypes.

commit e65c476284db9ef64b23191a21c2584b1083342f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Nov 19 10:05:35 2013 -0600

    Minor updates to packm_blk_var2.c and _blk_var3.c.
    
    Details:
    - Comment updates to packm_blk_var2.c and packm_blk_var3.c.
    - In packm_blk_var2(), call setm_unb_var1(), scal2m_unb_var1() directly
      instead of setm(), scal2m().

commit 9e1d0d4bca48eda54301d8976f203e2544c9df3a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 18:11:07 2013 -0600

    Added trsm_l, trsm_u ukernels for x86_64/core2.
    
    Details:
    - Added standalone trsm_l/trsm_u micro-kernels for x86_64 (core2).
      These kernels are based on the gemmtrsm_l/gemmtrsm_u micro-kernels
      that already existed in kernels/x86_64/core2-sse3/3.

commit 85e7e02ea3a9190b6fcff5d46b00d41c79cb1242
Merge: 67761e22 70720054
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 12:02:00 2013 -0600

    Merge branch 'master'. Forgot to git-pull.

commit 67761e224c92500eecf9c1540cc72bdd2fb27679
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 11:57:40 2013 -0600

    Attempting to fix errors in bgq build.
    
    Details:
    - Removed restrict declaration from b_cast and c_cast from
      bli_trsm_lu_ker_var2.c and bli_trsm_rl_ker_var2.c. Curiously, they
      are causing problems for xlc only in those two files and no other
      macro-kernels.
    - Fixed (hopefully) kernel function parameter type declarations in
      kernels/bgq/1f/bli_axpyf_opt_var1.c and kernels/bgq/3/bli_gemm_8x8.c.

commit 707200541d344f98cf34c9801954dbb36fbe0447
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 11:17:31 2013 -0600

    Syntax error fix in x86_64/core2 gemmtrsm_u ukr.

commit bbe2b84a49e7785d4d0c514cda34adfbe66478b0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 11:11:06 2013 -0600

    Updated Makefile in test, testsuite.
    
    Details:
    - Updated Makefiles in test and testsuite directories to use the new
      BLIS header installation directory scheme, which is to compile with
      -I<PREFIX>/include/blis instead of -I<PREFIX>/include.

commit 9bd7fcfd436625ca2108128086671319362f4d92
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 18 10:58:09 2013 -0600

    Outer-to-inner 'restrict' fix in macro-kernels.
    
    Details:
    - Fixed sloppy placement of 'restrict' pointer declarations in level-3
      macro-kernels. Previously, all restricted pointers were being declared
      at the outer-most function scope level. While this violates the C99
      standard, very few of the compilers used with BLIS so far have seemed
      to care. The lone exception has been IBM's xlc. Thanks to Tyler Smith
      for identifying this bug (and suggesting the fix).

commit 50549a6a31dd26cf63a013e0ede16b2c7ce835b6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Nov 17 18:31:27 2013 -0600

    Changed header install directory to include/blis.
    
    Details:
    - Changed top-level Makefile so that headers are installed to
      $(INSTALL_PREFIX)/include/blis/. (Header directories are no longer
      named by version/configuration and then symlinked.)
    - Added uninstall targets, including uninstall-old to clean out old
      library archives.
    - Added GREP makefile definitions to all configurations' make_defs.mk.

commit d70733abddfb9a95661897e1e4f3c1f3cfa7cbaa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 16 17:34:25 2013 -0600

    Added ARM kernels, configurations.
    
    Details:
    - Added kernels for ARM, and configurations for Cortex-A9 and Cortex-A15.
      Thanks to Francisco Igual for contributing these kernels and
      configurations.

commit d37c2cff62089c86983c2f79762f4b5329037373
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 13 10:47:11 2013 -0600

    Minor comment and Makefile changes.
    
    Details:
    - Added missing 'check-config' and 'check-make-defs' targets to
      testsuite/Makefile.
    - Removed unused 'test' target from top-level Makefile.
    - Comment changes to testsuite input files.

commit 19885f893a17b91ee79bead0620d0f913392d4c5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 11 12:09:21 2013 -0600

    Updated some kernel comment headers.
    
    Details:
    - Updated bgq and piledriver comment headers to use BLIS copyright header
      instead of libflame.

commit 1a4d698f42981d74fe5f29b980031e1ee7dc42d5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 11 10:15:40 2013 -0600

    CHANGELOG update (for 0.1.0).

commit 089048d5895a30221b6b1976c9be93ad6443420d (tag: 0.1.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 9 17:18:00 2013 -0600

    Added object wrappers to 1f test suite modules.
    
    Details:
    - Added missing object wrappers to level-1f test suite modules. This was
      only apparent if you were configuring with something other than the
      reference configuration.
    - Commented out object-wrappers in level-1f front-ends. These were not
      working as intended the reference configuration was selected, because
      most kernel sets, such as those in the template set, do not have object
      wrappers.
    - Whitespace changes to template micro-kernels.
    - Comment changes to template level-1f kernel headers.

commit 9ef3752079de10124bed906b5d28479d04aa8187
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 8 17:20:47 2013 -0600

    Updated template kernels wrt KernelsHowTo wiki.
    
    Details:
    - Merged latest state of KernelsHowTo wiki into template micro-kernels
      located in config/template/kernels/3.

commit 376bbb59c8944e29c5c1ff6637920d8451370afa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 8 11:17:34 2013 -0600

    Removed support for duplication.
    
    Details:
    - Removed support for duplication from the gemmtrsm/trsm micro-kernels
      and all framework code.
    - Updated test suite modules according to above changes.

commit 68a5910974b62b4df853fae2a68cb04df9d5a19c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Nov 7 11:36:11 2013 -0600

    Added comments to testsuite/input.operations.
    
    Details:
    - Added extensive comments to the top of testsuite/input.operations,
      which describe how to edit the file.
    - Removed input.operations.0 and input.operations.1.
    - Changed input.general to test all datatypes ("sdcz") by default.

commit a98f78b715fb256a519870071bb5266130d70b21
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 6 15:32:47 2013 -0600

    Changed dim_t and inc_t to be signed integers.
    
    Details:
    - Redefined dim_t and inc_t in terms of gint_t (instead of guint_t).
      This will facilitate interoperability with Fortran in the future.
      (Fortran does not support unsigned integers.)
    - Redefined many instances of stride-related macros so that they return
      or use the absolute value of the strides, rather than the raw strides
      which may now be signed. Added new macros bli_is_row_stored_f() and
      bli_is_col_stored_f(), which assume positive (forward-oriented) strides,
      and changed the packm_blk_var[23] variants to use these macros instead
      of the existing bli_is_row_stored(), bli_is_col_stored().
    - Added/adjusted typecasting to to various functions/macros, including
      bli_obj_alloc_buffer(), bli_obj_buffer_at_off(), and various pointer-
      related macros in bli_param_macro_defs.h.
    - Redefined bli_convert_blas_incv() macro so that the BLAS compatibility
      layer properly handles situations where vector increments are negative.
      Thanks to Vladimir Sukharev for pointing out this issue.
    - Changed type of increment parameters in bli_adjust_strides() from dim_t
      to inc_t. Likewise in bli_check_matrix_strides().
    - Defined bli_check_matrix_object(), which checks for negative strides.
    - Redefined bli_check_scalar_object() and bli_check_vector_object() so
      that they also check for negative stride.
    - Added instances of bli_check_matrix_object() to various operations'
      _check routines.

commit 1f8afc3e08a4312cfe810be86aedeacbc57275c5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Nov 6 10:09:10 2013 -0600

    Minor comment update to BLAS compat files.

commit 1abbf768afafc158d44e4d5c4a135cfd9e277f13
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 4 15:50:00 2013 -0600

    Fixed bugs in scalv and setv.
    
    Details:
    - Fixed bugs similar to those addressed in cca1e1f51dc6, whereby
      a segmentation fault may occur if beta is not the same type as
      the vector operand for scalv and setv.
    - Changed axpyv and scal2v front-ends in a similar fashion.

commit f5953259a1842ee48e5833c22ac86e68a337bfe1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Nov 4 14:43:55 2013 -0600

    Fixed a bug related to Hermitian matrix diagonals.
    
    Details:
    - Fixed a bug whereby BLIS assumed that the imaginary components of the
      diagonal elements of Hermitian matrices were already zero. This property
      is now enforced when the matrix is packed (bli_packm_blk_var2). Thanks
      to Vladimir Sukharev for reporting this bug.
    - Minor comment updates to template kernels.

commit d70f2b089dac8b9e4c19295dfa6014c36afee2ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Nov 2 17:19:40 2013 -0500

    Added scaling to abval2s, sqrt2s macros.
    
    Details:
    - Re-defined abval2s and sqrt2s macros to use scaling to avoid underflow
      and overflow from squaring the real and imaginary components. (This is
      the same technique used to fix recent bugs in invscals/invscaljs and
      inverts.)

commit c5b1ed9409ae2f71d04041eef5da9a0080b5784a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 1 10:28:04 2013 -0500

    Added new dotxaxpyf variant 2.
    
    Details:
    - Added a new variant for dotxaxpyf that is based on dotxf and axpyf
      kernels. By default, this variant is not used by any other operation.

commit 97f89fbcf202d72fc440b614708e352ea31633e2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Nov 1 10:16:39 2013 -0500

    Fixed bug in complex invscals.
    
    Details:
    - Fixed complex inversion in invscals and invscaljs whereby the
      imaginary component was being computed incorrectly.
    - Use bli_fmaxabs() instead of bli_fabs() when choosing the scalar
      in inverts, invscals, and invscaljs.
    - Changed bli_abs() and bli_fabs() macro definitions to use "<="
      operator instead of "<".

commit eda42a21d17a2742eab69ab801ed530b82488c8a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 31 18:00:44 2013 -0500

    Defined missing symbols in bla_rotg.c
    
    Details:
    - Defined local equivalents of libf2c's r_sign(), d_sign(), c_abs(), and
      z_abs(), which are needed by bla_rotg.c. Also defined r_abs() and
      d_abs() for completeness. Thanks to Vladimir Sukharev for reporting
      these bugs.

commit cca1e1f51dc67a2c3725d5c1837256831aaf70f8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 30 14:39:01 2013 -0500

    Fixed bugs in scalm and setm.
    
    Details:
    - Fixed bugs in scalm and setm that resulted in segmentation faults when
      beta is not the same type as the matrix operand. Thanks to Vladimir
      Sukharev for reporting this bug.
    - Changed axpym and scal2m front-ends in fashion similar to that of scalm
      and setm; namely, the alpha scalar is copy-cast the type of the first
      matrix operand.
    - Changed the template and reference configurations' bli_config.h files
      so that the number of memory allocator blocks of A and B are set based
      on BLIS_MAX_NUM_THREADS.
    - Comment updates to bli_obj.c and variable rename in bla_nrm2.c.

commit 2807013a4761c2b84b3944de64d23483ad7ef2fb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 24 14:32:20 2013 -0500

    Fixed over/under-flow in complex inversion.
    
    Details:
    - Fixed the complex bli_?inverts() macros, which were inverting elements
      in an "unsafe" manner, such that very large and very small values were
      unnecessarily over/under-flowing. Thanks for Vladimir Sukharev for
      reporting this bug.
    - Comment update to bli_sumsqv_unb_var1.c.
    - Removed redundant bli_min() macro in bli_scalar_macro_defs.h.
    - Changed 1.0F to 1.0 for bli_drands() macro.

commit 45a80c625f84edb2ade6ac25efe2b9c589d7e0df
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Oct 23 12:15:25 2013 -0500

    Fixed parameter checking issue in BLAS syr[2]k.
    
    Details:
    - Fixed a minor parameter checking bug in the BLAS compatibility layer
      for [sd]syrk and [sd]syr2k. Specifically, if 'C' is passed in for the
      trans parameter of either operation, it is (a) allowed, and (b) treated
      as 'T' (whereas previously it was disallowed). Thanks for Vladimir
      Sukharev for finding and reporting this bug.

commit a091a219bda55e56817acd4930c2aa4472e53ba5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Oct 14 10:11:29 2013 -0500

    Minor fixes to piledriver configuration, ukernel.
    
    Details:
    - Applied a patch from Tyler that fixes minor staleness in the piledriver
      configuration and gemm micro-kernel.
    - Very minor changes to test suite input files.

commit dacdde27aee4fb90b14880136d7f20c6b234e2c6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 11 11:37:19 2013 -0500

    Added Fran's Sandy Bridge kernels/configuration.
    
    Details:
    - Added a kernel directory for kernels developed by Francisco Igual for
      the Sandy Bridge architecture, including a dgemm ukernel coded with
      AVX intrinsics.
    - Added a configuration for Sandy Bridge using values supplied by Fran.

commit 03106d650e4030d4c9831683448376f92fc52d41
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Oct 11 10:40:38 2013 -0500

    Fixed minor perf bug in gemm_ker_var2.
    
    Details:
    - Fixed a minor performance bug in bli_gemm_ker_var2.c (and the experimental
      bli_gemm_ker_var5.c) whereby the addresses for a_next and b_next are not
      computed correctly (ie: do not wraparound) at the edge cases. Thanks to
      Tze Meng for helping me identify this bug.

commit b053337387dbdef9035be03538222670a21707ca
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 18:26:55 2013 -0500

    Added fusing factors, MR/NR to test suite output.
    
    Details:
    - Updated the test suite driver (and modules where appropriate) so that
      the level-1f fusing factors are output along with the variable dimension.
      While this is not strictly necessary, since the fusing factors are output
      in the initial parameter summary, it allows extra reassurance to the user
      since the fusing factors appear alongside the variable dimension, which
      together give a complete picture of the problem size. Similar changes were
      made for outputting the register blocksizes when reporting results for the
      micro-kernel test modules.

commit be4833bd91c5a58d0bfc52daaadf7ba543a77acf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 14:20:06 2013 -0500

    Added test suite modules for level-1f, 3 kernels.
    
    Details:
    - Added test modules in test suite for level-1f kernels and level-3
      micro-kernels. (Duplication in the micro-kernels, for now, is NOT
      supported by these test modules.)
    - Added section override switches to test suite's input.operations file.
    - Added obj_t APIs for level-1f front-ends and their unblocked variants to
      facilitate the level-1f test modules. Also added front-end for dupl
      operation.
    - Added obj_t-based check routines for level-1f operations, which are
      called from the new front-ends mentioned above.
    - Added query routines for axpyf, dotxf, and dotxaxpyf that return fusing
      factors as a function of datatype, which is needed by their respective
      test modules.
    - Whitespace changes to bli_kernel.h of all existing configurations.

commit 680188d46bb15b9a1a2867638104939dc77ca2a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 13:23:37 2013 -0500

    Cleaned up old test drivers.
    
    Details:
    - Minor updates to old test drivers in preparation for our participation
      in ACM TOMS's replicated results initiative.

commit 3690bdd4f95769c935c410414112102cc3e108b1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 11:45:33 2013 -0500

    More updates to level-1f kernels for core2-sse3.
    
    Details:
    - Changed types in function signatures to match new prototypes. Meant to
      include this in previous commit.

commit 661d5120cd7071f9b0c5cefc95f99f1361370ade
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Oct 10 11:27:27 2013 -0500

    Fixed outdated fusing factor macros in 1f kernels.
    
    Details:
    - Updated level-1f kernels for x86_64 and bgq to use renamed fusing factor
      macros. Meant to include this in 5e54f46c. Thanks to Fran for pointing
      this out.

commit 73aa1e9f31d1b2a319c7e711ced6db3f9835c832
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Oct 1 17:01:18 2013 -0500

    Added section overrides to test suite.
    
    Details:
    - Added new lines of input to the test suite's input.operations file, which
      allows the user to disable entire sections (levels) of tests. Before this
      change, the user had to manually disable each operation tests's "master
      switch". (This is why input.operations.0 existed: to allow a more
      convenient starting point for someone who only wanted to test one or a
      few operations.)

commit 5e54f46ccb76beab892d530b693e07c6bf6db7cf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 30 12:58:18 2013 -0500

    Added template implementations and other tweaks.
    
    Details:
    - Added a 'template' configuration, which contains stub implementations of the
      level 1, 1f, and 3 kernels with one datatype implemented in C for each, with
      lots of in-file comments and documentation.
    - Modified some variable/parameter names for some 1/1f operations. (e.g.
      renaming vector length parameter from m to n.)
    - Moved level-1f fusing factors from axpyf, dotxf, and dotxaxpyf header files
      to bli_kernel.h.
    - Modifed test suite to print out fusing factors for axpyf, dotxf, and
      dotxaxpyf, as well as the default fusing factor (which are all equal
      in the reference and template implementations).
    - Cleaned up some sloppiness in the level-1f unb_var1.c files whereby these
      reference variants were implemented in terms of front-end routines rather
      that directly in terms of the kernels. (For example, axpy2v was implemented
      as two calls to axpyv rather than two calls to AXPYV_KERNEL.)
    - Changed the interface to dotxf so that it matches that of axpyf, in that
      A is assumed to be m x b_n in both cases, and for dotxf A is actually used
      as A^T.
    - Minor variable naming and comment changes to reference micro-kernels in
      frame/3/gemm/ukernels and frame/3/trsm/ukernels.

commit 97aaf220a847363b4da35935eca17790c0ef71f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 17 10:51:36 2013 -0500

    Added new kernels, configurations.
    
    Details:
    - Added various micro-kernels for the following architectures:
        Intel MIC
        IBM BG/Q
        IBM Power7
        AMD Piledriver
        Loogson 3A
      and reorganized kernels directory. Thanks to Tyler Smith, Mike Kistler,
      and Xianyi Zhang for contributing these kernels.
    - Added configurations corresponding to above architectures, and renamed
      "clarksville" configuration to "dunnington".

commit fe979c5a114c877506a5697cdab1fc8cf2bcd303
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 13 14:31:53 2013 -0500

    Removed default configuration behavior.
    
    Details:
    - Changed the configure script so that it no longer defaults to the
      reference configuration. This change is being made so that the
      developer has a firm awareness of which configuration is being used
      to configure BLIS. Thanks to Mike Kistler and Bryan Marker for this
      suggested change.

commit da77e9614f54f92f703f01e3b9bd67a83280150c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Sep 13 12:00:37 2013 -0500

    Minor improvements to static memory allocator.
    
    Details:
    - Expanded on cpp macro definitions from bli_mem.c and relocated them to
      a new header file, frame/include/bli_mem_pool_macro_defs.h. The expanded
      functionality includes computing the pool size for each datatype (using
      that datatype's cache blocksizes) and using the maximum to size the
      actual pool array. This addresses the somewhat common pitfall whereby a
      developer updates cache blocksizes in bli_kernel.h for only one datatype
      (say, single-precision real), while the memory pools are sized using the
      double-precision real values. Then, when the developer attempts to link
      to and run a level-3 BLIS routine (e.g. dgemm), the library aborts with
      a message saying the static memory pool was exhausted. Clearly, this
      message is misleading when the pool was not sized properly to begin with.
    - Removed previously disabled code in bli_kernel_macro_defs.h that was
      meant to check for size consistency among the various cache blocksizes.
      (Obviously the memory pool size-based solution mentioned above is better.)
    - Added BLIS_SIZEOF_? cpp macros to bli_type_defs.h. This seemed like a
      reasonable place to put these constants, rather than further crowd up
      bli_config.h.
    - Updated testsuite driver to output memory pool sizes for A, B, and C.
    - Minor comment updates to bli_config.h.
    - Removed 'flame' configuration. It was beginning to get out-of-date, and
      I hadn't used it in months. We can always re-create it later.

commit 631f347b7a99cb02757c534fd3ec5f723a2fdb0e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 10 17:17:28 2013 -0500

    Added ESSL and Accelerate targets to test drivers.
    
    Details:
    - Added ESSL and Accelerate (OS X) targets to standalone test drivers'
      Makefile in "test" directory. Thanks to Jeff Hammond for suggesting
      / providing this patch.

commit 7ae4d7a41d13ef5f1ceee217c000a5cf77a11128
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 10 16:35:12 2013 -0500

    Various changes to treatment of integers.
    
    Details:
    - Added a new cpp macro in bli_config.h, BLIS_INT_TYPE_SIZE, which can be
      assigned values of 32, 64, or some other value. The former two result in
      defining gint_t/guint_t in terms of 32- or 64-bit integers, while the latter
      causes integers to be defined in terms of a default type (e.g. long int).
    - Updated bli_config.h in reference and clarksville configurations according
      to above changes.
    - Updated test drivers in test and testsuite to avoid type warnings associated
      with format specifiers not matching the types of their arguments to printf()
      and scanf().
    - Inserted missing #include "bli_system.h" into blis.h (which was slated for
      inclusion in d141f9eeb6d1).
    - Added explicit typecasting of dim_t and inc_t to macros in
      bli_blas_macro_defs.h (which are used in BLAS compatibility layer).
    - Slight changes to CREDITS and INSTALL files.
    - Slight tweaks to Windows build system, mostly in the form of switching to
      Windows-style CRLF newlines for certain files.

commit 068437736b41d51a1f5ec47839f059bf58a20413
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 9 14:07:58 2013 -0500

    Fixed set-but-not-used compiler (gcc) warnings.
    
    Details:
    - Used void-casts of certain variables to appease gcc (and perhaps other
      compilers) when such variables are only used in the complex instances of
      the functions. Special thanks to Karl Rupp for suggesting a portable fix
      for these warnings.

commit 6dc85f63dcd5282340c9e00d585e97d70a21edc3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 9 13:48:52 2013 -0500

    Small fix to Windows defs.mk makefile fragment.
    
    Details:
    - Commented out a !include statement that was attempting to include a
      version file that does not yet exist. For now, the version string is
      hard-coded into defs.mk.

commit d141f9eeb6d1de7044b7429adf52d11c6fca620c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 9 13:09:16 2013 -0500

    Added Windows build system.
    
    Details:
    - Added a 'windows' directory, which contains a Windows build system
      similar to that of libflame's. Thanks to Martin for getting this up
      and running.
    - Spun off system header #includes into bli_system.h, which is included
      in blis.h
    - Added a Windows section to bli_clock.c (similar to libflame's).

commit 9b320e7406fb69e8b61a0085abe2ed89a96bdb68
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Sep 9 11:04:46 2013 -0500

    Edited bli_?lamch.c to avoid Windows keyword.
    
    Details:
    - Renamed "small" variable to "smnum" to avoid collision with Windows type
      by the same name. This change is needed in advance of the upcoming Windows
      build system.

commit 9013ad6ff2e9ace35e0cf44c32795c2f3d5be628
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 4 13:36:07 2013 -0500

    Switched integer typedefs (again) to C types.
    
    Details:
    - Redefined gint_t and guint_t in terms of the standard C types long int
      and unsigned long int, respectively.
    - Changed testsuite default max problem size to 500.
    - Changed testsuite input.operations to use square problems for level-3
      operation tests.

commit 981a60cfa07abac2e93697dfe12b0f076ab00a38
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Sep 4 12:09:11 2013 -0500

    Falling back to 32-bit integers for dim_t, etc.
    
    Details:
    - In light of recent segfaulting issues when compiling on 32-bit systems,
      I've changed the default typedef for gint_t and guint_t from int64_t and
      uint64_t to int32_t and uint32_t, respectively.
    - Disabled 64-bit integers in the blas2blis layer for the reference
      configuration.
    - Added type sizes of gint_t, guint_t, and the four floating-point datatypes
      to introductory output of the testsuite.

commit b776ddcd4338b34f172ef78da0ac1d771a771ab4
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 3 21:58:07 2013 -0500

    Applied temp fix to typecasting bug in testsuite.
    
    Details:
    - Applied a temporary fix to the typecasting bug in the testsuite driver.
      The fix involves casting both numerator and denominator to unsigned long.
      This fix is more voodoo than science, as I can't be sure why it even
      works.

commit 9ee6e125373869c4213c017ce772c38ecefba103
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Sep 3 21:53:27 2013 -0500

    Changed dimension spec for gemm in testsuite.
    
    Details:
    - Encounted a bizarre typecasting bug whereby the test suite was not
      computing the proper dimension from the problem size and dimension
      specification when the latter was set to -3. Will investigate.
      Thanks to Fran for finding this "bug".

commit e8be081e68c385ab44d0fea8dade21d40c200b79
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 28 15:52:34 2013 -0500

    Generalized matlab and file output in testsuite.
    
    Details:
    - Added a new option in input.general that allows outputting in
      matlab/octave format so that one can output in matlab format
      independently from outputting to files.
    - Adjusted input.operations according to above.
    - Added input.operations.0 and input.operations.1 with all options
      disabled and enabled, respectively.

commit d352c746e5683037d41b5061dfb5ce08e1d0843b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 27 13:41:46 2013 -0500

    Added single/real gemm micro-kernel for x86_64.
    
    Details:
    - Added a single-precision real gemm micro-kernel in
      kernels/x86_64/3/bli_gemm_opt_d4x4.c.
    - Adjusted the single-precision real register blocksizes in
      config/clarksville/bli_kernel.h to be 8x4.
    - Added a missing comment to bli_packm_blk_var2.c that was present in
      bli_packm_blk_var3.c

commit dedda523dc5dc779ecc34e6a03dc74cb8eb220de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Aug 19 12:07:41 2013 -0500

    Fixed bug in bli_acquire_mpart_t2b(), _l2r().
    
    Details:
    - Fixed a bug in bli_acquire_mpart_t2b() and bli_acquire_mpart_l2r()
      that cause incorrect partitioning when SUBPART0 was requested. This
      bug was introduced in 46d3d09d49ad. Thanks to Bryan for isolating
      this bug.
    - Removed dupl kernels from kernels/x86_64/3 directory.
    - Uncommented beta == 0 optimizaition code in
      kernels/x86_64/3/bli_gemm_opt_d4x4.c.

commit 12dbd2f33455e9384fe2070cbdd660fd4a7fceb5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 8 14:39:35 2013 -0500

    Moved init_safe(), finalize_safe() to BLAS compat.
    
    Details:
    - Moved the bli_init_safe() and bli_finalize_safe() function calls from the
      BLAS-like BLIS layer to the BLAS compatibility layer. Having these auto-
      initializers in the BLIS layer wasn't buying us anything because the user
      could still call the library with uninitialized global scalar constants,
      for example. Thus, we will just have to live with the constraint that
      bli_init() MUST be called before calling ANY routine with a bli_ prefix.
    - Added the missing _init_safe() and finalize_safe() calls to the level-1
      BLAS compatibility wrappers.

commit 8abfe55f2ae5d89df18e1b26a5a28d94b0936683
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 8 13:30:19 2013 -0500

    Miscellaneous updates.
    
    Details:
    - Changed the BLIS_HEAP_STRIDE_ALIGN_SIZE in the configurations from 16 to
      BLIS_CACHE_LINE_SIZE (typically 64).
    - Changed the use of nr in sizing of bd buffer to packnr in level-3 macro-
      kernels.
    - Reformulated gemm_ker_var2 to look more like the other level-3 macro-
      kernels, in that the interior and edge-case handling is expressed once
      inside the loops in the n and m dimensions, rather than the edge-case
      handling being "unrolled" and expressed as distinct code regions. The
      previous macro-kernel now lives in retired form in the subdirectory
      other/bli_gemm_ker_var2.c.old.
    - Updated experimental gemm_ker_var5 according to above change.
    - Fixed bug in bli_her2k.c whereby incorrect transformations were being
      applied to optimize the macro-kernel accesses pattern on C when C is
      row-stored.
    - Various updates inside of test/exec_sizes.

commit 1aa05736ff49e7cc5f121acf615460fe9a87852c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Aug 7 12:27:04 2013 -0500

    Fixed bug in interface of bla_ger_check().
    
    Details:
    - Fixed the misplaced lda parameter in the function signature of
      bla_ger_check(). Thanks to Tyler for finding this bug.

commit 685aad25353fb200de4ca97a8bc0feeebde51d0f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Aug 6 12:25:51 2013 -0500

    Fixed cpp guard typos in frame/compat/check files.
    
    Details:
    - Fixed instances of BLIS_ENABLE_BLIS2BLAS that should have been
      BLIS_ENABLE_BLAS2BLIS. Thanks to Tyler for catching this.
    - Fixed various syntax errors in the code that had yet to be compiled
      due to the aforementioned bug.

commit f4ec28e723d28d998f1038f82da6986e44320ef6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Aug 1 11:24:23 2013 -0500

    Added basic OpenMP-based gemm and packm files.
    
    Details:
    - Integrated Tyler's parallelized packm_blk_var2 and gemm_ker_var2
      into the following auxiliary files
    
        frame/1m/packm/other/bli_packm_blk_var2.c
        frame/3/gemm/other/bli_gemm_ker_var2.c
    
      The routine in the first file uses a basic OpenMP parallel region to
      parallelize the packing of blocks of A and panels of B, while the
      second uses a similar parallel region to parallelize along the n
      dimension of the gemm macro-kernel.

commit f8980edf9c318453bb1962ac4939c06bf11e6d5e
Merge: 67a8b949 6e7e4523
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 26 11:14:27 2013 -0500

    Merge branch 'master' of https://code.google.com/p/blis

commit 67a8b9498d13b038deb316ac163e62c5b17da2ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 26 11:12:37 2013 -0500

    Added missing cpp kernel blocksize constraints.
    
    Details:
    - Added missing C preprocessor guards in bli_kernel_macro_defs.h that enforce
      constraints on the register blocksizes relative to the cache blocksizes.
      Thanks to Tyler for helping me stumble across this issue.

commit 6e7e452343014e8f86640874dc1dbadca4a642a1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 22 14:50:57 2013 -0500

    Fixed minor warnings and misc issues.
    
    Details:
    - Fixed various warnings output by gcc 4.6.3-1, including removing some
      set-but-not-used variables and addressing some instances of typecasting
      of pointer types to integer types of different sizes.

commit 03f6c3599743bc837a7d40eb5b415b1bf4f2a4e9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 22 12:54:32 2013 -0500

    Tightened some macros that detect datatypes.
    
    Details:
    - Modified the definitions of some macros, such as bli_is_real(), so that
      the "special" bit is taken into account so that BLIS_INT is differentiated
      from BLIS_FLOAT.
    - Whitespace changes to bli_obj_macro_defs.h.
    - Removed BLIS_SPECIAL_BIT definition from bli_type_defs.h, since it wasn't
      being used.

commit b33e2f4443b9043b554963320280ff7783773652
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jul 19 17:15:03 2013 -0500

    CHANGELOG update (for 0.0.9).

commit 0680916fdd532f7a4716b11a2515243b2c08d00f (tag: 0.0.9)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 18 18:04:34 2013 -0500

    Added BLAS error checking to compatibility layer.
    
    Details:
    - Added frame/compat/check directory, which now houses companion _check()
      routines for each of the BLAS wrappers in frame/compat. These _check()
      routines are called from the compatibility wrappers and mimic the
      error-checking present in the netlib BLAS.
    - Edited bla_xerbla.c so that xerbla() translates the operation string to
      uppercase before printing.
    - Redefined util routines in frame/compat/f2c/util in terms of level0
      macros.
    - Added prototypes for util routines, f2c routines, lsame(), and xerbla().
    - Commented out prototypes in test/test_*.c since Fortran integers are now
      int64_t by default (and the prototypes that were present in the files
      used int).
    - Removed redundant #include "bli_f2c.h" in bli_?lamch.c and bli_lsame.c,
      since blis.h was already being included.
    - Other minor changes to code in frame/compat/f2c.

commit 4e80ad28c97273db3366428ec44020da7944964d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jul 18 17:53:31 2013 -0500

    Added support for C99 complex types/arithmetic.
    
    Details:
    - Added support for C99 complex types to bli_type_defs.h and overloaded
      complex arithmetic to the scalar-level macros in include/level0. This
      includes a somewhat substantial reorganization and re-layering of much
      of the existing machinery present in the level0 macros.
    - Added new #define for BLIS_ENABLE_C99_COMPLEX to bli_config.h files,
      commented-out by default, which optionally enables the use of built-in
      C99 complex types and arithmetic.
    - Minor changes to clarksville and reference configs' make_defs.mk files.
    - Removed macro definitions from bli_param_macro_defs.h which was not being
      used (bli_proj_dt_to_real_if_imag_eq0).

commit 6072d7c848e837ba20d607f7b727438ada31bdcf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 17 12:27:45 2013 -0500

    Fixed bugs in trsm, trmm macro-kernels.
    
    Details:
    - Fixed a bug in trsm_rl_ker_var2() caused by incorrect edge case handling.
    - Fixed a bug in trsm_rl_ker_var2() and trsm_ru_ker_var2() whereby k was
      incorrectly being adjusted upward by MR, instead of NR. The rl and ru
      trmm macro-kernels were updated in a similar fashion.
    - Fixed a bug in trsm_ru_ker_var2() that was due to a missing negation on
      diagoffb when recomputing k to skip a zero region below where the
      diagonal intersects the right side of the block. The corresponding
      trmm macro-kernel was also updated.
    - Fixed a bug in trsm_ru_ker_var2() where the the adjustment of k (by NR)
      needed to be placed AFTER the block that recomputes k to skip the zero
      region (if present). The other three trsm macro-kernels, as well as the
      trmm macro-kernels, were updated in the same manner, for consistency.
    - Fixed a bug in trmm_lu_ker_var2() in which the wrong dimension (n) was
      being updated to skip a zero region to the left of where the diagonal
      of A intersects the top edge of the block.
    - Comment updates to all trsm and trmm macro-kernels.
    - Comment updates to bli_packm_init.c.

commit 47410a48f9b91e94ce4c67633686ffd1f2ad0275
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 10 14:53:59 2013 -0500

    Added f2c'ed Givens rotation wrappers.
    
    Details:
    - Retired (for now) existing ?rot*() BLAS compatibility wrappers to 'attic'
      along with other wrappers for which no BLIS implementation exists.
    - Added f2c-generated codes for applicable datatype flavors of rot, rotg,
      rotm, and rotmg operations.

commit e5f90f3a8dbe671104bcb9d8b4e3409de01805da
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 10 13:40:12 2013 -0500

    Removed copynz defs from bli_kernel.h files.
    
    Details:
    - Removed COPYNZ_KERNEL definition from the bli_kernel.h files in each
      configuration. (Meant to include this in previous commit.)

commit aec12d90f596e8c04b1ad178258a1cd38108f59d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jul 10 13:33:30 2013 -0500

    Removed copynzv, copynzm and related codes.
    
    Details:
    - Removed copynzv and copynzm operation directories. These operations
      implemented a variation of copyv/m that, in the case of real source
      and complex destination operands, leaves the imaginary component
      untouched (rather than setting it to zero). I realize now that the
      special case(s) (e.g. gemm with real A and B but complex C) that I
      thought required this operation actually can be handled more simply.
    - Removed level0 scalar macros implementing copynzs, copynzjs.

commit b0a0a0f274a761788531b5d281cc3b411b7124ed
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jul 9 17:15:38 2013 -0500

    Added handling of restrict, stdint.h for non-C99.
    
    Details:
    - Removed the #include <stdint.h> from blis.h and inserted a cpp macro block
      in bli_type_defs.h that #includes <stdint.h> for C++ and C99, and otherwise
      manually typedefs the types we need (which, for now, are unconditionally
      int64_t and uint64_t).
    - Moved basic typedefs to top of bli_type_defs.h, and comment changes.
    - Added cpp macro block to bli_macro_defs.h that #defines restrict as
      nothing for C++ and non-C99.

commit 4b7e7970f1af4a1ab121e07657e2b78b9fcd7671
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 8 15:20:34 2013 -0500

    Migrated integer usage to stdint.h types.
    
    Details:
    - Changed the way bli_type_defs.h defines integer types so that dim_t,
      inc_t, doff_t, etc. are all defined in terms of gint_t (general signed
      integer) or guint_t (general unsigned integer).
    - Renamed Fortran types fchar and fint to f77_char and f77_int.
    - Define f77_int as int64_t if a new configuration variable,
      BLIS_ENABLE_BLIS2BLAS_INT64, is defined, and int32_t otherwise.
      These types are defined in stdint.h, which is now included in blis.h.
    - Renamed "complex" type in f2c files to "singlecomplex" and typedef'ed
      in terms of scomplex.
    - Renamed "char" type in f2c files to "character" and typedef'ed in terms
      of char.
    - Updated bla_amax() wrappers so that the return type is defined directly
      as f77_int, rather than letting the prototype-generating macro decide
      the type. This was the only use of GENTFUNC2I/GENTPROT2I-related macros,
      so I removed them. Also, changed the body of the wrapper so that a
      gint_t is passed into abmaxv, which is THEN typecast to an f77_int
      before returning the value.
    - Updated f2c code that accessed .r and .i fields of complex and
      doublecomplex types so that they use .real and .imag instead (now that
      we are using scomplex and dcomplex).

commit 372501398564fdba3d5a3db86c30bc1039b185ff
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jul 8 11:24:18 2013 -0500

    Added experimental bli_gemm_ker_var5().
    
    Details:
    - Added support for an experimental gemm macro-kernel incrementally
      packs one micro-panel of B at a time. This is useful for certain
      special cases of gemm where m is small.
    - Minor changes to default values of clarksville configuration.
    - Defined BLIS_PACKED_BLOCKS as part of pack_t type, even though we
      do not yet have any use (or implementation support) for block storage.
    - Comment update to bli_packm_init.c.

commit 9915d667a79f23e3a2a2516247c560e9063a1646
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Jul 7 13:28:39 2013 -0500

    Defined "total" blocksize query functions.
    
    Details:
    - Defined bli_blksz_total_for_type() and bli_blksz_total_for_obj() to query
      the default blocksize plus blocksize extension (using the type or the type
      of an object).
    - Comment update in bli_packm_cxk.c.

commit 46d3d09d49aded1d9f1b468c83fce75e07d631dc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 27 13:19:56 2013 -0500

    Consolidated lower/upper her[2]k blocked variants.
    
    Details:
    - Consolidated lower and upper blocked variants for herk and her2k, and
      renamed the resulting variants, according to the same changes recently
      made to trmm and trsm.
    - Implemented support for four new subpartitions types:
        BLIS_SUBPART1T
        BLIS_SUBPART1B
        BLIS_SUBPART1L
        BLIS_SUBPART1R
      which correspond to "merged" partitions that include the middle "1"
      partition as well as either the neighboring "0" or "2" partition. This is
      used to clean up code in herk/her2k var2 that attempts to partition away
      the strictly zero region above or below the diagonal of a matrix operand
      that is being marched through diagonally.
    - Added safeguards to herk macro-kernels that skip any leading or trailing
      zero region in the panel of C that is passed in. This is now needed given
      that herk/her2k var1 no longer partitions off this zero region before
      calling the macro-kernel (via bli_her[2]k_int()).
    - Updated comments and other whitespace changes to trmm/trsm macro-kernels.

commit 02002ef6f3d2746665982793db36714bd69bccc9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 24 17:08:14 2013 -0500

    Added row-storage optimizations for trmm, trsm.
    
    Details:
    - Implemented algorithmic optimizations for trmm and trsm whereby the right
      side case is now handled explicitly, rather than induced indirectly by
      transposing and swapping strides on operands. This allows us to walk through
      the output matrix with favorable access patterns no matter how it is stored,
      for all parameter combinations.
    - Renamed trmm and trsm blocked variants so that there is no longer a
      lower/upper distinction. Instead, we simply label the variants by which
      dimension is partitioned and whether the variant marches forwards or
      backwards through the corresponding partitioned operands.
    - Added support for row-stored packing of lower and upper triangular matrices
      (as provided by bli_packm_blk_var3.c).
    - Fixed a performance bug in bli_determine_blocksize_b() whereby the cache
      blocksize  extensions (if non-zero) were not being used to appropriately size
      the first iteration (ie: the bottom/right edge case).
    - Updated comments in bli_kernel.h to indicate that both MC and NC must be
      whole multiples of MR AND NR. This is needed for the case of trsm_r where,
      in order to reuse existing left-side gemmtrsm fused micro-kernels, the
      packing of A (left-hand operand) and B (right-hand operand) is done with
      NR and MR, respectively (instead of MR and NR).

commit d1e81ddc848ee47bc188735883d14582bdd0cabc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 13 11:14:21 2013 -0500

    Minor generalizing tweaks to trmm blk var1, var2.

commit 0efb7974f104206ba3985276f2180a9b14fe9f9b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 12 16:40:04 2013 -0500

    CHANGELOG update.

commit 5b641c3bab31eac6a1795b9f6e3f86c59651ca50 (tag: 0.0.8)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Jun 12 16:02:12 2013 -0500

    Use separate CFLAGS for "kernels" directories.
    
    Details:
    - Added a new "special" directory type: any source code within directories
      named "kernels" will be compiled with a separate CFLAGS_KERNELS set of
      compiler flags. This allows the developer to specify a separate set of
      flags (e.g. optimization flags) for compiling kernels while maintaining a
      standard set for regular framework code.
    - Fixed a bug in the top-level Makefile that was causing "noopt" code
      to be compiled with the standard set of compilation flags.
    - Updated make_defs.mk in reference, flame, and clarksville configurations
      according to above changes.

commit 08475e7c7653ba598665071a617d10f0d8f763c2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 11 12:18:39 2013 -0500

    Various level-3 optimizations for row storage.
    
    Details:
    - Implemented remaining two cases within bli_packm_blk_var2(), which allow
      packing from a lower or upper-stored symmetric/Hermitian matrix to column
      panels (which are row-stored). Previously one could only pack to row panels
      (which are column-stored).
    - Implemented various optimizations in the level-3 front-ends that allow more
      favorable access through row-stored matrices for gemm, hemm, herk, her2k,
      symm, syrk, and syr2k.
    - Cleaned up code in level-3 front-ends that has to do with setting target and
      execution datatypes.

commit 05a657a6b92e8d34efa5c57ae6a18a4f35ec0841
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jun 7 11:04:10 2013 -0500

    Added beta == 0 optimization to x86_64 ukernel.
    
    Details:
    - Modified x86_64 gemm microkernel so that when beta is zero, C is not read
      from memory (nor scaled by beta).
    - Fixed minor bug in test suite driver when "Test all combinations of storage
      schemes?" switch is disabled, which would result in redundant tests being
      executed for matrix-only (e.g. level-1m, level-3) operations if multiple
      vector storage schemes were specified.
    - Restored debug flags as default in clarksville configuration.

commit f1aa6b81cc421516dd77dd0f18f7c432724e6ef2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Jun 6 13:36:06 2013 -0500

    Whitespace changes to old test drivers.
    
    Details:
    - Replaced tabs with four spaces in places where indention was already
      in place.

commit 9feb4c23d2e36f3d8b5417a3802c69f94b29f749
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Jun 4 14:57:46 2013 -0500

    Fixed unaligned handling in axpyf, dotxaxpyf.
    
    Details:
    - Fixed over-cautious handling of unaligned operands in vector instrinsic
      implementation of axpyf kernel.
    - Fixed over- and under-cautious handling of unaligned operands in vector
      intrinsic implementation of dotxaxpyf kernel.

commit 22b06cfcd2e3205c8325a246c2279e4b1047c066
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Jun 3 16:54:52 2013 -0500

    Updated level-1/-1f [vector intrinsic] kernels.
    
    Details:
    - Updated level-1/-1f kernels so that non-unit and un-aligned cases are
      handled by reference implementation (rather than aborted).
    - Added -fomit-frame-pointer to default make_defs.mk for clarksville
      configuration.
    - Defined bli_offset_from_alignment() macro.
    - Minor edits to old test drivers.

commit 0288c827d3659bb225ac9c10f168b623ed0106a2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Jun 1 08:02:23 2013 -0500

    Updated ukernels for x86_64.
    
    Details:
    - Tweaked micro-kernels and configuration for clarksville.
    - Updated/cleaned up old test drivers in test directory.
    - Fixed syntax bug in trsv_unb_var1 and trsv_unf_var1 (introduced
      recently).

commit 85a6d1c9a52c2b27c71a3a3e341c51d7ba263749
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon May 6 11:05:08 2013 -0500

    Replaced axpys usage with subs in trsv.
    
    Details:
    - Replaced instances of axpys with alpha equal to -1 with subs.
    - Use BLIS_MAX_TYPE_SIZE to define BLIS_CONSTANT_SLOT_SIZE instead of
      sizeof(dcomplex).

commit 2d9c667f3c48a12cab64e5ad09d5fcb9f4c19d78
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 24 16:28:10 2013 -0500

    Fixed x86_64 kernel bugs and other minor issues.
    
    Details:
    - Fixed bugs in trmv_l and trsv_u due to backwards iteration resulting in
      unaligned subpartitions. We were already going out of our way a bit to
      handle edge cases in the first iteration for blocked variants, and this
      was simply the unblocked-fused extension of that idea.
    - Fixed control tree handling in her/her2/syr/syr2 that was not taking
      into account how the choice of variant needed to be altered for
      upper-stored matrices (given that only lower-stored algorithms are
      explicitly implemented).
    - Added bli_determine_blocksize_dim_f(), bli_determine_blocksize_dim_b()
      macros to provide inlined versions of bli_determine_blocksize_[fb]() for
      use by unblocked-fused variants.
    - Integrated new blocksize_dim macros into gemv/hemv unf variants for
      consistency with that of the bugfix for trmv/trsv (both of which now
      use the same macros).
    - Modified bli_obj_vector_inc() so that 1 is returned if the object is a
      vector of length 1 (ie: 1 x 1). This fixes a bug whereby under certain
      conditions (e.g. dotv_opt_var1), an invalid increment was returned, which
      was invalid only because the code was expecting 1 (for purposes of
      performing contiguous vector loads) but got a value greater than 1 because
      the column stride of the object (e.g. rho) was inflated for alignment
      purposes (albeit unnecessarily since there is only one element in the
      object).
    - Replaced some old invocations of set0 with set0s.
    - Added alpha parameter to gemmtrsm ukernels for x86_64 and use accordingly.
    - Fixed increment bug in cleanup loop of gemm ukernel for x86_64.
    - Added safeguard to test modules so that testing a problem with a zero
      dimension does not result in a failure.
    - Tweaked handling of zero dimensions in level-2 and level-3 operations'
      internal back-ends to correctly handle cases where output operand still
      needs to be scaled (e.g. by beta, in the case of gemm with k = 0).

commit d57ec42b34f8447c88adeffa95cf22f8c115ad51
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 3 17:35:32 2013 -0500

    Renamed _trans_status() macro.
    
    Details:
    - Mistakenly forgot to rename the _trans_status() macro and instances in
      previous commit.

commit 9e2b227866af429a4a6fb7dbb8c457bbdda2f136
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri May 3 17:24:58 2013 -0500

    Renamed _set_trans(), _trans_status() macros.
    
    Details:
    - Renamed the following macros:
        bli_obj_set_trans()    -> bli_obj_set_onlytrans()
        bli_obj_trans_status() -> bli_obj_onlytrans_status()
      to remove ambiguity as to which bits are read/updated.

commit 2f8174509ea9f844db11ebd9389de5168e85b132
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 1 15:06:30 2013 -0500

    Unconditionally check memory pool(s) for errors.
    
    Details:
    - Changed bli_mem_acquire_m() in bli_mem.c so that we still check if the
      memory pool is exhausted before checking out and returning a block, even
      if BLIS error checking has been disabled. These errors are useful because
      they likely indicate that BLIS was improperly configured for the code
      being run.

commit 75405a2b83679b6aff38d7e7425199d623a7b0a9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed May 1 15:00:30 2013 -0500

    CHANGELOG update.

commit 6bfa96f84887dec0b4cf8be5d38dd634c2f8951d (tag: 0.0.7)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 30 19:35:54 2013 -0500

    Absorbed blocksize extensions into main objects.
    
    Details:
    - Revamped some parts of commit b6ef84fad1c9 by adding blocksize extension
      fields to the blksz_t object rather than have them as separate structs.
    - Updated all packm interfaces/invocations according to above change.
    - Generalized bli_determine_blocksize_?() so that edge case optimization
      happens if and only if cache blocksizes are created with non-zero
      extensions.
    - Updated comments in bli_kernel.h files to indicate that the edge case
      blocksize extension mechanism is now available for use.

commit bc7c8005cedbe50961ac2a99aeeabf4e9f9a8e9e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 25 17:16:59 2013 -0500

    Added option to disable err checking in testsuite.
    
    Details:
    - Added a new line to input.general that allows one to specify the error-
      checking level to use for each BLIS experiment. The only two levels
      supported for now are "no error checking" and "full error checking".

commit 096b366ddcfe386f44419ef84d8df8be13825f86
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 25 16:43:43 2013 -0500

    Use cntl trees that block in n dimension.
    
    Details:
    - Updated _cntl.c files for each level-3 operation to induce blocked
      algorithms that first paritition in the n dimension with a blocksize
      of NC. Typically this is not an issue since only very large problems
      exceed that of NC. But developers often run very large problems, and
      so this extra blocking should be the default.
    - Removed some recently introduced but now unused macros from
      bli_param_macro_defs.h.

commit b6e24b23cb4dfc488c1c9c70d596539c2287f72e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 25 12:06:12 2013 -0500

    Use PASTEMAC in macro-kernels (over MAC2 or MAC3).
    
    Details:
    - Replaced multi-type invocations of copys_mxn, xpbys_mxn, etc. (PASTEMAC2
      and PASTEMAC3) with those that only use a single type (PASTEMAC).
    - Added extra macros to bli_adds_mxn_uplo.h and bli_xpbys_mxn_uplo.h to
      accommodate above change.
    - Fixed comment typo in bli_config.h files.
    - Added .nfs* pattern to .gitignore.

commit df80acf517dde180ddcc5835c6136b2fa7556d4b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 19:43:23 2013 -0500

    Fixed computation of b_next in L3 macro-kernels.
    
    Details:
    - Restructured herk_l and herk_u macro-kernels in the imagine of trmm
      and trsm, in that the edge cases are captured by the main loop, rather
      than trying to have "cleanup" sections that result in four distinct
      parts (interior, bottom edge, right edge, bottom-right edge) of the
      code.
    - Fixed the way b_next was being computed in the non-gemm level-3
      macro-kernels (herk, trmm, trsm). The way they are computed now matches
      that of gemm.

commit 3671528cf8efe4b445d196665143a5c50c2c6048
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 19:12:14 2013 -0500

    Fixed minor bug in computing b_next in gemm.

commit db072a5b4a039a9a668ef951333ecfb5bd3a74b9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 17:49:10 2013 -0500

    Fixed rare edge case bug in herk_l macro-kernel.
    
    Details:
    - Fixed a potential bug in herk_l at the m_left edge case. If MR was
      chosen to be much larger than NR, then one could encounter edge cases
      in the the MC dimension that fall entirely below the diagonal, which
      the previous implementation of the herk_l macro-kernel was not allowing
      for.

commit 1dab11e37d1cb403cbe75b73a644c00de534f104
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 17:17:11 2013 -0500

    Updated x86 gemmtrsm ukernels to use alpha.

commit 9d10d7dd9bc92a993fea7162bfa5983f75506f49
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 16:00:18 2013 -0500

    Added a_next, b_next arguments to micro-kernels.
    
    Details:
    - Added two more arguments to the gemm and gemmtrsm microkernels: the
      addresses of the next micro-panels of A and B. By passing these
      pointers into the micro-kernel, we allow the micro-kernel author to
      prefetch micro-panels of A and B as necessary (though this is
      completely optional; these addresses may also be safely ignored).
    - Updated all seven macro-kernels so that they compute and pass in
      a_next and b_next. Note that ONLY the gemm macro-kernel computes
      a_next and b_next with the precise semantics we want. I will go back
      and fix the other macro-kernels in the near future.
    - Added 'restrict' to various micro-kernels from which it was missing.

commit f3815dc84d385c514a5acaf1e925424a57be2f51
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 23 11:12:33 2013 -0500

    Added code for backward edge-case blocking.
    
    Disabled:
    - Edited bli_determine_blocksize_b() to include experimental (and
      currently disabled) code that computes extended blocks.
    - Updated commnts relate to above changes.
    - Enabled use of x86 gemmtrsm ukernel in config/flame/bli_kernel.h.

commit 4fe1435f20e8fc7dd72f795ac58c8e236e6c631b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 22 19:00:43 2013 -0500

    Updated dupl implementation to use PACKNR and NR.
    
    Details:
    - Updated frame/util/dupl/bli_dupl_unb_var1.c to utilize PACKNR and NR
      explicitly so navigate b1 so that situations where PACKNR > NR are
      supported.
    - Moved the 4x2 and 4x4 reference micro-kernels in frame/3/gemm/ukernels and
      frame/3/trsm/ukernels to kernels/c99/.
    - Updated clarksville and flame configurations.

commit 2d6f9e83799a46d52d7901e275f8fd67f0a0edc6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Apr 21 15:10:34 2013 -0500

    Disabled blocksize checks for memory pools.
    
    Details:
    - Temporarily disabled checks that ensure that enough memory will be allocated
      by the contiguous memory allocator for all types, given that the values for
      double precision real are the ones used to allocate the space. These checks
      can easily go awry in certain situations, especially if you are developing for
      only one datatype. So for now, they are probably more trouble than they are
      worth.

commit b6ef84fad1c9884c84b7f1350a0bcdfe1737e8f2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Apr 21 15:00:24 2013 -0500

    Allow ldim of packed micro-panels != MR, NR.
    
    Details:
    - Made substantial changes throughout the framework to decouple the leading
      dimension (row or column stride) used within each packed micro-panel from
      the corresponding register blocksize. It appears advantageous on some
      systems to use, for example, packed micro-panels of A where the column
      stride is greater than MR (whereas previously it was always equal to MR).
    - Changes include:
      - Added BLIS_EXTEND_[MNK]R_? macros, which specify how much extra padding
        to use when packing micro-panels of A and B.
      - Adjusted all packing routines and macro-kernels to use PACKMR and PACKNR
        where appropriate, instead of MR and NR.
      - Added pd field (panel dimension) to obj_t.
      - New interface to bli_packm_cntl_obj_create().
      - Renamed bli_obj_packed_length()/_width() macros to
        bli_obj_padded_length()/_width().
      - Removed local #defines for cache/register blocksizes in level-3 *_cntl.c.
      - Print out new cache and register blocksize extensions in test suite.
    - Also added new BLIS_EXTEND_[MNK]C_? macros for future use in using a larger
      blocksize for edge cases, which can improve performance at the margins.

commit 59fca58dbe678d79c1df0916b022afbeac7c48fa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 19 15:26:29 2013 -0500

    Fixed bug in compatibility layer (her2k/syr2k).
    
    Details:
    - Fixed a bug in the BLAS compatibility layer, specifically in bla_her2k.c
      and bla_syr2k.c, that caused incorrect computation to occur when the BLAS
      interface caller requests the [conjugate-]transpose case. Thanks to Bryan
      Marker for reporting the behavior that led to this bug.

commit 09eacbd1ab1380a95a0e9625726b45e43ed102d6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 18 19:39:13 2013 -0500

    Changed old level3 test drivers to call front-ends.
    
    Details:
    - Changed old level-3 test drivers, in 'test' directory, to always call the
      front-end object API instead of the internal back-end with the locally
      defined control tree.

commit 83e45de23e565138b8fde06fb11cfedc973b7246
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 18 18:33:03 2013 -0500

    Allow packm_init() to reacquire a too-small mem_t.
    
    Details:
    - Changed bli_packm_init() to react differently to a situation where a pack
      obj_t has an already-allocated mem_t entry that has a buffer that is smaller
      than what will be needed to hold the block/panel that now needs to be
      packed. Previously, this situation was treated with an abort() since I
      assumed something was horribly wrong. I have changed the code so that it now
      reacts by releasing the previous mem_t and re-acquires a new mem_t with the
      new information. (This change was done at the request of Bryan Marker to
      facilitate code generation via DxT.)

commit a6990434173b0cf651f8521194f3aef738deb7d2
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 18 13:52:47 2013 -0500

    Fixed bug in packing block of A for hemm/symm.
    
    Details:
    - Fixed a bug in bli_packm_blk_var2() that affected the packing functionality
      of hemm and symm. The bug occurs whenever attempting to pack a Hermitian or
      symmetric matrix where the block of A being packed intersects the diagonal,
      but some of its micro-panels do not intersect the diagonal and lie completely
      in the unstored region. Thanks to Francisco Igual for reporting this bug.
    - Comment updates to both _blk_var2.c and _blk_var3.c.

commit c92e7590e1934f830814ab614c794215ebe0c415
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Apr 17 20:53:29 2013 -0500

    Activated bli_packm_acquire_mpart_t2b().
    
    Details:
    - Removed the overly-paranoid bli_abort() from the end of
      bli_packm_acquire_mpart_t2b(), to allow others to experiment with
      partitioning through packed blocks of A. Also, and more importantly,
      changed an earlier check that was causing an erroneous (but
      coincidentally redundant) abort(). Also, updated some of the comments
      in bli_packm_part.c.

commit bea579e9f009a44e08008eb14d09f38748ab2b53
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 16 19:43:14 2013 -0500

    Allow creation of "empty" objects.
    
    Details:
    - Modified bli_obj_alloc_buffer() to allow allocating an empty buffer, and
      modified bli_adjust_strides() to explicitly handle m = n = 0.
    - Updated bli_check_matrix_strides() to allow cases where m = n = 0.

commit 7904e20f2e6908571ee5008da2a08084198eefae
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 16 17:37:16 2013 -0500

    Fixed "root" object bug in bli_her[2]k/syr[2]k.
    
    Details:
    - Fixed an obscure bug in the front-ends for herk, her2k, syrk, and syr2k,
      that manifested as the incorrect triangle being updated. It occurred when
      the user would pass in a matrix object that was correctly marked as
      symmetric/Hermitian and lower-stored, but whose root object was never marked
      as lower (or upper). We now alias and re-assign root status for matrix C
      within the front-ends. Note that trmm and trsm were already doing this,
      albeit for a slightly different reason (to allow the internal back-end to
      choose which algorithm to run--lower or upper--based on the uplo of the root
      object for both left and right side cases). Thanks to Bryan Marker for
      leading me to this bug.

commit 19155a768dd97b57cfb59c32fa8e54a344ec66e1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 16 11:24:03 2013 -0500

    Fixed overzealous type-checking in bli_getsc().
    
    Details:
    - Relaxed type checking in getsc so that the input object could be a constant
      and not just a proper floating-point type. (If it is a constant, default to
      extracting the dcomplex values.) Thanks to Bryan Marker for reporting this
      bug.
    - Added definition for bli_is_constant() in bli_param_macro_defs.h
    - Comment updates to various level-0 scalar routines.

commit 2ee6bbca2953d04c967685da9735b3eaf8a4b813
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 19:27:57 2013 -0500

    Fixed bug in bli_obj_is_packed() and renamed.
    
    Details:
    - This macro is used to determine whether the partitioning routines should
      call a corresponding packm_part routine instead. However, it was
      unintentionally catching matrices that were marked as "packed" by virtue
      of them simply being marked as BLIS_PACKED_UNSPEC in, say, bli_gemv().
      The macro has now been renamed to bli_obj_is_panel_packed(), and now only
      checks for row or column panel packing. (Note that I first attempted to
      fix this bug in a571af816d72.) Thanks to Bryan Marker for reporting the
      erroneous behavior that led me to this bug.

commit 99b99eebe70336b5f28039a4a084aa7f5fa7059d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 17:54:43 2013 -0500

    Removed local reference ukernel blocksize macros.
    
    Details:
    - Removed locally defined gemm microkernel blocksize macros from _mxn
      reference microkernel definition and header. Meant to include this in
      a recent/previous commit (0020ef7c8271).

commit 6a538fa7b164655f41cea5b9c8d3902438bda66b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 14:40:31 2013 -0500

    Formatting change to mods in previous commit.

commit ea079d35591e808971d2d98a1a7d9f89bc1f7c2f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 14:31:40 2013 -0500

    Set structure of objects in level-2 BLIS APIs.
    
    Details:
    - Added missing statement to set structure field of local objects in
      top-level BLIS (BLAS-like) API wrappers. Thanks to Bryan Marker for
      reporting this bug.

commit d9948c541c0446e20e249a1ccc83709ce51b7aa8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 10:21:26 2013 -0500

    Tweak to test suite function string construction.
    
    Details:
    - Fixed a minor bug in the way that the test suite would construct function
      name strings when the user anchored all parameters in input.operations.
      In this case, the test driver would mistake this situation for one where
      the operation simply had no parameters to begin with, and thus would not
      include the parameter string in the function string that is output for
      every result.

commit ca9e435c57c5c7a000d2a32681dd8070ba850abd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 15 09:59:46 2013 -0500

    Fixed a bug in reference implementation of dupl.
    
    Details:
    - Fixed a bug in reference implementation of dupl (bli_dupl_unb_var1.c),
      which resulted in incorrect duplication.
    - Updated old test drivers according to recently updated packm control tree
      creation interface.
    - Added 'restrict' to x86 gemm microkernel interface.

commit 26cbd52e364bbe439e3744101cd5a6cbcb82dffd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Apr 14 19:05:33 2013 -0500

    Modified bli_kernel.h include order in blis.h.
    
    Details:
    - Delayed #include of bli_kernel.h in blis.h to prevent a situation where
      _kernel.h includes an optimized microkernel header, which uses BLIS types
      such as dim_t and inc_t, which would precede the definition of those types
      in bli_type_defs.h.
    - Moved the #include of bli_kernel_macro_defs.h in bli_macro_defs.h to blis.h
      (immediately after that of bli_kernel.h).

commit 3414a23c38b0de45a8034b3dda2fc4b5a755e4e1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 13 16:53:16 2013 -0500

    CHANGELOG update.

commit ec16c52f2ecf419c749175ce0a297441c10f1c68 (tag: 0.0.6)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 13 16:41:16 2013 -0500

    Updated INSTALL file (now redirects to website).

commit 0020ef7c82711a7ebf08e5174f939bee2563184c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 13 15:26:35 2013 -0500

    Removed gemmtrsm-, trsm-specific blocksize macros.
    
    Details:
    - Modified gemmtrsm micro-kernel wrappers to use new aliased blocksize macros
      instead of operation-specific ones.
    - Removed local, gemmtrsm-specific blocksize macro definitions found in
      micro-kernel header files.
      (Meant to include above changes in 31b100e7bf4a.)
    - Added comments to reference gemmtrsm micro-kernel wrapper implementation.

commit 1a9f427b85bb95aaa9e54c8ff8ecad8734b361ee
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 12 15:25:54 2013 -0500

    Added/renamed alignment constants to _config.h.
    
    Details:
    - Added new memory alignment constants:
        BLIS_HEAP_STRIDE_ALIGN_SIZE   (previously assumed to be same as SYSTEM_MEM)
        BLIS_CONTIG_ADDR_ALIGN_SIZE   (previously assumed to be same as PAGE_SIZE)
        BLIS_STACK_BUF_ALIGN_SIZE     (previously not enforced)
      and renamed existing ones
        BLIS_SYSTEM_MEM_ALIGN_SIZE -> BLIS_HEAP_ADDR_ALIGN_SIZE
        BLIS_CONTIG_MEM_ALIGN_SIZE -> BLIS_CONTIG_STRIDE_ALIGN_SIZE
      to better convey what the alignment factor is used for (and what it is
      not used for).
    - Removed BLIS_ENABLE_SYSTEM_MEM_ALIGN. Dynamic memory alignment is now
      disabled by setting BLIS_HEAP_STRIDE_ALIGN_SIZE to 1.
    - Inserted instances of __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE)))
      into macro-kernels to specify stack alignment of temporary buffers.
    - Modified test suite driver to output new constants.
    - Removed bli_align_dim_to_sys() and bli_align_dim_to_cmem(). Instead, we now
      use bli_align_dim_to_size(), which takes a third argument (the desired
      alignment).

commit a77d10e87e3c0ab55ec14d74c285bc95c06285c3
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 12 11:40:55 2013 -0500

    Fixed an bug in axpyv/axpym when alpha is unit.
    
    Details:
    - Fixed bug whereby axpyv and axpym were incorrectly simplifying to a copy,
      rather than an add, when alpha = 1. Thanks to Bryan Marker for identifying
      this bug.

commit 0495bd1d6de5995fe2fb79b321eec79e961eb7a5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 16:39:25 2013 -0500

    Moved _POSIX_C_SOURCE def to compiler cmd line.
    
    Details:
    - Removed the #define of _POSIX_C_SOURCE in bli_config.h (for both reference
      and clarksville configurations) and added "-D_POSIX_C_SOURCE=200112L" to
      the compiler command line arguments in make_defs.mk (for both configs).
      Thanks to Devin Matthews for suggesting this change.

commit d43d1a0a2ef6de4bc57627566aef8e3fdb458b8c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 16:28:17 2013 -0500

    Appended 'f2c_' to abs, min, max macros in f2c.h.
    
    Details:
    - Renamed abs, min, max, dmin, and dmax macros in bli_f2c.h so that they
      would not conflict with anything defined by the user (or the language).
      Thanks to Devin Matthews for suggesting this fix.
    - Updated all instances of the above macros accordingly.

commit 31b100e7bf4aeaa4ceafefd2b6c3102d5fbc4cbb
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 11:11:52 2013 -0500

    Added new kernel blocksize macro aliases.
    
    Details:
    - Added new macros that alias level-3 cache and register blocksize macros
      to names that can be constructed via the PASTEMAC macro. These aliased
      macro definitions live inside bli_kernel_macro_defs.h, which is now
      #included after bli_kernel.h.
    - Modified macro-kernels to use new aliased blocksize macros instead of
      operation-specific ones.
    - Removed local, operation-specific kernel blocksize macro definitions
      (found in macro-kernel header files).

commit bd2b24ba65b36d7c07c5918a3838ce2ff57c4b48
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 10:35:39 2013 -0500

    Updated CREDITS file.

commit 79328c15410215737f3f14cd069328cf52aa11fd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 11 10:32:14 2013 -0500

    Reverted testsuite object files' home to 'obj'.
    
    Details:
    - Removed 'obj' and 'lib' from .gitignore.
    - Added testsuite/obj/.gitkeep (which is an empty file).
    - Updated testsuite/Makefile accordingly.
    - Thanks to Vernon Austel for pointing out the .gitkeep trick to tracking
      empty directories in git.

commit 4afe3bfd82c03e1e97b58b7d250588a0d28541e5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 9 17:45:39 2013 -0500

    Renamed/moved object scalar constant macros.
    
    Details:
    - Replaced scalar constant macro definitions in bli_const_defs.h with a single,
      simplier macro in bli_obj_macro_defs.h.
    - Updated invocations of old macros accordingly.
    - Removed bli_const_defs.h.

commit 357893f5be5c56ab7b062874005e77e614b23f06
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 9 14:48:15 2013 -0500

    Applied fix from prev commit to gemmtrsm_?_ref_4x4
    
    Details:
    - Fixed hard-coded kernels in bli_gemmtrsm_l_ref_4x4.c and
      bli_gemmtrsm_u_ref_4x4.c.

commit 54988e8dca44475610bcaee5a7bc1c40e8921402
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 8 19:08:43 2013 -0500

    Fixed a performance bug in trsm.
    
    Details:
    - Fixed a bug in the reference implementations of the gemmtrsm wrappers
      (bli_gemmtrsm_l_ref_mxn.c and bli_gemmtrsm_u_ref_mxn.c) whereby the
      reference gemm microkernel was hard-coded, and thus always called, even
      when GEMM_UKERNEL was defined to point to an optimzied microkernel. This
      manifested as artificially low trsm performance for all problem sizes, but
      especially for small problem sizes as it only affected blocks of A that
      intersected the diagonal. Thanks to Mike Kistler of IBM for helping me
      find this bug.

commit a7252e40b5c351eef9a1df531ea0ef25cb5fb705
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 8 16:08:22 2013 -0500

    Generate testsuite objects 'src'.
    
    Details:
    - Tweaked the testsuite makefile so that object files are stored in 'src'
      rather than 'obj', since (a) the top-level .gitignore dictates that
      obj directories are to be ignored, and (b) since git has problems
      tracking empty directories. Now, users do not need to create their own
      obj directories within their own local clones of BLIS.

commit 803871c55b60d3c225ad9a0607fa507a9c16aab7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 8 15:18:42 2013 -0500

    Minor formatting changes.

commit a571af816d72727e16cad37007e7043b9d6fa362
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Apr 8 15:00:13 2013 -0500

    Fixed definition of bli_is_packed_object() macro.
    
    Details:
    - Changed the definition of bli_is_packed_object() so that it keys off of the
      value of the pack schema bits in the info field of obj_t, rather than
      comparing the obj_t buffer with that of the mem_t entry. This was the cause
      of a very low probability bug whereby uninitialized memory caused the macro
      to evaluate to TRUE even though the object in question was not packed.
      Thanks to Vernon Austel of IBM for helping discover this bug.
    - Changed an abort() in bli_packm_part() to a not-yet-implemented.

commit 3be14c32f735ecc6169d3ab6370cf8b69162acec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Apr 6 12:54:45 2013 -0500

    Updated information in testsuite output header.
    
    Details:
    - Added to the information that is echoed at the beginning of the test suite's
      output, and also re-labeled some existing information.

commit 874707c1b183a4dd9a91dbfd4ea1522384c190df
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Apr 5 17:19:43 2013 -0500

    Fixed edge case handling bug in herk macrokernels.
    
    Details:
    - Fixed a bug present in bli_herk_l_ker_var2() and bli_herk_u_ker_var2() that
      only manifests when BLIS is configured such that MR != NR. The bug involves
      incorrectly detecting edge cases, which resulted in some parts of matrix C
      potentially being skipped and not updated, depending on the problem size.
    - Updated the default values of MR and NR in config/reference/bli_kernel.h to
      8 and 4, respectively, so that I can better stress the framework on a
      day-to-day basis. (The fact that they were both equal to 4 for so long is
      why I did not stumble upon this bug much sooner.)

commit 7cbda15291d3e01300e71c286b9657b7ef0708bf
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Apr 4 15:25:43 2013 -0500

    Added reference microkernels for arbitrary MR, NR.
    
    Details:
    - Added a new set of reference gemm, gemmtrsm, and trsm micro-kernels that
      contain explicit loops over MR and NR, thus allowing them to be used
      unmodified by developers who want to build a reference library with
      custom register blocksizes.
    - Changed config/reference/bli_kernel.h to use above ukernels by default.
    - Changed interfaces of new and existing gemm, gemmtrsm, and trsm micro-kernels
      to use 'restrict' keyword.
    - Added -funroll-loops option to config/reference/make_defs.mk.
    - Updated comments in bli_kernel.h describing constraints on register and
      cache blocksizes.
    - Updated _adds_mxn.h, _copys_mxn.h, and _xpbys_mxn.h macros files so that
      single-char macros are also defined.

commit 6684b73d5501f91d24a79e26655a42819c9b3114
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Apr 2 13:06:20 2013 -0500

    Implemented amax operation and related changes.
    
    Details:
    - Implemented amax operation in BLIS.
    - Activated BLAS2BLIS routine mapping for new amax BLIS implementation.
    - Added integer support to [f]printv, [f]printm.
    - Added integer support to level-0 copys macros.
    - Updated printing of configuration information in test suite driver.
    - Comment changes to _config.h files.
    - Added comments to bla_dot.c to reminder reader what sdsdot()/dsdot() are
      used for.

commit fb68087f8727cd5fd656a742a110e54fb1c91db9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 15:10:16 2013 -0500

    More memory alignment-related tweaks.
    
    Details:
    - Renamed BLIS_MEMORY_ALIGNMENT_SIZE to BLIS_CONTIG_MEM_ALIGN_SIZE.
    - Renamed BLIS_ENABLE_MEMORY_ALIGNMENT to BLIS_ENABLE_SYSTEM_MEM_ALIGN.
    - Added BLIS_SYSTEM_MEM_ALIGN_SIZE, which controls only the alignment
      passed into posix_memalign() or equivalent.
    - Defined new function, bli_align_dim_to_cmem(), which applies the
      contiguous memory alignment (rather than the system/malloc alignment).

commit 9682ef61dbf9a8846c8b0826d4de24bc216cd641
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 14:14:53 2013 -0500

    Always define memory alignment size cpp constant.
    
    Details:
    - Removed guard around #define for memory alignment size constant.
      Memory alignment should always be enabled, and so this value should
      always be defined.

commit 3a787cccaae16531474f34398e3c0cf4f49b8cd8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 13:59:19 2013 -0500

    Renamed memory alignment macro constant.
    
    Details:
    - Renamed all occurrences of BLIS_MEMORY_ALIGNMENT_BOUNDARY to
      BLIS_MEMORY_ALIGNMENT_SIZE.

commit 37308f9a502b56d94fa52a7df71c676a46c3be3d
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 26 12:43:14 2013 -0500

    Align packed panel strides with system alignment.
    
    Details:
    - Pass panel strides through bli_align_dim_to_sys() to ensure that each
      subsequent packed panel of A and B begins at an aligned address. (The
      first panel is presumably aligned to system alignment because it is
      aligned to a page boundary, which is typically much larger.)
    - Rearranged code in packm_init_pack() to prevent additional conditional
      blocks as a result of the aforementioned change.
    - Adjusted contiguous memory allocator so that the system memory alignment
      is used to allocate enough space for each block no matter what kind of
      register blocking is used (even if register blocksize is unit and every
      row/column needs maximal padding).
    - Adjusted default blocksizes in reference configuration so that MC*KC
      and KC*NC result in identical footprints for all datatypes.

commit 40a0654ada5f256beb3da80ebba015a3c71fb61f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 24 20:18:12 2013 -0500

    CHANGELOG update.

commit b65cdc57d9e51fa00e3c03539cfb7e045707d0f4 (tag: 0.0.5)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 24 20:01:49 2013 -0500

    Migrated 'bl2' prefix to 'bli'.
    
    Details:
    - Changed all filename and function prefixes from 'bl2' to 'bli'.
    - Changed the "blis2.h" header filename to "blis.h" and changed all
      corresponding #include statements accordingly.
    - Fixed incorrect association for Fran in CREDITS file.

commit 132bffcef7441f32d02cc7485aef6a0648e0ef1e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 24 18:49:36 2013 -0500

    Removed several 'old' directories and files.
    
    Details:
    - Removed most of the 'old' directories scattered throughout the framework,
      which includes alternate/half-baked/broken implementations.

commit 551ea4767a3ea6c263f12aaca94bc2642cee4cfa
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sun Mar 24 18:00:10 2013 -0500

    Removed #include "blis2.h" from low-level headers.
    
    Details:
    - Removed #include of "blis2.h" from various lower-level, operation-specific
      header files throughout the framework. Given that these low-level headers
      are included within #blis2.h in a very specific order, #include'ing blis2.h
      within them directly is unnecessary.

commit bc7b318ed0960edeb4537797dd8c91de0d942ca9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 22 17:18:58 2013 -0500

    Added cpp guards to conflicting libflame typedefs.
    
    Details:
    - Added cpp guards around the definitions of dim_t, scomplex, and dcomplex.
      This is a temporary hack to allow interoperability with libflame. (Similarly
      temporary changes are being made to libflame's type definitions file.)

commit f469907503fcdc24dff0174c569170e6e756e045
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 22 15:20:15 2013 -0500

    Renamed MAX_PREFETCH_BYTE_OFFSET to MAX_PRELOAD_.
    
    Details:
    - Renamed BLIS_MAX_PREFETCH_BYTE_OFFSET to
      BLIS_MAX_PRELOAD_BYTE_OFFSET since "prefetch" is kind of a loaded word
      (e.g. "prefetch" instructions, which are different than the particular
      kind of prefetching/preloading referred to by this constant).

commit d1023bfbc6668a58a01ee4f82ded2319911e7b19
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 22 15:09:59 2013 -0500

    Removed build/old directory.

commit 718888849c48d99f83eea6b8f83bc1998cffef7e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 22 15:07:01 2013 -0500

    Deprecated 'flame' configuration.
    
    Details:
    - Removed 'flame' configuration, as it was horribly out-of-date.
    - Comment changes to bl2_blocksize.c and bl2_mem.c.

commit bba38cf4e9d28058c14483f44fa074a6d2852ad9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Mar 19 18:07:40 2013 -0500

    Added missing conjbeta argument to scald.

commit 1f82b51d06d0279dded3f2b87ba59403f3ed0af6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 18 15:37:20 2013 -0500

    Relocated packed mem_t dimension fields to obj_t.
    
    Details:
    - Removed the m and n (and elem_size) fields from the mem_t object, and added
      m_packed and n_packed fields to obj_t. These new fields track the same as
      the old ones. From an abstraction standpoint, it seemed awkward to store
      those dimensions inside the mem_t.
    - Updated interfaces to bl2_mem_acquire_*() so that only a byte size argument
      is passed in, instead of m, n, and elem_size.
    - Updated bl2_packm_init_pack() and bl2_packv_init_pack() to inline the
      functionality of bl2_mem_alloc_update_m() and bl2_mem_alloc_update_v(),
      respectively.
    - Updated packm variants to access the packed length and width fields from
      their new locations.

commit 36c782857bf9b8ac1b1dac47a70f689a4407e2cc
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Mar 18 10:37:03 2013 -0500

    CHANGELOG update.

commit e7d41229d3b1674e74f47d7f29fae004a745201a (tag: 0.0.4)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 15 17:12:36 2013 -0500

    Re-implemented contiguous memory allocator.
    
    Details:
    - Completely re-wrote the contiguous memory allocator (bl2_mem.c). The new
      allocator instantiates and initializes three separate memory pool objects,
      each one associated with a separate array of contiguous memory blocks, each
      block of fixed and uniform size. (The three pools are for allocating mc-by-kc
      blocks of A, kc-by-nc panels of B, and mc-by-nc panels of C.) The pool
      objects use a stack structure internally to track which blocks in the region
      have been "checked out" to a thread and which are still available. Critical
      regions are now clearly marked and adaptable to parallel environments (e.g.
      OpenMP). Memory pools are set up when bl2_init() is called.
    - Added a new field to the packm control tree node, which indicates what kind
      of packed buffer is being allocated. The enumerated type for this argument
      is defined as packbuf_t in bl2_type_defs.h.
    - Updated level-3 _cntl.c files to pass in the appropriate value for a new
      packbuf_t argument to bl2_packm_cntl_obj_create().
    - Moved some macros called by packm_init_pack() from bl2_obj_macro_defs.h to
      bl2_mem_macro_defs.h.
    - Added BLIS_MAX_NUM_THREADS to bl2_config.h, which we use as the default
      number of blocks of A reserved for the memory allocator.
    - Deprecated bl2_align_dim(). Replaced usage with that of
      bl2_align_dim_to_mult(). Turns out that typically we don't need to align
      a dimension to the system alignment, since that value has to do with
      starting addresses, whereas the values we are dealing with are unitless
      dimensions.

commit 1e76cae00cb0a04544aaae1ade878686b238d283
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 15 12:21:42 2013 -0500

    Perform her2k var1 loops in sequence.
    
    Details:
    - Changed variant 1 of her2k so that the two rank-k products are computed
      and accumulated in sequence rather than fused into one loop. This is
      necessary if BLIS is to be configured to provide only enough contiguous
      memory for one panel of B.

commit c95c270eba91ae4efc26603beddfd0292caa919b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 7 14:42:15 2013 -0600

    Enhanced tracking of dimensions for mem_t objects.
    
    Details:
    - Added new fields to mem_t struct definition to track the allocated (as
      opposed to the currently used) dimensions of the memory region. This
      allows packm_init() to be more robust in situations where memory is
      already allocated but is more than needed for the current packing job.
    - Updated logic in bl2_obj_set_buffer_with_cached_packm_mem() macro, used
      in packm_init(), to update the "currently used" dimensions of the mem_t
      object if the requested dimensions are smaller than the allocated
      dimensions.

commit e99281a0f41d482fddeffa239bfc8e13e6d13d4b
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Mar 7 14:00:10 2013 -0600

    Fixed test suite flop formulas for ops with side.
    
    Details:
    - Fixed incorrect flop counts in test suite modules for hemm, symm, trmm,
      trmm3, and trsm.
    - Comment updates in herk macro-kernels.

commit ef8cbfc44dd620fdcbdb51cdb173217194bebe31
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 2 12:47:06 2013 -0600

    Added "version" to .gitignore.
    
    Details:
    - Added "version" to .gitignore file so that the file does not show up when
      running 'git status', or accidentally get pulled into the index when
      running 'git add' or 'git add --all'.

commit e9e0747c2f6c178f53ac46ab794acbb7b8c4fea8
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Sat Mar 2 12:43:54 2013 -0600

    Removed version file from version control.
    
    Details:
    - Removed version file from version control to prevent git errors that occur
      when trying to pull new commits.

commit bb612f864e9c17dd9805e9446840f02259619469
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Mar 1 12:55:42 2013 -0600

    Updated behavior of bl2_obj_induce_trans() macro.
    
    Details:
    - Changed bl2_obj_induce_trans() so that the transposition bit is no longer
      updated as part of the macro. All current uses of the macro have been
      coupled with instances of bl2_obj_set_trans() to clear the bit.
    - Added Jed to CREDITS file.

commit f24e29b789e7314764a818ceb3063126936c986f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 22 18:15:41 2013 -0600

    Replaced banded/packed BLAS2 stubs with f2c code.
    
    Details:
    - Retired the blas2blis wrappers that simply called abort with a "not yet
      implemented" message. This includes all of the level-2 banded and packed
      routines.
    - Replaced the aforementioned with the corresponding netlib implementations
      having been run through f2c (with some customization).
    - Added directories named 'attic' to build/gen-make-frags/ignore_list.

commit 1454c1a14207766dfed372b8e38b47fa384f5198
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 22 12:38:45 2013 -0600

    Moved Fortran name-mangling macro to bl2_config.h.
    
    Details:
    - Moved the Fortran-77 name-mangling macros from bl2_blas_macro_defs.h to the
      configuration directory (bl2_config.h, specifically) given that it can be
      expected to be tweaked by some developers.

commit ede75693e5a36c6006087c4a7df834175b604504 (tag: 0.0.3)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 22 12:11:24 2013 -0600

    Implemented blas2blis compatibility layer.
    
    Details:
    - Added the blas2blis compatibility layer, located in frame/compat. This
      includes virtually all of the BLAS, including banded and packed level-2
      operations.
    
    - Defined bl2_init_safe(), bl2_finalize_safe(). The former allows a conditional
      initialization, which stores the "exit status" in an err_t, which is then
      read by the latter function to determine whether finalization should actually
      take place.
    - Added calls to bl2_init_safe(), bl2_finalize_safe() to all level-2 and
      level-3 BLAS-like wrappers.
    - Added configuration option to instruct BLIS to remain initialized whenever
      it automatically initializes itself (via bl2_init_safe()), until/unless the
      application code explicitly calls bl2_finalize().
    
    - Added INSERT_GENTFUNC* and INSERT_GENTPROT* macros to facilitate type
      templatization of blas2blis wrappers.
    - Defined level-0 scalar macro bl2_??swaps().
    - Defined level-1v operation bl2_swapv().
    - Defined some "Fortran" types to bl2_type_defs.h for use with BLAS
      wrappers.

commit 995edf43e21c1868732dbdd7fee14b08730218bd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 21 14:30:50 2013 -0600

    Updated version file. (Forgot to in prev commit).

commit e823b08aaf7b65ecc6ddc30570709ea8a4b52aa7
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 21 12:00:17 2013 -0600

    Fixed some scalar types in BLAS-like Herm APIs.
    
    Details:
    - Some of the scalars of Hermitian operations, such as alpha in her,
      alpha and beta in herk, and beta in her2k, need to be real. These
      arguments were typed incorrectly as the complex types. This has been
      fixed. Note the issue was only present in the BLAS-like APIs for
      these operations (not the native object-based interfaces).

commit 5ece050a669e74ba4a711d1d4669239d22d45642
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 20 15:50:54 2013 -0600

    Updated version file. (Forgot to in prev commit).

commit f243034b8b430d4684680ea8eddfd246e73fefc0
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 20 14:11:36 2013 -0600

    Changed API of packm_init_pack() to use blksz_t.
    
    Details:
    - Changed the interface of packm_init_pack() so that mult_m and mult_n
      are passed in as type blksz_t* instead of dim_t.
    - Make similar change for packv_init_pack().

commit da0c22f24107be9f33e0ea2dae52e5534b1fd0e5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Feb 15 09:59:48 2013 -0600

    Minor changes to lower levels of scalm and setm.
    
    Details:
    - Removed diagx parameter from lower-level interfaces of scalm.
    - Modified scalm_basic_check() to expect an object with a nonunit diagonal.
    - Changed setm_unb_var1() so that having an implicit unit diagonal results
      in only the strictly lower or upper triangle of the matrix being modified.

commit 2c836adadcd2a7d7f217033ac4d7fcad03d5bd55
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 14 10:42:56 2013 -0600

    Updated beta == zero semantics of mulsc.
    
    Details:
    - Updated beta == zero semantics of mulsc. Hopefully this is the last
      operation that needed updating.
    - Added Devin to CREDITS file.

commit 722b66c7dcaaaa1b109e7c8b1d53fd71a9af8240
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Feb 14 10:18:00 2013 -0600

    Removed some calls to setv() in test modules.
    
    Details:
    - Removed calls to setv() in test modules whose sole purpose was to
      initialize vectors to zero to ensure that nan's and inf's would not
      taint the computation. Now that beta == zero semantics have been
      updated to clear the output operand (when beta is zero), rather than
      multiply against it, these setv() calls are no longer needed.

commit e6ac623a902f776c42f85eadbf76996d9770a0db
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 13 18:44:59 2013 -0600

    Properly implemented beta == 0 semantics.
    
    Details:
    - Changed name of set0 and set0_mxn macros to set0s and set0s_mxn,
      respectively.
    - Added code to the following operations that sets the output operand to
      zero if the corresponding scalar is zero (rather than performing the
      floating-point multiply, or in the case of setv, copying the value).
      This will prevent nan's and inf's from creeping into results from
      uninitialized memory.
      - axpy
      - dotxv
      - scalv
      - scal2v
      - setv
      - gemv
      - ger
      - hemv
      - her
      - her2
      - gemm reference ukernels

commit aedccbc85d491e41711a0c6eb0d246d8700a199a
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 13 18:29:53 2013 -0600

    Fixed stale interface to packm_unb_var1().
    
    Details:
    - Removed the control tree from the interface to packm_unb_var1(), which
      I meant to do when it was un-deprecated.

commit c23135669f7a8a545e2e11ef559bf284be8bc65c
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Wed Feb 13 13:21:00 2013 -0600

    Un-deprecated packm_unb_var1.c (needed by l2 ops).
    
    Details:
    - Added bl2_packm_unb_var1() back into the mix once I realized that level-2
      operations still need this routine for packing matrices. Now, whether
      level-2 operations should be packing matrices to begin with is another
      matter. But this fixes the segmentation fault one would have gotten when
      running bl2_gemv() on a general stride matrix.

commit cf49e35f9819f9d93ebdca4703ade5abab28f6f6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 12 18:39:35 2013 -0600

    Removed cntl tree usage from packm implementation.
    
    Details:
    - Added new fields to obj_t info field:
      - invert_diag
      - pack_order_if_upper
      - pack_order_if_lower
      These fields allow packm_init() to embed information that begins
      in the control tree into the object so that the packm implementation
      does not need to use control trees at all. This is being done to aid
      Bryan's DxT code generation.
    - Added macros that operate on above fields.
    - Changed packm_init(), packm_blk_var2(), and packm_blk_var3() according
      to above changes.
    - Made similar (but much simpler) changes to packv.
    - Deprecated packm_blk_var1(), packm_unb_var1(), and packm_densify().
      These were part of prototype implementations and are no longer needed.

commit eb139ae256651af7820b93ef982626180195b87f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 12 12:39:30 2013 -0600

    Replaced bl2_abs() with _fabs() where appropriate.

commit 474bac30c99928f9e87315972bcb45c632c0b7ec
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 12 12:23:48 2013 -0600

    Removed level-0 macros projrs, grabis.
    
    Details:
    - Replaced instances of projrs and grabis macros with newer,
      more general-purpose getris.

commit 03a260a457c8964e4603a655cee0d40ac17affba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Feb 12 11:45:34 2013 -0600

    Restored executable permissions to scripts.
    
    Details:
    - Restored executable (0755) permissions to scripts that were touched by
      the recursive sed script that updated the copyright headers in the
      previous commit.

commit 1274e1243775e5e705114257a43176f63635227f
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 11 14:37:47 2013 -0600

    Updated copyright headers from 2012 to 2013.

commit 3b620cc8e90c53c79129bd9dd89ae6b77c2446f1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 11 13:38:07 2013 -0600

    CHANGELOG update.

commit 768fcebaa8be0eb936a6e7a02cd8a19438c79d99 (tag: 0.0.2)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Feb 11 13:20:44 2013 -0600

    Added unified test suite, and many fixes.
    
    Details:
    - Added a highly configurable, unified test suite.
    
    - Removed DUPB configuration constant from bl2_kernel.h and macro-kernel
      header files. Now, instead, DUPB is computed as (NDUP != 1) within each
      macro-kernel. This fixes a bug in trmm/trsm whereby bp was indexed into
      incorrectly when DUPB was set to FALSE but the NDUP was still non-unit.
      By encoding both pieces of information into one constant in _kernel.h,
      it seems somewhat less likely others will encounter this bug in the
      future.
    - Added level-2 cache blocksizes to _kernel.h for reference configuration,
      and defined blocksizes in _cntl.c files to these default values.
    
    - Changed semantics of her2k and syr2k such that these operations no longer
      expect the B matrix to already be conjugate-transposed (or just transposed
      for syr2k). However, these semantics are preserved for the internal
      mechanics of the implementations, including the internal back-end and all
      blocked variants.
    - Inserted checks for real-valued alpha and beta for herk/her2k and herk,
      respectively.
    
    - Relaxed general object structure constraints in _basic_check() for gemv, ger.
    - Changed her front-end to NOT copy-cast to real projection; instead, this is
      replaced by selecting either the real part or both parts within the unblocked
      algorithm implementation, depending on the value of conjh.
    - Added conjh to all _check routines for her so that the code knows when to
      verify that alpha has an imaginary component equal to zero (for her, but
      not syr).
    - Changed control tree for her to forgo packing.
    
    - Added unit diagonal support to fnormm.
    - Redefined real versions of abval2s macros in terms of fabs(), fabsf().
    - Redefined complex versions of sqrt2s macros using the actual "complex square
      root" formula.
    - Created new level-0 object-based routines, suffixed with "sc" (for "scalar").
    - Defined new level-1v, -1d, and -1m versions of add and sub operations
      (two-operand add and subtract).
    - Added new scalar macros:
      - getris: acquire real and imaginary components.
      - setris: set real and imaginary components.
      - addjs: addition with conjugated x.
      - subjs: subtraction with conjugated x.
    - Defined new utility operations:
      - absumv: element-wise sum of absolute values for vector elements.
      - absumm: element-wise sum of absolute values for matrix elements.
      - mkherm: convert existing matrix to Hermitian.
      - mksymm: convert existing matrix to symmetric.
      - mktrim: convert existing matrix to triangular.
    
    - Added various error checking routines.
    - Added bl2_clock_min_diff(), which is used to more cleanly measure the
      wall clock time of a code block.
    - Added general stride support to bl2_obj_alloc_buffer().
    - Added bl2_obj_init_scalar().
    - Updated parameter mapping in bl2_param_map.c.
    - Added support for queriable version string.
    
    - Fixed a bug in the her2k macro-kernels (which currently are simply
      implemented in terms of two invocations of herk) whereby beta was being
      applied to both the first and second rank-k updates, rather than only
      the first.
    - Fixed a bug in trmm/trsm whereby transpose and right side cases were not
      properly implemented due to erroneous assumptions regarding aliasing and
      root objects.
    - Fixed a bug in the upper triangular trsm macro-kernel in which the wrong
      MR x NR block of B was being updated.
    - Fixed a bug in the inverts macro in the double real case whereby the
      value was typecast to float before inversion. This affected non-unit cases
      of dtrsm.
    - Fixed a bug in the reference kernels for gemmtrsm whereby the minus one
      constant was being applied incorrectly.
    - Fixed a bug in the overall treatment of non-unit alpha for trsm. The code
      now mimics the rank-k strategy of gemm, whereby alpah is applied during
      the first iteration of variant 3, with BLIS_ONE passed in instead for
      subsequent iterations. This also required passing alpha into the macro-
      kernels as well as the fused gemmtrsm micro-kernels.
    - Fixed a bug in trsm_u_blk_var1 whereby the gemm macro-kernel was being
      called for blocks strictly above the diagonal. While this sounds good in
      theory, this cannot be done because gemm_ker_var2 expects row panels of
      A to be packed from top to bottom, while for trsm_u, A is actually packed
      from bottom to top due to the reverse (BR->TL) nature of the algorithm.
    - Fixed a bug in packm_cxk() whereby panel packings with unit panel
      dimensions were mishandled due to incorrect arguments to the copyv kernel.
      Also changed the copyv kernel invocation to scal2v so that these edge
      cases are properly handled when scaling is requested.
    - Fixed a bug in packv_int() whereby an uninitialized object is passed in
      instead of the source object.
    - Fixed a bug whereby level-2 code could allocate memory dynamically via
      bl2_malloc() and then attempt to free it via bl2_mm_release(). Also fixed
      a potential future bug whereby a mem_t object that is actually no longer
      "allocated" from the static pool is mistaken for being allocated due to
      failure to NULLify the buffer when the block was most recently released.
    - Fixed a bug in bl2_acquire_mpart_*() whreby the uplo field was mistakenly
      toggled when the requested subpartition needed to be "reflected" due to it
      residing in an unstored region.

commit be94fb84c0351602d7585269f29998e3bf83f899
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 4 10:55:21 2013 -0600

    Added missing 'd' to fused gemmtrsm function name.

commit 879a179e1dee36f0c56765f2ab91a26861019b34
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Jan 4 10:37:27 2013 -0600

    Added debug statements to bl2_mm_acquire_m().
    
    Details:
    - Added printf() statements to bl2_mm_acquire_m() to help debug issues
      with prematurely exhausted memory pool.
    - Removed 'd' from kernel names of reference kernels in clarksville
      configuration's bl2_kernel.h

commit 806e74beb4eafeef620a555ffbb3f6779e29c7b6
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 20 17:07:50 2012 -0600

    Defined Frobenius norm operations.
    
    Details:
    - Added level-0 grabis macro operation to grab imaginary component of one
      variable and copy it to the real component of another variable.
    - Defined sumsqv operation, which computes the sum of the absolute squares
      of the elements of a vector. This implementation is modeled after ?lassq
      in netlib LAPACK.
    - Defined fnormv and fnormm operations, which compute the Frobenius norm on
      vectors and matrices, respectively. These operations are treated as one-
      operand operations where the output norm value is the real projection of
      the datatype of the input operand. Both operations are implemented in terms
      of sumsqv.

commit 66e80ce1aec099b2b2b0c4f295e38add2c921383
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 20 17:02:55 2012 -0600

    Added GENT*R macros; tweaked bl2_machval defs.
    
    Details:
    - Added function and prototype macro-generating macros for GENTFUNCR and
      GENTPROTR, which are one-operand macros with auxiliary real projection
      types.
    - Tweaked bl2_machval files to use new macros.

commit 2fecc88ca22142020573f168da715e8e9f3dd7de
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 20 11:35:14 2012 -0600

    Fixed harmless macro bug in level-1m operations.
    
    Details:
    - Fixed some inconsistent usage of n_iter_max and n_iter in the two
      bl2_set_dims_incs_uplo_[12]m macros. The right thing ended up happening
      despite the bug, which is why I had not discovered it until now.

commit 8945db6ec9f82168cf72411ad408b4fdb44ae0d1
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 18 15:07:36 2012 -0600

    Renamed x86,x86_64 kernels to indicate 'd' fusing.
    
    Details:
    - Renamed x86 and x86_64 kernels to contain a 'd' before the fusing shape
      to emphasize that the fusing shape is not for all datatype instances, but
      rather just for one (that of double-precision real). Other fusing shapes
      would be proportional to their precision and domain "byte footprints".
    - Corresponding changes to config/clarksville/bl2_kernel.h.

commit 6fbbdd4e194d06096ad08c5db61127be338067db
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 18 14:34:02 2012 -0600

    More tweaks to _config.h, _kernel.h; smem tweaks.
    
    Details:
    - Moved kernel-related definitions form bl2_config.h to bl2_kernel.h.
    - Replaced #define of _GNU_SOURCE with #define of _POSIX_C_SOURCE. This
      accomplishes the same thing (enabling posix_memalign()) without enabling
      all of the GNU extensions we don't need.
    - Defined the size of the static memory pool in terms of MC, KC, and NC,
      as well as two new constants that determine how many MCxKC blocks and
      how many KCxNC blocks should be allocated (defined in bl2_config.h).
    - In the case of static memory pool exhaustion, replaced the generic
      bl2_abort() with a specific error code call.

commit 5d8bdb21c48e8fb11bef6128a242122cc1470a99
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 17 16:07:36 2012 -0600

    Minor reordering of bl2_config.h definitions.

commit 4a83f67490136a898f558e273b76a687aed8b893
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 17 12:35:54 2012 -0600

    Consolidated configuration headers.
    
    Details:
    - Merged contents of bl2_arch.h into bl2_config.h for reference and
      clarksville configurations.
    - Updated CREDITS, INSTALL, LICENSE, README files.

commit 0670c33cc14612f636ef09ede4133404ae0af6ba
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 14 12:45:26 2012 -0600

    Fixed bug in reference gemm ukernels.
    
    Details:
    - Fixed a bug whereby, for the reference gemm ukernels, the matrix product
      was not correctly accumulated and scaled (by alpha) into the output matrix
      C. (Thanks to Fran for finding this bug.)
    - Whitespace changes to reference trsm kernels.

commit e2e7cb2fbe615be4d375bc2dce88d03d98fadc9e
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 13 18:17:54 2012 -0600

    Expanded reference packm/unpackm kernel set to 16.
    
    Details:
    - Added 10xk, 12xk, 14xk, and 16xk reference kernels for packm and
      unpackm.
    - Updated bl2_[un]packm_cxk() to silently use scal2m if "out of range"
      kernel size is requested. (Thanks to Tyler for finding this bug.)
    - Updated bl2_kernel.h to contain new _KERNEL definitions, according
      to above changes, for 'reference' and 'clarksville' configurations.
    - Updated CHANGELOG.
    - Removed "output*.m" from .gitignore.

commit 17455a8bce038dd570356ab0c5c11d9a89f20248
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 10 17:23:32 2012 -0600

    Minor updates towards to 0.0.1.

commit 7ad4ebef38b8e6eea9b6091844ba7294ec870271 (tag: 0.0.1)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 10 16:18:40 2012 -0600

    Tweaks to get BLIS compiling again on clarksville.
    
    Details:
    - Updated header files and make_defs.mk in config/clarksville.
    - Fixes to bl2_mem.c (now that SMEM_M, SMEM_N are gone).
    - Moved definition of blksz_t from bl2_cntl.h to bl2_type_defs.h.
    - Shuffled include statements in blis2.h.

commit cc58ea86010b1f046134d13b546c878389df9af5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 10 14:55:12 2012 -0600

    Added template fragment.mk; updated .gitignore.

commit 714c527b0eb153b7e2040b79349edc8372f743fd
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 7 19:54:04 2012 -0600

    Added 'changelog' make target; other tweaks.
    
    Details:
    - Updated CHANGELOG.
    - Added 'changelog' target to Makefile that runs 'git log --decorate' and
      overwrites CHANGELOG with the output.
    - Other trivial changes.

commit e4e5404d26aded4873278e85faf6f14ac32115b5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 7 17:34:53 2012 -0600

    Define static memory pool size in bl2_config.h.

commit 19bb507d0de6a2bd3ce37cf616bdcd6b419ed641
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Fri Dec 7 17:18:00 2012 -0600

    Refined INSTALL text; added 'showconfig' target.
    
    Details:
    - Added 'showconfig' target to Makefile.
    - Added header files and ./config/<configname>/make_defs.mk as prerequisites
      to object file rules.
    - Added config.mk as prerequisite to library install rules.
    - Edited and added to INSTALL file.

commit 26cb659dd79636489db5a051aa60fff80273a7b9
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 6 15:34:53 2012 -0600

    Added auto-detection of version string (via git).
    
    Details:
    - Added build/update-version-file.sh script for auto-detecting "version"
      string and updating 'version' file accordingly. (If .git directory is
      not present, then it is assumed this copy of BLIS is a downloaded
      release, in which case 'version' file is left unchanged.)
    - Added invocation of update-version-file.sh to configure script.

commit b0ecd0ff52fa6ffc9e1d9eb44c365f7f009a6204
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 6 14:27:11 2012 -0600

    Wrote first draft of INSTALL file.

commit bcbe81235a35ccfdbcc2f2319a0ca6e04f75a785 (tag: 0.0.0)
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Thu Dec 6 12:42:35 2012 -0600

    Updated standalone test Makefile and other fixes.
    
    Details:
    - Major edits to test/Makefile to bring up-to-date wrt new build system;
      should no longer be broken.
    - Minor edits to top-level Makefile.
    - Fixed copy-and-paste bugs in
      - frame/1m/packm/ukernels/bl2_packm_ref_?xk.c
      - frame/1m/unpackm/ukernels/bl2_unpackm_ref_?xk.c

commit 2f272b40f43307909736327f49d17737c7a05d37
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Tue Dec 4 19:22:14 2012 -0600

    Added build system and continued reorganization.
    
    Details:
    - Added/renamed packm, unpackm kernels.
    - Added machine value routines.
    - Added param_map facility.
    - Renamed AUTHORS to CREDITS.
    - Added Makefile; continued to expand upon existing configure script.
    - #define fuse_fac macros in operation headers if not defined already
      (by the user in bl2_kernels.h).

commit 00f3498a8943be1b387f0d5c029c8c7891687ad5
Author: Field G. Van Zee <field@cs.utexas.edu>
Date:   Mon Dec 3 12:36:11 2012 -0600

    Initial commit.
