
.. _using_gpu:

=============
Using the GPU
=============

For an introductory discussion of *Graphical Processing Units* (GPU) and their use for
intensive parallel computation purposes, see `GPGPU <http://en.wikipedia.org/wiki/GPGPU>`_.

One of Theano's design goals is to specify computations at an
abstract level, so that the internal function compiler has a lot of flexibility
about how to carry out those computations.  One of the ways we take advantage of
this flexibility is in carrying out calculations on an Nvidia graphics card when
the device present in the computer is CUDA-enabled.

Setting Up CUDA
----------------

If you have not done so already, you will need to install Nvidia's
GPU-programming toolchain (CUDA) and configure Theano to use it.
We provide installation instructions for :ref:`Linux <gpu_linux>`,
:ref:`MacOS <gpu_macos>` and :ref:`Windows <gpu_windows>`.

Testing Theano with GPU
-----------------------

To see if your GPU is being used, cut and paste the following program into a
file and run it.

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_1

.. code-block:: python

    from theano import function, config, shared, sandbox
    import theano.tensor as T
    import numpy
    import time

    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
    f = function([], T.exp(x))
    print f.maker.fgraph.toposort()
    t0 = time.time()
    for i in xrange(iters):
        r = f()
    t1 = time.time()
    print 'Looping %d times took' % iters, t1 - t0, 'seconds'
    print 'Result is', r
    if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
        print 'Used the cpu'
    else:
        print 'Used the gpu'

The program just computes the ``exp()`` of a bunch of random numbers.
Note that we use the ``shared`` function to
make sure that the input *x* is stored on the graphics device.

.. the following figures have been measured twice on BART3 on Aug 2nd 2012 with no other job running simultaneously 

If I run this program (in check1.py) with ``device=cpu``, my computer takes a little over 3 seconds,
whereas on the GPU it takes just over 0.64 seconds. The GPU will not always produce the exact 
same floating-point numbers as the CPU. As a benchmark, a loop that calls ``numpy.exp(x.get_value())`` takes about 46 seconds.

.. code-block:: text

    $ THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python check1.py
    [Elemwise{exp,no_inplace}(<TensorType(float32, vector)>)]
    Looping 1000 times took 3.06635117531 seconds
    Result is [ 1.23178029  1.61879337  1.52278066 ...,  2.20771813  2.29967761
      1.62323284]
    Used the cpu

    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check1.py
    Using gpu device 0: GeForce GTX 580
    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
    Looping 1000 times took 0.638810873032 seconds
    Result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
      1.62323296]
    Used the gpu

Note that GPU operations in Theano require for now ``floatX`` to be *float32* (see also below).


Returning a Handle to Device-Allocated Data
-------------------------------------------

The speedup is not greater in the preceding example because the function is
returning its result as a NumPy ndarray which has already been copied from the
device to the host for your convenience.  This is what makes it so easy to swap in ``device=gpu``, but
if you don't mind less portability, you might gain a bigger speedup by changing
the graph to express a computation with a GPU-stored result.  The ``gpu_from_host``
op means "copy the input from the host to the GPU" and it is optimized away
after the ``T.exp(x)`` is replaced by a GPU version of ``exp()``.

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_2

.. code-block:: python

    from theano import function, config, shared, sandbox
    import theano.tensor as T
    import numpy
    import time

    vlen = 10 * 30 * 768  # 10 x #cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
    f = function([], sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)))
    print f.maker.fgraph.toposort()
    t0 = time.time()
    for i in xrange(iters):
        r = f()
    t1 = time.time()
    print 'Looping %d times took' % iters, t1 - t0, 'seconds'
    print 'Result is', r
    print 'Numpy result is', numpy.asarray(r)
    if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
        print 'Used the cpu'
    else:
        print 'Used the gpu'

The output from this program is

.. code-block:: text

    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python check2.py
    Using gpu device 0: GeForce GTX 580
    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
    Looping 1000 times took 0.34898686409 seconds
    Result is <CudaNdarray object at 0x6a7a5f0>
    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
      1.62323296]
    Used the gpu

Here we've shaved off about 50% of the run-time by simply not copying the
resulting array back to the host.
The object returned by each function call is now not a NumPy array but a
"CudaNdarray" which can be converted to a NumPy ndarray by the normal
NumPy casting mechanism.


Running the GPU at Full Speed
------------------------------

To really get maximum performance in this simple example, we need to use an
:class:`out<function.Out>` instance with the flag ``borrow=True`` to tell Theano not to copy
the output it returns to us. This is because Theano pre-allocates memory for internal use 
(like working buffers), and by default will never return a result that is aliased to one of
its internal buffers: instead, it will copy the buffers associated to outputs into newly 
allocated memory at each function call. This is to ensure that subsequent function calls will
not overwrite previously computed outputs. Although this is normally what you want, our last
example was so simple that it had the unwanted side-effect of really slowing things down.


.. 
    TODO:
    The story here about copying and working buffers is misleading and potentially not correct
    ... why exactly does borrow=True cut 75% of the runtime ???

..  TODO: Answer by Olivier D: it sounds correct to me -- memory allocations must be slow.

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_using_gpu.test_using_gpu_3
.. code-block:: python

    from theano import function, config, shared, sandbox, Out
    import theano.tensor as T
    import numpy
    import time

    vlen = 10 * 30 * 768  # 10 x # cores x # threads per core
    iters = 1000

    rng = numpy.random.RandomState(22)
    x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
    f = function([], 
            Out(sandbox.cuda.basic_ops.gpu_from_host(T.exp(x)),
                borrow=True))
    print f.maker.fgraph.toposort()
    t0 = time.time()
    for i in xrange(iters):
        r = f()
    t1 = time.time()
    print 'Looping %d times took' % iters, t1 - t0, 'seconds'
    print 'Result is', r
    print 'Numpy result is', numpy.asarray(r)
    if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
        print 'Used the cpu'
    else:
        print 'Used the gpu'

Running this version of the code takes just over 0.05 seconds, that is 60x faster than
the CPU implementation!

.. code-block:: text

    With *flag* ``borrow=False``:

    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python using_gpu_solution_1.py
    Using gpu device 0: GeForce GTX 580
    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
    Looping 1000 times took 0.31614613533 seconds
    Result is <CudaNdarray object at 0x77e9270>
    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
      1.62323296]
    Used the gpu

    With *flag* ``borrow=True``:

    $ THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python using_gpu_solution_1.py
    Using gpu device 0: GeForce GTX 580
    [GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>)]
    Looping 1000 times took 0.0502779483795 seconds
    Result is <CudaNdarray object at 0x83e5cb0>
    Numpy result is [ 1.23178029  1.61879349  1.52278066 ...,  2.20771813  2.29967761
      1.62323296]
    Used the gpu


This version of the code including the flag ``borrow=True`` is slightly less safe because if we had saved
the *r* returned from one function call, we would have to take care and remember that its value might
be over-written by a subsequent function call.  Although ``borrow=True`` makes a dramatic difference
in this example, be careful!  The advantage of ``borrow=True`` is much weaker in larger graphs, and 
there is a lot of potential for making a mistake by failing to account for the resulting memory aliasing.


What Can Be Accelerated on the GPU
----------------------------------

The performance characteristics will change as we continue to optimize our
implementations, and vary from device to device, but to give a rough idea of
what to expect right now:

* Only computations 
  with *float32* data-type can be accelerated. Better support for *float64* is expected in upcoming hardware but
  *float64* computations are still relatively slow (Jan 2010).  
* Matrix
  multiplication, convolution, and large element-wise operations can be
  accelerated a lot (5-50x) when arguments are large enough to keep 30
  processors busy.  
* Indexing,
  dimension-shuffling and  constant-time reshaping will be equally fast on GPU
  as on CPU.
* Summation 
  over rows/columns of tensors can be a little slower on the GPU than on the CPU.
* Copying 
  of large quantities of data to and from a device is relatively slow, and
  often cancels most of the advantage of one or two accelerated functions on
  that data.  Getting GPU performance largely hinges on making data transfer to
  the device pay off.

Tips for Improving Performance on GPU
-------------------------------------

* Consider 
  adding ``floatX=float32`` to your ``.theanorc`` file if you plan to do a lot of
  GPU work.
* Use the Theano flag ``allow_gc=False``. See :ref:`gpu_async`
* Prefer  
  constructors like ``matrix``, ``vector`` and ``scalar`` to ``dmatrix``, ``dvector`` and
  ``dscalar`` because the former will give you *float32* variables when
  ``floatX=float32``.
* Ensure 
  that your output variables have a *float32* dtype and not *float64*.  The
  more *float32* variables are in your graph, the more work the GPU can do for
  you.
* Minimize 
  tranfers to the GPU device by using ``shared`` *float32* variables to store
  frequently-accessed data (see :func:`shared()<shared.shared>`).  When using
  the GPU, *float32* tensor ``shared`` variables are stored on the GPU by default to
  eliminate transfer time for GPU ops using those variables.
* If you aren't happy with the performance you see, try building your functions with 
  ``mode='ProfileMode'``. This should print some timing information at program
  termination. Is time being used sensibly?   If an op or Apply is
  taking more time than its share, then if you know something about GPU
  programming, have a look at how it's implemented in theano.sandbox.cuda.
  Check the line similar to *Spent Xs(X%) in cpu op, Xs(X%) in gpu op and Xs(X%) in transfer op*.
  This can tell you if not enough of your graph is on the GPU or if there
  is too much memory transfer.

.. _gpu_async:

GPU Async capabilities
----------------------

Ever since Theano 0.6 we started to use the asynchronous capability of
GPUs. This allows us to be faster but with the possibility that some
errors may be raised later than when they should occur. This can cause
difficulties when profiling Theano apply nodes. There is a NVIDIA
driver feature to help with these issues. If you set the environment
variable CUDA_LAUNCH_BLOCKING=1 then all kernel calls will be
automatically synchronized. This reduces performance but provides good
profiling and appropriately placed error messages.

This feature interacts with Theano garbage collection of intermediate
results. To get the most of this feature, you need to disable the gc
as it inserts synchronization points in the graph. Set the Theano flag
``allow_gc=False`` to get even faster speed! This will raise the memory
usage.

Changing the Value of Shared Variables
--------------------------------------

To change the value of a ``shared`` variable, e.g. to provide new data to processes,
use ``shared_variable.set_value(new_value)``. For a lot more detail about this,
see :ref:`aliasing`.

-------------------------------------------

**Exercise**

Consider again the logistic regression:

.. code-block:: python
    
    import numpy
    import theano
    import theano.tensor as T
    rng = numpy.random

    N = 400
    feats = 784
    D = (rng.randn(N, feats).astype(theano.config.floatX),
    rng.randint(size=N,low=0, high=2).astype(theano.config.floatX))
    training_steps = 10000

    # Declare Theano symbolic variables
    x = T.matrix("x")
    y = T.vector("y")
    w = theano.shared(rng.randn(feats).astype(theano.config.floatX), name="w")
    b = theano.shared(numpy.asarray(0., dtype=theano.config.floatX), name="b")
    x.tag.test_value = D[0]
    y.tag.test_value = D[1]
    #print "Initial model:"
    #print w.get_value(), b.get_value()

    # Construct Theano expression graph
    p_1 = 1 / (1 + T.exp(-T.dot(x, w)-b)) # Probabily of having a one
    prediction = p_1 > 0.5 # The prediction that is done: 0 or 1
    xent = -y*T.log(p_1) - (1-y)*T.log(1-p_1) # Cross-entropy
    cost = xent.mean() + 0.01*(w**2).sum() # The cost to optimize
    gw,gb = T.grad(cost, [w,b])

    # Compile expressions to functions
    train = theano.function(
                inputs=[x,y],
                outputs=[prediction, xent],
                updates={w:w-0.01*gw, b:b-0.01*gb},
                name = "train")
    predict = theano.function(inputs=[x], outputs=prediction,
                name = "predict")

    if any([x.op.__class__.__name__ in ['Gemv', 'CGemv', 'Gemm', 'CGemm'] for x in
            train.maker.fgraph.toposort()]):
        print 'Used the cpu'
    elif any([x.op.__class__.__name__ in ['GpuGemm', 'GpuGemv'] for x in
              train.maker.fgraph.toposort()]):
        print 'Used the gpu'
    else:
        print 'ERROR, not able to tell if theano used the cpu or the gpu'
        print train.maker.fgraph.toposort()

    for i in range(training_steps):
        pred, err = train(D[0], D[1])
    #print "Final model:"
    #print w.get_value(), b.get_value()

    print "target values for D"
    print D[1]

    print "prediction on D"
    print predict(D[0])

   

Modify and execute this example to run on GPU with ``floatX=float32`` and 
time it using the command line ``time python file.py``. (Of course, you may use some of your answer
to the exercise in section :ref:`Configuration Settings and Compiling Mode<using_modes>`.)

Is there an increase in speed from CPU to GPU?

Where does it come from? (Use ``ProfileMode``)

What can be done to further increase the speed of the GPU version? Put your ideas to test.


.. Note::

   * Only 32 bit floats are currently supported (development is in progress).
   * ``Shared`` variables with *float32* dtype are by default moved to the GPU memory space.

   * There is a limit of one GPU per process.
   * Use the Theano flag ``device=gpu`` to require use of the GPU device.
   * Use ``device=gpu{0, 1, ...}`` to specify which GPU if you have more than one.
 
   * Apply the Theano flag ``floatX=float32`` through (``theano.config.floatX``) in your code.
   * ``Cast`` inputs before storing them into a ``shared`` variable.
   * Circumvent the automatic cast of *int32* with *float32* to *float64*:
    
     * Insert manual cast in your code or use *[u]int{8,16}*.
     * Insert manual cast around the mean operator (this involves division by length, which is an *int64*).
     * Notice that a new casting mechanism is being developed.

:download:`Solution<using_gpu_solution_1.py>`

-------------------------------------------


Software for Directly Programming a GPU
---------------------------------------

Leaving aside Theano which is a meta-programmer, there are:

* **CUDA**: GPU programming API by NVIDIA based on extension to C (CUDA C) 

  * Vendor-specific

  * Numeric libraries (BLAS, RNG, FFT) are maturing.

* **OpenCL**: multi-vendor version of CUDA

  * More general, standardized.

  * Fewer libraries, lesser spread.

* **PyCUDA**: Python bindings to CUDA driver interface allow to access Nvidia's CUDA parallel 
  computation API from Python

  * Convenience:

    Makes it easy to do GPU meta-programming from within Python.
 
    Abstractions to compile low-level CUDA code from Python (``pycuda.driver.SourceModule``).

    GPU memory buffer (``pycuda.gpuarray.GPUArray``).
  
    Helpful documentation.

  * Completeness: Binding to all of CUDA's driver API.

  * Automatic error checking: All CUDA errors are automatically translated into Python exceptions.

  * Speed: PyCUDA's base layer is written in C++.

  * Good memory management of GPU objects:

    Object cleanup tied to lifetime of objects (RAII, 'Resource Acquisition Is Initialization').

    Makes it much easier to write correct, leak- and crash-free code.

    PyCUDA knows about dependencies (e.g. it won't detach from a context before all memory
    allocated in it is also freed).
     

  (This is adapted from PyCUDA's `documentation <http://documen.tician.de/pycuda/index.html>`_ 
  and Andreas Kloeckner's `website <http://mathema.tician.de/software/pycuda>`_ on PyCUDA.)


* **PyOpenCL**: PyCUDA for OpenCL

Learning to Program with PyCUDA
-------------------------------

If you already enjoy a good proficiency with the C programming language, you
may easily leverage your knowledge by learning, first, to program a GPU with the
CUDA extension to C (CUDA C) and, second, to use PyCUDA to access the CUDA
API with a Python wrapper.

The following resources will assist you in this learning process:

* **CUDA API and CUDA C: Introductory**

  * `NVIDIA's slides <http://www.sdsc.edu/us/training/assets/docs/NVIDIA-02-BasicsOfCUDA.pdf>`_

  * `Stein's (NYU) slides <http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf>`_
  
* **CUDA API and CUDA C: Advanced**

  * `MIT IAP2009 CUDA <https://sites.google.com/site/cudaiap2009/home>`_
    (full coverage: lectures, leading Kirk-Hwu textbook, examples, additional resources)

  * `Course U. of Illinois <http://courses.engr.illinois.edu/ece498/al/index.html>`_
    (full lectures, Kirk-Hwu textbook)

  * `NVIDIA's knowledge base <http://www.nvidia.com/content/cuda/cuda-developer-resources.html>`_
    (extensive coverage, levels from introductory to advanced)

  * `practical issues <http://stackoverflow.com/questions/2392250/understanding-cuda-grid-dimensions-block-dimensions-and-threads-organization-s>`_
    (on the relationship between grids, blocks and threads; see also linked and related issues on same page)

  * `CUDA optimisation <http://www.gris.informatik.tu-darmstadt.de/cuda-workshop/slides.html>`_

* **PyCUDA: Introductory**

  * `Kloeckner's slides <http://www.gputechconf.com/gtcnew/on-demand-gtc.php?sessionTopic=&searchByKeyword=kloeckner&submit=&select=+&sessionEvent=2&sessionYear=2010&sessionFormat=3>`_

  * `Kloeckner' website <http://mathema.tician.de/software/pycuda>`_ 

* **PYCUDA: Advanced**

  * `PyCUDA documentation website <http://documen.tician.de/pycuda/>`_


The following examples give a foretaste of programming a GPU with PyCUDA. Once
you feel competent enough, you may try yourself on the corresponding exercises.

**Example: PyCUDA**


.. code-block:: python

  # (from PyCUDA's documentation)
  import pycuda.autoinit
  import pycuda.driver as drv
  import numpy
  
  from pycuda.compiler import SourceModule
  mod = SourceModule("""
  __global__ void multiply_them(float *dest, float *a, float *b)
  {
    const int i = threadIdx.x;
    dest[i] = a[i] * b[i];
  }
  """)

  multiply_them = mod.get_function("multiply_them")
  
  a = numpy.random.randn(400).astype(numpy.float32)
  b = numpy.random.randn(400).astype(numpy.float32)
  
  dest = numpy.zeros_like(a)
  multiply_them(
          drv.Out(dest), drv.In(a), drv.In(b),
          block=(400,1,1), grid=(1,1))

  assert numpy.allclose(dest, a*b)
  print dest

-------------------------------------------

**Exercise**

Run the preceding example.

Modify and execute to work for a matrix of shape (20, 10).

-------------------------------------------


.. _pyCUDA_theano:

**Example: Theano + PyCUDA**


.. code-block:: python

    import numpy, theano
    import theano.misc.pycuda_init
    from pycuda.compiler import SourceModule
    import theano.sandbox.cuda as cuda

    class PyCUDADoubleOp(theano.Op):
        def __eq__(self, other):
            return type(self) == type(other)
        def __hash__(self):
            return hash(type(self))
        def __str__(self):
            return self.__class__.__name__
        def make_node(self, inp):
            inp = cuda.basic_ops.gpu_contiguous(
               cuda.basic_ops.as_cuda_ndarray_variable(inp))
            assert inp.dtype == "float32"
            return theano.Apply(self, [inp], [inp.type()])
        def make_thunk(self, node, storage_map, _, _2):
            mod = SourceModule("""
        __global__ void my_fct(float * i0, float * o0, int size) {
        int i = blockIdx.x*blockDim.x + threadIdx.x;
        if(i<size){
            o0[i] = i0[i]*2;
        }
      }""")
            pycuda_fct = mod.get_function("my_fct")
            inputs = [ storage_map[v] for v in node.inputs]
            outputs = [ storage_map[v] for v in node.outputs]
            def thunk():
                z = outputs[0]
                if z[0] is None or z[0].shape!=inputs[0][0].shape:
                    z[0] = cuda.CudaNdarray.zeros(inputs[0][0].shape)
                grid = (int(numpy.ceil(inputs[0][0].size / 512.)),1)
                pycuda_fct(inputs[0][0], z[0], numpy.intc(inputs[0][0].size),
                           block=(512,1,1), grid=grid)
            return thunk
    

Use this code to test it:

>>> x = theano.tensor.fmatrix()
>>> f = theano.function([x], PyCUDADoubleOp()(x))
>>> xv=numpy.ones((4,5), dtype="float32")
>>> assert numpy.allclose(f(xv), xv*2)
>>> print numpy.asarray(f(xv))

-------------------------------------------

**Exercise**


Run the preceding example.

Modify and execute to multiply two matrices: *x* * *y*.

Modify and execute to return two outputs: *x + y* and *x - y*.

(Notice that Theano's current *elemwise fusion* optimization is
only applicable to computations involving a single output. Hence, to gain
efficiency over the basic solution that is asked here, the two operations would
have to be jointly optimized explicitly in the code.)

Modify and execute to support *stride* (i.e. so as not constrain the input to be *C-contiguous*).

