
.. _introduction:


************
Introduction
************

Background Questionaire
-----------------------

* Who has used Theano before?

 * What did you do with it?

* Who has used Python? numpy? scipy? matplotlib?

* Who has used iPython?

 * Who has used it as a distributed computing engine?

* Who has done C/C++ programming?

* Who has organized computation around a particular physical memory layout?

* Who has used a multidimensional array of >2 dimensions?

* Who has written a Python module in C before?

 * Who has written a program to *generate* Python modules in C?

* Who has used a templating engine?

* Who has programmed a GPU before?

 * Using OpenGL / shaders ?

 * Using CUDA (runtime? / driver?)

 * Using PyCUDA ?

 * Using OpenCL / PyOpenCL ?

 * Using cudamat / gnumpy ?

 * Other?

* Who has used Cython?


Python in one slide
-------------------

* General-purpose high-level OO interpreted language
 
* Emphasizes code readability
 
* Comprehensive standard library
 
* Dynamic type and memory management

* Built-in types: int, float, str, list, dict, tuple, object

* Slow execution

* Popular in web-dev and scientific communities


.. code-block:: python

    #######################
    # PYTHON SYNTAX EXAMPLE
    #######################
    a = 1                     # no type declaration required!
    b = (1,2,3)               # tuple of three int literals
    c = [1,2,3]               # list of three int literals
    d = {'a': 5, b: None}     # dictionary of two elements
                              # N.B. string literal, None

    print d['a']              # square brackets index
    # -> 5
    print d[(1,2,3)]          # new tuple == b, retrieves None
    # -> None
    print d[6]
    # raises KeyError Exception

    x, y, z = 10, 100, 100    # multiple assignment from tuple
    x, y, z = b               # unpacking a sequence

    b_squared = [b_i**2 for b_i in b]  # list comprehension

    def foo(b, c=3):          # function w default param c
        return a + b + c      # note scoping, indentation

    foo(5)                    # calling a function
    # -> 1 + 5 + 3 == 9       # N.B. scoping
    foo(b=6, c=2)             # calling with named args
    # -> 1 + 6 + 2 == 9

    print b[1:3]              # slicing syntax

    class Foo(object):        # Defining a class
        def __init__(self):
            self.a = 5
        def hello(self):
            return self.a

    f = Foo()                 # Creating a class instance
    print f.hello()           # Calling methods of objects
    # -> 5 

    class Bar(Foo):           # Defining a subclass
        def __init__(self, a):
            self.a = a

    print Bar(99).hello()     # Creating an instance of Bar
    # -> 99

Numpy in one slide
------------------

* Python floats are full-fledged objects on the heap

 * Not suitable for high-performance computing!

* Numpy provides a N-dimensional numeric array in Python

 * Perfect for high-performance computing.

* Numpy provides

 * elementwise computations

 * linear algebra, Fourier transforms

 * pseudorandom numbers from many distributions

* Scipy provides lots more, including

 * more linear algebra

 * solvers and optimization algorithms

 * matlab-compatible I/O

 * I/O and signal processing for images and audio

.. code-block:: python

    ##############################
    # Properties of Numpy arrays
    # that you really need to know
    ##############################

    import numpy as np          # import can rename
    a = np.random.rand(3,4,5)   # random generators
    a32 = a.astype('float32')   # arrays are strongly typed

    a.ndim                      # int: 3
    a.shape                     # tuple: (3,4,5)
    a.size                      # int: 60
    a.dtype                     # np.dtype object: 'float64'
    a32.dtype                   # np.dtype object: 'float32'

Arrays can be combined with numeric operators, standard mathematical
functions. Numpy has great `documentation <http://docs.scipy.org/doc/numpy/reference/>`_.

Training an MNIST-ready classification neural network in pure numpy might look like this:

.. code-block:: python

    #########################
    # Numpy for Training a
    # Neural Network on MNIST
    #########################

    x = np.load('data_x.npy')
    y = np.load('data_y.npy')
    w = np.random.normal(
        avg=0,
        std=.1,
        size=(784, 500))
    b = np.zeros((500,))
    v = np.zeros((500, 10))
    c = np.zeros((10,))

    batchsize = 100
    for i in xrange(1000):
        x_i = x[i*batchsize:(i+1)*batchsize]
        y_i = y[i*batchsize:(i+1)*batchsize]

        hidin = np.dot(x_i, w) + b

        hidout = np.tanh(hidin)

        outin = np.dot(hidout, v) + c
        outout = (np.tanh(outin)+1)/2.0

        g_outout = outout - y_i
        err = 0.5 * np.sum(g_outout**2)

        g_outin = g_outout * outout * (1.0 - outout)

        g_hidout = np.dot(g_outin, v.T)
        g_hidin = g_hidout * (1 - hidout**2)

        b -= lr * np.sum(g_hidin, axis=0)
        c -= lr * np.sum(g_outin, axis=0)
        w -= lr * np.dot(x_i.T, g_hidin)
        v -= lr * np.dot(hidout.T, g_outin)


What's missing?
---------------

* Non-lazy evaluation (required by Python) hurts performance

* Numpy is bound to the CPU

* Numpy lacks symbolic or automatic differentiation

Now let's have a look at the same algorithm in Theano, which runs 15 times faster if
you have GPU (I'm skipping some dtype-details which we'll come back to).

.. code-block:: python

    #########################
    # Theano for Training a
    # Neural Network on MNIST
    #########################

    import theano as T
    import theano.tensor as TT

    x = np.load('data_x.npy')
    y = np.load('data_y.npy')

    # symbol declarations
    sx = TT.matrix()
    sy = TT.matrix()
    w = T.shared(np.random.normal(avg=0, std=.1,
        size=(784, 500)))
    b = T.shared(np.zeros(500))
    v = T.shared(np.zeros((500, 10)))
    c = T.shared(np.zeros(10))

    # symbolic expression-building
    hid = TT.tanh(TT.dot(sx, w) + b)
    out = TT.tanh(TT.dot(hid, v) + c)
    err = 0.5 * TT.sum(out - sy)**2
    gw, gb, gv, gc = TT.grad(err, [w,b,v,c])

    # compile a fast training function
    train = T.function([sx, sy], err,
        updates={
            w:w - lr * gw,
            b:b - lr * gb,
            v:v - lr * gv,
            c:c - lr * gc})

    # now do the computations
    batchsize = 100
    for i in xrange(1000):
        x_i = x[i*batchsize:(i+1)*batchsize]
        y_i = y[i*batchsize:(i+1)*batchsize]
        err_i = train(x_i, y_i)

    
Theano in one slide
-------------------

* High-level domain-specific language tailored to numeric computation

* Compiles most common expressions to C for CPU and GPU.

* Limited expressivity means lots of opportunities for expression-level optimizations

 * No function call -> global optimization

 * Strongly typed -> compiles to machine instructions

 * Array oriented -> parallelizable across cores

 * Support for looping and branching in expressions

* Expression substitution optimizations automatically draw
  on many backend technologies for best performance.

 * FFTW, MKL, ATLAS, Scipy, Cython, CUDA

 * Slower fallbacks always available

* Automatic differentiation


Project status
--------------

* Mature: theano has been developed and used since January 2008 (3.5 yrs old)

* Driven over 40 research papers in the last few years

* Good user documentation

* Active mailing list with participants from outside our lab

* Core technology for a funded Silicon-Valley startup

* Many contributors (some from outside our lab)

* Used to teach IFT6266 for two years

* Used for research at Google and Yahoo.

* Unofficial RPMs for Mandriva

* Downloads (January 2011 -  June 8 2011):

 * Pypi 780

 * MLOSS: 483

 * Assembla (`bleeding edge` repository): unknown



Why scripting for GPUs?
-----------------------

They *Complement each other*:

* GPUs are everything that scripting/high level languages are not

 * Highly parallel

 * Very architecture-sensitive

 * Built for maximum FP/memory throughput

 * So hard to program that meta-programming is easier.

* CPU: largely restricted to control

 * Optimized for sequential code and low latency (rather than high throughput)

 * Tasks (1000/sec)

 * Scripting fast enough

Best of both: scripted CPU invokes JIT-compiled kernels on GPU.


How Fast are GPUs?
------------------

* Theory

 * Intel Core i7 980 XE (107Gf/s float64) 6 cores

 * NVIDIA C2050 (515 Gf/s float64, 1Tf/s float32) 480 cores

 * NVIDIA GTX580 (1.5Tf/s float32) 512 cores

 * GPUs are faster, cheaper, more power-efficient

* Practice (our experience)

 * Depends on algorithm and implementation!

 * Reported speed improvements over CPU in lit. vary *widely* (.01x to 1000x)

 * Matrix-matrix multiply speedup: usually about 10-20x.

 * Convolution speedup: usually about 15x.

 * Elemwise speedup: slower or up to 100x (depending on operation and layout)

 * Sum: can be faster or slower depending on layout.

* Benchmarking is delicate work...

 * How to control quality of implementation?

  * How much time was spent optimizing CPU vs GPU code?

 * Theano goes up to 100x faster on GPU because it uses only one CPU core

 * Theano can be linked with multi-core capable BLAS (GEMM and GEMV)

* If you see speedup > 100x, the benchmark is probably not fair.


Software for Directly Programming a GPU
---------------------------------------

Theano is a meta-programmer, doesn't really count.

* CUDA: C extension by NVIDIA 

 * Vendor-specific

 * Numeric libraries (BLAS, RNG, FFT) maturing.

* OpenCL: multi-vendor version of CUDA

 * More general, standardized

 * Fewer libraries, less adoption.

* PyCUDA: python bindings to CUDA driver interface

 * Python interface to CUDA

 * Memory management of GPU objects

 * Compilation of code for the low-level driver

 * Makes it easy to do GPU meta-programming from within Python

* PyOpenCL: PyCUDA for PyOpenCL

