
.. _tutcomputinggrads:


=====================
Derivatives in Theano
=====================

Computing gradients
===================

Now let's use Theano for a slightly more sophisticated task: create a
function which computes the derivative of some expression ``y`` with
respect to its parameter ``x``. To do this we will use the macro ``T.grad``.
For instance, we can compute the
gradient of :math:`x^2` with respect to :math:`x`. Note that:
:math:`d(x^2)/dx = 2 \cdot x`.

Here is code to compute this gradient:

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_examples.test_examples_4

>>> from theano import pp
>>> x = T.dscalar('x')
>>> y = x**2
>>> gy = T.grad(y, x)
>>> pp(gy)  # print out the gradient prior to optimization
'((fill((x ** 2), 1.0) * 2) * (x ** (2 - 1)))'
>>> f = function([x], gy)
>>> f(4)
array(8.0)
>>> f(94.2)
array(188.40000000000001)

In the example above, we can see from ``pp(gy)`` that we are computing
the correct symbolic gradient.
``fill((x ** 2), 1.0)`` means to make a matrix of the same shape as
``x ** 2`` and fill it with 1.0.

.. note::
    The optimizer simplifies the symbolic gradient expression.  You can see
    this by digging inside the internal properties of the compiled function.

    .. code-block:: python

        pp(f.maker.env.outputs[0])
        '(2.0 * x)'

    After optimization there is only one Apply node left in the graph, which
    doubles the input.

We can also compute the gradient of complex expressions such as the
logistic function defined above. It turns out that the derivative of the
logistic is: :math:`ds(x)/dx = s(x) \cdot (1 - s(x))`.

.. figure:: dlogistic.png

    A plot of the gradient of the logistic function, with x on the x-axis
    and :math:`ds(x)/dx` on the y-axis.


.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_examples.test_examples_5

>>> x = T.dmatrix('x')
>>> s = T.sum(1 / (1 + T.exp(-x)))
>>> gs = T.grad(s, x)
>>> dlogistic = function([x], gs)
>>> dlogistic([[0, 1], [-1, -2]])
array([[ 0.25      ,  0.19661193],
       [ 0.19661193,  0.10499359]])

In general, for any **scalar** expression ``s``, ``T.grad(s, w)`` provides
the theano expression for computing :math:`\frac{\partial s}{\partial w}`. In 
this way Theano can be used for doing **efficient** symbolic differentiation
(as
the expression return by ``T.grad`` will be optimized during compilation) even for
function with many inputs. ( see `automatic differentiation <http://en.wikipedia.org/wiki/Automatic_differentiation>`_ for a description
of symbolic differentiation).

.. note::

   The second argument of ``T.grad`` can be a list, in which case the
   output is also a list. The order in both list is important, element
   *i* of the output list is the gradient of the first argument of
   ``T.grad`` with respect to the *i*-th element of the list given as second argument.
   The first argument of ``T.grad`` has to be a scalar (a tensor
   of size 1). For more information on the semantics of the arguments of
   ``T.grad`` and details about the implementation, see :ref:`this <libdoc_gradient>`.



Computing the Jacobian
======================

Theano implements :func:`theano.gradient.jacobian` macro that does all
what is needed to compute the Jacobian. The following text explains how
to do it manually.

In order to manually compute the Jacobian of some function ``y`` with
respect to some parameter ``x`` we need to use ``scan``. What we
do is to loop over the entries in ``y`` and compute the gradient of
``y[i]`` with respect to ``x``.

.. note::
    
    ``scan`` is a generic op in Theano that allows writting in a symbolic
    manner all kind of recurrent equations. While in principle, creating
    symbolic loops (and optimizing them for performance) is a hard task,
    effort is being done for improving the performance of ``scan``. More
    information about how to use this op, see :ref:`this <lib_scan>`.


>>> x = T.dvector('x')
>>> y = x**2
>>> J, updates = theano.scan(lambda i, y,x : T.grad(y[i], x), sequences = T.arange(y.shape[0]), non_sequences = [y,x])
>>> f = function([x], J, updates = updates)
>>> f([4,4])
array([[ 8.,  0.],
       [ 0.,  8.]])
    
What we did in this code, is to generate a sequence of ints from ``0`` to
``y.shape[0]`` using ``T.arange``. Then we loop through this sequence, and
at each step, we compute the gradient of element ``y[[i]`` with respect to
``x``. ``scan`` automatically concatenates all these rows, generating a
matrix, which corresponds to the Jacobian.

.. note::
    There are a few gotchas regarding ``T.grad``. One of them is that you
    can not re-write the above expression of the jacobian as
    ``theano.scan(lambda y_i,x: T.grad(y_i,x), sequences=y,
    non_sequences=x)``, even though from the documentation of scan this
    seems possible. The reason is that ``y_i`` will not be a function of
    ``x`` anymore, while ``y[i]`` still is. 


Computing the Hessian
=====================

Theano implements :func:`theano.gradient.hessian` macro that does all
that is needed to compute the Hessian. The following text explains how
to do it manually.

You can compute the Hessian manually as the Jacobian. The only
difference is that now, instead of computing the Jacobian of some expression
``y``, we compute the Jacobian of ``T.grad(cost,x)``, where ``cost`` is some
scalar. 


>>> x = T.dvector('x')
>>> y = x**2
>>> cost = y.sum()
>>> gy = T.grad(cost, x)
>>> H, updates = theano.scan(lambda i, gy,x : T.grad(gy[i], x), sequences = T.arange(gy.shape[0]), non_sequences = [gy,x])
>>> f = function([x], H, updates = updates)
>>> f([4,4])
array([[ 2.,  0.],
       [ 0.,  2.]])


Jacobian times a vector
=======================

Sometimes we can express the algorithm in terms of Jacobians times vectors,
or vectors times Jacobians. Compared to evaluating the Jacobian and then
doing the product, there are methods that computes the wanted result, while
avoiding actually evaluating the Jacobian. This can bring about significant 
performance gains. A description of one such algorithm can be found here: 

* Barak A. Pearlmutter, "Fast Exact Multiplication by the Hessian", *Neural
  Computation, 1994*

While in principle we would want Theano to identify such patterns for us,
in practice, implementing such optimizations in a generic manner can be 
close to impossible. As such, we offer special functions that
can be used to compute such expression.


R-operator
----------

The *R operator* is suppose to evaluate the product between a Jacobian and a
vector, namely :math:`\frac{\partial f(x)}{\partial x} v`. The formulation
can be extended even for `x` being a matrix, or a tensor in general, case in
which also the Jacobian becomes a tensor and the product becomes some kind
of tensor product. Because in practice we end up needing to compute such
expression in terms of weight matrices, theano supports this more generic
meaning of the operation. In order to evaluate the *R-operation* of
expression ``y``, with respect to ``x``, multiplying the Jacobian with ``v``
you need to do something similar to this:


>>> W = T.dmatrix('W')
>>> V = T.dmatrix('V')
>>> x = T.dvector('x')
>>> y = T.dot(x,W)
>>> JV = T.Rop(y, W, V)
>>> f = theano.function([W,V,x], JV)
>>> f([[1,1],[1,1]], [[2,2,],[2,2]], [0,1])
array([ 2.,  2.])


L-operator
----------

Similar to *R-operator* the *L-operator* would compute a *row* vector times
the Jacobian. The mathematical forumla would be :math:`v \frac{\partial
f(x)}{\partial x}`. As for the *R-operator*, the *L-operator* is supported
for generic tensors (not only for vectors). Similarly, it can be used as
follows:

>>> W = T.dmatrix('W')
>>> v = T.dvector('v')
>>> x = T.dvector('x')
>>> y = T.dot(x,W)
>>> VJ = T.Lop(y, W, v)
>>> f = theano.function([W,v,x], JV)
>>> f([[1,1],[1,1]], [2,2,], [0,1])
array([[ 0.,  0.],
       [ 2.,  2.]])

.. note::
    
    `v`, the evaluation point, differs between the *L-operator* and the *R-operator*.
    For the *L-operator*, the evaluation point needs to have the same shape
    as the output, while for the *R-operator* the evaluation point should
    have the same shape as the input parameter. Also the result of these two
    opeartion differs. The result of the *L-operator* is of the same shape
    as the input parameter, while the result of the *R-operator* is the same
    as the output.


Hessian times a vector
======================

If you need to compute the Hessian times a vector, you can make use of the
above defined operators to do it more efficiently than actually computing
the exact Hessian and then doing the product. Due to the symmetry of the 
Hessian matrix, you have two options that will
give you the same result, though these options might exhibit different performance, so we
suggest to profile the methods before using either of the two:


>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x**2)
>>> gy = T.grad(y, x)
>>> vH = T.grad( T.sum(gy*v), x)
>>> f = theano.function([x,v], vH)
>>> f([4,4],[2,2])
array([ 4.,  4.])


or, making use of the *R-operator*:

>>> x = T.dvector('x')
>>> v = T.dvector('v')
>>> y = T.sum(x**2)
>>> gy = T.grad(y, x)
>>> Hv = T.Rop(gy,x,v)
>>> f = theano.function([x,v], Hv)
>>> f([4,4],[2,2])
array([ 4.,  4.])

