
===============================
Making arithmetic Ops on double
===============================

Now that we have a ``double`` type, we have yet to use it to perform
computations. We'll start by defining multiplication.


.. _op_contract:

Op's contract
=============

An Op (:class:`gof.Op`) is any object which defines the
following methods:


.. function:: make_node(*inputs)

  This method is responsible for creating output Variables of a
  suitable symbolic Type to serve as the outputs of this Op's application.
  The Variables found in ``*inputs`` must be operated on using Theano's
  symbolic language to compute the symbolic output Variables. This method
  should put these outputs into an Apply instance, and return the
  Apply instance.

  This method creates an Apply node representing the application of
  the Op on the inputs provided. If the Op cannot be applied to
  these inputs, it must raise an appropriate exception.

  The inputs of the Apply instance returned by this call must be
  ordered correctly: a subsequent ``self.make_node(*apply.inputs)``
  must produce something equivalent to the first ``apply``.

.. function:: perform(node, inputs, output_storage)

  This method computes the function associated to this Op. ``node`` is an Apply node created by the Op's ``make_node``
  method. ``inputs`` is a list of references to data to operate on using non-symbolic statements, 
  (i.e., statements in Python, Numpy and C languages). ``output_storage`` is a list of storage cells where the
  variables of the computation must be put.

  More specifically:

    - ``node``: This is a reference to an Apply node which was previously
      obtained via the ``Op``'s ``make_node`` method. It is typically not
      used in simple Ops, but it contains symbolic information that
      could be required for complex Ops.

    - ``inputs``: This is a list of data from which the values stored in ``output_storage``
      are to be computed using non-symbolic language.

    - ``output_storage``: This is a list of storage cells where the output is to be stored.
      A storage cell is a one-element list. It is forbidden to change
      the length of the list(s) contained in ``output_storage``.  There is
      one storage cell for each output of the Op.

      The data put in ``output_storage`` must match the type of the
      symbolic output. This is a situation where the ``node`` argument
      can come in handy.

      A function Mode may allow ``output_storage`` elements to persist between
      evaluations, or it may reset ``output_storage`` cells to hold a value of
      ``None``.  It can also pre-allocate some memory for the Op to use.
      This feature can allow ``perform`` to reuse memory between
      calls, for example. If there is something  preallocated in the
      ``output_storage``, it will be of the good dtype, but can have
      the wrong shape and have any stride pattern.

  This method must be determined by the inputs. That is to say, if
  it is evaluated once on inputs A and returned B, then if ever
  inputs C, equal to A, are presented again, then outputs equal to
  B must be returned again.

  You must be careful about aliasing outputs to inputs, and making
  modifications to any of the inputs. See :ref:`Views and inplace
  operations <views_and_inplace>` before writing a ``perform``
  implementation that does either of these things.

.. function:: __eq__(other)

  ``other`` is also an Op.

  Returning ``True`` here is a promise to the optimization system
  that the other Op will produce exactly the same graph effects
  (from perform) as this one, given identical inputs. This means it
  will produce the same output values, it will destroy the same
  inputs (same destroy_map), and will alias outputs to the same
  inputs (same view_map). For more details, see
  :ref:`views_and_inplace`.

.. function:: __hash__()

  If two Op instances compare equal, then they **must** return the
  same hash value.

  Equally important, this hash value must not change during the
  lifetime of self.  Op instances should be immutable in this
  sense.

.. function:: connection_pattern( node ):

  Optional method; sometimes needed for gradient.grad to
  work correctly.

  Returns a list of list of bools.

  Op.connection_pattern[input_idx][output_idx] is true if the
  elements of inputs[input_idx] have an effect on the elements of
  outputs[output_idx].

  The ``node`` parameter is needed to determine the number of
  inputs. Some ops such as Subtensor take a variable number of
  inputs.

  If no connection_pattern is specified, gradient.grad will
  assume that all inputs have some elements connected to some
  elements of all outputs.

  This method conveys two pieces of information that are otherwise
  not part of the theano graph:

  1) Which of the op's inputs are truly ancestors of each of the
     op's outputs. Suppose an op has two inputs, x and y, and
     outputs f(x) and g(y). y is not really an ancestor of f, but
     it appears to be so in the theano graph.
  2) Whether the actual elements of each input/output are relevant
     to a computation.
     For example, the shape op does not read its input's elements,
     only its shape metadata. d shape(x) / dx should thus raise
     a disconnected input exception (if these exceptions are
     enabled).
     As another example, the elements of the Alloc op's outputs
     are not affected by the shape arguments to the Alloc op.

  Failing to implement this function for an op that needs it can
  result in two types of incorrect behavior:
  
  1) gradient.grad erroneously raising a TypeError reporting that
     a gradient is undefined.
  2) gradient.grad failing to raise a ValueError reporting that
     an input is disconnected.

  Even if connection_pattern is not implemented correctly,
  if gradient.grad returns an expression, that expression will
  be numerically correct.

.. function:: grad(inputs, output_gradients)

  Optional (but needed to have it work with gradient.grad()).

  If the Op being defined is differentiable, its gradient may be specified 
  symbolically in this method. Both ``inputs`` and ``output_gradients``
  are lists of symbolic Theano Variables and those must be operated on using 
  Theano's symbolic language. The grad method must return a list containing 
  one Variable for each input. Each returned Variable represents 
  the gradient with respect to that input computed based on the symbolic gradients with
  respect to each output.

  If the output is not differentiable with respect to an input
  then this method should be defined to return a variable of type
  NullType for that input. Likewise, if you have not implemented the
  grad computation for some input, you may return a variable of type
  NullType for that input. theano.gradient contains convenience methods
  that can construct the variable for you: :func:`theano.gradient.grad_undefined` and
  :func:`theano.gradient.grad_not_implemented`, respectively.

  If an element of output_gradient is of type theano.gradient.DisconnectedType,
  it means that the cost is not a function of this output. If any of the
  op's inputs participate in the computation of only disconnected outputs,
  then Op.grad should return DisconnectedType variables for those inputs.

  If the grad method is not defined, then Theano assumes it has been
  forgotten.  Symbolic differentiation will fail on a graph that
  includes this Op.

  It must be understood that the Op's grad method is not meant to return the
  gradient of the Op's output. theano.tensor.grad computes gradients; Op.grad
  is a helper function that computes terms that appear in gradients.
  
  If an Op has a single vector-valued output y and a single vector-valued input x,
  then the grad method will be passed x and a second vector z. Define J to be
  the Jacobian of y with respect to x. The Op's grad method should return
  dot(J.T,z). When theano.tensor.grad calls the grad method, it will set z to
  be the gradient of the cost C with respect to y. If this op is the only op
  that acts on x, then dot(J.T,z) is the gradient of C with respect to x.
  If there are other ops that act on x, theano.tensor.grad will have to add up
  the terms of x's gradient contributed by the other op's grad method.

  In practice, an op's input and output are rarely implemented as single vectors.
  Even if an op's output consists of a list containing a scalar, a sparse matrix,
  and a 4D tensor, you can think of these objects as being formed by rearranging
  a vector. Likewise for the input. In this view, the values computed by the grad
  method still represent a Jacobian-vector product.

  In practice, it is probably not a good idea to explicitly construct the Jacobian,
  which might be very large and very sparse. However, the returned value should
  be equal to the Jacobian-vector product.

  So long as you implement this product correctly, you need not understand what
  theano.tensor.grad is doing, but for the curious the mathematical justification
  is as follows:

  In essence, the grad method must simply implement through symbolic Variables
  and operations the chain rule of differential calculus. The chain rule
  is the mathematical procedure that allows one to calculate the total derivative
  :math:`\frac{d C}{d x}` of the final scalar symbolic Variable C with respect to a
  primitive symbolic Variable x found in the list ``inputs``.
  The grad method does this using ``output_gradients`` which provides the total
  derivative :math:`\frac{d C}{d f}` of C with respect to a symbolic Variable
  that is returned by the Op (this is provided
  in ``output_gradients``), as well as the knowledge of the total derivative :math:`\frac{d f}{d x}` of the
  latter with respect to the primitive Variable (this has to be computed).

  In mathematics, the total derivative of a scalar variable (C) with respect to a vector of
  scalar variables (x), i.e. the gradient, is customarily represented as the
  row vector of the partial derivatives, whereas the total derivative of a vector of
  scalar variables (f) with respect to another (x), is customarily represented by the matrix of
  the partial derivatives, i.e.the jacobian matrix. In this convenient setting,
  the chain rule instructs that the gradient of the final scalar variable C with respect
  to the primitive scalar variables in x through those in f is simply given by the matrix product: 
  :math:`\frac{d C}{d x} = \frac{d C}{d f} * \frac{d f}{d x}`.

  Here, the chain rule must be implemented in a similar but slightly more complex
  setting: Theano provides in the list ``output_gradients`` one gradient for each
  of the Variables returned by the Op. Where f is one such particular Variable,
  the corresponding gradient found in ``output_gradients`` and representing
  :math:`\frac{d C}{d f}` is provided with a shape similar to f and thus not
  necessarily as a row vector of scalars.  Furthermore, for each Variable x of 
  the Op's list of input variables ``inputs``, the returned gradient representing
  :math:`\frac{d C}{d x}` must have a shape similar to that of Variable x.

  If the output list of the op is :math:`[f_1, ... f_n]`, then the list 
  ``output_gradients`` is :math:`[grad_{f_1}(C), grad_{f_2}(C), ... , grad_{f_n}(C)]`.
  If ``inputs`` consists of the list :math:`[x_1, ..., x_m]`, then Op.grad
  should return the list :math:`[grad_{x_1}(C), grad_{x_2}(C), ..., grad_{x_m}(C)]`,
  where :math:`(grad_{y}(Z))_i = \frac{\partial Z}{\partial y_i}` (and :math:`i` can stand for multiple dimensions).
 
  In other words, :func:`grad` does not return
  :math:`\frac{d f_i}{d x_j}`, but instead the appropriate dot product specified by the chain rule:  
  :math:`\frac{d C}{d x_j} =
  \frac{d C}{d f_i} \cdot \frac{d f_i}{d x_j}`.
  Both the partial differentiation and the multiplication have to be performed by
  :func:`grad`.


  Theano currently imposes the following constraints on the values returned by the grad method:
  
  1) They must be Variable instances.
  2) When they are types that have dtypes, they must never have an integer dtype.

  Integers are a tricky subject. Integers are the main reason for having DisconnectedType,
  NullType or zero gradient. When you have an integer as an argument to your grad method,
  recall the definition of a derivative to help you decide what value to return:

  :math:`\frac{d f}{d x} = \lim_{\epsilon \rightarrow 0} (f(x+\epsilon)-f(x))/\epsilon`.

  Suppose your function f has an integer-valued output. For most functions you're likely
  to implement in theano, this means your gradient should be zero, because f(x+epsilon)
  = f(x) for almost all x. (The only other option is that the gradient could be undefined,
  if your function is discontinuous everywhere, like the rational indicator function)

  Suppose your function f has an integer-valued input. This is a little trickier, because
  you need to think about what you mean mathematically when you make a variable integer-valued
  in theano. Most of the time in machine learning we mean "f is a function of a real-valued
  x, but we are only going to pass in integer-values of x". In this case, f(x+epsilon) exists,
  so the gradient through f should be the same whether x is an integer or a floating point
  variable. Sometimes what we mean is "f is a function of an integer-valued x, and f is only
  defined where x is an integer." Since f(x+epsilon) doesn't exist, the gradient is undefined.
  Finally, many times in theano, integer valued inputs don't actually affect the elements of
  the output, only its shape.

  If your function f has both an integer-valued input and an
  integer-valued output, then both rules have to be combined:

  - If f is defined at (x+epsilon), then the input gradient is
    defined. Since f(x+epsilon) would be equal to f(x) almost
    everywhere, the gradient should be 0 (first rule).

  - If f is only defined where x is an integer, then the gradient
    is undefined, regardless of what the gradient with respect to the
    output is.

  Examples:

  1) f(x,y) = dot product between x and y. x and y are integers.
        Since the output is also an integer, f is a step function.
        Its gradient is zero almost everywhere, so Op.grad should return
        zeros in the shape of x and y.
  2) f(x,y) = dot product between x and y. x is floating point and y is an integer.
        In this case the output is floating point. It doesn't matter that y is an integer.
        We consider f to still be defined at f(x,y+epsilon). The gradient is exactly the
        same as if y were floating point.
  3) f(x,y) = argmax of x along axis y.
        The gradient with respect to y is undefined, because f(x,y) is not defined for
        floating point y. How could you take an argmax along a fraActional axis?
        The gradient with respect to x is 0, because f(x+epsilon, y) = f(x) almost
        everywhere.
  4) f(x,y) = a vector with y elements, each of which taking on the value x
        The grad method should return DisconnectedType()() for y, because the elements of
        f don't depend on y. Only the shape of f depends on y. You probably also want to
        implement a connection_pattern method to encode this.
  5) f(x) = int(x) converts float x into an int. g(y) = float(y) converts an integer y into a float.
        If the final cost C = 0.5 * g(y) = 0.5 g(f(x)), then the
        gradient with respect to y will be 0.5, even if y is an
        integer. However, the gradient with respect to x will be 0,
        because the output of f is integer-valued.


.. function:: infer_shape(node, shapes)

   Optional.

   This function is needed for shape optimization. ``shapes`` is a
   list with one tuple for each input of the Apply node (which corresponds
   to the inputs of the op).  Each tuple contains as many elements as the
   number of dimensions of the corresponding input. The value of each element
   is the shape (number of items) along the corresponding dimension of that
   specific input.

   While this might sound complicated, it is nothing more than the shape
   of each input as symbolic variables (one per dimension).

   The function should return a list with one tuple for each output.
   Each tuple should contain the corresponding output's computed shape.

   Implementing this method will allow Theano to compute the output's
   shape without computing the output itself, potentially sparing you
   a costly recomputation.

.. function:: make_thunk(node, storage_map, compute_map, no_recycling)

   TODO

.. function:: R_op(inputs, eval_points)

   Optional.

   This function implements the application of the R-operator on the
   function represented by your op. Let assume that function is :math:`f`,
   with input :math:`x`, applying the R-operator means computing the 
   Jacobian of :math:`f` and right-multiplying it by :math:`v`, the evaluation 
   point, namely: :math:`\frac{\partial f}{\partial x} v`. 

   ``inputs`` are the symbolic variables corresponding to the value of 
   the input where you want to evaluate the jacobian, and ``eval_points``
   are the symbolic variables corresponding to the value you want to
   right multiply the jacobian with. 

   Same conventions as for the grad method hold. If your op is not
   differentiable, you can return None. Note that in contrast to 
   the method :func:`grad`, for :func:`R_op` you need to return the
   same number of outputs as there are ouputs of the op. You can think
   of it in the following terms. You have all your inputs concatenated
   into a single vector :math:`x`. You do the same with the evaluation 
   points (which are as many as inputs and of the shame shape) and obtain
   another vector :math:`v`. For each output, you reshape it into a vector, 
   compute the jacobian of that vector with respect to :math:`x` and 
   multiply it by :math:`v`. As a last step you reshape each of these
   vectors you obtained for each outputs (that have the same shape as 
   the outputs) back to their corresponding shapes and return them as the 
   output of the :func:`R_op` method.

.. attribute:: default_output

  *Default:* None

  If this member variable is an integer, then the default
  implementation of ``__call__`` will return
  ``node.outputs[self.default_output]``, where ``node`` was returned
  by ``make_node``.  Otherwise, the entire list of outputs will be
  returned.

.. function:: __call__(*inputs)

  Syntactic shortcut to make_node which returns the output
  Variables of the Op.

  *Default:* this is implemented in the parent class and you do not need to change it.

.. function:: __str__()

   *Default:* python default: module_path_to_your_class.CLASSNAME

   This allows you to specify a more informative string representation of your
   Op. If an Op has parameters, it is highly recommended to have the
   ``__str__`` method include the name of the op and the Op's parameters'
   values.

.. function:: do_constant_folding(node)

   *Default:* Return True

   By default when optimizations are enabled, we remove during
   function compilation Apply nodes whose inputs are all constants.
   We replace the Apply node with a Theano constant variable.
   This way, the Apply node is not executed at each function
   call. If you want to force the execution of an op during the
   function call, make do_constant_folding return False.

   As done in the Alloc op, you can return False only in some cases by
   analyzing the graph from the node parameter.

At a bare minimum, a new Op must define ``make_node`` and ``perform``, which
have no defaults.

You can also provide a :ref:`C implementation <cop>` of
``perform()``. For more details, refer to the documentation for
:ref:`op`.


Defining an Op: ``mul``
=======================

We'll define multiplication as a *binary* operation, even though a
multiplication Op could take an arbitrary number of arguments.

First, we'll instantiate a ``mul`` Op:

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_extending.test_extending_1

.. code-block:: python

   from theano import gof
   mul = gof.Op()


**make_node**

This function must take as many arguments as the operation we are
defining is supposed to take as inputs---in this example that would be
two.
This function ensures that both inputs have the ``double``
type.
Since multiplying two doubles yields a double,
this function makes an Apply node with an output Variable of type
``double``.

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_extending.test_extending_1

.. code-block:: python

   def make_node(x, y):
       if x.type != double or y.type != double:
           raise TypeError('mul only works on doubles')
       return gof.Apply(mul, [x, y], [double()])
   mul.make_node = make_node


The first two lines make sure that both inputs are Variables of the
``double`` type that we created in the previous section. We would not
want to multiply two arbitrary types, it would not make much sense
(and we'd be screwed when we implement this in C!)

The last line is the meat of the definition. There we create an Apply
node representing the application of Op ``mul`` to inputs ``x`` and
``y``, giving a Variable instance of type ``double`` as the output.

.. note::

   Theano relies on the fact that if you call the ``make_node`` method
   of Apply's first argument on the inputs passed as the Apply's
   second argument, the call will not fail and the returned Apply
   instance will be equivalent.  This is how graphs are copied.

**perform**

This code actually computes the function.
In our example, the data in ``inputs`` will be instances of Python's
built-in type ``float`` because this is the type that ``double.filter()``
will always return, per our own definition. ``output_storage`` will
contain a single storage cell for the multiplication's variable.

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_extending.test_extending_1
.. code-block:: python

   def perform(node, inputs, output_storage):
       x, y = inputs[0], inputs[1]
       z = output_storage[0]
       z[0] = x * y
   mul.perform = perform

Here, ``z`` is a list of one element. By default, ``z == [None]``.

.. note::

   It is possible that ``z`` does not contain ``None``. If it contains
   anything else, Theano guarantees that whatever it contains is what
   ``perform`` put there the last time it was called with this
   particular storage. Furthermore, Theano gives you permission to do
   whatever you want with ``z``'s contents, chiefly reusing it or the
   memory allocated for it. More information can be found in the
   :ref:`op` documentation.

.. warning::

   We gave ``z`` the Theano type ``double`` in ``make_node``, which means
   that a Python ``float`` must be put there. You should not put, say, an
   ``int`` in ``z[0]`` because Theano assumes Ops handle typing properly.


Trying out our new Op
=====================

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_extending.test_extending_1

In the following code, we use our new Op:

>>> x, y = double('x'), double('y')
>>> z = mul(x, y)
>>> f = theano.function([x, y], z)
>>> f(5, 6)
30.0
>>> f(5.6, 6.7)
37.519999999999996

Note that there is an implicit call to
``double.filter()`` on each argument, so if we give integers as inputs
they are magically cast to the right type. Now, what if we try this?

>>> x = double('x')
>>> z = mul(x, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/u/breuleuo/hg/theano/theano/gof/op.py", line 207, in __call__
  File "<stdin>", line 2, in make_node
AttributeError: 'int' object has no attribute 'type'

Automatic Constant Wrapping
---------------------------

Well, OK. We'd like our Op to be a bit more flexible. This can be done
by modifying ``make_node`` to accept Python ``int`` or ``float`` as
``x`` and/or ``y``:

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_extending.test_extending_1
.. code-block:: python

   def make_node(x, y):
       if isinstance(x, (int, float)):
           x = gof.Constant(double, x)
       if isinstance(y, (int, float)):
           y = gof.Constant(double, y)
       if x.type != double or y.type != double:
           raise TypeError('mul only works on doubles')
       return gof.Apply(mul, [x, y], [double()])
   mul.make_node = make_node

Whenever we pass a Python int or float instead of a Variable as ``x`` or
``y``, ``make_node`` will convert it to :ref:`constant` for us. ``gof.Constant``
is a :ref:`variable` we statically know the value of.

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_op.test_op_1

>>> x = double('x')
>>> z = mul(x, 2)
>>> f = theano.function([x], z)
>>> f(10)
20.0
>>> f(3.4)
6.7999999999999998

Now the code works the way we want it to.

.. note::
   Most Theano Ops follow this convention of up-casting literal
   make_node arguments to Constants.
   This makes typing expressions more natural.  If you do
   not want a constant somewhere in your graph, you have to pass a Variable
   (like ``double('x')`` here).



Final version
=============

The above example is pedagogical.  When you define other basic arithmetic
operations ``add``, ``sub`` and ``div``, code for ``make_node`` can be
shared between these Ops. Here is revised implementation of these four
arithmetic operators:

.. If you modify this code, also change :
.. theano/tests/test_tutorial.py:T_extending.test_extending_1

.. code-block:: python

   from theano import gof

   class BinaryDoubleOp(gof.Op):

       def __init__(self, name, fn):
           self.name = name
           self.fn = fn

       def __eq__(self, other):
           return type(self) == type(other) and (self.name == other.name) and (self.fn == other.fn)

       def __hash__(self):
           return hash(type(self)) ^ hash(self.name) ^ hash(self.fn)

       def make_node(self, x, y):
           if isinstance(x, (int, float)):
               x = gof.Constant(double, x)
           if isinstance(y, (int, float)):
               y = gof.Constant(double, y)
           if x.type != double or y.type != double:
               raise TypeError('%s only works on doubles' % self.name)
           return gof.Apply(self, [x, y], [double()])

       def perform(self, node, inp, out):
           x, y = inp
           z, = out
           z[0] = self.fn(x, y)

       def __str__(self):
           return self.name

   add = BinaryDoubleOp(name = 'add',
                        fn = lambda x, y: x + y)

   sub = BinaryDoubleOp(name = 'sub',
                        fn = lambda x, y: x - y)

   mul = BinaryDoubleOp(name = 'mul',
                        fn = lambda x, y: x * y)

   div = BinaryDoubleOp(name = 'div',
                        fn = lambda x, y: x / y)

Instead of working directly on an instance of Op, we create a subclass of
Op that we can parametrize. All the operations we define are binary. They
all work on two inputs with type ``double``. They all return a single
Variable of type ``double``. Therefore, ``make_node`` does the same thing
for all these operations, except for the Op reference ``self`` passed
as first argument to Apply.  We define ``perform`` using the function
``fn`` passed in the constructor.

This design is a flexible way to define basic operations without
duplicating code. The same way a Type subclass represents a set of
structurally similar types (see previous section), an Op subclass
represents a set of structurally similar operations: operations that
have the same input/output types, operations that only differ in one
small detail, etc. If you see common patterns in several Ops that you
want to define, it can be a good idea to abstract out what you can.
Remember that an Op is just an object which satisfies the contract
described above on this page and that you should use all the tools at
your disposal to create these objects as efficiently as possible.

**Exercise**: Make a generic DoubleOp, where the number of
arguments can also be given as a parameter.

