---------------------------------------------------------------
I Introduction to Machine Learning
---------------------------------------------------------------

1. ML setup
-----------

. To solve a problem, do not make by hand a set of rules (in such case, do that), but learn from examples

Types of ML problems:
. supervised:
    given: many pairs of examples (x_i, y_i)   y_i = (discrete) label, or continuous value(s)
    desired output: extrapolation: new x --> y   (classification, regression)
. reinforcement learning:
   for a given x_i, don't tell the optimal label (or value) y_i,
   but, for any proposal y, tell how good is that candidate y (no information about the existence of better/worse choices ==> more difficult than supervised)
   --> exploration strategies, bandits, etc.
. unsupervised:
    given: many examples (x_i)_i
    desired outcome: clusters, or a relevant metric, a relevant compact representation,
                     or an estimation of the distribution law (density), or generating new samples according to that law...


2. examples of tasks / datasets / etc.
--------------------------------------

Supervised setting:
. Give many examples (x_i, y_i), ask for generalization: new x --> ?

Classification: [show slides]
- image : landscape or town ?  dog or cat ?
  . image = set of RGB pixel values : no notion of object, no prior on typical images (no world ever seen before)
                                      image = just a vector in [| 1, 256 |] ^ #pixels
  . loss: 
    . f outputs a label : number of mistakes : sum_i delta_{label predicted f(x_i) =/= y_i}
    . f outputs a probability over labels :    sum_i - log p(good label)(x_i)   =: cross_entropy(true proba distrib || predicted one)
  . how to do such a task?
   --> dominant colors (green vs that road shade of grey), dominant gradients (vertical vs horizontal), texture... (difficult to describe) 
   --> 101 dataset --> 100% solved --> what? --> Mercedes symbol
   --> relevant feature/information extracted = not necessarily what you expect
- ImageNet: 
  . historical benchmark with 1 million pictures, 1000 classes --> find the class (could have been: the classes, with ranking)
  . obtained by cheap labor (Amazon Mechanical Turk)
  . allowed example-intensive techniques (rise of NN)
- other examples: fish

Prediction, more general:
- image segmentation: image --> one label per pixel
- image registration: (image 1, image 2) --> deformation that aligns them = vector field (one vector per pixel)
- object detection: image --> bounding box around object of interest
- text translation: text in language A --> in language B
- speech processing: audio --> words (characters)


How to build a dataset?
. not that obvious
. no bias (or then, known and given)  --> hidden bias: Mercedes logo, ... medical datasets
. correctly sampled (frequency of labels = lot of information)
. importance of benchmarks: only way to compare approaches/algorithms

How to set a criterion? 
. usually hand-designed, but, recommanded: MDL (penalizes model complexity)
. inspired from Kolmogorov complexity
. in practice: cost-of-describing-the-model + cost-of-describing-its-parameter-values + cost-of-reproducing-the-dataset-with-that-model
                                               = number of parameters (* log |dataset| / 2)     = - log p(dataset|model with given parameters)
. not always obvious (ex: problematic for text translation)


3. Parameterized families
--------------------------

ML typical methods:
- k-nearest neighbors [for a given metric, find the closest examples x_i already seen to our new x, average their associated y_i]
- kernel methods (Support Vector Machines, etc.) [same but with a weighted average, relying on a similarity notion defined properly with a kernel]
- boosting, ensemble... [mixing many weak approaches]
- random forests [set of decision trees, picked stochastically]
- ...
- neural networks

Parameterized families: the case of supervised learning:
- f_theta : x --> y
- loss (performance criterion): L(theta) = sum_i ( f_theta(x_i) - y_i )^2  (+ possibly a regularizer)
- optimization problem: find inf_theta L(theta)  or any "good enough" theta
- if no particular trick: gradient descent: theta_{t+1} = theta_t - a * dL/dtheta   (a: step size)
- under weak conditions, converges to a local optimum theta_*
- use f_{theta_*} to predict f_{theta_*}(new x)

. family expressivity  vs.  optimization problem difficulty (more parameters ==> harder)
. ML = parameter optimization


Family choice = important. Show toy dataset of points with binary classification task. Examples:
- linear (affine) :  f(x) = M x + b                   : theta = (coefficients of matrix M and vector b)
- polynomials :      f(x) = sum_n a_n x^n             : theta = (coefficients a_n)
- Gaussian mixture : sum_k a_k N(mu_k, sigma_k)(x)    : theta = (locations mu_k, standard deviations sigma_k, coefficients a_k)
- kernels :          f(x) = sum_i a_i k(x_i, x)       : theta = (coefficients a_i)  [x_i = samples]
- linear on predefined features : f(x) = M Phi(x) + b : theta = (M, b)  --> classic literature = just different hand-made choices of Phi
                                                                        --> would Phi be optimizable? 


Neural networks, "deep learning" (as opposed to linear/kernel methods, supposed to be "shallow", while actually not necessarily deep...):
. computational graph of neurons, with weights on edges 
   --> a neuron = a simple function f(inputs) = sigma( sum_i w_i a_i ) 
. parameters = (w_i)