---------------------------------------------------------------
I Introduction to Machine Learning
---------------------------------------------------------------
1. ML setup
-----------
. To solve a problem, do not make by hand a set of rules (in such case, do that), but learn from examples
Types of ML problems:
. supervised:
given: many pairs of examples (x_i, y_i) y_i = (discrete) label, or continuous value(s)
desired output: extrapolation: new x --> y (classification, regression)
. reinforcement learning:
for a given x_i, don't tell the optimal label (or value) y_i,
but, for any proposal y, tell how good is that candidate y (no information about the existence of better/worse choices ==> more difficult than supervised)
--> exploration strategies, bandits, etc.
. unsupervised:
given: many examples (x_i)_i
desired outcome: clusters, or a relevant metric, a relevant compact representation,
or an estimation of the distribution law (density), or generating new samples according to that law...
2. examples of tasks / datasets / etc.
--------------------------------------
Supervised setting:
. Give many examples (x_i, y_i), ask for generalization: new x --> ?
Classification: [show slides]
- image : landscape or town ? dog or cat ?
. image = set of RGB pixel values : no notion of object, no prior on typical images (no world ever seen before)
image = just a vector in [| 1, 256 |] ^ #pixels
. loss:
. f outputs a label : number of mistakes : sum_i delta_{label predicted f(x_i) =/= y_i}
. f outputs a probability over labels : sum_i - log p(good label)(x_i) =: cross_entropy(true proba distrib || predicted one)
. how to do such a task?
--> dominant colors (green vs that road shade of grey), dominant gradients (vertical vs horizontal), texture... (difficult to describe)
--> 101 dataset --> 100% solved --> what? --> Mercedes symbol
--> relevant feature/information extracted = not necessarily what you expect
- ImageNet:
. historical benchmark with 1 million pictures, 1000 classes --> find the class (could have been: the classes, with ranking)
. obtained by cheap labor (Amazon Mechanical Turk)
. allowed example-intensive techniques (rise of NN)
- other examples: fish
Prediction, more general:
- image segmentation: image --> one label per pixel
- image registration: (image 1, image 2) --> deformation that aligns them = vector field (one vector per pixel)
- object detection: image --> bounding box around object of interest
- text translation: text in language A --> in language B
- speech processing: audio --> words (characters)
How to build a dataset?
. not that obvious
. no bias (or then, known and given) --> hidden bias: Mercedes logo, ... medical datasets
. correctly sampled (frequency of labels = lot of information)
. importance of benchmarks: only way to compare approaches/algorithms
How to set a criterion?
. usually hand-designed, but, recommanded: MDL (penalizes model complexity)
. inspired from Kolmogorov complexity
. in practice: cost-of-describing-the-model + cost-of-describing-its-parameter-values + cost-of-reproducing-the-dataset-with-that-model
= number of parameters (* log |dataset| / 2) = - log p(dataset|model with given parameters)
. not always obvious (ex: problematic for text translation)
3. Parameterized families
--------------------------
ML typical methods:
- k-nearest neighbors [for a given metric, find the closest examples x_i already seen to our new x, average their associated y_i]
- kernel methods (Support Vector Machines, etc.) [same but with a weighted average, relying on a similarity notion defined properly with a kernel]
- boosting, ensemble... [mixing many weak approaches]
- random forests [set of decision trees, picked stochastically]
- ...
- neural networks
Parameterized families: the case of supervised learning:
- f_theta : x --> y
- loss (performance criterion): L(theta) = sum_i ( f_theta(x_i) - y_i )^2 (+ possibly a regularizer)
- optimization problem: find inf_theta L(theta) or any "good enough" theta
- if no particular trick: gradient descent: theta_{t+1} = theta_t - a * dL/dtheta (a: step size)
- under weak conditions, converges to a local optimum theta_*
- use f_{theta_*} to predict f_{theta_*}(new x)
. family expressivity vs. optimization problem difficulty (more parameters ==> harder)
. ML = parameter optimization
Family choice = important. Show toy dataset of points with binary classification task. Examples:
- linear (affine) : f(x) = M x + b : theta = (coefficients of matrix M and vector b)
- polynomials : f(x) = sum_n a_n x^n : theta = (coefficients a_n)
- Gaussian mixture : sum_k a_k N(mu_k, sigma_k)(x) : theta = (locations mu_k, standard deviations sigma_k, coefficients a_k)
- kernels : f(x) = sum_i a_i k(x_i, x) : theta = (coefficients a_i) [x_i = samples]
- linear on predefined features : f(x) = M Phi(x) + b : theta = (M, b) --> classic literature = just different hand-made choices of Phi
--> would Phi be optimizable?
Neural networks, "deep learning" (as opposed to linear/kernel methods, supposed to be "shallow", while actually not necessarily deep...):
. computational graph of neurons, with weights on edges
--> a neuron = a simple function f(inputs) = sigma( sum_i w_i a_i )
. parameters = (w_i)