--------------------------------------------------------------- I Introduction to Machine Learning --------------------------------------------------------------- 1. ML setup ----------- . To solve a problem, do not make by hand a set of rules (in such case, do that), but learn from examples Types of ML problems: . supervised: given: many pairs of examples (x_i, y_i) y_i = (discrete) label, or continuous value(s) desired output: extrapolation: new x --> y (classification, regression) . reinforcement learning: for a given x_i, don't tell the optimal label (or value) y_i, but, for any proposal y, tell how good is that candidate y (no information about the existence of better/worse choices ==> more difficult than supervised) --> exploration strategies, bandits, etc. . unsupervised: given: many examples (x_i)_i desired outcome: clusters, or a relevant metric, a relevant compact representation, or an estimation of the distribution law (density), or generating new samples according to that law... 2. examples of tasks / datasets / etc. -------------------------------------- Supervised setting: . Give many examples (x_i, y_i), ask for generalization: new x --> ? Classification: [show slides] - image : landscape or town ? dog or cat ? . image = set of RGB pixel values : no notion of object, no prior on typical images (no world ever seen before) image = just a vector in [| 1, 256 |] ^ #pixels . loss: . f outputs a label : number of mistakes : sum_i delta_{label predicted f(x_i) =/= y_i} . f outputs a probability over labels : sum_i - log p(good label)(x_i) =: cross_entropy(true proba distrib || predicted one) . how to do such a task? --> dominant colors (green vs that road shade of grey), dominant gradients (vertical vs horizontal), texture... (difficult to describe) --> 101 dataset --> 100% solved --> what? --> Mercedes symbol --> relevant feature/information extracted = not necessarily what you expect - ImageNet: . historical benchmark with 1 million pictures, 1000 classes --> find the class (could have been: the classes, with ranking) . obtained by cheap labor (Amazon Mechanical Turk) . allowed example-intensive techniques (rise of NN) - other examples: fish Prediction, more general: - image segmentation: image --> one label per pixel - image registration: (image 1, image 2) --> deformation that aligns them = vector field (one vector per pixel) - object detection: image --> bounding box around object of interest - text translation: text in language A --> in language B - speech processing: audio --> words (characters) How to build a dataset? . not that obvious . no bias (or then, known and given) --> hidden bias: Mercedes logo, ... medical datasets . correctly sampled (frequency of labels = lot of information) . importance of benchmarks: only way to compare approaches/algorithms How to set a criterion? . usually hand-designed, but, recommanded: MDL (penalizes model complexity) . inspired from Kolmogorov complexity . in practice: cost-of-describing-the-model + cost-of-describing-its-parameter-values + cost-of-reproducing-the-dataset-with-that-model = number of parameters (* log |dataset| / 2) = - log p(dataset|model with given parameters) . not always obvious (ex: problematic for text translation) 3. Parameterized families -------------------------- ML typical methods: - k-nearest neighbors [for a given metric, find the closest examples x_i already seen to our new x, average their associated y_i] - kernel methods (Support Vector Machines, etc.) [same but with a weighted average, relying on a similarity notion defined properly with a kernel] - boosting, ensemble... [mixing many weak approaches] - random forests [set of decision trees, picked stochastically] - ... - neural networks Parameterized families: the case of supervised learning: - f_theta : x --> y - loss (performance criterion): L(theta) = sum_i ( f_theta(x_i) - y_i )^2 (+ possibly a regularizer) - optimization problem: find inf_theta L(theta) or any "good enough" theta - if no particular trick: gradient descent: theta_{t+1} = theta_t - a * dL/dtheta (a: step size) - under weak conditions, converges to a local optimum theta_* - use f_{theta_*} to predict f_{theta_*}(new x) . family expressivity vs. optimization problem difficulty (more parameters ==> harder) . ML = parameter optimization Family choice = important. Show toy dataset of points with binary classification task. Examples: - linear (affine) : f(x) = M x + b : theta = (coefficients of matrix M and vector b) - polynomials : f(x) = sum_n a_n x^n : theta = (coefficients a_n) - Gaussian mixture : sum_k a_k N(mu_k, sigma_k)(x) : theta = (locations mu_k, standard deviations sigma_k, coefficients a_k) - kernels : f(x) = sum_i a_i k(x_i, x) : theta = (coefficients a_i) [x_i = samples] - linear on predefined features : f(x) = M Phi(x) + b : theta = (M, b) --> classic literature = just different hand-made choices of Phi --> would Phi be optimizable? Neural networks, "deep learning" (as opposed to linear/kernel methods, supposed to be "shallow", while actually not necessarily deep...): . computational graph of neurons, with weights on edges --> a neuron = a simple function f(inputs) = sigma( sum_i w_i a_i ) . parameters = (w_i)