Similar to quasi-Newton methods (but not inspired by them), the CMA-ES is a second order approach estimating a positive definite matrix within an iterative procedure (more precisely: a covariance matrix, that is, on convex-quadratic functions, closely related to the inverse Hessian). This makes the method feasible on non-separable and/or badly conditioned problems. In contrast to quasi-Newton methods, the CMA-ES does not use or approximate gradients and does not even presume or require their existence. This makes the method feasible on non-smooth and even non-continuous problems, as well as on multimodal and/or noisy problems. It turns out to be a particularly reliable and highly competitive evolutionary algorithm for local optimization (Hansen & Ostermeier 2001) and, surprising at first sight, also for global optimization (Hansen & Kern 2004, CEC 2005, Hansen 2009 in BBOB-2009).
The CMA-ES has several invariance properties. Two of them, inherited from the plain evolution strategy, are (i) invariance to order preserving (i.e. strictly monotonic) transformations of the objective function value (that is, e.g., ||x||2 and 3||x||0.2-100 are equivalent objective functions with identical performance of CMA-ES), and (ii) invariance to angle preserving (rigid) transformations of the search space (including rotation, reflection, and translation), if the initial search point is transformed accordingly. Invariances are highly desirable: they imply uniform behavior on classes of functions and therefore imply the generalization of empirical results.
The CMA-ES does not require a tedious parameter tuning for its application. In fact, the choice of strategy internal parameters is not left to the user (arguably with the exception of population size λ). Finding good (default) strategy parameters is considered as part of the algorithm design, and not part of its application—the aim is to have a well-performing algorithm as is. The default population size λ is comparatively small to allow for fast convergence. Restarts with increasing population size (Auger & Hansen 2005) improve the global search performance. For the application of the CMA-ES, an initial solution, an initial standard deviation (step-size, variables should be defined such that the same standard deviations can be reasonably applied to all variables, see also here) and, possibly, the termination criteria (e.g. a function tolerance) need to be set by the user. The most common applications are model calibration (e.g. curve fitting) and shape optimisation.
The generating set adaptation (GSA) is the first evolution strategy to our knowledge that learns a problem scaling and is invariant to coordinate system rotations. The GSA implements the principle idea of CMA, to increase the likelihood of selected steps, without using the formalism of a covariance matrix. Global step-size control is done by mutative self-adaptation. The paper also provides an algorithm that learns individual step-sizes and one direction (AII). AII has a linear number of strategy parameters and hence learns a less rich model quicker.
The first CMA paper, where the covariance matrix adaptation is introduced into the (1,λ)-ES (μ=1). The paper emphasizes on the evolution path and the differences to the generating set adaptation (Hansen et al 1995). Using the formalism of a covariance matrix has three important consequences. First, path length control (cumulative step-size adaptation, CSA) can be used to control the global step-size. This will become particularly relevant when μ is chosen to be greater than 1. Second, the eigensystem, determined by the covariance matrix, can give insights into the underlying problem structure. De facto, it turns out to be particularly useful to monitor the time evolution of eigenvalues of the covariance matrix. Third, the computational expense to generate a new individual reduces from O(n3) to O(n2). Because the eigendecomposition can be postponed for Ω(n) iterations, the overall internal expenses become O(n2).
Introduction of the "multi-membered" (μ/μ,λ)-CMA-ES with intermediate recombination of all μ parental solutions. Investigation of the effect of μ for small λ on unimodal test functions, some including distorted selection. The (3/3,12)-CMA-ES is recommended as the best compromise for being fast and robust.
A comprehensive article, where weighted recombination is introduced in the (μ/μ,λ)-CMA-ES. Objectives and rationales behind the strategy are discussed as well as (default) strategy parameter choices in detail. Comparisons with the so-called correlated mutations and scale-up investigations, up to 320-D, on a considerable number of unimodal test functions are shown. On the multimodal Rastrigin function the myth that (local) adaptation to the topography of the function jeopardizes global search properties is disproved. Local adaptation allows for larger step-sizes and consequently is likely to improve global search performance.
Extension with the so-called rank-μ update of the covariance matrix which scales up the efficiency of the CMA to large population sizes. The rank-μ update reduces the time complexity, i.e. the number of generations to adapt the complete matrix roughly from 10n2 to 20n. The other way around, the learning speed, with respect to the number of objective function evaluations, depends much less on the population size if the rank-μ update is used. Reasonable population sizes range between 5 and 5n. For larger population sizes a deficiency of the step-size adaptation is discovered (that has been solved in Hansen and Kern (2004)). Simulations on a number of unimodal test functions and scale-up investigations are presented.
In this paper 1) the weighted rank-μ update of the covariance matrix is introduced. 2) the step-size adaptation problem discovered in Hansen et al (2003) is addressed by a remarkably improved setting of the step-size control parameters: for large population sizes the backward time horizon of the evolution path is diminished by increasing the parameter cσ and the step-size damping dσ is increased. 3) the influence of the population size on the performance is investigated on a number of multimodal functions, revealing that an increased population size can dramatically improve the global search performance. Comparisons to other optimization algorithms reveal that on additively decomposable (and therefore separable) functions the CMA-ES can be vastly outperformed, e.g., by differential evolution, while it reveals superior performance on non-separable functions.
The remaining strategy parameter population size, λ, that is most important for the global search performance, is removed by implementing a restart mechanism. Before each restart the population size is increased by a factor of two. Otherwise restarts are conducted independently. Tests on a benchmark function set of 25 functions provided for the session on real-parameter optimization reveal highly competitive performance on local and global optimization problems, compared to the other submitted algorithms.
The CMA is approached and analyzed from the view point of estimating a search distribution. Important differences to the so-called estimation of distribution algorithms (EDAs) are revealed. 1) in CMA the distribution of the selected steps is estimated, while in EDAs the selected points are modeled. This leads invariably to larger variances in CMA. In particular, a "pure EDA" converges prematurely on a linear function whereas a "pure CMA" (without step-size control) does not. 2) in EDAs the estimation is usually done from scratch, while in CMA the distribution is updated (at least for population size λ < 4n2). Therefore in CMA the estimation can be done reliably independent of the population size and, for a small population size, in a lesser number of function evaluations. 3) in CMA an additional global step-size adaptation is conducted, based on a different adaptation principle (cumulative step-size adaptation, or path length control) that often leads to a considerably larger population spread and prevents premature convergence even more effectively. Major concepts and design principles, namely change rates, invariance, and stationarity are discussed.
Introduces an additional negative rank-μ update of the covariance matrix using the μ worst of the λ individuals. All vectors for both, the original and the negative, rank-μ updates have the same weight. For population sizes up to λ=4n a significant speed-up compared to the original equally weighted rank-μ update can be observed. On the discus (or tablet) function the speed-up exceeds a factor of two for large dimensions and small population sizes. While positive definiteness of the covariance matrix cannot be guaranteed anymore, empirically the covariance matrix remains always positive definite. For a larger population size presumably the learning parameter β needs to be modified.
This paper introduces 1) the CMA into the (1+1)-ES using an improved variant of a success rule based step-size adaptation in place of the path length control, and 2) an incremental O(n2) Cholesky update of the covariance matrix replacing the original O(n2) covariance matrix update altogether with the iterative O(n3) eigendecomposition of the covariance matrix (the latter is usually postponed until after n/10 generations). The resulting algorithm is an elegant CMA variant and accomplishes what is expected. Nevertheless, its practical relevance is limited. The (1+1)-ES is slightly faster than its ","-counterpart, but more prone to converge to local optima. Applying the Cholesky update prohibits to utilize the evolution path for the covariance matrix update. This is a significant disadvantage as the evolution path considerably improves the performance on many functions and was addressed in Suttorp et.al. (2009).