00001 00002 This is not yet another gp system (nyagp). For one, it is not general. 00003 It does one thing, find mathematical functions, and tries to do that well. 00004 00005 So, if you're trying to steer ants on various New Mexico trails, or build your 00006 own tiny block world, you're in the wrong place. However, if you're interested 00007 in finding mathematical functions either through direct application on data or 00008 running it through a simulator, you might find what you're looking for here. 00009 00010 === Representation (sym/ + gen/) ======== 00011 00012 Mathsym has a few interesting characteristics. First and foremost is the 00013 basic representation. It uses trees, but these trees are stored in a 00014 reference counted hashtable. This means that every distinct subtree that is alive 00015 is stored once and only once. 00016 The reference counting mechanism takes care of memory management. 00017 00018 The idea of using a hashtable (for offline analysis) comes from Walter Tackett, in his 00019 1994 dissertation. The current system is just a real-time implementation of this 00020 idea, adding the reference counting for ease of use. 00021 00022 The hashtable brings overhead. It's still pretty fast, but a string based representation 00023 would run circles around it. However, by virtue of it storing every subtree only once, it 00024 is fairly tight on memory. This helps tremendously when confronted with excessively growing populations, bloat. 00025 The hashtable implementation can not stop bloat, but does make it more manageable. In a typical 00026 GP run, the number of distinct subtrees is only 10-20% of the total number of subtrees. 00027 00028 Other advantages of the hashtable are in the ability to examine the run more thoroughly. It is easy 00029 to check how many subtrees are present in the system, and for each subtree you can check the reference 00030 count. 00031 00032 The basic tree is called a Sym. A Sym is simply a tree, and has children, accessible through args(). 00033 A Sym simply contains an iterator (== decorated pointer) to an entry in the hashtable. 00034 Every time you create a Sym, it is either looked up in the hashtable or added to the hashtable. 00035 A Sym has several members: size, depth, args, etc. One interesting member is the refcount(). 00036 This returns the reference count of the Sym in the hashtable, and thus returns the number 00037 of distinct contexts in which the Sym is used. 00038 00039 Another nice thing of these hashtable Syms is that a check for equality reduces to a pointer comparison. 00040 00041 The Sym nodes are identified by a simple token, of type token_t (usually an unsigned int). It 00042 is completely generic and could conceivably be adapted to steer ants. The rest of the library 00043 is however targeted at mathematical functions purely. 00044 00045 sym/Sym.h is the file to look into for the functionality provided by Sym. The sym/ directory 00046 is where the source files are stored that are relevant for the generic Sym functionality. The 00047 'gen/' directory contains some generic functionality to build and traverse trees, independent of 00048 the function and terminal set. 00049 00050 The file sym/README.cpp documents the use of the sym library for general GP use. 00051 00052 === Function Set (fun/) === 00053 00054 The standard GP function set of binary functions: addition, multiplication, subtraction and 00055 division is NOT supported. 00056 00057 What is however supported are the functions of: 00058 00059 summation: arbitrary arity, arity zero meaning 0.0. Arity 2 is standard addition 00060 product: arbitrary arity, arity zero meaning 1.0. Arity 2 is standard multiplication 00061 inversion: 1.0 / x. Only arity 1 00062 unary minus: -x. Only arity 1 00063 00064 Plus a whole bunch of other functions (see "fun/FunDef.h") 00065 00066 The reason for this is the observation (actually from a friend of mine, thanks Luuk), 00067 that this set of functions is complete and slightly more orthogonal than a binary set. 00068 00069 The directory 'fun' contains the functionality for the function and terminal set, together 00070 with ERC's etc. fun/FunDef.cpp contains the definition of the functionality. Stuff can be 00071 added here, but best to contact me if you miss particular functions. 00072 00073 With the sym and the function set in place, some fairly nice overloading is possible. A quick tour: 00074 00075 To create a variable that reads the first value from the inputs, do: 00076 00077 Sym var = SymVar(0); 00078 00079 To create a constant of value 0.4432, do 00080 00081 Sym cnst = SymConst(0.4432); 00082 00083 The constants are also stored uniquely so that: 00084 00085 Sym cnst2 = SymConst(0.4432) 00086 00087 will lead to: 00088 00089 cnst == cnst2 00090 00091 to be true (this happens without value compare, they point to the same element in the hashtable) 00092 00093 To add two values, do 00094 00095 Sym sym = var + const; 00096 00097 This will create a tree with three nodes. Other operators are overloaded similarily. 00098 00099 === Evaluation (eval/) === 00100 00101 The second important thing is evaluation. Although Syms can be evaluated through an interpreter, 00102 this is not the fastest way to go about with it. The standard way of evaluating a Sym is to 00103 first *compile* it to a function, and then run it in your favourite environment. Compilation 00104 is done through the use of the excellent tinycc compiler, which is blazingly fast and produces 00105 pretty good functions. 00106 00107 Compilation comes in several flavours: compile a single function and retrieve a pointer to a function 00108 of signature: 00109 00110 double func(const double* x); 00111 00112 where x is the input array. Another option is to compile a bunch of functions in one go, and retrieve an array 00113 of such function pointers. The Syms are simply printed and compiled. An example: 00114 00115 double func(const double* x) { return x*x + x * 1./x; } 00116 00117 The batch version proceeds significantly more quickly than calling compile every time. The function pointers 00118 can be given to a simulation for extremely quick evaluation. 00119 00120 A third option is to compile a complete population in one go, and return a single pointer of signature 00121 00122 void func(const double* x, double* y); 00123 00124 Where 'y' is the (preallocated) output array. This allows to evaluate a complete population in one function 00125 call, storing the results in 'y'. It uses the hashtable to store every calculation only once. An example 00126 for the two function x*x + x*1./x and x + sin(x*x) is: 00127 00128 void func(const double* x, double* y) { 00129 double a0 = x; 00130 double a1 = a0 * a0; 00131 double a2 = 1.0; 00132 double a3 = a2 / a0; 00133 double a4 = a2 * a3; 00134 y[0] = a4; 00135 double a5 = sin(a1); 00136 double a6 = a0 + a5; 00137 y[1] = a6; 00138 } 00139 00140 This is the fastest way to evaluate even humongous populations quickly. You might be surprised at 00141 the amount of code re-use in a GP population. 00142 00143 The three compilation functions can be found in eval/sym_compile.h 00144 00145 A limiting factor in tinycc is that the struct TCCState that is used to hold the compilation context, 00146 is not really self-contained. This unfortunately means that with every call to 'compile' ALL previous 00147 pointers that have been produced become unsafe for use. I'm still looking at ways to circumvent this. 00148 00149 To work with mathsym, a few small changes in tccelf.c were necessary, check README.TCC for details. 00150 00151 === Interval Arithmetic (eval/) === 00152 00153 GP is pretty good at finding mathematical expressions that are numerically unsound. Take for instance 00154 the function '1 / x'. This is well defined only when x is strictly positive, but will lead to problems 00155 when x equals 0. The standard answer is to define some pseudo-arithmetical function called 'protected 00156 division' that will return some value (usually 1) when a division by zero occurs. This leads to a number 00157 of protected functions (sqrt, log, tan, etc.) which all need to be protected. Interpreting results from 00158 GP using such functions is in general hard. 00159 00160 Interval arithmetic (through another excellent library boost/numeric/interval) is used to calculate 00161 if particular functions can conceivably produce problems. This completely annihilates the use for Koza-style 00162 protected operators and is a more safe and sound method. For interval arithmetic to function, the bounds 00163 on the input variables need to be known. As for every function we can calculate a guarenteed, 00164 though not necessarily tight, output interval given the input intervals, we can check arbitrary functions 00165 for possible problems. If, for example for division, the input interval contains 0, we know that a division 00166 by zero is theoretically possible. It's then best to throw away the entire function. 00167 00168 Interval Arithmetic is accessible through the class IntervalBoundsCheck (eval/BoundsCheck.h) 00169 00170 === More generic support (gen/) === 00171 00172 The gen subdirectory contains some general utility classes for defining function sets and for 00173 creating trees. The idea is that these functions are generic and only append on the sym/ part 00174 of the library. Unfortunately, the language table currently needs an ERC function, a default 00175 implementation is hidden inside fun/FunDef.cpp. Will fix at some point. 00176 00177 gen/LanguageTable.cpp -> defines the functions/terminals that can be used 00178 gen/TreeBuilder.cpp -> can create trees based on a LanguageTable 00179 00180 === Data and Errors (regression/) === 00181 00182 The above classes are generic and apply for any type of problem where a mathematical function can be 00183 used to steer some process, run a simulation, whatever. First check the intervals, then compile the 00184 Sym(s) to a (set of) function pointer(s), and use the pointers in some way to evaluate for fitness. 00185 One particular type of problem for which support is built in is 'symbolic regression'. This type of 00186 problem involves finding an mathematical input/output relationship based on some data. 00187 00188 To enable this, regression/ introduces the class Dataset to contain the data and ErrorMeasure to calculate 00189 error. Currently supported: mean squared error, mean absolute error and mean squared error scaled (proportional 00190 to correlation squared). They use some helper classes such as Scaling and TargetInfo. 00191 00192 === EO interface (eo_interface/) === 00193 00194 Contains the classes to make it all work with EO. Check the root application 'symreg' for ways to use this 00195 00196 00197 00198