Chapter 7
7.1 Tree Representation 7.2 Selection Methods 7.3 Operators 7.4 Miscellany 7.4.1 Tree Generation Spaces 7.4.2 Saved Individuals 7.4.3 Ephemeral Random Constants
Extending the Kernel
Internally, lil-gp is fairly simple. I have attempted to keep
the structure reasonably clean and modular, without going too
overboard about avoiding global variables and such. 24 C files
comprise the kernel:
main.c Initialization and cleanup.
gp.c The main evaluate-and-breed cycle, and population statistics calculation.
eval.c The tree evaluator.
tree.c Utility routines dealing with trees--counting nodes and depth, printing, generating random trees, finding subtrees, copying trees.
change.c Breeding of the new population each generation.
crossovr.c The crossover operator.
reproduc.c The reproduction operator.
mutate.c The mutation operator.
select.c Utility routines for selection methods.
tournmnt.c The tournament selection method.
bstworst.c The best, worst, and random selection methods.
fitness.c The fitness, fitness_overselect, and inverse_fitness selection methods.
genspace.c Utility routines for allocating space to grow new trees in.
exch.c The subpopulation exchange system.
populate.c Utility routines for population--copying, freeing, random generation.
ephem.c Utility routines for ephemeral random constants.
ckpoint.c Reading and writing checkpoint files.
event.c System-dependent module for tracking execution time.
pretty.c The tree pretty-printer.
individ.c Utility routines for whole individuals--printing and calculating size.
params.c The parameter database.
random.c The portable pseudorandom number generator, adapted from Numerical Recipes in FORTRAN.
memory.c The implementations of MALLOC(), FREE(), and REALLOC() that track memory usage.
output.c The output subsystem (oprintf() and oputs(), among others).
A tree is stored as an array of type lnode. An lnode is a union which can can a pointer to a function structure, a pointer to an ephemeral random constant (ERC) structure, or an integer. The tree is stored in prefix order. The first lnode is always a pointer to a function. If the function is an ERC terminal, then the next lnode in the array has the pointer to the ERC structure. If a function is of type EXPR (the user code controls evaluation of the function's arguments) then there is an extra lnode just before the start of each child_it contains an integer, the number of lnodes used in representing the child subtree. The evaluation code uses this value to skip the child during evaluation.
Consider this expression in the symbolic regression problem:
(+ X (iflte X (* .34 X) .56))
This would be represented in lil-gp as the array:
+ | pointer to structure for function + |
X | pointer to structure for function X |
iflte | pointer to structure for function iflte |
1 | first argument to iflte takes 1 lnode to store |
X | |
4 | second argument to iflte takes 4 lnodes to store |
* | pointer to structure for function * |
R | pointer to structure for ERC function |
.34 | pointer to structure containing ERC value |
X | |
2 | third argument to iflte takes 2 lnodes to store |
R | |
.56 |
This representation, while it may seem cumbersome, has two major
advantages over a more traditional C representation (with individual
structs for each node, linked by pointers). First, it uses much
less memory--approximately 1-2 words per node versus the 4-5 it
would take otherwise. This is because the structure of the tree
is represented implicitly in the ordering of the nodes rather
than explicitly via pointers. Second, it results in much faster
tree evaluation. For instance, consider the crossover operator.
In the traditional representation crossover is performed by just
swapping two pointers. While this is very fast and easy, over
time it means that the nodes of a given tree become spread out
over the process's address space. On a system with virtual memory,
this slows evaluation (or any traversal of the tree) to a crawl
as the tree is spread across dozens of pages which must constantly
be swapped in and out. lil-gp's representation complicates crossover
somewhat, but leaves each offspring tree as a single continuous
block of memory, able to fit on just one or two pages.
A selection method is implemented with two functions: one to perform initialization and cleanup, and another to do the actual selection. The first function, when called for initialization, creates and returns a data structure called a selection context. This contains any state information needed for the selection method. It should also store a pointer to the population that the selection is being done on. This structure will be passed to the second function, which should return an index of an individual within the population.
To create a new selection method, it is suggested that you copy and modify an existing one (the random method in bstworst.c is an especially simple one). If your of selection can be expressed as randomly selecting an individual from a set where each individual has a fixed probability of selection, then there is already an efficient implementation of the second function. See the code for the fitness, fitness_overselect, and inverse_fitness methods for examples
Finally, you must add a record describing your selection method
to the array select_method_table at the top of select.c.
This lists the names and initialization functions of each selection
method available.
The breeding of the population is controlled by a table of breeding phases. This is built from the parameter database at the start of the run (and whenever rebuild_breeding_table() is called from user code. Each subpopulation has its own breeding table. For each phase, there is a record in the table. Each record has pointers to the four methods for the operator, the rate for that phase, and a pointer to an operator-specific structure
To implement a new operator, it is suggested that you copy and
modify the code of an existing operator. The reproduction operator
is the simplest of the three included. Suppose that your new operator
is called "foo". You would need to provide five functions
(the following naming scheme is strongly recommended):
operator_foo_init Parses the operator's options string
and builds the operator table entry. The kernel
functions parse_o_rama() and free_o_rama() are available
to parse the options string just like the built-in operators do.
operator_foo_free Frees the operator-specific part of the
operator table entry.
operator_foo_start This is called at the start of breeding.
Selection methods should be initialized here.
operator_foo_end This is called at the end of breeding.
Selection contexts should be freed here.
operator_foo This performs the actual operation.
You then add a record to the array operator_table at the top of change.c, listing the name of the operator (a string) and the operator's initialization method (a function pointer), for example, after adding the foo operator the table would look like:
operator operator_table[] = { { "crossover", operator_crossover_init }, { "reproduction", operator_reproduce_init }, { "mutation", operator_mutate_init }, { "foo", operator_foo_init }, { NULL, NULL } };
The next field of the new population structure gives the index at which the operator should place the new individual. After adding an individual, increment the next field. If your operator produces multiple offspring with a single call then you must make sure that you don't overfill the population (the next field should not exceed the size field). The crossover operator, for instance, normally produces two offspring on each call. If it is called when there is only one more space in the population, it fills the space and throws the other offspring away.
Your operator should always add at least one individual to the
new population per call. If it does not, infinite loops may occur
when the probabilistic_operators parameter is off.
This section gives a general overview of how some of the nonobvious
parts of lil-gp work and how they fit together.
When building a new tree, lilgp needs a continuous block of memory to put the tree in, but can't allocate the final location of the tree because the size isn't known ahead of time. Therefore, special blocks of memory are allocated to grow the trees in. Every time an lnode is added to a tree-in- progress, the function gensp_next (or gensp_next_int) is called to enlarge the memory block if needed. Once the tree is finished and its size is known, its final location is allocated and the tree is copied from the generation space
Currently there are two generation spaces needed. If you are implementing an operator or some- thing that requires three or more trees to be grown simultaneously, increase GENSPACE_COUNT in defines.h.
Generation spaces are initially allocated to hold GENSPACE_START
lnodes, and grow in steps of GENSPACE_GROW lnodes. These
constants are in defines.h.
lilgp tracks the best n individuals of each population, where n is a user-settable parameter. Pointers to these best individuals are passed to the user function app_end_of_evaluation(). To ensure that these pointers are always good, the kernel makes a copy of each individual in the top n and passes the address of the copy (since the original individual may not survive the breeding process)
An individual can be referenced in multiple top-n lists (best-of-gen,
best-of-run, etc.). To avoid making multiple copies, a reference
count is kept for each individual. All these saved individuals
are kept in a linked list, and once per generation a garbage collection
procedure traverses the list and frees any individuals which have
no references to them.
7.4.3 Ephemeral Random Constants
Whenever a new ephemeral random constant terminal is inserted into a tree (during the generation of the initial population, or during mutation operations) the function new_ephemeral_const() is called to create the constant. It calls the user-supplied generation function to create a new value. Each ERC record stores the value, along with a reference count of how many tree nodes point to that value. The ERC records are maintained in a linked list. Once per generation, a garbage collection routine traverses the linked list and removes any ERCs which are no longer needed (that have a reference count of zero)
The ERC records are not allocated individually but in large blocks. This ensures that all the ERCs are kept on a few memory pages at most, which reduces the need for paging during evaluation (when there are many scattered references to the ERCs). When an ERC is freed by the garbage collection routine, it is added to the end of the free list. If the free list ever becomes empty, a new large block of ERC records is allocated and all of them added on to the end of the free list
Pointers to the blocks themselves are kept in an array so that
the blocks can be freed at the conclusion of the run. Since the
number of blocks can increase, this array is reallocated as necessary.