Chapter 7

     7.1    Tree Representation

     7.2    Selection Methods

     7.3    Operators

     7.4    Miscellany
          7.4.1     Tree Generation Spaces
          7.4.2     Saved Individuals
          7.4.3     Ephemeral Random Constants

Extending the Kernel

Internally, lil-gp is fairly simple. I have attempted to keep the structure reasonably clean and modular, without going too overboard about avoiding global variables and such. 24 C files comprise the kernel:

main.c Initialization and cleanup.

gp.c The main evaluate-and-breed cycle, and population statistics calculation.

eval.c The tree evaluator.

tree.c Utility routines dealing with trees--counting nodes and depth, printing, generating random trees, finding subtrees, copying trees.

change.c Breeding of the new population each generation.

crossovr.c The crossover operator.

reproduc.c The reproduction operator.

mutate.c The mutation operator.

select.c Utility routines for selection methods.

tournmnt.c The tournament selection method.

bstworst.c The best, worst, and random selection methods.

fitness.c The fitness, fitness_overselect, and inverse_fitness selection methods.

genspace.c Utility routines for allocating space to grow new trees in.

exch.c The subpopulation exchange system.

populate.c Utility routines for population--copying, freeing, random generation.

ephem.c Utility routines for ephemeral random constants.

ckpoint.c Reading and writing checkpoint files.

event.c System-dependent module for tracking execution time.

pretty.c The tree pretty-printer.

individ.c Utility routines for whole individuals--printing and calculating size.

params.c The parameter database.

random.c The portable pseudorandom number generator, adapted from Numerical Recipes in FORTRAN.

memory.c The implementations of MALLOC(), FREE(), and REALLOC() that track memory usage.

output.c The output subsystem (oprintf() and oputs(), among others).

7.1 Tree Representation

A tree is stored as an array of type lnode. An lnode is a union which can can a pointer to a function structure, a pointer to an ephemeral random constant (ERC) structure, or an integer. The tree is stored in prefix order. The first lnode is always a pointer to a function. If the function is an ERC terminal, then the next lnode in the array has the pointer to the ERC structure. If a function is of type EXPR (the user code controls evaluation of the function's arguments) then there is an extra lnode just before the start of each child_it contains an integer, the number of lnodes used in representing the child subtree. The evaluation code uses this value to skip the child during evaluation.

Consider this expression in the symbolic regression problem:

(+ X (iflte X (* .34 X) .56))

This would be represented in lil-gp as the array:

+	pointer to structure for function +
X	pointer to structure for function X
iflte	pointer to structure for function iflte
1	first argument to iflte takes 1 lnode to store
X
4	second argument to iflte takes 4 lnodes to store
*	pointer to structure for function *
R	pointer to structure for ERC function
.34	pointer to structure containing ERC value
X
2	third argument to iflte takes 2 lnodes to store
R
.56

This representation, while it may seem cumbersome, has two major advantages over a more traditional C representation (with individual structs for each node, linked by pointers). First, it uses much less memory--approximately 1-2 words per node versus the 4-5 it would take otherwise. This is because the structure of the tree is represented implicitly in the ordering of the nodes rather than explicitly via pointers. Second, it results in much faster tree evaluation. For instance, consider the crossover operator. In the traditional representation crossover is performed by just swapping two pointers. While this is very fast and easy, over time it means that the nodes of a given tree become spread out over the process's address space. On a system with virtual memory, this slows evaluation (or any traversal of the tree) to a crawl as the tree is spread across dozens of pages which must constantly be swapped in and out. lil-gp's representation complicates crossover somewhat, but leaves each offspring tree as a single continuous block of memory, able to fit on just one or two pages.

7.2 Selection Methods

A selection method is implemented with two functions: one to perform initialization and cleanup, and another to do the actual selection. The first function, when called for initialization, creates and returns a data structure called a selection context. This contains any state information needed for the selection method. It should also store a pointer to the population that the selection is being done on. This structure will be passed to the second function, which should return an index of an individual within the population.

To create a new selection method, it is suggested that you copy and modify an existing one (the random method in bstworst.c is an especially simple one). If your of selection can be expressed as randomly selecting an individual from a set where each individual has a fixed probability of selection, then there is already an efficient implementation of the second function. See the code for the fitness, fitness_overselect, and inverse_fitness methods for examples

Finally, you must add a stt_record describing your selection method to the array select_method_table at the top of select.c. This lists the names and initialization functions of each selection method available.

7.3 Operator

The breeding of the population is controlled by a table of breeding phases. This is built from the parameter database at the start of the run (and whenever rebuild_breeding_table() is called from user code. Each subpopulation has its own breeding table. For each phase, there is a stt_record in the table. Each stt_record has pointers to the four methods for the operator, the rate for that phase, and a pointer to an operator-specific structure

To implement a new operator, it is suggested that you copy and modify the code of an existing operator. The reproduction operator is the simplest of the three included. Suppose that your new operator is called "foo". You would need to provide five functions (the following naming scheme is strongly recommended):

operator_foo_init Parses the operator's options string and builds the operator table entry. The kernel functions parse_o_rama() and free_o_rama() are available to parse the options string just like the built-in operators do.

operator_foo_free Frees the operator-specific part of the operator table entry.

operator_foo_start This is called at the start of breeding. Selection methods should be initialized here.

operator_foo_end This is called at the end of breeding. Selection contexts should be freed here.

operator_foo This performs the actual operation.

You then add a stt_record to the array operator_table at the top of change.c, listing the name of the operator (a string) and the operator's initialization method (a function pointer), for example, after adding the foo operator the table would look like:

operator operator_table[] =
{ { "crossover", operator_crossover_init },
     { "reproduction", operator_reproduce_init },
     { "mutation", operator_mutate_init },
     { "foo", operator_foo_init },
     { NULL, NULL } };

The next field of the new population structure gives the index at which the operator should place the new individual. After adding an individual, increment the next field. If your operator produces multiple offspring with a single call then you must make sure that you don't overfill the population (the next field should not exceed the size field). The crossover operator, for instance, normally produces two offspring on each call. If it is called when there is only one more space in the population, it fills the space and throws the other offspring away.

Your operator should always add at least one individual to the new population per call. If it does not, infinite loops may occur when the probabilistic_operators parameter is off.

7.4 Miscellany

This section gives a general overview of how some of the nonobvious parts of lil-gp work and how they fit together.

7.4.1 Tree Generation Spaces

When building a new tree, lilgp needs a continuous block of memory to put the tree in, but can't allocate the final location of the tree because the size isn't known ahead of time. Therefore, special blocks of memory are allocated to grow the trees in. Every time an lnode is added to a tree-in- progress, the function gensp_next (or gensp_next_int) is called to enlarge the memory block if needed. Once the tree is finished and its size is known, its final location is allocated and the tree is copied from the generation space

Currently there are two generation spaces needed. If you are implementing an operator or some- thing that requires three or more trees to be grown simultaneously, increase GENSPACE_COUNT in defines.h.

Generation spaces are initially allocated to hold GENSPACE_START lnodes, and grow in steps of GENSPACE_GROW lnodes. These constants are in defines.h.

7.4.2 Saved Individuals

lilgp tracks the best n individuals of each population, where n is a user-settable parameter. Pointers to these best individuals are passed to the user function app_end_of_evaluation(). To ensure that these pointers are always good, the kernel makes a copy of each individual in the top n and passes the address of the copy (since the original individual may not survive the breeding process)

An individual can be referenced in multiple top-n lists (best-of-gen, best-of-run, etc.). To avoid making multiple copies, a reference count is kept for each individual. All these saved individuals are kept in a linked list, and once per generation a garbage collection procedure traverses the list and frees any individuals which have no references to them.

7.4.3 Ephemeral Random Constants

Whenever a new ephemeral random constant terminal is inserted into a tree (during the generation of the initial population, or during mutation operations) the function new_ephemeral_const() is called to create the constant. It calls the user-supplied generation function to create a new value. Each ERC stt_record stores the value, along with a reference count of how many tree nodes point to that value. The ERC records are maintained in a linked list. Once per generation, a garbage collection routine traverses the linked list and removes any ERCs which are no longer needed (that have a reference count of zero)

The ERC records are not allocated individually but in large blocks. This ensures that all the ERCs are kept on a few memory pages at most, which reduces the need for paging during evaluation (when there are many scattered references to the ERCs). When an ERC is freed by the garbage collection routine, it is added to the end of the free list. If the free list ever becomes empty, a new large block of ERC records is allocated and all of them added on to the end of the free list

Pointers to the blocks themselves are kept in an array so that the blocks can be freed at the conclusion of the run. Since the number of blocks can increase, this array is reallocated as necessary.