101 lines
2.9 KiB
Plaintext
101 lines
2.9 KiB
Plaintext
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|
|
|
|
SPAM e-mail database (spambase): Machine learning using strongly-typed GP
|
|
with Open BEAGLE
|
|
|
|
Copyright (C) 2001-2003
|
|
by Christian Gagne <cgagne@gmail.com>
|
|
and Marc Parizeau <parizeau@gel.ulaval.ca>
|
|
|
|
+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
|
|
|
|
|
|
Getting started
|
|
===============
|
|
|
|
Example is compiled in binary 'spambase'. Usage options is described by
|
|
executing it with command-line argument '-OBusage'. The detailed help can
|
|
also be obtained with argument '-OBhelp'.
|
|
|
|
Objective
|
|
=========
|
|
|
|
Find a program the will successfully predict whether a given e-mail is spam
|
|
or not from some extracted features.
|
|
|
|
Comments
|
|
========
|
|
|
|
The evolved programs works on floating-point values AND Booleans values.
|
|
The programs must return a Boolean value which must be true if e-mail is
|
|
spam, and false otherwise. Don't expect too much from this program as
|
|
it is quite basic and not oriented toward performance. It is there mainly
|
|
to illustrate the use of strongly-typed GP with Open BEAGLE.
|
|
|
|
Terminal set
|
|
============
|
|
|
|
IN0, IN1, ... up to IN56, the e-mail features. [floating-point]
|
|
0 and 1, two Boolean constants. [Boolean]
|
|
Ephemeral constants randomly generated in $[0,100]$ [floating-point]
|
|
|
|
Function set
|
|
============
|
|
AND [Inputs: Booleans, Output: Boolean]
|
|
OR [Input: Boolean, Output: Boolean]
|
|
NOT [Inputs: Booleans, Output: Boolean]
|
|
+ [Inputs: floating-points, Output: floating-point]
|
|
- [Inputs: floating-points, Output: floating-point]
|
|
* [Inputs: floating-points, Output: floating-point]
|
|
/ [Inputs: floating-points, Output: floating-point]
|
|
< [Inputs: floating-points, Output: Booleans]
|
|
== [Inputs: floating-points, Output: Booleans]
|
|
if-then-else [1st Input: Boolean, 2nd & 3rd Input: floating-points,
|
|
Output: floating-point]
|
|
|
|
Fitness cases
|
|
=============
|
|
|
|
A random sample of 400 e-mails over the database, re-chosen for
|
|
each fitness evaluation.
|
|
|
|
Hits
|
|
====
|
|
|
|
Number of correct outputs obtained over the 400 fitness cases.
|
|
|
|
Raw fitness
|
|
===========
|
|
|
|
Ignored (always 0).
|
|
|
|
|
|
Standardized fitness
|
|
====================
|
|
|
|
Rate of correct outputs over the fitness cases where
|
|
the desired output was 0 (non-spam).
|
|
|
|
Adjusted fitness
|
|
================
|
|
|
|
Rate of correct outputs over the fitness cases where
|
|
the desired output was 1 (spam).
|
|
|
|
Normalized fitness
|
|
==================
|
|
|
|
Rate of correct outputs obtained over all the 400 fitness cases.
|
|
|
|
Stopping criteria
|
|
=================
|
|
|
|
When the best individual scores 400 hits or when the evolution reaches
|
|
the maximum number of generations.
|
|
|
|
Reference
|
|
=========
|
|
|
|
Machine learning repository,
|
|
http://www.ics.uci.edu/~mlearn/MLRepository.html
|