UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
ECML PKDD 2008
15-19 September 2008, Antwerp, Belgium
A Genetic Algorithm for
Text Classification
Rule Induction
A.Pietramala1, V.Policicchio1, P.Rullo1,2, I.Sidhu3
Università della Calabria (Rende, Italy)
1.
{a.pietramala,policicchio,rullo}@mat.unical.it
2.
3.
Exeura Srl (Rende, Italy)
Kenetica Ltd (Chicago, IL-USA)
{isidhu}@computer.org
UNIVERSITA’ DELLA CALABRIA
Outline
Dipartimento di MATEMATICA
– Motivations
– The Olex Hypothesis Language
– The Genetic Algorithm Approach (Olex-GA)
– Experimental Results and Comparative Evaluation
– Discussions
– Conclusions and Future Work
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Motivations
• Rule learning algorithms have become a successful
strategy for classifier induction.
• Rule-based classifiers provide the desirable property
of being readable and, thus, easy to understand (and,
possibly, modify).
• Genetic Algorithms (GAs) are stochastic search
methods inspired to the biological evolution.
• GAs show the capability to provide good solutions for
classical optimization tasks (e.g. TSP and Knapsack)
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Rule Induction and GAs
• Rule induction is one of the application fields of GAs.
The basic idea is that:
– Each individual in the population represents a candidate solution (a
classification rule or a classifier)
– The fitness of an individual is evaluated in terms of the predictive
accuracy.
• We propose presents a GA approach, called Olex-GA, for the induction of
rule-based text classifiers.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Olex-GA The hypothesis language
Dipartimento di MATEMATICA
• A classifier Hc (Pos,Neg) is of the form:
Hc (Pos,Neg)
Pos
c  ( t1  d  ...  tn  d ) 
( tn 1  d  ...  tn  m  d ).
c
category
ti
term (n-gram)
d
document
Neg
“if any of the terms t1,…,tn occurs in d and none of the
terms tn+1,…,tn+m occurs in d, then classify d under
category c”
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
Olex-GA
The hypothesis language
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
• The terms in Pos and Neg are chosen among the ones
belonging to the local vocabulary:
V (k , f )  UcC Vc (k , f )
• Intuitively, Vc (k, f ) is the set of the best k terms for
category c according to a given scoring function f.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA
Problem statement
• The Olex-GA’s learning problem is stated as an optimization
problem:
PROBLEM MAX-F
Let a category c  C and a vocabulary V (k, f) over the training set TS be given.
Then, find two subsets of V (k, f), Pos = {t1,…,tn } and Neg = {tn+1,…,tn+m }
with Pos ≠ Ø , such that Hc (Pos, Neg) applied to TS yields a maximum value of
Fc, (over TS), for a given [0,1].
• Problem MAX-F is NP-Hard.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA
A Genetic Algorithm to Solve MAX-F
• Problem MAX-F is a combinatorial optimization problem
aimed at finding a best combination of terms taken from a
given vocabulary.
• MAX-F is a typical problem for which GAs are known to be a
good candidate resolution method.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
GA-Olex
Our implementation of GA
• In the following, we describe our choices concerning:
– Population Encoding
– Fitness Function
– Evolutionary Operators
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
Olex-GA
Population Encoding
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
• Each individual represents an entire classifier.
EXAMPLE
Given a vocabulary
V ( k ,f )  { t1 ,t2 ,t3 ,t4 ,t5 }
Hc
c  ( t1  t3 )  ( t2  t4 )
K
1
0
1
0
0
0
1
0
1
0
t1
t2
t3
t4
t5
t1
t2
t3
t4
t5
• An individual is simply a binary representation of the sets Pos
and Neg of a classifier Hc (Pos, Neg) .
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA
Population Encoding
• We restrict the search of both positive and negative
terms, respectively, to:
– Pos*, the set of terms belonging to Vc (k, f ) (candidate positive
terms);
– Neg*, the set of terms which occur in any document containing
some candidate positive term and not belonging to the training
set TSc of c (candidate negative terms).
• The reduction of search space allows:
– an improvement of the algorithm efficiency
– a quick convergence toward good solutions
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA
Fitness Function
• The fitness of a chromosome K, representing
Hc(Pos,Neg) is the value of the F-measure resulting
from applying Hc(Pos,Neg) to the training set TS.
• This choice naturally follows from the formulation of
problem MAX-F.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA
Evolutionary Operators
• We perform:
– selection via the roulette-wheel method,
– crossover by the uniform crossover scheme.
– mutation, which consists in the flipping of each single bit
with a given (low) probability.
– elitism, in order to ensure that the best individuals of the
current generation are passed to the next one without
being altered
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Olex-GA
Experimentation
We have experimentally evaluated our algorithm on two
standard benchmark corpora:
• REUTERS-21578 (R10)
– It consists of 12,902 documents
– They are manually classified with respect to 135 categories. We have
considered the subset of the 10 most populated categories.
• OHSUMED
– We used the collection consisting of the first 20,000 documents from
the 50,216 medical abstracts of the year 1991.
– The classification scheme consisted of the 23 MeSH disease
categories.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Experimental settings
• We applied the stratified holdout method:
REUTERS:
– ModApté split : 9603 documents are used to form the training
corpus (seen data) and 3299 to form the test set (unseen data).
OHSUMED:
– The first 10,000 were used as seen data and the second 10,000 as
unseen data.
In both cases, we have randomly split the set of seen data into a
– training set (70%), on which to run the GA
– and a validation set (30%), on which tuning the model parameters.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Experimental settings
Dipartimento di MATEMATICA
• GA Parameters:
Parameter
Value
Iterations
3
Population Size
500
Num of Generations
200
Cross-over Rate
1.0
Mutation Rate
0.001
Elitism Probability
0.2
• For each chromosome K in the population, we initialized
K+ at random, while we set K¡- [t] = 0, for each t  Neg*
(thus, K initially encodes a classifier Hc(Pos,Neg) with no
negative terms).
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Comparative Evaluation
• On both corpora, we carried out a direct comparison
with the following systems:
– SVM (both polynomial and radial basis function)
– Ripper (with two optimization steps)
– C4.5
– Naive Bayes
– Olex-Greedy
• The performances were evaluated using the Weka
library of ML algorithms (apart from Olex-Greedy).
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
•
Performance Comparison on Reuters
Efficacy
– SVMpoli > SVMrbf > Ripper ≈ Olex-GA > C45 > Olex-Greedy > NB
•
Efficiency
– NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
•
Performance Comparison on OHSUMED
Efficacy
– Olex-GA > Ripper > SVMpoli > Olex-Greedy > SVMrbf ≈ NB > C45
•
Efficiency
– NB > Olex-Greedy > SVMpoli > Olex-GA > C45 > SVMrbf > Ripper
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
Discussions –
UNIVERSITA’ DELLA CALABRIA
Dipartimento di MATEMATICA
Relation to other inductive rule learners
• Conventional Rule Learners (Ripper, C4.5):
– Usually rely on a two-stage process: rule induction and rule pruning.
– Each of the above step in turn consists of several steps
• Olex-GA relies on a a single-step process which does not need
any post-induction optimization.
• With respect to Olex-Greedy, Olex-GA provides better
predictive accuracy, but is less efficient.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Conclusions
Dipartimento di MATEMATICA
• Olex-GA encodes a classifier, in a very natural and
compact way, as an individual
• Fitness of an individual is evaluated as the F-measure of
the encoded classifiers
• Experimental results point out:
– Olex-GA quickly converges to very accurate classifiers;
– Olex-GA performs at a competitive level with standard
algorithms;
– Time efficiency is lower than Olex-Greedy but higher than the
other rule learning methods, such as Ripper and C45.
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
UNIVERSITA’ DELLA CALABRIA
Future work
Dipartimento di MATEMATICA
• Extension of the proposed technique to deal with classifiers of the
form
c  ( T1  d  ... Tn  d ) 
( Tn  1  d  ... Tn  m  d )
where each Ti is a conjunction of “simple” terms:
Ti  ti  ....  tik
1
A.Pietramala, V.Policicchio, P.Rullo, I. Sidhu
A Genetic Algorithm for Text Classification Rule Induction
Scarica

Olex-GA - Dipartimento di Matematica e Informatica