Knowledge
from
___
FareExtracting
____
clic per
___modificare
__________
stili
_____
del
Biomedical
Data
through
Logic
___testo
_____
dello
_____
schema
______
Learning
Machines
and
Rulex
______
Secondo_______
livello
____
Terzo
livello
Marco_______
Muselli
_____
Quarto
Institute of Electronics,
Computer _______
andlivello
Telecommunication Engineering
National Research Council of Italy, Genova, Italy
_____
Quinto
_______
livello
[email protected]
Extracting knowledge from data
___
Fare____
clic per
___modificare
__________
stili
_____
del
___testo
_____
dello
_____
schema
______
Type of knowledge:
______
Secondo
_______
livello
• Correlation, statistical measures
____
Terzo
_______
livello
• Feature ranking, analysis
of relevance
_____
Quarto
_______
livello
• Prediction, clustering,
risk analysis
• Intelligible model_____
(rules)
Quinto_______
livello
Basic problem: Infer some knowledge about a biological
phenomenon of interest starting from a sample of data.
K
N
O
W
L
E
D
G
E
Marco Muselli
2
NETTAB 2012
Rule generation methods
Extract models described by a set of intelligible rule in if-then form
___
Fare
____
clic
per
___
modificare
__________
stili
_____
del
If Pressure > 115 and Heart_rate < 100 then Disease = Yes
___testo
_____
dello
_____
schema
______
Aggregative approach
Divide-and-conquer approach
______
Secondo_______
livello
____
Terzo_______
livello
_____
Quarto_______
livello
_____
Quinto_______
livello
Emphasis on similarities!
Emphasis on differences!
Marco Muselli
3
NETTAB 2012
Statistical vs. Machine learning methods
•
•
•
•
Statistical methods
Machine learning methods
___
Fare
____
clic per
___modificare
__________
stili
_____
del
Simpler to be used with huge
• Their application is not
testo
_____
dello
_____
schema
______
experience ___
straightforward
and experience is
not so big
Plenty of commercial ______
and
free tools _______
Secondo
livello
available
• Commercial tools are often
extensions of statistical packages;
____
Terzo_______
livello
Limited quantity of knowledge
free programs are not so friendly
extracted
• Relevant quantity of knowledge
_____
Quarto
_______
livello
A priori hypotheses on probability
extracted
distributions
_____
Quinto_______
livello
• No a priori hypothesis is required
Marco Muselli
4
NETTAB 2012
Machine learning software
Commercial software
•
•
•
•
Free Software
___
Fare____
clic per
___modificare
__________
stili
_____
del
SAS Enterprise Miner
• WEKA
(www.sas.com/technologies/analytics/
(www.cs.waikato.ac.nz/ml/weka)
___testo
_____
dello
_____
schema
______
datamining/miner)
• RapidMiner (rapid-i.com)
IBM SPSS Statistics Software
______
Secondo
_______
livello
(www-01.ibm.com/software/analytics/
• Orange (orange.biolab.si)
spss/products/statistics)
____
Terzo_______
livello
• Machine Learning & Statistical
Salford Systems Data Mining
Learning in R language
Suite (www.salford-systems.com)
_____
Quarto_______
livello
(cran.r-project.org/web/views/
Statistica Data Miner
MachineLearning.html)
(www.statsoft.com/products/data_____
Quinto
_______
livello
mining-solutions)
Marco Muselli
5
NETTAB 2012
RULEX® Suite
___
Fare____
clic per
___modificare
__________
stili
_____
del
___
testo
_____
dello
_____
schema
______
The name RULEX is the contraction of RULe Extraction since it is especially
devoted to generate intelligible
rules, although
a wide range of statistical and
______
Secondo
_______
livello
machine learning approaches will be made available.
____
Terzo
_______
livello
An intuitive graphical interface
allows to
easily
apply standard and advanced
algorithms for analyzing any dataset of interest, providing solution to
_____
Quarto
_______
livello
classification, regression and
clustering problems.
The software suite is in rapid evolution; therefore, the number and the
_____
Quinto
_______
livello
functionalities of available tasks increase every day.
The suite RULEX® (contraction of RULe Extraction) developed by Impara Srl
(www.impara-ai.com), a spin-off of the National Research Council of Italy, offers a
new simple and powerful tools for extracting knowledge from real world data.
Marco Muselli
6
NETTAB 2012
RULEX GUI
___
Fare____
clic per
___modificare
__________
stili
_____
del
___testo
_____
dello
_____
schema
______
______
Secondo_______
livello
____
Terzo_______
livello
_____
Quarto_______
livello
_____
Quinto_______
livello
Dataset
panel
Tasks
Stage
Component
panel
Marco Muselli
Source
7
NETTAB 2012
Logic Learning Machine
___
Fare____
clic per
___modificare
__________
stili
_____
del
Logistic
Decision trees
___testo
_____
dello
_____
schema
______
K-nearest-neighbor
Neural networks
______
Secondo_______
livello
Rulex offers the possibility of applying an original proprietary approach, named
____
Terzo_______
livello
Logic learning machine (LLM)
_____
Quarto
_______
livello
which represents an efficient implementation of the switching neural network
model (Muselli, 2006).
_____
Quinto_______
livello
Besides standard techniques, such as:
Marco Muselli
8
NETTAB 2012
Logic Learning Machine
___
Fare____
clic per
___modificare
__________
stili
_____
del
___
testo
_____
dello
_____
schema
______
The approach of LLM is based on monotone Boolean function synthesis
(Shadow Clustering) and______
adopts an aggregative
policy: at any iteration some
Secondo
_______
livello
patterns belonging to the same output class are clustered to produce an
intelligible rule.
____
Terzo_______
livello
Since the training process occurs in a binary projected space, the
_____
Quarto
livellotask that finds
application of LLM must be
preceded by_______
a discretization
proper cutoffs for ordered (continuous and discrete) input variables.
_____
Quinto_______
livello
LLM allows to solve classification problems producing sets of intelligible rules
capable of achieving an accuracy comparable or superior to that of best machine
learning methods.
Marco Muselli
9
NETTAB 2012
An application in biomedical analysis
___
Fare____
clic per
___modificare
__________
stili
_____
del
Diabetes: it concerns
the
problemdello
of diagnosing
diabetes
starting from 8 input
___
testo
_____
_____
schema
______
variables; all the 768 considered patients are females at least 21 years old of
Pima Indian heritage: 268
of them are cases
and
500 are controls.
______
Secondo
_______
livello
Dna: it has the aim of recognizing acceptors and donors sites in a primate gene
____
Terzo
_______
livello
sequences with length 60 (basis);
the dataset
consists of 3186 sequences,
subdivided into three classes: acceptor, donor, none.
_____
Quarto
_______
livello
Heart: it deals with the detection of heart disease from a set of 13 input
variables concerning patient status; the total sample of 250 elements is formed
_____
Quinto_______
livello
by 120 cases and 150 controls.
The functionalities of Rulex have been verified by analyzing three biomedical
datasets included in the Statlog benchmark:
Marco Muselli
10
NETTAB 2012
An application of Rulex (results)
Five classification algorithms have been considered: LLM, DT,
NN, LOGIT, and KNN.
___
Fare____
clic per
___modificare
__________
stili
_____
del
Results obtained on an independent test set including 30% of
___testo
_____
dello
_____
schema
______
data has been compared both in terms of accuracy and of
quantity of knowledge extracted (number of rules and average
Secondo_______
livello
number of conditions). ______
____
Terzo_______
livello
_____
Quarto_______
livello
_____
Quinto_______
livello
LLM
DT
Accuracy # Rules # Cond. Accuracy # Rules
NN
LOGIT
KNN
# Cond. Accuracy Accuracy Accuracy
Diabetes
77.40%
14
3.00
73.04%
56
4.02
75.22%
77.23%
69.13%
Dna
94.01%
64
10.86
90.04%
67
6.26
88.69%
92.57%
40.68%
Heart
85.19%
19
5
81.48%
18
3.67
80.25%
83.95%
80.25%
Marco Muselli
11
NETTAB 2012
Conclusions
___
Fare____
clic per
___modificare
__________
stili
_____
del
___testo
_____
dello
_____
schema
______
An intuitive graphical interface allows to construct complex analysis processes
through the composition of elementary tasks. Facilities for displaying and
______
Secondo
_______
livello
managing datasets are also
provided.
Besides standard methods,____
like
logistic,_______
k-nearest-neighbor,
neural networks
Terzo
livello
and decision trees, Rulex makes available a new approach, logic learning
machines (LLM), whose models
are described
by intelligible rules.
_____
Quarto
_______
livello
Results obtained for the analysis of three biomedical datasets belonging to the
_____
Quinto
_______
livello
Statlog benchmark point out
the good quality
of LLM, which achieves an excellent
A new suite, called Rulex, for the analysis of biomedical datasets through
conventional and advanced machine learning techniques has been presented. It is
able to solve classification, regression and clustering problems.
accuracy while providing understandable knowledge about the problem at hand.
Marco Muselli
12
NETTAB 2012
Work in progress
___
Fare____
clic per
___modificare
__________
stili
_____
del
___testo
_____
dello
_____
schema
______
Functionalities are continuously added to Rulex to improve the versatility of the
suite. Suggestions arising from researchers are extremely important, since they
______
Secondo
_______
livello
allow us to offer a product
satisfying the real
needs of users.
To this aim, we are searching
for researchers
interested to try the Rulex suite,
____
Terzo
_______
livello
signaling bugs and providing us advices for improving each part of the product.
_____
Quarto
_______
livello
If you are interested to test
Rulex for your
specific
application, please send me an
email ([email protected]) and we will provide you a fully functional copy
_____
Quinto_______
livello
of Rulex.
Version 2.0 of Rulex is currently under beta testing. Several features have been
added with the intent of giving researchers a simple but powerful tool for
analyzing their own datasets.
Marco Muselli
13
NETTAB 2012
___
Fare____
clic per
___modificare
__________
stili
_____
del
Thanks
fordello
your
attention!
___testo
_____
_____
schema
______
______
Secondo_______
livello
www.impara-ai.com
____
Terzo_______
livello
_____
Quarto_______
livello
_____
Quinto_______
livello
Scarica

Marco Muselli