Knowledge from ___ FareExtracting ____ clic per ___modificare __________ stili _____ del Biomedical Data through Logic ___testo _____ dello _____ schema ______ Learning Machines and Rulex ______ Secondo_______ livello ____ Terzo livello Marco_______ Muselli _____ Quarto Institute of Electronics, Computer _______ andlivello Telecommunication Engineering National Research Council of Italy, Genova, Italy _____ Quinto _______ livello [email protected] Extracting knowledge from data ___ Fare____ clic per ___modificare __________ stili _____ del ___testo _____ dello _____ schema ______ Type of knowledge: ______ Secondo _______ livello • Correlation, statistical measures ____ Terzo _______ livello • Feature ranking, analysis of relevance _____ Quarto _______ livello • Prediction, clustering, risk analysis • Intelligible model_____ (rules) Quinto_______ livello Basic problem: Infer some knowledge about a biological phenomenon of interest starting from a sample of data. K N O W L E D G E Marco Muselli 2 NETTAB 2012 Rule generation methods Extract models described by a set of intelligible rule in if-then form ___ Fare ____ clic per ___ modificare __________ stili _____ del If Pressure > 115 and Heart_rate < 100 then Disease = Yes ___testo _____ dello _____ schema ______ Aggregative approach Divide-and-conquer approach ______ Secondo_______ livello ____ Terzo_______ livello _____ Quarto_______ livello _____ Quinto_______ livello Emphasis on similarities! Emphasis on differences! Marco Muselli 3 NETTAB 2012 Statistical vs. Machine learning methods • • • • Statistical methods Machine learning methods ___ Fare ____ clic per ___modificare __________ stili _____ del Simpler to be used with huge • Their application is not testo _____ dello _____ schema ______ experience ___ straightforward and experience is not so big Plenty of commercial ______ and free tools _______ Secondo livello available • Commercial tools are often extensions of statistical packages; ____ Terzo_______ livello Limited quantity of knowledge free programs are not so friendly extracted • Relevant quantity of knowledge _____ Quarto _______ livello A priori hypotheses on probability extracted distributions _____ Quinto_______ livello • No a priori hypothesis is required Marco Muselli 4 NETTAB 2012 Machine learning software Commercial software • • • • Free Software ___ Fare____ clic per ___modificare __________ stili _____ del SAS Enterprise Miner • WEKA (www.sas.com/technologies/analytics/ (www.cs.waikato.ac.nz/ml/weka) ___testo _____ dello _____ schema ______ datamining/miner) • RapidMiner (rapid-i.com) IBM SPSS Statistics Software ______ Secondo _______ livello (www-01.ibm.com/software/analytics/ • Orange (orange.biolab.si) spss/products/statistics) ____ Terzo_______ livello • Machine Learning & Statistical Salford Systems Data Mining Learning in R language Suite (www.salford-systems.com) _____ Quarto_______ livello (cran.r-project.org/web/views/ Statistica Data Miner MachineLearning.html) (www.statsoft.com/products/data_____ Quinto _______ livello mining-solutions) Marco Muselli 5 NETTAB 2012 RULEX® Suite ___ Fare____ clic per ___modificare __________ stili _____ del ___ testo _____ dello _____ schema ______ The name RULEX is the contraction of RULe Extraction since it is especially devoted to generate intelligible rules, although a wide range of statistical and ______ Secondo _______ livello machine learning approaches will be made available. ____ Terzo _______ livello An intuitive graphical interface allows to easily apply standard and advanced algorithms for analyzing any dataset of interest, providing solution to _____ Quarto _______ livello classification, regression and clustering problems. The software suite is in rapid evolution; therefore, the number and the _____ Quinto _______ livello functionalities of available tasks increase every day. The suite RULEX® (contraction of RULe Extraction) developed by Impara Srl (www.impara-ai.com), a spin-off of the National Research Council of Italy, offers a new simple and powerful tools for extracting knowledge from real world data. Marco Muselli 6 NETTAB 2012 RULEX GUI ___ Fare____ clic per ___modificare __________ stili _____ del ___testo _____ dello _____ schema ______ ______ Secondo_______ livello ____ Terzo_______ livello _____ Quarto_______ livello _____ Quinto_______ livello Dataset panel Tasks Stage Component panel Marco Muselli Source 7 NETTAB 2012 Logic Learning Machine ___ Fare____ clic per ___modificare __________ stili _____ del Logistic Decision trees ___testo _____ dello _____ schema ______ K-nearest-neighbor Neural networks ______ Secondo_______ livello Rulex offers the possibility of applying an original proprietary approach, named ____ Terzo_______ livello Logic learning machine (LLM) _____ Quarto _______ livello which represents an efficient implementation of the switching neural network model (Muselli, 2006). _____ Quinto_______ livello Besides standard techniques, such as: Marco Muselli 8 NETTAB 2012 Logic Learning Machine ___ Fare____ clic per ___modificare __________ stili _____ del ___ testo _____ dello _____ schema ______ The approach of LLM is based on monotone Boolean function synthesis (Shadow Clustering) and______ adopts an aggregative policy: at any iteration some Secondo _______ livello patterns belonging to the same output class are clustered to produce an intelligible rule. ____ Terzo_______ livello Since the training process occurs in a binary projected space, the _____ Quarto livellotask that finds application of LLM must be preceded by_______ a discretization proper cutoffs for ordered (continuous and discrete) input variables. _____ Quinto_______ livello LLM allows to solve classification problems producing sets of intelligible rules capable of achieving an accuracy comparable or superior to that of best machine learning methods. Marco Muselli 9 NETTAB 2012 An application in biomedical analysis ___ Fare____ clic per ___modificare __________ stili _____ del Diabetes: it concerns the problemdello of diagnosing diabetes starting from 8 input ___ testo _____ _____ schema ______ variables; all the 768 considered patients are females at least 21 years old of Pima Indian heritage: 268 of them are cases and 500 are controls. ______ Secondo _______ livello Dna: it has the aim of recognizing acceptors and donors sites in a primate gene ____ Terzo _______ livello sequences with length 60 (basis); the dataset consists of 3186 sequences, subdivided into three classes: acceptor, donor, none. _____ Quarto _______ livello Heart: it deals with the detection of heart disease from a set of 13 input variables concerning patient status; the total sample of 250 elements is formed _____ Quinto_______ livello by 120 cases and 150 controls. The functionalities of Rulex have been verified by analyzing three biomedical datasets included in the Statlog benchmark: Marco Muselli 10 NETTAB 2012 An application of Rulex (results) Five classification algorithms have been considered: LLM, DT, NN, LOGIT, and KNN. ___ Fare____ clic per ___modificare __________ stili _____ del Results obtained on an independent test set including 30% of ___testo _____ dello _____ schema ______ data has been compared both in terms of accuracy and of quantity of knowledge extracted (number of rules and average Secondo_______ livello number of conditions). ______ ____ Terzo_______ livello _____ Quarto_______ livello _____ Quinto_______ livello LLM DT Accuracy # Rules # Cond. Accuracy # Rules NN LOGIT KNN # Cond. Accuracy Accuracy Accuracy Diabetes 77.40% 14 3.00 73.04% 56 4.02 75.22% 77.23% 69.13% Dna 94.01% 64 10.86 90.04% 67 6.26 88.69% 92.57% 40.68% Heart 85.19% 19 5 81.48% 18 3.67 80.25% 83.95% 80.25% Marco Muselli 11 NETTAB 2012 Conclusions ___ Fare____ clic per ___modificare __________ stili _____ del ___testo _____ dello _____ schema ______ An intuitive graphical interface allows to construct complex analysis processes through the composition of elementary tasks. Facilities for displaying and ______ Secondo _______ livello managing datasets are also provided. Besides standard methods,____ like logistic,_______ k-nearest-neighbor, neural networks Terzo livello and decision trees, Rulex makes available a new approach, logic learning machines (LLM), whose models are described by intelligible rules. _____ Quarto _______ livello Results obtained for the analysis of three biomedical datasets belonging to the _____ Quinto _______ livello Statlog benchmark point out the good quality of LLM, which achieves an excellent A new suite, called Rulex, for the analysis of biomedical datasets through conventional and advanced machine learning techniques has been presented. It is able to solve classification, regression and clustering problems. accuracy while providing understandable knowledge about the problem at hand. Marco Muselli 12 NETTAB 2012 Work in progress ___ Fare____ clic per ___modificare __________ stili _____ del ___testo _____ dello _____ schema ______ Functionalities are continuously added to Rulex to improve the versatility of the suite. Suggestions arising from researchers are extremely important, since they ______ Secondo _______ livello allow us to offer a product satisfying the real needs of users. To this aim, we are searching for researchers interested to try the Rulex suite, ____ Terzo _______ livello signaling bugs and providing us advices for improving each part of the product. _____ Quarto _______ livello If you are interested to test Rulex for your specific application, please send me an email ([email protected]) and we will provide you a fully functional copy _____ Quinto_______ livello of Rulex. Version 2.0 of Rulex is currently under beta testing. Several features have been added with the intent of giving researchers a simple but powerful tool for analyzing their own datasets. Marco Muselli 13 NETTAB 2012 ___ Fare____ clic per ___modificare __________ stili _____ del Thanks fordello your attention! ___testo _____ _____ schema ______ ______ Secondo_______ livello www.impara-ai.com ____ Terzo_______ livello _____ Quarto_______ livello _____ Quinto_______ livello