Web Community Mining and Web log Mining : Commody Cluster based execution Romeo Zitarosa Mining di Dati Web Overview Introduction Web Community Mining Web log mining on MIS Parallel Data Mining on Pc Cluster Performance Evaluation Conclusion Mining di Dati Web Introduction Proposed two application of web mining: 1) Extract web Communities 2) Understand Behaviour of Mobile Internet Users (Usage Mining) Mining di Dati Web Web Community Mining Web Community def: A web Community is a collection of web pages created by individuals or association that have common interests on a specific topic. Mining di Dati Web Proposed technique Starts from a set o seed Based on RPA Create a Community Chart Mining di Dati Web Authorities and Hubs Authority : page with good contents on a topic linked by many good hub pages. Hub : page with a list of hyperlink to valuable pages on a topic, that points to good authorities. Community Core = Authority + Hubs Mining di Dati Web Web Community Mining Algorithm: 1. Seed set 2. Apply RSA to each seed: Built web subgraph and extract (using HITS) hubs and authority. 3. Investigate how seed derive other seed as related pages. Mining di Dati Web Example 1. Consider that s derives t as related page and vice versa. “s” and “t” are pointed to by similar set of hubs. 2. Consider that s derives t as related page and but t doesn’t derives s. “t” is pointed to by many different hubs so “t” derives a different set of related pages Mining di Dati Web Observation In this way we define a symmertic derivation relationship for identify Communities. Def. Community : Set of pages strongly connected by “s.d.r”. Two Communities are related if a member of one community derives a member of the other community. Mining di Dati Web Web Community Chart Def. Is a Graph that consist of communities as nodes and weighted edges between nodes. The weight represents the relevance of the community We need a tool to browse Communities Mining di Dati Web Web Community Chart(2) Label assigned manually Box = list of URLs sorted by connectivity score. Def. Connectivity score: number of derivation relatioship from the node to others node of the community. Mining di Dati Web Example Mining di Dati Web Mobile Info Search (MIS) NTT laboratories Goal : provide location aware information from internet collecting, structuring, filtering and organizing. www.kokono.net Mining di Dati Web kokono There is a database-type resource between user and information souces (online maps,yellow pages, etc.) Mining di Dati Web MIS Functionalities User Location Acquisition - GPS,PHS,postal number Location Oriented Robot-Based Search(kokono) - search documents close to a location - display documents in order of distance written in the doc and user position Location Oriented Meta Search - backbone database accessed by CGI programs. Mining di Dati Web Association Rule Mining Support , confidence Hierarchy => Taxonomy Hierarchy allow to find not only rules specific to a location but also wider area that covers that location. Identify Acces patterns of MIS users. Prefetch information. Reduce acces time. Spatial information gives valuabel information to mobile users. Mining di Dati Web Sequential Rule Mining Sequential Patterns Derive how different services are used together. Example: Define the plan after checking the weather: Submit_weather = Wether Forecast subimit_shop = Shop Info && shop_web = townpage Submit_kokono = KOKONOSearch Submit_map = MAP Mining di Dati Web Parallel DM and Pc Cluster Parallel Apriori - nodes keep all candidate itemsets - scan indipendently the dataset - comunicate only at the end of the phase Problem : Too much memory used!!! Solution (Partial) : Hash Partitioned Apriori (HPA). - candidates are partitioned using hash function - each node buils candidate Itemsets - a lot of disk I/O when support is small Mining di Dati Web Parallel Algorithm for Association Rule Mining Non partitioned generalized (NPGM) Hash Partitioned (HPGM) - reduce communications Hierarchical HPGM (H-HPGM) - candidate whoose root is identical allocated on the same node H-HPGM with Fine Grain Duplicates (H-HPGM-FGD) - use remaining free space Mining di Dati Web Performance evaluation Oss. Time increase when support becomes small Mining di Dati Web Conclusion Real web Mining application need high performance computing system Pc Cluster with his scalable performance (and high costs) is a promising platform… Mining di Dati Web