Web Community Mining and
Web log Mining : Commody
Cluster based execution
Romeo Zitarosa
Mining di Dati Web
Overview
Introduction
 Web Community Mining
 Web log mining on MIS
 Parallel Data Mining on Pc Cluster
 Performance Evaluation
 Conclusion

Mining di Dati Web
Introduction

Proposed two application of web
mining:
1) Extract web Communities
2) Understand Behaviour of Mobile
Internet Users (Usage Mining)
Mining di Dati Web
Web Community Mining

Web Community
def: A web Community is a collection of
web pages created by individuals or
association that have common interests
on a specific topic.
Mining di Dati Web
Proposed technique

Starts from a set o seed

Based on RPA

Create a Community Chart
Mining di Dati Web
Authorities and Hubs

Authority : page with good contents on a
topic linked by many good hub pages.

Hub : page with a list of hyperlink to
valuable pages on a topic, that points to
good authorities.

Community Core = Authority + Hubs
Mining di Dati Web
Web Community Mining

Algorithm:
1. Seed set
2. Apply RSA to each seed:
Built web subgraph and extract
(using HITS) hubs and authority.
3. Investigate how seed derive other
seed as related pages.
Mining di Dati Web
Example
1. Consider that s derives t as related
page and vice versa.
“s” and “t” are pointed to by
similar set of hubs.
2. Consider that s derives t as related
page and but t doesn’t derives s.
“t” is pointed to by many different
hubs so “t” derives a different set of
related pages
Mining di Dati Web
Observation
In this way we define a symmertic
derivation relationship for identify
Communities.
Def. Community : Set of pages strongly
connected by “s.d.r”.
Two Communities are related if a
member of one community derives a
member of the other community.
Mining di Dati Web
Web Community Chart

Def. Is a Graph that consist of
communities as nodes and weighted
edges between nodes.
The weight represents the relevance of
the community

We need a tool to browse Communities
Mining di Dati Web
Web Community Chart(2)

Label assigned manually

Box = list of URLs sorted by connectivity
score.

Def. Connectivity score:
number of derivation relatioship from the
node to others node of the community.
Mining di Dati Web
Example
Mining di Dati Web
Mobile Info Search (MIS)

NTT laboratories

Goal : provide location aware
information from internet collecting,
structuring, filtering and organizing.

www.kokono.net
Mining di Dati Web
kokono
There is a database-type resource
between user and information souces
(online maps,yellow pages, etc.)
Mining di Dati Web
MIS Functionalities

User Location Acquisition
- GPS,PHS,postal number

Location Oriented Robot-Based Search(kokono)
- search documents close to a location
- display documents in order of distance
written in the doc and user position

Location Oriented Meta Search
- backbone database accessed by
CGI programs.
Mining di Dati Web
Association Rule Mining

Support , confidence

Hierarchy => Taxonomy

Hierarchy allow to find not only rules specific to a location but
also wider area that covers that location.

Identify Acces patterns of MIS users.

Prefetch information.

Reduce acces time.

Spatial information gives valuabel information to mobile users.
Mining di Dati Web
Sequential Rule Mining

Sequential Patterns

Derive how different services are used together.
Example:
Define the plan after checking the weather:
Submit_weather = Wether Forecast 
subimit_shop = Shop Info && shop_web = townpage 
Submit_kokono = KOKONOSearch  Submit_map = MAP
Mining di Dati Web
Parallel DM and Pc Cluster

Parallel Apriori
- nodes keep all candidate itemsets
- scan indipendently the dataset
- comunicate only at the end of the phase
Problem : Too much memory used!!!
Solution (Partial) : Hash Partitioned Apriori (HPA).
- candidates are partitioned using hash function
- each node buils candidate Itemsets
- a lot of disk I/O when support is small
Mining di Dati Web
Parallel Algorithm for Association
Rule Mining

Non partitioned generalized (NPGM)

Hash Partitioned (HPGM)
- reduce communications

Hierarchical HPGM (H-HPGM)
- candidate whoose root is identical allocated on
the same node

H-HPGM with Fine Grain Duplicates
(H-HPGM-FGD)
- use remaining free space
Mining di Dati Web
Performance evaluation
Oss. Time increase when support becomes
small
Mining di Dati Web
Conclusion

Real web Mining application need high
performance computing system

Pc Cluster with his scalable
performance (and high costs) is a
promising platform…
Mining di Dati Web
Scarica

PPT