Enhanced Content Delivery
Action 2: Mine the
Web
Industrial Day
Roma, 10 Giugno 2004
Action 2 - Partners
Dipartimento di Informatica, Università di Pisa
KDD & HPC Labs
ISTI-CNR, Pisa
ICAR-CNR, Cosenza
ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web
 The project: four Work Packages
(Action Coordinator Dott. Fosca Giannotti, ISTI-CNR)




Work Package 2.1. Web Mining (UNIPI, ISTI, ICAR)
 WP Coordinator: Dott. Salvatore Ruggieri, Dip. Informatica
Work Package 2.2. Indexing and compression (UNIPI)
 WP Coordinator : Prof. Paolo Ferragina, Dip. Informatica
Work Package 2.3. Managing Terabytes (ISTI, ICAR)
 WP Coordinator : Dott. Raffaele Perego, ISTI-CNR
Work Package 2.4. Participatory Search Services (UNIPI)
 WP Coordinator : Prof. Maria Simi, Dip. Informatica
ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web
 The main goals of the ECD Project, content enhancement
and delivery, are here pursued in a complementary way
w.r.t. Action 1
 The focus is on Delivering Enhanced Web Contents to
(Communities of) Users:


Exploiting Web Mining to extract knowledge/models that can
be used to enhance efficacy and efficiency of the various
phases of the information search process
Design, validate and provide efficient and scalable solutions for
retrieving, storing, and delivering Web contents to users
ECD - Industrial Day, Roma 10 Giugno 2004
Motivations
 On-line data grows rapidly:
 50+M new pages/day, font: IBM
 100+k news, articles/day font: IBM
 Databases, digital libraries, etc.
 Internet use tracking produces additional interesting
data:

Servers logs, WSE logs, network traffic logs
 Goldman Sachs estimates (2002):
“between 80 and 90 percent of information on the
Internet and corporate networks is unstructured”
ECD - Industrial Day, Roma 10 Giugno 2004
Motivations
 The limits of the current means of access to web contents
are becoming clear

Low precision and quality, difficulty of matching users’
subjective relevance
over-abundance of low-quality web material
 low covering and freshness



much relevant information in the hidden web

ranking mechanisms penalize important pages that enter the
scene
Difficulties in
 managing size, complexity, heterogeneity
 identifying Patterns and Trends within huge amounts of
unstructured contents
Web Mining plays an important role. It allows to synthesize
and extract precious information and knowledge
ECD - Industrial Day, Roma 10 Giugno 2004
Web Mining
Web Mining: Exploiting Data Mining techniques with data
coming from the Web
Data Mining: the process of
Goal: assist users or site owners in finding something
discovery interesting
useful/interesting/relevant
knowledge from large amount
of data stored in databases,
 User-Centric View (Client-Side)
data warehouses, or other
repositories
 discovery ofView
documents
on a subject
 Owner-Centric
(Server-Side)




 discovery
of semantically
related
documents
or document
increasing
contact
/ conversion
efficiency
(Web marketing)
segments
targeted
promotion of goods, services, products, ads
 extraction of relevant knowledge about a subject from
measuring
effectiveness of site content / structure
multiple sources
providing dynamic personalized services or content
ECD - Industrial Day, Roma 10 Giugno 2004
Web Mining Taxonomy
Web Mining
Web Usage
Mining
Web Content
Mining
131.114.21.41 - - [27/May/2004:19:24:00
+0200] "GET /images/finger.jpg HTTP/1.1"
304 131.114.21.41 - - [27/May/2004:19:24:00
+0200] "GET /images/logokdd.jpg HTTP/1.1
" 304 131.114.21.41 - - [27/May/2004:19:24:09
+0200] "GET /didattica/BDM2004/TDM_intro
.19.02.04.pdf HTTP/1.1" 200 131072
131.114.21.41 - - [27/May/2004:19:24:12
+0200] "GET /didattica/BDM2004/TDM_intro
.19.02.04.pdf HTTP/1.1" 206 196608
131.114.21.41 - - [27/May/2004:19:24:13
+0200] "GET /didattica/BDM2004/TDM_intro
.19.02.04.pdf HTTP/1.1" 206 338224
ECD - Industrial Day, Roma 10 Giugno 2004
Web Structure
Mining
Web Mining Applications
 Web Usage Mining
 discovering customer preference and behavior
 Web personalization / collaborative filtering
 adaptive Web sites / improving Web site organization
 e-business intelligence, etc.
 Web Content Mining
 information filtering / knowledge extraction
 Web document categorization
 discovery of ontologies on the Web, etc.
 Web Structure Mining
 Finding "Quality" or "authoritative" sites based on linkage and citations



IBM CLEVER project
Google
Etc.
ECD - Industrial Day, Roma 10 Giugno 2004
Some related projects
 WebFountain - IBM
 WebBase - Stanford DBGroup
ECD - Industrial Day, Roma 10 Giugno 2004
WebFountain
IBM
World-Wide
Web, News
Forums,
Weblogs, etc.
Newspapers,
Magazines, etc.
Customer
Electronic Text
WebFountain
Infrastructure
for
Advanced Text
Analytics
Finds patterns,
trends and
relationships in text
Application
Examples:
• Marketing
• Intelligence
• Research
ECD - Industrial Day, Roma 10 Giugno 2004
WebFountain: an infrastructure for Advanced T
Analytics applications
½ Petabye
Cluster capacity
2,000,000,000
Number of pages in store
25,000,000
Number of pages crawled per day
10,000
Number of pages mined per
second
3674
Number of 73GB hard drives
1231
Number of CPU’s
250
Number of scientists and
researchers who have contributed
to WebFountain technology
100
Patents pending
75
Patents issued
70
Megabytes/sec traffic coming in
from internet
PROJECT WF INFRASTRUCTURE
Communications
Infrastructure
Internet
Information
Information
Miners
Miners
Intranets
Customer
DBs
News
Feeds
Data
Structured Data
Structured
Gatherer
Gatherer
3rd
Party
DBs
Index(es)
Index(es)
Data
Data Store
Store
Application Server
Crawler
Crawler
Customers
Customer
DBs
5 minutes, 22 seconds
Cluster Management System
5
ECD - Industrial Day, Roma 10 Giugno 2004
Time to complete query
Number of countries contributing to
technology
WebFountain: Reputation Tracking
ECD - Industrial Day, Roma 10 Giugno 2004
WebBase
Stanford DBgroup
ECD - Industrial Day, Roma 10 Giugno 2004
WebBase Challenges
 Archiving
 Scalability




crawling
archive distribution
index construction
storage
 Consistency


freshness
versions
 Dissemination


“units”
coordination
 IP Management



copy access
link access
access control
 Hidden Web
 Topic-Specific
Collection Building
ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web:
application scenario
 So far, barely no approach analyzes how a given group of
users access the Web, with the aim of exploiting usage
information to provide enhanced access to web resources
to the users from this group
 We think that it is possible to learn from usage data of a
group of web users new models and patterns that, in
combination with document content and structure, may
yield enhanced content access and delivery

better search services, better categorization and document
classification services, better question answering services
ECD - Industrial Day, Roma 10 Giugno 2004
Action 2 – Mine the Web
Ambitious objective:
Exploit the combination of Web data about:
USAGE, STRUCTURE, CONTENT
originated/accessed by a Virtual Organization, to
improve the efficacy and efficiency of the knowledge
extraction process from the users point of view
Developing solutions:
Innovative
w.r.t. the state of the art
Appropriate for the Web domain
ECD - Industrial Day, Roma 10 Giugno 2004
Virtual Organizations
Internet
Virtual Community
ECD - Industrial Day, Roma 10 Giugno 2004
Tracking Virtual Organizations
Virtual Community

Tracking the interaction of the virtual
community with internet allows us to
collect several interesting information

Network Traffic data provide detailed
information about:

Usage


Content


Preferred sites, user sessions
Accessed Documents
Structure


From client sessions we can
build the usage Web subgraph
By parsing the documents
retrieved we can build the
corresponding link graph
ECD - Industrial Day, Roma 10 Giugno 2004
Tracking Virtual Organizations
Virtual Community
Link graph
Link and
Traffic graph
Traffic graph
ECD - Industrial Day, Roma 10 Giugno 2004
We need an infrastructure: the
Web Object Store (WOS)
 A Web Data Management System optimized to efficiently
handle content, usage, and structure web data
Purpose: Enable (possibly) innovative Web IR and Web
Mining research by locally providing a small, but
significant, portion of the Web built according to our usercentric view


Manage large collections of
 Web pages
 Preprocessed Usage data
 Structure data
Collected within our virtual community
ECD - Industrial Day, Roma 10 Giugno 2004
WOS and related activities
Related activities:
 Persistent
store of objects
Clustering
Emails
 - Web
data management
content,
- system
Cachingfor
of web
Documents
and of
Clustering/Pattern/Classification
structure
and
usage data
Query
results
Web
Mining
algorithms
 - Management
of dataaccess
at
Efficient and
and scalable
scalable
pattern
Efficient
miningabstraction
and clustering
many
levels
methods:
Efficient and scalable storage:
algorithms
 Fast
development
ofindexes
new
•• IXE
b-trees,
full-text
IXE
persistent
objects
cleaning,
preprocessing,
- Data
Enhanced
compression
Population:
applications
filtering
••methods
search
in compressed data
compression
Easy
C++
•traffic
rawannotation
data of ourof
- •Clustering/categorizing
distributed architecturequery
community
new
persistent objects
results snippets
 Read
and write data in
Crawler
- •IXE
Clustering
XML documents
tables
search
- •Partecipatory
Etc.

ECD - Industrial Day, Roma 10 Giugno 2004
WOS applications
 Some innovative applications are currently pursued within
our project:




Characterization, on the basis of usage only or usage +
contents + structure, of new important emerging sites, or
irrelevant sites (e.g., advertising sites);
 crucial to instruct the crawler of the community web repository
towards fresh, relevant documents while avoiding unimportant
documents
Page ranking based also on usage information, for achieving a
more accurate and dynamic measurement of document
relevance
Recommendation of similar/related documents and keywords,
on the basis of combined usage/content analysis
Caching and clustering of web search results
ECD - Industrial Day, Roma 10 Giugno 2004
WOS population: usage data
(WP 2.1)
 We collected long periods of proxy-level IP traffic
originated from SERRA network (domain unipi.it)

The whole University of Pisa
 Many-to-many interactions
 Inter-site user sessions
 Massive data


Millions/day HttpRequest
~1 GB/day raw data
ECD - Industrial Day, Roma 10 Giugno 2004
WOS population: content data
(WP 2.4)
 Methods to gather contents to populate Web
Object Store



IXE Crawler
Participatory Search System (main activity this year)
Hidden Web Search
ECD - Industrial Day, Roma 10 Giugno 2004
WOS population: content data
(WP 2.4)
init
initial urls
 IXE crawler
Internet
get next url
get page
extract urls
web pages
ECD - Industrial Day, Roma 10 Giugno 2004
IXE Crawler
 Parallel/distributed crawler
 High performance through:





asynchronous I/O (500 connections/thread)
asynchronous DNS resolution
keep-alive connections
multi-threads
URL compression
 9 Mb/sec transfer rate (7 times nutch.org crawler)
ECD - Industrial Day, Roma 10 Giugno 2004
Participatory search: the idea
 Participatory search:


each participant builds an index of the local contents and
sends it to a central server
the central server implements a community search service
collecting and merging the participants' indexes
 A model that fits community needs for dedicated search
services
 A trade-off between a centralized search model (e.g.:
Google), and a distributed approach (e.g.: Gnutella,
Kazaa)
ECD - Industrial Day, Roma 10 Giugno 2004
Participatory Search
Centralized
CIS
Participatory
Distributed
CI
CIS
CI
S
CI
CI
Documents
Search Index
Search results
ECD - Industrial Day, Roma 10 Giugno 2004
CIS
CIS
CIS
CIS
C – Crawler
I – Indexer
S – Search Engine
Participatory Search: benefits
 Participants are in charge of


selecting what to index and to publish
when to publish (no need of coordination with an
external crawler)
 Control on index update and freshness
 Publishing of Hidden Web content
ECD - Industrial Day, Roma 10 Giugno 2004
Storage and access methods:
compression (WP 2.2)
Our technique takes a poor compressor A and turns it
into a compressor Aboost with better performance guarantee
Booster
s
A
c’
The better is A,
the better is Aboost
Qualitatively, we show that

c’ is shorter than c, if s is compressible

Time(Aboost) = Time(A), i.e. no slowdown

A is used as a black-box
ECD - Industrial Day, Roma 10 Giugno 2004
The more compressible is s,
the better is Aboost
Key Components:
Burrows-Wheeler Transform,
Suffix Tree,
and a Greedy processing of them
Storage and access methods
(WP 2.1 and 2.2)
 Repository of URLs


Compressed
Prefix and Suffix search within URLs
 Search by hostname, path, file-ext, …
select count(*)
from …
where url LIKE ‘http://%.it/%.asp’
 Up to two order of magnitude faster than using
sequential scan and B-tree

Space occupacy << B-tree
ECD - Industrial Day, Roma 10 Giugno 2004
Storage and access methods:
index compression (WP 2.3)
 Assigning DocIDs in a clever way could improve the
compression factor of traditional variable-[bit/byte]
encoding methods by increasing the number of small
DGaps.
 Clustering property: within each posting lists there
are dense zones (i.e. a lot of small DGaps).
 Our problem consists of enhancing the Clustering
Property of posting lists.
ECD - Industrial Day, Roma 10 Giugno 2004
Compression Enhancement
ECD - Industrial Day, Roma 10 Giugno 2004
Content delivery (WP 2.1, 2.2
and 2.3)
 Web Caching

Mining of web/proxy server requests aimed at improving LRUbased document caching (WP 2.1)
 Recommendation system

(On line/Off line) Mining of web sessions aimed at profiling
users and recommending them related pages (WP 2.1, 2.3)
 Transactional Clustering

Clustering specialized on transactional data aimed at
categorizing web pages, user sessions, snippet sequences,
search engine results (WP 2.1, 2.2)
ECD - Industrial Day, Roma 10 Giugno 2004
Content delivery (WP 2.3)
 SUGGEST: a recommendation system made up of two
distinct modules


Offline: performing model extraction by a clustering algorithm
which partition the Usage Graph
Online: performing users classification and suggestion
generation
 The WOS remarkably shortened implementation time (<
500 C++ lines)

We used three WOS objects to produce a persistent clustering
structure
Citation
PageView
Session
ECD - Industrial Day, Roma 10 Giugno 2004
sCluster
Content delivery (WP 2.2)
Goal: Retrieve the pages which match the user needs.
This is a much difficult task in the light of the fact that:





the Web size is increasing and so the number of answers
the Web coverage is a problem for a single search engine
Web pages are heterogeneous
User needs are subjective and time-varying
“list of keywords” paradigm for a user query may be ambiguous
SnakeT: clusters the web-snippets returned by many search
engine(s) into hierarchically labeled folders which are created
on-the-fly to catch the various meaning of the answers
returned for a user query
ECD - Industrial Day, Roma 10 Giugno 2004
SnakeT: An example fo use
ECD - Industrial Day, Roma 10 Giugno 2004
SnakeT: An example fo use
Look at the
DEMO
ECD - Industrial Day, Roma 10 Giugno 2004
Content delivery (WP 2.1)
 Clustering of



E-mails (manco)
XML documents (chiara)
??
ECD - Industrial Day, Roma 10 Giugno 2004
On going and future activities
 Work in progress

Pursuing our goal of exploiting USAGE, STRUCTURE, CONTENT
Web data to improve efficacy and efficiency in the interaction of the
user with the Web
 Implementation of additional WOS layers
Compression booster, XML clustering
 Future work (medium-long term)
 WOS, final version
 Community-oriented ranking
 Content (news, xml, ..) clustering
 Cooperation with Nutch.org


(Doug Cutting in Pisa next October)
etc
ECD - Industrial Day, Roma 10 Giugno 2004
Deployment scenarios
 Concerning the role of the WOS and of the ECD
applications three (non-exclusive) possible deployment
scenarios could be devised



The WOS is a research infrastructure, in the spirit of the
WebBase project at Stanford University
The WOS is an infrastructure for web analytics services to be
offered to third parties, in a spirit close to the WebFountain
IBM project
The WOS can become a product for Web Data Management
Systems aimed at developing and engineering web mining
ECD applications, again in a spirit close to WebBase
ECD - Industrial Day, Roma 10 Giugno 2004
Demo Session
 Three demos here



WOS: browsing usage data (Mirko Nanni, Vincenzo
Bacarella)
SnakeT: Web snippets clustering (Paolo Ferragina,
Antonio Gullì)
ANTIX: Participatory Search System (Andrea Esuli)
 Some other activities described in the Posters
ECD - Industrial Day, Roma 10 Giugno 2004
More information
 Interested people can find these slides, more
information, documents and the full list of
publications at the address:
 http://ecd.isti.cnr.it
ECD - Industrial Day, Roma 10 Giugno 2004
Scarica

Industrial Day Roma, 10 Giugno 2004