Enhanced Content Delivery Action 2: Mine the Web Industrial Day Roma, 10 Giugno 2004 Action 2 - Partners Dipartimento di Informatica, Università di Pisa KDD & HPC Labs ISTI-CNR, Pisa ICAR-CNR, Cosenza ECD - Industrial Day, Roma 10 Giugno 2004 Action 2 – Mine the Web The project: four Work Packages (Action Coordinator Dott. Fosca Giannotti, ISTI-CNR) Work Package 2.1. Web Mining (UNIPI, ISTI, ICAR) WP Coordinator: Dott. Salvatore Ruggieri, Dip. Informatica Work Package 2.2. Indexing and compression (UNIPI) WP Coordinator : Prof. Paolo Ferragina, Dip. Informatica Work Package 2.3. Managing Terabytes (ISTI, ICAR) WP Coordinator : Dott. Raffaele Perego, ISTI-CNR Work Package 2.4. Participatory Search Services (UNIPI) WP Coordinator : Prof. Maria Simi, Dip. Informatica ECD - Industrial Day, Roma 10 Giugno 2004 Action 2 – Mine the Web The main goals of the ECD Project, content enhancement and delivery, are here pursued in a complementary way w.r.t. Action 1 The focus is on Delivering Enhanced Web Contents to (Communities of) Users: Exploiting Web Mining to extract knowledge/models that can be used to enhance efficacy and efficiency of the various phases of the information search process Design, validate and provide efficient and scalable solutions for retrieving, storing, and delivering Web contents to users ECD - Industrial Day, Roma 10 Giugno 2004 Motivations On-line data grows rapidly: 50+M new pages/day, font: IBM 100+k news, articles/day font: IBM Databases, digital libraries, etc. Internet use tracking produces additional interesting data: Servers logs, WSE logs, network traffic logs Goldman Sachs estimates (2002): “between 80 and 90 percent of information on the Internet and corporate networks is unstructured” ECD - Industrial Day, Roma 10 Giugno 2004 Motivations The limits of the current means of access to web contents are becoming clear Low precision and quality, difficulty of matching users’ subjective relevance over-abundance of low-quality web material low covering and freshness much relevant information in the hidden web ranking mechanisms penalize important pages that enter the scene Difficulties in managing size, complexity, heterogeneity identifying Patterns and Trends within huge amounts of unstructured contents Web Mining plays an important role. It allows to synthesize and extract precious information and knowledge ECD - Industrial Day, Roma 10 Giugno 2004 Web Mining Web Mining: Exploiting Data Mining techniques with data coming from the Web Data Mining: the process of Goal: assist users or site owners in finding something discovery interesting useful/interesting/relevant knowledge from large amount of data stored in databases, User-Centric View (Client-Side) data warehouses, or other repositories discovery ofView documents on a subject Owner-Centric (Server-Side) discovery of semantically related documents or document increasing contact / conversion efficiency (Web marketing) segments targeted promotion of goods, services, products, ads extraction of relevant knowledge about a subject from measuring effectiveness of site content / structure multiple sources providing dynamic personalized services or content ECD - Industrial Day, Roma 10 Giugno 2004 Web Mining Taxonomy Web Mining Web Usage Mining Web Content Mining 131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/finger.jpg HTTP/1.1" 304 131.114.21.41 - - [27/May/2004:19:24:00 +0200] "GET /images/logokdd.jpg HTTP/1.1 " 304 131.114.21.41 - - [27/May/2004:19:24:09 +0200] "GET /didattica/BDM2004/TDM_intro .19.02.04.pdf HTTP/1.1" 200 131072 131.114.21.41 - - [27/May/2004:19:24:12 +0200] "GET /didattica/BDM2004/TDM_intro .19.02.04.pdf HTTP/1.1" 206 196608 131.114.21.41 - - [27/May/2004:19:24:13 +0200] "GET /didattica/BDM2004/TDM_intro .19.02.04.pdf HTTP/1.1" 206 338224 ECD - Industrial Day, Roma 10 Giugno 2004 Web Structure Mining Web Mining Applications Web Usage Mining discovering customer preference and behavior Web personalization / collaborative filtering adaptive Web sites / improving Web site organization e-business intelligence, etc. Web Content Mining information filtering / knowledge extraction Web document categorization discovery of ontologies on the Web, etc. Web Structure Mining Finding "Quality" or "authoritative" sites based on linkage and citations IBM CLEVER project Google Etc. ECD - Industrial Day, Roma 10 Giugno 2004 Some related projects WebFountain - IBM WebBase - Stanford DBGroup ECD - Industrial Day, Roma 10 Giugno 2004 WebFountain IBM World-Wide Web, News Forums, Weblogs, etc. Newspapers, Magazines, etc. Customer Electronic Text WebFountain Infrastructure for Advanced Text Analytics Finds patterns, trends and relationships in text Application Examples: • Marketing • Intelligence • Research ECD - Industrial Day, Roma 10 Giugno 2004 WebFountain: an infrastructure for Advanced T Analytics applications ½ Petabye Cluster capacity 2,000,000,000 Number of pages in store 25,000,000 Number of pages crawled per day 10,000 Number of pages mined per second 3674 Number of 73GB hard drives 1231 Number of CPU’s 250 Number of scientists and researchers who have contributed to WebFountain technology 100 Patents pending 75 Patents issued 70 Megabytes/sec traffic coming in from internet PROJECT WF INFRASTRUCTURE Communications Infrastructure Internet Information Information Miners Miners Intranets Customer DBs News Feeds Data Structured Data Structured Gatherer Gatherer 3rd Party DBs Index(es) Index(es) Data Data Store Store Application Server Crawler Crawler Customers Customer DBs 5 minutes, 22 seconds Cluster Management System 5 ECD - Industrial Day, Roma 10 Giugno 2004 Time to complete query Number of countries contributing to technology WebFountain: Reputation Tracking ECD - Industrial Day, Roma 10 Giugno 2004 WebBase Stanford DBgroup ECD - Industrial Day, Roma 10 Giugno 2004 WebBase Challenges Archiving Scalability crawling archive distribution index construction storage Consistency freshness versions Dissemination “units” coordination IP Management copy access link access access control Hidden Web Topic-Specific Collection Building ECD - Industrial Day, Roma 10 Giugno 2004 Action 2 – Mine the Web: application scenario So far, barely no approach analyzes how a given group of users access the Web, with the aim of exploiting usage information to provide enhanced access to web resources to the users from this group We think that it is possible to learn from usage data of a group of web users new models and patterns that, in combination with document content and structure, may yield enhanced content access and delivery better search services, better categorization and document classification services, better question answering services ECD - Industrial Day, Roma 10 Giugno 2004 Action 2 – Mine the Web Ambitious objective: Exploit the combination of Web data about: USAGE, STRUCTURE, CONTENT originated/accessed by a Virtual Organization, to improve the efficacy and efficiency of the knowledge extraction process from the users point of view Developing solutions: Innovative w.r.t. the state of the art Appropriate for the Web domain ECD - Industrial Day, Roma 10 Giugno 2004 Virtual Organizations Internet Virtual Community ECD - Industrial Day, Roma 10 Giugno 2004 Tracking Virtual Organizations Virtual Community Tracking the interaction of the virtual community with internet allows us to collect several interesting information Network Traffic data provide detailed information about: Usage Content Preferred sites, user sessions Accessed Documents Structure From client sessions we can build the usage Web subgraph By parsing the documents retrieved we can build the corresponding link graph ECD - Industrial Day, Roma 10 Giugno 2004 Tracking Virtual Organizations Virtual Community Link graph Link and Traffic graph Traffic graph ECD - Industrial Day, Roma 10 Giugno 2004 We need an infrastructure: the Web Object Store (WOS) A Web Data Management System optimized to efficiently handle content, usage, and structure web data Purpose: Enable (possibly) innovative Web IR and Web Mining research by locally providing a small, but significant, portion of the Web built according to our usercentric view Manage large collections of Web pages Preprocessed Usage data Structure data Collected within our virtual community ECD - Industrial Day, Roma 10 Giugno 2004 WOS and related activities Related activities: Persistent store of objects Clustering Emails - Web data management content, - system Cachingfor of web Documents and of Clustering/Pattern/Classification structure and usage data Query results Web Mining algorithms - Management of dataaccess at Efficient and and scalable scalable pattern Efficient miningabstraction and clustering many levels methods: Efficient and scalable storage: algorithms Fast development ofindexes new •• IXE b-trees, full-text IXE persistent objects cleaning, preprocessing, - Data Enhanced compression Population: applications filtering ••methods search in compressed data compression Easy C++ •traffic rawannotation data of ourof - •Clustering/categorizing distributed architecturequery community new persistent objects results snippets Read and write data in Crawler - •IXE Clustering XML documents tables search - •Partecipatory Etc. ECD - Industrial Day, Roma 10 Giugno 2004 WOS applications Some innovative applications are currently pursued within our project: Characterization, on the basis of usage only or usage + contents + structure, of new important emerging sites, or irrelevant sites (e.g., advertising sites); crucial to instruct the crawler of the community web repository towards fresh, relevant documents while avoiding unimportant documents Page ranking based also on usage information, for achieving a more accurate and dynamic measurement of document relevance Recommendation of similar/related documents and keywords, on the basis of combined usage/content analysis Caching and clustering of web search results ECD - Industrial Day, Roma 10 Giugno 2004 WOS population: usage data (WP 2.1) We collected long periods of proxy-level IP traffic originated from SERRA network (domain unipi.it) The whole University of Pisa Many-to-many interactions Inter-site user sessions Massive data Millions/day HttpRequest ~1 GB/day raw data ECD - Industrial Day, Roma 10 Giugno 2004 WOS population: content data (WP 2.4) Methods to gather contents to populate Web Object Store IXE Crawler Participatory Search System (main activity this year) Hidden Web Search ECD - Industrial Day, Roma 10 Giugno 2004 WOS population: content data (WP 2.4) init initial urls IXE crawler Internet get next url get page extract urls web pages ECD - Industrial Day, Roma 10 Giugno 2004 IXE Crawler Parallel/distributed crawler High performance through: asynchronous I/O (500 connections/thread) asynchronous DNS resolution keep-alive connections multi-threads URL compression 9 Mb/sec transfer rate (7 times nutch.org crawler) ECD - Industrial Day, Roma 10 Giugno 2004 Participatory search: the idea Participatory search: each participant builds an index of the local contents and sends it to a central server the central server implements a community search service collecting and merging the participants' indexes A model that fits community needs for dedicated search services A trade-off between a centralized search model (e.g.: Google), and a distributed approach (e.g.: Gnutella, Kazaa) ECD - Industrial Day, Roma 10 Giugno 2004 Participatory Search Centralized CIS Participatory Distributed CI CIS CI S CI CI Documents Search Index Search results ECD - Industrial Day, Roma 10 Giugno 2004 CIS CIS CIS CIS C – Crawler I – Indexer S – Search Engine Participatory Search: benefits Participants are in charge of selecting what to index and to publish when to publish (no need of coordination with an external crawler) Control on index update and freshness Publishing of Hidden Web content ECD - Industrial Day, Roma 10 Giugno 2004 Storage and access methods: compression (WP 2.2) Our technique takes a poor compressor A and turns it into a compressor Aboost with better performance guarantee Booster s A c’ The better is A, the better is Aboost Qualitatively, we show that c’ is shorter than c, if s is compressible Time(Aboost) = Time(A), i.e. no slowdown A is used as a black-box ECD - Industrial Day, Roma 10 Giugno 2004 The more compressible is s, the better is Aboost Key Components: Burrows-Wheeler Transform, Suffix Tree, and a Greedy processing of them Storage and access methods (WP 2.1 and 2.2) Repository of URLs Compressed Prefix and Suffix search within URLs Search by hostname, path, file-ext, … select count(*) from … where url LIKE ‘http://%.it/%.asp’ Up to two order of magnitude faster than using sequential scan and B-tree Space occupacy << B-tree ECD - Industrial Day, Roma 10 Giugno 2004 Storage and access methods: index compression (WP 2.3) Assigning DocIDs in a clever way could improve the compression factor of traditional variable-[bit/byte] encoding methods by increasing the number of small DGaps. Clustering property: within each posting lists there are dense zones (i.e. a lot of small DGaps). Our problem consists of enhancing the Clustering Property of posting lists. ECD - Industrial Day, Roma 10 Giugno 2004 Compression Enhancement ECD - Industrial Day, Roma 10 Giugno 2004 Content delivery (WP 2.1, 2.2 and 2.3) Web Caching Mining of web/proxy server requests aimed at improving LRUbased document caching (WP 2.1) Recommendation system (On line/Off line) Mining of web sessions aimed at profiling users and recommending them related pages (WP 2.1, 2.3) Transactional Clustering Clustering specialized on transactional data aimed at categorizing web pages, user sessions, snippet sequences, search engine results (WP 2.1, 2.2) ECD - Industrial Day, Roma 10 Giugno 2004 Content delivery (WP 2.3) SUGGEST: a recommendation system made up of two distinct modules Offline: performing model extraction by a clustering algorithm which partition the Usage Graph Online: performing users classification and suggestion generation The WOS remarkably shortened implementation time (< 500 C++ lines) We used three WOS objects to produce a persistent clustering structure Citation PageView Session ECD - Industrial Day, Roma 10 Giugno 2004 sCluster Content delivery (WP 2.2) Goal: Retrieve the pages which match the user needs. This is a much difficult task in the light of the fact that: the Web size is increasing and so the number of answers the Web coverage is a problem for a single search engine Web pages are heterogeneous User needs are subjective and time-varying “list of keywords” paradigm for a user query may be ambiguous SnakeT: clusters the web-snippets returned by many search engine(s) into hierarchically labeled folders which are created on-the-fly to catch the various meaning of the answers returned for a user query ECD - Industrial Day, Roma 10 Giugno 2004 SnakeT: An example fo use ECD - Industrial Day, Roma 10 Giugno 2004 SnakeT: An example fo use Look at the DEMO ECD - Industrial Day, Roma 10 Giugno 2004 Content delivery (WP 2.1) Clustering of E-mails (manco) XML documents (chiara) ?? ECD - Industrial Day, Roma 10 Giugno 2004 On going and future activities Work in progress Pursuing our goal of exploiting USAGE, STRUCTURE, CONTENT Web data to improve efficacy and efficiency in the interaction of the user with the Web Implementation of additional WOS layers Compression booster, XML clustering Future work (medium-long term) WOS, final version Community-oriented ranking Content (news, xml, ..) clustering Cooperation with Nutch.org (Doug Cutting in Pisa next October) etc ECD - Industrial Day, Roma 10 Giugno 2004 Deployment scenarios Concerning the role of the WOS and of the ECD applications three (non-exclusive) possible deployment scenarios could be devised The WOS is a research infrastructure, in the spirit of the WebBase project at Stanford University The WOS is an infrastructure for web analytics services to be offered to third parties, in a spirit close to the WebFountain IBM project The WOS can become a product for Web Data Management Systems aimed at developing and engineering web mining ECD applications, again in a spirit close to WebBase ECD - Industrial Day, Roma 10 Giugno 2004 Demo Session Three demos here WOS: browsing usage data (Mirko Nanni, Vincenzo Bacarella) SnakeT: Web snippets clustering (Paolo Ferragina, Antonio Gullì) ANTIX: Participatory Search System (Andrea Esuli) Some other activities described in the Posters ECD - Industrial Day, Roma 10 Giugno 2004 More information Interested people can find these slides, more information, documents and the full list of publications at the address: http://ecd.isti.cnr.it ECD - Industrial Day, Roma 10 Giugno 2004