Site Explorer Server: an integrated, client-server, query system for Web sites Giancarlo Bongiovanni, Flavio Fontana, Stefano Borghetti Dept. Of Computer Science, University of Rome, “La Sapienza” ENEA’s Usability Lab Summary: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Information Search in Internet Internet is the biggest and the most widespread network 158 Milions of accesses in Junary ‘99 Millions of heterogeneous users Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Billions of information sources provided by Web Exponantial increasing of Web site count Increasing of network access by end users Increasing of web browser funtionalities Increasing of search engines performs A forecast of 200 millions in 2000 33 millions in the United States, 1 million and 300 thousand in Germany, 371 thousand in Italy The users that use Internet since more than 3 years are only the 11% Il problema della ricerca delle informazioni sul Web Issue: Information search in Internet could be a problem for particular type of users? Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Today a better scenario Users problems related to information search: •Many users don’t know the Web information model •Users have problems to find a valid tools able to locate the relevant information •Users have problems to describe searched information using right and concise terms •Users have problems to use advanced search tools (i.e. Site Explorer Server is more difficult to use rather than browser) Analisi dei requisiti dell’utente New search and exploration tools New and alternative Web approach to traditional browser Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Implementation of a Client/Server tools able to make Web IR using Java, experimented and tested ENEA Site Explorer v1.1 IR Tool integrated with browser Network service Information Retrieval Systems Struttura generale Gerard Salton, Introduction to modern information retrieval, Ed. 1983, McGraw-Hill, Inc. User Query Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Result Similar •Result formulation Data structure in pre-definded language Indexing •Query formulation by user •Indexing process Documents Information Retrieval Systems Formulazione della richiesta Operatori booleani Extended boolean systems use additional operators: •nearness of terms •cutting of terms •search using particular field Ranking Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Boolean Systems combine the terms using boolean operators: •and •or •andnot Operatori estesi Query formulation is a list of terms able to express and summarize the searched argument In Ranking systems query formulation is made using natural language phrases Examples: Information and retrieval Information or retrieval Information andnot retrieval Examples: Information adj retrieval Inform* Information [in titolo] Examples: “Uman influence in Information Retrieval systems Information Retrieval Systems Indicizzazione Data structures Iverted indexing Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Terms vector Indexing is a process to analyse documents and to provide a short contents rapresentation. Rapresentation is based on a keyword vector. These keywords are choosen by a manual process or are extracted by an authomatic process Example: “Information Retrieval Data Structure & Algorithms” <information, retrieval, data-strucuture, alghoritms> Example: Data structure to contains document rapresentation List, tree, index file, etc. Example: A file where every record describe the releted record with each particular term Information Retrieval Doc. 1 1 1 Doc. 2 0 1 Doc.3 1 1 Information Retrieval Systems Formulazione e presentazione del risultato Documents ordinated by relevance level Explicit measure of relevance level (score) Dynamic presentation (results manipulation) New features Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Resuls order In traditional IRS the result is a potential relevant document list Graphic and direct method presentations Multimedia integration Use of windows (different way to present the results) Gerard Salton, Introduction to modern information retrieval, Ed. 1983, McGraw-Hill, Inc. William B. Frakes, Ricardo BaezaYates, Information Retrieval Data Structure & Algorithms, Ed. 1992, Prentice Hall, Inc. Information Information Retrieval Retrieval Systems Systems Calcolo Calcolodello delloscore score Score compute is focused to measure the relevance of specific terms in specific documents Key point in score compute Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works A method to weight the term relevance in the whole document collection Frequence normalization for particular document collection Example: IDFi log 2 N 1 ni (Sparck Jones, 1972) signalk log 2 totfreqk noisek (Dennis, 1967) discvaluek avgsimk avgsim Example: cfreqij K 1 K nfreqij freqij max freqj log 2 freqij 1 log 2 lengthj (Croft, 1983) (Harman, 1986) Compute of a term weght for a document Term frequence in the document * term relevance weigth in the collection Compute the score: •Boolean system: use SOP method •Ranking system: use particular formula. I motori di ricerca Web interface (Query and results) SIMILAR Web pages Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works New functionality in the most popular search engine: Sites classification Integration of new advanced search services to search information in particular format (picture, sounds, MP3, e-mail etc.) not much search engines provide a document score Migration from search service to on-line seller guides Index DB Authomatic indexing system Media Matrix - June 1999 S. Engine Yahoo Exite Lycos Altavista About HotBot Looksmart GoTo Pos. 1 8 9 19 21 23 25 32 Internet Da trent’anni ad oggi 30 years Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Source: FIND/ITPD, III, Gennaio 1999 - NII project, supported by DOIT, MOEA 3% 23% 55% 19% Europe Asia-Pacific North America Others Internet Da trent’anni ad oggi 30 years Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Source: FIND/ITPD, III, Gennaio 1999 - NII project, supported by DOIT, MOEA 11% 28% more 3 years 0.5 years 30% 1-3 years 0,5-1 years 31% Internet Da trent’anni ad oggi 30 years Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Source: FIND/ITPD, III, Gennaio 1999 - NII project, supported by DOIT, MOEA 0,0% 10,0% 20,0% 30,0% 50,0% 50,7% Casa 28,8% Scuola Ufficio Computer Education Class 40,0% 16,8% 1,6% Postazioni pubbliche 0,9% Computer portatile 0,7% Altri 0,5% 60,0% Internet Da trent’anni ad oggi 30 years Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Source: FIND/ITPD, III, Gennaio 1999 - NII project, supported by DOIT, MOEA Cinque o più Quattro 5% 9% 26% Tre 39% Due 21% Una 0% 10% 20% 30% 40% 50% Internet Verso il domani 2003 2002 2001 2000 Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works 1999 0 100 200 300 Future Tracks: •Research and technologies •Educational •The Public Administration •E-commerce 400 500 600 Java Main features Technologies Applet Multithread Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Oriented to Graphic User Interfaces implementation Client Object-oriented Dynamic Oriented to Client/Server systems implementation Portable Platform independence High functionalities for networking Site Explorer Server v2.0 Server Site Explorer Server v2.0 Obiettivi Goals - To implement a new system: able to work directly on Web Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works able to helps the user to find interesting documents on Web with an high usability degree able to integrate: •search functions •alternative approach rather than browser •management functions •user position to access to the Web etherogeneous data using a unique way. Site Explorer Server v2.0 Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Additional features Site Explorer Server v.2.0. AClient/Server system, implemented using Java, able to make automatic Web site analyse, and to provide, as result, the tree site structure where the root node represents the site home-page. •Focused on information search and retreiving by keywords search approach •an easy information-filtering service •a score computation service •user management Client User A network service An accessible (open to everybody) open and multi-platform service Interface Site Explorer Server INTERNET Web site Site Explorer Server v2.0 Architettura esterna e configurazione HTTP HTTP Server SEP (SES) Web site #1 Web site #n Internet SEJA applet Browser SEC User 1 Technical features Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Web site #2 (SEJA) Windows User 2 Unix User 3 Mac-OS User m •Client/Server system •The Server (SES) is a Java application •The Client (SEJA) is a Java applet •SES and SEJA speak using a dedicated Application layer protocol (SEP) Site Explorer Server v2.0 Funzionamento e processi Query HTTP connection process USER Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Query selector process Links extraction process Contents extraction process Web sites Keywords analisys process Score process Result builder Result-display process Result Next site’s page Client user interface Site SiteExplorer ExplorerServer Serverv2.0 v2.0 Sottocomponenti del SES Page analyser Connection request (client) Features Internet Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Query (client) Main Site analyser Comunicator Function manager User manager Retriever Results (client) •full-text document analyse •Links cheking using connection requests •HTML 4 oriented Site Explorer Server v2.0 Lo score di Site Explorer Server Three score level: Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Level 1 score. It’s based only on the keywords items inside the Web page. Level 2 score. It’s also based on the keywords distribution inside the whole Web site. Level 3 score. It’s based also on the position of keywords items inside the Web page structure. Site Explorer Server v2.0 Site Explorer Java Applet GUI Menù-bar Tool-bar Tree structure area Displayed result Retrieved object in Web site Textual area Multimedia area State bar State indicator Site Explorer Server v2.0 Site Explorer Java Applet GUI Connessione al server Site Explorer Server v2.0 Site exploration Site Explorer Server v2.0 Site exploration Indicatore di connessine attiva New site analyse request Site Explorer Server v2.0 Site exploration Use of a favorite site analyse request Site Explorer Server v2.0 Site exploration Use of a pre-defined site analyse request Site Explorer Server v2.0 Site exploration Receiving result Site Explorer Server v2.0 Site exploration Site Explorer Server v2.0 Site exploration Results navigation Score level Relevat page indicator Results browsing Site Explorer Server v2.0 Site exploration Site Explorer Server v2.0 Site exploration Site Explorer Server v2.0 Il pilot-center Lo Usability Lab (Ulab), istituito nel 1992 presso il pilotcenter del progetto ESPRIT III VENUS e svolge un’attività di Ricerca & Sviluppo nel campo delle interfacce visuali avanzate a basi di dati e sistemi informativi multimediali in rete. Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Macchine di sviluppo e test: Intel Pentium II 350Mhz / Windows 98 (Netlab) Intel Pentium MMX 166Mhz / Windows 95 (Fontanaulab) AMD K6 300Mhz/ Windows 98 (Ulab) Sun Sparc Station 5 / Unix Solaris 2.5 (Venus) Sun Sparc Station 10 / Unix Solaris 2.5 (Dafne) Strumenti software: JDK v1.1.6, JDK v1.1.7, JDK v1.1.7a, JDK v1.17b, JDK 1.1.8 Edit+, Netbeans Java Swing v1.0.3, Java Media Framework v1.1 Site Explorer Server v2.0 Conclusion and experimental results •A strong system •good/exellent usability degree •A good response time (Analyse and result build) 9,1 Brow sing Site Explorer Server functionality Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works 7,6 Icons 8,7 Multimedia contents 7,9 Textual contents 8,4 Tree 9,4 Query 8,9 Connection 0,0 2,0 4,0 6,0 8,0 General user satisfaction degree 50 users selected using ENEA/VENUS methodology: random user. Occassional system use. Professional users: System user related to their work. Expert user. 10,0 Site Explorer Server v2.0 ENEA applications G7 Global-Inventory project A project data card collection •Site search engine vs Site Explorer Server Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works Plus - Prosoma LinkUp Service A multimedia data card collection Experimental sites: ULAB sites Future testing: •Virtual Lab Site •FAD Applet per navigazione su mappa PersonalSearch: applet come motore di ricerca per un sito Virgilio - Funzione di ricerca su un sito Esplorazione e rappresentazio ne di un sito Site Explorer - Costruzione di un albero per un singolo sito SurfMap JavaNavigator Navigazione su mappa e funzione di ricerca Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works LinkBot - Analisi dei link Ricerca su un sito Esplorazione dei link Site Explorer Server v2.0 e altri sistemi esistenti HyperSystem Net40 - esplora un sito e ne da una rappresentazione ad albero permettendo la navigazione MerzeScope: applet di navigazione su un grafo con funzione di ricerca per un solo sito Site Explorer Server v2.0 Future works Index: •Introduction •Information Retrieval Systems and keyword score •Search engines •Internet now and the future •Java •Site Explorer Server v2.0 •Conclusion and experimental results •Future works A totally modular internal architecture to be able to add new modules and news functions in the simplest and most dynamic way. The implementation of a user profile system based on the user’s interests constantly updateable by a feed-back technique. The insertion of a new system agent able to make automatic off-line Web site analysis to suggest to the user, using his profile information, a set of query about specific themes.