Site Explorer Server: an integrated, client-server, query system
for Web sites
Giancarlo Bongiovanni, Flavio Fontana, Stefano Borghetti
Dept. Of Computer Science, University of Rome, “La Sapienza”
ENEA’s Usability Lab
Summary:
•Introduction
•Information Retrieval Systems and keyword score
•Search engines
•Internet now and the future
•Java
•Site Explorer Server v2.0
•Conclusion and experimental results
•Future works
Information Search in Internet
Internet is the biggest and the most widespread
network
158 Milions of
accesses in Junary
‘99
Millions of heterogeneous users
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Billions of information
sources provided by Web
Exponantial increasing of
Web site count
Increasing of network access
by end users
Increasing of web browser
funtionalities
Increasing of search engines
performs
A forecast of
200 millions in
2000
33 millions in the
United States, 1
million and 300
thousand in
Germany, 371
thousand in Italy
The users that
use Internet
since more than
3 years are
only the 11%
Il problema della ricerca delle informazioni sul Web
Issue: Information search in Internet could
be a problem for particular type of users?
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Today a better scenario
Users problems related to information search:
•Many users don’t know the Web information model
•Users have problems to find a valid tools able to locate
the relevant information
•Users have problems to describe searched information
using right and concise terms
•Users have problems to use advanced search tools (i.e.
Site Explorer Server is more difficult to use rather than
browser)
Analisi dei requisiti dell’utente
New search and exploration tools
New and alternative Web approach to traditional browser
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Implementation of a Client/Server tools able to make Web IR using
Java, experimented and tested ENEA
Site
Explorer
v1.1
IR
Tool integrated with browser
Network service
Information Retrieval Systems
Struttura generale
Gerard Salton,
Introduction to modern
information retrieval, Ed.
1983, McGraw-Hill, Inc.
User
Query
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Result
Similar
•Result formulation
Data structure
in
pre-definded
language
Indexing
•Query formulation by
user
•Indexing process
Documents
Information Retrieval Systems
Formulazione della richiesta
Operatori booleani
Extended boolean systems use
additional operators:
•nearness of terms
•cutting of terms
•search using particular field
Ranking
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Boolean Systems combine the
terms using boolean operators:
•and
•or
•andnot
Operatori estesi
Query formulation is a list of terms able to express
and summarize the searched argument
In Ranking systems query
formulation is made using natural
language phrases
Examples:
Information and retrieval
Information or retrieval
Information andnot retrieval
Examples:
Information adj retrieval
Inform*
Information [in titolo]
Examples:
“Uman influence in
Information Retrieval
systems
Information Retrieval Systems
Indicizzazione
Data structures
Iverted indexing
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Terms vector
Indexing is a process to analyse documents and
to provide a short contents rapresentation.
Rapresentation is based on a
keyword vector. These
keywords are choosen by a
manual process or are
extracted by an authomatic
process
Example:
“Information Retrieval Data Structure &
Algorithms”
<information, retrieval, data-strucuture,
alghoritms>
Example:
Data structure to contains
document rapresentation
List, tree, index file, etc.
Example:
A file where every record
describe the releted record
with each particular term
Information
Retrieval
Doc. 1
1
1
Doc. 2
0
1
Doc.3
1
1
Information Retrieval Systems
Formulazione e presentazione del risultato
Documents ordinated by relevance level
Explicit measure of relevance level (score)
Dynamic presentation (results manipulation)
New features
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Resuls order
In traditional IRS the result is a potential
relevant document list
Graphic and direct method presentations
Multimedia integration
Use of windows (different way to present the
results)
Gerard Salton,
Introduction to
modern information
retrieval, Ed. 1983,
McGraw-Hill, Inc.
William B. Frakes,
Ricardo BaezaYates, Information
Retrieval Data
Structure &
Algorithms, Ed.
1992, Prentice Hall,
Inc.
Information
Information Retrieval
Retrieval Systems
Systems
Calcolo
Calcolodello
delloscore
score
Score compute is focused to measure the relevance of
specific terms in specific documents
Key point in score compute
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
A method to weight the
term relevance in the
whole document
collection
Frequence normalization
for particular document
collection
Example:
IDFi  log 2
N
1
ni
(Sparck Jones, 1972)
signalk  log 2 totfreqk   noisek
(Dennis, 1967)
discvaluek  avgsimk  avgsim
Example:
cfreqij  K  1 K 
nfreqij 
freqij
max freqj
log 2  freqij  1
log 2 lengthj
(Croft, 1983)
(Harman, 1986)
Compute of a term weght for a document
Term frequence in the document * term relevance weigth in the collection
Compute the score:
•Boolean system: use SOP method
•Ranking system: use particular formula.
I motori di ricerca
Web interface
(Query and results)
SIMILAR
Web
pages
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
New functionality in the most popular search
engine:
Sites classification
Integration of new advanced search services to
search information in particular format (picture,
sounds, MP3, e-mail etc.)
not much search engines provide a document
score
Migration from search service to on-line seller
guides
Index DB
Authomatic
indexing
system
Media Matrix - June 1999
S. Engine
Yahoo
Exite
Lycos
Altavista
About
HotBot
Looksmart
GoTo
Pos.
1
8
9
19
21
23
25
32
Internet
Da trent’anni ad oggi
30 years
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Source: FIND/ITPD, III, Gennaio 1999 - NII project, supported by DOIT, MOEA
3%
23%
55%
19%
Europe
Asia-Pacific
North America
Others
Internet
Da trent’anni ad oggi
30 years
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Source: FIND/ITPD, III, Gennaio 1999 - NII project, supported by DOIT, MOEA
11%
28%
more 3 years
0.5 years
30%
1-3 years
0,5-1 years
31%
Internet
Da trent’anni ad oggi
30 years
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Source: FIND/ITPD, III, Gennaio 1999 - NII project, supported by DOIT, MOEA
0,0%
10,0%
20,0%
30,0%
50,0%
50,7%
Casa
28,8%
Scuola
Ufficio
Computer Education Class
40,0%
16,8%
1,6%
Postazioni pubbliche
0,9%
Computer portatile
0,7%
Altri
0,5%
60,0%
Internet
Da trent’anni ad oggi
30 years
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Source: FIND/ITPD, III, Gennaio 1999 - NII project, supported by DOIT, MOEA
Cinque o più
Quattro
5%
9%
26%
Tre
39%
Due
21%
Una
0%
10%
20%
30%
40%
50%
Internet
Verso il domani
2003
2002
2001
2000
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
1999
0
100
200
300
Future Tracks:
•Research and technologies
•Educational
•The Public Administration
•E-commerce
400
500
600
Java
Main features
Technologies
Applet
Multithread
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Oriented to
Graphic User
Interfaces
implementation
Client
Object-oriented
Dynamic
Oriented to
Client/Server
systems
implementation
Portable
Platform
independence
High
functionalities
for networking
Site
Explorer
Server
v2.0
Server
Site Explorer Server v2.0
Obiettivi
Goals - To implement a new system:
able to work directly on Web
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
able to helps the user to find interesting
documents on Web
with an high usability degree
able to integrate:
•search functions
•alternative approach rather than browser
•management functions
•user position to access to the Web etherogeneous data
using a unique way.
Site Explorer Server v2.0
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Additional
features
Site Explorer Server v.2.0. AClient/Server system, implemented
using Java, able to make automatic Web site analyse, and to provide,
as result, the tree site structure where the root node represents the site
home-page.
•Focused on information search and retreiving by keywords search
approach
•an easy information-filtering service
•a score computation service
•user management
Client
User
A network service
An accessible (open
to everybody) open
and multi-platform
service
Interface
Site Explorer
Server
INTERNET
Web
site
Site Explorer Server v2.0
Architettura esterna e configurazione
HTTP
HTTP
Server
SEP
(SES)
Web site
#1
Web site
#n
Internet
SEJA
applet
Browser
SEC
User 1
Technical
features
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Web site
#2
(SEJA)
Windows
User 2
Unix
User 3
Mac-OS
User m
•Client/Server system
•The Server (SES) is a Java application
•The Client (SEJA) is a Java applet
•SES and SEJA speak using a dedicated Application layer protocol
(SEP)
Site Explorer Server v2.0
Funzionamento e processi
Query
HTTP connection
process
USER
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Query selector
process
Links extraction
process
Contents extraction
process
Web sites
Keywords analisys
process
Score
process
Result
builder
Result-display
process
Result
Next site’s page
Client user
interface
Site
SiteExplorer
ExplorerServer
Serverv2.0
v2.0
Sottocomponenti del SES
Page analyser
Connection request (client)
Features
Internet
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Query (client)
Main
Site analyser
Comunicator
Function
manager
User manager
Retriever
Results (client)
•full-text document analyse
•Links cheking using connection requests
•HTML 4 oriented
Site Explorer Server v2.0
Lo score di Site Explorer Server
Three score level:
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Level 1 score. It’s based only on the keywords items
inside the Web page.
Level 2 score. It’s also based on the keywords
distribution inside the whole Web site.
Level 3 score. It’s based also on the position of
keywords items inside the Web page structure.
Site Explorer Server v2.0
Site Explorer Java Applet GUI
Menù-bar
Tool-bar
Tree structure area
Displayed
result
Retrieved object
in Web site
Textual
area
Multimedia
area
State bar
State
indicator
Site Explorer Server v2.0
Site Explorer Java Applet GUI
Connessione al server
Site Explorer Server v2.0
Site exploration
Site Explorer Server v2.0
Site exploration
Indicatore di
connessine
attiva
New site analyse request
Site Explorer Server v2.0
Site exploration
Use of a favorite site analyse request
Site Explorer Server v2.0
Site exploration
Use of a pre-defined site analyse
request
Site Explorer Server v2.0
Site exploration
Receiving result
Site Explorer Server v2.0
Site exploration
Site Explorer Server v2.0
Site exploration
Results navigation
Score level
Relevat
page
indicator
Results browsing
Site Explorer Server v2.0
Site exploration
Site Explorer Server v2.0
Site exploration
Site Explorer Server v2.0
Il pilot-center
Lo Usability Lab (Ulab), istituito nel 1992 presso il pilotcenter del progetto ESPRIT III VENUS e svolge
un’attività di Ricerca & Sviluppo nel campo delle
interfacce visuali avanzate a basi di dati e sistemi
informativi multimediali in rete.
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
Macchine di sviluppo e test:
Intel Pentium II 350Mhz / Windows 98 (Netlab)
Intel Pentium MMX 166Mhz / Windows 95 (Fontanaulab)
AMD K6 300Mhz/ Windows 98 (Ulab)
Sun Sparc Station 5 / Unix Solaris 2.5 (Venus)
Sun Sparc Station 10 / Unix Solaris 2.5 (Dafne)
Strumenti software:
JDK v1.1.6, JDK v1.1.7, JDK v1.1.7a, JDK v1.17b, JDK
1.1.8
Edit+, Netbeans
Java Swing v1.0.3, Java Media Framework v1.1
Site Explorer Server v2.0
Conclusion and experimental results
•A strong system
•good/exellent usability degree
•A good response time (Analyse and result build)
9,1
Brow sing
Site Explorer Server
functionality
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental
results
•Future works
7,6
Icons
8,7
Multimedia contents
7,9
Textual contents
8,4
Tree
9,4
Query
8,9
Connection
0,0
2,0
4,0
6,0
8,0
General user satisfaction degree
50 users selected using ENEA/VENUS methodology:
random user. Occassional system use.
Professional users: System user related to their work.
Expert user.
10,0
Site Explorer Server v2.0
ENEA applications
G7 Global-Inventory project
A project data card collection
•Site search engine vs Site Explorer Server
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental
results
•Future works
Plus - Prosoma LinkUp Service
A multimedia data card collection
Experimental sites:
ULAB sites
Future testing:
•Virtual Lab Site
•FAD
Applet per
navigazione
su mappa
PersonalSearch: applet come motore di ricerca per un sito
Virgilio - Funzione di ricerca su un sito
Esplorazione e
rappresentazio
ne di un sito
Site Explorer - Costruzione di un albero per un singolo sito
SurfMap
JavaNavigator
Navigazione
su mappa e
funzione di
ricerca
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental results
•Future works
LinkBot - Analisi dei link
Ricerca su
un sito
Esplorazione
dei link
Site Explorer Server v2.0
e altri sistemi esistenti
HyperSystem Net40 - esplora un sito e ne da una
rappresentazione ad albero permettendo la navigazione
MerzeScope: applet di navigazione su un grafo con
funzione di ricerca per un solo sito
Site Explorer Server v2.0
Future works
Index:
•Introduction
•Information
Retrieval Systems
and keyword score
•Search engines
•Internet now and
the future
•Java
•Site Explorer
Server v2.0
•Conclusion and
experimental
results
•Future works
A totally modular internal architecture to be
able to add new modules and news functions
in the simplest and most dynamic way.
The implementation of a user profile system
based on the user’s interests constantly
updateable by a feed-back technique.
The insertion of a new system agent able to
make automatic off-line Web site analysis to
suggest to the user, using his profile
information, a set of query about specific
themes.
Scarica

PPT Slides - ULab