Language Independent
Collocation Extraction
(LICE)
Vidas Daudaravičius
Andrius Utka
(Vytautas Magnus University)
Mutual Information
 N  f ( x, y ) 

MI ( x; y ) = log 2 
 f ( x)  f ( y ) 
30
avg
20
min
MI
10
The sum of word frequencies in a word pair
Midshipmen Abdulla Mohammed Al-Kaabi; Ahmed Suleman Al-Mamari; Ali Adam Al-Maimani; Ali Suleman AlRawahi; L P Chariandy; Feras Al-Kandari; Khalid Al-Moqbali; Khamis Ali Al-Sulaitni; Khamis
Saeed Al-Mazrouei; Majed Al-Majed; Mansour Sultan Al-Ramyan; Mohammed A Al-Mazrouei;
Mohammed Ali Al-Wahaibi; Naser Al-Mutairi; Osama Khaled Al-Ammar.
2,900,000
930,000
650,000
400,000
150,000
80,000
55,000
30,000
9,500
7,000
4,500
2,000
850
600
350
100
75
50
-10
25
0
0
•quotations in foreign languages
•specific noun phrases
•first names and surnames
preceded by titles
•names of institutions and
organisations
max
T-score
f ( x)  f ( y )
f ( x, y ) N
T ( x, y ) =
f ( x, y )
12
max
avg
8
min
6
4
-6
The sum of word frequencies in a word pair
“We think that there should be tighter safeguards with us being used as an example of what can go
wrong. The Law Society has done the right thing but it was one of its members who did this, so
it is bad it spent two years and two previous attempts denying us our compensation.”
10,000,000
1,000,000
-4
100,000
10,000
1,000
-2
100
0
10
2
1
•specific noun phrases
•proper nouns
•idioms
•verb phrases
Log*(T-score)
10
Dice
2,900,000
930,000
650,000
400,000
80,000
55,000
30,000
9,500
7,000
4,500
2,000
850
600
350
100
75
50
25
•quotations in foreign languages
•specific noun phrases
•first names and surnames
preceded by titles
•names of organisations and
institutions
•exclamations
0
 2  f ( x, y ) 

Dice( x; y ) = log 2 
 f ( x) + f ( y ) 
150,000
Dice
5
0
-5
-10
-15
max
avg
-20
min
-25
The sum of the word frequencies in a word pair
Fade in theme music. Tum-ti-tum-ti-tum-ti-tum Tum-ti-tum-ti-tum tum etc (trad arr Snoop Doggy Dogg).
Gravity Counts
 f ( x, y )  n ( x ) 
 f ( x, y )  n ' ( y ) 
 + log

G( x, y ) = log
f ( x)
f ( y)




max
avg
15
min
5
2,900,000
930,000
650,000
400,000
150,000
80,000
55,000
30,000
9,500
7,000
4,500
2,000
850
600
350
100
75
50
-15
25
-5
0
•specific noun phrases
•proper nouns
•idioms
•verb phrases
Gravity Counts
25
The sum of the word frequencies in a word pair
… he replied: “The Conservative party wants to win the next election. I want to win the next election. I
have the will to win the next election and I believe we will have a case to take to the British people that
will encourage them to believe it’s right that we carry on the job we’ve been trying to do.
Extraction of a Collocational Strings
25
20,8
18,8
16,4
13,6
10,8
15
13,1 13,7
11,5
11,7
6,2
5,1
5
15,4
10,6
7,2
0,6
14,8
9,38,8
0
-3,2
He will work
for a new
Free trade area
North America and
Europe, an idea
INTERESTED IN
IS INTERESTED
CLINTON IS
PRESIDENT CLINTON
IDEA PRESIDENT
AN IDEA
EUROPE AN
AND EUROPE
AMERICA AND
NORTH AMERICA
EMBRACING NORTH
AREA EMBRACING
TRADE AREA
FREE TRADE
NEW FREE
A NEW
FOR A
WORK FOR
WILL WORK
HE WILL
-5
President Clinton is
interested in
Extraction of Nominal Phrases from
Lithuanian Language Corpus (100m)
MI
25
20
15
10
5
0
-5
-10
RÉGIONAL
30
25
20
15
10
5
0
-5
-10
NIVEAU
AU
REPRÉSE
VARIÉTÉS
DES
CEUX
À
DUR
BLÉ
DE
VARIÉTÉS
DES
QUALITÉ
DE
INDICES
LES
ANS
DEUX
MOINS
AU
D
PÉRIODE
UNE
SUR
COMPARE
MEMBRE
GC
ÉTAT
CHAQUE
AC (French)
Span =1
Span = 3
30
20
15
MI
10
5
0
-5
-10
THE
OF
THOSE
WITH
VARIETIES
WHEAT
DURUM
THE
OF
INDEXES
QUALITY
THE
YEARS
TWO
LEAST
AT
OF
PERIOD
A
OVER
COMPARE
SHALL
LEVEL
REGIONAL
25,0
AT
VARIETIES
REPRESENTATIVE
25
STATE
GC
MEMBER
EACH
AC (English)
Span =1
Span = 3
20,0
15,0
10,0
5,0
0,0
-5,0
-10,0
MI
VARIETÀ
DELLE
QUELLI
CON
DURO
FRUMENTO
DI
VARIETÀ
DELLE
QUALITÀ
DI
INDICI
GLI
ANNI
DUE
ALMENO
DI
PERIODO
UN
DI
ARCO
NELL
RAFFRONTA
REGIONALE
15,0
LIVELLO
20,0
A
RAPPRESENTATIVE
30
25
20
15
10
5
0
-5
-10
MEMBRO
GC
STATO
CIASCUNO
AC (Italian)
Span =1
Span = 3
10,0
5,0
0,0
-5,0
-10,0
-15,0
20
15
MI
10
5
0
-5
-10
TAL
KWALITÀ
TAL
INDIĊI
L
SENTEJN
ANQAS
TA
MILL
FIRXA
FUQ
JQABBEL
GĦANDU
VARJETAJIET
TAL
DAWK
MA
AWSTRALJA
L
TA
QAMĦ
TA
REĠJONALI
25
LIVELL
FUQ
RAPPREZENTATTIVI
25
VARJETAJIET
30
MEMBRU
GC
STAT
KULL
AC (Maltese)
Span =1
Span = 3
20
15
10
5
0
-5
-10
MI
30
25
20
15
10
5
0
-5
-10
RASSEN
REPRESENTATIEVE
DE
VAN
DIE
MET
DURUMTARWERA
DE
VAN
KWALITEITSINDEX
DE
JAAR
TWEE
MINSTE
TEN
VAN
PERIODE
EEN
OVER
NIVEAU
REGIONAAL
OP
VERGELIJKT
GC
LIDSTAAT
ELKE
AC (Dutch)
Span =1
Span = 3
25,0
20,0
15,0
10,0
5,0
0,0
-5,0
-10,0
Phrase Alignment
FR
EN
CHAQUE
ÉTAT
MEMBRE
DE
MT
NL
AU
NIVEAU
RÉGIONAL
BLÉ
DUR
OF
EACH
MEMBER
STATE
SHALL
OF
THE
DURUM
AT
REGIONAL
LEVEL
THE
WHEAT
FRUMENTO
DI
DURO
CIASCUNO
IT
DES
STATO
DELLE
A
LIVELLO
REGIONALE
MEMBRO
QAMĦ
TA
TA
L
AWSTRALJA
KULL
STAT
MEMBRU
GĦANDU
ELKE
LIDSTAAT
OP
REGIONAAL
NIVEAU
FUQ
LIVELL
REĠJONALI
TAL
VAN
VAN
DE
DURUMTARWERA
DE
Language Independent
Collocation Extraction
(LICE)
http://donelaitis.vdu.lt/~vidas/celex/lice.php
Vidas Daudaravičius
Andrius Utka
(Vytautas Magnus University)
Scarica

Tum-ti-tum-ti-tum-ti-tum Tum-ti-tum-ti