Language Independent Collocation Extraction (LICE) Vidas Daudaravičius Andrius Utka (Vytautas Magnus University) Mutual Information N f ( x, y ) MI ( x; y ) = log 2 f ( x) f ( y ) 30 avg 20 min MI 10 The sum of word frequencies in a word pair Midshipmen Abdulla Mohammed Al-Kaabi; Ahmed Suleman Al-Mamari; Ali Adam Al-Maimani; Ali Suleman AlRawahi; L P Chariandy; Feras Al-Kandari; Khalid Al-Moqbali; Khamis Ali Al-Sulaitni; Khamis Saeed Al-Mazrouei; Majed Al-Majed; Mansour Sultan Al-Ramyan; Mohammed A Al-Mazrouei; Mohammed Ali Al-Wahaibi; Naser Al-Mutairi; Osama Khaled Al-Ammar. 2,900,000 930,000 650,000 400,000 150,000 80,000 55,000 30,000 9,500 7,000 4,500 2,000 850 600 350 100 75 50 -10 25 0 0 •quotations in foreign languages •specific noun phrases •first names and surnames preceded by titles •names of institutions and organisations max T-score f ( x) f ( y ) f ( x, y ) N T ( x, y ) = f ( x, y ) 12 max avg 8 min 6 4 -6 The sum of word frequencies in a word pair “We think that there should be tighter safeguards with us being used as an example of what can go wrong. The Law Society has done the right thing but it was one of its members who did this, so it is bad it spent two years and two previous attempts denying us our compensation.” 10,000,000 1,000,000 -4 100,000 10,000 1,000 -2 100 0 10 2 1 •specific noun phrases •proper nouns •idioms •verb phrases Log*(T-score) 10 Dice 2,900,000 930,000 650,000 400,000 80,000 55,000 30,000 9,500 7,000 4,500 2,000 850 600 350 100 75 50 25 •quotations in foreign languages •specific noun phrases •first names and surnames preceded by titles •names of organisations and institutions •exclamations 0 2 f ( x, y ) Dice( x; y ) = log 2 f ( x) + f ( y ) 150,000 Dice 5 0 -5 -10 -15 max avg -20 min -25 The sum of the word frequencies in a word pair Fade in theme music. Tum-ti-tum-ti-tum-ti-tum Tum-ti-tum-ti-tum tum etc (trad arr Snoop Doggy Dogg). Gravity Counts f ( x, y ) n ( x ) f ( x, y ) n ' ( y ) + log G( x, y ) = log f ( x) f ( y) max avg 15 min 5 2,900,000 930,000 650,000 400,000 150,000 80,000 55,000 30,000 9,500 7,000 4,500 2,000 850 600 350 100 75 50 -15 25 -5 0 •specific noun phrases •proper nouns •idioms •verb phrases Gravity Counts 25 The sum of the word frequencies in a word pair … he replied: “The Conservative party wants to win the next election. I want to win the next election. I have the will to win the next election and I believe we will have a case to take to the British people that will encourage them to believe it’s right that we carry on the job we’ve been trying to do. Extraction of a Collocational Strings 25 20,8 18,8 16,4 13,6 10,8 15 13,1 13,7 11,5 11,7 6,2 5,1 5 15,4 10,6 7,2 0,6 14,8 9,38,8 0 -3,2 He will work for a new Free trade area North America and Europe, an idea INTERESTED IN IS INTERESTED CLINTON IS PRESIDENT CLINTON IDEA PRESIDENT AN IDEA EUROPE AN AND EUROPE AMERICA AND NORTH AMERICA EMBRACING NORTH AREA EMBRACING TRADE AREA FREE TRADE NEW FREE A NEW FOR A WORK FOR WILL WORK HE WILL -5 President Clinton is interested in Extraction of Nominal Phrases from Lithuanian Language Corpus (100m) MI 25 20 15 10 5 0 -5 -10 RÉGIONAL 30 25 20 15 10 5 0 -5 -10 NIVEAU AU REPRÉSE VARIÉTÉS DES CEUX À DUR BLÉ DE VARIÉTÉS DES QUALITÉ DE INDICES LES ANS DEUX MOINS AU D PÉRIODE UNE SUR COMPARE MEMBRE GC ÉTAT CHAQUE AC (French) Span =1 Span = 3 30 20 15 MI 10 5 0 -5 -10 THE OF THOSE WITH VARIETIES WHEAT DURUM THE OF INDEXES QUALITY THE YEARS TWO LEAST AT OF PERIOD A OVER COMPARE SHALL LEVEL REGIONAL 25,0 AT VARIETIES REPRESENTATIVE 25 STATE GC MEMBER EACH AC (English) Span =1 Span = 3 20,0 15,0 10,0 5,0 0,0 -5,0 -10,0 MI VARIETÀ DELLE QUELLI CON DURO FRUMENTO DI VARIETÀ DELLE QUALITÀ DI INDICI GLI ANNI DUE ALMENO DI PERIODO UN DI ARCO NELL RAFFRONTA REGIONALE 15,0 LIVELLO 20,0 A RAPPRESENTATIVE 30 25 20 15 10 5 0 -5 -10 MEMBRO GC STATO CIASCUNO AC (Italian) Span =1 Span = 3 10,0 5,0 0,0 -5,0 -10,0 -15,0 20 15 MI 10 5 0 -5 -10 TAL KWALITÀ TAL INDIĊI L SENTEJN ANQAS TA MILL FIRXA FUQ JQABBEL GĦANDU VARJETAJIET TAL DAWK MA AWSTRALJA L TA QAMĦ TA REĠJONALI 25 LIVELL FUQ RAPPREZENTATTIVI 25 VARJETAJIET 30 MEMBRU GC STAT KULL AC (Maltese) Span =1 Span = 3 20 15 10 5 0 -5 -10 MI 30 25 20 15 10 5 0 -5 -10 RASSEN REPRESENTATIEVE DE VAN DIE MET DURUMTARWERA DE VAN KWALITEITSINDEX DE JAAR TWEE MINSTE TEN VAN PERIODE EEN OVER NIVEAU REGIONAAL OP VERGELIJKT GC LIDSTAAT ELKE AC (Dutch) Span =1 Span = 3 25,0 20,0 15,0 10,0 5,0 0,0 -5,0 -10,0 Phrase Alignment FR EN CHAQUE ÉTAT MEMBRE DE MT NL AU NIVEAU RÉGIONAL BLÉ DUR OF EACH MEMBER STATE SHALL OF THE DURUM AT REGIONAL LEVEL THE WHEAT FRUMENTO DI DURO CIASCUNO IT DES STATO DELLE A LIVELLO REGIONALE MEMBRO QAMĦ TA TA L AWSTRALJA KULL STAT MEMBRU GĦANDU ELKE LIDSTAAT OP REGIONAAL NIVEAU FUQ LIVELL REĠJONALI TAL VAN VAN DE DURUMTARWERA DE Language Independent Collocation Extraction (LICE) http://donelaitis.vdu.lt/~vidas/celex/lice.php Vidas Daudaravičius Andrius Utka (Vytautas Magnus University)