Gramsci’s authorship attribution of anonymus newspapers articles Maurizio Lana Histoire et informatique Textométrie des sources historiques 6.6.2014 who we are • • • • • maurizio lana mirko degli esposti emanuele caglioti dario benedetto 1 scholar and 3 physical mathematicians it’s always data • the analysis of numerization of physical world phenomena can equally work on • TAC imaging, • songs, • ECG, • texts, • … reason for the study • national edition of Gramsci’s works, by Ministero dei Beni Culturali • new work on the newspaper articles • many anonymous newspaper articles in the journals and newspapers Gramsci wrote for: Il Grido del Popolo, Avanti!, La Città Futura • request from the Fondazione Gramsci to start anew the study of anonymous articles, to find new evidences of Gramsci writings • we were in 2005 a little background • the start is in 1847, V.J. Bunjakovskij On the possibility to apply determining measures of confidence to the results of some observing sciences, particularly statistics • 1897-98, W. Lutosławski, “On Stylometry”; “Principes de stylometrie” • 1959, D. R. Cox and L. Brandwood, On a discriminatory problem connected with the works of Plato • 1962, A. Ellegård, Who was Junius? • 1964, F. Mosteller and D. Wallace Inference and Disputed Authorship: The Federalist • 1978, A. Kenny, The Aristotelian ethics: a study of the relationship between the Eudemian and Nicomachean ethics of Aristotle • 1980, J.P. Benzécri Pratique de l’analyse des données • 1987, J. F Burrows, Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style, ”LLC”, 2, 1987, pagg. 61-70 in common… • … they have the work at words levels the turning point • G. Ledger, Re-counting Plato: A Computer Analysis of Plato’s Style, Oxford, Clarendon Press, 1989 • the scope are words containing a specified letter; words ending in a specified letter; words with a specified letter as penultimate • that is semantically and linguistically meaningless parts of the words • “I have departed from the traditional approach of stylometry by ignoring entirely meanings and grammatical functions, measuring instead the frequencies of words according to their orthographic content” today, for me (for us) • the key is: a latent mathematical structure of the text • from: L. Doležel, A note on quantification in text theory, in: “Text Processing”, S. Allén ed., Stockholm, 1982, pagg. 539-552 • an expression of the idea: D. Khmelev, F. Tweedie, Using Markov chains for identification of writers, “LLC”, 16, 4, 2001, pagg. 299-307 today, for me (for us) • another expression: D. Benedetto, E. Caglioti, V. Loreto et al., Language Trees and Zipping, “Phys. Rev. Lett.” 88, n. 4, 048702-1, 048702-4 (2002) • take 1 texts, compress it with Zip; • then take another text and compress it with the compression dictionary of the first one; • measure the difference in size: this is the measure of the relative entropy then came the AAAC • in 2004 the american mathematician Patrick Juola proposed the ad-hoc authorship attribution competition to experimentally find the best method to correctly attribute anonymous works: http://www.mathcs. duq.edu/~juola/authorship_contest.html • second best scorer Vlado Keselj, with a method based on measurements of n-grams frequencies the state of the QAA world in 2005 • in 2002 Jack Grieve, for his thesis “Quantitative Authorship Attribution: A History And An Evaluation Of Techniques”, counts at least 39 known and used methods with 93 variants for Quantitative AA • the aim of AAAC: prune the useless methods • nevertheless: this continue to be not science, but craftmanship in 2005 we started • we had to prove to the Fondazione Gramsci that the Quantitative AA produced good results • we choose to use two QAA methods: – relative entropy (already described) – n-gram distances (which gave Keselj the 2° palce in the AAAC) the protocol • phase 1: 50 surely Gramscian texts; 50 surely non-Gramscian texts; – do whatever you like to be able to recognize the Gramscian as Gramscian and the non-Gramscian as non-Gramscian • phase 2 (blind test): 40 unidentified texts, some Gramscian and some not: classify them correctly text preparation • deletion of: – citations of any lenght – proper nouns – numbers • no lemmatization: e.g. the choice for a given tense and person of a verb contains some quantity of information we cannot evaluate properly in order to discard it n-grams • sequencies of n entities you must choose (we chose characters) • sliding n-grams: in “final” a 3-gram reads fin, ina, nal • to find the right n you do tests • n-grams capture fragments of meaning, syntax, collocations/cooccurrences, etc. • you have a dictionary of gramscian n-grams • you check the n-grams of your anonymous texts; you count the matches and the non-matches and do an algebric sum: if positive the text is gramscian, if negative not strategy • maximize the correct attributions • at the same avoiding false attributions • = some missed attributions are ok if you don’t produce false attributions • you must have your commissioner trust you strategy 2 • we don’t know if, how, and how much the “parole” of an author changes across matters, audience, genre, time, … • so we decide that we had to work on well defined periods: their boundaries being left to decide to the Gramsci experts • 1° period 1914-1921 a little of maths • having two methods at work, we could build a cartesian plane, where the results of he measures were plotted after normalization bringing them in the range -1 / + 1 phase 1 - setup phase 2 – blind test the day after • we started to do the attributions - being paid by Fondazione Gramsci for it - without knowing anything of the texts, and giving periodical reports to the historians who were editors of the various volumes of the national edition od Gramsci works • we got the texts, normalized them, measured them, and produced a Report we sent to Fondazione Gramsci • historians evaluation of the QAA: no proposed attribution was unacceptable, even if not every proposed attribution was accepted • [example of report] now we have stopped • due to the cuts to research funds, the national edition is at now stopped some practical principles on AA • no tool can ‘read’ a text and say you: this text was written by Francesco Stella • you can only classify the texts you chose to work on, crunched by the tool you use • all of the texts will be connected: you must interpret the results • you must mix anonymous or disputed works with “control works”: same period, same genre, same language, same author, similar authors, … be careful • when you have proper nouns in your works, it’s easy to classify them: • R. Clement and D. Sharp, Ngram and Bayesian Classification of Documents for Topic and Authorship, “LLC”, 2003, 18(4):423-447 • but you don’t really classifiy the texts, you classify the collections of proper nouns they contain why the gramsci cas was/is difficult and strange • articles are very short: between 300 and 1000/1200 words • all of these articles share: matters, ideology, context • there is no countercheck, and you work for a scientific and productive initiative (it’s not ‘simply’ an experiment) • the tables showing the matches are sparse tables, nevertheless these data work well now what • Patrick Juola, the mathematician who proposed the AAAC, has released JGAAP, a package offering various tools for QAA: • http://evllabs.com/jgaap/w/index.php/ • the R package with stylo is impressive and I wish we had it when we started our work with Gramsci texts some references to start from • C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti, An example of mathematical authorship attribution, “Journal Of Mathematical Physics”, 2008, 49, pp. 1 – 20 • C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti, L'attribuzione dei testi gramsciani: metodi e modelli matematici, “La Matematica nella Società e nella Cultura”, 2010, 3, pp. 235 – 269 • M. Lana, Come scriveva Gramsci? Metodi matematici per riconoscere scritti gramsciani anonimi, “Informatica Umanistica”, 2010, 3, 31-56 some references (2) • M. Lana, Individuare scritti gramsciani anonimi in un" corpus" giornalistico. Il ruolo dei metodi quantitativi, “Studi storici: rivista trimestrale dell'Istituto Gramsci”, 52 (4), 859-880 • P. Juola, Authorship Attribution, “Foundations and Trends in Information Retrieval”, Vol. 1, No. 3 (2006) 233–334 http://www.conll.org/~walter/educational/material/ fnt-aa.pdf • J. Grieve, Quantitative Authorship Attribution: An Evaluation of Techniques, LLC 22: 251-270 http://dl.dropboxusercontent.com/u/99161057/Grie ve_authorshipattribution.pdf thanks!