it’s always data
• the analysis of numerization of physical world
phenomena can equally work on
• TAC imaging,
• songs,
• ECG,
• texts,
• …
reason for the study
• national edition of Gramsci’s works, by Ministero
dei Beni Culturali
• new work on the newspaper articles
• many anonymous newspaper articles in the
journals and newspapers Gramsci wrote for:
Il Grido del Popolo, Avanti!, La Città Futura
• request from the Fondazione Gramsci to start
anew the study of anonymous articles, to find
new evidences of Gramsci writings
• we were in 2005
a little background
• the start is in 1847, V.J. Bunjakovskij On the possibility to apply
determining measures of confidence to the results of some
observing sciences, particularly statistics
• 1897-98, W. Lutosławski, “On Stylometry”; “Principes de
• 1959, D. R. Cox and L. Brandwood, On a discriminatory problem
connected with the works of Plato
• 1962, A. Ellegård, Who was Junius?
• 1964, F. Mosteller and D. Wallace Inference and Disputed
Authorship: The Federalist
• 1978, A. Kenny, The Aristotelian ethics: a study of the relationship
between the Eudemian and Nicomachean ethics of Aristotle
• 1980, J.P. Benzécri Pratique de l’analyse des données
• 1987, J. F Burrows, Word Patterns and Story Shapes: The Statistical
Analysis of Narrative Style, ”LLC”, 2, 1987, pagg. 61-70
in common…
• … they have the work at words levels
the turning point
• G. Ledger, Re-counting Plato: A Computer Analysis of
Plato’s Style, Oxford, Clarendon Press, 1989
• the scope are
words containing a specified letter;
words ending in a specified letter;
words with a specified letter as penultimate
• that is semantically and linguistically meaningless parts
of the words
• “I have departed from the traditional approach of
stylometry by ignoring entirely meanings and
grammatical functions, measuring instead the
frequencies of words according to their orthographic
today, for me (for us)
• the key is:
a latent mathematical structure of the text
• from: L. Doležel, A note on quantification in
text theory, in: “Text Processing”, S. Allén ed.,
Stockholm, 1982, pagg. 539-552
• an expression of the idea: D. Khmelev, F.
Tweedie, Using Markov chains for
identification of writers, “LLC”, 16, 4, 2001,
pagg. 299-307
today, for me (for us)
• another expression: D. Benedetto, E. Caglioti,
V. Loreto et al., Language Trees and Zipping,
“Phys. Rev. Lett.” 88, n. 4, 048702-1, 048702-4
• take 1 texts, compress it with Zip;
• then take another text and compress it with
the compression dictionary of the first one;
• measure the difference in size: this is the
measure of the relative entropy
then came the AAAC
• in 2004 the american mathematician Patrick Juola
proposed the ad-hoc authorship attribution
competition to experimentally find the best
method to correctly attribute anonymous works:
• second best scorer Vlado Keselj, with a method
based on measurements of n-grams frequencies
the state of the QAA world in 2005
• in 2002 Jack Grieve, for his thesis
“Quantitative Authorship Attribution: A
History And An Evaluation Of Techniques”,
counts at least 39 known and used methods
with 93 variants for Quantitative AA
• the aim of AAAC: prune the useless methods
• nevertheless: this continue to be not science,
but craftmanship
in 2005 we started
• we had to prove to the Fondazione Gramsci
that the Quantitative AA produced good
• we choose to use two QAA methods:
– relative entropy (already described)
– n-gram distances (which gave Keselj the 2° palce
in the AAAC)
the protocol
• phase 1: 50 surely Gramscian texts; 50 surely
non-Gramscian texts;
– do whatever you like to be able to recognize the
Gramscian as Gramscian and the non-Gramscian
as non-Gramscian
• phase 2 (blind test): 40 unidentified texts,
some Gramscian and some not: classify them
text preparation
• deletion of:
– citations of any lenght
– proper nouns
– numbers
• no lemmatization: e.g. the choice for a given
tense and person of a verb contains some
quantity of information we cannot evaluate
properly in order to discard it
• sequencies of n entities you must choose (we chose
• sliding n-grams: in “final” a 3-gram reads fin, ina, nal
• to find the right n you do tests
• n-grams capture fragments of meaning, syntax,
collocations/cooccurrences, etc.
• you have a dictionary of gramscian n-grams
• you check the n-grams of your anonymous texts; you
count the matches and the non-matches and do an
algebric sum: if positive the text is gramscian, if
negative not
• maximize the correct attributions
• at the same avoiding false attributions
• = some missed attributions are ok if you don’t
produce false attributions
• you must have your commissioner trust you
strategy 2
• we don’t know if, how, and how much the
“parole” of an author changes across matters,
audience, genre, time, …
• so we decide that we had to work on well
defined periods: their boundaries being left to
decide to the Gramsci experts
• 1° period 1914-1921
a little of maths
• having two methods at work, we could build a
cartesian plane, where the results of he
measures were plotted after normalization
bringing them in the range -1 / + 1
phase 1 - setup
phase 2 – blind test
the day after
• we started to do the attributions - being paid by
Fondazione Gramsci for it - without knowing anything
of the texts, and giving periodical reports to the
historians who were editors of the various volumes of
the national edition od Gramsci works
• we got the texts, normalized them, measured them,
and produced a Report we sent to Fondazione Gramsci
• historians evaluation of the QAA: no proposed
attribution was unacceptable, even if not every
proposed attribution was accepted
• [example of report]
now we have stopped
• due to the cuts to research funds, the national
edition is at now stopped
some practical principles on AA
• no tool can ‘read’ a text and say you: this text was
written by Francesco Stella
• you can only classify the texts you chose to work
on, crunched by the tool you use
• all of the texts will be connected: you must
interpret the results
• you must mix anonymous or disputed works with
“control works”: same period, same genre, same
language, same author, similar authors, …
be careful
• when you have proper nouns in your works,
it’s easy to classify them:
• R. Clement and D. Sharp, Ngram and Bayesian
Classification of Documents for Topic and
Authorship, “LLC”, 2003, 18(4):423-447
• but you don’t really classifiy the texts, you
classify the collections of proper nouns they
why the gramsci cas was/is difficult
and strange
• articles are very short: between 300 and
1000/1200 words
• all of these articles share: matters, ideology,
• there is no countercheck, and you work for a
scientific and productive initiative (it’s not
‘simply’ an experiment)
• the tables showing the matches are sparse
tables, nevertheless these data work well
now what
• Patrick Juola, the mathematician who
proposed the AAAC, has released JGAAP, a
package offering various tools for QAA:
• the R package with stylo is impressive and I
wish we had it when we started our work with
Gramsci texts
some references to start from
• C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti, An
example of mathematical authorship attribution,
“Journal Of Mathematical Physics”, 2008, 49, pp. 1 – 20
• C. Basile, D. Benedetto, E. Caglioti, M. Degli Esposti,
L'attribuzione dei testi gramsciani: metodi e modelli
matematici, “La Matematica nella Società e nella
Cultura”, 2010, 3, pp. 235 – 269
• M. Lana, Come scriveva Gramsci? Metodi matematici per
riconoscere scritti gramsciani anonimi, “Informatica
Umanistica”, 2010, 3, 31-56
some references (2)
• M. Lana, Individuare scritti gramsciani anonimi in un"
corpus" giornalistico. Il ruolo dei metodi quantitativi,
“Studi storici: rivista trimestrale dell'Istituto Gramsci”,
52 (4), 859-880
• P. Juola, Authorship Attribution, “Foundations and
Trends in Information Retrieval”, Vol. 1, No. 3 (2006)
• J. Grieve, Quantitative Authorship Attribution: An
Evaluation of Techniques, LLC 22: 251-270

