Abstract
The wider availability of large-scale datasets and reproducible algorithms has boosted the application of NLP to living languages. On the other hand, dead languages benefit from the availability of curated resources both to offset the sparseness of available data and to make data accessible to researchers. We present here AGVaLex, a computational valency lexicon automatically extracted from the Ancient Greek Dependency Treebank. It contains quantitative corpus-driven morphological, syntactic and lexical information about verbs and their direct and indirect arguments and has a wide range of applications for the study of Ancient Greek. To illustrate these applications, we offer a case study that compares the semantic flexibility of transitive verb formulae in archaic Greek epic to a non-formulaic corpus, with the goal of detecting unique patterns of variation. We also illustrate the possibilities afforded by AGVaLex to scholars with a less extensive background in computational corpus-based research.
1 Verbal valency and valency lexicons
The concept of verbal valency has received much attention in different linguistic traditions (cf., e.g., the overview in Zanchi 2018). Introduced by Lucien Tesnière in the context of Dependency Grammar in 1959, the term valency refers to the extent to which verbs determine the configuration of a consistent and predictable number of participants, which Tesnière referred to as actants. Actants are contrasted with circumstants, which are free modifiers of verbs. Actants are commonly termed arguments, a term that encompasses semantic and syntactic roles. The English ditransitive verb give, for example, requires three arguments: one expressing the person giving, one expressing the object given, and one expressing the recipient. If we know that an English sentence contains an active form of the verb give, we can expect to find its three arguments realized in the sentence, as we can see in (1) or (2). The transitive verb print, on the other hand, only requires two arguments (the person or object printing, and the object being printed), so we can expect to see two arguments if a sentence contains the verb print in its active form as in (3). An adjunct like yesterday can occur with most verbs and its presence cannot be expected based only on the presence of a verb like give or print (1 and 2).
(1) I gave you the phone yesterday.
(2) He gave the receipts to the customers.
(3) They printed the paper yesterday.
Different linguistic subfields have referred to concepts related to valency with different terms, each of which has a slightly different scope. While ‘valency’ was first used in linguistics in the context of Dependency Grammar by Tesnière, subcategorization was introduced with phrase structure grammars in the generative linguistics tradition, and it has been widely adopted in Natural Language Processing research. Following McGillivray (2014: 31 ff.), we adopt here an operational definition of valency, based on corpus and distributional methods, and take a theory-agnostic view on this topic. Further, we describe AGVaLex, a corpus-driven valency lexicon for ancient Greek, and illustrate its value for historical linguistics scholarship through a case study on Homeric formulae. The lexicon was created automatically from the dependency syntax annotation of the Ancient Greek Dependency Treebank 2.0 (
Our focus on ancient languages, and particularly on ancient Greek, a ‘large-corpus language’ (Mayrhofer 1980; Untermann 1983), offers us the opportunity to test the effectiveness of corpus methods on a language for which no native speakers are available. This has received an increasing level of attention in recent years, in conjunction with the development and analysis of large-scale annotated corpora for corpus languages (see for example McGillivray 2014 for an overview on Latin).
The lexicon has a number of advantages and reuse potential. Thanks to its automatic creation procedure, the lexicon can be regenerated if new annotated data become available or if the annotation is corrected, which enhances its potential applications in future research. Moreover, unlike traditional dictionaries and handmade valency lexicons, computational valency lexicons like the one we present here provide a quantitative and systematic account of the valency properties of verbs reflecting the corpus they are extracted from. They can tell us whether a verb, for example, is found with a particular argument pattern, and how many times this occurs in the corpus. They can also give us information about the distribution of authors, genres and works of these patterns, and whether there is a change over time. The lexicon provides information about the number and type of arguments of all verbs occurring in the treebank. Its entries are equipped with morpho-syntactic information, namely the case of nouns and the mood of verbs, the gender and number of nouns and adjectives, and the voice of verbs. This information is highly valuable to investigate a range of linguistic questions, from the analysis of word order patterns to the study of the constructions that occur with specific verbs. The lexicon’s valency patterns also display the lemmas of the arguments, which allows for lexical-semantics studies, for example to investigate the semantic fields of the subjects or objects of verbs and how they vary by author or work. At the same time, because of the automatic extracted procedure it was built with, the lexicon is bound to inherit any annotation errors that were present in the treebank.
1.1 Applications to Ancient Greek
Valency lexicons are an extremely useful tool in linguistic research on verbal valency patterns. For Ancient Greek, an important recent contribution on the topic is Keersmaekers (2020), who analyses language variation in a corpus of Greek papyri, focusing on complementation and the role of tense, aspect, and modality in verbal complements. This study, however, also showcases how much corpus pre-processing is necessary for this kind of scholarship; AGVaLex will offer an important tool for researchers wishing to conduct research on similar topics without collating their own corpus.
To illustrate this application, we propose a case study taken from Rodda (2021), on linguistic variation in archaic Greek epic poetry. The language of early Greek epic relies extensively on formulae, repeated constructions with limited syntactic and semantic flexibility; generations of researchers have investigated the precise extent of this flexibility and how it relates to issues of oral performance and language change (Rodda 2021 provides an extended bibliography; see particularly Hainsworth 1968 for an important example of this approach, and Friedrich 2019 for some criticism). Our study shows how the application of a well-developed pre-existing resource such as AGVaLex allows for new approaches to this crucial question in Homeric studies.
2 Previous work
Dictionaries typically display some information about verbal valency in their lexical entries. This is usually in the form of the grammatical case of arguments and the prepositions introducing the arguments themselves. For example, the dictionary entry for the verb
Over time, dedicated valency lexicons have been created for specific languages. For example, Happ (1976) presents the only hand-made valency lexicon for Latin and was derived from a manual analysis of 800 verbal occurrences in Cicero’s Orationes. Such resources offer high-quality information derived from a detailed manual analysis and are therefore very reliable. However, they suffer from the lack of completeness which we observed earlier, and which affects other handmade resources like traditional dictionaries.
Several large textual resources of Ancient Greek are available today, including full-text databases such as TLG (Thesaurus Linguae Graecae), and the Perseus Digital Library (Bamman & Crane 2011), manually annotated corpora such as the Ancient Greek Dependency Treebank (AGDT 2.0), PROIEL (Pragmatic Resources of Old Indo-European Languages, Haug & Jøndal 2008),2 and automatically annotated corpora such as the Diorisis Ancient Greek Corpus (Vatri & McGillivray 2018). A number of syntactically annotated corpora is a subset of this list and includes PROIEL and the AGDT 2.0. AGDT 2.0 follows the Dependency Grammar annotation model of the Prague Dependency Treebank for Czech (PDT 3.0; Hajič et al. 1999), which was created in the tradition of the Functional Generative Description (Sgall et al. 1986). AGDT 2.0 follows a predicate-centric approach where each word corresponds to a node in the treebank. The texts are annotated in separate yet interconnected layers. The analytical layer, focusing on syntax, comprises dependency syntactic trees, is built on the morphological layer, and serves as the foundation for the tectogrammatical layer, where semantic information such as semantic role labeling, information structure, and annotations for anaphora/ellipsis resolution is included.
The increased availability of such large syntactically annotated corpora has made it possible to conduct large-scale analyses of historical languages (Haug 2015; Eckhoff et al. 2018; Biagetti et al. 2021), and in particular to develop methods for extracting valency information automatically, thus supporting the creation of corpus-driven computational resources aiming at a systematic account of the valency behaviour of the verbs in the corpora. Typically, such resources are drawn from corpora provided with morpho-syntactic annotation. As the annotation marks predicates and their arguments, it is then possible to automatically identify them and extract them in the form of a table or database. For an overview of computational valency lexicons and a discussion of valency vs subcategorization for Latin, see McGillivray (2014: 31 ff.), and for a description of such a lexicon for Latin see McGillivray et al. (2009), McGillivray & Passarotti (2012), and Passarotti et al. (2016). Computational valency lexicons have several advantages over their manual counterparts. Because they directly rely on corpus data, they can easily show quantitative information such as the frequencies of each pattern for each verb, and link those back to the original corpus occurrences. They can also be easily expanded as the corpora they are based on grow, because they have been created programmatically.
Only one corpus-based valency lexicon is currently available for Ancient Greek, as far as we are aware: HoDeL, the Homeric Dependency Lexicon (Zanchi et al. 2018; Zanchi 2021), a project run at the University of Pavia.3 HoDeL was automatically extracted from the syntactically annotated portion of the AGDT containing the Homeric poems. As explained in its guidelines,4 for every verb in the Homeric poems, HoDeL includes its arguments, i.e. those dependents that are tagged as subjects, objects, object complements, and predicate nominals. In the guidelines, the authors point out that AGDT does not contain referential null arguments and that there are several consistency issues that affect the annotation of the treebank. Therefore, HoDeL has been manually edited to correct some annotation errors in the corpus, particularly on lemmatisation.
The online tool Myria (
3 The lexicon
AGVaLex was created from the Ancient Greek Dependency Treebank (AGDT 2.0; Celano 2019). AGDT 2.0 contains 557,922 tokens from the works listed in Table 1. AGVaLex is licensed under a Creative Commons Attribution-ShareAlike 3.0 United States License and is available on Figshare (McGillivray 2021).
Table 1
List of authors and works included in the AGDT 2.0 treebank and in AGVaLex
|
Author |
Title |
|---|---|
|
Aeschylus |
Agamemnon, Eumenides, Libation Bearers, Persians, Prometheus Bound, Seven Against Thebes, Suppliant Women |
|
Aesop |
Aesop’s Fables 1.1–1.50 |
|
Athenaeus of Naucratis |
Deipnosophistae |
|
Diodorus Siculus |
Bibliotheca Historica |
|
Herodotus |
Histories |
|
Hesiod |
Shield of Heracles, Theogony, Works and Days |
|
Homer |
Iliad, Odyssey |
|
Lysias |
Against Alcibiades 1 and 2, Against Pancleon, On the Murder of Eratosthenes |
|
Plato |
Euthyphro |
|
Plutarch |
Alcibiades, Lycurgus |
|
Polybius |
Histories |
|
Pseudo Apollodorus |
Bibliotheca |
|
Pseudo Homer |
Hymn to Demeter |
|
Sophocles |
Ajax, Antigone, Electra, Oedipus Tyrannus, Trachinae |
|
Thucydides |
History of the Peloponnesian War |
The xml files of the so-called analytical layer of annotation of the treebank contain dependency-based syntactic trees. The treebank files were first converted into a tab-separated format via a Perl script, and then imported into a MySQL database; a series of MySQL query scripts then produced several database tables making up the lexicon. The scripts were adapted from the work done to create the Latin Dependency Treebank valency lexicon described in McGillivray (2014: 31–60). Specifically, we extracted all dependents of verbal forms labelled as ‘SBJ’ (subjects), ‘OCOMP’ (object complements), ‘PNOM’ (predicate nominals) and ‘OBJ’, which includes all other arguments, i.e. nouns and pronouns in the accusative, dative and genitive cases, prepositional phrases, infinitive verbs, and subordinate clauses that can function as verbal objects such as accusative + infinitive constructions. We excluded dependents labeled as ADV (adverbials), ATR (modifiers), and ATV/ATVV (non-governed complements, such as predicative noun phrases), as these are not considered part of the verbal valency. The lexicon contains both dependents which are direct children and indirect children of verbal forms via preposition (AUXP), conjunction (AUXC), coordination (COORD) and apposing (APOS) nodes. The specific handling of recursive relations such as those between predicates and their indirect children made the extraction of the frames non-trivial. It is important to note that, because it was extracted from an annotated corpus and because of the annotation of AGDT 2.0, the lexicon does not include referential null arguments, i.e. those arguments that are identifiable in the context but not lexically realised, and which were employed in ancient Indo-European languages (Luraghi 2003, Keydana & Luraghi 2012, Haug 2012, Sausa & Zanchi 2015).6 This means that a valency frame with zero arguments in principle could indicate either an instance of an impersonal verb or an instance of an intransitive verb with a null subject.



Figure 1
A selection of six entries from AGVaLex
Citation: Journal of Greek Linguistics 24, 2 (2024) ; 10.1163/15699846-02402003
Figure 1 displays six entries from AGVaLex. Each entry (or database record) corresponds to a verbal token occurrence in the AGDT and each column corresponds to each of eight different attributes of the token, which we can categorize into three main groups:
-
Metadata: the columns ‘author’, ‘title’, ‘subdoc’, and ‘sentence_id’ contain, respectively, the name of the author, the title of the work, the passage where the verb token occurs, and the identifier of the sentence in the treebank.
-
Verb token attributes: the columns ‘verb’ and ‘voice’ display the verb’s lemma and voice, respectively.
-
Argument patterns: the columns ‘frame’ and ‘frame_fillers’ contain the valency information, as explained in more detail below.
Let us consider the first entry in Figure 1. This entry corresponds to sentence 2901046 of the treebank, from Persians by Aeschylus, lines 703–706:
ἀλλ᾽ ἐπεὶ δέος παλαιὸν σοὶ φρενῶν ἀνθίσταται ,τῶν ἐμῶν λέκτρων γεραιὰ ξύννομ᾽ εὐγενὲς γύναι ,κλαυμάτων λήξασα τῶνδε καὶ γόων σαφές τί μοι λέξον ·
‘Since dread long ingrained in your mind restrains you, cease, noble woman, venerable partner of my bed, from your tears and laments, speak to me with all frankness.’
Smyth 1926
The annotation of the first part of this sentence in the treebank is shown below. Each token is indicated by the XML tag ⟨word⟩, whose attributes are id (identifier of the token in the sentence), cid (another identifier of the token), form (the form of the token in the sentence), lemma (the lemma of the token), postag (the part-of-speech tag of the token), head (the syntactic head of the token), relation (the syntactic relation that holds between the token and its head), and cite (the text passage). Each attribute is followed by its value between double quotes. For example, “1” is the value of the attribute “id” in the first line.
⟨word id="1" form="
ἀλλ̓ " lemma="ἀλλά " postag=“d--------” head="25" relation=“AuxY” cite=“urn:cts:greekLit:tlg0085.tlg002:703”/⟩⟨word id="2" form="
ἐπεὶ " lemma="ἐπεί " postag=“c--------” head="25" relation=“AuxC” cite=“urn:cts:greekLit:tlg0085.tlg002:703”/⟩⟨word id="3" form="
δέος " lemma="δέος " postag=“n-s---nn-” head="7" relation="SBJ" cite=“urn:cts:greekLit:tlg0085.tlg002:703”/⟩⟨word id="4" form="
παλαιὸν " lemma="παλαιός " postag=“a-s---nn-” head="3" relation="ATR" cite=“urn:cts:greekLit:tlg0085.tlg002:703”/⟩⟨word id="5" form="
σοὶ " lemma="σύ " postag=“p-s----d-” head="7" relation="OBJ" cite=“urn:cts:greekLit:tlg0085.tlg002:703”/⟩⟨word id="6" form="
φρενῶν " lemma="φρήν " postag=“n-p---fg-” head="3" relation="ATR" cite=“urn:cts:greekLit:tlg0085.tlg002:703”/⟩⟨word id="7" form="
ἀνθίσταται " lemma="ἀνθίστημι " postag=“v3spie---” head="2" relation="ADV" cite=“urn:cts:greekLit:tlg0085.tlg002:703”/⟩⟨word id="8" form="," lemma="," postag=“u--------” head="2" relation=“AuxX” cite=""/⟩
According to the treebank annotation guidelines (Celano 2014), which follow the foundations of Dependency Grammar and display some difference with concepts in traditional grammars, the PRED (“predicate”) function is assigned to the verb of the main clause in a sentence, while any other verb receives a different label indicating its function in relation to its parent node, subjects are tagged as ‘SBJ’, other verb arguments are tagged as ‘OBJ’, predicate nominals are tagged as ‘PNOM’, and object complements are ‘OCOMP’. All these elements can depend on a coordination node (tagged as ‘COORD’) and therefore take the suffix ‘_CO’, or an apposition node (tagged as ‘APOS’) and then the suffix ‘_AP’.
More specifically, the label ‘SBJ’ is used to mark the syntactic subject of a clause. These include typically a noun or pronoun in the nominative case, but also participles, infinitives, and substantive clauses. Moreover, AGDT distinguishes between OBJ (direct object) and other arguments. Direct objects typically include nouns, but also infinitive clauses and substantive clauses, for example. AGDT uses OCOMP for predicative complements that complete the meaning of the object. These can include nouns or adjectives that further describe the direct object and participles when used predicatively with certain verbs such as verbs of perception. Finally, the label ‘PNOM’ is used to annotate predicate nominatives, i.e. predicate nouns or adjectives in copular constructions. For a full explanation of the annotation, see Celano (2014). In the example sentence, the verb form
The AGDT has 548,782 word tokens (i.e. individual instances of words) of which 95,841 have been tagged with the part of speech ‘verb’ or ‘participle’; these correspond to 36,964 verb types (i.e. distinct verbs). AGVaLex was extracted from this treebank and contains 71,868 entries, one for each of the verb tokens occurring with at least one argument in this corpus. Table 2 displays some basic statistics of the lexicon.
Table 2
Basic statistics of AGVaLex
|
Entity |
Count |
|---|---|
|
Verb tokens (lexical entries) |
71,868 |
|
Unique verb lemmas |
5077 |
|
Unique frames |
4116 |
|
Unique frames with lexical fillers |
43624 |
The treebank contains texts of 15 authors and 31 works. Table 3 shows the number of lexical entries for each author.
Table 4 shows the 20 most frequent frames in the lexicon. The most frequent frame is the pattern ‘active_OBJ[accusative]’ which corresponds to constructions with accusative direct objects. Note that subjects in ancient Greek by default are not expressed if topical, so this frame includes those cases in which the predicate is, for example, inflected in the first person singular and the subject is not expressed lexically.
Table 3
Number of AGVaLex’s entries for each of the authors. Each entry corresponds to a verb token from the Ancient Greek Dependency Treebank.
|
Author |
Number of lexicon entries |
|---|---|
|
Aeschylus |
6007 |
|
Aesop |
826 |
|
Athenaeus |
5766 |
|
Diodorus |
3445 |
|
Herodotus |
4784 |
|
Hesiod |
2169 |
|
Homer |
30567 |
|
Lysias |
1176 |
|
Plato |
745 |
|
Plutarch |
2884 |
|
Polybius |
3357 |
|
Pseudo-Apollodorus |
150 |
|
Pseudo-Homer |
450 |
|
Sophocles |
6293 |
|
Thucydides |
3249 |
|
TOTAL |
72067 |
Table 4
Most frequent valency frames, with their frequency in the lexicon
|
Frame |
Count |
|---|---|
|
active_OBJ[accusative] |
12557 |
|
active_OBJ[accusative],SBJ[nominative] |
4512 |
|
active_SBJ[nominative] |
4218 |
|
active_OBJ[infinitive] |
2260 |
|
active_OBJ[dative] |
1723 |
|
medio-passive_OBJ[accusative] |
1624 |
|
middle_OBJ[accusative] |
1543 |
|
medio-passive_SBJ[nominative] |
1506 |
|
active_PNOM[nominative] |
1234 |
|
active_OBJ[genitive] |
1190 |
|
active_PNOM[nominative],SBJ[nominative] |
1037 |
|
medio-passive_OBJ[dative] |
829 |
|
active_OBJ_CO[accusative] |
806 |
|
active_OBJ[infinitive],SBJ[nominative] |
792 |
|
active_OBJ[dative],SBJ[nominative] |
790 |
|
medio-passive_OBJ[infinitive] |
750 |
|
active_OBJ[accusative],OBJ[dative] |
744 |
|
active_( |
682 |
|
active_OBJ[dative],OBJ[accusative] |
596 |
|
middle_SBJ[nominative] |
584 |
3.1 Comparison with traditional lexicographical resources
A practical way to show the usefulness of the lexicon is to compare it with a commonly used scientific dictionary. We chose to compare it with the relatively recent GE (seeSection 2), rather than the older LSJ (Liddell et al., 1996), as GE highlights valency information more clearly, especially for high frequency verbs. So, for instance, the various constructions for
In order to compare the constructions listed in the dictionary with those in the lexicon, we chose a small set of 5 transitive verbs, from the larger dataset used in Section 4. These are very high frequency verbs with a wide range of constructions:
For each of these verbs, we noted down the dependency information that is given in GE, without taking note of diathesis (active vs. middle vs. passive), as the dictionary does not always break down meanings by diathesis unless a specific passive or middle meaning is involved. We then searched AGVaLex for all dependencies that are recorded for each verb, and noted which ones do not appear in the dictionary, as well as where they are attested. We made this choice because constructions that occur in a range of authors are arguably more likely to be recorded in a dictionary than constructions that are unique to one author, even in a partial sample like the one that forms the basis of the AGVaLex.
The results of this comparison are summarised in Table 5. The final column in this table contains the number of ‘collostructions’ reported for the same verb in Myria (see Section 2). There is only limited overlap between the way Myria categorises collostructions and the data in VaLex: while both resources record information about the lexical fillers that occur with each verbal lemma in the corpus, VaLeX does not provide any information about the lemma’s preference for a specific construction compared to expectations. Therefore, the numbers for Myria in Table 5 are only reported for reference.
Table 5
Comparison between GE, Myria, and AGVaLex on transitive verbs
|
Verb |
Constructions recorded in GE |
Constructions only in AGVaLex |
Constructions only in AGVaLex that occur in more than one author |
Constructions only in AGVaLex that occur at least 10 times |
Collostructions in Myria |
|---|---|---|---|---|---|
|
|
5 |
31 |
1 |
1 |
10 |
|
|
9 |
27 |
7 |
2 |
12 |
|
|
9 |
43 |
7 |
3 |
9 |
|
|
1 |
69 |
15 |
7 |
8 |
|
|
9 |
60 |
12 |
5 |
12 |
As table 5 shows, while AGVaLex lists significantly more constructions than the dictionary, most of them are very rare and/or unique to one author, which makes them less relevant to a lexicographical resource that is meant to represent ‘standard’ Greek, with only limited reference to special usage. In addition to this, a large proportion of constructions that are unique to one author are unique to Homer, a phenomenon that sometimes has to do with Homeric syntax preserving traces of an archaic stage of development (see e.g. Hackstein 2010); for instance, as many as 48 of the 69 constructions listed for
That said, even for such a small sample as the one we tested, AGVaLex does sometimes bring useful additional information. For instance, we can hypothesise that a small cluster of constructions for
The results above, of course, are not meant to show that the valency lexicon is superior to a published dictionary. Each lexical resource has its own purpose, but the valency lexicon does prove its worth in a test of its completeness against a common lexicographical resource, as well as offering possibilities in relation to common philological aims like textual criticism, as detailed above.
In addition to the features described above, AGVaLex allows the user to retrieve summary data by construction (e.g., searching for all verbs that take the preposition
A note on the Homeric Dependency Lexicon (HoDeL), the only other available verbal valency database for Ancient Greek, as described in Section 2. Since AGVaLex covers a much broader corpus than HoDeL, it makes little sense to directly compare the number of constructions retrievable by the two tools. A comparison between the search for constructions with



Figure 2
Screenshot from HoDeL illustrating an example of its use. The site was accessed on 08/04/2024.
Citation: Journal of Greek Linguistics 24, 2 (2024) ; 10.1163/15699846-02402003
The user can filter this data using the left-hand side menus in order to look at, for instance, all occurrences of the verb with three arguments, and explore which case these arguments appear in (see Figure 2); individual examples are represented through a dependency tree on the right. This visualisation makes HoDeL an excellent teaching tool, and its user-friendliness is way higher than the usual standards in the field. On the other hand, downloading data from HoDeL is effectively impossible, making it a less useful tool for studies that aim to cover all examples of a phenomenon, as illustrated in Section 4. Together with AGVaLex’s wider coverage, this shows how the two tools are complementary in their range and use cases.
4 Case study: semantic variation in TrV+Obj formulae
4.1 Aims and context of the study
The case study introduced here aims to assess the scope of semantic variation in a sample of epic formulae, and then to compare the results with a baseline corpus (for the importance of this step see Wulff 2008). We use Distributional Semantics to quantify semantic variation. The target of analysis is a sample of formulae made of a transitive verb and its direct object in the accusative (from here on, TrV+Obj formulae), selected exclusively on the basis of frequency. These are phrases of the type
The study of formulaic variation has been a major topic in Homeric studies since at least the 1960s (Hoekstra 1965, 1969; Hainsworth 1968; Postlethwaite 1979; Friedrich 2019). Formulae allow for a limited amount of linguistic variation, a trait which they share with idioms and other multi-word expressions in everyday language (Kiparsky 1976). Most recently, the behaviour of formulae has been described under the linguistic framework of Construction Grammar (Goldberg 1995): formulae are indissoluble pairs of form and function, and as such are characterised by restrictions as to their shape as well as their meaning (Bozzone 2014, 2024; Antović & Cánovas 2016).
We use Distributional Semantics to model the range of meaning of the formulae and non-formulaic material in this case study. As a corpus-based approach, Distributional Semantics is particularly suited to the study of dead languages such as Ancient Greek, where no speaker input can be sought. In Distributional Semantics, the meaning of a word is defined as a function of its collocates in a corpus: words that share a linguistic context are also related in meaning (Harris 1954; Fabre & Lenci 2015). Shared linguistic contexts are modelled mathematically via word vectors which encode the frequency of co-occurrences between each word in the corpus and each of the others (with the possible exception of semantically empty ‘stop-words’). These vectors form a distributional space model of meaning (DSM); the distance between the vector associated to each word and the vector associated to another represents the similarity between the words’ meanings.10
For this case study, we use a DSM built from the Diorisis Ancient Greek corpus (Vatri & McGillivray 2018) using DISSECT (Dinu, Pham, & Baroni2013). The DSM was optimised specifically for ancient Greek (Rodda, Probert, & McGillivray 2019).
4.2 Data and methods
The data on TrV+Obj formulae was extracted by running a Python script11 on texts from the Ancient Greek and Latin Dependency Treebank (AGLDT: Bamman & Crane, 2011), a syntactically parsed corpus that is part of the Perseus Project. The TrV+Obj pairs were extracted from the four main archaic Greek epic texts: Homer’s Iliad and Odyssey and Hesiod’s Theogony and Works and Days (from here on, ‘the epic corpus’). Two formular editions (Pavese & Venti 2000; Pavese & Boschetti 2003), which are designed to mark material in the target texts as formulaic or non-formulaic based on their frequency in the texts, were used to establish which of these automatically extracted phrases are properly formulaic, i.e. repeated in the traditional language; we opted to adopt the formular editions’ pre-existing definition of formularity, rather than introduce our own, in order to minimise researcher bias.12 Out of the 6764 formulaic TrV+Obj pairs that were thus extracted, only the objects of those verbs that occur at least 50 times in the epic corpus were selected, for a total of 26 verbs and 2703 tokens (ranging from 335 to 50).
The non-formulaic data for comparison was extracted from AGVaLex. All texts from the lexicon’s database were included apart from those which overlap with the epic corpus, i.e. the Iliad and the Odyssey. We looked up each of the 26 target verbs in the lexicon, and manually selected the accusative objects from the existing data.
The analysis below is not on tokens, but on types (for the reasons see Barðdal 2008), i.e. on unique object lexemes of each transitive verb. We therefore discarded any verbs that had less than 10 object types in either the epic corpus or the comparison corpus, which reduces the sample to 15 verbs. The final list of verbs, with their type frequency, is provided in Table 6.
Table 6
Target verbs and their object types in the epic and baseline corpus, ordered by token frequency in the epic corpus (not by type frequency)
|
Verb |
Epic |
Baseline |
|
|---|---|---|---|
|
1 |
|
91 |
485 |
|
2 |
|
56 |
80 |
|
3 |
|
58 |
90 |
|
4 |
|
28 |
49 |
|
5 |
|
41 |
120 |
|
6 |
|
49 |
24 |
|
7 |
|
43 |
87 |
|
8 |
|
45 |
101 |
|
9 |
|
29 |
53 |
|
10 |
|
13 |
13 |
|
11 |
|
33 |
92 |
|
12 |
|
14 |
31 |
|
13 |
|
13 |
25 |
|
14 |
|
19 |
13 |
|
15 |
|
11 |
48 |
For each verb, therefore, we have a list of object types in the epic corpus and one in the baseline corpus, for a total of 30 lists. To assess their semantic similarity, we measured the cosine distance between the objects in each list and their respective centroids in the semantic space (see again Rodda, Probert, & McGillivray 2019 for another example of this approach). This gives us 30 distributions of distances, which can be compared to each other or assessed for the influence of other factors.
4.3 Results
To assess the relationship between formulaic and non-formulaic verb phrases, we compared the semantic range of objects in the epic corpus vs. the baseline corpus for each verb. The results of this comparison are detailed in the boxplot in Figure 3.13
The distributions of distances for each pair were compared using the Kolmogorov-Smirnov test in R (R Core Team 2017). Two significance thresholds were set: p < 0.05 for high significance (**) and p < 0.1 for low significance (*). The results are summarised in Table 7.
There are relatively few significant differences here, even with a higher than usual significance threshold. The four verbs that show a significant difference are
In other words, there is only a very limited effect of formularity on semantic range, but as far as an effect can be observed, it appears to go in the direction of constructional pre-emption: objects of formulaic phrases tend to show lower semantic similarity. This is somewhat surprising, as discussions of formulaic systems (starting with Parry 1930, 1932) have stressed the poetic utility of having a range of expressions that are similar in meaning but have different metrical shapes, a result which could be easily obtained by varying lexical items and using synonyms or near-synonyms. It is possible that the definition of formularity adopted in this study, which was based on simple repetition and did not take metre into account, does not capture subtleties in the actual relation between verbs and objects which could help explain our results. It is also possible that a different approach to the data analysis would reveal a different pattern—for instance, if we set out to look for individual clusters of closely-related words among the objects of a formulaic verb, rather than measure their semantic proximity to a centroid in the semantic space. As one reviewer suggested, the broad semantic range of many of the verbs in this study probably plays a role; it is suggestive that more semantically restricted verbs such as



Figure 3
Box-and-whiskers plots of object similarities in non-formulaic (white) vs. formulaic (dotted) TrV+Obj pairs
Citation: Journal of Greek Linguistics 24, 2 (2024) ; 10.1163/15699846-02402003
Table 7
Comparison between formulaic and non-formulaic distributions of objects in the semantic space
|
Verb |
Median similarity |
Variance |
Significance |
||
|---|---|---|---|---|---|
|
Formulaic |
Baseline |
Formulaic |
Baseline |
||
|
|
0.427 |
0.482 |
0.0109 |
0.0100 |
* (p = 0.084) |
|
|
0.407 |
0.455 |
0.0172 |
0.0123 |
(p = 0.240) |
|
|
0.447 |
0.430 |
0.0077 |
0.0130 |
(p = 0.911) |
|
|
0.324 |
0.329 |
0.0116 |
0.0188 |
(p = 0.538) |
|
|
0.422 |
0.468 |
0.0091 |
0.0086 |
(p = 0.414) |
|
|
0.446 |
0.416 |
0.0139 |
0.0147 |
(p = 0.716) |
|
|
0.371 |
0.445 |
0.0137 |
0.0103 |
(p = 0.106) |
|
|
0.410 |
0.439 |
0.0133 |
0.0093 |
(p = 0.723) |
|
|
0.414 |
0.394 |
0.0070 |
0.0227 |
(p = 0.537) |
|
|
0.364 |
0.472 |
0.0096 |
0.0027 |
** (p = 0.034) |
|
|
0.392 |
0.456 |
0.0095 |
0.0099 |
(p = 0.329) |
|
|
0.314 |
0.470 |
0.0080 |
0.0072 |
(p = 0.168) |
|
|
0.270 |
0.405 |
0.0122 |
0.0062 |
* (p = 0.052) |
|
|
0.453 |
0.307 |
0.0104 |
0.0027 |
* (p = 0.053) |
|
|
0.413 |
0.375 |
0.0059 |
0.0118 |
(p = 0.787) |
5 Discussion and conclusions
We have presented AGValex and illustrated, via some examples and the case study in Section 4, how it can be used to explore crucial issues in Ancient Greek linguistics. For example, we have shown how AGVaLex allows researchers to retrieve summary data by construction, which cannot be done with existing dictionaries, as well as to look for examples of specific structures, including with lexical information for nouns in specific constructions with a verb. The case study focused on Homeric formularity, a topic which is primarily of interest to literary scholars, who are particularly likely to appreciate a pre-compiled dataset that can be directly applied to their work without the need for significant knowledge of the relevant databases and/or programming languages. The limited space devoted to the application of AGVaLex in Section 4 should not obscure the fact that the existence of the database in practice enabled this research in the first place: gathering the data on TrV + Obj constructions in Homer and Hesiod required weeks of work, which would have needed to be scaled up to the entirety of the baseline corpus, a practically insurmountable task. While the results of the case study should be seen as preliminary when it comes to furthering our understanding of semantic variation in formulae, they show the promising value of the Distributional Semantics approach and of the use of a comparison database to assess how formulaic behaviour differs from non-formulaic usage.
A resource such as AGVaLex, if maintained and kept up to date, can enable research that would otherwise require more time and computational power than the average literature scholar can be expected to apply. As the availability of syntactically annotated corpora expands, these resources can be integrated into AGVaLex, ensuring the widest distribution of the data. AGVaLex fits in a broader trend of making Digital Humanities (DH) resources available to researchers outside of DH, which is also illustrated by resources such as HoDeL and Syntacticus; work on formularity in various Indo-European languages has already started taking advantage of the availability of these resources (Biagetti 2023; Brigada Villa et al. 2023), a trend that the current study hopes to further promote.
Author contributions
BMcG designed and implemented the scripts for the creation of the lexicon, and conducted the quantitative analysis of the lexicon described in section 3; she wrote sections 1, 1.1, and 3. The authors jointly wrote section 2. MAR conducted the comparative analysis of dictionaries, designed the case study and conducted its analysis, and wrote sections 3.1, 4, 4.1, 4.2, 4.3, and 5.
For an introduction to the methodology followed by the TLL, see
PROIEL has an associated online treebank query tool, Syntacticus (
The FAQ for Myria contain the following explanation: “What is a collostruction? … [sic]”.
For example, in Odyssey 1.2,
Athenaeus, who is not an Ionic author, contains many quotations from Ionic sources. This construction can coordinate with the more common dative + accusative construction, as shown in this example from Herodotus (Histories 1.54, i.e. subdoc 1.54 sentence_id 346 root_id 457705 in AG ValeX):
For instance in Homer, Il. 1.323 (subdoc 1.323 sentence_id 2274288 root_id 156142 in AG ValeX):
For a formulation of this concept in Homeric studies, see Parry, 1930.
We have avoided a detailed discussion of Distributional Semantics here so as not to unnecessarily encumber the case study; interested readers can find accessible explanations in Sahlgren (2006) and Rodda (2021).
All scripts for Section 4 are available at
Readers can find a useful discussion of definitions of formula, and their implications in a current cognitive linguistics perspective, in Bozzone (2023, 2024).
All figures and tables in this section are reproduced from Rodda (2021).
References
Antović, Mihailo & Cristóbal Pagán Cánovas. 2016. Construction Grammar and Oral Formulaic Theory. Oral poetics and cognitive science, ed. by Mihailo Antović & Cristóbal Pagán Cánovas, 79–98. Berlin: De Gruyter.
Bamman, David & Gregory Crane. 2011. The Ancient Greek and Latin Dependency Treebanks. Language technology for cultural heritage: Selected papers from the LaTeCH workshop series, ed. by Caroline Sporleder, Antal Bosch, & Kalliopi Zervanou, 79–98. Berlin: Springer.
Barðdal, Jóhanna. 2008. Productivity: Evidence from case and argument structure in Icelandic. Amsterdam: John Benjamins.
Bayerische Akademie der Wissenschaften 2002. Thesaurus Linguae Latinae (CDROM). https://thesaurus.badw.de/en/project.html
Biagetti, Erica. 2023. Integrare Sanskrit WordNet e Vedic Treebank: Uno studio pilota sulla formularità del RigVeda tra semantica e sintassi. E pluribus unum. Prospettive sull’antico, ed. by Isabella Bossolino & Chiara Zanchi, 45–62. Pavia: Pavia University Press.
Biagetti, Erica, Chiara Zanchi, & Silvia Luraghi. 2021. Building new resources for historical linguistics. Pavia: Pavia University Press.
Bozzone, Chiara. 2014. Constructions: A new approach to formularity, discourse, and syntax in Homer. PhD dissertation, Indo-European Studies, UCLA. https://escholarship.org/uc/item/6kg0q4cx
Bozzone, Chiara. 2023. Chunks, collocations, and constructions: The Homeric formula in cognitive and linguistic perspective. New light on formulas in oral poetry and prose, ed. by David Sävborg & Bernt Ø. Thorvaldsen, 113–139. Turnhout: Brepols.
Bozzone, Chiara. 2024. Homer’s living language: Formularity, dialect, and creativity in oral-traditional poetry. Cambridge: Cambridge University Press.
Brigada Villa, Luca, Erica Biagetti, Riccardo Ginevra, & Chiara Zanchi. 2023. Combining WordNets with Treebanks to study idiomatic language: A pilot study on Rigvedic formulas through the lenses of the Sanskrit WordNet and the Vedic Treebank. Proceedings of the 12th Global Wordnet Conference, 133–139. Donostia—San Sebastian: Global Wordnet Association.
Celano, Giuseppe G.A. 2014. Guidelines for the annotation of the Ancient Greek Dependency Treebank 2.0. https://github.com/PerseusDL/treebank_data/edit/master/AGDT2/guidelines
Celano, Giuseppe G.A. 2019. The dependency treebanks for ancient Greek and Latin. Digital classical philology, ed. by Monica Berti, 279–298. Berlin: De Gruyter.
Dinu, Georgiana, Nghia The Pham, & Marco Baroni. 2013. DISSECT—DIStributional SEmantics Composition Toolkit. Proceedings of the 51st annual meeting of the Association for Computational Linguistics: System demonstrations, 31–36. Sofia: Association for Computational Linguistics.
Eckhoff, Hanne M., Silvia Luraghi, & Marco Passarotti. 2018. The added value of diachronic treebanks for historical linguistics. Diachronica 35.3.297–309.
Fabre, Cécile & Alessandro Lenci. 2015. Distributional Semantics today: Introduction to the special issue. Traitement automatique des langues 56.2.7–20.
Friedrich, Rainer. 2019. Postoral Homer: Orality and literacy in the Homer epic. Stuttgart: Franz Steiner Verlag.
Goldberg, Adele E. 1995. Constructions: A Construction Grammar approach to argument structure. Chicago: University of Chicago Press.
Hackstein, Olav. 2010. The Greek of epic. A companion to the Ancient Greek language, ed. by Egbert J. Bakker, 401–423. Oxford: Wiley-Blackwell.
Haug, Dag T.T. 2015. Treebanks in historical linguistics research. Perspectives on historical syntax, ed. by Carlotta Viti, 187–202. Amsterdam: John Benjamins.
Hainsworth, John B. 1968. The flexibility of the Homeric formula. Oxford: Clarendon Press.
Hajič, Jan, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, & Alevtina Bémová (in cooperation with) Jiří Kárník, Jan Štěpánek, & Petr Pajas. 1999. Annotations at analytical level: Instructions for annotators. https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html
Happ, Heinz. 1976. Grundfragen einer Dependenz-Grammatik des Lateinischen. Göttingen: Vandenhoeck & Ruprecht.
Harris, Zellig S. 1954. Distributional structure. Word 10.2–3.146–162.
Haug, Dag T.T. & Marius L. Jøhndal. 2008. Creating a parallel treebank of the Old Indo-European Bible translations. Proceedings of the second workshop on language technology for cultural heritage data (LaTeCH 2008), ed. by Caroline Sporleder & Kiril Ribarov, 27–34. Association for Computational Linguistics.
Haug, Dag T.T. 2012. Syntactic conditions on null arguments in the Indo-European Bible translations. Acta Linguistica Hafniensia 44.2.129–141.
Hoekstra, Arie. 1965. Homeric modifications of formulaic prototypes: Studies in the development of Greek epic diction. Amsterdam: N.V. Noord-Hollandsche Uitgevers Maatschappij.
Hoekstra, Arie. 1969. The sub-epic stage of the formulaic tradition: Studies in the Homeric hymns to Apollo, to Aphrodite and to Demeter. Amsterdam: North-Holland Publishing Company.
Keersmaekers, Alek, Wouter Mercelis, Colin Swaelens, & Toon van Hal. 2019. Creating, enriching and valorising treebanks of Ancient Greek: The ongoing Pedalion-project. Available at https://syntaxfest.github.io/syntaxfest19/proceedings/papers/paper_68.pdf .
Keersmaekers, Alek. 2020. A computational approach to the Greek papyri: Developing a corpus to study variation and change in the post-classical Greek complementation system. PhD thesis, KU Leuven. https://lirias.kuleuven.be/retrieve/590983 .
Keydana, Götz & Silvia Luraghi. 2012. Definite referential null objects in Vedic Sanskrit and Ancient Greek. Acta Linguistica Hafniensia 44.2.116–128.
Kiparsky, Paul. 1976. Oral poetry: Some linguistic and typological considerations. Oral literature and the formula, ed. by Benjamin A. Stolz & Richard S. Shannon, 73–125. Ann Arbor: University of Michigan.
Liddell, Henry G., Robert Scott, Henry S. Jones, & Roderick McKenzie, eds. 1996. A Greek-English lexicon. Oxford: Clarendon Press.
Luraghi, Silvia. 2003. Definite referential null objects in Ancient Greek. Indogermanische Forschungen 108.169–196.
Luraghi, Silvia. 2020. Experiential verbs in Homeric Greek. Leiden: Brill.
Luraghi, Silvia & Luz Conti. 2014. The ancient Greek partitive genitive in typological perspective. Partitive cases and related categories, ed. by Silvia Luraghi & Tuomas Huumo, 443–476. Berlin: De Gruyter.
Mayrhofer, Manfred. 1980. Zur Gestaltung des etymologischen Wörterbuchs einer ‘Großcorpus-Sprache’. Vienna: Österr. Akademie der Wissenschaften, Phil-Hist. Klasse.
McGillivray, Barbara, Marco Passarotti, & Paolo Ruffolo. 2009. The Index Thomisticus Treebank project: Annotation, parsing and valency lexicon. TAL—Traitement Automatique Des Langues 50.2.103–127.
McGillivray, Barbara & Marco Passarotti. 2012. Accessing and using a corpus-driven Latin valency lexicon. Latin linguistics in the early 21st century. Acts of the 16th international colloquium on Latin linguistics, Uppsala, June 6th–11th, 2011, ed. by Gerd V.M. Haverling. Uppsala: Uppsala Universitet.
McGillivray, Barbara. 2014. Methods in Latin Computational Linguistics. Leiden: Brill.
McGillivray, Barbara. 2021. Ancient Greek valency lexicon (AGVaLex). Figshare dataset. 10.6084/m9.figshare.14316251
McGillivray, Barbara & Alessandro Vatri. 2015. Computational valency lexica for Latin and Greek in use: A case study of syntactic ambiguity. Journal of Latin Linguistics 14.1.101–126.
Montanari, Franco. 2015. The Brill dictionary of Ancient Greek. English edition, ed. by Madeleine Goh & Chad Schroeder. Leiden: Brill.
Parry, Milman. 1930. Studies in the epic technique of oral verse-making I: Homer and Homeric style. Harvard Studies in Classical Philology 41.73–148.
Parry, Milman. 1932. Studies in the epic technique of oral verse-making II: The Homeric language as the language of an oral poetry. Harvard Studies in Classical Philology 43.1–50.
Passarotti, Marco, Berta González Saavedra & Christophe Onambele. 2016. Latin Vallex. A treebank-based semantic valency lexicon for Latin. Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}’16), 2599–2606. Portorož, Slovenia: European Language Resources Association (ELRA).
Pavese, Carlo O. & Federico Boschetti. 2003. A complete formular analysis of the Homeric poems. Amsterdam: Hakkert.
Pavese, Carlo O. & Paolo Venti. 2000. A complete formular analysis of the Hesiodic poems: Introduction and formular edition. Amsterdam: Hakkert.
Postlethwaite, Norman. 1979. Formula and formulaic: Some evidence from the Homeric hymns. Phoenix 33.1–18.
R Core Team. 2017. R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. https://www.R-project.org/ .
Rodda, Martina A. 2021. A corpus study of formulaic variation and linguistic productivity in early Greek epic. PhD thesis, University of Oxford. https://ora.ox.ac.uk/objects/uuid:1e682001-b916-4322-adc3-52857d93b92b/files/d2514nk879
Rodda, Martina A., Philomen Probert, & Barbara McGillivray. 2019. Vector space models of ancient Greek word meaning, and a case study on Homer. TAL—Traitement Automatique Des Langues 60.3.
Sahlgren, Magnus. 2006. The word-space model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. SICS Dissertation Series 44. Stockholm: Dept. of Linguistics, Stockholm Univ.
Sausa, Eleonora & Chiara Zanchi. 2015. Non-accusative null objects in the Homeric Dependency Treebank. Proceedings of the workshop on corpus-based research in the Humanities, ed. by Marco Passarotti, Francesco Mambrini, & Caroline Sporleder, 107–116. Warsaw: Institute of Computer Science of the Polish Academy of Sciences.
Sgall, Petr, Eva Hajičová, & Jarmila Panevová. 1986. The meaning of the sentence in its semantic and pragmatic aspects. Dordrecht: D. Reidel.
Smyth, Herbert Weir. 1926. Aeschylus, with an English translation (in two volumes). 1. Persians. Cambridge, MA. Harvard University Press.
Stefanowitsch, Anatol. 2013. Collostructional analysis. The Oxford handbook of Construction Grammar, ed. by Thomas Hoffmann & Graeme Trousdale, 290–306. Oxford: Oxford University Press.
Suttle, Laura & Adele E. Goldberg. 2011. The partial productivity of constructions as induction. Linguistics 49.1237–1269.
Tesnière, Lucien. 1969. Éléments de syntaxe structurale. 2nd edition. Paris: Klincksieck.
Untermann, Jürgen. 1983. Indogermanische Restsprachen als Gegenstand der Indogermanistik. Le lingue indoeuropee di frammentaria attestazione. Die indogermanischen Restsprachen. Atti del convegno della Societa italiana di glottologia e della Indogermanische Gesellschaft. Udine, 22–24 settembre 1981, ed. by Edoardo Vineis, 11–28. Pisa: Società Italiana di Glottologia, Giardini.
Vatri, Alessandro & Barbara McGillivray. 2018. The Diorisis Ancient Greek corpus. Research Data Journal for the Humanities and Social Sciences 3.55–65.
Wulff, Stephanie. 2008. Rethinking idiomaticity: A usage-based approach. London: Continuum International Publishing.
Zanchi, Chiara, Eleonora Sausa, & Silvia Luraghi. 2018. HoDeL, a dependency lexicon for Homeric Greek: Issues and perspectives. Formal representation and the Digital Humanities, ed. by Paola Cotticelli-Kurras & Federico Giusfredi, 221–246. Cambridge: Cambridge Scholars Publishing.
Zanchi, Chiara. 2021. The Homeric Dependency Lexicon: What it is and how to use it. Journal of Greek Linguistics 21.263–297.
