1 Introduction
In a 2017 article for Digital Scholarship in the Humanities, Grace Muzny, Mark Algee-Hewitt, and Dan Jurafsky proposed a new metric for computational literary criticism called “dialogism,” rooted in Mikhail Bakhtin’s work on direct and indirect speech as a central aspect of literary discourse. Dialogism as defined in this article is a composite metric that quantifies textual features, and especially syntactical and structural as opposed to lexical features, to measure the degree to which different texts and parts of texts resemble direct speech. To quote the authors, the aim of measuring dialogism is to “examine the ways in which narrative text and direct quotation both draw on the grammatical properties of spoken dialog.”1
This chapter pursues quantitative dialogism following Muzny et al., but with the following additions: 1. the earlier article deals exclusively with English-language novels; here I consider the extent to which assumptions about dialogic features can be applied to a different literary language, specifically Latin. That is, when Muzny et al. write that “spoken, or conversational, language tends to be in the present tense, to use short clauses, modals, and to employ 1st- and 2nd-person pronouns” and that these grammatical features are “inescapably intertwined with literary style,” can we proceed with confidence that these are the same grammatical-stylistic features that contribute to dialogism in Latin literature?; and 2. a variation on the first point, Muzny et al. similarly restrict their analysis to the novel; this chapter qualifies the extent to which assumptions about dialogic features can be applied to a different genre, specifically epic. Epic shares with the modern novel (in English, as in other languages) a surface resemblance that supports such an analogical reading: each is a narrative discourse populated by a variety of characters and punctuated by moments of direct (and indirect) speech delivered by these characters. For these reasons, Latin epic is an ideal candidate for extending previous work on quantitative dialogism.
Yet perhaps the most important contribution of this chapter is a demonstration of the foundational work necessary to address adequately these crosslingual and crossgeneric questions. That is, what can Latinists do in order to identify and extract at scale the kinds of “textual features” on which Muzny et al. base their analyses? What is necessary for measuring to what degree large amounts of Latin text are, for example, “in the present tense” or “employ 1st- and 2nd-person pronouns”? As one research team looking at direct-speech recognition in novels writes: “Beyond quotation marks, what are linguistic markers of direct speech?”2 Moreover, and with specific consideration of the crosslingual question from above, it cannot be assumed a priori that the features found in previous work can be readily reused, so it is incumbent upon researchers working on Latin epic to determine whether the markers that are effective in predicting direct speech in English or French or German are in fact also applicable to the language and the genre. As described below, the analyses in this chapter first use a Latin language model to identify at scale lemmas, part-of-speech tags, and morphological markers, these markers can then be collated and counted by text type (specifically, by labels indicating whether a text compromises direct speech or narrative), and finally a corpus comparison method is used to determine which lexical or syntactic features are most indicative of each text type.3 The combination of “text-as-data” approaches from natural language processing, data science, and corpus linguistics gives us renewed insight into how researchers can describe the constituent elements of direct speech in Latin epic.4
2 Background
The DICES project is the most current and comprehensive source of data about epic speech from antiquity with coverage of key examples of the genre in Latin as well as in Ancient Greek,5 and is itself in many respects a descendant of two author-specific databases, namely Deborah Beck’s Speech Presentation in Homeric Epic and Berenice Verhelst’s Direct Speech in Nonnus’ Dionysiaca.6 Moreover, quantitative analysis of Latin epic speech is found elsewhere in the non-database-driven extant literature. W.J. Dominik, for example, in Speech and Rhetoric in Statius’s Thebaid provides a series of “statistical appendices” cataloging a variety of literary phenomena such as speech lengths for different books, distributions of speech types, among others.7 Elsewhere in that volume, Dominik provides a comprehensive review of “stylistic elements” in the speeches of Statius’s Thebaid, including “alliteration, assonance, rhyme, rhetorical repetition, metre and rhythm.”8
All of this analysis is valuable in its own respect and yet without a systematic review of the background against which the stylistic elements sit, it can be difficult to appreciate fully how any given point of style fits in the larger generic picture. This chapter is a gesture toward establishing a background, specifically by suggesting a baseline understanding of Latin epic speech style as derived from its lexical and syntactic features so as to better contextualize all other stylistic analyses. Dominik, for example, examines how stylistic and rhetorical features contribute to “producing special effects”; my goal is to examine features of the text that produce what we might call normal effects, that is the features that characterize Latin epic speech more generally and so then allow it to be distinguished from epic narrative.9 Moreover—and what is perhaps the most interesting outcome of developing a quantitative dialogism measure—we also then have a way of seeing which parts of epic narrative exhibit most strongly the characteristic features of speech, such as apostrophe, lending data-driven evidence to what otherwise would necessarily be an assumption about how such a passage struck an ancient reader.
As noted above, much of the available work offers data about the speeches. Number of speeches. Average length of the speeches in different poets. Longest and shortest speeches in epic. Range of percentages of speech for books of different epics. And so on.10 More recent work has been interested in speech types and narratological data.11 Less attention has been given to arguments built up from the speeches themselves, insofar as they are constituted by specific words and grammatical features. Some notable exceptions arise. For example, it is not difficult to find quantitative reports about use of the vocative as a feature of epic speech,12 and with respect to the analysis presented in this chapter such studies are a useful starting point in the literature: it is helpful to know from the outset that a linguistic feature like the vocative has been noted specifically as determinative of whether a passage is direct speech.13 Still, advances in computational philology, specifically improvements in lemmatization as well as in part-of-speech tagging and morphological tagging for Latin, allow this dynamic to be reversed.14 That is, it is now possible to work not only with data about speeches, but with the speeches themselves as data.15
3 Methodology
3.1 Data
The texts used for the analyses that follow consist of files in the CLTK Latin Library text collection.16 The files are further limited to only those dating up to 175 CE.17 A custom script is then used to separate files into direct-speech passages and narrative passages using the following rules: all words found between quotation marks are extracted from a given text, the extracted text is saved to its own file and given the label SPEECH, and the original text with the extracted text now removed is saved to its own file and given the label NARRATIVE.18
3.2 Model and Annotations
The ability to extract syntactic features from Latin texts in the manner necessary for the experiments in this chapter is made possible by recent advances in Latin natural language processing (NLP). I use a pretrained Latin pipeline for use with the NLP framework spaCy called LatinCy.19 This pipeline contains components that provide word-level annotations for part-of-speech-based and morphological analyses.20 The following annotations are of interest to this study:
-
texts are lemmatized using LatinCy and these lemmas are collated such that the lemma frequency for each input text is available for analysis; and
-
texts are annotated with part-of-speech and morphological tags which are then collated such that total number of each tag is determined for each input text and made available for analysis.21
3.3 Methods
As a way of comparing either lexical or syntactical feature importance between different types of texts, I use here a weighted log odds measure as implemented by Julia Silge, Alex Hayes, and Tyler Schnoebelen in the tidylo R package.22 The logic behind the measure is to show “how the usage or frequency of some feature, such as words, differs across some group or set, such as documents,” while also accounting for sampling variability.23 The package encodes a method used first by political scientists to develop a text-as-data measure for comparing language of “conflict” in American partisan political texts,24 but applications as diverse as determining popular baby names by decade or identifying key phrases in the novels of Jane Austen can also be found using this method.25 In this chapter, rather than establishing how the names Jason and Jennifer surge in the 1970s or how phrases such as “so very” come to mark the style of Sense and Sensibility (especially in comparison with Pride and Prejudice), I use weighted log odds to identify both lexical and syntactical features more indicative of direct speech than narrative.
To create a composite epic dialogism metric, I take the following steps:
-
First, I convert the reverse-sorted feature lists of weighted log odds for SPEECH texts into a lexicon by scaling the values between 1 and 0 where the closer the value is to 1 the more indicative it is of direct speech.26 This produces a word lexicon based on lexical features and a grammatical lexicon based on grammatical features.
-
Secondly, a text is run through the LatinCy pipeline to get the lemma, part-of-speech tag, and morphological tags for each word.
-
Next, two provisional metrics are produced:
-
First, the lemma is looked up in the word lexicon from the first step to produce a lexical epic dialogism metric; any out-of-vocabulary lemmas are assigned a zero.
-
Secondly, all of the part-of-speech or morphological tags produced for a given token are looked up in the grammatical lexicon and the mean of all matched values is taken to produce a grammatical epic dialogism metric.27
-
-
Lastly, the composite epic dialogism metric is created by taking the average of the lexical epic dialogism metric and the grammatical epic dialogism metric.28
4 Results / Findings
The first analysis undertaken is to determine using weighted log odds the key syntactic features in epic texts labelled SPEECH in comparison to those labelled NARRATIVE. The top seven features for each type are shown in Figure 9.1.



Figure 9.1
Comparison of top ranked grammatical features by log odds ratio for NARRATIVE and SPEECH epic texts
The results for NARRATIVE grammatical features align well with existing literature which has already noted the preference for use of the third person (MORPH_PERSON_3) and the past tense (MORPH_TENSE_PAST). The signal is relatively weak beyond the top features though and it is not easy to see anything particularly coherent in the remaining features. The situation is reversed however with SPEECH features. Morphological features indicating first- and second-person verbs are expected markers of direct speech and show strong signal here, as do the imperative mood and interjections. The high ranking of pronouns among the grammatical features, especially in coordination with the prominence of specifically first- and second-person personal pronouns among the lexical features (see below, discussion of figure 9.2), also matches previous work outside of Latin epic. On the contrary, the appearance of the future tense in this list as a strong marker of Latin epic speech as opposed to narrative finds little corroboration in related crosslingual or crossgeneric work and is discussed further in the “Crosslingual and crossgeneric comparisons” section (5.1) below.
The second analysis uses the same weighted log odds approach but for the key lexical features in epic texts according to their SPEECH and NARRATIVE labels. The top twenty features for each type are shown in Figure 9.2.
Among the most heavily weighted NARRATIVE lexical features are aio (“to say”) with the closely related word inquam. This hardly comes as a surprise since they are almost by definition lexical features external to direct speech as introductory verbs. In addition, there is signal from third-person reflexive pronouns and adjectives (suus, sui), a finding in line with the top syntactic feature, namely the morphological tag covering the third person. That said, much stronger signal can again be seen when looking at SPEECH lexical features. The most heavily weighted lexical features and several other top ranked features in this category are personal pronouns or personal adjectives: ego (“I”), tu (“you” singular), nos (“we” plural), uos (“you” plural), as well as meus (“my”), noster (“our”), tuus (“your” singular), and uester (“your” plural). O is the lexical speech marker introducing the vocative and often found in conjunction with the imperative mood. Nunc (“now”) hints at the kind of present-tense perspective that the existing literature suggests characterizes dialogism.29



Figure 9.2
Comparison of top ranked lexical features by log odds ratio for NARRATIVE and SPEECH epic texts
For these two analyses, it can be seen that the weighted log odds approach has been largely successful in identifying features of NARRATIVE and especially SPEECH epic texts that align with existing work on the topic. Moreover, as shown in Figure 9.3, the epic texts can be each described individually by their top features so that authors can be compared.30



Figure 9.3
Top ranked morphological features for different authors by log odds ratio for SPEECH texts restricted to only epic
In general, the features that are indicative of epic resurface with greater or lesser signal in the author-specific charts: MORPH_PERSON_1, MORPH_PERSON_2, MORPH_MOOD_IMP, POS_INTJ, and so on. A comment is necessary though with respect to dialogic features in Ovid. One of the strongest markers of Ovid’s speech style in the Metamorphoses is auxiliary verbs which in the case of the LatinCy model means sum (“to be”). This is curious enough to demand a looking at the underlying data: in the Metamorphoses texts labelled NARRATIVE, Ovid uses a form of sum 1,194 times (or 2.50 times per 100 words) and in the texts labelled SPEECH, the poet uses the word 962 times (3.25 times per 100 words).31 Examples like this show how the dialogism metric can be used in an exploratory sense, that is pointing Latin critics in the direction of potentially fruitful areas of investigation that would otherwise be difficult to observe directly at readerly scale.
5 Discussion
5.1 Crosslingual and Crossgeneric Comparisons
Muzny et al. ground their study in the intuition that the English novel will exhibit in its depiction of direct speech the kinds of markers shown in earlier corpus linguistics studies.32 They note general grammatical features such as tense and deixis as important literary stylistic markers as well as other more specific features: “The extent to which a part of a text exhibits the grammatical properties of dialogue (is in the present tense, uses modals, refers to the 1st and 2nd person, etc.) thus reveals its stylistic underpinnings.”33 They expand on this later in the article with specific reference to part-of-speech and morphological tagging: “Dialogue … is characterized by high-quantities of the bare verb tag, or the 1st-person pronoun tag. In contrast, non-dialogue text is characterized by higher frequencies of adjectives and determiners.”34 Other researchers call attention to features like interjections and verbal tenses as strong indicators of direct speech.35
The results based on Latin epic speech presented in the previous section align well with this earlier work on quantitative dialogism. Latin epic in its depiction of direct speech shares with the modern English or French or German models features such as conspicuous usage of first- and second-person verbs, use of interjections, and use of the imperative mood to name the most prominent instances. The high ranking of pronouns overall among the grammatical features combined with prominence of specifically first- and second-person personal pronouns among the lexical features also matches previous work outside of Latin epic. That said, one feature that stands out on the list of grammatical features indicative of Latin epic direct speech is the future tense.
Gilbert Highet notes that one of the functions of Virgil’s speeches is to “[forecast] the future” and in one section of commentary calls attention in passing to how Anchises’ speech to Aeneas at Aeneid 6.756–859 is a prophecy “as the tenses confirm,” enumerating a handful of future-tense verbs from the passage.36 Susan Adema notes in her work on Virgil’s use of verb tense in the Aeneid that the future is “not a tense typical for narrative,” and gives some basic statistical support to the finding here: of 465 future tense verbs in the Aeneid, all but 14 can be found in direct speech.37 Yet the role of the future tense in Latin epic speech more generally does not appear to receive elaborated treatment and could be an avenue for future research.38
To conclude this section on crosslingual and crossgeneric comparison, allow me to comment briefly on the decision to use lexical features in the present study. Muzny et al. use only grammatical features, eschewing the use of words as features in their dialogism metric as they are potentially a source of “overfitting to one particular genre or moment in literary history.”39 For the study in this chapter, the combination of lexical and grammatical features appears to offer a more robust description of Latin epic style. This may be because the most heavily weighted words in the lexicon exhibit precisely some of the most heavily weighted part-of-speech and morphological features, mutually reinforcing dialogistic signal when encountered in the text. For example, ego (“I”) is the top ranked lexical feature in Latin epic speech and is also a personal pronoun (POS_PRON), the third ranked grammatical feature.40 Moreover, since only one genre is under study here and without the century-spanning and subgeneric coverage of specific interest to the earlier researchers, there is less of a concern with overfitting.
5.2 Dialogistic Narrative in Latin Epic
While it is certainly of some interest simply to have a more detailed descriptive understanding of the syntactic and lexical features that constitute direct speech in Latin epic, the measure explored so far in this chapter has further literary critical value in helping to identify passages outside of speech that call attention to themselves as more or less dialogistic.41 In this section, I briefly review the idea of apostrophe as a kind of dialogistic narrative device, applying the dialogism measure to Lucan’s Bellum Civile as proof of concept that this data-driven collation of syntactic and lexical features can detect speech-like aspects of non-speech Latin text.
Lucan’s use of apostrophe—that is the “the address of a person that is not physically present [and] in the case of epic poetry, this person usually is one of the heroic characters, addressed by the epic narrator”42—has received much attention.43 Roland Mayer writes: “Nothing is so typical of Lucan’s epic technique, nothing sets him so far apart from all other poets in this genre as his tendency to abandon narrative for an editorial reflection upon events.”44 Gordon Williams speaks of “innumerable authorial interventions” and Matthew Leigh notes that Lucan uses apostrophe “with greater statistical frequency than any other Latin epicist.”45 Leigh also lays bare the point most germane to the present discussion: “He truly is talking to his characters.”46 That is, outside the direct speech of any of the epic’s characters, the narrator fills the narrative with lines that emulate direct speech.
With the dialogism measure available, this narrative quality can be demonstrated in the text. In Figure 9.4, the rolling mean of the composite epic dialogism metric is mapped through the narrative space of Bellum Civile 9. Direct speech in this book is indicated through light grey shading and apostrophes—specifically, apostrophe in the voice of the narrator (and so outside the direct speech of other characters)—is indicated with dark grey shading. The average dialogism of the book is shown through a dashed line at 0.26. Moving from left to right serves as a proxy for reading dialogistically; that is, following the black line should correlate with our readerly experience of “any span of text [that] exhibits the grammatical features characteristic of spoken dialogue.”47 The black line does generally rise in the lightly shaded areas, most noticeably during Cato’s speeches at the beginning and end of the book. What is of more interest here is the way in which upward movement can be observed during the apostrophes. Take for example the apostrophe to the dracones at Luc. 9.727–733 (corresponding on the x-axis to the dark-grey shaded area covering tokens 4,822 through 4,867) in the middle of the “snake episode” in Book 9; after a notable dip in the rolling dialogism metric in an extended period of uninterrupted (by direct speech, that is) narrative, this apostrophe helps bring the passage back to the book-level average.48



Figure 9.4
Narrative plot of Book 9 from Lucan’s Bellum Civile. The rolling mean of the composite dialogism score is indicated by the black line. Speeches are shaded in light grey, apostrophes from the narrator in dark grey. The average dialogism score (0.26) is indicated by a dashed grey line.
Another example helps illustrate the dialogic weight of Lucan’s apostrophes. Beginning at Bellum Civile 9.980 (so, the fourth and last dark grey-shaded section in Figure 9.4), the narrator addresses Caesar in a sphragis-like reflection on the future readership of both men. Here is the passage printed such that any word with a composite dialogism score above the book average—so a little over half of the words—is shown in bold (Luc. 9.980–986):
o sacer et magnus uatum labor! omnia fatoeripis et populis donas mortalibus aeuum.inuidia sacrae, Caesar, ne tangere famae;nam, siquid Latiis fas est promittere Musis,quantum Zmyrnaei durabunt uatis honores,uenturi me teque legent; Pharsalia nostrauiuet, et a nullo tenebris damnabimur aeuo.49
The density of bold-type words in this passage helps the reader to see the ways in which “even narrative (non-quotational) regions of texts contain aspects of spoken dialogue.”50
6 Conclusion
As shown in the preceding sections, computational approaches can be used to model a reader’s encounter with direct speech in Latin epic, including and perhaps even especially in cases where the encounter is not with direct speech itself but with some aspect of narrative merely evocative of direct speech. This readerly model draws both on grammatical features as well as lexical features in Latin epic texts, both of which, due to recent advances in text analysis and natural language processing for the language, can be identified, enumerated, and quantified at scales previously not available. A specific example of dialogistic narrative, namely the rhetorical figure of apostrophe as used by Lucan in Bellum Civile 9, offers a literary critical application for the metric and hopefully can serve as a model for analyzing other dialogistic forms. At the same time, this is an experimental metric that will continue to be refined and tested, so that quantitative dialogism can applied with confidence in a wide variety of literary critical applications—such as, for example, narrative-speech dynamics in other genres or with greater attention to text types so as to complicate the simplification of “narrative” vs “speech” used in this preliminary inquiry into the problem—in Latin.51
Appendix
Here is a list of grammatical features with descriptions.
| MORPH_CASE_ABL |
ablative case |
| MORPH_CASE_ACC |
accusative case |
| MORPH_CASE_DAT |
dative case |
| MORPH_CASE_GEN |
genitive case |
| MORPH_CASE_LOC |
locative case |
| MORPH_CASE_NOM |
nominative case |
| MORPH_CASE_VOC |
vocative case |
| MORPH_GENDER_FEM |
feminine gender |
| MORPH_GENDER_MASC |
masculine gender |
| MORPH_GENDER_NEUT |
neuter gender |
| MORPH_MOOD_GDV |
gerundive mood |
| MORPH_MOOD_GER |
gerund mood |
| MORPH_MOOD_IMP |
imperative mood |
| MORPH_MOOD_IND |
indicative mood |
| MORPH_MOOD_SUB |
subjunctive mood |
| MORPH_NUMBER_PLUR |
plural number |
| MORPH_NUMBER_SING |
singular number |
| MORPH_PERSON_1 |
first person |
| MORPH_PERSON_2 |
second person |
| MORPH_PERSON_3 |
third person |
| MORPH_TENSE_FUT |
future tense |
| MORPH_TENSE_PAST |
past tense |
| MORPH_TENSE_PQP |
pluperfect tense |
| MORPH_TENSE_PRES |
present tense |
| MORPH_VERBFORM_FIN |
finite verb |
| MORPH_VERBFORM_INF |
infinitive |
| MORPH_VERBFORM_PART |
participle |
| MORPH_VERBFORM_SUP |
supine |
| MORPH_VOICE_ACT |
active voice |
| MORPH_VOICE_PASS |
passive voice |
| POS_ADJ |
adjective |
| POS_ADP |
adposition |
| POS_ADV |
adverb |
| POS_AUX |
auxiliary verb |
| POS_CCONJ |
coordinating conjunction |
| POS_DET |
determiner |
| POS_INTJ |
interjection |
| POS_NOUN |
noun |
| POS_PART |
particle |
| POS_PRON |
pronoun |
| POS_PROPN |
proper noun |
| POS_SCONJ |
subordinating conjunction |
| POS_VERB |
verb |
Muzny, Algee-Hewitt, Jurafsky 2017: ii32.
Schöch et al. 2016. That paper studies diachronic and generic trends in direct-speech content in 19th-century French novels and is notable, like Muzny et al., for establishing a computationally derived feature set (given in the chapter’s Appendix) for analysing direct speech in a literary context.
Data and code used in the analyses in this chapter will be made available to the reader in the digital appendix of this volume.
For a general discussion of “text-as-data” approaches to social science and digital humanities research, see Grimmer, Roberts, Stewart 2022.
Forstall, Finkmann, Verhelst 2022.
Beck 2012, with digital appendix available at
Dominik 1994: 275–363. Other examples include Elderkin 1906; Lipscomb 1909, addressing Greek and Latin respectively.
Dominik 1994: 236–271.
Compare Mahlberg 2013: 8–11 on “foregrounding theory and corpus norms” as a contribution of corpus stylistics to literary criticism.
The preceding list draws directly from the table of contents in Elderkin 1906.
See for example De Bakker, De Jong 2022.
Scott 1903; see also the enumeration of vocatives in Quintus of Smyrna with and without the introductory particle
Beck 2012: 13: “Expressive elements may convey nothing more than that a particular speaker is the speaker (such as first-person forms) without implying any additional feeling on his part. Besides first- and second-person forms, expressive elements also include vocatives, exclamations like
Burns 2019.
In addition to Muzny et al.’s work on dialogism, I have also found the following work helpful, at least in a comparative fashion, in considering how computational approaches can be applied to a Latin literary critical question concerning direct speech. It should be noted that much of the comparative work involves specifically the task of direct-speech recognition; so, for example, Brunner 2013; Tu, Krug, Brunner 2019. This is not exactly the task in this chapter though as the texts in this dataset are already labelled for SPEECH or NARRATIVE; that is, the work here begins from a point at which direct speech has already been recognized. At the same time, the application of machine-learning classification to this problem, and most importantly, the feature selection inherent in the direct-speech recognition task, is relevant and useful in determining a measure of dialogism as defined above. Also of note in this respect is Byszuk et al. 2020, whose work uses no upfront feature selection but rather inputs multilingual text directly into a transformer-based classifier; at the same time, the researchers conclude (103) that the model still infers relevant features, including syntactic features that are corroborated by the analyses in this chapter: “It is unclear how important are linguistic features of direct and non-direct speech for the model, but errors suggest it pays some attention to imperative mode, personal pronouns, proper names, interjections and verb forms.” See also Jannidis et al. 2018 for another deep-learning approach to this task on German novels.
Texts available from
In addition, the files for Lucan’s Bellum Civile are held out so that they can be evaluated independently of computed dialogism measures in the discussion section below on “Dialogistic narrative in Latin epic.”
A package for extracting direct-speech text from this Latin collection called inquit can be found at the following repository:
Burns 2023. On spaCy, see Honnibal, Montani 2023.
The tagging is automated and probabilistic and so not error-free; that said, it is certainly sufficient for the preliminary analyses undertaken in this chapter. I report here the accuracy scores for the relevant components based on version 3.8.0: lemmatizer, 94.9 %; POS tagger, 97.3 %; morphological tagger, 92.6 %. Additional information about model performance can be found here:
So, for example, this collate-and-count approach shows with respect to a lexical feature that the lemma ego appears 6,759 times in all of the input texts labelled SPEECH and 30,698 times in the texts labelled NARRATIVE. With respect to a syntactic feature, the morphological tag for the vocative appears 2,940 times in SPEECH and 10,515 times in NARRATIVE. Note though for these counts that the total percentage of the input texts that is labelled NARRATIVE (90.1 % of total tokens) is far larger than that labelled SPEECH (9.9 %).
Silge, Hayes, Schnoebelen 2022.
Silge 2019. To emphasize the importance of taking into account “sampling variability,” Silge adds: “We haven’t counted every feature the same number of times so how do we know which differences are meaningful?” Linguist Mark Liberman (Liberman 2014) summarizes the advantage helpfully as follows: “We want to take account of the likely sampling error in our counts, discounting differences that are probably just an accident, and enhancing differences that are genuinely unexpected given the null hypothesis that both X and Y are making random selections from the same vocabulary.”
Monroe, Colaresi, Quinn 2008.
See for example Silge 2019; Schnoebelen 2019.
The lexicons used for the analyses in this chapter, including the example in the section below of apostrophe in Lucan’s Bellum Civile, can be found in the digital appendix to this volume.
The part-of-speech tags used in the analyses are the Universal POS tagset (
An example from Lucan’s Bellum Civile, using the first word and the main verb of Cornelia’s report of Pompey’s exhortation to his sons after his death in the previous book, Luc. 9.87–88: Me cum fatalis leto damnauerit hora, excipite, o nati, bellum ciuile (“When the destined hour shall have condemned me to death, I bid you, my sons, take over civil war.” tr. Duff). Me: lexical epic dialogism for ego = 1.00; grammatical epic dialogism = 0.56; epic dialogism (i.e. the average of the last two figures) = 0.78. Note that the grammatical epic dialogism metric is itself derived from the mean of the following lexicon entries: POS_PRON: 0.78; MORPH_PERSON_1: 1.00; MORPH_NUMBER_SING: 0.35; MORPH_CASE_ABL: 0.11. Excipite: lexical epic dialogism for excipio = 0.16; grammatical epic dialogism = 0.49; epic dialogism = 0.33. The grammatical epic dialogism metric for excipite is the mean of the following lexicon entries: POS_VERB; 0.28; MORPH_PERSON_2: 1.00; MORPH_MOOD_IMP: 0.75; MORPH_NUMBER_PLUR: 0.30; MORPH_TENSE_PRES: 0.36; MORPH_VERBFORM_FIN: 0.41; MORPH_VOICE_ACT: 0.32;. See Appendix for an expanded description of these POS and MORPH labels.
Muzny, Algee-Hewitt, Jurafsky 2017: ii37: “We expect our metric to give high scores to spans of text that … address events in the here and now.” Hic perhaps also contributes to this effect, though with the current lemmatization setup hic (“here”) and hic (“this”) are collapsed into a single homonymic lemma.
Note that the epic poet Lucan is not included in these analyses. His works have been held out from the data on which the lexicons are constructed so that the analysis in the section “Dialogistic Narrative in Latin Epic” below (5.2) proceeds from unseen epic-speech data.
Virgil in the Aeneid by comparison uses sum 214 times in NARRATIVE (0.54 times per 100 words) and 320 times in SPEECH (1.34 times per 100 words).
Biber 1988 in particular is cited as a touchpoint.
Muzny, Algee-Hewitt, Jurafsky 2017: ii32.
Muzny, Algee-Hewitt, Jurafsky 2017: ii37.
Schöch et al. 2016.
Highet 1972: 100.
Adema 2019: 202, 211.
In a chapter on time in ancient epic, Reitz, Finkmann 2019: 173 call attention to the idea of “prior narration,” referring to “ekphrases …, prophecies, omens, and dream visions,” as a kind of narration which is “naturally expressed in the future tense.” These “prior” narrative types, where they occur outside of direct speech, would—like apostrophe, as discussed later in this chapter—be good places to look for dialogic signal.
Muzny, Algee-Hewitt, Jurafsky 2017: ii32.
At present, the LatinCy model does not assign “person” information to pronouns; accordingly, it is only through the combination of lexical and grammatical features here that the full contribution of ego can be captured.
It is interesting to note that in Schöch et al. 2016 the researchers note an error in their speech recognition task that amounts precisely to what I am describing as a literary critical benefit of the dialogism measure: “Several features which have been previously used to define and recognize direct speech (question / exclamation marks, interjections, verbal tenses) also cause incorrect assignments, especially in the context of homodiegetic narration, where the narrator is somewhat involved in the plot so that his narrator speech is similar to direct speech.”
Schmitz 2019: 37; an expanded definition (with bibliography) can be found at Klooster 2013: 151–152.
See Leigh 1997; Bartsch 1997: 93–98; Faber 2005; D’Alessandro Behr 2007; Asso 2009.
Mayer 1981: 148.
Williams 1978: 234; Leigh 1997: 309.
Leigh 1997: 310.
Muzny, Algee-Hewitt, Jurafsky 2017: ii50.
The “dip” here is surely a factor both of this conspicuously long narrative passage but also from a large number of out-of-vocabulary words; this passage of mythic herpetology finds many lemmas that are not otherwise found in the word lexicon and so have a lexical epic dialogism score of zero.
In the translation of Duff 1928: “How mighty, how sacred in the poet’s task! He snatches all things from destruction and gives to mortal men immortality. Be not jealous, Caesar, of those whom fame has consecrated; for if it is permissible for the Latin Muses to promise aught, then, as long as the fame of Smyrna’s bard endures, posterity shall read my verse and your deeds; our Pharsalia shall live on, and no age will ever doom us to oblivion.” Compare with the (less bold, 28 % of the words) narrative opening to Book 9, Luc. 9.1–4: At non in Pharia manes iacuere fauilla / nec cinis exiguus tantam conpescuit umbram; / prosiluit busto semustaque membra relinquens/ degeneremque rogum sequitur conuexa Tonantis. (Duff, “But the spirit of Pompey did not linger down in Egypt among the embers, nor did that handful of ashes prison his mighty ghost. Soaring up from the burning-place, it left the charred limbs and unworthy pyre behind, and sought the dome of the Thunderer.”)
Muzny, Algee-Hewitt, Jurafsky 2017: ii37, with reference to Bakhtin’s 1935 essay “Discourse in the Novel” (reprinted as Bakhtin 2004).
With respect to the literary scope of this volume, there are encouraging developments in Ancient Greek natural language processing which portend well for related work on the epic tradition in that language; see, for example, Kostkan et al. 2023. On text types, Pinkster (2015: 1141) notes that for Latin they are “still an underdeveloped area of research”; one can imagine a path forward where ongoing research into where a given text is truly “narrative” or otherwise “argumentative,” “expository,” or another type not only informs the larger questions on epic direct speech looked at in this chapter but is also able to be further developed by the large-scale text annotations at the heart of the chapter’s methodology.
Bibliography
Adema, S.M. (2019). Tenses in Vergil’s Aeneid. Narrative Style and Structure. Leiden.
Asso, P. (2009). The Intrusive Trope: Apostrophe in Lucan. Materiali e Discussioni per l’analisi Dei Testi Classici 61, pp. 161–173.
Bakhtin, M.M. (2004). Discourse in the Novel. In: J. Rivkin and M. Ryan, eds., Literary Theory: An Anthology, Second edition. Malden, MA., pp. 674–685.
De Bakker, M., and De Jong, I.J.F., eds. (2022). Speech in Ancient Greek Literature. Studies in Ancient Greek Narrative V. Leiden.
Bartsch, S. (1997). Ideology in Cold Blood. A Reading of Lucan’s Civil War. Cambridge, MA.
Beck, D. (2012). Speech Presentation in Homeric Epic. Austin, TX.
Biber, D. (1988). Variation across Speech and Writing. Cambridge. (doi:10.1017/CBO9780511621024)
Brunner, A. (2013). Automatic Recognition of Speech, Thought, and Writing Representation in German Narrative Texts. Literary and Linguistic Computing 28 (4), pp. 563–575.
Brunner, A., Tu, N.D.T., Weimer, L., and Jannidis, F. (2020). To BERT or Not to BERT-Comparing Contextual Embeddings in a Deep Learning Architecture for the Automatic Recognition of Four Types of Speech, Thought and Writing Representation. In: Proceedings of the 5th Swiss Text Analytics Conference (SwissText) & 16th Conference on Natural Language Processing (KONVENS).
Burns, P.J. (2019). Building a Text Analysis Pipeline for Classical Languages. In: M. Berti, ed., Digital Classical Philology: Ancient Greek and Latin in the Digital Revolution, Berlin; pp. 159–176.
Burns, P.J. (2023). LatinCy: Synthetic Trained Pipelines for Latin NLP. https://arxiv.org/abs/2305.04365v1.
Byszuk, J., Woźniak, M., Kestemont, M., Leśniak, A., Łukasik, W., Šeļa, A., and Eder, M. (2020). Detecting Direct Speech in Multilingual Collection of 19th-Century Novels. In: Proceedings of 1st Workshop on Language Technologies for Historical and Ancient Languages, Marseille, pp. 100–104.
D’Alessandro Behr, F. (2007). Feeling History. Lucan, Stoicism, and the Poetics of Passion. Columbus, OH.
Dominik, W.J. (1994). Speech and Rhetoric in Statius’ Thebaid. Hildesheim/Zürich/New York.
Duff, J.D. (1928). Lucan, The Civil War (Pharsalia). Cambridge, MA.
Elderkin, G.W. (1906). Aspects of the Speech in the Later Greek Epic. Baltimore.
Elson, D., and McKeown, K. (2010). Automatic Attribution of Quoted Speech in Literary Narrative. Proceedings of the AAAI Conference on Artificial Intelligence 24 (1), pp. 1013–1019. (doi:10.1609/aaai.v24i1.7720)
Faber, R.A. (2005). The Adaptation of Apostrophe in Lucan’s Bellum Civile. In: C. Deroux, ed., Studies in Latin Literature and Roman History XII. Brussels, pp. 334–343.
Forstall, C.W., Finkmann, S., and Verhelst, B. (2022). Towards a Linked Open Data Resource for Direct Speech Acts in Greek and Latin Epic. Digital Scholarship in the Humanities 37, pp. 972–981. (doi:10.1093/llc/fqac006)
Grimmer, J., Roberts, M.E., and Stewart, B.M. (2022). Text as Data. A New Framework for Machine Learning and the Social Sciences. Princeton, NJ.
Highet, G. (1972). The Speeches in Vergil’s Aeneid. Princeton, NJ. (https://doi.org/10.1515/9781400869466)
Highet, G. (1974). Speech and Narrative in the Aeneid. Harvard Studies in Classical Philology 78, pp. 189–229. (doi:10.2307/311206)
Honnibal, M., and Montani, I. (2023). spaCy: Industrial-Strength Natural Language Processing in Python (version v. 3.5.1). (https://spacy.io/)
Jannidis, F., Konle, L., Zehe, A., Hotho, A., and Krug, M. (2018). Analysing Direct Speech in German Novels. In: DHd 2018. Cologne.
Klooster, J. (2013). Apostrophe in Homer, Apollonius and Callimachus. In: U.E. Eisen and P.V. Möllendorff, eds., Über Die Grenze. Berlin, pp. 151–173. (doi:10.1515/9783110331721.151)
Kostkan, J., Kardos, M., Mortensen, J.P.B., and Nielbo, K.L. (2023). OdyCy: A General-Purpose NLP Pipeline for Ancient Greek. In: Degaetano-Ortlieb, S., Kazantseva, A., Reiter, N., and Szpakowicz, S. eds., Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature. Dubrovnik, Croatia: Association for Computational Linguistics. pp. 128–134. (doi:10.18653/v1/2023.latechclfl-1.14. https://aclanthology.org/2023.latechclfl-1.14)
Leigh, M. (1997). Lucan: Spectacle and Engagement. Oxford.
Liberman, M. (2014). Obama’s Favored (and Disfavored) SOTU Words. Language Log. (https://languagelog.ldc.upenn.edu/nll/?p=10073)
Lipscomb, H.C. (1909). Aspects of the Speech in the Later Roman Epic. Baltimore.
Mahlberg, M. (2013). Corpus Stylistics and Dickens’s Fiction. London.
Mayer, R. (1981). Lucan: Civil War VIII. Warminster, England.
Monroe, B.L., Colaresi, M.P., and Quinn, K.M. (2008). Fightin’ Words. Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict. Political Analysis 16 (4), pp. 372–403. (doi:10.1093/pan/mpn018)
Muzny, G., Algee-Hewitt, M., and Jurafsky, D. (2017). Dialogism in the Novel. A Computational Model of the Dialogic Nature of Narration and Quotations. DSH 32 (Suppl. 2) (December 1): pp. ii31–52. (doi:10.1093/llc/fqx031)
Pinkster, H. (2015). The Oxford Latin Syntax, Vol. 2. Oxford.
Reitz, C., and Finkmann, S. (2019). Time in Ancient Epic. A Short Introduction. In: C. Reitz and S. Finkmann, eds., Structures of Epic Poetry, Berlin, pp. 171–182. (doi:10.1515/9783110492590-044)
Schmitz, T.A. (2019). Epic Apostrophe from Homer to Nonnus. Symbolae Osloenses 93 (1), pp. 37–57. (doi:10.1080/00397679.2019.1648012)
Schnoebelen, T. (2019). I Dare Say You Will Never Use tf-idf Again. Medium. (https://medium.com/@TSchnoebelen/i-dare-say-you-will-never-use-tf-idf-again-4918408b2310)
Schöch, C., Schlör, D., Popp, S., Brunner, A., Henny, U., and Tello, J.C. (2016). Straight Talk! Automatic Recognition of Direct Speech in Nineteenth-Century French Novels. In: DH2016. Kraków, Poland. 346–353 (https://dh2016.adho.org/abstracts/31)
Scott, J.A. (1903). The Vocative in Homer and Hesiod. The American Journal of Philology 24 (2), pp. 192–196. (doi:10.2307/288759)
Silge, J. (2019). Introducing Tidylo. Julia Silge. (https://juliasilge.com/blog/introducing-tidylo/)
Silge, J., Hayes, A., and Schnoebelen, T. (2022). Juliasilge/Tidylo. R (version v0.2.0). (https://github.com/juliasilge/tidylo).
Tu, N.D.T., Krug, M., and Brunner, A. (2019). Automatic Recognition of Direct Speech without Quotation Marks. A Rule-Based Approach. Frankfurt/Mainz.
Verhelst, B. (2017). Direct Speech in Nonnus’ Dionysiaca. Narrative and Rhetorical Functions of the Characters’ “Varied” and “Many-Faceted” Words. Leiden. (doi:10.1163/9789004334656)
Williams, G. (1978). Change and Decline. Roman Literature in the Early Empire. Berkeley, CA.