| Accuracy scores |
the average degree of accuracy of an automated process of analysis, as measured based on verified sample data. See also, “recall and precision” |
| Ancient Greek and Latin Dependency Treebank (AGLDT) |
first and largest dependency treebank (see also, “treebank”) for Greek and Latin Texts, including, e.g., the full text of Homer’s Iliad and Odyssey. Hosted and maintained by Perseus Digital Library. |
| Apostrophe |
the address of a person not physically present, as for example when a narrator addresses a character in a poem as a form of metalepsis |
| Application Programming Interface (API) |
interface of a computer program or database oriented not towards the human user, but towards other computer programs. It contains the basic information and definitions needed for communication between computer programs |
| Authorship attribution |
branch of stylometry concerned with matching anonymous or pseudonymous texts to their true authors based on similarity of textual features |
| Bar chart |
a graphical representation for categorical data, in which values are shown by the extent of vertical or horizontal bars |
| Betweenness centrality |
quantitative measure for network graphs, representing a given node’s importance in linking other nodes. If one imagines every possible pair of nodes within the graph as connected by the shortest possible route, the number of such routes that pass through a particular node can be used to quantify this node’s centrality as a connector (or “bridge”) between parts of the graph |
| Box-and-whisker plot (also box plot) |
a graphical representation of the spread of values in a data set. A collection of values is represented by a rectangular shape divided into two by a horizontal line (the “box”); above and below the box are vertical lines (the “whiskers”). The box represents the values from the 25th to the 75th percentile (i.e., the middle 50 % of all the data), the horizontal line dividing the box into two is the median (50th percentile). The whiskers can be defined in various ways. In the chapters of Berlincourt (chapter 13) and Verhelst and Forstall (chapter 18) in this volume, the authors use a conventional setting defining the whiskers as extending to 1.5 times the interquartile range (distance between 1st and 3rd quartile) beyond the box. Values beyond that range, the “outliers”, are represented as dots. |
| Bridge |
in network graphs, bridges are connections between nodes that otherwise belong to different communities (see also, “network graphs”) |
| Canonical Text Services (CTS) |
protocol for identifying and retrieving passages of literary text. Using a canonical reference system, this protocol allows for connecting databases, digital libraries, and more for automated queries (see also, “linked data”) |
| Chicago Homer |
multilingual database and reading environment that makes the distinctive features of Early Greek epic accessible to readers with and without Greek. |
| The Classical Language Toolkit (CLTK) |
a Python library offering natural language processing (NLP) tools for the languages of pre-modern Eurasia. CLTK also hosts several text corpora, collected from various open-source digital libraries, both for Latin and Greek texts. |
| Classification |
the process of assigning a given sample to one of a set of predetermined classes based on a set of features. For example, a sample consisting of epic language might be classified as “male” or “female”, or as “speech” or “narrator text”, according to features such as the frequency of certain words or grammatical forms. In computational studies, a “classifier” commonly refers to a machine learning model trained on pre-classified data, which can be used to predict to which class new samples belong |
| Cognitive Poetics |
a school of literary criticism in which literature is approached through the lens of cognition (the mental processes of reading, understanding and remembering), drawing on insights from the fields of psychology and cognitive linguistics |
| Collective speech |
speech represented as spoken by multiple speakers, often a crowd. While in some cases multiple speakers may realistically say the exact same thing simultaneously (e.g. a crowd shouting their assent or dissent), most often collective speeches represent the general sentiment of what multiple individuals within a crowd roughly say to one another (one speech representing many) |
| Community |
in network graphs communities are defined as a subset of nodes, densely connected to each other and loosely connected to nodes in other communities in the same graph (see also, “network graphs”) |
| Corpus Linguistics |
an empirical method for the study of language phenomena based on large text corpora which are analyzed with computational and statistical methods |
| Daphne treebanks |
collection of Ancient Greek dependency treebanks (see also, “treebank”) curated by F. Mambrini. |
| Delta measure |
Also known as “Burrows’ Delta”. Stylometric tool for authorship attribution designed to measure the similarities between texts in a corpus by calculating the distance between them in a multidimensional vector space based on word frequencies |
| Diagnostic feature set |
set of features used to predict whether an entity is likely to belong to a specific category. For example, a specific set of linguistic or lexical features may be used to predict whether a given passage is speech or narrative. See also, “classification” |
| Direct Speech in Greek Epic Poetry (DSGEP) |
companion website and digital appendix to Verhelst’s 2017 book Direct Speech in Nonnus’ Dionysiaca, with data on all direct speech in Homer, Apollonius, Quintus and Nonnus’ Dionysiaca. The data presented in this database have now been integrated in the new DICES database. |
| Dirichlet prior distributions |
statistical method for calculating probabilities |
| Edge |
link between nodes in a network graph (see also, “network graph”) |
| Eidolopoiia |
rhetorical term for a speech by a deceased person |
| Embedded speech |
direct speech embedded in direct speech. The speaker of the embedding speech acts as a secondary (or tertiary) narrator |
| Epithalamium |
poetical and/or rhetorical genre comprising poems or speeches to be performed at a wedding |
| Ethopoiiai |
rhetorical exercises in which the student must compose a fitting speech for a given character or character type in a given situation. Ethopoiiai were among the standard exercises or progymnasmata for young rhetoricians in an early phase of their training |
| F1-score |
widely used metric to evaluate the accuracy of automated searches. The F1-score combines precision and recall (see also, “recall and precision”) into a single metric by giving both equal weight (harmonic mean). |
| Face |
sociological term for the social image a person creates and maintains for themself in interaction with others |
| False negatives |
see “recall and precision” |
| False positives |
see “recall and precision” |
| Function words |
words whose primary role is grammatical rather than lexical, such as articles, conjunctions, prepositions, pronouns and particles. Examples in Greek include |
| General form of address |
a form of address in the vocative which does not identify the addressee by name or by means of patronymics or ethnics. Instead, kinship and age terms are used (“father”), titles (“king”), terms of affection and esteem (“my dear”), insults (“dog”), or collective addresses (“friends”). Cf. “name-vocative”. |
| Gephi |
open-source software for creating (network) graphs. |
| Gini importance |
statistical measure evaluating the relative importance of a specific feature in a multi-feature classification experiment. It measures how much each feature contributes to reducing uncertainty (or “impurity”) in the predictions of the classifier (see also, “classification”) |
| Ground-truth dataset |
a thoroughly verified dataset which serves as a benchmark for evaluating the results of an automated process. |
| Heatmap |
a diagram where the color hue of each square represents the data values, e.g. to indicate the intensity of a relation between datapoints as in the chapters of Mambrini and Schirner (resp. indicating the amount of shared vocabulary and the frequency of co-occurring emotions) |
| Hypotaxis |
syntactic subordination; especially the phenomenon of multiple clauses which are subordinated to one another in a complex nested structure. Subordinate clauses can for example be relative clauses or are introduced by subordinating conjunctions |
| Hypothetical speech |
speech that is represented as hypothetical, potential or counterfactual, and not actually uttered by any character. Examples include speculating about an absent character’s reaction (X would have said Y) or predicting one’s own future reactions (when X happens, I will say Y). A well-known category of hypothetical speech is the so-called potential |
| Idiolect |
an individual person or character’s unique use of language |
| If-not situation |
recurring narrative feature in Homeric (and later) epic, describing the hypothetical outcome of an action not taken: “then X would have happened, if had Y not intervened” |
| Interquartile range (IQR) |
see “box-and-whisker plot” |
| Large language model (LLM) |
computational model, trained on vast amounts of texts, designed for natural language processing tasks such as generating “new” texts. ChatGPT is a well-known general-purpose example, but LLM s can also perform more specialized tasks such as lemmatization (see also, “lemmatized text”) |
| LatinCy |
An open language model for Latin. See “spaCy” |
| Lemmatized text |
a lemmatized text is a text in which every word (or “token”) is linked to a corresponding lemma (dictionary headword). In Latin and Greek, for example, lemmata are often more useful than the original inflected forms for tasks such as calculating frequencies or detecting repetition |
| Lemma frequency |
the sum frequency of all inflected forms of a given lemma (dictionary headword) within a text or passage |
| Lexeme |
the basic unit of meaning in a language |
| Library of Latin Texts (LLT) |
a digital library of Latin literature, hosted by Brepols Publishers, with advanced search functions and built-in analytical tools. The LLT is not open access but part of the paywalled section of the larger platform of Brepolis Databases. |
| Linear regression |
mathematical model proposing a linear correlation between two measured variables. As one variable changes, the other is expected to change proportionally. The expected relationship between the two variables can be plotted as a straight line on a graph. The distance between this linear fit and actual individual observations represents variation not explained by the model |
| Linked Open Data (LOD) |
machine-readable structured data designed for sharing online and linking to other datasets and released under an open license. Interoperability can be facilitated by means of non-proprietary data formats and a shared system of naming and request conventions, for example, those defined for classical text passages by the CTS protocol (see also, “Canonical Text Services”) |
| Logarithmic scale |
method to visually represent numerical data spanning a broad range of values. Whereas on a linear scale each unit of distance corresponds to the same increment (e.g. 1, 2, 3, 4, …), on a logarithmic scale, the increment is each time multiplied by the base value (e.g. 1, 10, 100, 1000, …) |
| Log odds |
a logarithmic representation of probability. For example, while the probability of a given word within a sample of text must fall within the range 0 to 1, the log odds ratio is scaled to the range -∞ to ∞, so that very uncommon words are represented by large negative numbers, and common words by large positive numbers. Weighted log odds applies a further adjustment to account for differences in the sizes of samples |
| MANTO |
MANTO is an authoritative database of Greek myth, providing open access to metadata on mythological characters, places and source texts and as well as numerous types of relationships between the different entities. |
| Median |
in calculating averages for a set of values, the median is the value representing the middle of the group: 50 % of all values are higher, 50 % are lower. The median provides a popular alternative for the mean (sum of all values divided by the number of values) and the mode (value that occurs most often) |
| Mertens-Pack3 (MP3) |
Papyrological database. |
| Modularity |
In network graphs modularity calculations are used as a measure for establishing “community” structures in larger and more complex graphs (see also, “community”). When calculating the modularity measure, the number of connections (“edges”) within a community are compared to the number of connections in an equivalent randomized network (see also, “network graph”) |
| Morphological tagging |
the process, frequently performed by an NLP model, of providing a morphological analysis for every word in a text, so that in the resulting data the morphological features (such as case, number, tense, …) appear as annotations (tags) for each word |
| Name-vocatives |
a form of address in the vocative, consisting of the name of the addressee. This category also includes patronymics and ethnics. Cf. “General form of address”. |
| Narrative level |
the narrative level indicates whether a speech (or any other narrative feature) occurs at the level of the primary narrator, or embedded in language attributed to a secondary or tertiary narrator. In Homer’s Odyssey, the words of the narrator are level 0; when the narrator quotes the direct speech of Odysseus, this is level 1; when, in recounting his adventures, Odysseus himself quotes the words of the Cyclops, this is level 2 |
| Natural Language Processing (NLP) |
NLP is a subfield of computer science focusing on the computational analysis of human language. Examples of NLP include automated lemmatization and morphological tagging of Greek and Latin. NLP generally relies on machine learning to extract computational models of linguistic patterns from large text corpora. NLP tools offer possibilities for large scale automated text analysis and manipulation |
| N-gram |
a sequence of a given number (n) of adjacent items in a particular order. For text-based computational analysis, these items can be (lemmatized) words, letters, etc. Depending on the value of n, n-grams can be used, for example, to detect longer or shorter units of repetition within a text |
| Network graph |
graphical visualization of the relations or interconnections between a set of entities. Each entity is represented by a node. The connections between the entities are represented by edges |
| Node |
see “network graph” |
| Noise |
irrelevant examples that show up in an automated search, also “false positives” |
| The Oath in Archaic and Classical Greece Database |
database recording and annotating Greek oaths until 322 BC across all genres and text types, including epigraphical evidence. |
| OdyCy |
an open language model for Ancient Greek (see “spaCy”) |
| Open Greek and Latin |
open-source resource featuring a large set of digital texts in Latin and Greek, reading tools and software. |
| Optical Character Recognition (OCR) |
the automatic conversion of images of printed text (for example, scanned images of public domain books) into machine-readable text data |
| Oratio obliqua |
indirect speech |
| Oratio recta |
direct speech |
| Outlier |
statistical term for a datapoint that differs significantly from the large majority of datapoints. It can be worthwhile investigating outliers because they may point at interesting, exceptional examples. On the other hand, outliers are often excluded from further statistical calculations because their exceptional features would distort the results. See also, “box-and-whisker plot” |
| Pars epica |
a narrative part of a primarily hymnic or epideictic poem |
| Parsing |
a parser uses an NLP model to extract linguistic features from text, typically providing information for each word (or token), such as the lemma (dictionary headword), the part of speech (POS), number, case, mood, etc |
| Part of speech (POS) |
The intrinsic grammatical type of a word, such as noun, adjective, or verb. One of the tasks typically performed by a linguistic parser (see also, “parsing”) is annotating all tokens with POS tags |
| Passim |
open-source software for automatically detecting repeated sequences within texts. |
| Permutation importance |
statistical measure evaluating the relative importance of a specific feature in a multi-feature classification experiment. It measures how much the performance of a classifier worsens when the values of a given feature are rearranged at random. See also, “classification”. |
| Perseus Digital Library |
open-source digital library containing, among many other things, an extensive collection of Greek and Latin literary texts. |
| Principal Component Analysis (PCA) |
statistical method used to represent and analyze multi-variable data. PCA transforms and reorders multi-dimensional data to highlight the most salient dimensions of variance. For example, a collection of samples originally characterized by hundreds of individual word frequencies can be reduced to a two-dimensional visualization while retaining as much meaningful information as possible |
| Progymnasmata |
Set of standard writing exercises practiced in antiquity by young rhetoricians in an early phase of their training. These exercises are described extensively in several Greek rhetorical handbooks of the first centuries AD. For an example, see “ethopoiiai” |
| Python |
general-purpose computer programming language. The DICES client, spaCy, CLTK and various other digital resources used in this volume can be controlled using Python. The use of a common language makes it easier to automate complex tasks in which several tools must be used in combination |
| R |
general purpose programming language popular for statistics and digital humanities |
| Recall and precision |
when evaluating the accuracy of a classification experiment based on automated analysis, recall and precision are calculated based on the number of true positives (correctly retrieved as belonging to a target group), false positives (wrongly retrieved as belonging to a target group), true negatives (correctly rejected as not belonging to a target group) and false negatives (wrongly rejected as not belonging to a target group). The recall score gives the ratio of true positives in relation to the total number of relevant items. The precision score gives the ratio of the correctly identified items in relation to the total number of retrieved items |
| Reported speech |
depending on the theoretical framework that is being used, this term has multiple meanings. 1. Speech within speech. Throughout the volume we prefer the term embedded speech for direct reported speech (see also, “embedded speech”). The term reported speech can also refer to speech quoted by a character in oratio obliqua (indirect speech) and is used in this way in the chapter of Cesca and Romanello. Sometimes a further distinction is made between (actually) reported speech (“what a character has said”) and hypothetical speech (“what a character might say/might have said”). 2. A mere mention of a speech (see also, “speech mention”), without it being quoted directly or paraphrased. It is used in this sense in the chapters of Oughton and Burns |
| Responsibility Exchange Theory (RET) |
social psychology theory, proposed by Shereen Chaudry and George Loewenstein, analyzing social patterns of conduct in assuming and attributing responsibility in collaborative situations |
| Rolling windows |
A method for analyzing continuous data by aggregating overlapping samples (“windows”) of fixed size. For example, an epic poem may be divided into overlapping segments of five lines: the first sample comprises lines 1–5, the second, 2–6, and so on. This method produces a continuous rather than quantized metric and avoids accidentally overlooking features that occur at the boundaries of non-overlapping samples. This method is used in the chapters of Burns and Forstall and Verhelst. |
| Scatter plot |
a representation comparing two variables (x-axis and y-axis) for a large amount of data points which are represented as dots |
| Secondary narrator |
an embedded story (often in the form of direct speech) is told by a character who acts as secondary narrator and may include direct speech in his story (see also, “embedded speech”) |
| SpaCy |
a general-purpose natural language processing package for Python, which must be used in conjunction with third-party, pre-trained language models specific to a given language and task. For Latin NLP, several chapters in this volume use the latinCy spaCy model trained by Burns et al.; for Greek NLP, the spaCy model used in this volume is odyCy by Kostkan and Kardos |
| Speech act |
an act performed through speaking. Speech act theory distinguishes different actions performed through speech, e.g. to blame, to apologize. In this volume and the DICES database, the standard unit is not that of one speech act, but that of one direct speech (from the opening to the closing of quotation marks). One such speech can consist of multiple speech acts (shifting e.g. from a complaint to a request). Conversely, a single speech act can be represented by the narrator partly with direct speech, partly with indirect speech. For an approach to epic speech using speech act theory, see the chapters by Minchin and Beck |
| Speech cluster |
a conversational cluster consisting of multiple speeches, most often in the form of a dialogue |
| Speech conclusion |
language used by a narrator to conclude a direct speech and move back to a narrative mode (e.g. “This is what he said in tears. Then …”). Alternatively called “speech capping” |
| Speech framing language |
Umbrella term for both speech conclusions and speech introductions |
| Speech introduction |
language used by a narrator to introduce direct speech (alternatively called “attributive discourse”), usually identifying speaker and addressee, often also hinting at the intentions of the speaker |
| Speech mention |
mere mention of a speech without its being quoted directly or paraphrased (also “reported speech”) |
| Speech Presentation in Homeric epic |
companion website and database for the 2012 book by Deborah Beck with the same title, containing information about as well as the full text of all instances of speech representation in Homer, including indirect forms of speech representation. |
| Speech type |
while various classification systems for speeches co-exist and are being referred throughout the volume, “speech type” is used primarily to refer to a content-based typology of (epic) speeches such as (battle) exhortations, prayers or laments. The DICES database includes tags tentatively indicating applicable speech type classifications for all speeches in the corpus |
| Stacked column chart |
graphical representation for categorical data. For each category, a number of different values is being compared, which are represented stacked onto each other as the building blocks of a tall column. The columns can show either relative or absolute values. In the example in the chapter of Berlincourt, a full column represents 100 % of all speeches occurring in an epic |
| Stylometry |
the study of quantifiable aspects of style |
| Tesserae |
web-based tool for detecting allusions in Greek and Latin literature on the basis of shared vocabulary (with similar functionality as “Passim”). Secondarily, Tesserae is also the source for one of the text corpora available through CLTK. |
| Theory of Mind (ToM) |
psychological concept referring to the human ability to understand something of our own mental states, and the capacity to form intuitions about the intentions of others |
| Thesaurus Linguae Graecae (TLG) |
digital library of Greek literature, hosted by the University of California, with advanced search functions and built-in analytical tools. Only part of the extensive TLG corpus is open access. The larger part is on the paywalled section of the site. |
| Tidylo |
R library used to calculate weighted log odds. |
|
conventional term for a speech in Greek epic by an anonymous character ( |
|
| Tokenized text |
Tokenization is the process, typically performed by an NLP parser, of segmenting a text into a sequence of “tokens” for analysis. Tokens are often equivalent to words, but may also include numerals, punctuation marks, or combining forms such as the Latin enclitic ‑que, depending on the parser |
| Treebank |
dataset consisting of syntactically annotated texts. Dependency treebanks, for example, record the dependency relations among the words of each sentence in a machine-readable, tree-like structure. While some language models can generate automatic syntactic annotations, treebanks are usually meant to be authoritative and are hand-corrected by human readers to ensure accuracy |
| Trismegistos (TM) |
database for Greco-Roman epigraphical and papyrological materials. |
| True negatives |
see “recall and precision” |
| True positives |
see “recall and precision” |
| Unseen data |
data that was not part of the dataset used to build a certain metric |
| Violin plot |
graphical visualization for comparing data distributions. The width of the curved lines corresponds with the density of datapoints in each region |
| Wikidata |
open knowledge base storing structured data about the world. Part of the Wikimedia movement, it both supports and derives content from Wikipedia; like Wikipedia it is editable by the public. Wikidata stores sometimes extensive biographical details for authors as well as for historical and mythical epic characters, including known family relationships and alternative names. |
| Z-score |
standardized way of representing how much datapoints differ from the mean, e.g. allowing for a comparison of frequencies of word use in corpora of different sizes |
Glossary
In: Direct Speech in Greek and Latin EpicSearch for other papers by Christopher W. Forstall in
Current site
Google Scholar
PubMed
Search for other papers by Berenice Verhelst in
Current site
Google Scholar
PubMed