Translations as semantic mirrors: from parallel corpus to wordnet
于Advances in Corpus LinguisticsSearch for other papers by Helge Dyvik in
Current site
Google Scholar
PubMed
Purchase instant access (PDF download and unlimited online access):
Purchase instant access (PDF download and unlimited online access):
The paper reports from the project ‘From Parallel Corpus to Wordnet’ at the University of Bergen (2001–2004), which explores a method for deriving wordnet relations such as synonymy and hyponymy from data extracted from parallel corpora. Assumptions behind the method are that semantically closely related words ought to have strongly overlapping sets of translations, and words with wide meanings ought to have a larger number of translations than words with narrow meanings. Furthermore, if a word a is a hyponym of a word b (such as tasty of good, for example), then the possible translations of a ought to be a subset of the possible translations of b.
Based on assumptions like these a set of definitions are formulated, defining semantic concepts like, e.g., ‘synonymy’, ‘hyponymy’, ‘ambiguity’ and ‘semantic field’ in translational terms. The definitions are implemented in a computer program which takes words with their sets of translations from the corpus as input and performs the following calculations: (1) On the basis of the input different senses of each word are identified. (2) The senses are grouped in semantic fields based on overlapping sets of translations, such overlap being assumed to indicate semantic relatedness. (3) On the basis of the structure of a semantic field a set of features is assigned to each individual sense in it, coding its relations to other senses in the field. (4) Based on intersections and inclusions among these feature sets a semilattice is calculated with the senses as nodes. According to our hypothesis, hyponymy/hyperonymy, near-synonymy and other semantic relations among the senses now appear through dominance and other relations among the nodes in the semilattice. Thus, the semilattice is supposed to contain some of the semantic information we want to represent in wordnets. (5) In accordance with this assumption, thesaurus-like entries for words are generated from the information in the semilattice.
In the project these assumptions are tested against data from the English- Norwegian Parallel Corpus ENPC (Johansson 1997).