21. 11. 2006 Lexicon Data and their Structure

 

21. 11. 2006 Lexicon data and their structure



Lexicon structure and their data types

  • Microstructure

  • number of lexicon articles/entries/records

  • order of DatCats ( datacategories)

  • Mesostructure

  • Interrelation of lexicon entries

  • relation to external information

  • Macrostructure

  • order of lexicon entries

  • selection of sort key

  • sorting order not trivial! ( cf. Languages which are only spoken -> IPA)


    Sorting NOT trivial, example „ @ „

  • you would expect „ h@me“ close to the word „home“

  • you would expect intern@t close to the word „ internet“

  • you would expect „ @“ home close to the word „ at“

    so where do you sort „@“ ???


Haus -> Häuser

Hauses -> Häuser

Hause -> Häusern

Haus -> Häuser ( which form would you find in a dictionary? -> Haus)


a declination

flamm – a, – ae, - ae, -am, -a, -ae, -arum, - is, - as, -is ( which one would you find here? -> flamma )


Microstructure

  • words (most) ( except for pucture dictionaries)

  • grammatical information: syntax

  • part of speech (POS)

  • inflectional class

  • valence ( which verb takes (how many) objects, transitiv, intransitiv)

  • representation of meaning (formats differ)

  • semantics

  • definition

  • corpus reference := usage examples



Detour: CORPUS

-> collection of language material

  • texts

  • transcripts

  • speech ( transcription in IPA)

  • examples : Oxford corpus, Longman corpus

-> with additional information

  • Part Of Speech

  • lemma ( de- grammaticalized form of a word)

  • transcription

  • annotations

-> with a specific structure

  • interlinar glossing

  • special make up




Other types of lexicons


  • Word frequency lexicon

  • the most frequent one first

  • Lexicon of "phrasal verbs"

  • by part of speech and a special structure

  • rhyming lexicon

  • by word ending

  • picture lexicon

  • by prototype




Problematic issues in lexicography


  • ambiguity

  • synonyms ( two word forms , same meaning)

  • polysemy ( one word form, two or more (slightly) different meanings)

  • homonyms ( one word form, meaning completely different)


  • word search

  • languages with inflectional prefixes

  • orthographic ambiguity

  • picture lexicons?

     

  • Language change

  • new words

  • new meaning



  • Solutions to problems

  • ambiguity : enumeration

  • search word „abitrary“ definition

  • language change: new edition

  • more fundamental solutions




Methods of creating lexicons

  • introspection

  • look inside ( by trained linguist)

  • reflecting one's own language use

  • social“ filter : relevance, importance, adequacy

  • Questionnaire

  • in comparative linguistics

  • typology

  • unknown language -> picture dictionary

  • point at picture ( might be rude in some countries)

     

  • requirements and limitations

  • intended use: researching morphology, use in computer systems, translation

  • intended usergroup: experts, lay, translators,linguists,

  • intended coverage: general, special purpose

  • available sources: availability of language experts (native speakers)

  • example questionnaire :

  • Asking questions for translation, explanation

  • Social filters apply

  • http://www.spectrum.unibielefeld.de/~ttrippel/htmd/questionnaire_short.html


     

  • corpus

     

     


Corpus based lexicon creation

  • "reflect the evidence"

  • include "words" found

  • exclude items not in corpus

  • based on corpora

  • list all words: wordlist

  • words in context: concordance

  • distribution analysis: HMM

  • flat tabular lexicon

  • generalizations in the lexicon

  • declarative lexicons




Hierachy of lexicon and corpus types



Corpus based lexicon creation application

  • SIL toolbox

  • Summer Institute of Linguistics

  • famous for fieldwork tools

  • language database: www.ethnologue.org

  • previously named "shoebox"

  • future: fieldworks

  • Interlinearization of text

  • one line "base" text

  • one line gloss

  • one line morphology

  • ....




Lexicon Database Applications

  • Lists

  • Table

  • Tables

  • Relational Database Management Systems (RDBMS)

  • samples

  • Corpus based lexicon management

  • Graph based lexicons




Relational Model for a Lexicon

  • table structures

  • efficient storage and retrieval in Relational

  • Database Management Systems (RDBMS)

  • often used for technological applications

  • used for some web based lexicons

  • translation = mapping of two different columns

  • example: http://dict.tu-chemnitz.de




Graph based lexicon

  • Lexical information = nodes in a graph

  • microstructure = (labeled) arcs between nodes

  • crossreferences = arcs between nodes

  • mesostructure = reference to external knowledge

  • macrostructure = access structure, starting at each node



Summary

  • Lexicon structures and data types

  • microstructure data types

  • different macrostructures

  • Lexicon creation

  • questionnaire

  • corpus

  • Lexicon representation formats

  • RDBMS

  • graphs

27.11.06 20:07

bisher 0 Kommentar(e)     TrackBack-URL