How it works: terminology extraction process
In order to
produce the best results and to provide the best information about the input
text, the process of extracting terminology is performed in two steps: text
clean-up and collocation extraction.
1. Text Clean-Up
The
objective of the clean-up phase is ensure that all inconsistencies are removed
from the input text. For this, TerminologyExtractor uses the following
resources:
a. The input text.
b. An exclude list provided with TerminologyExtractor. This list contains the
most common words that are usually not considered for keywords, such as
prepositions, pronouns and modal verbs.
c. A user-defined exclude list that is initially empty. This list will later
typically include words that are not important for collocation extraction,
such as the proper nouns found in the input text.
d. A system dictionary, provided with TerminologyExtractor, that converts
inflected forms (plurals and verb conjugations) to canonical forms (singular
and verb infinitive).
e. A user-defined dictionary that is initially empty. This list will later
include important words not found in the system dictionary. When adding the
inflected form of a word to this list, the user may also specify the
corresponding canonical form so that the canonical form and not the inflected
form is used when extracting collocations.
The word list generation tool contained in TerminologyExtractor produces two
files from this input: a word file and a non-word file. The word file
contains the list of the words found in the system dictionary or in the
user-defined dictionary. The non-word file contains all the words not found
in any of the dictionaries and exclude lists. Typically, the non-word file
should be empty when the text is clean because every word in the input text
should be either a potential keyword, in which case it should belong to one
of the dictionaries, or a non-important word, in which case it should be in
an exclude list.
After the text is submitted to the word list generation tool for the first
time, the non-word list will include all words not found in the system
dictionary and the system exclude list, because the user-defined dictionary
and exclude list are initially empty. Typically, the non-word list will
include the following types of words:
a. Abreviations.
b. Proper nouns.
c. Misspelled words.
d. Words that should be in the system dictionary but are not because this
dictionary is incomplete.
Words of type (a) and (d) should be inserted in the user-defined dictionary.
Words of type (b) should be inserted in the user-defined exclude list. Words
of type (c) should be corrected in the input text. The decision as to
precisely what to do with each word is up to the user. TerminologyExtractor
provides an easy way to move words from one list to another (from the
non-word list to the exclude list, for example): all lists can be selected
from the "View" menu; TerminologyExtractor automatically launches
Notepad, and individual words can be moved from one list to another using
cut-and-paste.
At the end of the clean-up phase:
· The
user-defined dictionary will contain a list of relevant terms not found in
the system dictionary.
· The
user-defined exclude list will contain proper nouns and other words not
considered important.
· The word
list will contain the canonical form of all words found in one of the
dictionaries with their frequency.
· The
non-word list should be empty.
· The text
will be cleaner.
2. Collocation Extraction
The
collocation extraction process produces word and non-word lists as well as a
list of the collocations found in the input text with their frequency. Collocations
up to a user-defined width are extracted. Only maximum length collocations
are inserted in the output list. For example, if "testing computer
software" appears 10 times in a text and "computer software"
also appears 10 times, then "computer software" will not be taken
into consideration. However, if "developing computer software"
appears 15 times, the collocation "computer software" will have
appeared 25 times and will be inserted in the list.
|