TerminologyExtractor - How it works

Chamblon Systems Inc.

TerminologyExtractor: How it Works

How it works: terminology extraction process

In order to produce the best results and to provide the best information about the input text, the process of extracting terminology is performed in two steps: text clean-up and collocation extraction.

1. Text Clean-Up

The objective of the clean-up phase is ensure that all inconsistencies are removed from the input text. For this, TerminologyExtractor uses the following resources:

a. The input text.

b. An exclude list provided with TerminologyExtractor. This list contains the most common words that are usually not considered for keywords, such as prepositions, pronouns and modal verbs.

c. A user-defined exclude list that is initially empty. This list will later typically include words that are not important for collocation extraction, such as the proper nouns found in the input text.

d. A system dictionary, provided with TerminologyExtractor, that converts inflected forms (plurals and verb conjugations) to canonical forms (singular and verb infinitive).

e. A user-defined dictionary that is initially empty. This list will later include important words not found in the system dictionary. When adding the inflected form of a word to this list, the user may also specify the corresponding canonical form so that the canonical form and not the inflected form is used when extracting collocations.

The word list generation tool contained in TerminologyExtractor produces two files from this input: a word file and a non-word file. The word file contains the list of the words found in the system dictionary or in the user-defined dictionary. The non-word file contains all the words not found in any of the dictionaries and exclude lists. Typically, the non-word file should be empty when the text is clean because every word in the input text should be either a potential keyword, in which case it should belong to one of the dictionaries, or a non-important word, in which case it should be in an exclude list.

After the text is submitted to the word list generation tool for the first time, the non-word list will include all words not found in the system dictionary and the system exclude list, because the user-defined dictionary and exclude list are initially empty. Typically, the non-word list will include the following types of words:

a. Abreviations.
b. Proper nouns.
c. Misspelled words.
d. Words that should be in the system dictionary but are not because this dictionary is incomplete.

Words of type (a) and (d) should be inserted in the user-defined dictionary. Words of type (b) should be inserted in the user-defined exclude list. Words of type (c) should be corrected in the input text. The decision as to precisely what to do with each word is up to the user. TerminologyExtractor provides an easy way to move words from one list to another (from the non-word list to the exclude list, for example): all lists can be selected from the "View" menu; TerminologyExtractor automatically launches Notepad, and individual words can be moved from one list to another using cut-and-paste.

At the end of the clean-up phase:

· The user-defined dictionary will contain a list of relevant terms not found in the system dictionary.

· The user-defined exclude list will contain proper nouns and other words not considered important.

· The word list will contain the canonical form of all words found in one of the dictionaries with their frequency.

· The non-word list should be empty.

· The text will be cleaner.

2. Collocation Extraction

The collocation extraction process produces word and non-word lists as well as a list of the collocations found in the input text with their frequency. Collocations up to a user-defined width are extracted. Only maximum length collocations are inserted in the output list. For example, if "testing computer software" appears 10 times in a text and "computer software" also appears 10 times, then "computer software" will not be taken into consideration. However, if "developing computer software" appears 15 times, the collocation "computer software" will have appeared 25 times and will be inserted in the list.