Next Previous Contents

1. Introduction

Bonito is a graphical user interface (GUI) of a Manatee corpus manager. It enables queries to be formed and given to various corpora. The results are clearly displayed and can be changed in various ways. Statistics can also be computed on them.

1.1 Corpus

A Corpus is here defined as a sequence of so-called positions. Each position is made up of one word, number or punctuation mark, etc. The actual division into individual positions is performed automatically by external tools in most corpora and does not depend on Manatee or Bonito systems in any way. This may be different in various corpora.

Each position consists of a set of positional attributes. Each attribute contains a piece of simple word information (a word, a basic form, a part of speech, etc.). The position of an optional corpus always contains a minimum of one attribute with name word. The attribute contains an actual word at the given position. Different corpora contain different sets of attributes. Some corpora contain only the attribute mentioned above. Others contain a word, the basic form of the word (attribute lemma) and grammatical information (attribute tag). Some corpora contain grammatical information divided into more specific attributes.

The corpus may also contain various structure tags, such as sentence boundaries, paragraph or document boundaries. Certain types of tags may also contain additional information. Thus in many corpora the whole text is divided into documents by the structure tag <doc>. This structure tag usually contains the document source identifier.

In some annotated corpora grammatical information stored for each word is often denoted as a tag. This type of information, however, does not stand for the structure tags mentioned above. These tags (grammatical information) are stored in some of the positional attributes.

1.2 Corpus manager

The corpus query result is the so-called concordance list that creates all corpus positions corresponding with the query given. The concordance list is then displayed in KWIC (Key Word(s) In Context) format. The searched words are displayed with their contexts one below the other. The concordance list is sometimes abbreviated as concordance. The abbreviation KWIC represents the searched word or a word sequence.


Next Previous Contents