The DIALOG corpus contains recordings and transcripts of discussion programs broadcast on Czech television stations. There are six corpora currently available: DIALOG 2.0, DIALOG 1.2, DIALOG 1.1, DIALOG 1.0, DIALOG 0.3, DIALOG 0.2 and DIALOG 0.1m. They differ in size, program choice, and method of morphological annotation and lemmatization.
Comprised of 150 recordings and transcripts as follows:
All corpora were morphologically annotated and lemmatized, i.e. individual words in all corpora have been enriched with information on morphology and classified according to their basic word forms (lemmas). The DIALOG 1.1 corpus, the DIALOG 1.0 corpus, the DIALOG 0.3 corpus and the DIALOG 0.2 corpus were annotated and lemmatized automatically. The DIALOG 0.1m corpus was annotated and lemmatized manually. Jan Hajič’s system was used in all cases (see tag structure).
All transcripts of the DIALOG 0.1m corpus are included in the DIALOG 0.2 corpus; not, however, their morphological annotation and lemmatization. These transcripts were automatically lemmatized and annotated in the DIALOG 0.2 corpus.
All transcripts of the DIALOG 0.2 corpus are included in the DIALOG 0.3 corpus. All transcripts of the DIALOG 0.3 corpus are included in the DIALOG 1.0 corpus. Not all transcripts of the DIALOG 1.0 corpus are included in the DIALOG 1.1 corpus. All transcripts of the DIALOG 1.2 corpus are included in the DIALOG 2.0 corpus.