History of the DIALOG corpus


“В сущности слово является двусторонним актом. Оно в равной
степени определяется как тем, чье оно, так и тем, для кого оно. ...
Слово – общая территория между говорящим и собеседником.”
(В. Н. Волошинов, 1929)
“It is unwise to rely on a single corpus, however large or well designed
it might be: all corpora have in-built biases, and findings should therefore
be checked in different independent corpora.”
(M. Stubbs, 2000)


The material for the DIALOG corpus comes from an archive of recordings of television discussion programs and their transcripts that the Institute of the Czech Language of the Academy of Sciences of the Czech Republic has been gathering since late 1996. Světla Čmejrková, head of the Department of Stylistics and Text Linguistics, came up with the idea the while working on an interdisciplinary grant project from the Grant Agency of the Czech Republic called “Dialogue in the World of People and Machines” (1996–2001, project code 405/96/K096).

When the project was complete, the thought arose for making the archive of recordings and transcripts into an electronic corpus. This task was taken on by the team from the junior grant project “Spoken Czech in Czech Television Discussion Programs” (2003–2005, project code B9061304, Grant Agency of the Academy of Sciences of the Czech Republic): Světla Čmejrková, Lucie Jílková (project head 2003–2004), Petr Kaderka, Jana Klímová, Kamila Mrázková, Zdeňka Svobodová (project head 2004–2005) a Nino Peterek (Institute of Formal and Applied Linguistics at Charles University’s Faculty of Mathematics and Physics).

Světla Čmejrková proposed calling the future corpus DIALOG for two reasons: it refers to the name of the project “Dialogue in the World of People and Machines,“ and thus to the beginnings of material collection, as well to the project’s basic theoretical and methodological starting point – that dialogue is the basic existential speech form and that dialogism is the basic principle of semiosis (see Valentin Vološinov’s words above).

The project was conducted based on the idea that it is important to analyze a large amount of a variety of data for linguistic work and that it is wrong to depend on a single source of material (compare the words of Michael Stubbs above). The compilation of a television corpus as a special corpus of spoken Czech was thus associated with the expectation that it would provide insights into the current form of spoken Czech in public and into the methods for carrying on dialogue in the media.

In addition to a number of studies (see the Publications section), the project resulted in the compilation of the extensive DIALOG corpus (over 2 million words), a small example of which was posted on the internet under the name DIALOG 0.1 in late 2005 and early 2006.

The junior grant project “Spoken Czech in Public Dialogues: DIALOG Corpus Development, Publication and Research” (2007–2009, project code KJB900610701) from the Grant Agency of the Academy of Sciences of the Czech Republic continued with this work. The members of the junior grant project team are: Martin Havlík, Eva Havlová, Petr Kaderka (project head), Jana Klímová, Patricie Kubáčková, Nino Peterek (author of the project’s software, Institute of Formal and Applied Linguistics of Charles University’s Faculty of Mathematics and Physics), and Zdeňka Svobodová.

From the outset the new team tried to deal with this basic methodological requirement: limiting the study of spoken communication to the analysis of transcripts would be wrong; the audiovisual recording must remain the primary material for analysis. As a result of this requirement Nino Peterek created the search tool Dialogy.Org, which allows video recordings of the passages in question to be played (see Searching the corpus for more on search methods).

In 2008 the project team made public two multimedia versions of the DIALOG corpus: the DIALOG 0.1m corpus and the DIALOG 0.2 corpus (see Corpus structure for details).

Petr Kaderka