Turkish National Corpus is designed to be a balanced, large scale (50 million words) and general-purpose corpus for contemporary Turkish. It has benefited from previous practices and efforts for the construction of corpora. In this sense, TNC generally follows the framework of British National Corpus, yet necessary adjustments in corpus design of TNC are made whenever needed. All throughout the process, different types of open-source software are used for specific tasks, and the resulting corpus is a free resource for non-commercial use.
TNC with a size of 50 million words, is a balanced and a representative corpus of contemporary Turkish. It consists of samples of textual data across a wide variety of genres covering a period of 24 years (1990-2013). Written component consists of texts produced in different domains on various topics. Transcriptions from spoken data constitute 2% of TNC’s database, which involves spontaneous, every day conversations and speeches collected in particular communicative settings. From a size of 50 million words collection, users will be able to perform queries by defining restrictions to generate outputs from media, text sample, domain, derived text type, sex of author, type of author, text genre, as well as the audience of the text. In TNC Version 3.0 users will able to conduct queries in term of POS of words and in terms of suffixes. Moreover they search multiword units and also send queries by using regular expressions.