çuvaşlar: The Lexicostatistics and Glottochronology of Turkic Languages

The Lexicostatistics and Glottochronology of Turkic Languages

Version 3.01

v.1(10-12/2009) (first online) > v.2 (11/2011) (the dendrogram rebuilt with new data; the method of glottochronological corrections removed, and the article greatly simplified) > v.3 (03/2012) (Swadesh-200 expanded to Swadesh-215, data recalculated; the program updated; the theoretical part moved to a separate article.)

Abstract

A classical lexicostatistical study of 15 Turkic languages has been conducted using the 215-word Swadesh lists, originally prepared for Wiktionary.org, and then expanded and verified for possible semantic errors and borrowings. Cognates were determined using the classical trained-linguist procedure; additionally, a special php-program was written to assist in building up the lexicostatistical matrix. The phylogenetic analysis of the Turkic languages, which had been performed as a separate work using phonological, grammatical and historical evidence, was thereafter adjusted by applying the lexicostatistical evidence obtained in this study. As a result, an uncalibrated dendrogram was built and the glottochronological dates of the principle bifurcation nodes were determined, after averaging the values for each subbranch and recalculating the local ad-hoc Swadesh constants.

The present lexicostatistical study of Turkic languages uses the standard Swadesh-Lees approach with some minor methodological modifications. The theoretical basis for this approach is reviewed in detail in The Fundamentals of Lexicostatistics and Glottochronology (2012).

The wordlists

The research was initially based on the 200-word Swadesh lists of the Turkic languages prepared for Wiktionary.org by several independent editors on the web, including the author of this publication. The word lists were checked for errors and semantic inconsistencies several times to the extent that it was physically possible without the perfect simultaneous knowledge of a dozen languages. Notably, some lists were edited by people presumably fluent in a particular language, whereas others (Tuvan, Altay, Khakas, Kyrgyz) were composed and verified by the author using just meticulous dictionary mining. The verification mostly focused on semantics, and may not necessarily include orthography, which have changed in Turkic languages several times through the course of the 20th century, so occasional misspellings are thought to be irrelevant.

The collected datasets initially had some intrinsic and inherent inconsistencies accompanying the Swadesh's semastatistical (=lexicostatistical) approach: some words had too many synonyms; some synonyms had unknown semantic connotations, which is especially true in the case of rare or remote languages; some defects resulted from the Anglo-centristic semantics of the original list (such as different entries for "float" and "swim", but the same for "hot weather" and "hot water", which is not necessarily the same in other languages). Some of these drawbacks were known as far back as 1952, however, they were still present in Swadesh 200-word list used as a Wiktionary standard. Being aware of these finer inconsistencies, we tried to achieve strict semantic stability by thoroughly verifying the exact meaning of most lexemes.

In 11/2011-02/2012, all the lists were verified again for redundant synonymy and exact meanings using standard dictionaries of corespondent languages and then expanded to 215 entries.
Certain lexemes with too many synonyms, such as "some", "fight", "smell", "stab", "float" were excluded from the classical Swadesh-200 and substituted by other words (e.g. "war", "(elder) brother"). Some words, such as "person", "animal", "fruit (berry)", "flower", "sea", were excluded from the latest lists because they persistently contaminated the results with multiple real or supposed borrowings.
In 02/2012, "the classical Swadeshes" were expanded to include more words that can be considered as belonging to the basic vocabulary ("do", "begin (intr)", "end (intr)", "look for", "find", "understand", "wait", "house", "come out", "run", "boat", "swamp", "steppe/desert", "door", "wheel", "word", "voice", "finger", "lip", "nets", "tomorrow", "yesterday", "(animal) wool"). As a result, the list grew to about 215 entries and may still grow longer. In the beginning, the background reason for the 200-word limitation was the attempt to maintain the compatibility with Wiktionary, in the hope that other responsible and proficient speakers would add online materials that could later be used in this study. However, relying on others turned out to be a shaky approach, and considering that the lists composed by unknown authors still require painstaking manual rechecking and editing, whereas Wikipedia has its own rules often incompatible with the requirements in this study, it turned out to be more fruitful to suspend the compatibility with Wiktionary.
The final dataset was written in a doc-file in the Wiktionary format:
wiki_swadesh_turkic_colored_prepared_for_php_borrowings_excl_2012_rechecked.doc (original Swadesh lists with cognates designated with different colors, humanly readable)

All cognates in the doc-file were colored with different colors. The colors can be chosen with or without meaning. Meaningful coloring was done since version 3 (02.2012), because it helps to identify potential innovations which are useful in rebuilding the genetic phylogeny. Numbering and coloring is, of course, arbitrary.
A (black) - apparently, Bulgaro-Turkic
Z (green) - apparently, Bulgaro-Turkic or Turkic
F (red) - apparently, Altay-Sayan
B (blue) - apparently, Oghuz
E (magenta) - apparently, Seljuk
P (dark yellow green) - purely Yakutic or other outcast isolexemes
O (dark red) - presumably, Altay-Sayan-Great-Steppe
I (dark purple) - Great-Steppe
C (cyan, light blue) - Chuvash
H (dark blue green) - Oghuz to Great-Steppe isolexemes
G (dark blue) - other internal isolexemes
J (dark green) - other internal isolexemes

K (50% gray) - borrowing
L (25% gray) - borrowing
M (yellow) - borrowing

The program application
Additionally, a short PHP program called Swadesh Comparator 2.0 was created (2009, updated in 2012) to facilitate the calculations.
The program doesn't search for cognates — that was still done manually in a Word file — all it does is assist in counting up the percentage output from any input written in the format of a type *A,A;B,A;A,B*A,C,A;D;A,D*, where commas serve as separators between synonyms, semicolons serve as separators between the languages, asterisks as separators between the words in the list (lexemes). Consequently, in the example above, A,B,C,D are four different cognates in a 2-word, 3-language wordlist. In theory, Swadesh Comparator can be used for doing lexicostatistics for any language family, whose cognates are coded in this way.
The first "cognate" after a semicolon is supposed to be a language name, e.g. |English|. It is followed by an English translation and explanation of the word usage.
The program-compatible format can be obtained through relatively simple search-and-replace manipulations in a doc file.
The program does not run as an exe file, it requires compilation with a PHP-interpreter. It should probably be run on a local server (since a remote web server might hang up because of relatively high calculation load, though that depends on many circumstances).

In 2009, the program counted a hit if it saw just one match among several synonyms, which led to exaggerated similarity in the results. Since 2012, the program was significantly expanded by adding the synonymy counting module that works as described in the article on methodological procedures, The Fundamentals of Lexicostatistics and Glottochronology.

For more details on the program and its input format see:
Swadesh_Comparator.txt (the php-program; rename .txt to .php, do internal adjustments if necessary, and run with a php-interpreter installed on your computer)
swadesh_input.txt (Swadesh lists already prepared for running, should be in the same folder as the php file)
wiki_swadesh_turkic_colored_prepared_for_php_borrowings_excl_2012_rechecked_subst_with_letters.doc (Swadesh lists with cognates as letters, provided merely as an example)

The lexicostatistical matrix with borrowings included

After much preliminary work with the verification of exact semantics and removing avoidable synonymy, the lexicostatistical matrix was obtained by running Swadesh Comparator 2.0 with borrowings included. The inclusion of Persian, Arabic, Russian and Mongolic borrowings may be necessary when the linguistic data should be presented as they are, providing a more accurate picture of real languages as they are used today.

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings included, raw data
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	44%
	Tuvan	44.1%	50.7%
	Khakas	49.4%	56.2%	68.7%
	Standard Altay	47.8%	50.6%	65.7%	75%
	Kyrgyz	53.3%	54.7%	59.3%	69%	72.8%
	Kazakh	52.9%	54.8%	57.5%	66.4%	68%	90.6%
	Uzbek	52.9%	49.5%	51.5%	60.8%	60.8%	76.5%	75.7%
	Uyghur	51.7%	51.2%	54.8%	61.3%	65%	78.2%	75.6%	82.7%
	Karachay	52.1%	54.8%	54.3%	63.4%	62.7%	74.1%	73.7%	68%	70.5%
	Bashkir	52.5%	53.9%	55%	64.9%	66.4%	78.5%	78%	68.7%	71.2%	72.8%
	Tatar	53.9%	55%	55.9%	66.6%	67.9%	79.8%	79.5%	69.9%	71.6%	74.7%	94%
	Turkmen	49.8%	48.9%	50.7%	58.6%	56.5%	69.1%	69.2%	70.9%	68.1%	64.7%	69.2%	65.2%
	Azeri	49%	45.1%	47.1%	53.1%	54.6%	62.5%	62.1%	62.8%	62.3%.	61.7%	60.8%	63.6%	72.4%
	Turkish	47.4%	44.4%	45.1%	49.8%	50.4%	59.3%	58.6%	58.7%	59.7%	58.1%	56.5%	59.3%	66.8%	79.7%

Consequently, this table was used to built a wave diagram of the Turkic languages, based on their lexical proximity, which supposedly correlates with their actual mutual intelligibility.

A wave model of the Turkic languages

The lexicostatistical matrix with borrowings excluded
The further research is aimed at obtaining borrowing-free data that would, as Starostin advises, provide "pure" evidence for solving genetic relatedness.
Obvious borrowings from Arabic, Persian, Mongolic, Russian were now excluded. Sakha was checked for Evenk and Selkup borrowings but almost none were found, and 3 presumed Yeniseian borrowings were removed ("fly", "bird", "fear"). A couple of North Caucasian borrowings were found in Karachay-Balkar. Chuvash was checked for Tatar loanwords.
All the loanwords were denoted with gray or yellow color.
After running the program, the following table with raw data was obtained:

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	51.9%
	Tuvan	49.3%	57%
	Khakas	52.8%	61.3%	71.9%
	Standard Altay	50.9%	55.9%	69.3%	75.6%
	Kyrgyz	57.9%	59.6%	63.3%	70.3%	74.6%
	Kazakh	58.2%	59.4%	61.6%	68.1%	69.9%	92%
	Uzbek	61.1%	57.8%	58.2%	65.3%	66.3%	82.9%	82.8%
	Uyghur	59.2%	59%	61.7%	65.7%	70.2%	83.8%	81.9%	86.3%
	Karachay	57.5%	60.8%	58.7%	65.1%	65.2%	77.8%	78.3%	74.6%	77.1%
	Bashkir	58.3%	59.4%	59.9%	67.1%	69%	82%	79.9%	76.1%	78.5%	77.4%
	Tatar	59.4%	60.7%	60.2%	68.2%	70.1%	83.9%	82.1%	78%	79.6%	79.2%	94.9%
	Turkmen	55.6%	55%	54.7%	61.2%	59.5%	71.2%	71.9%	75.9%	71.7%	69.2%	71.9%	69.8%
	Azeri	55.6%	51.8%	51.8%	56.4%	58.4%	66.9%	67.8%	70%	68.8%.	66.9%	66%	68.4%	78.2%
	Turkish	54.9%	52%	50%	53.8%	54.4%	64.9%	64.8%	67.2%	66.7%	64.2%	62.8%	65.6%	73.6%	86%

As the next step, after a comprehensive linguistic, historical and geographical analysis, the following dendrogram of the Turkic languages was built using the lexicostatistical evidence obtained from the table above. All the complicated, corroborative work is fully described in The Internal Classification and Migration of Turkic Languages.

A preliminary tree of Turkic languages (without dates)

A preliminary tree of Turkic languages (without dates)

As you can see, this initial dendrogram was not built directly upon the data from the lexicostatistical matrix, rather some of the data from the lexicostatistical analysis were used to build it. Lexicostatistical facts obtained in the present study had been used in making conclusions about the tree topology only whenever these facts seemed relatively obvious, explicit and unambiguous, so that the lexicostatistical data alone seemed to be sufficient to make correspondent conclusions. In other cases, additional phonological, morphological, and historical evidence was involved.
The version number in the dendrogram reflects the history of changes: many topologies were explored to build the one that ultimately corresponded to all of the evidence available.
In other words, the procedure was based on the standard approach: do a wholesome, comprehensive analysis first, and deal with all the glottochronological dating later. In cases when this rule tends to be broken, a researcher relying entirely on lexicostatistics or other kinds of superficial statistical approaches alone runs a risk of obtaining overlapping and entangled and branches placing them on a wrong stem.
After all the preliminary work on internal classification had been finished, and the undated and uncalibrated dendrogram with presumably correct topology had been constructed, the lexicostatistical matrix was colored to mark the positions of related branches to help with subsequent statistical averaging.

Adjusting & Averaging
As the next step, to cancel out the unsystematic errors resulting mostly from fluctuations in glottochronological rates, we should average our results over each pair of closest branches.

Adjusting Proto-Bulgaro-Turkic
We will start by adjusting Chuvash as an example. The effect of comparing many languages with Chuvash should cancel out any small statistical fluctuations in their rate of change.
This averaging should be carried out on the closest-fork basis, that is, first we should average over the closest languages, then we average these averaged results over the other closest but more distant branches, and so forth. If we fail to do this, and average over all the languages Chuvash is related to simultaneously, we'll obtain a slightly different outcome value. Even though the actual difference between the two approaches may seem statistically insignificant at first, it may turn out to be much larger than expected and lead to unpredictable results when applying the logarithm, so it is best do all the calculations in the logically correct way.

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	51.9%
	Tuvan	49.3%	57%
	Khakas	51.9%	61.3%	71.9%
	Standard Altay	51.9%	55.9%	69.3%	75.6%
	Kyrgyz	58.1%	59.6%	63.3%	70.3%	74.6%
	Kazakh	58.1%	59.4%	61.6%	68.1%	69.9%	92%
	Uzbek	60.2%	57.8%	58.2%	65.3%	66.3%	82.9%	82.8%
	Uyghur	60.2%	59%	61.7%	65.7%	70.2%	83.8%	81.9%	86.3%
	Karachay	57.5%	60.8%	58.7%	65.1%	65.2%	77.8%	78.3%	74.6%	77.1%
	Bashkir	58.9%	59.4%	59.9%	67.1%	69%	82%	79.9%	76.1%	78.5%	77.4%
	Tatar	58.9%	60.7%	60.2%	68.2%	70.1%	83.9%	82.1%	78%	79.6%	79.2%	94.9%
	Turkmen	55.6%	55%	54.7%	61.2%	59.5%	71.2%	71.9%	75.9%	71.7%	69.2%	71.9%	69.8%
	Azeri	55.2%	51.8%	51.8%	56.4%	58.4%	66.9%	67.8%	70%	68.8%.	66.9%	66%	68.4%	78.2%
	Turkish	55.2%	52%	50%	53.8%	54.4%	64.9%	64.8%	67.2%	66.7%	64.2%	62.8%	65.6%	73.6%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	51.9%
	Tuvan	50.6%	57%
	Khakas		61.3%	71.9%
	Standard Altay		55.9%	69.3%	75.6%
	Kyrgyz	59.2%	59.6%	63.3%	70.3%	74.6%
	Kazakh		59.4%	61.6%	68.1%	69.9%	92%
	Uzbek		57.8%	58.2%	65.3%	66.3%	82.9%	82.8%
	Uyghur		59%	61.7%	65.7%	70.2%	83.8%	81.9%	86.3%
	Karachay	58.2%	60.8%	58.7%	65.1%	65.2%	77.8%	78.3%	74.6%	77.1%
	Bashkir		59.4%	59.9%	67.1%	69%	82%	79.9%	76.1%	78.5%	77.4%
	Tatar		60.7%	60.2%	68.2%	70.1%	83.9%	82.1%	78%	79.6%	79.2%	94.9%
	Turkmen	55.4%	55%	54.7%	61.2%	59.5%	71.2%	71.9%	75.9%	71.7%	69.2%	71.9%	69.8%
	Azeri		51.8%	51.8%	56.4%	58.4%	66.9%	67.8%	70%	68.8%.	66.9%	66%	68.4%	78.2%
	Turkish		52%	50%	53.8%	54.4%	64.9%	64.8%	67.2%	66.7%	64.2%	62.8%	65.6%	73.6%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	51.9%
	Tuvan	50.6%	57%
	Khakas		61.3%	71.9%
	Standard Altay		55.9%	69.3%	75.6%
	Kyrgyz	58.7%	59.6%	63.3%	70.3%	74.6%
	Kazakh		59.4%	61.6%	68.1%	69.9%	92%
	Uzbek		57.8%	58.2%	65.3%	66.3%	82.9%	82.8%
	Uyghur		59%	61.7%	65.7%	70.2%	83.8%	81.9%	86.3%
	Karachay		60.8%	58.7%	65.1%	65.2%	77.8%	78.3%	74.6%	77.1%
	Bashkir		59.4%	59.9%	67.1%	69%	82%	79.9%	76.1%	78.5%	77.4%
	Tatar		60.7%	60.2%	68.2%	70.1%	83.9%	82.1%	78%	79.6%	79.2%	94.9%
	Turkmen	55.4%	55%	54.7%	61.2%	59.5%	71.2%	71.9%	75.9%	71.7%	69.2%	71.9%	69.8%
	Azeri		51.8%	51.8%	56.4%	58.4%	66.9%	67.8%	70%	68.8%.	66.9%	66%	68.4%	78.2%
	Turkish		52%	50%	53.8%	54.4%	64.9%	64.8%	67.2%	66.7%	64.2%	62.8%	65.6%	73.6%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	53.7%
	Tuvan		57%
	Khakas		61.3%	71.9%
	Standard Altay		55.9%	69.3%	75.6%
	Kyrgyz		59.6%	63.3%	70.3%	74.6%
	Kazakh		59.4%	61.6%	68.1%	69.9%	92%
	Uzbek		57.8%	58.2%	65.3%	66.3%	82.9%	82.8%
	Uyghur		59%	61.7%	65.7%	70.2%	83.8%	81.9%	86.3%
	Karachay		60.8%	58.7%	65.1%	65.2%	77.8%	78.3%	74.6%	77.1%
	Bashkir		59.4%	59.9%	67.1%	69%	82%	79.9%	76.1%	78.5%	77.4%
	Tatar		60.7%	60.2%	68.2%	70.1%	83.9%	82.1%	78%	79.6%	79.2%	94.9%
	Turkmen	55.4%	55%	54.7%	61.2%	59.5%	71.2%	71.9%	75.9%	71.7%	69.2%	71.9%	69.8%
	Azeri		51.8%	51.8%	56.4%	58.4%	66.9%	67.8%	70%	68.8%.	66.9%	66%	68.4%	78.2%
	Turkish		52%	50%	53.8%	54.4%	64.9%	64.8%	67.2%	66.7%	64.2%	62.8%	65.6%	73.6%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	53.7%
	Tuvan		57%
	Khakas		61.3%	71.9%
	Standard Altay		55.9%	69.3%	75.6%
	Kyrgyz		59.6%	63.3%	70.3%	74.6%
	Kazakh		59.4%	61.6%	68.1%	69.9%	92%
	Uzbek		57.8%	58.2%	65.3%	66.3%	82.9%	82.8%
	Uyghur		59%	61.7%	65.7%	70.2%	83.8%	81.9%	86.3%
	Karachay		60.8%	58.7%	65.1%	65.2%	77.8%	78.3%	74.6%	77.1%
	Bashkir		59.4%	59.9%	67.1%	69%	82%	79.9%	76.1%	78.5%	77.4%
	Tatar		60.7%	60.2%	68.2%	70.1%	83.9%	82.1%	78%	79.6%	79.2%	94.9%
	Turkmen	55.4%	55%	54.7%	61.2%	59.5%	71.2%	71.9%	75.9%	71.7%	69.2%	71.9%	69.8%
	Azeri		51.8%	51.8%	56.4%	58.4%	66.9%	67.8%	70%	68.8%.	66.9%	66%	68.4%	78.2%
	Turkish		52%	50%	53.8%	54.4%	64.9%	64.8%	67.2%	66.7%	64.2%	62.8%	65.6%	73.6%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	54.5%
	Tuvan		57%
	Khakas		61.3%	71.9%
	Standard Altay		55.9%	69.3%	75.6%
	Kyrgyz		59.6%	63.3%	70.3%	74.6%
	Kazakh		59.4%	61.6%	68.1%	69.9%	92%
	Uzbek		57.8%	58.2%	65.3%	66.3%	82.9%	82.8%
	Uyghur		59%	61.7%	65.7%	70.2%	83.8%	81.9%	86.3%
	Karachay		60.8%	58.7%	65.1%	65.2%	77.8%	78.3%	74.6%	77.1%
	Bashkir		59.4%	59.9%	67.1%	69%	82%	79.9%	76.1%	78.5%	77.4%
	Tatar		60.7%	60.2%	68.2%	70.1%	83.9%	82.1%	78%	79.6%	79.2%	94.9%
	Turkmen		55%	54.7%	61.2%	59.5%	71.2%	71.9%	75.9%	71.7%	69.2%	71.9%	69.8%
	Azeri		51.8%	51.8%	56.4%	58.4%	66.9%	67.8%	70%	68.8%.	66.9%	66%	68.4%	78.2%
	Turkish		52%	50%	53.8%	54.4%	64.9%	64.8%	67.2%	66.7%	64.2%	62.8%	65.6%	73.6%	86%

Ultimately, we have 54.5% for Chuvash-to-any-other-language, using the fork-by-fork averaging, which differs a little bit from 55.9% that we would obtain, if we had used the simplified Chuvash-to-everything-else-at-the-same-time averaging. The difference of 1.4% may later lead to noticeable temporal deviations.
At the current stage, this value of 54.5% has only been averaged and adjusted for non-Bulgaric Turkic languages, not Chuvash itself, because of its stand-alone position among Bulgaric and the complete lack of surviving sibling languages, to which we could do the comparison. Therefore, we'll have to accept this figure at face value at this point and assume Chuvash per se is neither too innovative nor too archaic.
How reasonable can this latter assumption be? Tatar, located in the same area, is generally rather archaic (as evident from its close proximity to Kyrgyz and Kazakh), consequently we may assume that Chuvash, which has been located in similar historical and geographic background cannot be too innovative. On the other hand, Chuvash was in contact with the Tatar superstratum and Finno-Ugric adstratum, which may have resulted in a "creolization" process and strong innovative changes. Indeed, it should be noted that Chuvash has a few Kazan Tatar borrowings even in the basic vocabulary (these were mostly tracked down and excluded from the cognate list), so these innovative features are supposed to cancel out any potential archaism of Chuvash.
Furthermore, the scanty historical evidence demonstrating the existence of other Bulgaric languages confirm that there existed other Chuvash siblings with about the same level of phonological and lexical transformation, which means that Chuvash had once been part of a bigger family probably with a rather glottochronologically normal separation rate.
Therefore, altogether, we may expect that the deviation of Chuvash would rather too close to zero, consiering that so any factors were involved, and our assumption about Chuvash being neither too archaic nor too innovative presently seems plausible.
All in all, that marks our calculations for Chuvash as still partly subject to Bergsland-Vogt objection, however to a lesser extent, since we still have statistically corrected values at least at one side, and we found no immediately obvious geographical or historical reasons for Chuvash to be a strongly archaic or innovative. Therefore, we may now conclude with some 80% certainity that Chuvash is most likely a rather glottochronologically normal language.

Adjusting Proto-Turkic
By the same token, we should average the numbers for other Turkic languages.
Again, we will start from averaging the closest internal branches using a two-by-two walk for each bifurcation:

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	54.5%
	Tuvan		57%
	Khakas		61.3%	71.9%
	Standard Altay		55.9%	69.3%	75.6%
	Kyrgyz		59.5%	62.6%	69.2%	72.2%
	Kazakh		59.5%	62.6%	69.2%	72.2%	92%
	Uzbek		58.4%	60%	65.5%	68.3%	83.3%	82.3%
	Uyghur		58.4%	60%	65.5%	68.3%	83.3%	82.3%	86.3%
	Karachay		60.8%	58.7%	65.1%	65.2%	77.8%	78.3%	74.6%	77.1%
	Bashkir		60.1%	60.5%	67.7%	69.6%	83.0%	81%	77.1%	79.1%	78.3%
	Tatar		60.1%	60.5%	67.7%	69.6%	83.0%	81%	77.1%	79.1%	78.3%	94.9%
	Turkmen		55%	54.7%	61.2%	59.5%	71.2%	71.9%	75.9%	71.7%	69.2%	71.9%	69.8%
	Azeri		51.9%	50.9%	55.1%	56.4%	65.9%	66.3%	68.6%	67.8%	65.6%	64.4%	67%	75.9%
	Turkish		51.9%	50.9%	55.1%	56.4%	65.9%	66.3%	68.6%	67.8%	65.6%	64.4%	67%	75.9%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	54.5%
	Tuvan		57%
	Khakas		61.3%	71.9%
	Standard Altay		55.9%	69.3%	75.6%
	Kyrgyz		59.5%	62.6%	69.2%	72.2%
	Kazakh		59.5%	62.6%	69.2%	72.2%	92%
	Uzbek		58.4%	60%	65.5%	68.3%	82.8%
	Uyghur		58.4%	60%	65.5%	68.3%	82.8%		86.3%
	Karachay		60.8%	58.7%	65.1%	65.2%	78.1%		78.5%
	Bashkir		60.1%	60.5%	67.7%	69.6%	82%		78.1%		78.3%
	Tatar		60.1%	60.5%	67.7%	69.6%	82%		78.1%		78.3%	94.9%
	Turkmen		55%	54.7%	61.2%	59.5%	71.6%		73.8%		69.2%	70.9%
	Azeri		51.9%	50.9%	55.1%	56.4%	66.1%		68.2%		65.6%	65.7%		75.9%
	Turkish		51.9%	50.9%	55.1%	56.4%	66.1%		68.2%		65.6%	65.7%		75.9%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	54.5%
	Tuvan		57%
	Khakas		61.3%	71.9%
	Standard Altay		55.9%	69.3%	75.6%
	Kyrgyz		58.9%	61.3%	67.4%	70.3%
	Kazakh						92%
	Uzbek						82.8%
	Uyghur						82.8%		86.3%
	Karachay		60.5%	59.6%	66.4%	67.4%	80.1%		78.3%
	Bashkir										78.3%
	Tatar										78.3%	94.9%
	Turkmen		55%	54.7%	61.2%	59.5%	72.7%				70.1%
	Azeri		51.9%	50.9%	55.1%	56.4%	67.2%				65.7%			75.9%
	Turkish		51.9%	50.9%	55.1%	56.4%	67.2%				65.7%			75.9%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	54.5%
	Tuvan		57%
	Khakas		58.6%	70.6%
	Standard Altay		58.6%	70.6%	75.6%
	Kyrgyz		59.7%	60.5%	67.9%
	Kazakh						92%
	Uzbek						82.8%
	Uyghur						82.8%		86.3%
	Karachay						79.2%
	Bashkir										78.3%
	Tatar										78.3%	94.9%
	Turkmen		55%	54.7%	60.4%		72.7%				70.1%
	Azeri		51.9%	50.9%	55.8%		67.2%				65.7%			75.9%
	Turkish		51.9%	50.9%	55.8%		67.2%				65.7%			75.9%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	54.5%
	Tuvan		57%
	Khakas		58.6%	70.6%
	Standard Altay		58.6%	70.6%	75.6%
	Kyrgyz		59.7%	60.5%	67.9%
	Kazakh						92%
	Uzbek						82.8%
	Uyghur						82.8%		86.3%
	Karachay						79.2%
	Bashkir										78.3%
	Tatar										78.3%	94.9%
	Turkmen		53.5%	52.8%	58.1%		70%				67.9%
	Azeri													75.9%
	Turkish													75.9%	86%

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	54.5%
	Tuvan		57.8%
	Khakas			70.6%
	Standard Altay			70.6%	75.6%
	Kyrgyz		59.7%	64.2%
	Kazakh						92%
	Uzbek						82.8%
	Uyghur						82.8%		86.3%
	Karachay						79.2%
	Bashkir										78.3%
	Tatar										78.3%	94.9%
	Turkmen		53.5%	55.5%			69%
	Azeri													75.9%
	Turkish													75.9%	86%

Then, we finally get to the most early separated Turkic branches, such as Oghuz-Seljuk and Sakha.

The Lexicostatistical Matrix of Turkic Languages, Swadesh-215 (02.2012), borrowings excluded
		Chuvash	Sakha	Tuvan	Khakas	Standard Altay	Kyrgyz	Kazakh	Uzbek	Uyghur	Karachay	Bashkir	Tatar	Turkmen	Azeri
	Sakha	54.5%
	Tuvan		57.8%
	Khakas			70.6%
	Standard Altay			70.6%	75.6%
	Kyrgyz		59.7%	64.2%
	Kazakh						92%
	Uzbek						82.8%
	Uyghur						82.8%		86.3%
	Karachay						79.2%
	Bashkir										78.3%
	Tatar										78.3%	94.9%
	Turkmen		53.5%	62.3%
	Azeri													75.9%
	Turkish													75.9%	86%

At this point, we run into certain difficulties with adjusting the Yakutic subtaxon. These are most likely due to the fact we were unable to exclude all the borrowings from the "odd words" list in Sakha. These words may come from an unknown source, such as an unknown Yeniseian or Tungusic adstrate. Moreover, judging by (1) a nearly complete lack of historical siblings (except Dolgan) and poor dialectical differentiation, (2) the enormous geographical separation, (3) the presence of borrowings from Mongolic and (4) the genetic bottleneck that evidences some kind of catastrophic event in the past, we have reasons to believe that Sakha may be a strongly innovative or glottochronologically aberrant language. Therefore, we cannot exclude the possibility that Sakha may be generally younger than it looks.

Consequently, the best thing we can do is exclude Sakha from any further calculations. Knowing that the values for Oghuz-to-others and Altay-Sayan-to-Great-Steppe come from the averaging over a great many language pairs, we must conclude that these values are quite statistically robust already, as they are, and there is no need to "spoil" them with the data from the Yakutic branch. Thus, we will just leave the Yakutic subgroup aside, by simply expanding the previously obtanined values to the left:

The Lexicostatistical Matrix of Turkic Languages,
Swadesh-215 (02.2012), borrowings excluded

Chuvash

Sakha

Tuvan

Khakas

Standard Altay

Kyrgyz

Kazakh

Uzbek

Uyghur

Karachay

Bashkir

Tatar

Turkmen

Azeri

Sakha

54.5%

Tuvan

58%
(?)

Khakas

70.6%

Standard Altay

75.6%

Kyrgyz

64.2%

Kazakh

92%

Uzbek

82.8%

Uyghur

86.3%

Karachay

79.2%

Bashkir

78.3%

Tatar

94.9%

Turkmen

62.3%

Azeri

75.9%

Turkish

86%

In most other respects, the very fact that we have averaged over the great many both archaic and innovative languages should provide guarantee against nonsystematic glottochronological errors and fluctuations, therefore most other values in the table are supposed to be rather precise, resistant and no longer subject to Bergland-Vogt's objection, especially as far as the deepest glottochronological separation nodes, where most averaging was done, are concerned.

Establishing glottochronological calibration points
Now that we have fluctuation-resistant values, we can do the more or less correct glottochronology.
An important correction to the Swadesh-Lees methods should be the following: we will not use the standard global Swadesh-Lees' constant (81% for Swadesh-200, 86% for Swadesh-100), but rather apply the local (ad-hoc) calibration (gauging) instead. This approach has already been done in many other studies of other language families by other authors.
To proceed any further, we must determine the calibration points, which means historical periods within the tree of the Turkic languages when each particular splitting was actually attested.
From the classical formula for the negative exponential decay, we have:

t = - k ln C

Therefore,

k = - t/lnC
Consequently, we can now determine k for each calibration point:

Event	Date	Lexical %		k
Turkmen—Seljuk separation before c. 980 (certain)	950 AD	75.9%	- 1.05 / ln 0.759	3.8
Uzbek-Uyghur separation after the division of the Chagatai Ulus (1370)	1370 AD	86.3%	- 0.63 / ln 0.863	4.2
Turkish-Azeri separation after the Battle of Manzikert (1071) and then, particularly, the collapse of the Seljuk Empire (1194), and then the Mongol invasion (1260)	c. 1100-1260 AD	86%	- 0.8 / ln 0.86	5.3
Kyrgyz-Kazakh separation	1450 AD	92%	- 0.55 / ln 0.92	6.6
Kyrgyz and Tatar mentioned as separate tribes as early as 730 AD, which supposedly marks the separation between Kimak tribes (Tatar, Kimak, Kypchak) and Kyrgyz some time before that date.	c. 700 AD	79.2%	- 1.3 / ln 0.792	5.6
Average Local Constant for Turkic Languages				~5.1

After averaging over all the calibration points available, we obtain the rather statistically robust local glottochronological constant, therefore we may tentatively conclude that
t ~ - 5.1 ln C

Adjusting the dendrogram bifurcation points along the temporal axis
Ultimately, using the newly-calculated dates for each undated node, we can adjust the glottochronological dendrogram that finally looks as follows:

The glottochronological dendrogram of the Turkic languages

The glottochronological dendrogram of the Turkic languages

This dendrogram is now adjusted along the temporal axis and contains all the glottochronological values obtained in this study.

References

The Fundamentals of Lexicostatistics and Glottochronology (2009, 2012);
The Internal Classification and Migrations of Turkic Languages (2009, 2012);
en.wiktionary.org/wiki/Appendix:Swadesh_lists_for_Turkic_languages (2007-2011);
M. Dyachok, Glottochronolgiya tyurkskikh yazykov (The Glottochronology of the Turkic Languages), Materials of 2nd Scientific Conference, Novosibirsk (2001);
Anna Dybo, Khronologiya tyurkskikh yazykov i lingvisticheskiye kontakty rannikh tyurkov (The Chronology of the Turkic Languages and the Linguistic Contacts of the Early Turks) (2006);
O.A. Mudrak, Ob utochnenii klassifikatsii tyurkskikh yazykov s pomosch'yu morphologicheskoy lingvostatistiki (On the clarification of the Turkic languages classification by means of morphological linguostatistics)// Sravnintelno-istoricheskaya grammatka tyurkskikh yazykov. Regionalnyiye rekonstruktsii. Moscow (2002);

10/2009 - 10/2011 - 01-03/2012

BACK TO THE TURKIC LANGUAGES IN A NUTSHELL

Home

StatCounter - Free Web Tracker and Counter

Listed on: Dmegs Web Directory

Best Free Host

çuvaşlar

7 Ekim 2012 Pazar

The Lexicostatistics and Glottochronology of Turkic Languages

10/2009 - 10/2011 - 01-03/2012

Hiç yorum yok:

Yorum Gönder