The
Lexicostatistics and Glottochronology of Turkic Languages
Version 3.01 v.1(10-12/2009) (first
online) > v.2 (11/2011) (the dendrogram rebuilt with new
data;
the method of glottochronological corrections removed, and the
article greatly simplified) > v.3 (03/2012) (Swadesh-200
expanded to
Swadesh-215, data recalculated; the program updated; the theoretical
part moved
to a separate article.)
Abstract
A
classical lexicostatistical study of 15 Turkic languages has been
conducted using
the 215-word Swadesh lists, originally prepared for Wiktionary.org, and
then expanded
and verified for possible semantic errors and borrowings. Cognates were
determined
using the classical trained-linguist procedure; additionally, a special
php-program
was written to assist in building up the lexicostatistical matrix. The
phylogenetic
analysis of the Turkic languages, which had been performed as a separate
work using phonological, grammatical and historical evidence, was
thereafter adjusted
by applying the lexicostatistical evidence obtained in this study. As a
result,
an uncalibrated dendrogram was built and the glottochronological dates
of the
principle bifurcation nodes were determined, after averaging the values
for each
subbranch and recalculating the local ad-hoc Swadesh constants.
The
present lexicostatistical study of Turkic languages uses the standard
Swadesh-Lees
approach with some minor methodological modifications. The theoretical
basis for
this approach is reviewed in detail in The
Fundamentals of Lexicostatistics and Glottochronology
(2012).
The
wordlists
The research was initially based on the
200-word Swadesh lists of the Turkic languages prepared
for Wiktionary.org
by several independent editors on the web, including the author of this
publication.
The word lists were checked for errors and semantic inconsistencies
several times
to the extent that it was physically possible without the perfect
simultaneous
knowledge of a dozen languages. Notably, some lists were edited by
people presumably
fluent in a particular language, whereas others (Tuvan, Altay, Khakas,
Kyrgyz)
were composed and verified by the author using just meticulous
dictionary mining.
The verification mostly focused on semantics, and may not necessarily
include
orthography, which have changed in Turkic languages several times
through the
course of the 20th century, so occasional misspellings are thought to be
irrelevant.
The
collected datasets initially had some intrinsic and inherent
inconsistencies accompanying
the Swadesh's semastatistical (=lexicostatistical) approach: some
words had too many synonyms; some synonyms had unknown semantic
connotations,
which is especially true in the case of rare or remote languages; some
defects
resulted from the Anglo-centristic semantics of the original list (such
as different
entries for "float" and "swim", but the same for "hot
weather" and "hot water", which is not necessarily the same in
other languages). Some of these drawbacks were known as far back as
1952, however, they
were still present in Swadesh 200-word list used as a Wiktionary
standard. Being
aware of these finer inconsistencies, we tried to achieve strict
semantic stability
by thoroughly verifying the exact meaning of most lexemes. In
11/2011-02/2012,
all the lists were verified again for redundant synonymy and exact
meanings using
standard dictionaries of corespondent languages and then expanded to 215
entries.
Certain lexemes
with too many synonyms, such as "some", "fight", "smell",
"stab", "float" were excluded from the classical Swadesh-200
and substituted by other words (e.g. "war", "(elder) brother").
Some
words, such as "person", "animal", "fruit (berry)",
"flower", "sea", were excluded from the latest lists because
they persistently contaminated the results with multiple real or
supposed borrowings.
In
02/2012, "the classical Swadeshes" were expanded to include more words
that can be considered as belonging to the basic vocabulary ("do",
"begin
(intr)", "end (intr)", "look for", "find",
"understand", "wait", "house", "come out",
"run", "boat", "swamp", "steppe/desert",
"door", "wheel", "word", "voice", "finger",
"lip", "nets", "tomorrow", "yesterday",
"(animal) wool"). As a result, the list grew to about 215 entries and
may still grow longer. In the beginning, the background reason for the
200-word
limitation was the attempt to maintain the compatibility with
Wiktionary, in the
hope that other responsible and proficient speakers would add online
materials
that could later be used in this study. However, relying on others
turned out
to be a shaky approach, and considering that the lists composed by
unknown authors
still require painstaking manual rechecking and editing, whereas
Wikipedia has
its own rules often incompatible with the requirements in this study, it
turned
out to be more fruitful to suspend the compatibility with Wiktionary.
The
final dataset was written in a doc-file in the Wiktionary format:
wiki_swadesh_turkic_colored_prepared_for_php_borrowings_excl_2012_rechecked.doc
(original Swadesh lists with cognates designated with different
colors,
humanly readable)
All
cognates in the doc-file were colored with different colors. The colors
can be
chosen with or without meaning. Meaningful coloring was done since
version 3 (02.2012),
because it helps to identify potential innovations which are useful in
rebuilding
the genetic phylogeny. Numbering and coloring is, of course, arbitrary.
A
(black) - apparently,
Bulgaro-Turkic Z (green) - apparently, Bulgaro-Turkic or Turkic F
(red)
- apparently, Altay-Sayan B (blue) - apparently, Oghuz E
(magenta) - apparently,
Seljuk P (dark yellow green) - purely Yakutic or other outcast
isolexemes
O (dark red) - presumably, Altay-Sayan-Great-Steppe I (dark purple) -
Great-Steppe
C (cyan, light blue) - Chuvash H (dark blue green) - Oghuz to
Great-Steppe
isolexemes G (dark blue) - other internal isolexemes J (dark
green) -
other internal isolexemes
K
(50% gray) - borrowing L (25% gray) - borrowing M (yellow) -
borrowing
The
program application
Additionally,
a short PHP program called Swadesh Comparator 2.0 was created
(2009, updated
in 2012) to facilitate the calculations.
The
program doesn't search for cognates — that was still done manually in a
Word
file — all it does is assist in counting up the percentage output from
any input written in the format of a type *A,A;B,A;A,B*A,C,A;D;A,D*,
where commas
serve as separators between synonyms, semicolons serve as separators
between the
languages, asterisks as separators between the words in the list
(lexemes). Consequently,
in the example above, A,B,C,D are four different cognates in a 2-word,
3-language
wordlist. In theory, Swadesh Comparator can be used for doing
lexicostatistics
for any language family, whose cognates are coded in this way.
The
first "cognate" after a semicolon is supposed to be a language name,
e.g. |English|. It is followed by an English translation and explanation
of the
word usage.
The
program-compatible format can be obtained through relatively simple
search-and-replace
manipulations in a doc file.
The
program does not run as an exe file, it requires compilation with a
PHP-interpreter.
It should probably be run on a local server (since a remote web server
might hang
up because of relatively high calculation load, though that depends on
many
circumstances).
In
2009, the program counted a hit if it saw just one match among several
synonyms,
which led to exaggerated similarity in the results. Since 2012, the
program was
significantly expanded by adding the synonymy counting module
that works
as described in the article on methodological procedures, The
Fundamentals of Lexicostatistics and Glottochronology.
For more details on the program and its input format see:
Swadesh_Comparator.txt
(the php-program; rename .txt to .php, do internal adjustments if
necessary,
and run with a php-interpreter installed on your computer) swadesh_input.txt
(Swadesh lists already prepared for running, should be in the same
folder as
the php file)
wiki_swadesh_turkic_colored_prepared_for_php_borrowings_excl_2012_rechecked_subst_with_letters.doc
(Swadesh lists with cognates as letters, provided merely
as an example)
The lexicostatistical matrix with borrowings included
After
much preliminary work with the verification of exact semantics and
removing avoidable
synonymy, the lexicostatistical matrix was obtained by running Swadesh
Comparator
2.0 with borrowings included. The inclusion of Persian, Arabic,
Russian
and Mongolic borrowings may be necessary when the linguistic data should
be presented
as they are, providing a more accurate picture of real languages as they
are used
today.
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
included, raw data
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha | 44% | | | | | | | | | | | | | |
Tuvan | 44.1%
| 50.7%
| | | | | | | | | | | | |
Khakas | 49.4%
| 56.2%
| 68.7%
| | | | | | | | | | | |
Standard
Altay | 47.8%
| 50.6%
| 65.7%
| 75%
| | | | | | | | | | |
Kyrgyz | 53.3%
| 54.7%
| 59.3%
| 69%
| 72.8%
| | | | | | | | | |
Kazakh | 52.9% | 54.8% | 57.5% | 66.4% | 68% | 90.6% | | | | | | | | |
Uzbek | 52.9%
| 49.5%
| 51.5%
| 60.8%
| 60.8%
| 76.5%
| 75.7% | | | | | | | |
Uyghur | 51.7% | 51.2%
| 54.8%
| 61.3%
| 65%
| 78.2%
| 75.6% | 82.7%
| | | | | | |
Karachay | 52.1% | 54.8%
| 54.3%
| 63.4%
| 62.7%
| 74.1%
| 73.7% | 68% | 70.5% | | | | | |
Bashkir | 52.5% | 53.9% | 55% | 64.9% | 66.4% | 78.5% | 78% | 68.7% | 71.2% | 72.8% | | | | |
Tatar | 53.9%
| 55%
| 55.9%
| 66.6%
| 67.9%
| 79.8%
| 79.5% | 69.9% | 71.6% | 74.7%
| 94%
| | | |
Turkmen | 49.8%
| 48.9%
| 50.7%
| 58.6%
| 56.5%
| 69.1%
| 69.2% | 70.9%
| 68.1%
| 64.7%
| 69.2% | 65.2%
| | |
Azeri | 49% | 45.1% | 47.1%
| 53.1% | 54.6% | 62.5% | 62.1% | 62.8% | 62.3%.
| 61.7% |
60.8% | 63.6% | 72.4%
| |
Turkish | 47.4%
| 44.4%
| 45.1%
| 49.8%
| 50.4%
| 59.3%
| 58.6% | 58.7%
| 59.7%
| 58.1%
| 56.5% | 59.3%
| 66.8%
| 79.7% |
|
Consequently,
this table was used to built a wave diagram of the Turkic languages,
based on
their lexical proximity, which supposedly correlates with their actual
mutual
intelligibility.
The
lexicostatistical matrix with borrowings excluded
The
further research is aimed at obtaining
borrowing-free
data that would, as Starostin advises, provide "pure" evidence for
solving
genetic relatedness.
Obvious
borrowings from Arabic, Persian, Mongolic, Russian were now excluded.
Sakha was
checked for Evenk and Selkup borrowings but almost none were found, and 3
presumed
Yeniseian borrowings were removed ("fly", "bird", "fear").
A couple of North Caucasian borrowings were found in Karachay-Balkar.
Chuvash
was checked for Tatar loanwords. All the loanwords were denoted
with gray
or yellow color.
After
running the program, the following table with raw data was obtained:
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha | 51.9% | | | | | | | | | | | | | |
Tuvan | 49.3%
| 57%
| | | | | | | | | | | | |
Khakas | 52.8%
| 61.3%
| 71.9%
| | | | | | | | | | | |
Standard
Altay | 50.9%
| 55.9%
| 69.3%
| 75.6%
| | | | | | | | | | |
Kyrgyz | 57.9%
| 59.6%
| 63.3%
| 70.3%
| 74.6%
| | | | | | | | | |
Kazakh | 58.2% | 59.4% | 61.6% | 68.1% | 69.9% | 92% | | | | | | | | |
Uzbek | 61.1%
| 57.8%
| 58.2%
| 65.3%
| 66.3%
| 82.9%
| 82.8% | | | | | | | |
Uyghur | 59.2% | 59%
| 61.7%
| 65.7%
| 70.2%
| 83.8%
| 81.9% | 86.3%
| | | | | | |
Karachay | 57.5% | 60.8%
| 58.7%
| 65.1%
| 65.2%
| 77.8%
| 78.3% | 74.6% | 77.1% | | | | | |
Bashkir | 58.3% | 59.4% | 59.9% | 67.1% | 69% | 82% | 79.9% | 76.1% | 78.5% | 77.4% | | | | |
Tatar | 59.4%
| 60.7%
| 60.2%
| 68.2%
| 70.1%
| 83.9%
| 82.1% | 78% | 79.6% | 79.2%
| 94.9%
| | | |
Turkmen | 55.6%
| 55%
| 54.7%
| 61.2%
| 59.5%
| 71.2%
| 71.9% | 75.9%
| 71.7%
| 69.2%
| 71.9% | 69.8%
| | |
Azeri | 55.6% | 51.8% | 51.8%
| 56.4% | 58.4% | 66.9% | 67.8% | 70% | 68.8%.
| 66.9% |
66% | 68.4% | 78.2%
| |
Turkish | 54.9%
| 52%
| 50%
| 53.8%
| 54.4%
| 64.9%
| 64.8% | 67.2%
| 66.7%
| 64.2%
| 62.8% | 65.6%
| 73.6%
| 86% |
|
As
the next step, after a comprehensive linguistic, historical and
geographical analysis,
the following dendrogram of the Turkic languages was built using the
lexicostatistical
evidence obtained from the table above. All the complicated,
corroborative work
is fully described in The
Internal Classification and Migration of Turkic Languages.
As
you can see, this initial dendrogram was not built directly upon the
data from
the lexicostatistical matrix, rather some of the data from the
lexicostatistical
analysis were used to build it. Lexicostatistical facts obtained in the
present
study had been used in making conclusions about the tree topology only
whenever
these facts seemed relatively obvious, explicit and unambiguous, so that
the lexicostatistical
data alone seemed to be sufficient to make correspondent conclusions. In
other
cases, additional phonological, morphological, and historical evidence
was involved.
The
version number in the dendrogram reflects the history of changes: many
topologies were explored to build the one that ultimately corresponded
to all
of the evidence available.
In
other words, the procedure was based on the standard approach: do a
wholesome,
comprehensive analysis first, and deal with all the glottochronological
dating
later. In cases when this rule tends to be broken, a researcher relying
entirely
on lexicostatistics or other kinds of superficial statistical approaches
alone
runs a risk of obtaining overlapping and entangled and branches placing
them on
a wrong stem.
After
all the preliminary work on internal classification had been finished,
and the
undated and uncalibrated dendrogram with presumably correct topology had
been
constructed, the lexicostatistical matrix was colored to mark the
positions of
related branches to help with subsequent statistical averaging.
Adjusting
& Averaging
As
the next step, to cancel out the unsystematic errors resulting mostly
from fluctuations
in glottochronological rates, we should average our results over each
pair
of closest branches.
Adjusting
Proto-Bulgaro-Turkic
We
will start by adjusting Chuvash as an example. The effect of comparing
many languages
with Chuvash should cancel out any small statistical fluctuations in
their rate
of change.
This
averaging should be carried out on the closest-fork basis, that
is, first
we should average over the closest languages, then we average these
averaged results
over the other closest but more distant branches, and so forth. If we
fail to
do this, and average over all the languages Chuvash is related to
simultaneously,
we'll obtain a slightly different outcome value. Even though the actual
difference
between the two approaches may seem statistically insignificant at
first, it may
turn out to be much larger than expected and lead to unpredictable
results when
applying the logarithm, so it is best do all the calculations in the
logically
correct way.
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
51.9%
| | | | | | | | | | | | | |
Tuvan |
49.3%
| 57%
| | | | | | | | | | | | |
Khakas |
51.9%
| 61.3%
| 71.9%
| | | | | | | | | | | |
Standard
Altay | 55.9%
| 69.3%
| 75.6%
| | | | | | | | | | |
Kyrgyz |
58.1%
| 59.6%
| 63.3%
| 70.3%
| 74.6%
| | | | | | | | | |
Kazakh | 59.4% | 61.6% | 68.1% | 69.9% | 92% | | | | | | | | |
Uzbek |
60.2%
| 57.8%
| 58.2%
| 65.3%
| 66.3%
| 82.9%
| 82.8% | | | | | | | |
Uyghur | 59%
| 61.7%
| 65.7%
| 70.2%
| 83.8%
| 81.9% | 86.3%
| | | | | | |
Karachay |
57.5%
| 60.8%
| 58.7%
| 65.1%
| 65.2%
| 77.8%
| 78.3% | 74.6% | 77.1% | | | | | |
Bashkir |
58.9%
| 59.4% | 59.9% | 67.1% | 69% | 82% | 79.9% | 76.1% | 78.5% | 77.4% | | | | |
Tatar | 60.7%
| 60.2%
| 68.2%
| 70.1%
| 83.9%
| 82.1% | 78% | 79.6% | 79.2%
| 94.9%
| | | |
Turkmen |
55.6%
| 55%
| 54.7%
| 61.2%
| 59.5%
| 71.2%
| 71.9% | 75.9%
| 71.7%
| 69.2%
| 71.9% | 69.8%
| | |
Azeri |
55.2%
| 51.8% | 51.8%
| 56.4% | 58.4% | 66.9% | 67.8% | 70% | 68.8%.
| 66.9% |
66% | 68.4% | 78.2%
| |
Turkish | 52%
| 50%
| 53.8%
| 54.4%
| 64.9%
| 64.8% | 67.2%
| 66.7%
| 64.2%
| 62.8% | 65.6%
| 73.6%
| 86% |
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
51.9%
| | | | | | | | | | | | | |
Tuvan |
50.6%
| 57%
| | | | | | | | | | | | |
Khakas | 61.3%
| 71.9%
| | | | | | | | | | | |
Standard
Altay | 55.9%
| 69.3%
| 75.6%
| | | | | | | | | | |
Kyrgyz |
59.2%
| 59.6%
| 63.3%
| 70.3%
| 74.6%
| | | | | | | | | |
Kazakh | 59.4% | 61.6% | 68.1% | 69.9% | 92% | | | | | | | | |
Uzbek | 57.8%
| 58.2%
| 65.3%
| 66.3%
| 82.9%
| 82.8% | | | | | | | |
Uyghur | 59%
| 61.7%
| 65.7%
| 70.2%
| 83.8%
| 81.9% | 86.3%
| | | | | | |
Karachay |
58.2%
| 60.8%
| 58.7%
| 65.1%
| 65.2%
| 77.8%
| 78.3% | 74.6% | 77.1% | | | | | |
Bashkir | 59.4% | 59.9% | 67.1% | 69% | 82% | 79.9% | 76.1% | 78.5% | 77.4% | | | | |
Tatar | 60.7%
| 60.2%
| 68.2%
| 70.1%
| 83.9%
| 82.1% | 78% | 79.6% | 79.2%
| 94.9%
| | | |
Turkmen |
55.4%
| 55%
| 54.7%
| 61.2%
| 59.5%
| 71.2%
| 71.9% | 75.9%
| 71.7%
| 69.2%
| 71.9% | 69.8%
| | |
Azeri | 51.8% | 51.8%
| 56.4% | 58.4% | 66.9% | 67.8% | 70% | 68.8%.
| 66.9% |
66% | 68.4% | 78.2%
| |
Turkish | 52%
| 50%
| 53.8%
| 54.4%
| 64.9%
| 64.8% | 67.2%
| 66.7%
| 64.2%
| 62.8% | 65.6%
| 73.6%
| 86% |
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
51.9%
| | | | | | | | | | | | | |
Tuvan |
50.6%
| 57%
| | | | | | | | | | | | |
Khakas | 61.3%
| 71.9%
| | | | | | | | | | | |
Standard
Altay | 55.9%
| 69.3%
| 75.6%
| | | | | | | | | | |
Kyrgyz |
58.7%
| 59.6%
| 63.3%
| 70.3%
| 74.6%
| | | | | | | | | |
Kazakh | 59.4% | 61.6% | 68.1% | 69.9% | 92% | | | | | | | | |
Uzbek | 57.8%
| 58.2%
| 65.3%
| 66.3%
| 82.9%
| 82.8% | | | | | | | |
Uyghur | 59%
| 61.7%
| 65.7%
| 70.2%
| 83.8%
| 81.9% | 86.3%
| | | | | | |
Karachay | 60.8%
| 58.7%
| 65.1%
| 65.2%
| 77.8%
| 78.3% | 74.6% | 77.1% | | | | | |
Bashkir | 59.4% | 59.9% | 67.1% | 69% | 82% | 79.9% | 76.1% | 78.5% | 77.4% | | | | |
Tatar | 60.7%
| 60.2%
| 68.2%
| 70.1%
| 83.9%
| 82.1% | 78% | 79.6% | 79.2%
| 94.9%
| | | |
Turkmen |
55.4%
| 55%
| 54.7%
| 61.2%
| 59.5%
| 71.2%
| 71.9% | 75.9%
| 71.7%
| 69.2%
| 71.9% | 69.8%
| | |
Azeri | 51.8% | 51.8%
| 56.4% | 58.4% | 66.9% | 67.8% | 70% | 68.8%.
| 66.9% |
66% | 68.4% | 78.2%
| |
Turkish | 52%
| 50%
| 53.8%
| 54.4%
| 64.9%
| 64.8% | 67.2%
| 66.7%
| 64.2%
| 62.8% | 65.6%
| 73.6%
| 86% |
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
53.7%
| | | | | | | | | | | | | |
Tuvan | 57%
| | | | | | | | | | | | |
Khakas | 61.3%
| 71.9%
| | | | | | | | | | | |
Standard
Altay | 55.9%
| 69.3%
| 75.6%
| | | | | | | | | | |
Kyrgyz | 59.6%
| 63.3%
| 70.3%
| 74.6%
| | | | | | | | | |
Kazakh | 59.4% | 61.6% | 68.1% | 69.9% | 92% | | | | | | | | |
Uzbek | 57.8%
| 58.2%
| 65.3%
| 66.3%
| 82.9%
| 82.8% | | | | | | | |
Uyghur | 59%
| 61.7%
| 65.7%
| 70.2%
| 83.8%
| 81.9% | 86.3%
| | | | | | |
Karachay | 60.8%
| 58.7%
| 65.1%
| 65.2%
| 77.8%
| 78.3% | 74.6% | 77.1% | | | | | |
Bashkir | 59.4% | 59.9% | 67.1% | 69% | 82% | 79.9% | 76.1% | 78.5% | 77.4% | | | | |
Tatar | 60.7%
| 60.2%
| 68.2%
| 70.1%
| 83.9%
| 82.1% | 78% | 79.6% | 79.2%
| 94.9%
| | | |
Turkmen |
55.4%
| 55%
| 54.7%
| 61.2%
| 59.5%
| 71.2%
| 71.9% | 75.9%
| 71.7%
| 69.2%
| 71.9% | 69.8%
| | |
Azeri | 51.8% | 51.8%
| 56.4% | 58.4% | 66.9% | 67.8% | 70% | 68.8%.
| 66.9% |
66% | 68.4% | 78.2%
| |
Turkish | 52%
| 50%
| 53.8%
| 54.4%
| 64.9%
| 64.8% | 67.2%
| 66.7%
| 64.2%
| 62.8% | 65.6%
| 73.6%
| 86% |
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
53.7%
| | | | | | | | | | | | | |
Tuvan | 57%
| | | | | | | | | | | | |
Khakas | 61.3%
| 71.9%
| | | | | | | | | | | |
Standard
Altay | 55.9%
| 69.3%
| 75.6%
| | | | | | | | | | |
Kyrgyz | 59.6%
| 63.3%
| 70.3%
| 74.6%
| | | | | | | | | |
Kazakh | 59.4% | 61.6% | 68.1% | 69.9% | 92% | | | | | | | | |
Uzbek | 57.8%
| 58.2%
| 65.3%
| 66.3%
| 82.9%
| 82.8% | | | | | | | |
Uyghur | 59%
| 61.7%
| 65.7%
| 70.2%
| 83.8%
| 81.9% | 86.3%
| | | | | | |
Karachay | 60.8%
| 58.7%
| 65.1%
| 65.2%
| 77.8%
| 78.3% | 74.6% | 77.1% | | | | | |
Bashkir | 59.4% | 59.9% | 67.1% | 69% | 82% | 79.9% | 76.1% | 78.5% | 77.4% | | | | |
Tatar | 60.7%
| 60.2%
| 68.2%
| 70.1%
| 83.9%
| 82.1% | 78% | 79.6% | 79.2%
| 94.9%
| | | |
Turkmen |
55.4%
| 55%
| 54.7%
| 61.2%
| 59.5%
| 71.2%
| 71.9% | 75.9%
| 71.7%
| 69.2%
| 71.9% | 69.8%
| | |
Azeri | 51.8% | 51.8%
| 56.4% | 58.4% | 66.9% | 67.8% | 70% | 68.8%.
| 66.9% |
66% | 68.4% | 78.2%
| |
Turkish | 52%
| 50%
| 53.8%
| 54.4%
| 64.9%
| 64.8% | 67.2%
| 66.7%
| 64.2%
| 62.8% | 65.6%
| 73.6%
| 86% |
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
54.5%
| | | | | | | | | | | | | |
Tuvan | 57%
| | | | | | | | | | | | |
Khakas | 61.3%
| 71.9%
| | | | | | | | | | | |
Standard
Altay | 55.9%
| 69.3%
| 75.6%
| | | | | | | | | | |
Kyrgyz | 59.6%
| 63.3%
| 70.3%
| 74.6%
| | | | | | | | | |
Kazakh | 59.4% | 61.6% | 68.1% | 69.9% | 92% | | | | | | | | |
Uzbek | 57.8%
| 58.2%
| 65.3%
| 66.3%
| 82.9%
| 82.8% | | | | | | | |
Uyghur | 59%
| 61.7%
| 65.7%
| 70.2%
| 83.8%
| 81.9% | 86.3%
| | | | | | |
Karachay | 60.8%
| 58.7%
| 65.1%
| 65.2%
| 77.8%
| 78.3% | 74.6% | 77.1% | | | | | |
Bashkir | 59.4% | 59.9% | 67.1% | 69% | 82% | 79.9% | 76.1% | 78.5% | 77.4% | | | | |
Tatar | 60.7%
| 60.2%
| 68.2%
| 70.1%
| 83.9%
| 82.1% | 78% | 79.6% | 79.2%
| 94.9%
| | | |
Turkmen | 55%
| 54.7%
| 61.2%
| 59.5%
| 71.2%
| 71.9% | 75.9%
| 71.7%
| 69.2%
| 71.9% | 69.8%
| | |
Azeri | 51.8% | 51.8%
| 56.4% | 58.4% | 66.9% | 67.8% | 70% | 68.8%.
| 66.9% |
66% | 68.4% | 78.2%
| |
Turkish | 52%
| 50%
| 53.8%
| 54.4%
| 64.9%
| 64.8% | 67.2%
| 66.7%
| 64.2%
| 62.8% | 65.6%
| 73.6%
| 86% |
|
Ultimately,
we have 54.5% for Chuvash-to-any-other-language, using the fork-by-fork
averaging,
which differs a little bit from 55.9% that we would obtain, if we had
used the
simplified Chuvash-to-everything-else-at-the-same-time averaging. The
difference
of 1.4% may later lead to noticeable temporal deviations.
At
the current stage, this value of 54.5% has only been averaged and
adjusted for
non-Bulgaric Turkic languages, not Chuvash itself, because
of its stand-alone position among Bulgaric and the complete lack of
surviving
sibling languages, to which we could do the comparison.
Therefore, we'll
have to accept this figure at face value at this point and assume
Chuvash per
se is neither too innovative nor too archaic.
How
reasonable can this latter assumption be? Tatar, located in the same
area, is
generally rather archaic (as evident from its close proximity to Kyrgyz
and Kazakh),
consequently we may assume that Chuvash, which has been located in
similar
historical and geographic background cannot be too innovative. On the
other hand,
Chuvash was in contact with the Tatar superstratum and Finno-Ugric
adstratum,
which may have resulted in a "creolization" process and strong
innovative changes. Indeed, it should be noted that Chuvash has a
few Kazan
Tatar borrowings even in the basic vocabulary (these were mostly tracked
down
and excluded from the cognate list), so these innovative features are
supposed
to cancel out any potential archaism of Chuvash.
Furthermore,
the scanty historical evidence demonstrating the existence of other
Bulgaric languages
confirm that there existed other Chuvash siblings with about the same
level of
phonological and lexical transformation, which means that Chuvash had
once been
part of a bigger family probably with a rather glottochronologically
normal separation
rate.
Therefore,
altogether, we may expect that the deviation of Chuvash would rather too
close
to zero, consiering that so any factors were involved, and our
assumption about
Chuvash being neither too archaic nor too innovative presently
seems plausible.
All
in all, that marks our calculations for Chuvash as still partly subject
to
Bergsland-Vogt objection, however to a lesser extent, since we
still have
statistically corrected values at least at one side, and we found
no immediately
obvious geographical or historical reasons for Chuvash to be a strongly
archaic
or innovative. Therefore, we may now conclude with some 80% certainity
that Chuvash
is most likely a rather glottochronologically normal language.
Adjusting
Proto-Turkic
By
the same token, we should average the numbers for other Turkic
languages.
Again,
we will start from averaging the closest internal branches using a
two-by-two
walk for each bifurcation:
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
54.5%
| | | | | | | | | | | | | |
Tuvan |
57%
| | | | | | | | | | | | |
Khakas |
61.3%
|
71.9%
| | | | | | | | | | | |
Standard
Altay |
55.9%
|
69.3%
|
75.6%
| | | | | | | | | | |
Kyrgyz |
59.5%
|
62.6%
|
69.2%
|
72.2%
| | | | | | | | | |
Kazakh |
92%
| | | | | | | | |
Uzbek |
58.4%
|
60%
|
65.5%
|
68.3%
|
83.3%
|
82.3%
| | | | | | | |
Uyghur |
86.3%
| | | | | | |
Karachay |
60.8%
|
58.7%
|
65.1%
|
65.2%
|
77.8%
|
78.3%
|
74.6%
|
77.1%
| | | | | |
Bashkir |
60.1%
|
60.5%
|
67.7%
|
69.6%
|
83.0%
|
81%
|
77.1%
|
79.1%
|
78.3%
| | | | |
Tatar |
94.9%
| | | |
Turkmen |
55%
|
54.7%
|
61.2%
|
59.5%
|
71.2%
|
71.9%
|
75.9%
|
71.7%
|
69.2%
|
71.9%
|
69.8%
| | |
Azeri |
51.9%
|
50.9%
|
55.1%
|
56.4%
|
65.9%
|
66.3%
|
68.6%
|
67.8%
|
65.6%
|
64.4%
|
67%
|
75.9%
| |
Turkish |
86%
|
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
54.5%
| | | | | | | | | | | | | |
Tuvan |
57%
| | | | | | | | | | | | |
Khakas |
61.3%
|
71.9%
| | | | | | | | | | | |
Standard
Altay |
55.9%
|
69.3%
|
75.6%
| | | | | | | | | | |
Kyrgyz |
59.5%
|
62.6%
|
69.2%
|
72.2%
| | | | | | | | | |
Kazakh |
92%
| | | | | | | | |
Uzbek |
58.4%
|
60%
|
65.5%
|
68.3%
|
82.8%
| | | | | | | |
Uyghur |
86.3%
| | | | | | |
Karachay |
60.8%
|
58.7%
|
65.1%
|
65.2%
|
78.1%
|
78.5%
| | | | | |
Bashkir |
60.1%
|
60.5%
|
67.7%
|
69.6%
|
82%
|
78.1%
|
78.3%
| | | | |
Tatar |
94.9%
| | | |
Turkmen |
55%
|
54.7%
|
61.2%
|
59.5%
|
71.6%
|
73.8%
|
69.2%
|
70.9%
| | |
Azeri |
51.9%
|
50.9%
|
55.1%
|
56.4%
|
66.1%
|
68.2%
|
65.6%
|
65.7%
|
75.9%
| |
Turkish |
86%
|
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
54.5%
| | | | | | | | | | | | | |
Tuvan |
57%
| | | | | | | | | | | | |
Khakas |
61.3%
|
71.9%
| | | | | | | | | | | |
Standard
Altay |
55.9%
|
69.3%
|
75.6%
| | | | | | | | | | |
Kyrgyz |
58.9%
|
61.3%
|
67.4%
|
70.3%
| | | | | | | | | |
Kazakh |
92%
| | | | | | | | |
Uzbek |
82.8%
| | | | | | | |
Uyghur |
86.3%
| | | | | | |
Karachay |
60.5%
|
59.6%
|
66.4%
|
67.4%
|
80.1%
|
78.3%
| | | | | |
Bashkir |
78.3%
| | | | |
Tatar |
94.9%
| | | |
Turkmen |
55%
|
54.7%
|
61.2%
|
59.5%
|
72.7%
|
70.1%
| | |
Azeri |
51.9%
|
50.9%
|
55.1%
|
56.4%
|
67.2%
|
65.7%
|
75.9%
| |
Turkish |
86%
|
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
54.5%
| | | | | | | | | | | | | |
Tuvan |
57%
| | | | | | | | | | | | |
Khakas |
58.6%
|
70.6%
| | | | | | | | | | | |
Standard
Altay |
75.6%
| | | | | | | | | | |
Kyrgyz |
59.7%
|
60.5%
|
67.9%
| | | | | | | | | |
Kazakh |
92%
| | | | | | | | |
Uzbek |
82.8%
| | | | | | | |
Uyghur |
86.3%
| | | | | | |
Karachay |
79.2%
| | | | | |
Bashkir |
78.3%
| | | | |
Tatar |
94.9%
| | | |
Turkmen |
55%
|
54.7%
|
60.4%
|
72.7%
|
70.1%
| | |
Azeri |
51.9%
|
50.9%
|
55.8%
|
67.2%
|
65.7%
|
75.9%
| |
Turkish |
86%
|
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
54.5%
| | | | | | | | | | | | | |
Tuvan |
57%
| | | | | | | | | | | | |
Khakas |
58.6%
|
70.6%
| | | | | | | | | | | |
Standard
Altay |
75.6%
| | | | | | | | | | |
Kyrgyz |
59.7%
|
60.5%
|
67.9%
| | | | | | | | | |
Kazakh |
92%
| | | | | | | | |
Uzbek |
82.8%
| | | | | | | |
Uyghur |
86.3%
| | | | | | |
Karachay |
79.2%
| | | | | |
Bashkir |
78.3%
| | | | |
Tatar |
94.9%
| | | |
Turkmen |
53.5%
|
52.8%
|
58.1%
|
70%
|
67.9%
| | |
Azeri |
75.9%
| |
Turkish |
86%
|
|
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
54.5%
| | | | | | | | | | | | | |
Tuvan |
57.8%
| | | | | | | | | | | | |
Khakas |
70.6%
| | | | | | | | | | | |
Standard
Altay |
75.6%
| | | | | | | | | | |
Kyrgyz |
59.7%
|
64.2%
| | | | | | | | | |
Kazakh |
92%
| | | | | | | | |
Uzbek |
82.8%
| | | | | | | |
Uyghur |
86.3%
| | | | | | |
Karachay |
79.2%
| | | | | |
Bashkir |
78.3%
| | | | |
Tatar |
94.9%
| | | |
Turkmen |
53.5%
|
55.5%
|
69%
| | |
Azeri |
75.9%
| |
Turkish |
86%
|
|
Then,
we finally get to the most early separated Turkic branches, such as
Oghuz-Seljuk
and Sakha.
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
54.5%
| | | | | | | | | | | | | |
Tuvan |
57.8%
| | | | | | | | | | | | |
Khakas |
70.6%
| | | | | | | | | | | |
Standard
Altay |
75.6%
| | | | | | | | | | |
Kyrgyz |
59.7%
|
64.2%
| | | | | | | | | |
Kazakh |
92%
| | | | | | | | |
Uzbek |
82.8%
| | | | | | | |
Uyghur |
86.3%
| | | | | | |
Karachay |
79.2%
| | | | | |
Bashkir |
78.3%
| | | | |
Tatar |
94.9%
| | | |
Turkmen |
53.5%
|
62.3%
| | |
Azeri |
75.9%
| |
Turkish |
86%
|
|
At
this point, we run into certain difficulties with adjusting the Yakutic
subtaxon.
These are most likely due to the fact we were unable to exclude all the
borrowings
from the "odd words" list in Sakha. These words may come from an unknown
source, such as an unknown Yeniseian or Tungusic adstrate. Moreover,
judging by
(1) a nearly complete lack of historical siblings (except Dolgan) and
poor dialectical
differentiation, (2) the enormous geographical separation, (3) the
presence of
borrowings from Mongolic and (4) the genetic bottleneck that evidences
some kind
of catastrophic event in the past, we have reasons to believe that Sakha
may be
a strongly innovative or glottochronologically aberrant language.
Therefore, we
cannot exclude the possibility that Sakha may be generally younger than
it looks.
Consequently, the best thing we can do is exclude Sakha from
any further
calculations. Knowing that the values for Oghuz-to-others and
Altay-Sayan-to-Great-Steppe
come from the averaging over a great many language pairs, we must
conclude that
these values are quite statistically robust already, as they are, and
there is
no need to "spoil" them with the data from the Yakutic branch. Thus,
we will just leave the Yakutic subgroup aside, by simply expanding the
previously
obtanined values to the left:
The
Lexicostatistical Matrix of Turkic Languages, Swadesh-215
(02.2012), borrowings
excluded
|
|
| Chuvash | Sakha | Tuvan | Khakas | Standard
Altay | Kyrgyz | Kazakh | Uzbek | Uyghur | Karachay | Bashkir | Tatar | Turkmen | Azeri | |
Sakha |
54.5%
| | | | | | | | | | | | | |
Tuvan |
58%
(?)
| | | | | | | | | | | | |
Khakas |
70.6%
| | | | | | | | | | | |
Standard
Altay |
75.6%
| | | | | | | | | | |
Kyrgyz |
64.2%
| | | | | | | | | |
Kazakh |
92%
| | | | | | | | |
Uzbek |
82.8%
| | | | | | | |
Uyghur |
86.3%
| | | | | | |
Karachay |
79.2%
| | | | | |
Bashkir |
78.3%
| | | | |
Tatar |
94.9%
| | | |
Turkmen |
62.3%
| | |
Azeri |
75.9%
| |
Turkish |
86%
|
|
In most
other respects,
the very fact that we have averaged over the great many both archaic and
innovative languages should provide guarantee against nonsystematic
glottochronological
errors
and fluctuations,
therefore most other values in the table are supposed to be
rather precise,
resistant and no longer subject to Bergland-Vogt's objection, especially
as far as the deepest glottochronological separation nodes, where most
averaging
was done, are concerned.
Establishing
glottochronological calibration points
Now
that we have fluctuation-resistant values, we can do the more or less
correct glottochronology.
An
important
correction to the Swadesh-Lees methods should be the following: we will
not use
the standard global Swadesh-Lees' constant (81% for Swadesh-200, 86% for
Swadesh-100), but rather apply the local (ad-hoc) calibration
(gauging)
instead. This approach has already been done in many other studies of
other language
families by other authors.
To
proceed any further,
we must determine the calibration points, which means historical
periods
within the tree of the Turkic languages when each particular splitting
was actually
attested.
From
the classical formula for the negative exponential decay, we have:
t
= - k ln C
Therefore,
k
= - t/lnC
Consequently,
we can now determine k for each calibration point:
| Event | Date | Lexical
% | | k |
| Turkmen—Seljuk
separation before c. 980 (certain) | 950
AD | 75.9%
| -
1.05 / ln 0.759 | 3.8 |
| Uzbek-Uyghur
separation after the division of the Chagatai Ulus (1370) | 1370
AD | 86.3% | -
0.63 / ln 0.863 | 4.2 |
| Turkish-Azeri
separation after the Battle of Manzikert (1071) and then, particularly,
the collapse
of the Seljuk Empire (1194), and then the Mongol invasion (1260) | c.
1100-1260 AD | 86% | -
0.8 / ln 0.86 | 5.3 |
| Kyrgyz-Kazakh
separation | 1450
AD | 92% | -
0.55 / ln 0.92 | 6.6 |
|
Kyrgyz
and
Tatar mentioned as separate tribes as early as 730 AD, which supposedly
marks
the separation between Kimak tribes (Tatar, Kimak, Kypchak) and Kyrgyz
some time
before that date. | c.
700 AD | 79.2%
| -
1.3 / ln 0.792 | 5.6 |
| Average
Local Constant for Turkic Languages | | | | ~5.1 |
After averaging
over all the calibration points available, we obtain the rather
statistically
robust local glottochronological constant, therefore we may tentatively
conclude
that
t
~ - 5.1 ln C
Adjusting
the dendrogram bifurcation points along the temporal axis
Ultimately, using the newly-calculated dates for each undated node, we
can adjust the
glottochronological dendrogram that finally looks as follows:
This
dendrogram
is now adjusted along the temporal axis and contains all the
glottochronological
values obtained in this study.
References
The
Fundamentals of Lexicostatistics and Glottochronology
(2009, 2012);
The
Internal Classification and Migrations of Turkic Languages
(2009,
2012); en.wiktionary.org/wiki/Appendix:Swadesh_lists_for_Turkic_languages
(2007-2011); M. Dyachok, Glottochronolgiya
tyurkskikh yazykov (The Glottochronology of the
Turkic Languages),
Materials of 2nd Scientific Conference, Novosibirsk (2001);
Anna Dybo,
Khronologiya tyurkskikh yazykov i lingvisticheskiye kontakty rannikh
tyurkov
(The Chronology of the Turkic Languages and the Linguistic Contacts of
the Early
Turks) (2006); O.A.
Mudrak, Ob utochnenii klassifikatsii tyurkskikh yazykov s pomosch'yu
morphologicheskoy
lingvostatistiki (On the clarification of the Turkic languages
classification
by means of morphological linguostatistics)// Sravnintelno-istoricheskaya
grammatka tyurkskikh yazykov. Regionalnyiye rekonstruktsii. Moscow
(2002);
10/2009
- 10/2011 - 01-03/2012
|
Hiç yorum yok:
Yorum Gönder