Science Tribune - Article - January 1997
Amateurs' approach to linguistic affiliations
J.C. Doré and T. Ojasoo
CNRS URA 401, Muséum national d'histoire naturelle, Paris, France
E-mail : email@example.com.
E-mail : firstname.lastname@example.org.
What one can do with an instruction manual
A common complaint is that the multilingual 'Instructions for use' manuals of many domestic and do-it-yourself (DIY) appliances are so badly written that it is impossible to follow the instructions. The appliance is usually made to work by trial-and-error and the manuals are thrown aside. Can one find a use for them ?
It occurred to us that, although the terminology, grammar, and style of these manuals could often be improved, they use words belonging to specific languages, even if these words are, at times, rather inappropriately employed and strung together. Could anything be learnt about the relationships among languages just by analyzing the frequency of occurrence of the letters of the alphabet in the words ?
Analysis of the frequency of use of letters of the alphabet
We took a short instruction from the manual of a DIY heavy electrical appliance which had been translated, presumably from German, into 10 languages (a). We built a matrix (11 languages x 26 letters of the alphabet) which we analyzed by a descriptive statistical method called Correspondence Analysis (CA) (1) (2). In other words, for each language, we noted the number of times the letter "a" occurred in the paragraph, then "b", then "c", etc.... Letters with accents, circumflexes, tildes etc... were considered equivalent to letters without a mark. This meant that certain languages would not be immediately set apart from the others by their specific use of letters with marks. Spaces and punctuation were ignored.
We shall not delve into the specifics of correspondence analysis. The reader need only keep in mind that CA is a statistical method for dimensionality reduction. It shows how the information within the matrix is organized once background noise has been eliminated. In essence, the raw data are converted into 2-dimensional maps that illustrate the proximities (i.e. correlations) among all items, in this particular case among languages and letters (Figure 1).
Figure 1. Lay-out of languages with respect to their differential use of letters of the alphabet in the main factorial plot of a Correspondence Analysis. This plot shows the most important and discriminating half of the information content in the matrix; 28% of this information is embodied in the main horizontal axis and 22% in the vertical axis.
The nearer the languages are to the centre of gravity of the system, the more they conform to a certain common pattern. The more excentric they are, the less typical they are. The closer they are to each other, the greater their overall tendency to prefer the use of certain letters to others, i.e., to prefer the letters in that zone to those in opposing zones.
Relationships among the languages
Let us consider the cleavages due to each axis in Figure 1. Clearly, the most important distinction - made by the positive and negative coordinates of the principal horizontal axis - is between Northern European languages and those of Middle and Southern Europe (left and right-hand quadrants). England is a half-way house. The second cleavage (top and bottom quadrants) separates Germanic languages (German, Dutch) from Latin languages with French being nearest to the cleavage line; this second axis also separates Norwegian and Danish from Swedish and Finnish. However, one should note that Finnish does not lie as well in the plane of the paper as the other North European languages and that a substantial part of the information on Finnish (27%) is present within the 6th factorial axis and beyond.
Interpreting correlations among languages
We were pleasantly surprised to discover how well the map of languages in Figure 1, based on an exceedingly simple, but highly objective, analysis, rendered our subjective impressions and, alas, incomplete knowledge of these different languages. (For an exhaustive classification)
Finnish, in the bottom right-hand quadrant, is the most excentric language in Figure 1. It is characterized by the frequent use of the letters y, a, t and v present in the same quadrant. Its alphabet does not include several letters : b, c, f, q, w (except in gothic script), x and z, which all lie in opposing quadrants and most often at their outer periphery. (The actual quadrant depends upon relative usage of these letters by the other languages. The exception, x, is at the origin of the plot because it is absent in all the translations.) Finnish is the only language of our pool that does not belong to the family of Indo-European languages but to the Finno-Ugric family spoken by Estonians, Hungarians, Lives etc... in Europe and by many tribes east of the Urals. Its position, apparently closest to Swedish (although one must remember Finnish does not lie in the plane of the screen/paper), may be explained by the fact that these two countries are neighbours and also by the substantial percentage of Swedish Finns (about 7%) in Finland. Thus, geographical proximity might engender a "linguistic gradient".
Modern German is a late artificial compound of several kindred dialects (3) with a far smaller vocabulary of early borrowings from the Latin or Greek than the other languages of Western Europe. In the first millenium B.C., the language of the tribe occupying much of Northern Europe was Proto-Germanic which, in the west, gave rise to Anglo-Saxon, German and Dutch and, in the north to Scandinavian languages (Swedish, Danish, Norwegian, and Icelandic) (4). In Figure 1, Norwegian is closer to Danish than to Swedish. This reflects the historical development of the Norwegian language, which like Icelandic, is derived from Danish. In fact, Norwegian can be considered as a 19th c. import from Denmark.
The "camp" latin languages fall into two groups : a very tightly-knit group - Spanish and Portuguese -, on the one hand, and a looser group - Italian and French -, on the other, with French closest to the Germanic languages. This lay-out reflects fairly well the geographical proximity of the peoples speaking these languages and, in particular, the relative isolation of the Iberian peninsula separated from France by the Pyrenées.
The highly central position of English is particularly interesting. "English is a vernacular of vernaculars (b)" (3). The roots of English are in Northern Germany near Denmark, which was inhabited early in the first millenium by pagan tribes called the Angles, the Saxons, and the Jutes (4). At the time of the Roman Conquest, the Brits spoke celtic dialects to be replaced by latin and then anglo-saxon (4). English began in the 11th c as the language spoken between Norman-French conquerors and their Anglo-Saxon serfs who borrowed words from their masters. Despite this influx of Romance words, English words of Anglo-Saxon origin, though fewer in number, are still used today about 5 times as often. Moreover, unlike the camp latin languages, English never crystallised. "It is, indeed, an immense formless aggregate not merely of foreign assimilations and local dialects but of occupational and household dialects and personal eccentricities" (3).
Extension to other languages
Figure 1 can be used as a mathematical model into which further information can be injected as, for instance, the use of the letters of the alphabet by other languages. We therefore translated the instruction into Estonian (c), a language that belongs to the Finno-Ugric family. The relative contribution of Estonian to the figure was low (22%) and solely to the second factorial axis. As for Finnish, its position was relatively well down this axis. Its vicinity to the Latin languages is not so much a sign of similarity with these languages as an indication of a possible movement toward the Germanic languages in the upper quadrants. It is said that modern Estonian has an approximately 10% import of words from German and 10% of slav origin. An analysis of all 12 languages including Estonian confirmed that this was in fact the least typical language, i.e., furthest away from the center of gravity of the system under study.
Should readers wish to translate the instruction into any other language that uses the latin alphabet, we would be willing to position this language in the figure.
Comments and Caveats
Of course, an analysis of another sentence will not give identical results. The points corresponding to the languages in Figure 1 could be called language signatures and slight variations in the signature will fall within circles surrounding this point. The precise location of the signature will depend upon the nature of the text (technical, literary, poetical...), the style of the translator, the time in history when it was written ...
We have already successfully used the CA statistical method to classify families of proteins by counting amino-acids (6). Admittedly, just counting letters ignores letter sequences but, paradoxically, it has the advantage of stressing the isolated sounds of letters (see definition of phonemes below (d)) whatever their position in a word. Because it is thought that over the centuries more sounds have been preserved than mutated (4), our rather artificial classification might bear some relation to the evolution of different European languages from a possible common origin.
The statistical analysis of texts has been undertaken by several professional linguists (7) (8) but, to our knowledge, the use of CA to analyze letter counts is novel. There is, however, an extremely important aspect of linguistics, fundamental to the study of the origins and evolution of languages, that we have not considered, namely, syntax. We refer the interested reader to the following reference works : (4) (9) (10).
We would be interested in analysing texts that have evolved over time in different languages and would welcome receiving as many modern and ancient versions of the Lord's prayer as possible (latin alphabet only) by e-mail at email@example.com.
(a) Manual instruction:
German: Lassen Sie keine Werkzeugschlüssel stecken. Überprüfen Sie vor dem Einschalten, dass die Schlüssel und Einstellwerkzeuge entfernt sind.
English : Remove adjusting keys and wrenches. Form the habit of checking to see that keys and adjusting wrenches are removed from tool before turning it on.
French : Enlevez les clés à outils. Avant de mettre l'outil en marche, assurez-vous que les clés et outils de réglage aient été retirés.
Italian : Non lasci sull'apparecchio chiavi di servizio. Prima di mettere l'apparecchio in funzione, controlli che tutti le chiavi ed utensili di aggiustamento siano state tolte.
Spanish : Retire las llaves de las herramientas. Ante de contectar la herramienta, cercionese de que se hayan quitado las llaves y los utiles de ajuste.
Portuguese : Retire as chaves de ajustamento. Antes des fazer a ligaçao, verifique se as chaves e ferramentas de ajustamento foram previamente retiradas.
Dutch : Laat geen gereedchapsleutels op de machine zitten. Kontroleer voor het inschakelen of sleutels en andere hulpgereedschappen zijn verwijderd.
Swedish: Tag bort justernycklar och skruvnycklar. Gör det till en vana att kontrollera att nycklar och skruvnycklar har tagits bort fran verktyget innan det startas.
Danish: Fjern justeringsnogle eller skruenogle. Gor det til en vane altid at kontrollere at justerinsnogler og skruenogler er fjernet inden vaerktojet startes.
Norwegian: Fjern justeringsnokler og skrunokler. Gjor det til en vane a kontrollere for a se til at noklene og justeringsskruene er fjernet fra redskapet ved innkobling.
Finnish : lrrota avaimet. Ota tavaksesi tarkistaa ennen koneen käynnistystä, että olet irrottanut avaimet koneesta.
(b). Vernacular : from the Roman historian Vario's phrase "vernacula verba" (unliterary expressions used by slaves or serfs); the native language of a peasantry.
(c). Translation into Estonian : "Võtmed ja sättimise riistad eemalda. Enne kui masina käima panna, kindlaks teha, et võtmed ja sättimise riistad on ära võetud."
(d). Phoneme : One of the units of sound that are strung together to form a morpheme, roughly corresponding to the letters of the alphabet. Morphemes : Smallest meaningful pieces into which words can be divided.
1. Benzécri JP. Pratique de l'analyse des données. Linguistique et lexicologie. Dunod/Bordas, Paris, 1981.
2. Erod, CJ. Mastering the complexity of information in patient health care. Science Tribune http://www.tribunes.com/tribune/art96/erod.htm. 1996.
3. Graves R, Hodge A. The use and abuse of the English language. (Original title : The reader over your shoulder - first published in 1943). Paragon House, NY, 1970.
4. Pinker S. The language instinct. William Morrow & Co, USA, 1994.
5. Cavalli-Sforza L. Gènes, peuples & langues Odile Jacob, Paris, 1996.Ibid : Genes, peoples and languages. Scientific Am 265, 104-110, 1991.
6. Ojasoo T, Doré JC. Taxonomy of nuclear hormone receptors and SERPINS by multivariate analysis of amino-acid composition.J Steroid Biochem Mol Biol 58, 167-181, 1996.
7. Muller C. Le vocabulaire du théâtre de Pierre Corneille : étude de statistique lexicale. In : Travaux de linguistique quantitative. Slatkine, 1979.
8. French R. Singe, un générateur aléatoire de texte. Pour la Science, June 1987.
9. Saussure F de. Course in general linguistics. McGraw-Hill, New York, 1916/1959.
10. Chomsky N. Aspects of the theory of syntax. MIT Press, Cambridge MA, 1965.