Solving the great Ubykh word mystery

# Introduction

I want to share a piece of investigative linguistics I did recently. It's a bit messy, and goes as follows. Not long ago, a couple of months perhaps, I encountered a page on Gretchen McCulloch's All Things Linguistic blog breaking down the major types of morphological structures found in the world's languages. This refers to the kinds of word structures that a language has. For example, one language's words might necessarily consist of multiple internal parts, each bit contributing some kind of meaning to the sentence, while another language's words may only consist of one part and the order in which you combine the words determines the sentence's meaning.

Exemplified with rather cute illustrations are the types of Isolating/Analytic, Agglutinative, and Fusional. (Illustrations are from SpecGram, originally found here.) Analytic languages are those with very simple word structures, like Vietnamese, where each word consists basically of one part. Agglutinative and Fusional languages are synthetic, because the meanings of words emerge from the synthesis of their multiple parts. Agglutinative languages concatenate word-parts together like lego bricks, with each bit typically contributing just one bit of meaning. An example of this can be seen in the Warlpiri noun maliki-kari-kirlangu [dog-another-possessive] 'belonging to another dog'. In fusional languages, word-parts can express multiple bits of meaning, such as the -st in the German verb wander-st [wander-you.singular.present] '(You sg.) are wandering'.

# Polysynthetic words

I encourage you at this point visit the page in question. If you do, and make it to the bottom, you'll find an example of a type of language whose words are of absolutely eye-watering levels of complexity. Ubykh, a Caucasian language formerly spoken in the Caucasus and documented meticulously by George Dumézil and the last speaker Tevfik Esenç in the 20th century, is a so-called Polysynthetic language, which in this context means that it can express in a word what corresponds to an entire sentence in most other languages. The following Ubykh verb shows (perhaps the upper limit of) just how complex these languages can get:

a-χʲa-z-batʂ’a-ʁa-w-də-tʷ-aaj-la-fa-q’a-jt’-ma-da-χ

'If only you had not been able to make him take it all out from under me again for them!'

This is a truly remarkable, and also absurd, feat of human linguistic ability, by any standard. Consisting of 16 parts, this Ubykh expression is undoubtedly more complex than any word in any language you speak or know of. So of course, it seized my interest.

If you're anything like me (let's hope not), then perhaps you might also be a little sceptical, given this rather unbelievable complexity. I mean, it's actually pretty hard to work out what proposition is being expressed here. It involves something about someone taking something out from under someone else for another someone else, but I can't properly comprehend the situation it is describing. It really makes you wonder how, and in what context, an Ubykh speaker uttered this thing. Obviously, the first thing to do to learn more is go back to the original source. The allthingslinguistic.com post didn't name an Ubykh source, so I checked the SpecGram post. All the references are fortunately listed at its end, although strangely there's nothing specifically relating to Ubykh, only a few resources on polysynthesis more broadly. Upon checking all those resources, I tragically found no trace of The Verb, just some general remarks about Caucasian languages broadly.

I thought that was rather strange. Where did it come from, then? Presumably SpecGram had to have found it somewhere. Doing a little searching online, I couldn't immediately find its original location, although I did find it cited in a range of educational resources, all of them appealing to its super-complexity:

A printed example of the Ubykh super word in question, from Holmes and Wiles (2022), An Introduction to Sociolinguistics (6th ed.)

Several other online posts mention it, too. What intrigued me about these occurrences was that they vary in spelling. This puzzled me a little as there was no obvious reason why these mentions wouldn't simply use the spelling of the original source, whatever it was. The only one of these that cites any original Ubykh material at all is Evans, who references two works by Dumézil: La Langue des Oubykhs (1931), and Documents Anatoliens sur les Langues et les Traditions du Caucase, Vol. 2: Textes Oubykhs (1962). With the sources now revealed to me, it was time to forget about any actual work I was committed to and go down the rabbit hole.

# Scouring some (French) books

I did what any habitual procrastinator with a linguistics degree would do, and trawled meticulously through those books to find the original printing of the mysterious super-mega-verb. This was a rather difficult task, for two reasons. First, these books are entirely set in French. I don't speak any French. So I had to climb that barrier somehow. Second, Ubykh has a ridiculously large set of sounds. Only two vowels, but eighty-four consonants. Naturally, many unusual and not-standardly-typeable glyphs are used to represent these sounds, which made digitally searching the files extremely difficult. The orthographic system (i.e., the set of symbols used to represent the sounds) unhelpfully varies from publication to publication.

Of course, this was no deterrent, and I persisted. In an attempt to bypass the French problem, I decided my first strategy would be to word-search for sous 'under(neath)', from the 'take it all out from under me...' part of the translation. Plenty of hits, none of them right. I tried optatif, since this is in the optative ('if only ...!') mood. Nothing. I tried a bunch of other search terms but came up empty-handed each time. I looked through each chapter for Ubykh sentences and again, nothing. I was annoyed as much as I was baffled... my only lead turned out to be a dead end. So where did The Verb come from? Clearly I had to expand my database for searching, and thus I collected every digitally available work on Ubykh I could find. One of these works, Le Verbe Oubykh, is dedicated entirely to verbs -- I was sure I'd find it in there, seeing as it's about as epic as a verb can get, but it wasn't. Text after text, search after search, I found nothing. Pieces of it could be found around, but not the whole thing. For example, I found plenty of verbs incorporating the adverbial /bɜtɕ’ɜ-/ 'under', and plenty of causative ('make X do Y') verbs, but none matching the target. Believe me, I looked in about every single place I could find; I cried my way through many, many chapters on Ubykh and relatives from obscure volumes on Caucasian languages, went down a billion rabbit holes on the phonology of these languages, and checked anything written on their syntax. Sadly these efforts were in vain. Even the Wikipedia page didn't name a source, which was most disappointing. I emailed Trey Jones, the current Editor-in-Chief of Speculative Grammarian, to see if he could recover anything, but his search ultimately was met with the same fate.

My last hope was the reference grammar from 2011, published through LINCOM and authored by Rohan S. H. Fenwick from the University of Queensland (which is interesting to me, as I've never heard of any Australian linguists working on Caucasian languages). This is a fairly comprehensive description that synthesises what's been written about Ubykh previously, putting it in a format more familiar to linguists and providing new analyses of the phonology, morphology, and syntax. It's actually a very nice grammar and contains some fantastic analyses of the published material. Very unfortunately for me, I couldn't find any occurrence of a-χʲa-z-batʂ’a-ʁa-w-də-tʷ-aaj-la-fa-q’a-jt’-ma-da-χ. This troubled me deeply -- surely if it were anywhere, it would be this grammar. I find it difficult to accept that such a fantastic exemplar of morphological complexity would not be shown at least once. As with the other sources, the pieces of The Verb are all in there but not the whole thing at once.

I was pulling out hair at this point. I could not believe my unsuccess, despite how comprehensive (I thought) my search was. After a brief period of feeling like a time-waster, my original suspicions began floating back up to the surface. What if a-χʲa-z-batʂ’a-ʁa-w-də-tʷ-aaj-la-fa-q’a-jt’-ma-da-χ was actually a fabrication? Perhaps someone has done an M. Baker and only inferred it by carefully examining the existing grammatical descriptions. But how would I even prove this? I had to think about that one. With my new hunch that it was all somehow made-up, I looked through the 2011 grammar for any evidence that someone had manually stitched together different pieces of Ubykh, such as an incorrect ordering of word-parts, or something else. To my amazement, everything seemed in perfect shape. [The following discussion is likely to be of interest only to my readers with a background in linguistics; if this is not you, feel free to head on to the next paragraph.] I looked especially at the allomorphy and ordering: there is phonologically-, morphologically- and syntactically-conditioned allomorphy in the agreement prefixes, as well as particular orderings governed by other elements of the verb, but alas, the correct allomorphs do occur and in the correct order. For example, the 3rd plural oblique prefix /ɐ-/ is introduced and governed by the benefactive derivation, occurring correctly before the associated prefix. This oblique prefix then triggers the deletion of the preceding 3rd singular absolutive prefix, which is zero when directly preceding any 3rd plural agreement marker. Or another example: there are two negation options for verbs -- a prefix /m(ɨ)-/ and suffix /-mɜ/ -- the latter of which is demanded by the pluperfect tense (/-q’ɜ:jt’/) and correctly appears in The Verb.

Clearly, if this is the work of some Ubykh-loving nerd, then that individual has done their homework and more. But who? And why? I thought I was beat, but then I had a realisation, uncovering a new lead: the Wikipedia article on Ubykh does not refer to any sources, but someone had to have added The Verb at some point. Perhaps there was a reference there historically, which has since been (accidentally?) deleted. To test this theory, I examined the article's revision history in painstaking detail, from its creation to now. What I found was fascinating, but messy, so please hold on for the next while.

# Findings and results

The verb under investigation first appeared on the Ubykh Wikipedia article in this edit from 26/08/2004, at 02:14. Most curiously, it appears in the form:

azbacr'aghawtwaaylafaq'ayt'daqh

'if only you had been able to take it all out from under me again'

— [02:14, 26/08/2004 : User@203.45.230.188]

Two things need mentioning here: (1) it's missing stuff that appears in its most recent edition, notably the causative and benefactive derivations and the negation; and (2) no source is given for this word. The first point is almost definitive proof that the huge current version has been engineered after the fact, rather than lifted straight from a source. The second point doesn't prove much, except that even this 'basic' version has gone unattributed. Still, I couldn't find even that version in any existing work.

The editor is not named, other than by their IP address, given as 203.45.230.188. To find out more I decided to review their contribution history, as any sane person would do. They had made a substantial number of edits to the Ubykh page, of various sorts, and other language-related articles (Klingon, click consonants, mora, Australian English, to name just a few). Anyway, they made an edit at 02:10 to the Talk page for Agglutinative languages, only four minutes before the edit shown above. In the 2:10 revision, one can see an entry by a user named Thefamouseccles spelling out the Ubykh word as an example in a language they are "familiar with", dated 13/10/2003 at 00:37. Reviewing all edits to this Talk page, one finds that in an edit dated 13/10/2003 at 00:38, the following first appears:

The first known appearance of the Ubykh word with which this article is concerned.

Amazingly, the verb is even simpler (and thus my suspicions appear to be substantiated). But you can clearly see that the foundations of the latest version are there:

a-z-bacr'a-gha-w-tʷ-q'a-yt'-ba

'if you had taken it out from under me'

— [00:38, 13/10/2003 : User@130.102.42.97]

While still very complex (by English standards, at least), this is really approaching something more like you'd find in one of these languages, and you could actually imagine a situation in which someone might say this. Anyway, that edit was made by another unnamed user with IP 130.102.42.97. Sifting through the revision history of Talk:Agglutinative languages for edits made by users Thefamouseccles, 203.45.230.188, and 130.102.42.97, we can establish with great precision the chronology of a-χʲa-z-batʂ’a-ʁa-w-də-tʷ-aaj-la-fa-q’a-jt’-ma-da-χ through the four revisions linked below. In all but the first case, corresponding revisions to the main Ubykh article were completed by the same user just minutes later.

This final one (orthographic differences notwithstanding) is the version that can be found on the Ubykh Wikipedia article today. I was inordinately pleased with myself for uncovering this fabrication, and for having such inarguable evidence. But the fact that the edits were anonymous was haunting me. So, I poured some more time into Wiki-diving to identify the culprit.

Examining the revision history for the Talk page of 203.45.230.188, one can find a revision from 09:33 at 10/05/2005 stating:

Hi Rohan... you should register.

— [09:33, 10/05/2005 : User@node_ue]

Rohan... as in Rohan S. H. Fenwick, the author of A Grammar of Ubykh? That seemed too coincidental to be a coincidence, but I decided to get more evidence to be sure. Visiting this edit of User talk:130.102.42.97 from 04:29, 24/04/2006, one sees the following disclaimer, and much further down the eventual statement:

This IP address, 130.102.42.97, is registered to University of Queensland and is shared by multiple users.
...
Hey thefamouseccles, you're not logged in FYI. I just wanted to let you know that there's a new comment at Talk:Ubykh phonology.

— [04:29, 24/04/2006 : User@Khoikhoi]

Elated I felt as it all clicked into place: Fenwick was associated with the University of Queensland, which suggests that they also commanded this anonymous account, as well as Thefamouseccles by the helpful admission of user Khoikhoi. Moreover, the intimate knowledge of Ubykh morphology needed to accomplish this feat can now appropriately be attributed to the person who literally wrote the book on the language.

It makes sense suddenly why The Verb doesn't appear in the published grammar: because nobody ever said it. If you find a copy of the book you'll see that every Ubykh utterance is accompanied by an attribution to the speaker who said it and the publication from which it hails. The utterance under investigation could not appear in the same way without some lying involved, and it seems RF fortunately had enough integrity to remain honest there. Their name is right on the cover of the book, so maybe that made the difference.

# Conclusions and lessons

Well, quite the journey that was! I'm glad that after weeks of work I was able to solve one of the most irrelevant linguistic mysteries: the great Ubykh word, a-χʲa-z-batʂ’a-ʁa-w-də-tʷ-aaj-la-fa-q’a-jt’-ma-da-χ, is a complete fabrication. We can thus be comforted then that no language has a word meaning 'if only you had not been able to make him take it all out from under me again for them!'. I did reach out to Rhona Fenwick a couple of months ago to get confirmation of my findings, but I got no reply.

I'd be lying if I said I wasn't extremely satisfied with my own efforts here. Truly -- nothing beats the feeling of proving someone wrong, or in this case, proving someone guilty. I feel great about getting to sleep easily at night again.

On a more serious note, I think this whole thing shows an unfortunate outcome of making things up on Wikipedia. The issue is not really with the reproduction of the factoid in a couple of fun linguistic blogs (i.e., All Things Linguistic and SpecGram), but rather with its appearance in peer-reviewed academic publications, which ought to carry and represent genuine authority as well as scientific integrity. Nick Evans would I'm sure not have assumed that the information he found on Wikipedia was false, and Holmes and Wiles would I'm sure have trusted Evans on the matter (as he is an expert on polysynthesis). This is not a dig at Fenwick by any means, just a pointing out of the consequences of a bit of fun they likely felt was harmless. (And to be clear, really no harm was done -- except to the morphology-nerd in my brain who was hoping The Verb was real.)

So let this be a cautionary tale for pop-science bloggers, academic researchers, and fact-fabricators alike. From me you'll get two finger-wagging and tsk-tsk-tsk-ing lessons:

  1. Every school teacher is vindicated: don't get your information from Wikipedia, especially to make a scientific point, as you may in fact find it to be not entirely true after all. Always double-check your sources!

  2. If you're going to fabricate linguistic facts online, you should be clever about covering your tracks. You never know when a loser nerd with too much time on his hands will come along and retrace them.

If anybody knows of any other too-good-to-be-true linguistic facts floating around, please send them to me. I could be in for a new line of work.

# Appendix: Has an error been found?

The following doesn't constitute part of the story, so I've put it here as an appendix.

Upon revisiting this whole thing to write up this post, I did in fact discover a contradiction in the construction of The Verb, and it has to do (as I predicted) with an error in the agreement morphology with respect to the morphosyntactic derivations. The following lays out my discovery. The verb is an ordinary transitive verb /tʷ/ 'take', derived into a causative one. That changes the meaning from 'he took it' to 'you made him take it'. In Ubykh, 'he' would be demoted from ergative to oblique case (i.e., from subject to indirect object), and 'it' would remain in absolutive case (i.e., remain the direct object). (In English this might sound something like 'you made it be taken from him'.) Importantly, this means that in 'you made him take it', the 3rd singular argument 'him' is now an oblique argument (in Ubykh; note that the absolutive argument 'it' is represented by a null prefix). According to Fenwick's 2011 grammar, obliques can be represented by prefixes in two positions: directly after the absolutive prefix but before the benefactive prefix /χʲɜ-/ (i.e., Oblique-1 position), or after the benefactive prefix but before a locational preverbal element such as /bɜtɕ’ɜ-/ 'under' (i.e., Oblique-2 position).

If you sketch out the dependencies in The Verb, you will actually find that the third singular oblique fails to occupy either position because they are filled by two other oblique arguments:

                  ┌───────────┐
       ┌───────┐  │        ┌──┴──┐
       ▼       │  ▼        │     │
Ø-     ɐ-     χʲɜ-z-	 bɜtɕ'ɜ-ʁɜ- w-	   dɨ-	tʷ  -ɐj  -lɜ -fɜ -q'ɜjt'-mɜ -dɜχ
3sgABS-3plOBL-BEN-1sgOBL-under- ABL-2sgERG-CAUS-take-ITER-all-POT-PLUP  -NEG-FRUSTR.OPT
▲      ✕          ✕                 ▲       │    │
│      └╌╌╌╌?╌╌╌╌╌┘                 │       └─┬──┘
└───────────┴───────────────────────┴─────────┘

The third plural oblique /ɐ-/ '(for) them' is governed by the benefactive and thus occupies Oblique-1 position, so the causee /Ø-/ 'him' cannot be appropriately displaced there. That leaves only Oblique-2 open, but that position is occupied by the first singular oblique argument governed by the ablative preverb /ʁɜ-/ 'from'. There is thus no available location for the causee argument of the causative.

As far as I can tell, there is no plausible explanation for this. The causee participant of the causative construction, though in oblique case, should be an obligatory argument and therefore necessarily represented in the verb. It's unclear from Fenwick's description whether obliques introduced by applicativisation are also obligatory, but their representation by prefixes on the verb suggests they are. There are three possibilities: (1) the registration of three oblique arguments on the verb is not possible, blocking the argument structure of the verb; (2) it is possible to have three oblique arguments but only two can 'survive', effectively deleting the loser; or (3) three oblique prefixes can occur on the verb, but the third singular oblique prefix is null and therefore simply cannot be observed.

It's not clear how option (2) would work. One would need to resort to some parochial hierarchy in which third singular obliques lose to not only Speech Act Participants but also third plural obliques in the contest for being represented on the verb. As far as I'm aware there's no independent motivation for positing that. Option (3) seems similarly implausible, on the grounds that there is no other evidence for verbs representing three obliques, and such a lack of evidence makes it impossible to demonstrate that The Verb is in fact one such case. Moreover, Fenwick even states that "verbs [with four arguments] are rare" (p. 100), which almost a priori rules out the possibility of verbs with five. Sadly we'll never truly know the answer, as there are no native speakers remaining. But I'd be willing to wager some money on being right.

# Footnotes

From page 37 of David Nash's Topics in Walrpiri Grammar (1980 MIT Dissertation).

A rather well-cited example of the last fluent speaker of a language, who died in 1992. It's difficult to say however whether he really was the last speaker, as it's in reality quite difficult to determine who counts as the 'last speaker' (see for example this excellent chapter by Nick Evans).

What really counts as a word in these languages anyway?

One example is the labialised alveolopalatal ejective affricate /tɕʷ’/, spelled by Dumézil as <č°’> in some texts and as <ć̣ʷ> in others. Another is the labialised and pharyngealised uvular ejective stop /qʷˤ’/, spelled in some cases as <qw> and in others as <q̄°’>. You can see why searching directly for Ubykh text would be challenging.

The major descriptive works are (in order of publication): ¶ La Langue des Oubykhs (Dumézil, 1931); ¶ Le Verbe Oubykh (Dumézil & Esenç, 1962); ¶ Textes Oubykhs (Dumézil, 1962); ¶ Dictionnaire de la Langue Oubykh (Vogt, 1963); ¶ Nouvelles Études Oubykh (Dumézil, 1965); ¶ Textes Oubykh (Dumézil & Esenç, 1978); ¶ A Grammar of Ubykh (Fenwick, 2011).

This fact, as it turned out, was crucial.

Mark Baker, a linguist who has written extensively on theoretical aspects of polysynthesis, is quite the fan of this method. It's problematic, because in languages we often find that there are certain constructions that "should" exist (according to grammatical generalisations made by the analyst) but don't, for one reason or another. Often, the existence of an irregular or periphrastic form suppresses the expected form. In English, to take a simple example, we should 'expect' that a word meaning "one who steals" is stealer, but this possibility is 'blocked' by the existence of thief.

see page 138 in Fenwick. R. S. H. (2011). A Grammar of Ubykh. Lincom Europa.

← blog prev