Outbacks of Sanity

Should I ever write a book not on the VM itself but one the research which is done on it (and the people doing it), I’ll call it In the Outbacks of Sanity.

I’m not sure: Either only people with a peculiar mindset are attracted to the VM, or the VM corrupts otherwise sane minds into something which “isn’t really mad, but definitely bipolar on weekends”.

Entropy Wonder

I wonder in how far the correct or wrong transcription system affects the observed entropy of the VM text, namely the observed “low information content”.

Obviously, there are two major ways in which the transcription can be wrong: Either ciphertext character strings are broken up or joined at the wrong position (Is qo really one letter or two? What about dain, daiin and daiiin?), or characters which are identical are treated as different, or vice versa. (C/e/cc/ch come to mind. How many different gallows are really there?)

What would the effect on entropy be? Perhaps I should look up the old statistics books and see what difference a larger/smaller word length and/or character set would make.

Don’t be the First to Read About…

… the results of the Voynich carbon dating. At least, you won’t be if you stick around here.

René Zandbergen has kindly provided the broadcast date of the long-awaited Voynich documentary (December 10th), which will contain the even-longer-awaited results of the carbon dating performed on the VM earlier this year.

Unfortunately, the broadcast is with austrian station ORF2, which my cable provider carries not (which sucketh like only a vacuum cleaner made by Microsoft could suck). Thus, this blog will only able to mirror the information, but can’t be the premier source.

Stick to Nick Pelling’s blog to get the details and be up-to-date!

Churchill, Stalin, and Edward Elgar

The following developed out of a discussion lead elsewhere with regards to a possible breaking of the Dorabella cipher. (I wonder if anybody ever followed the original point of the post, namely the link to Sarah Goslee’s site?)

I challenged the codebreaker’s method (applying a simple subsitution cipher and then “processing” the resulting intermediate text), and asked him to give me a random string of characters, and I would transform it — just for the kick, and to show the arbitrariness of it all — into a message from Churchill to Hitler using the same processing techniques he applied, namely the reversal of letter sequences, and the heavy use of slang (“Backslang” — Not being a native speaker, I replaced it with mock German).

Thereby I wanted to show that those techniques would always render something readable, no matter whether the previous substitution had been correct or not.

The message I had been given was a string of 87 characters which looked like this:

kinljtvlaqoefmeueetiiemrqrsod
vevtrlalneentsfveoetwdfmraarod
ndtwd’nmwxsdnetwfdmlisdtlsrl

First, I noticed that letters b, c, g, h, p, and z were missing, which altogether comprise about 14% of volume in english texts. The lack of “h” was most dearly felt, because it wouldn’t allow me to write “the”, “this”, or “thespian”, so I decided that “v” was actually “h”. (Since it is derived from a simple substitution, some guesswork is surely allowed?) I couldn’t really do a lot about the fact that the whole 28 characters of the last line contained only two vowels.

I then decided that Churchill would have written a mocking message to Hitler, and to do so he would use phonetical spelling and faux-German words. I also assumed that the “l” would have to stand in as an exclamation mark at times. Furthermore, character strings of the text may have been reversed, as in Dorabella.

Having done that, I arrived at the following (bottom lines in all caps):

kinljthlaqoefmeueetiiemrqrsod
KINJ!THLAQOFEMEUEETMEIIRQRSDO

hehtrlalneentsfheoetwdfmraarod
HTHERLALTNEENSFEHOTEWDFARMARDO

ndtwd’nmwxsdnetwfdmlisdtlsrl
NTDWD’NMWXSDENT….LISTDLSR!

This parses as follows:

  • “KINJ!” — A mock address of Hitler as “King!” (deliberately misspelled, to put him in contrast with Churchill’s ‘real’ king)
  • “TH” — “The”
  • “LAQ” — Faux spelling of “lack”
  • “OF”
  • “E” — “your”
  • “MEUEET” — Mock spelling of the German word “Mut” (“courage”)
  • “MEII” — Faux “may”
  • “RQRS” — A contraction of “requires” (with mock faulty grammar)
  • “DOH” — “do”, with a pun on “duh”
  • “THER” — “their”
  • “LALTNEENS” — “latreens”
  • “FE” — “Stalin” (The chemical sign for iron is “Fe”, and it is well known that Stalin translates as “man of steel”)
  • “HOT”
  • “EWD” — A contraction of “ewe’d” (“ewe” as in “female sheep”)
  • “FARMAR” — mock spelling of “farmer”
  • “DONT”
  • “DWD” — “Dude”, a mock spelling together with the fact that “w” originated as “double-u”, ie DWD=”duud”. Perhaps “dud”.
  • “NMWXS” — “Nijmegen’s”, in German “Nimwegens”. Speaking the consonants as a single string renders something like the german pronounciation
  • “DENT”
  • “WFDM” — I couldn’t transcribe. A very similar string (“wdfm”) appeared in the previous line, though in a different context.
  • “LISTD” — “listed”, as in “recognized, being listed in a directory”
  • “LSR!” — “loser!” Again, speaking the consonants as a string, renders the (english) pronunciation.

Which leads us to —

“‘King!’ The lack of your “mut” (courage) may requires: Do (duh!) their latreens! Stalin a hot-ewe’d farmer? Don’t, dude! Nimegen’s dent … Listed loser!”

Churchill seems to warn Hitler not to underestimate Stalin as a farmer with “ewes in heat” (a vulnerable farmer?). “Nimegen’s dent” probably refers to the failed operaton Market Garden, where the Allies lost large amounts of paratroopers in the area around Nimegen; “wfdm” might be an abbreviated vow of revenge. “Listed Loser” requires no further explanation.

Compare this to the supposed Dorabella solution

B Hellcat ie a war using effin henshells! Why your antiquarian net diminuendo? Am sorry you theo o’ tis god then me so la deo da — aye

This took me about 90 minutes, plus time to write it up.

If that post title won’t attract readers, I don’t know what will.

Are You Tired of the Strokes yet?

You probably are.

Well, in this case, let me point you to a mostly overlooked gem in Voynich research, namely Sarah Goslee’s website. Not only is she a fellow SCAdian*) (Hail from Drachenwald!), but she has also put together a few nice statistical tests on the VM. As always, caveat emptor!, and honestly I haven’t figured out what “principal coordinates ordination on Euclidean distances of row-standardized frequencies” is supposed to be, but I’ve been in the game long enough to be suitably impressed by a procedure with a name of that length.

No, seriously, I’m still struggling to understand what exactly Sarah did and what the results mean, but this has all the appearance of a very interesting and competent piece of research which has up to now not received the mention it deserves, IMHO.

Hence, my usual piece of advice: Check it out, bros!

*) No, it’s not this.

Higher Stroke Count

So, these days I got around doing a little programming to number-crunch the VM. Assuming the Stroke Theory is the only way to salvation regarding the VM (and who would doubt it?), I did the following to find out what the “constituent syllable set” might be:

  1. Extract a wordlist from the VM (Currier B only), and perform a word count
  2. Discard all “rare” words (roughly, those with 2 occurences and less). This left me with a list of some 1200 words, covering about 80% of tokens.*)
  3. Compose a list of prefixes and suffixes from this vocabulary, namely the most frequent beginnings and endings of VM tokens. (The 32**) most frequent word prefixes and suffixes with lengths of 2 through 5 letters were chosen, giving a supply of roughly 140 beginnings and endings each.)
  4. Prepare a list of all words which can be created by combining each one of the prefixes with one of the suffixes.
  5. See how many of the tokens can be covered this way.
  6. Replace one of the 32 chosen prefixes and one of the suffixes with a different one from the supply.
  7. Recalculate the word coverage.
  8. If the word coverage has improved, keep the change, otherwise discard it.
  9. Repeat from step 6.

Already after a few hundred variations, the program became “saturated”. After some 9500 variations I aborted the run, when it had arrived at a coverage of 84%, meaning that with 32 prefixes and 32 suffixes, 84% of all tokens from the reduced wordlist could be composed.***)

Here are the results in no particular order:

Prefixes:
ch qo sh lche ok ot lk yk che ota she ai otai olk qoka yt ol qote qot lch ote kch da qoke qok oka cho okai cth ke oke dai te

Suffixes:
dy kal ol ey ar eey chdy al ty edy kain iin cthy aiin eody key keey in kedy ky ain or ckhy kar chy ody ir eedy eor eol am eo ok

I think this is fairly promising. More work needs to go into what share each of the syllables contributes to the whole, and of course more testing against other languages is required. (And I need to compare this to Robert Firth’s work.) But it’s a start.

*) We’ll discriminate between words and tokens. While any character group contributes to the numbers of tokens, only different groups are counting as words. In other terms, the sentence “The dog likes the food” consists of 5 tokens, but only 4 words. You can think of “words” as “vocabulary”, while “tokens” concern the “volume” of the text.

**) The number 32 was chosen pretty much randomly under the assumption that 26 groups — barely enough for the latin alphabet — wouldn’t suffice to include digits and possible special characters.

***) I tried the same with english text, but here only a coverage of around 15% could be achieved. This may have to do with the shorter english words, though — I’ll need to compare with other languages as well.