Strokes, Round 2: Fight from the inside

(Attack from the rear…)

As mentioned previously, I’ve come to the conclusion that the failure to reproduce Robert Firth’s results was at least partly due to me using the wrong transcription scheme. So, tonight I gave it a second shot, this time working with Currier’s transcription.

Continue reading “Strokes, Round 2: Fight from the inside”

Opposed to what?

Every now and then I check out the Voynich category at the dmoz open directory site. (For those of you who don’t know it, dmoz tries to be for links what Wikipedia is for information.) I have submitted my little blog to their review process and hope that in due time the Voynichthoughts will show up there as well.

Only today did I notice that there is a subsection “Opposing views” in the Voynich directory, and now I wonder — opposed to what? We all know that there is no universally accepted theory of the VM, and even what is considered “mainstream” in VM research is subject to debate. Methinks everybody with an opinion on the VM will be in opposition, more or less, to their peers.

The mind boggles.

A nessecary speling reform?

“… Year 2 might reform w spelling, so that which and one would take the same konsonant, wile Year 3 might well abolish y replasing it with i and Iear 4 might fiks the g/j anomali wonse and for all …”

Okay, most of you probably know about Mark Twain’s English spelling reform proposal and while we Germans in general would welcome it if you Anglophiles would make your spelling match the pronounciation at least somewhat (“now” — “plough” — “rough”, which two of them will sound more alike…?), what’s the bearing for the VM?

Time and again I noticed that people take the transcription of the VM to be the Real Thing(TM) itself. This it is not, not anymore than the word “apple” is an actual apple.

I’m convinced (I’m constantly under the urge to write “convicted”, but I feel that’d be wrong) that the crucial trick the VM author played on us does not lay in the enciphering scheme. In all probability, this will turn out to be something new and original, but not overly complicated. I even dare a bet that it will turn out that one page of the VM will be sufficient to break the manuscript’s code, and later generations will sneer at us, “What took those suckers so long to decipher something so laughably simple?”

It was the brilliant idea of the author to invent a special alphabet for his creation, with just the right amount of ambiguity to it.

No matter how much number-chrunching power we hurl at the VM, it’ll all come to naught as long as the data we feed to our computers are flawed, and this data is the transcription we use. But, as opposed to “conventional” codebreaking problems, where we in general have a good idea of what the ciphertext alphabet looks like, all we have for the VM are sophisticated guesses: Garbage in, garbage out, they say about computers, and our “garbage” is a flawed transcription.

We really can’t be sure if EVA “r” and “s” are representing the same ciphertext letter. All those different hooks about the “ch”, are they to mean the same? Are they different? Do they mean anything? Does the hook make a difference like between an “a” and an “ä”, or is it like “e” and “é” only?*) Is “iiin” one letter, two, three or four? Why does “qo” look like two letters, but behaves as one?

All this has a devastating impact to any test we might want to do on the text. If we do a letter frequency count, “a” and “ä” really should go into different baskets, while “e” and “é” should not. How can you arrive at a meaningful wordlength distribution if you can’t even count the letters in “daiin” reliably? Our statistical shells bounce off the alphabet fortress and detonate in our own camp with lots of smoke and fog.

This is not meant to dissuade anyone in general from working on the VM with statistical tests. I just want to point out that utmost diligence is required. Especially, never forget that you’re working under assumptions which may or may not be true.

(I simply write this because these days I happend to fall into that very trap myself.)

*) In case I’m not making myself clear to the tiny and neglectable community of non-speakers of German: “a” and “ä” are really two different letters, while “é” (in French) simply would mean that the “e” is voiced, rather than mute. I hope these fancy characters make it to your computers…

Stroke theory, after round 1: Elmar’s corner

Okay, I’ve made a mistake, so my attacks lost their punch.

Dennis Stallings and other acute readers have pointed out to me that the hit ratio I achieved — around 40% by token*) — was much less than what even superficial attempts from them achieved (around 80%).

At first, I attributed this to the Takahashi transcription which I had used, and which features a number of words running together (like “cthaiinydaiin” or “cheoeesykeor”), which in all probability should be split up in two words each. But I was doubtful if those run-togethers would really be so numerous as to account for half of the possible hits I had obviusly missed.

Turns out, I had made a mistake at one point: Robert Firth had worked from the Currier transcriptions, while I was using EVA, assuming that both could be unambiguously converted back and forth into each other. I was wrong there. The translation between the two systems is “lossy”, hence an unsophisticated (ie “dumb”) matching system as the one I used will of course render different results in the two domains.

Thus, either I adapt my programs to use Currier, or I find a real EVA equivalent to Robert’s odd and even groups

Time for some infighting, Mr. Voynich!

*) I’m also indebted to Dennis for pointing out to me the difference between “the number of words” (which is usually understood to mean the number of different words), and “the number of tokens” (the amount of words in total). Thus, “I was very, very ignorant” amounts to 4 (different) words, but 5 tokens in the above count. “By token” would mean something like “by volume”.