A Temptation to be Unfulfilled?

The four most frequent Voynichese words (as opposed to the letter sequences discussed in the last post), are “ol”, “ar”, “or”, and “al”, in that order, in Currier B.*) (Followed by “dy” and “om”).

In view of the Stroke theory, there is a tempting assumption (not yet a conclusion…) to be drawn from that: If “o” and “a” represent crescents open to the right and the left, respectively (like the shapes “c” and “)”), and “r” and “l” represent vertical strokes which end at the base line, or descend below it, respectively, these two EVA letters/stroke pairs could be combined to build four lower-case plaintext letters, namely “b”, “d”, “p” and “q”.

But which is which? Which are the crescents, which are the dashes? — Take a look at the total number of the VM word occurences, and the frequencies of the suggested letters in Latin language, in promille:

ol 416     "p" 30
ar 244     "d" 27
or 242     "b" 16
al 183     "q" 15

Like so often, it is a fit, but not a watertight one.

The pair of plaintext letters “p” and “d” has no strokes in common, likewise “b” and “q” are disjunct. The same relationship holds for “ol/ar”, “or/al”. So, identifying “p” with “ol” and “q” with “al” would fit the bill, and also the statistics, with “ol” being about twice as frequent as “al”. Consequently, we’d assume that “ar” is really “d” and “or” is equivalent to “b”.

Alas, the latter assumption is not really born out through the statistics, which show that “ar” and “or” share roughly the same frequency, halfway between “ol” and “al”, whereas in “reality” (whatever that word means in this context), “d” should be as frequent as “p”, and “b” as rare as “q”.

So, while it’s a tempting thought, I can’t really corroborate it yet. It’s a chance, but not a perfect match, and the number of occurences is too large to blame it on the usual fluctuations. Of course, it’s possible to presume some advanced linguistic features at work here — for example “q” might stand preferrably word-initial in the plaintext, which might lead to its uppercase character being preferred in the enciphering, which distorts the frequencies in turn.**)

But that’s all moot speculation for the minute. I’ll keep this tentative identification on the back burner and see if can be supported by any other finds as well.

*) Unfortunately, WordPress will interpret angled brackets as HTML code, thus I can’t reproduce EVA characters the usual way. For the time being, I’ll represent EVA with italics, and plaintext in straight letters. Bear with me, please.

**) As you can tell, I don’t have a firm concept for how the author might have gone on about the camelcasing of the characters.

A String of Luck?

In my neverending quest to find out the truth about the Stroke theory, my lovely wife Sina the other night had a splendid idea. Up to now I’d always tried to dissect the existing ciphertext into the hypothetical “syllables” (which, as the theory goes, compose the ciphertext, with each “syllable” of between one and four ciphertext characters representing the strokes with which one of the plaintext letters would have been drawn); a question not unrelated to the knapsack problem, and equally difficult to treat, especially in view of the fact that we can’t even be sure about the ciphertext character repertoire.

Now, Sina’s idea was not to look for the smallest building blocks which make up the VM, but to large ones. The idea being that long identical sequences of plaintext letters would result in long identical sequences of ciphertext.

Of course, this is easier said than done, if you know neither the plaintext nor even the plaintext language. But one can try to make reasonable guesses und bludgeon them to death by number crunching. For example, it’s not unreasonable to assume the plaintext language to be Latin, especially if we start with the idea that the VM might be a “late fake.” Then, the first step is to run a sufficiently large Latin plaintext through a little program and check for characteristic sequences.

Doing so for Caesar’s “De bello gallico” renders a few suspicious features (like a large number of “Caesar”s), but also some promising ones. To name the first which caught my eye, the most frequent three-letter string in “De bello gallico”, after tossing aside all whitespace and punctuation, is “ere”. (In plaintext of roughly 150k, it shows up 850 times.) This is useful, because the first and the third letter are the same, which makes it easier for us to look for a corresponding ciphertext sequence. So, for example, if plaintext “e” translates into the stroke sequence [Q1], and “r” into [Q2], there ought to be a correspondingly frequent ciphertext sequence [Q1][Q2][Q1], the only problem being that we know neither the length of [Q1] nor of [Q2].

It is also useful because I had to postulate that before the actual enciphering, the plaintext was modified into CaMeLcAsEpLaInTeXt. (If it hadn’t been, Robert Firth would not have found around 50 constituent syllables, but only 25.) That means that “ere” would have been rendered “eRe” or “ErE”, but in either case the first letter is the same as the third. If the frequent string had been “eer”, that would have been nearly useless, because it’d be turned into “eEr” or “EeR” for enciphering, with all three plaintext letters (ie ciphertext sub-sequences) being different, hence they being immediately much more difficult to recognize.

Now the big question is, into how many strokes “e/E” and “r/R” were being broken down?

If the camelcasing happened to render “eRe”, then it’s reasonable to assume that [Q1] (“e”) has a length of two characters, with [Q2] (“R”) having three, ie the “suspicious” ciphertext string should have a length of 2+3+2=7 characters, with the last two being the same as the first two.

Ceteris paribus, “ErE” should occur about as often (under the assumption that the camelcasing works more or less randomly). “E” could be decomposed into either 2 strokes (a “C”-shaped crescent, and a hyphen “-“, if the author used cursive writing as a model), or four (a vertical line and three horizontal bars, if the author used block letters). “r” is difficult to pin down and could be two strokes (a short vertical slash, and a little “comma” on top), or up to three or four strokes, if using one of the German “r”s of the time. (Though IMHO it’s rare to see Latin text written in German letters of the time.) So, “ErE” would translate into something looking like [Q1][Q2][Q1] with [Q1] being three or four letters long, and [Q2] being two or three, so the whole thing would be a string between eight and 11 letters.

That still gives us a lot of leeway. What else can we do to determine suitable candidates?

One thing is the absolute frequency. We have seen that “ere” occurs 850 times in 150,000 letters, in other words, one occurence of “ere” happens about every 300 plaintext letters. Now, Currier “B” makes up about 150,000 ciphertext characters.*) (That would mean that it is equivalent to between 70,000 and 50,000 plaintext characters, assuming that each plaintext letter is represented by between two and three ciphertext characters, on average.) Now, since “ere” was supposedly modified into “eRe” and “ErE” with equal probability, that reduces their frequency to 1/600 each. Likewise, seeing the ciphertext “bloated” compared to the plaintext by a factor of somewhere between 2 and 3, the frequencies of [Q1][Q2][Q1] are expected to drop between 1/1200 and 1/1800 per ciphertext character, or, in other words, they should show up very roughly 100 times in the ciphertext.

But here we are threading on very thin ice, and the uncertainties make guesses like these almost pointless — starting with the question what constitutes an individual ciphertext character.

Anyway, I think this is a promising lead to see whether it’s possible to match certain plaintext sequences to ciphertext strings.

*) The whole examination was limited to Currier “B” text, BTW, to avoid problems with the mix between languages, which might well indicate different enciphering schemes and would thus blur the statistics.