Don’s Fumbly Thing

(Make of that title what you will…)

Don of Tallahassee lately submitted an entrance in my “Present your own theory here!” category.

But since he alread sports his own website, I think I’d rather redirect interested parties to his pages than draining visitors from him by discussion his theory here. The discussion might as well be lead on Don’s site as on mine.

So, please check it out and tell Don what you think of his ideas. If I understand correctly, Don’s premise is that each VM word is actually a shorthand notice of an ingredient in a recipe, with the first several letters giving the amount required, the subsequent letters giving the part of the plant, then the (abbreviated) name of the plant itself, and the preparation and so on. A string of words then make a complete recipe for a potion or an ointment or such. This “recipe structure” seems to be related to the mysterious structure which is underlying the VM word “grammar”. Don thinks this structure is the consequence of the compressed and formulaic notation used.

Personally I think Don is barking up the wrong tree. His system already now is immensely complicated, and it seems to get more and more intricate. (Which is a bad sign in any deciphering attempt.) OTOH I have to admit that Don is basing his decipherment on very high standards and tries not only to arrivate at just any recipe, but recipes that would have made sense. In other words, he modifies his decipherment until the list of ingredients and preparations might result in actually useful products, and cross-references previous of his finds. While it seems to be a fairly trivial task to identify which part of the word denotes the amount, which is the plant part, which is the plant name etc., finding out which syllable value — as an example — precisely stands for Dandelion, and which is Clover, etc., is mostly guesswork. A daunting task.

My biggest problem with Don’s assumption is that, while they certainly are formulaic, the VM words are just not formulaic enough to warrant the “recipe idea”. For a start, we have word lengths between one and some 10 letters, short labels, long paragraphs of running text — all this is difficult to reconcile with standard recipes, which could just as well be written in the shape of tables.

But then, hey, I didn’t exactly make earth-shattering progress with my Stroke theory either, did I?

It may be Fake, but it’s not a Hoax

Some time ago, I wrote a note to myself on my smartphone*), saying essentially “The Voynich can’t be a hoax, because there’s wordwrap.” As usual, for the longest time afterwards I didn’t have a clear idea what I had meant on that fateful day.

Now, it dawned on me again, especially in the light of the observations I made lately in a little paper about the word-length distribution (which still needs to be amended for several of my oversights others have kindly pointed out to me.)

What I apparently had wanted to say was: “The VM may be a fake (ie, not the genuine 15th cty article it pretends to be), but it’s not a hoax (ie a meaningless sequence of letters), because it exhibits all behaviour conistent with word-wrap.”

As the abovementioned article and the work of several others has shown, the word-length distribution of the text which makes up the VM exhibits all the features one observes in a “regular” (natural language) text which is subject to word-wrapping (ie, long words which would otherwise run over the right margin of the page body are moved to the beginning of the next line as a whole.) Most importantly,

  • the average word length decreases as the line runs on from left to right, and
  • the first word of a line is significantly longer than average.

Both effects are due to the fact that longer words near line ends run a higher “risk” of running over the margin, and be subject to word-wrap.

But this means that the author of the VM didn’t introduce line breaks wherever it pleased him, but he did carefully word-wrap the VM text to keep the individual (ciphertext) words as a whole, rather than allowing them to be spread across line boundaries. A slightly stronger presumption would be that the author even had to word-wrap his text because the words are “information units” (either in the sense of plaintext words, or as “blocks” on which the enciphering algorithm worked.) In either case, if the VM was gibberish and not designed to be deciphered, (ergo if it didn’t matter whether the contents were still readable, because there is no content), then why bothering to wrap the words at line ends, rather than introducing a line break whenever space limits dictated it?

The VM text is all but random; it shows a high degree of structure: a lot of work went into creating it. This could have been done by a brainless Rugg automaton, or through some sensible enciphering algorithm. But work and thought also went into writing down the cipertext and keeping the integrity of the cipher words, and to me this seems to make sense only if information was supposed to be retrieved again as well — which would suggest information to be enciphered therein in the first place.

*) Said “smartphone” is actually a stone-age Palm, back from the time when Palm was still cool.

Not quite a Theory, but a Start

Dear all,

The “submit your own theory” option of this blog has found resonance for a second time, this time with Tomi Malinen, shifting the centre of Voynichology slightly towards Finland.

Tomi writes:

Could it be possible that the VMS text lines were written bottom to top? For example if you look at the page 53r there seem to be occurences that the upper line yeld the characters on the line below. Best example of this on the page 53r is on the third line from the bottom and the fourth word. The gallow character is tilted up as if it’s yelds the gallow character below. If you think about writing the text from top to bottom on a blank page there should be no reason to yeld characters that haven’t been written yet. Also the text lines seem to bend more on the topmost lines compared to the bottom lines.

The page bottom in question can be found here, and this seems to be the culprit in question. Thanx for submitting your thoughts, Tomi.

In general, the assumption is that the VM was written in conventional western manner (top to bottom, left to right). Look at the last lines of the various pages (f53r among them); they are shorter than the rest, and they have a ragged right margin, while the left margin is flush. This is difficult to explain, unless you assume the text was written top to bottom.

On the other hand, your find does look like the gallows character was pushed upwards to make room for the gallow below. As a note of caution, the gallows always appear to stand a bit above the baseline, but this one is really strong… interesting find…

Anybody else with comments on it?


Numero Uno

(This is more of a working note than an actual find. It’s all very much in flux…)

In Currier “B”, the most frequent single-letter words are (along with their number of occurences):

y    63
s    50
r    40
l    34
o    30
d    7
m    4...

Thus, there appear to be five frequent one-letter words, namely “y”, “s”, “r”, “l” and “o”; after that, the frequency drops sharply, and it’s probably more sensible to consider the counts from “d” on mostly transcription errors, since the word boundaries aren’t always well defined.

Since under the dogma of the Stroke theory each ciphertext word represents one or more plaintext letters, we may safely assume that a ciphertext word boundary is equivalent ot a plaintext letter boundary. Which means that single-letter ciphertext words must correspond to plaintext letters which consist of a single stroke.

Which letters can readily be drawn in a single stroke? — These are “c”, “i”, “l”, “o”, “C”, “I”, and “O”, in alphabetical order, with the possible addiition of “s”/”S”, and “u”/”U”. Here are their respective frequencies, in promille:

i/I    114
o/O     54
c/C     40
l       32

Note that “l” is a special case, since here only the lower-case letter can be written with a single stroke, while the uppercase “L” requires two strokes to be reasonably drawn. “I” could also be split in three strokes, if you draw top and bottom horizontal bars, so we remain with six candidates. Under duress, we could drop “O” and/or “C”, saying they require two strokes (“O” could be drawn as “()”, “C” could be rendered as “c'” to distinguish it from the lower-case “c”.

As usual, the statistics don’t quite pan out. “i” should be twice as frequent as “o”, but “y” is only slightly more prominent than the runner up “s”. OTOH the plaintext ratio of “i”/”l” should be 3.5:1, which isn’t that far out from the “y”/”o” ratio of 2.something:1.

There remains a tangled skein…

A Temptation to be Unfulfilled?

The four most frequent Voynichese words (as opposed to the letter sequences discussed in the last post), are “ol”, “ar”, “or”, and “al”, in that order, in Currier B.*) (Followed by “dy” and “om”).

In view of the Stroke theory, there is a tempting assumption (not yet a conclusion…) to be drawn from that: If “o” and “a” represent crescents open to the right and the left, respectively (like the shapes “c” and “)”), and “r” and “l” represent vertical strokes which end at the base line, or descend below it, respectively, these two EVA letters/stroke pairs could be combined to build four lower-case plaintext letters, namely “b”, “d”, “p” and “q”.

But which is which? Which are the crescents, which are the dashes? — Take a look at the total number of the VM word occurences, and the frequencies of the suggested letters in Latin language, in promille:

ol 416     "p" 30
ar 244     "d" 27
or 242     "b" 16
al 183     "q" 15

Like so often, it is a fit, but not a watertight one.

The pair of plaintext letters “p” and “d” has no strokes in common, likewise “b” and “q” are disjunct. The same relationship holds for “ol/ar”, “or/al”. So, identifying “p” with “ol” and “q” with “al” would fit the bill, and also the statistics, with “ol” being about twice as frequent as “al”. Consequently, we’d assume that “ar” is really “d” and “or” is equivalent to “b”.

Alas, the latter assumption is not really born out through the statistics, which show that “ar” and “or” share roughly the same frequency, halfway between “ol” and “al”, whereas in “reality” (whatever that word means in this context), “d” should be as frequent as “p”, and “b” as rare as “q”.

So, while it’s a tempting thought, I can’t really corroborate it yet. It’s a chance, but not a perfect match, and the number of occurences is too large to blame it on the usual fluctuations. Of course, it’s possible to presume some advanced linguistic features at work here — for example “q” might stand preferrably word-initial in the plaintext, which might lead to its uppercase character being preferred in the enciphering, which distorts the frequencies in turn.**)

But that’s all moot speculation for the minute. I’ll keep this tentative identification on the back burner and see if can be supported by any other finds as well.

*) Unfortunately, WordPress will interpret angled brackets as HTML code, thus I can’t reproduce EVA characters the usual way. For the time being, I’ll represent EVA with italics, and plaintext in straight letters. Bear with me, please.

**) As you can tell, I don’t have a firm concept for how the author might have gone on about the camelcasing of the characters.

A String of Luck?

In my neverending quest to find out the truth about the Stroke theory, my lovely wife Sina the other night had a splendid idea. Up to now I’d always tried to dissect the existing ciphertext into the hypothetical “syllables” (which, as the theory goes, compose the ciphertext, with each “syllable” of between one and four ciphertext characters representing the strokes with which one of the plaintext letters would have been drawn); a question not unrelated to the knapsack problem, and equally difficult to treat, especially in view of the fact that we can’t even be sure about the ciphertext character repertoire.

Now, Sina’s idea was not to look for the smallest building blocks which make up the VM, but to large ones. The idea being that long identical sequences of plaintext letters would result in long identical sequences of ciphertext.

Of course, this is easier said than done, if you know neither the plaintext nor even the plaintext language. But one can try to make reasonable guesses und bludgeon them to death by number crunching. For example, it’s not unreasonable to assume the plaintext language to be Latin, especially if we start with the idea that the VM might be a “late fake.” Then, the first step is to run a sufficiently large Latin plaintext through a little program and check for characteristic sequences.

Doing so for Caesar’s “De bello gallico” renders a few suspicious features (like a large number of “Caesar”s), but also some promising ones. To name the first which caught my eye, the most frequent three-letter string in “De bello gallico”, after tossing aside all whitespace and punctuation, is “ere”. (In plaintext of roughly 150k, it shows up 850 times.) This is useful, because the first and the third letter are the same, which makes it easier for us to look for a corresponding ciphertext sequence. So, for example, if plaintext “e” translates into the stroke sequence [Q1], and “r” into [Q2], there ought to be a correspondingly frequent ciphertext sequence [Q1][Q2][Q1], the only problem being that we know neither the length of [Q1] nor of [Q2].

It is also useful because I had to postulate that before the actual enciphering, the plaintext was modified into CaMeLcAsEpLaInTeXt. (If it hadn’t been, Robert Firth would not have found around 50 constituent syllables, but only 25.) That means that “ere” would have been rendered “eRe” or “ErE”, but in either case the first letter is the same as the third. If the frequent string had been “eer”, that would have been nearly useless, because it’d be turned into “eEr” or “EeR” for enciphering, with all three plaintext letters (ie ciphertext sub-sequences) being different, hence they being immediately much more difficult to recognize.

Now the big question is, into how many strokes “e/E” and “r/R” were being broken down?

If the camelcasing happened to render “eRe”, then it’s reasonable to assume that [Q1] (“e”) has a length of two characters, with [Q2] (“R”) having three, ie the “suspicious” ciphertext string should have a length of 2+3+2=7 characters, with the last two being the same as the first two.

Ceteris paribus, “ErE” should occur about as often (under the assumption that the camelcasing works more or less randomly). “E” could be decomposed into either 2 strokes (a “C”-shaped crescent, and a hyphen “-“, if the author used cursive writing as a model), or four (a vertical line and three horizontal bars, if the author used block letters). “r” is difficult to pin down and could be two strokes (a short vertical slash, and a little “comma” on top), or up to three or four strokes, if using one of the German “r”s of the time. (Though IMHO it’s rare to see Latin text written in German letters of the time.) So, “ErE” would translate into something looking like [Q1][Q2][Q1] with [Q1] being three or four letters long, and [Q2] being two or three, so the whole thing would be a string between eight and 11 letters.

That still gives us a lot of leeway. What else can we do to determine suitable candidates?

One thing is the absolute frequency. We have seen that “ere” occurs 850 times in 150,000 letters, in other words, one occurence of “ere” happens about every 300 plaintext letters. Now, Currier “B” makes up about 150,000 ciphertext characters.*) (That would mean that it is equivalent to between 70,000 and 50,000 plaintext characters, assuming that each plaintext letter is represented by between two and three ciphertext characters, on average.) Now, since “ere” was supposedly modified into “eRe” and “ErE” with equal probability, that reduces their frequency to 1/600 each. Likewise, seeing the ciphertext “bloated” compared to the plaintext by a factor of somewhere between 2 and 3, the frequencies of [Q1][Q2][Q1] are expected to drop between 1/1200 and 1/1800 per ciphertext character, or, in other words, they should show up very roughly 100 times in the ciphertext.

But here we are threading on very thin ice, and the uncertainties make guesses like these almost pointless — starting with the question what constitutes an individual ciphertext character.

Anyway, I think this is a promising lead to see whether it’s possible to match certain plaintext sequences to ciphertext strings.

*) The whole examination was limited to Currier “B” text, BTW, to avoid problems with the mix between languages, which might well indicate different enciphering schemes and would thus blur the statistics.