It may be Fake, but it’s not a Hoax

Some time ago, I wrote a note to myself on my smartphone*), saying essentially “The Voynich can’t be a hoax, because there’s wordwrap.” As usual, for the longest time afterwards I didn’t have a clear idea what I had meant on that fateful day.

Now, it dawned on me again, especially in the light of the observations I made lately in a little paper about the word-length distribution (which still needs to be amended for several of my oversights others have kindly pointed out to me.)

What I apparently had wanted to say was: “The VM may be a fake (ie, not the genuine 15th cty article it pretends to be), but it’s not a hoax (ie a meaningless sequence of letters), because it exhibits all behaviour conistent with word-wrap.”

As the abovementioned article and the work of several others has shown, the word-length distribution of the text which makes up the VM exhibits all the features one observes in a “regular” (natural language) text which is subject to word-wrapping (ie, long words which would otherwise run over the right margin of the page body are moved to the beginning of the next line as a whole.) Most importantly,

  • the average word length decreases as the line runs on from left to right, and
  • the first word of a line is significantly longer than average.

Both effects are due to the fact that longer words near line ends run a higher “risk” of running over the margin, and be subject to word-wrap.

But this means that the author of the VM didn’t introduce line breaks wherever it pleased him, but he did carefully word-wrap the VM text to keep the individual (ciphertext) words as a whole, rather than allowing them to be spread across line boundaries. A slightly stronger presumption would be that the author even had to word-wrap his text because the words are “information units” (either in the sense of plaintext words, or as “blocks” on which the enciphering algorithm worked.) In either case, if the VM was gibberish and not designed to be deciphered, (ergo if it didn’t matter whether the contents were still readable, because there is no content), then why bothering to wrap the words at line ends, rather than introducing a line break whenever space limits dictated it?

The VM text is all but random; it shows a high degree of structure: a lot of work went into creating it. This could have been done by a brainless Rugg automaton, or through some sensible enciphering algorithm. But work and thought also went into writing down the cipertext and keeping the integrity of the cipher words, and to me this seems to make sense only if information was supposed to be retrieved again as well — which would suggest information to be enciphered therein in the first place.

*) Said “smartphone” is actually a stone-age Palm, back from the time when Palm was still cool.


Numero Uno

(This is more of a working note than an actual find. It’s all very much in flux…)

In Currier “B”, the most frequent single-letter words are (along with their number of occurences):

y    63
s    50
r    40
l    34
o    30
d    7
m    4...

Thus, there appear to be five frequent one-letter words, namely “y”, “s”, “r”, “l” and “o”; after that, the frequency drops sharply, and it’s probably more sensible to consider the counts from “d” on mostly transcription errors, since the word boundaries aren’t always well defined.

Since under the dogma of the Stroke theory each ciphertext word represents one or more plaintext letters, we may safely assume that a ciphertext word boundary is equivalent ot a plaintext letter boundary. Which means that single-letter ciphertext words must correspond to plaintext letters which consist of a single stroke.

Which letters can readily be drawn in a single stroke? — These are “c”, “i”, “l”, “o”, “C”, “I”, and “O”, in alphabetical order, with the possible addiition of “s”/”S”, and “u”/”U”. Here are their respective frequencies, in promille:

i/I    114
o/O     54
c/C     40
l       32

Note that “l” is a special case, since here only the lower-case letter can be written with a single stroke, while the uppercase “L” requires two strokes to be reasonably drawn. “I” could also be split in three strokes, if you draw top and bottom horizontal bars, so we remain with six candidates. Under duress, we could drop “O” and/or “C”, saying they require two strokes (“O” could be drawn as “()”, “C” could be rendered as “c'” to distinguish it from the lower-case “c”.

As usual, the statistics don’t quite pan out. “i” should be twice as frequent as “o”, but “y” is only slightly more prominent than the runner up “s”. OTOH the plaintext ratio of “i”/”l” should be 3.5:1, which isn’t that far out from the “y”/”o” ratio of 2.something:1.

There remains a tangled skein…

A Temptation to be Unfulfilled?

The four most frequent Voynichese words (as opposed to the letter sequences discussed in the last post), are “ol”, “ar”, “or”, and “al”, in that order, in Currier B.*) (Followed by “dy” and “om”).

In view of the Stroke theory, there is a tempting assumption (not yet a conclusion…) to be drawn from that: If “o” and “a” represent crescents open to the right and the left, respectively (like the shapes “c” and “)”), and “r” and “l” represent vertical strokes which end at the base line, or descend below it, respectively, these two EVA letters/stroke pairs could be combined to build four lower-case plaintext letters, namely “b”, “d”, “p” and “q”.

But which is which? Which are the crescents, which are the dashes? — Take a look at the total number of the VM word occurences, and the frequencies of the suggested letters in Latin language, in promille:

ol 416     "p" 30
ar 244     "d" 27
or 242     "b" 16
al 183     "q" 15

Like so often, it is a fit, but not a watertight one.

The pair of plaintext letters “p” and “d” has no strokes in common, likewise “b” and “q” are disjunct. The same relationship holds for “ol/ar”, “or/al”. So, identifying “p” with “ol” and “q” with “al” would fit the bill, and also the statistics, with “ol” being about twice as frequent as “al”. Consequently, we’d assume that “ar” is really “d” and “or” is equivalent to “b”.

Alas, the latter assumption is not really born out through the statistics, which show that “ar” and “or” share roughly the same frequency, halfway between “ol” and “al”, whereas in “reality” (whatever that word means in this context), “d” should be as frequent as “p”, and “b” as rare as “q”.

So, while it’s a tempting thought, I can’t really corroborate it yet. It’s a chance, but not a perfect match, and the number of occurences is too large to blame it on the usual fluctuations. Of course, it’s possible to presume some advanced linguistic features at work here — for example “q” might stand preferrably word-initial in the plaintext, which might lead to its uppercase character being preferred in the enciphering, which distorts the frequencies in turn.**)

But that’s all moot speculation for the minute. I’ll keep this tentative identification on the back burner and see if can be supported by any other finds as well.

*) Unfortunately, WordPress will interpret angled brackets as HTML code, thus I can’t reproduce EVA characters the usual way. For the time being, I’ll represent EVA with italics, and plaintext in straight letters. Bear with me, please.

**) As you can tell, I don’t have a firm concept for how the author might have gone on about the camelcasing of the characters.

A String of Luck?

In my neverending quest to find out the truth about the Stroke theory, my lovely wife Sina the other night had a splendid idea. Up to now I’d always tried to dissect the existing ciphertext into the hypothetical “syllables” (which, as the theory goes, compose the ciphertext, with each “syllable” of between one and four ciphertext characters representing the strokes with which one of the plaintext letters would have been drawn); a question not unrelated to the knapsack problem, and equally difficult to treat, especially in view of the fact that we can’t even be sure about the ciphertext character repertoire.

Now, Sina’s idea was not to look for the smallest building blocks which make up the VM, but to large ones. The idea being that long identical sequences of plaintext letters would result in long identical sequences of ciphertext.

Of course, this is easier said than done, if you know neither the plaintext nor even the plaintext language. But one can try to make reasonable guesses und bludgeon them to death by number crunching. For example, it’s not unreasonable to assume the plaintext language to be Latin, especially if we start with the idea that the VM might be a “late fake.” Then, the first step is to run a sufficiently large Latin plaintext through a little program and check for characteristic sequences.

Doing so for Caesar’s “De bello gallico” renders a few suspicious features (like a large number of “Caesar”s), but also some promising ones. To name the first which caught my eye, the most frequent three-letter string in “De bello gallico”, after tossing aside all whitespace and punctuation, is “ere”. (In plaintext of roughly 150k, it shows up 850 times.) This is useful, because the first and the third letter are the same, which makes it easier for us to look for a corresponding ciphertext sequence. So, for example, if plaintext “e” translates into the stroke sequence [Q1], and “r” into [Q2], there ought to be a correspondingly frequent ciphertext sequence [Q1][Q2][Q1], the only problem being that we know neither the length of [Q1] nor of [Q2].

It is also useful because I had to postulate that before the actual enciphering, the plaintext was modified into CaMeLcAsEpLaInTeXt. (If it hadn’t been, Robert Firth would not have found around 50 constituent syllables, but only 25.) That means that “ere” would have been rendered “eRe” or “ErE”, but in either case the first letter is the same as the third. If the frequent string had been “eer”, that would have been nearly useless, because it’d be turned into “eEr” or “EeR” for enciphering, with all three plaintext letters (ie ciphertext sub-sequences) being different, hence they being immediately much more difficult to recognize.

Now the big question is, into how many strokes “e/E” and “r/R” were being broken down?

If the camelcasing happened to render “eRe”, then it’s reasonable to assume that [Q1] (“e”) has a length of two characters, with [Q2] (“R”) having three, ie the “suspicious” ciphertext string should have a length of 2+3+2=7 characters, with the last two being the same as the first two.

Ceteris paribus, “ErE” should occur about as often (under the assumption that the camelcasing works more or less randomly). “E” could be decomposed into either 2 strokes (a “C”-shaped crescent, and a hyphen “-“, if the author used cursive writing as a model), or four (a vertical line and three horizontal bars, if the author used block letters). “r” is difficult to pin down and could be two strokes (a short vertical slash, and a little “comma” on top), or up to three or four strokes, if using one of the German “r”s of the time. (Though IMHO it’s rare to see Latin text written in German letters of the time.) So, “ErE” would translate into something looking like [Q1][Q2][Q1] with [Q1] being three or four letters long, and [Q2] being two or three, so the whole thing would be a string between eight and 11 letters.

That still gives us a lot of leeway. What else can we do to determine suitable candidates?

One thing is the absolute frequency. We have seen that “ere” occurs 850 times in 150,000 letters, in other words, one occurence of “ere” happens about every 300 plaintext letters. Now, Currier “B” makes up about 150,000 ciphertext characters.*) (That would mean that it is equivalent to between 70,000 and 50,000 plaintext characters, assuming that each plaintext letter is represented by between two and three ciphertext characters, on average.) Now, since “ere” was supposedly modified into “eRe” and “ErE” with equal probability, that reduces their frequency to 1/600 each. Likewise, seeing the ciphertext “bloated” compared to the plaintext by a factor of somewhere between 2 and 3, the frequencies of [Q1][Q2][Q1] are expected to drop between 1/1200 and 1/1800 per ciphertext character, or, in other words, they should show up very roughly 100 times in the ciphertext.

But here we are threading on very thin ice, and the uncertainties make guesses like these almost pointless — starting with the question what constitutes an individual ciphertext character.

Anyway, I think this is a promising lead to see whether it’s possible to match certain plaintext sequences to ciphertext strings.

*) The whole examination was limited to Currier “B” text, BTW, to avoid problems with the mix between languages, which might well indicate different enciphering schemes and would thus blur the statistics.

Countdown to Crackdown?

Until now, the VM has refused to give up its secrets because it’s so “hermetically sealed”, because there is no crack in the wall into which we could jam our crowbar to crack the cipher open — we don’t know the cipher alphabet, nor the plaintext language, have only the faintest idea about the contents, and, since the pagination was done in arab numbers, can’t even use that for a crib.

The first word on F17v --  in EVA

The first word on F17v — “fshody” in EVA

At the same time, one of the (many (some would say “countless”)) puzzling and confusing features of the VM is the fact that in the herbal section, the first (regular) word on the page, not counting the labels, is most of the time unique, and very rare in the other instances. It has for this reason been suggested, that these first words or “titles”*) are the actual names of the plants depicted.

Now, aside of the fact that I have the strong conviction that VM ciphertext words don’t map 1:1 to plaintext words, I also think that wouldn’t make sense: If you have a whole page dedicated to a single plant, wouldn’t it be obvious to use the plant’s name more than once? You wouldn’t write “Dandelion.**) It’s green. It’s got long roots. It’s got yellow flowers. It develops a kind of ‘snow’ for seeding.”, but rather something along the lines of “The Dandelion has got yellow flowers which turn into what is called ‘Dandelion snow’ for seeding.” Likewise, one would assume that various pages make reference to each other as well — like “The blackberry looks like the raspberry, except it’s black.”

Both times, this would increase the use of the page titles, rather than making them unique words. (You would expect the subject of a section to show up more often, not less often than average.) Now I’ve wondered whether perhaps these first words have a completely different meaning — are they maybe simply numbers?

Let’s assume for the minute that the titles plainly number the entries in the herbal section — “25. entry: The boring dandelion,” or such. If we assume that the better part of the VM isn’t concerned with numbers, then it wouldn’t be surprising that these title words are always pretty rare. It would also explain why each title is unique — because duplicating your indices would be daft.

And of course, it is an incredibly tempting aspect, because if this was really the case, then that feature could be a crib to crack the VM. If we can reconstruct the original page sequence before the various rebindings (and Nick Pelling and René Zandbergen et al have given us good means at hand to do so), that would mean that we had the numbers 1 through 50 or so in Voynichese before us, in plain sight — which would be a better start into cracking the rest of the cipher than we’ve had in the last century.

I’ll need to look into that.

*) I’ll simply call them that.
**) No, I don’t suggest f17v is a dandelion. Get real.

Talk is Cheap

Lately, I have posted a little treatise regarding the truth behind the myth of the irregular word-length distribution along the line.

I had developed the theory that the first word in each line has to be longer than average due to word-wrap effects which happen automatically. Such a higher-than-average word length shows up in the VM, but to my dismay I was unable to consistently verify the existance of such an effect in natural language text, namely Mark Twain’s “Tom Sawyer”, which I had used for a reference, failed to exhibit it.

Ger Hungerink then pointed out to me that he had had the same idea before independently from me, he had tested it like I had on “Tom Sawyer” (What is it with that book?), and he did find the first-word effect where I had failed to find any.

Now, this obviously is a bit confusing — two people doing statistics on the same raw data ought to achieve at least vaguely the same results, one would think.

It turned out that the main difference was that while I had retained the complete text, Ger had removed the dialogue from it. I had at first worried about the legitimacy of this step, but became convinced that a) the VM with a high degree of probability is prose rather than dialogue, and b) dialogue will exhibit way more interjections, incomplete sentences, and also in other ways differ from regular text.

Now I did the same as Ger had done, and the result is below:

Continue reading

The (dys-) Functional Line

Dear all,

Once more into the breach, I decided I would take another shot at the ever-elusive properties of the Voynich Manuscript, this time setting my sights on the line statistics, where in the past curioser and curioser properties of the word length distribution have been observed.

I had made up my mind to examine these effects more closely, and especially answer the question whether the line is a “functional unit” (ie, whether it “plays a role”) in the enciphering of the VM or not.

Here is the resulting paper:

the_voynich_line (ca. 500kB)

It’s about twelve pages. Click “more” or see below, if you want to have the short answer.

Comments and suggestions welcome, as always!

Continue reading