Are You Tired of the Strokes yet?

You probably are.

Well, in this case, let me point you to a mostly overlooked gem in Voynich research, namely Sarah Goslee’s website. Not only is she a fellow SCAdian*) (Hail from Drachenwald!), but she has also put together a few nice statistical tests on the VM. As always, caveat emptor!, and honestly I haven’t figured out what “principal coordinates ordination on Euclidean distances of row-standardized frequencies” is supposed to be, but I’ve been in the game long enough to be suitably impressed by a procedure with a name of that length.

No, seriously, I’m still struggling to understand what exactly Sarah did and what the results mean, but this has all the appearance of a very interesting and competent piece of research which has up to now not received the mention it deserves, IMHO.

Hence, my usual piece of advice: Check it out, bros!

*) No, it’s not this.

Advertisements

Higher Stroke Count

So, these days I got around doing a little programming to number-crunch the VM. Assuming the Stroke Theory is the only way to salvation regarding the VM (and who would doubt it?), I did the following to find out what the “constituent syllable set” might be:

  1. Extract a wordlist from the VM (Currier B only), and perform a word count
  2. Discard all “rare” words (roughly, those with 2 occurences and less). This left me with a list of some 1200 words, covering about 80% of tokens.*)
  3. Compose a list of prefixes and suffixes from this vocabulary, namely the most frequent beginnings and endings of VM tokens. (The 32**) most frequent word prefixes and suffixes with lengths of 2 through 5 letters were chosen, giving a supply of roughly 140 beginnings and endings each.)
  4. Prepare a list of all words which can be created by combining each one of the prefixes with one of the suffixes.
  5. See how many of the tokens can be covered this way.
  6. Replace one of the 32 chosen prefixes and one of the suffixes with a different one from the supply.
  7. Recalculate the word coverage.
  8. If the word coverage has improved, keep the change, otherwise discard it.
  9. Repeat from step 6.

Already after a few hundred variations, the program became “saturated”. After some 9500 variations I aborted the run, when it had arrived at a coverage of 84%, meaning that with 32 prefixes and 32 suffixes, 84% of all tokens from the reduced wordlist could be composed.***)

Here are the results in no particular order:

Prefixes:
ch qo sh lche ok ot lk yk che ota she ai otai olk qoka yt ol qote qot lch ote kch da qoke qok oka cho okai cth ke oke dai te

Suffixes:
dy kal ol ey ar eey chdy al ty edy kain iin cthy aiin eody key keey in kedy ky ain or ckhy kar chy ody ir eedy eor eol am eo ok

I think this is fairly promising. More work needs to go into what share each of the syllables contributes to the whole, and of course more testing against other languages is required. (And I need to compare this to Robert Firth’s work.) But it’s a start.

*) We’ll discriminate between words and tokens. While any character group contributes to the numbers of tokens, only different groups are counting as words. In other terms, the sentence “The dog likes the food” consists of 5 tokens, but only 4 words. You can think of “words” as “vocabulary”, while “tokens” concern the “volume” of the text.

**) The number 32 was chosen pretty much randomly under the assumption that 26 groups — barely enough for the latin alphabet — wouldn’t suffice to include digits and possible special characters.

***) I tried the same with english text, but here only a coverage of around 15% could be achieved. This may have to do with the shorter english words, though — I’ll need to compare with other languages as well.

Smart Force Required

Fellow-Voynichero Rich SantaColoma just asked in a different post, why I wouldn’t get my lazy butt up and do a little brute force statistics regarding my Stroke theory. Namely, if each plaintext letter is always represented by the same group of ciphertext letters (which I’ll call a “syllable”), why not simply count the ciphertext syllables, and then do a reasonable frequency match?*)

Actually, I had a similar idea some time ago, sat down at my computer, fired up my trustworthy interpreter, and stalled. It dawned upon me, that a few problems stood in the way of a brute force statistical attack:

  1. We don’t know the plaintext language, hence we don’t know the frequency distribution of its letters,
  2. We don’t know the plaintext character set used, ie whether it was cursive (batarde or modern?), block writing, print letters, etc. If you look at it, this has grave consequences for the numbers of strokes required to compose each letter, and hence for the length of the corresponding syllable, not to mention for the relationship between different syllables.
  3. We can’t even count on the plaintext to be written in a 26-letter latin alphabet. Letters like “j”, “y” or “x” may well be missing,
  4. Special characters (digits, astrological symbols) complicate matters,
  5. We don’t know the ciphertext alphabet, ie we don’t know if daiin and daiir are really two distinct words oder not; we can’t even be sure c, h, and e are different letters,
  6. Most annoyingly, we also don’t know exactly what the syllable repertoire is. VM words apparently are mostly composed from more than one syllable, but where the syllable “boundaries” run, is unclear: Is qocheedy supposed to be split qo-cheedy, qoch-eedy, or perhaps even qochee-dy?**)
  7. We only have limited statistical material, namely some 70,000 chars at the most.***)

It seems what is required is not a brute force, but a Smart Force(tm) attack.

*) He actually used a much more friendly wording.

**) Robert Firth had an idea, but apparently was not able to find a solution which was completely statisfying for him. As always, there is a number of solutions which yield varying degrees of success, but none with a 100% match. I plan to do some analysis on the labels, which should help at least insofar as the word boundaries of the labels seem to be more clear-cut than of words in the continuous text.

***) Out of a total of roughly 120,000 chars. But with the different Currier hands, it’s reasonable to assume that different enciphering schemes were used between Currier A and B, hence only either A or B should be used for any statistical test.

A Constant Case of Lower Case

One very interesting fact of the marginalia to me seems, that there are a lot of ambiguities and uncertainties which letter many of the shapes are supposed to represent — Yet all the reasonable options for even the most dubious cases appear to be minor letters.

Where have the capitals gone? How come, no matter how crappy the author scribbled across the tortured vellum, nothing looks like upper case?

26 + 26 + 10 < VM

We see that many words in the VM comply with a comparatively simple and straightforward grammar, but we also see that lots of words break with those rules — hence, the underlying rules are either more complex than we think, or not all words must obey them.

To paraphrase this in terms of the Stroke theory, it means that a character set of 26 capital and 26 minor letters plus 10 digits apparently was not enough to transcribe the VM plaintext, otherwise we’d only see 62 different “syllables” making up the VM.

Now, idly browsing across the web I came across a German astronomical manuscript from around 1500. If you take a look at f28r —

(click the image to get the full resolution), you’ll notice that while the top half of the page consists almost entirely of the latin alphabet, the bottom section is riddled with astrological symbols.

If such was the case with the VM itself, these “special characters” would require special enciphering: Some graphical elements in those symbols aren’t present in the latin character set, which would give rise to the use of rare ciphertext letters, plus their combinations would be different from the grammar of the body of the text. Hence, we’d have occurences of unusual letters and breach of the grammar rules.

Under this assumption, the hypothesis would be that the VM word “grammar” is not so much a grammar per se, but rather an artifact of the existance of only a limited set of syllables to begin with.

Safety in Numbers (Roman and Arab)

As we all know, there is no shortage of oddities regarding the VM. One of those is that throughout the whole manuscript, no numbers can be found.*) While it has been suggested that the structure of VM words seems to be governed by rules similar to those for the composition of roman numbers, as of now no consistent system has been proposed which would allow one to generate VM words.**)

But what if we look at it from the Stroke theory’s point of view? Remember, this theory says that each VM ciphertext letter represents one plaintext penstroke, and hence that each VM word is the equivalent of one or two plaintext letters. But in the same manner the letters were enciphered, it would also be possible to encipher digits, if arab numbers were used. If roman numbers were used, they would be enciphered just like their letter equivalents.

What would be the consequences?

I have suggested in the wake of Robert Firth’s observations that in the Stroke theory the plaintext was written in capital and small letters alternating for the most part, so that a KiNd Of CaMeLcAsE writing was produced. With around 25 capital and small letter shapes each, this means the the better part of them VMs vocabulary should be composed of around 50 ciphertext “syllables”. If arab numbers were used, this number should increase by the 10 or so shapes required for the individual digits.***)

Furthermore, since it’s impossible to write “capital” and “small” numbers, a string of digits should show up as a digression of the pattern of alternating “prefixes” and “suffixes” (which are identified in the Stroke theory as capital and small letters). Note also that many of the arab numbers are fairly similar to each other (6-8-9-0, for example)****), which should lead to a string of quite similar words or syllables in the ciphertext. (As is observed.)

If roman numbers were used, they could have been written in camelcase (like MmIx rather than MMIX). Again, strings of similar or identical words should be observed, as the letters denoting the roman numbers often resembled each other. (For example, “D” was supposed to be “one half of (the letter) ‘M'”, ie 500 was the half of 1000. Same with “V” and “X”.) The total number of different syllables should not go up though.

Thus, I think it’s quite conceivable that the VM text is interspersed with numbers. They just don’t show up, because they’er enciphered as the rest of the letters are.

If the VM turns out to be the very earliest book on cocktail recipes, the first round is on me.

*) With the except of the pagination, which appears to be a later emendment though.

**) If this was possible, ie if all VM words could be explained as roman numbers, the question is — what would that mean for the deciphering of the VM? That it was one of the earliest phone books in existence, predating the phone by some 400 years, and that unfortunately all the people’s names were dropped from it…?

***) Or less, perhaps. It’s difficult to see how the Stroke theory should discriminate between “o” and “0”.

****) Bear in mind though that in period the shapes of arab numbers could significantly vary with the exact time and location when they were used.