Fellow-Voynichero Rich SantaColoma just asked in a different post, why I wouldn’t get my lazy butt up and do a little brute force statistics regarding my Stroke theory. Namely, if each plaintext letter is always represented by the same group of ciphertext letters (which I’ll call a “syllable”), why not simply count the ciphertext syllables, and then do a reasonable frequency match?*)
Actually, I had a similar idea some time ago, sat down at my computer, fired up my trustworthy interpreter, and stalled. It dawned upon me, that a few problems stood in the way of a brute force statistical attack:
- We don’t know the plaintext language, hence we don’t know the frequency distribution of its letters,
- We don’t know the plaintext character set used, ie whether it was cursive (batarde or modern?), block writing, print letters, etc. If you look at it, this has grave consequences for the numbers of strokes required to compose each letter, and hence for the length of the corresponding syllable, not to mention for the relationship between different syllables.
- We can’t even count on the plaintext to be written in a 26-letter latin alphabet. Letters like “j”, “y” or “x” may well be missing,
- Special characters (digits, astrological symbols) complicate matters,
- We don’t know the ciphertext alphabet, ie we don’t know if daiin and daiir are really two distinct words oder not; we can’t even be sure c, h, and e are different letters,
- Most annoyingly, we also don’t know exactly what the syllable repertoire is. VM words apparently are mostly composed from more than one syllable, but where the syllable “boundaries” run, is unclear: Is qocheedy supposed to be split qo-cheedy, qoch-eedy, or perhaps even qochee-dy?**)
- We only have limited statistical material, namely some 70,000 chars at the most.***)
It seems what is required is not a brute force, but a Smart Force(tm) attack.
*) He actually used a much more friendly wording.
**) Robert Firth had an idea, but apparently was not able to find a solution which was completely statisfying for him. As always, there is a number of solutions which yield varying degrees of success, but none with a 100% match. I plan to do some analysis on the labels, which should help at least insofar as the word boundaries of the labels seem to be more clear-cut than of words in the continuous text.
***) Out of a total of roughly 120,000 chars. But with the different Currier hands, it’s reasonable to assume that different enciphering schemes were used between Currier A and B, hence only either A or B should be used for any statistical test.