There is no end to the string of new theories and commenters on this blog (Keep em coming, boys and girls!) Today it’s Zach, who sent me more a general question than a theory:
Please forgive me if this has already been covered in your site and I’m just not seeing it, but hasn’t anyone tried feeding the VM into a computer and brute-forcing it? Computers are really good at trying every possible combination and weeding out the ones that don’t make sense, so it seems like they ought to be perfect for grinding away at the VM by one means or another.
Any idea whether computers have been tried?
And followed it immediately up, for good measure, with some more detail:
Me again. I’m elaborating a bit, in case I really am
a) the first person to think of this and
b) there’s really no reason it wouldn’t work
I did see one article here on computers and frequency analysis, but the author made a good point about not knowing the language and so not knowing the frequency rules. I envision a brute-force that attacks the words rather than the letters. It would work like this:
FIrst, your computer is fed lots and lots and lots of example texts, the more the better, so it can build a probability map that stores the likely hood of any given word being followed by (or just found near) a second word, and given those two words, what third word is likely to come next, and so on. You will never get firm answers, but this is fuzzy logic – as long as we can grade potential sentences as more or less likely to be linguistically correct and sensible, we are good to go.
Once the computer has this probability map, the rest is just a matter of brute force:
Start with one of the repeating ‘words’ in the VM.
Assign it an arbitrary English meaning.
Using that meaning, and the probability map, assign English words to each word preceding and following your starting word. Branch outward from there until you have ‘translated’ all words. Use your probability map to judge how likely it is that the ‘translation’ is actual English sentences (best if you can ignore word order and focus on word proximity, because word order falsely assumes they had the same grammar as we do). Store that probability and start over with a new guess for your first repeating word. Repeat over and over until you’ve found a high-probability match or, more likely, you’ve run out of choices.
Assuming you’ve not found a 100% match, present the 10 most probable ‘translations’ for human inspection.
Compiling tables of synonyms would also improve things; that way the computer could also consider how likely CONCEPTS are to be grouped together, since English and VM-speak are very unlikely to have a 1:1 mapping.
The core assumption is that, even if there are no direct word matches, there’s a 1:1 map of concepts between VM-speak and your target language (English, in my case). I think that’s a safe assumption, because if there’s no such mapping, then it seems translation would be impossible, like trying to translate a 1-time-pad cipher without the key.
Anyhoo, I hope you enjoy my ramblings. Thanks for listening.
Thanks, Zach, for your input.
For a start, there is actually a wealth of computer power which has been pumped (more or less in vain) into the black hole of information we call the Voynich. I myself tried a little with my Stroke theory, for example, but these efforts are dwarfed by people like Jorge Stolfi or Julian Bunn, among others. But these mostly focus on analysis of the VM, not directly on a translation. Why is this?
Well, a brute force attack is hampered by several constraints:
- Our statistical material is limited. The VM comprises some 130,000 characters, which appears to be a lot. But when you look at it, that’s only some 30,000 words. If you further take into account the different encoding schemes (aka “Currier A” and “Currier B”, resp.), which differ subtly but do differ, you’re left with only a sample of some 15,000 words, which isn’t that much.
- We don’t know the plaintext language underlying the VM. English is possible, yet some of the marginalia point to French or Spanish, the images provide hints to Italy, and some clues point to Germany, not to mention that Latin would have been the lingua franca of the era.
- We know next to nothing about the subject matter, and accordingly little about the vocabulary used.
- We’re unclear about the ciphertext alphabet. We have really no idea whether the sequence of two connected “c”s really means “two ‘c’s in a row”, or is a completely different letter. (Compare this to the case of latin letters where “nn” is something completely different than “m”.) We don’t know if the “drops” above some “cc” groups only modify the underlying letter(s) (compare “O” -> “Ö”), or if they make it a completely different letter (compare “O” -> “Q”).
- Some characters like the notorious “gallows” show a tendency to only show paragraph-initial or in the first row of a page. They may be embellishments of “regular” characters (as was often done in manuscripts of the era), but we don’t know which “regulars” they’d replace.
But there is one obstacle even more great than this, and much more fundamental: Any “brute force” attack would presume that the ciphertext words of the VM are mapped 1:1 from the plaintext words. And this is extremely unlikely for a number of reasons:
- The ciphertext alphabet seems to consist of around 17 frequent letters, plus a large number of rare “wierdos”. That maps poorly to a latin alphabet.
- Some frequent letter groups show up almost exclusively word-initial (“qo”) or word-terminal (“dy”). That’s unknown for any Central European language.
- Word-length distribution is odd: There is a shortage of both very short and very long words; words have a comparatively uniform length — Again, this is unusual for Central European languages.
- Overall, the words exhibit a very regular structure — check out Stolfi’s “Core-Mantle-Crust” paradigm. (Yes, it’s a tough read, but worth working it through if you want to understand the VM.) they are composed by a fairly rigid “grammar”, the like of it is unknown for European languages.
- Nobody has been able to identify particles and articles (“a”, “and”, “with”…) in the VM.
All of these differences between natural languages and the VM make it highly unlikely that the enciphering mechanism simply always turned plaintext word “A” into ciphertext word “X”, and “B” into “Y”.*) I’m convinced that one VM word is not equivalent to a plaintext word, but rather that it only represents a few letters.
There are other assumptions — Don of Tallahassee assumes it’s a list of highly abbreviated recipes, David Suter presumes it could be geographical coordinates encoded. Theoretically, all these schemes could be attacked by brute computer force, but this would only make sense once the enciphering method was sufficiently clear. And exactly this is not the case — to my knowledge, no theory has been put forth which would sufficiently explain all the peculiarities we observe in the ciphertext, and hence there’s simply no starting point for a computer programme to launch.
*) There are actually two scenarios where it would be just conceivable that there is a 1:1 correspondence between plaintext words and cipher words.
One is that the VM was written with the aid of a dictionary, where all plaintext words were numbered, and in the VM their numbers were written down not in arab numerals, but in something like the Roman numbering system — ie word “259″ in the dictionary would have been written “CCLIX”. While this is conceivable, up to now nobody has been able to provide a coherent numbering system which would result in the “word grammar features” mentioned above.
The second idea would be that the VM was written in an artificial language, in particular in one of the “logical” or “A priori languages“. (Check out Solresol for an example.) These artificial languages construct their words from “blocks” which do resemble the “core-mantle-crust” syllables found by Stolfi. But the first comparable logical languages date from at least two centuries after the VM was written, so their use is fairly unlikely.