Period wordlists

Dan wrote already some time ago, and again I must apologize that I’m currently fairly busy with other projects, and hence can’t devote as much time to the VM as I should. Nevertheless, I finally should give him the floor:

Yeah yeah, here’s another theory. Actually I’m not going into the theory, but simply asking if you can provide any assistance in resources I am seeking. Let me back up a bit – I’m a full time software developer of over 20 years, and I have had some insights regarding the manuscript. I’ve written software to generate various statistics about the document and have found some surprising and very obvious (once distilled down to hard numbers) patterns that further validate the insights. These are not “hunches”, or “gut feelings” or any mystical, nutty stuff. It’s simply what it is, and the analysis doesn’t lie.

I am currently running brute force deciphering attempts using additional software I have developed, based on my theory of how the document is ciphered. The main resource I am lacking at this time are simply word lists of the candidate languages the manuscript may have been written in (in its decoded form of course), and specifically, the vernacular and spelling of those languages when the manuscript was written in the 1400s.

I have always assumed the Voynich manuscript was a hoax, but when it was positively dated a few years ago I took a harder look, again with the expectation that it was a hoax but at least a hoax contemporary to the 15th century. My attempt was actually to prove (just to myself) that very thing – that it is just a contrived hoax. Unfortunately the insights and analysis I have done over the last few years have left no other option but to follow the logical progression until it peters out and comes to a dead end. I have not yet reached that point.

Thanks for you time, and again, if you know of simple word lists (or who can provide them or assist in that) of good candidate languages from the 15th century, that would be quite helpful.

This question isn’t so easy to answer. First of all, even when taking the the C14 dating of the vellum as a given, we still have about a century of leeway regarding the actual production date of the manuscript. A century is a long time in which languages can change.

Secondly, languages weren’t “codified” as strictly as they are today, and pretty much everyone would write down their MSs in their local dialect, not to mention the fact that strict orthography wasn’t enforced yet either. Which means that even two people from the same region writing at the same time wouldn’t necessarily employ the same spelling. (An extreme example of this is the Bayeux Tapestry (admittedly predating the VM by some 400 years), where the name of William the Conqueror is written IIRC in not less than seven different manners.) Hence, to make a long story short, any word list should be taken with a grain of salt.

I did some statistics in the past myself, and to get decent wordlists I simply went to, downloaded a few works I considered representative of the era, and ran my own little wordcount scripts on these files.

IMHO, prime candidates for the plaintext languages are Latin, English, French, German (including the various dialects like Swiss), and perhaps Spanish. But though I wouldn’t bet on it, more exotic options like Hungarian, Finnish or maybe the Lingua Franca can’t be ruled out either.

Sorry, but this is probably as less simple answer than you asked for?

The Cuttest Critter

James recently asked me:

Just wondering what type of animal you think it is eating what is believed to be a Woad Plant on f25v

referring to this cute little critter.

I don’t think it’s supposed to be a real animal. My guess is it’s a little dragon; the scaly back, the comparatively short legs, the ears and comb on the neck, and the fact that it may only have two legs seem to be a good match for me. Compare here for the idea of a 15th century painter (Uccello) what a dragon is supposed to look like.


Lately, I received a brief message —

Censorship should be consistent

which I feel deserves a bit of comment because it represents a widespread, but false notion from the web. (I presume the missive didn’t allude to the general state of international politics, but referred to my decision to block some user comments on my blog.)

First of all, “censorship” means the suppression of information or opinion, usually through a public body. This is definitely not the same thing as deciding to ignore a contribution.

But, secondly and more importantly, you seem to feel you are entitled to using my blog for your messages. This is simply wrong, contribution here is a priviledge I grant (or withhold), as is the case with any private web page. My blog is not a public place to which anyone should have access, but a private BBQ I hold in my backyard. You are invited to drop by and share the party, but if you act inappropriately, I’ll kick you out, and, as the digital landlord, here I’m the sole arbiter to what constitutes appropriate behaviour. Simple as that. Play somewhere else.

Considering that I work for the maintenance of this site, that I’m legally responsible for the contents and that finally I’ll also be judged on the merits of the contributions here, I feel this is only fair. You’re free to go any other place, and party and voice your opinion there, and you will find I do nothing to hinder your free speech there. (That would be censorship.)

So while you’d be able to publicize on your own, you prefer to parasite from the infrastructure provided by me, insulting me with claims of “censorship” when I refuse to comply. This in itself should justify blocking your access.

“This conversation can serve no further purpose.”

A Plea to all Voynicheros

If you pursue a theory, please keep your website up-to-date.

As happened several times on the Voynich list during the last weeks, readers were encouraged to test other’s deciphering schemes based on publications on certain websites, but ran into dead ends or couldn’t arrive at the same results as the original poster. Only later or after complaining about this were they told that the information on the website was outdated.

This is impolite, cause it’s a waste of time on your readers’ part, it will make them irritated and discourage them to get seriously engaged with your theory, it will make them miss the point (of testing your theory), and it will do your reputation in general no good. So, it’s a win-win if you first update your website and then publicize it.


We put the “Brute” in the “Force”

There is no end to the string of new theories and commenters on this blog (Keep em coming, boys and girls!) Today it’s Zach, who sent me more a general question than a theory:

Please forgive me if this has already been covered in your site and I’m just not seeing it, but hasn’t anyone tried feeding the VM into a computer and brute-forcing it? Computers are really good at trying every possible combination and weeding out the ones that don’t make sense, so it seems like they ought to be perfect for grinding away at the VM by one means or another.

Any idea whether computers have been tried?

And followed it immediately up, for good measure, with some more detail:

Me again. I’m elaborating a bit, in case I really am
a) the first person to think of this and
b) there’s really no reason it wouldn’t work

I did see one article here on computers and frequency analysis, but the author made a good point about not knowing the language and so not knowing the frequency rules. I envision a brute-force that attacks the words rather than the letters. It would work like this:
FIrst, your computer is fed lots and lots and lots of example texts, the more the better, so it can build a probability map that stores the likely hood of any given word being followed by (or just found near) a second word, and given those two words, what third word is likely to come next, and so on. You will never get firm answers, but this is fuzzy logic – as long as we can grade potential sentences as more or less likely to be linguistically correct and sensible, we are good to go.
Once the computer has this probability map, the rest is just a matter of brute force:
Start with one of the repeating ‘words’ in the VM.
Assign it an arbitrary English meaning.
Using that meaning, and the probability map, assign English words to each word preceding and following your starting word. Branch outward from there until you have ‘translated’ all words. Use your probability map to judge how likely it is that the ‘translation’ is actual English sentences (best if you can ignore word order and focus on word proximity, because word order falsely assumes they had the same grammar as we do). Store that probability and start over with a new guess for your first repeating word. Repeat over and over until you’ve found a high-probability match or, more likely, you’ve run out of choices.
Assuming you’ve not found a 100% match, present the 10 most probable ‘translations’ for human inspection.
Compiling tables of synonyms would also improve things; that way the computer could also consider how likely CONCEPTS are to be grouped together, since English and VM-speak are very unlikely to have a 1:1 mapping.

The core assumption is that, even if there are no direct word matches, there’s a 1:1 map of concepts between VM-speak and your target language (English, in my case). I think that’s a safe assumption, because if there’s no such mapping, then it seems translation would be impossible, like trying to translate a 1-time-pad cipher without the key.

Anyhoo, I hope you enjoy my ramblings. Thanks for listening.


Thanks, Zach, for your input.

For a start, there is actually a wealth of computer power which has been pumped (more or less in vain) into the black hole of information we call the Voynich. I myself tried a little with my Stroke theory, for example, but these efforts are dwarfed by people like Jorge Stolfi or Julian Bunn, among others. But these mostly focus on analysis of the VM, not directly on a translation. Why is this?

Well, a brute force attack is hampered by several constraints:

  1. Our statistical material is limited. The VM comprises some 130,000 characters, which appears to be a lot. But when you look at it, that’s only some 30,000 words. If you further take into account the different encoding schemes (aka “Currier A” and “Currier B”, resp.), which differ subtly but do differ, you’re left with only a sample of some 15,000 words, which isn’t that much.
  2. We don’t know the plaintext language underlying the VM. English is possible, yet some of the marginalia point to French or Spanish, the images provide hints to Italy, and some clues point to Germany, not to mention that Latin would have been the lingua franca of the era.
  3. We know next to nothing about the subject matter, and accordingly little about the vocabulary used.
  4. We’re unclear about the ciphertext alphabet. We have really no idea whether the sequence of two connected “c”s really means “two ‘c’s in a row”, or is a completely different letter. (Compare this to the case of latin letters where “nn” is something completely different than “m”.) We don’t know if the “drops” above some “cc” groups only modify the underlying letter(s) (compare “O” -> “Ö”), or if they make it a completely different letter (compare “O” -> “Q”).
  5. Some characters like the notorious “gallows” show a tendency to only show paragraph-initial or in the first row of a page. They may be embellishments of “regular” characters (as was often done in manuscripts of the era), but we don’t know which “regulars” they’d replace.

But there is one obstacle even more great than this, and much more fundamental: Any “brute force” attack would presume that the ciphertext words of the VM are mapped 1:1 from the plaintext words. And this is extremely unlikely for a number of reasons:

  1. The ciphertext alphabet seems to consist of around 17 frequent letters, plus a large number of rare “wierdos”. That maps poorly to a latin alphabet.
  2. Some frequent letter groups show up almost exclusively word-initial (“qo”) or word-terminal (“dy”). That’s unknown for any Central European language.
  3. Word-length distribution is odd: There is a shortage of both very short and very long words; words have a comparatively uniform length — Again, this is unusual for Central European languages.
  4. Overall, the words exhibit a very regular structure — check out Stolfi’s “Core-Mantle-Crust” paradigm. (Yes, it’s a tough read, but worth working it through if you want to understand the VM.) they are composed by a fairly rigid “grammar”, the like of it is unknown for European languages.
  5. Nobody has been able to identify particles and articles (“a”, “and”, “with”…) in the VM.

All of these differences between natural languages and the VM make it highly unlikely that the enciphering mechanism simply always turned plaintext word “A” into ciphertext word “X”, and “B” into “Y”.*) I’m convinced that one VM word is not equivalent to a plaintext word, but rather that it only represents a few letters.

There are other assumptions — Don of Tallahassee assumes it’s a list of highly abbreviated recipes, David Suter presumes it could be geographical coordinates encoded. Theoretically, all these schemes could be attacked by brute computer force, but this would only make sense once the enciphering method was sufficiently clear. And exactly this is not the case — to my knowledge, no theory has been put forth which would sufficiently explain all the peculiarities we observe in the ciphertext, and hence there’s simply no starting point for a computer programme to launch.

*) There are actually two scenarios where it would be just conceivable that there is a 1:1 correspondence between plaintext words and cipher words.

One is that the VM was written with the aid of a dictionary, where all plaintext words were numbered, and in the VM their numbers were written down not in arab numerals, but in something like the Roman numbering system — ie word “259” in the dictionary would have been written “CCLIX”. While this is conceivable, up to now nobody has been able to provide a coherent numbering system which would result in the “word grammar features” mentioned above.

The second idea would be that the VM was written in an artificial language, in particular in one of the “logical” or “A priori languages“. (Check out Solresol for an example.) These artificial languages construct their words from “blocks” which do resemble the “core-mantle-crust” syllables found by Stolfi. But the first comparable logical languages date from at least two centuries after the VM was written, so their use is fairly unlikely.

Die Antwort der Teutonen

After all the suggestions for the VM which arrived from Russia and France over the last few weeks, with Michael Hadlich it’s now another German VM afficionado’s turn to throw his intellectual hat into the ring, so to speak:

I did a graphical analysis of the words on page f76r ( I found it strange that some words are written in different angles even when they stand very close. So I drew a rectangle around each word to see if there are words with same angle:

It seems there are correlations between single words with exact the same angle. I’ll continue with the analysis to find some more relations. My first thought was about a cardan grille. But the letters often have very long ascenders.

Another point is that the words are written not as a whole sentence but letter by letter and word by word. It looks like the author stopped after each word, sometimes after each letter. Only a few letters are connected with ligatures. This is not typical for a natural language and a very inefficient way to write. When you look at the technique the letters are written, you can find that some letters are darker than others. This is due to the fact that one can’t write a lot of letters with this little amount of ink on the feather. BUT: Sometimes you can see phrases in same brightness with just one dark letter in the middle. That’s also very strange. Try to write with a feather and you see what I mean.

I have two possible solutions for the points above:
1) The text is not the original book but a copy from a person who did not understand the content.
2) The text is constucted using a mechanical device, lets say a cardan grille or a wheel as shown on page f57v ( What looks like a word is just a symbol for a word in reality. Single “letters” are also symbols for words or numbers.

My guess is the use of a wheel as shown on page f57v. Maybe this wheel is not the one that has been used. It’s possible that the “real” wheel doesn’t exist anymore. But we can try to re-construct it.

Let’s have a look at the wheel on page f57v: Very interesting is the second circle / band (seen from outside). You can see 17 single symbols written once in each quadrant (N/E/S/W). This is the only circle (band) on the whole disc with a double line (key) at NW position. Obviously the disc is a device to set symbols in relation to words. One important question is: what mask (or cardan grille) is used to see the selected word / symbol and how is the wheel turned to point from a small symbol to a word symbol. I guess the mask looks like a disk too but with cutouts at some positions to see “words” and “symbols” thru these holes.

These are just my thougths at the moment about the VMS, and I’m far away from a real solution. But I’ll keep on trying and tell you my findings.

I wanted to reply to Michael’s ideas, but due to my tardiness he has meanwhile apparently assumed that he had to take matters in his own hands, and has subscribed to the Voynich Mailing List, where there now is a lively discussion going on.

Still, I’d like to publish his ideas here, too. Thus, if you don’t feel like subscribing to the list (though I highly recommend it if you have any interest in the VM, or simply in a bunch of quirky individuals), please do discuss Michael’s ideas here!

Letter from France Encore

And here is the next missive from France, this time from Stephanie Levavasseur, and Stephanie has much confidence in either my French faculties or in the abilities of Google translate and sent me her message in French, a language in which in consider myself a dilettante at best. Anyway, alors, mes enfants:

J’ai trouvé ces fragments dans le livre : Pseudo-Apulée, De medicaminibus herbarum liber, SIUE Herbarius (pseudo-Apulée, Herbier)

D’après l’analyse de la BNF qui accompagne le livre, plus de 5 personnes ont annoté le livre après l’auteur.

Voici les pages du manuscrit dans lesquelles se trouvent les fragments:

Qu’en pensez-vous ? des similitudes pour certaines lettres ? tentative de cryptage ?

Now, if I do properly make sense of Stephanie’s mail, she has come across the herbal by Pseudo-Apulée and noted that, according to BNF’s analysis (whoever that is…), the various marginalia found therein were written by no less than five different hands. This being somehow similar to the way the VM marginalias were written, she wonders whether this could be a crib to the VM. (Forgive me, Stephanie, if I mangled your message too badly!)

Now, this looks interesting, especially f135, with the top line which — to me — looks vaguely like Sanskrit, and the characters below looking like regular latin letters which have been blown to pieces. But I guess it would be necessary to know a little more about the history of this book in particular.

The others from the Pseudo-Apulée look like “innocent” (ha!) writing exercises to me, with people writing down the alphabet, with the exception of the second part on f37v, which, at a first glance, makes no sense to me, much like the VM marginalia.

Your opinions, ladies and gentlemen?