Smart Force Required

Fellow-Voynichero Rich SantaColoma just asked in a different post, why I wouldn’t get my lazy butt up and do a little brute force statistics regarding my Stroke theory. Namely, if each plaintext letter is always represented by the same group of ciphertext letters (which I’ll call a “syllable”), why not simply count the ciphertext syllables, and then do a reasonable frequency match?*)

Actually, I had a similar idea some time ago, sat down at my computer, fired up my trustworthy interpreter, and stalled. It dawned upon me, that a few problems stood in the way of a brute force statistical attack:

  1. We don’t know the plaintext language, hence we don’t know the frequency distribution of its letters,
  2. We don’t know the plaintext character set used, ie whether it was cursive (batarde or modern?), block writing, print letters, etc. If you look at it, this has grave consequences for the numbers of strokes required to compose each letter, and hence for the length of the corresponding syllable, not to mention for the relationship between different syllables.
  3. We can’t even count on the plaintext to be written in a 26-letter latin alphabet. Letters like “j”, “y” or “x” may well be missing,
  4. Special characters (digits, astrological symbols) complicate matters,
  5. We don’t know the ciphertext alphabet, ie we don’t know if daiin and daiir are really two distinct words oder not; we can’t even be sure c, h, and e are different letters,
  6. Most annoyingly, we also don’t know exactly what the syllable repertoire is. VM words apparently are mostly composed from more than one syllable, but where the syllable “boundaries” run, is unclear: Is qocheedy supposed to be split qo-cheedy, qoch-eedy, or perhaps even qochee-dy?**)
  7. We only have limited statistical material, namely some 70,000 chars at the most.***)

It seems what is required is not a brute force, but a Smart Force(tm) attack.

*) He actually used a much more friendly wording.

**) Robert Firth had an idea, but apparently was not able to find a solution which was completely statisfying for him. As always, there is a number of solutions which yield varying degrees of success, but none with a 100% match. I plan to do some analysis on the labels, which should help at least insofar as the word boundaries of the labels seem to be more clear-cut than of words in the continuous text.

***) Out of a total of roughly 120,000 chars. But with the different Currier hands, it’s reasonable to assume that different enciphering schemes were used between Currier A and B, hence only either A or B should be used for any statistical test.

Advertisements

9 thoughts on “Smart Force Required

  1. Gotcha. Then how about this?: Perhaps we would be surprised if we examined a cross-sampling of many of the variables you mention, and more, for stroke counts… perhaps there are some (currently unknown) universal laws to the distribution of strokes throughout various languages and character types? Couldn’t hurt to try.

    For instance, if it turned out that cursive, secretary, and block letters, in German and Latin, all has roughly the same distribution of various strokes… such as “c”, or forward slant, or whatever… if this “turned out” to be the case, then it would not matter the underlying language of the VMs, or it’s script. It would then allow a rough picture of the strokes used, to help reassemble it… and see which one it was.

    If, if, if, of course… we don’t know if their is a univeral law to stroke distribution across types and languages… ULSDATYL?… ha. But even if it was mostly a mad mess of statistics, one may jump out… say, “vertical stroke”, or like that… perhaps one stroke remains prominent, no matter the root text’s source and characters. If so, that may give a starting point.

    And yes, then, we would still have the universal problem you mention, determining the intended structure of the VMs text… what is a character, a word break, etc… it always comes down to reducing your trials, and if there is a way to do this, perhaps is is a crack in the door…

  2. One of the very few universal feature combinations I could make out is — Sentences start with capital letters, many capital letters feature a long vertical stroke (“I”), most of the VM paragraphs begin with a gallows letter.

    Hence — gallows = “I” ?

    OTOH, see my related abortive stroke of genius

    I’m afraid a statistical approach will fail due to the large amounts of uncertainties. To go on about it with deduction seems the only viable way to me. (Which I hate, because I’d much more love to crunch numbers…)

  3. Well any way you look at it we are screwed of course… as hundreds of others, and thousands before, have been. But this sort of exercise is important, because something may jump out… something totally unexpected. Keep up the good work…

  4. I’m pondering to write a program that would “automagically” find out about the minimum syllable set required to compose the most part of the ciphertext. That would give us an indication of the “alphabet”/syllable repertoire used, and the rest ought to be fairly easy.

    But I’m still at a loss about how to sensibly perform such an analysis. A brute force attack checking all possible combinations seems to be forbiddingly expensive in terms of calculation time, while more sophisticated approaches will probably be sensitive to errors in the transcription choices.

    Any suggestions welcome.

  5. One idea which occurs to me would be to start with the first words of the first sentences of many pages. These can be reasonably assumed to be beginning of any possible stroke sequences, whereas you cannot assume this for the beginning of a VMs word or sentence, when in the bulk of a page (this would be presumptuous).

    Look at the first two, three, four, and so on, groups of all the first page words. When a list of these is compiled, and applied to to the bulk of any one specific page, then they can be “dropped out”… leaving the remaining stroke sequences.

    Do you get what I am saying? It is like “assuming a reasonable control”, to narrow the possibles, and leave the unknowns. It would work with any non-transposed system, because we can “count on”, with reasonable certainty, the first words of each page, to represent the beginning of sequences used.

  6. I see where you’re getting at. (Of course, this assumes that page breaks are also always sentence breaks. ;-)

    Currently I’m also pondering the labels. We may be reasonably sure that label boundaries also are plaintext letter boundaries.

    But then… So much to do, so little time. (And so easily distracted… ;-)

  7. I like your idea (from some time back I think) that the labels may be “pointers” to the body text, and not contain useful information in and of themselves. I liked it because it would explain the seemingly minimal information they would be able to carry otherwise… at best one might assume they are abbreviated, if not “pointers”.

    If pointers, then most likely letters or numbers (as opposed to VMs proprietary symbols), and if so, then yes I see that they can be good examples of “clean” sequences, which demonstrate a unit of stroke information from start to end.

    No point here… just reflecting your point, and agreeing it is a good idea. These ideas of course have importance to other code candidates, too…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s