Okay, since the fat lady hasn’t sung, the Stroke theory (as outlined here for your elucidation) isn’t dead yet, and these days I had what I thought was a relevation.
As you probably recall (because I’ve been harping on about it endlessly) is that of course for the Stroke theory it is essential to discern the set of “syllables” which compose the words of the ciphertext, since each of these “syllables”, or fragments, represents one of the plaintext letters. It’s a bit akin to the knapsack problem, where you are given a number of blocks of a given size and have to find out how to optimally fill certain shaped space. In the case of the VM, you have scores of differrently shaped spaces and must find the minimum set of building blocks.
Now, of course one can write a program to do that for one by stupid number crunching and trying ever-different syllable sets to see which one would render the best match. Alas, such a naive approach would quickly hit a dead end, because of course the best fit to the ciphertext would be a set of syllables consisting of only one letter each. Actually, that would guarantee a 100% perfect match, but of course isn’t helpful for the assumption of the Stroke theory. This is what held me up for quite some time.
But the epiphany I had was that the weighing system for my program should not be the percentage of covered text. Rather, whether a solution was better or worse than any other would be determined by the number of syllables required to synthesize the better part of the VM’s volume. In other words, the longer the syllables, the better.
In the end, the resulting syllable set could be used as a start to find out how the original plaintext letters were segmented. By frequency analysis, this would effectively have reduced the solution of the VM to a monoalphabetic substituion cipher.
I gave this a run of about 24 hrs, and for one thing noticed that this rendered only tests for about 15,000 different syllable sets. Checking a syllable set is computationally expensive, because obviously a set of 50 syllables can be employed in almost infinite ways to compose word, plus we have to test for at least the 150 most frequent VM words to arrive at any useful statistics anyway.
I started from a regular syllable set, with minimum syllables like “a”, “b”, etc, and allowed for random mutations of individual characters in that set, adding, removing or changing one character at a time. While this should render a pretty good result in the long run, of course it also meant that many variations would have little or no effect on the result at all, reducing the number of “helpful” variations even more.
But, anyway, it became apparent that I probably wouldn’t be going anywhere. After some initial progress, the program stalled: Neither did the average syllable length grow much beyond two letters or so, nor did the coverage of the VM (by volume) exceed 80% — which was even more disappointing, since I had set 80% to be the minimum required by a syllable set to still be considered valid. (Of course it’s trivial to create syllable sets with syllables of maximum length if there is no obligation of actually covering the better part of the ciphertext.)
So, once more I hit a wall, and once more the results are tantalisingly ambiguous: A 95% “saturation value” would have given me lots of confidence in my approach. 40% would have clearly shown I was definitely, once and for all, on the wrong track. But 80%? It’s surprisingly close to the results of previous attempts regarding the Stroke theory. One of the problems is certainly the limited amount of mutations of the test syllable set available. It may also have been wiser to chose a different transcription set than Takahashi, of which I begin to think less and less, the more I look at it. And finally, the whole approach of composing words may be wrong — perhaps it would be wiser to disect existing cipher text.
Someone has given me an idea…