So, these days I got around doing a little programming to number-crunch the VM. Assuming the Stroke Theory is the only way to salvation regarding the VM (and who would doubt it?), I did the following to find out what the “constituent syllable set” might be:
- Extract a wordlist from the VM (Currier B only), and perform a word count
- Discard all “rare” words (roughly, those with 2 occurences and less). This left me with a list of some 1200 words, covering about 80% of tokens.*)
- Compose a list of prefixes and suffixes from this vocabulary, namely the most frequent beginnings and endings of VM tokens. (The 32**) most frequent word prefixes and suffixes with lengths of 2 through 5 letters were chosen, giving a supply of roughly 140 beginnings and endings each.)
- Prepare a list of all words which can be created by combining each one of the prefixes with one of the suffixes.
- See how many of the tokens can be covered this way.
- Replace one of the 32 chosen prefixes and one of the suffixes with a different one from the supply.
- Recalculate the word coverage.
- If the word coverage has improved, keep the change, otherwise discard it.
- Repeat from step 6.
Already after a few hundred variations, the program became “saturated”. After some 9500 variations I aborted the run, when it had arrived at a coverage of 84%, meaning that with 32 prefixes and 32 suffixes, 84% of all tokens from the reduced wordlist could be composed.***)
Here are the results in no particular order:
ch qo sh lche ok ot lk yk che ota she ai otai olk qoka yt ol qote qot lch ote kch da qoke qok oka cho okai cth ke oke dai te
dy kal ol ey ar eey chdy al ty edy kain iin cthy aiin eody key keey in kedy ky ain or ckhy kar chy ody ir eedy eor eol am eo ok
I think this is fairly promising. More work needs to go into what share each of the syllables contributes to the whole, and of course more testing against other languages is required. (And I need to compare this to Robert Firth’s work.) But it’s a start.
*) We’ll discriminate between words and tokens. While any character group contributes to the numbers of tokens, only different groups are counting as words. In other terms, the sentence “The dog likes the food” consists of 5 tokens, but only 4 words. You can think of “words” as “vocabulary”, while “tokens” concern the “volume” of the text.
**) The number 32 was chosen pretty much randomly under the assumption that 26 groups — barely enough for the latin alphabet — wouldn’t suffice to include digits and possible special characters.
***) I tried the same with english text, but here only a coverage of around 15% could be achieved. This may have to do with the shorter english words, though — I’ll need to compare with other languages as well.