Higher Stroke Count

So, these days I got around doing a little programming to number-crunch the VM. Assuming the Stroke Theory is the only way to salvation regarding the VM (and who would doubt it?), I did the following to find out what the “constituent syllable set” might be:

  1. Extract a wordlist from the VM (Currier B only), and perform a word count
  2. Discard all “rare” words (roughly, those with 2 occurences and less). This left me with a list of some 1200 words, covering about 80% of tokens.*)
  3. Compose a list of prefixes and suffixes from this vocabulary, namely the most frequent beginnings and endings of VM tokens. (The 32**) most frequent word prefixes and suffixes with lengths of 2 through 5 letters were chosen, giving a supply of roughly 140 beginnings and endings each.)
  4. Prepare a list of all words which can be created by combining each one of the prefixes with one of the suffixes.
  5. See how many of the tokens can be covered this way.
  6. Replace one of the 32 chosen prefixes and one of the suffixes with a different one from the supply.
  7. Recalculate the word coverage.
  8. If the word coverage has improved, keep the change, otherwise discard it.
  9. Repeat from step 6.

Already after a few hundred variations, the program became “saturated”. After some 9500 variations I aborted the run, when it had arrived at a coverage of 84%, meaning that with 32 prefixes and 32 suffixes, 84% of all tokens from the reduced wordlist could be composed.***)

Here are the results in no particular order:

Prefixes:
ch qo sh lche ok ot lk yk che ota she ai otai olk qoka yt ol qote qot lch ote kch da qoke qok oka cho okai cth ke oke dai te

Suffixes:
dy kal ol ey ar eey chdy al ty edy kain iin cthy aiin eody key keey in kedy ky ain or ckhy kar chy ody ir eedy eor eol am eo ok

I think this is fairly promising. More work needs to go into what share each of the syllables contributes to the whole, and of course more testing against other languages is required. (And I need to compare this to Robert Firth’s work.) But it’s a start.

*) We’ll discriminate between words and tokens. While any character group contributes to the numbers of tokens, only different groups are counting as words. In other terms, the sentence “The dog likes the food” consists of 5 tokens, but only 4 words. You can think of “words” as “vocabulary”, while “tokens” concern the “volume” of the text.

**) The number 32 was chosen pretty much randomly under the assumption that 26 groups — barely enough for the latin alphabet — wouldn’t suffice to include digits and possible special characters.

***) I tried the same with english text, but here only a coverage of around 15% could be achieved. This may have to do with the shorter english words, though — I’ll need to compare with other languages as well.

Advertisements

2 thoughts on “Higher Stroke Count

  1. I think that “instances” would be preferable to “tokens”.

    I’d also say that I’m not too comfortable with prefixes that include “da” and “dai”, as well as “ok”, “ot”, “oka”, “ota”, “okai”, and “otai”, and with suffixes that include “in”, “iin”, and “ain”.

    This seems to be carving fairly arbitrary slices through what are very likely to be consistent sub-units of the text – “ain” / “aiin” / “aiiin”.

    Yes, the text could conceivably have been specifically constructed to look that way – but the larger the set of incidental structural attributes you need to add in to do so, the further away from the truth you seem to be slipping.

  2. Hi Nick,

    I agree partly; I’m not fully statisfied with the design of this experiment. (It was the best I could come up with in a short period of time, and am open to better suggestions.) It’s not a very sophisticated thing and the “slicing” is indeed a problem:

    For example, if “oka” and “ok” as prefixes and “iin” and “aiin” as suffixes are tested, a word like “okaiin” will both trigger matches for “oka/iin” and “ok/aiin” pairs. Since one word can thus score multiple hits, it’s conceivable to achieve a total coverage of more than 100% this way… (I never thought too highly of statistics anyway. :-/)

    I need to check the whole matrix of prefixes and suffixes and see if there’s anything interesting going on. Even if the whole Stroke thing should go down the drain, perhaps we gain some more insight into VM word structure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s