Markov chain
A markov chain, used in the context of Caves of Qud, is a method of generating new text using a "corpus" of existing text. The game pulls from a file in the base game directory: LibraryCorpus.json
which is a compilation of most of the game's dialogue, descriptions, quest info, and help text; and also texts from the Corpus
folder which is a collection of excerpts from public domain books from Project Gutenberg. The specific titles are:
- The Machinery of the Universe: Mechnaical Concepts of Physical Phenomena
- Meteorology; or, Weather Explained
- Thought-Forms.
This method of generation is used to generate text for white-titled books, dreaming creatures, graffiti, urn engravings, and glowcrow dialogue.
Some end-game items and conversations are marked with the ExcludeFromCorpusGeneration
tag, to exclude them from the markov corpus. This is to curate the corpus to texts accessible earlier in the game, preventing spoilers through the markov generation.
Jason Grinblat, aka Ptychomancer, also had a talk at the International Roguelike Celebration about markov generation:
This talk also mentions Ruin of House Isner generation, which this article does not go into.
Preparing the Model
The general algorithm is that the corpus is chopped up into key-value groups, with the size of each key based on the "order" of the markov chain. Caves of Qud uses an order of two, so the word groups consist of at most two words.
As a simplified example, take the sentence "Oh, a quetzal is a pretty bird in the trogon family." The game will split this into the following pairs:
{ ["Oh, a"]: {"quetzal"}, # This is the start of the sentence, so these are OpeningWords ["a quetzal"]: {"is"}, ["is a"]: {"pretty"}, ... }
The already processed output is also a file, titled LibraryCorpus.json
which is a .json file. Note that the file does not look like the above example: it requires additional parsing to make it usable.
Generation
The game chooses two starting words. This is usually chosen between all OpeningWords, to make sure they are coherent sounding. Using this very small corpus as an example, the markov chain will generate the first two words: "Oh, a". The "chain" portion comes from the fact that once one point is established, the rest will follow based on what will most likely occur next. Plugging "Oh, a" into the dictionary will return "quetzal". The current output is now "Oh, a quetzal". The chain then takes the last few words based on order and does the process again. "a quetzal" are the last two words, and this is put in the dictionary to output the next word, "is".
[Oh, a] => quetzal |
This continues until the chain comes across a period or 300 words were found. Of course, with a one sentence long corpus, it will output the only possible sentence: "Oh, a quetzal is a pretty bird in the trogon family." This is also the reason why some sentences are repeated word for word in white-title books. The source sentence uses such unique words and/or phrasing that there is only one possible word that will appear afterwards. Ex. "Klanq puff on debt".
The complexity happens when the corpus grows. In the full corpus, there are several possible values that "is a" can lead to: "billowing", "small", etc.
{ ... ["is a"]: {pretty, billowing, measurement, heavy,...} ... }
When there are multiple possible values, one will be randomly chosen. Some possible sentences:
Oh, a quetzal [is a] billowing cloak... |
Limitations
The game has an optional seeding parameter to choose the initial two starting words. Cryptogull, the official discord server's helper bot, also has this parameter. This section applies to both unless specified otherwise, although it only significantly affects Cryptogull because the model is not designed to be a general text generator.
- Initial Phrasing - Because of how the data is organized, seeds must only be of two words. There is no fuzzy searching, so the seed must be the exact same case and contain the same punctuation if needed. For example, the model considers "Of the" and "of the" to be two distinct phrases: the first occurs at the start of the sentence, while the second will only appear somewhere in the middle.
- Corpus Size - The corpus consists of only the game's descriptions and dialogue and some public domain texts. This is about 1MB, compared to GPT-2's training model of 40GB. These are two completely different machine learning algorithms, but this is a good way of showing scale. Because of this comparatively small corpus, putting in any phrase as the seed will not work. That exact phrase must be in the corpus. If you want to check if a phrase is in the corpus, you can
ctrl+F
the LibraryCorpus.json to see if it appears. If you are using Cryptogull, you can use?incorpus <word>
to see all possible phrases that contain that word.
Further Reading
XRL.MarkovChain.cs
XRL.MarkovChainData.cs
- Cryptogull's markov generation module