Markov chain: Difference between revisions

Jump to navigation Jump to search
665 bytes added ,  22:42, 22 March 2023
m
no edit summary
(Created page with "A markov chain, used in the context of Caves of Qud, is a method of generating new text using a "corpus" of existing text. This corpus can be found at <code>QudCorpus1.txt</co...")
 
mNo edit summary
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
A markov chain, used in the context of Caves of Qud, is a method of generating new text using a "corpus" of existing text. This corpus can be found at <code>QudCorpus1.txt</code> and <code>QudCorpus2.txt</code> where [[File locations|game data files are stored]]. This method of generation is used to generate text for white-titled [[books]], [[telepathy|sleeptalking creatures]], graffiti, urn engravings, and {{favilink|glowcrow}} dialogue.
A markov chain, used in the context of Caves of Qud, is a method of generating new text using a "corpus" of existing text. The game pulls from two files [[File locations|in the base game directory]]: <code>QudCorpus.txt</code>, which is a compilation of all of the game's dialogue, descriptions, quest info, and help text; and <code>LibraryCorpus.txt</code> which is a collection of public domain books from Project Gutenberg. Some specific titles include [https://www.gutenberg.org/ebooks/29444 <i>The Machinery of the Universe: Mechnaical Concepts of Physical Phenomena</i>] and [https://www.gutenberg.org/ebooks/38687 <i>Zoological Mythology; or, The Legends of Animals</i>]. This method of generation is used to generate text for white-titled [[books]], [[telepathy|dreaming creatures]], graffiti, urn engravings, and {{favilink|glowcrow}} dialogue.


Jason Greenblat, aka Ptychomancer, also had a talk at the International Roguelike Celebration about markov generation:
Jason Greenblat, aka Ptychomancer, also had a talk at the International Roguelike Celebration about markov generation:
Line 8: Line 8:
The general algorithm is that the corpus is chopped up into key-value groups, with the size of each key based on the "order" of the markov chain. Caves of Qud uses an order of two, so the word groups consist of at most two words.
The general algorithm is that the corpus is chopped up into key-value groups, with the size of each key based on the "order" of the markov chain. Caves of Qud uses an order of two, so the word groups consist of at most two words.


For example, take the sentence "Oh, a quetzal is a pretty bird in the trogon family." The game will split this into the following pairs:
As a simplified example, take the sentence "Oh, a quetzal is a pretty bird in the trogon family." The game will split this into the following pairs:


<pre>
<pre>
Line 26: Line 26:
1=[&COh, a&y] => &Wquetzal
1=[&COh, a&y] => &Wquetzal
Oh, [&Ca quetzal&y] => &Wis
Oh, [&Ca quetzal&y] => &Wis
Oh, a [&Cquetzal is&y] => &Wpretty&y...
Oh, a [&Cquetzal is&y] => &Wa&y
Oh, a quetzal [&Cis a&y] => &Wpretty&y
...
}}
}}


Line 35: Line 37:
{
{
...
...
["is a"] = {pretty, billowing, measurement, heavy,...}
["is a"]: {pretty, billowing, measurement, heavy,...}
...
...
}
}
Line 51: Line 53:
* <b> Initial Phrasing -</b> Because of how the data is organized, seeds must only be of two words. There is no fuzzy searching, so the seed must be the exact same case and contain the same punctuation if needed. For example, the model considers "Of the" and "of the" to be two distinct phrases: the first occurs at the start of the sentence, while the second will only appear somewhere in the middle.
* <b> Initial Phrasing -</b> Because of how the data is organized, seeds must only be of two words. There is no fuzzy searching, so the seed must be the exact same case and contain the same punctuation if needed. For example, the model considers "Of the" and "of the" to be two distinct phrases: the first occurs at the start of the sentence, while the second will only appear somewhere in the middle.
   
   
* <b> Corpus Size -</b> The corpus consists of only the game's descriptions and dialogue and some public domain texts. This is 871KB, compared to GPT-2's training model of 40GB. These are two completely different machine learning algorithms, but this is a good way of showing scale. Because of this comparatively small corpus, putting in any phrase as the seed will not work. That exact phrase must be in the corpus. If you want to check if a phrase is in the corpus, you can <code>ctrl+F</code> the QudCorpus.txt to see if it appears. If you are using Cryptogull, you can use <code>?sleeptalk <word></code> to see all possible phrases that contain that word.
* <b> Corpus Size -</b> The corpus consists of only the game's descriptions and dialogue and some public domain texts. This is about 1MB, compared to GPT-2's training model of 40GB. These are two completely different machine learning algorithms, but this is a good way of showing scale. Because of this comparatively small corpus, putting in any phrase as the seed will not work. That exact phrase must be in the corpus. If you want to check if a phrase is in the corpus, you can <code>ctrl+F</code> the QudCorpus.txt to see if it appears. If you are using Cryptogull, you can use <code>?incorpus <word></code> to see all possible phrases that contain that word.
 


== Further Reading ==
*<code>XRL.MarkovChain.cs</code>
*<code>XRL.MarkovChainData.cs</code>
*[https://github.com/TrashMonks/cryptogull/blob/main/bot/helpers/corpus.py Cryptogull's markov generation module]
[[Category:Guides]]
[[Category:Guides]]

Navigation menu