Generating Content with Markov Chains
Posted on July 15th, 2008 in Python, Web Development | 4 Comments »
Reading up on Markov Chains opened my eyes to a way of generating content automatically, based on a probabilistic model. The content generation program would be able to produce pieces of text of variable (user-defined) lengths, that were semi-comprehensible. The program would first go through a training phase, where it is given the document(s) that it were to base its generated text around. For every word in the document(s) the program would record the following word.
When the text was to be generated, the program would pick a random starting word, then produce the following words by making a (weighted) random selection of the words that followed the current word in the training document(s) (based on the data recorded in the training phase). The resulting text wouldn’t make sense to a reader, but would appear at a first glance to make much more sense than just a collection of random words. One possible use for this kind of text would be for pages wishing to achieve a high PageRank in order to raise their advertising revenue (effectively spam pages that appear high up on Google - so don’t actually do it!).
I’m sure there are plenty of other implementations of this kind of program, but I’ve written my own basic one in Python. It’s not perfect, but it gets the job done.
#!/usr/bin/env python from collections import defaultdict from random import choice class TextGenerator(object): def __init__(self): self._data = defaultdict(list) def train(self, file): words = [None, None] for line in open(file): for word in line.split(): words[0], words[1] = words[1], word if words[0]: self._data[words[0]].append(words[1]) def gentext(self, num_words): text = [] text.append(choice(self._data.keys()).title()) while len(text) < num_words: if self._data.has_key(text[-1]): text.append(choice(self._data[text[-1]])) else: text.append(choice(self._data.keys())) return ' '.join(text) + '.' if __name__ == '__main__': textgen = TextGenerator() textgen.train('pandp.txt') print textgen.gentext(100)
The data I used for training the program was documents I found on Project Gutenberg; ‘pandp.txt’ - the file mentioned in the code above, is Pride and Prejudice by Jane Austen, however, for use on the web you could train the program with a series of existing web pages on a particular topic. Here is an example paragraph produced by the program after being trained with Pride and Prejudice:
Productive way! I have the room; but as well married, but had escaped her every painful than you can, you have any reply. You have been. “What did not ask for the most attentive and she and solemn, apologising if he had marked their brother Gardiner’s curiosity; and Wickham, we know as they lived, and her affability, that other, and by her mother’s reproach him as your cousin Lydia’s unguarded temper, that had been his amusement. But little to place my aunt and Lydia’s interruption, “that when we should happen to guide us. We both him desperate.” The horses drew.
I extended this concept to generate English-sounding words (letters were generated to make up a word, rather than words being generated to make up a paragraph). In case you’re interested, the here is code.