Archive for the ‘Web Development’ Category

Generating Content with Markov Chains

Posted on July 15th, 2008 in Python, Web Development | 4 Comments »

Reading up on Markov Chains opened my eyes to a way of generating content automatically, based on a probabilistic model. The content generation program would be able to produce pieces of text of variable (user-defined) lengths, that were semi-comprehensible. The program would first go through a training phase, where it is given the document(s) that it were to base its generated text around. For every word in the document(s) the program would record the following word.

When the text was to be generated, the program would pick a random starting word, then produce the following words by making a (weighted) random selection of the words that followed the current word in the training document(s) (based on the data recorded in the training phase). The resulting text wouldn’t make sense to a reader, but would appear at a first glance to make much more sense than just a collection of random words. One possible use for this kind of text would be for pages wishing to achieve a high PageRank in order to raise their advertising revenue (effectively spam pages that appear high up on Google - so don’t actually do it!).

I’m sure there are plenty of other implementations of this kind of program, but I’ve written my own basic one in Python. It’s not perfect, but it gets the job done.

#!/usr/bin/env python
 
from collections import defaultdict
from random import choice
 
class TextGenerator(object):
 
    def __init__(self):
        self._data = defaultdict(list)
 
    def train(self, file):
        words = [None, None]
        for line in open(file):
            for word in line.split():
                words[0], words[1] = words[1], word
                if words[0]:
                    self._data[words[0]].append(words[1])
 
    def gentext(self, num_words):
        text = []
        text.append(choice(self._data.keys()).title())
        while len(text) < num_words:
            if self._data.has_key(text[-1]):
                text.append(choice(self._data[text[-1]]))
            else:
                text.append(choice(self._data.keys()))
        return ' '.join(text) + '.'
 
if __name__ == '__main__':
    textgen = TextGenerator()
    textgen.train('pandp.txt')
    print textgen.gentext(100)

The data I used for training the program was documents I found on Project Gutenberg; ‘pandp.txt’ - the file mentioned in the code above, is Pride and Prejudice by Jane Austen, however, for use on the web you could train the program with a series of existing web pages on a particular topic. Here is an example paragraph produced by the program after being trained with Pride and Prejudice:

Productive way! I have the room; but as well married, but had escaped her every painful than you can, you have any reply. You have been. “What did not ask for the most attentive and she and solemn, apologising if he had marked their brother Gardiner’s curiosity; and Wickham, we know as they lived, and her affability, that other, and by her mother’s reproach him as your cousin Lydia’s unguarded temper, that had been his amusement. But little to place my aunt and Lydia’s interruption, “that when we should happen to guide us. We both him desperate.” The horses drew.

I extended this concept to generate English-sounding words (letters were generated to make up a word, rather than words being generated to make up a paragraph). In case you’re interested, the here is code.

Web Security and Encryption - Missing a Middle Ground?

Posted on July 11th, 2008 in Security, Web Development | 11 Comments »

It occurs to me that online security is generally either very thorough or non-existent. If a website needs to be secure it uses SSL. If security is not utterly essential, there is often no security used at all - everything is transmitted in plain text. What about the sites that have login forms, sites that take some potentially confidential information from clients, but can’t afford a full SSL certificate.

SSL provides extremely strong encryption and uses a very effective protocol to help with authorization. This protocol makes use of a Certificate Authority - the CA issues the certificate that the server uses in the protocol, and is trusted by the client. It is possible to generate your own certificates, but if a certificate is encountered that has not been issued by a trusted CA a popup message is displayed in the browser window that warns users that the site may not be secure. This feature is vital as it warns users not to enter confidential information (such as credit card details) on a site that is not absolutely secure.

There are, however, sites where absolute security isn’t necessary. SSL certificates cost money. To give you some idea, VeriSign - one of the largest (if not the largest) CAs charges £599 + VAT for their mid-range SSL certificate (Secure Site Pro - 128bit encryption). To be fair, sites such as RapidSSL offer cheaper certificates, but for hobby sites or sites on very limited budgets any price is too much. More and more sites these days require you to sign in to access their full content - I’d bet that most people re-use passwords accross different sites (including for their web-based e-mail?). If this is the case, logging in to a site that doesn’t use SSL could mean that a user’s password is sent across the internet in plain text, which makes it completely vunerable to eavesdropping attacks. If this same password provides access to an e-mail account, the user’s identity could potentially be stolen (an e-mail account can be used to gain access to other sites using the “I’ve forgotten my password” forms).

To me it seems absurd that there isn’t already a mid-level security system, one that uses regular HTTP, that is free but not as secure as SSL. Even a system that didn’t bother encrypting pages - that only encrypted form data would be useful. For the time being, I will present the way that I prevent passwords from being sent in plaintext in login forms. (Note: this is not completely secure by any means, and only works with JavaScript-enabled browsers).

This example demonstrates the steps for a login page being displayed and the user logging in using their username and password. A password stored on a server shouldn’t be plaintext, for this example we will assume that an MD5 hash of the password is stored.

  1. The server generates a random salt and embeds it in some JavaScript code on the page.
  2. The user fills in the form with their username and password.
  3. When the form is submitted, some JavaScript code catches the event and pauses the submission.
  4. The JavaScript code produces the MD5 hash of the user’s password. It then concatenates the hash with the salt provided by the server. An MD5 hash is then produced of the result and submitted in place of the password.
  5. The server identifies the user by their session and recalls the salt from a session variable. It also produces an MD5 hash of the concatenation of the password hash that is retrieved from the database and the randomly generated salt. If the result of this matches the user’s input, their password is valid.

As I stated before, this is not entirely secure - session cookies can still be stolen, most of the transmission is still plaintext, but it does protect the user’s password from man-in-the-middle and eavesdropping attacks.