Archive for the ‘Python’ Category

My Views on Python

Posted on July 15th, 2008 in Python | No Comments »

About a year ago, before I had been properly introduced to Python, I was of the opinion that back-end website programming should be done in PHP, and that applications should be coded in C++. I hadn’t at this stage had any kind of Computing Science or programming tuition, nor had I attempted much more than ‘hobby’ programming. When I arrived at university to study Computer Science, I was told that Python was the language that we were going to be taught in. All I knew about Python was that it could be used for CGI programming, and that it had what I considered to be odd syntax. Anyhow, after it received the whole-hearted endorsement of my lecturers I decided that I wanted learn it properly and see what all the fuss was about. 10 months on I program almost exclusively in Python; using PHP, my former ‘main’ language, makes me feel awkward and limited. Here, I present the reasons why (at least for the jobs it is suited to) Python is now my language of choice.

Syntax

The ‘unconventional’ syntax was one of the main reasons I didn’t look at Python earlier. I was used to languages with C-style syntax (C, C++, PHP, Javascript, etc.). Now, I consider the syntax to be one of the things I regard most highly about Python. Readability is a big issue in computer programming, especially if you are releasing code into the Public domain. I have found Python code to be much more instantly understandable and readable than any other language I have used. The indentation-based code blocks ensure that programmers lay out their code in a readable fashion, and also looks cleaner than C-style languages, which are littered with braces.

The syntax being so simple and so close to natural language makes Python easier to learn for beginners; it’s perfect for teaching the basics of programming to people who have never programmed before. Python still allows you to saturate your code with parentheses and cryptic variable names if that’s what you really like to do, but the basic syntax is much cleaner and certainly no less powerful than that of other languages. I also like Python’s multi-paradigm nature - it may be used as a simple imperative language, as an object-oriented language, or as a functional language.

Standard Library

Python’s ‘batteries included’ philosophy means that it has a rich and powerful standard library. Everything from regular expressions, to web protocol handling (httplib/smtplib etc.), to cryptographic hashing, to database connectivity is included. If something is missing, there will probably already be a third-party library available, which can be found with a simple Google search.

Speed

This may seem like an odd one - I’ve mentioned before that Python is not particularly fast and thus not necessarily suited to computationally expensive tasks; I stick by those claims. However, if you compare Python’s speed to other languages in the same class (‘scripting’ languages such as PHP, Perl, Ruby, etc.), it actually does pretty well. Benchmarks from the Computer Language Benchmark Game website show that Python (just vanilla CPython, not PyPy or CPython+Psyco etc.) generally performs better than PHP and Perl, and knocks Ruby out of the water. Admittedly Lua seems to be faster than Python, but that’s only really used for game scripting. On top of this, Psyco, the fantastic JIT compiler I’ve mentioned in previous posts, can allegedly speed up Python by as much as 100x! If this still isn’t quick enough, functions may be coded in C and compiled and imported as a module, which brings me on to my final topic:

Extensibility

As well as having fantastic standard library, Python is hugely extensible. Extension modules may easily be hand coded in C or C++ to eliminate bottlenecks, or bindings for entire C/C++ libraries can easily be created with SWIG or ctypes. A fantastic example of one such library is PyOpenGL, a Python wrapper around OpenGL, created using ctypes. It allows all the really hard work to be done in C and by the GPU, while allowing the flexibility of Python to used. Python may also be easily embedded in C and C++ applications and used as a scripting language (e.g. 3D software such as Maya).

I don’t think of Python as a perfect language - it isn’t right for every job and I couldn’t see myself only ever using Python, but I find programming in it more satisfying, rewarding and productive than in any of the other languages I’ve used.

Generating Content with Markov Chains

Posted on July 15th, 2008 in Python, Web Development | 2 Comments »

Reading up on Markov Chains opened my eyes to a way of generating content automatically, based on a probabilistic model. The content generation program would be able to produce pieces of text of variable (user-defined) lengths, that were semi-comprehensible. The program would first go through a training phase, where it is given the document(s) that it were to base its generated text around. For every word in the document(s) the program would record the following word.

When the text was to be generated, the program would pick a random starting word, then produce the following words by making a (weighted) random selection of the words that followed the current word in the training document(s) (based on the data recorded in the training phase). The resulting text wouldn’t make sense to a reader, but would appear at a first glance to make much more sense than just a collection of random words. One possible use for this kind of text would be for pages wishing to achieve a high PageRank in order to raise their advertising revenue (effectively spam pages that appear high up on Google - so don’t actually do it!).

I’m sure there are plenty of other implementations of this kind of program, but I’ve written my own basic one in Python. It’s not perfect, but it gets the job done.

#!/usr/bin/env python
 
from collections import defaultdict
from random import choice
 
class TextGenerator(object):
 
    def __init__(self):
        self._data = defaultdict(list)
 
    def train(self, file):
        words = [None, None]
        for line in open(file):
            for word in line.split():
                words[0], words[1] = words[1], word
                if words[0]:
                    self._data[words[0]].append(words[1])
 
    def gentext(self, num_words):
        text = []
        text.append(choice(self._data.keys()).title())
        while len(text) < num_words:
            if self._data.has_key(text[-1]):
                text.append(choice(self._data[text[-1]]))
            else:
                text.append(choice(self._data.keys()))
        return ' '.join(text) + '.'
 
if __name__ == '__main__':
    textgen = TextGenerator()
    textgen.train('pandp.txt')
    print textgen.gentext(100)

The data I used for training the program was documents I found on Project Gutenberg; ‘pandp.txt’ - the file mentioned in the code above, is Pride and Prejudice by Jane Austen, however, for use on the web you could train the program with a series of existing web pages on a particular topic. Here is an example paragraph produced by the program after being trained with Pride and Prejudice:

Productive way! I have the room; but as well married, but had escaped her every painful than you can, you have any reply. You have been. “What did not ask for the most attentive and she and solemn, apologising if he had marked their brother Gardiner’s curiosity; and Wickham, we know as they lived, and her affability, that other, and by her mother’s reproach him as your cousin Lydia’s unguarded temper, that had been his amusement. But little to place my aunt and Lydia’s interruption, “that when we should happen to guide us. We both him desperate.” The horses drew.

I extended this concept to generate English-sounding words (letters were generated to make up a word, rather than words being generated to make up a paragraph). In case you’re interested, the here is code.

Edge Detection in Python

Posted on July 13th, 2008 in Image Processing, Python | 2 Comments »

I’ve been reading up on digital image processing and wanted to have a go at implementing a Sobel edge detector. I’m fairly familiar with PIL, so I decided to implement the algorithm in Python. Python is not by any means and ideal language for digital image processing; it is far too slow for most things, but for the time being the speed doesn’t matter. Anyway, here is the basic implementation:

def sobel(img):
    if img.mode != "RGB":
        img = img.convert("RGB")
    out_image = Image.new(img.mode, img.size, None)
    imgdata = img.load()
    outdata = out_image.load()
 
    gx = [[-1, 0, 1], [-2, 0, 2], [-1, 0, 1]]
    gy = [[-1, -2, -1], [0, 0, 0], [1, 2, 1]]
 
    for row in xrange(1, img.size[0]-1):
        for col in xrange(1, img.size[1]-1):
            pixel_gx = pixel_gy = 0
            pxval = sum(imgdata[row,col])/3
            for i in range(-1, 2):
                for j in range(-1, 2):
                    val = sum(imgdata[row+i,col+j])/3
                    pixel_gx += gx[i+1][j+1] * val
                    pixel_gy += gy[i+1][j+1] * val
            newpixel = math.sqrt(pixel_gx * pixel_gx + pixel_gy * pixel_gy)
            newpixel = 255 - int(newpixel)
            outdata[row, col] = (newpixel, newpixel, newpixel)
    return out_image

The results are surprisingly good:

Common Searching and Sorting Algorithms in Python

Posted on July 13th, 2008 in Python | 4 Comments »

I implemented some common searching and sorting algorithms in Python to improve my understanding of them in preparation for an algorithms exam. They may not be perfect implementations (unoptimised etc.) but they may help a Python-minded individual understand how they work…

Note: The selection sort and bubble sort algorithms take one argument, a list, and modify it in place as it is a mutable type. The merge sort and quicksort algorithms sorted versions of the provided lists; they don’t modify the lists in place. The search algorithms take two arguments - a list (l), and an element to be found(k). The both return the index of the found element.

Selection Sort

def selection_sort(l):
    n = len(l)
    for i in xrange(n-1):
        mini = i
        for j in xrange(i+1, n):
            if l[j] < l[mini]:
                mini = j
        l[i], l[mini] = l[mini], l[i]

Bubble Sort

def bubble_sort(l):
    n = len(l)
    for i in xrange(n-1):
        for j in xrange(n-1-i):
            if l[j] > l[j+1]:
                l[j], l[j+1] = l[j+1], l[j]

Merge Sort

def merge_sort(l):
    n = len(l)
    if n > 1:
        mid = n/2
        l1 = merge_sort(l[:mid])
        l2 = merge_sort(l[mid:])
        return merge(l1, l2)
    return l
 
def merge(l1, l2):
    result = []
    p = len(l1)
    q = len(l2)
    i = j = 0
    while i < p and j < q:
        if l1[i] <= l2[j]:
            result.append(l1[i])
            i += 1
        else:
            result.append(l2[j])
            j += 1
    result.extend(l1[i:])
    result.extend(l2[j:])
    return result

QuickSort

def quick_sort(l):
    if l == []:
        return []
    else:
        p = l[0]
        left = quick_sort([x for x in l[1:] if x < p])
        right = quick_sort([x for x in l[1:] if x >= p])
        return left + [p] + right

Sequential Search

def sequential_search(l, k):
    for i in xrange(len(l)):
        if l[i] == k:
            return i

Binary Search

def binary_search(a, k):
    count = 0
    l, r = 0, len(a)-1
    while l <= r:
        count += 1
        mid = (l+r)/2
        if k == a[mid]:
            print count
            return mid
        else:
            if k < a[mid]:
                r = mid
            else:
                l = mid+1

Python + Psyco vs Java vs C++ vs C

Posted on July 11th, 2008 in C/C++, Java, Python | 9 Comments »

In my last post, I demonstrated how Psyco can be used to speed up Python code using the example of a brute-force password cracker. To see how the speed of the Python-Psyco combination compared to the speed of other languages I ported to code to Java, C++ and C. In each language I wrote the code as if it were a regular program written in that language (i.e. I didn’t try to optimize the code) - in C++ I used C++ STL Strings, in Java I used the StringBuilder class and in C I used char arrays.

The timings for the Python+Psyco code are as follows (see last post for full details):

Password | Time taken
zzzz     | 0m0.134s
zzzzz    | 0m3.486s
zzzzzz   | 2m8.056s

Here is the Java code (I’m not a Java programmer at all - I just know the basics so I apologise if the code is poorly written):

class BrutePass {
 
    static int count = 0;
 
    public static void main( String args[] ) {
 
        if( args.length < 1 ) {
            System.out.println( "Usage: java BrutePass <password>" );
            return;
        }
 
        String pwd = args[0];
        int maxlen = 10;
 
        StringBuilder p = new StringBuilder();
 
        for( int w=0; w<maxlen; w++ )
            if( crack( w+1, 0, p, pwd ) > 0 )
                break;
 
        System.out.println( "Found password " + p.toString()
             + " with " + count + " attempts." );
 
    }
 
    public static int crack( int w, int pos, StringBuilder p, 
        String pwd ) {
 
        String chars = "abcdefghijklmnopqrstuvwxyz";
 
        for( int i=0; i<chars.length(); i++ ) {
            p.append( chars.charAt( i ) );
 
            if( pos == w-1 ) {
                count++;
                if( pwd.compareTo( p.toString() ) == 0 )
                    return 1;
            }
 
            int ret = 0;
            if( pos < w-1 )
                ret = crack( w, pos+1, p, pwd );
 
            if( ret > 0 )
                return 1;
 
            p.delete( p.length() - 1, p.length() );
        }
        return 0;
    }
}

Here are the timings for the java program:

Password | Time taken
zzzz     | 0m0.176s
zzzzz    | 0m1.334s
zzzzzz   | 0m31.369s

Much faster than the Psyco+Python version. The reason that the four-character password appears to take longer to crack in Java than in Python is that the JVM takes a while to get going. Now for the C++ code:

#include <iostream>
#include <string>
 
using namespace std;
 
int crack( int w, int pos, string& p, string& pass );
 
int main( int argc, char **argv )
{
    if( argc < 2 )
    {
        cout << "Usage: " << argv[0] << " <password>" << endl;
        return( 0 );
    }
 
    int maxlen = 10;
    if( argc > 2 )
        maxlen = atoi( argv[2] );
 
    string pwd( argv[1] );
    string p;
 
    int count;
    for( int i=0; i<maxlen; i++)
        if( count = crack( i+1, 0, p, pwd ) )
            break;
 
    cout << "Found password " << p << " with " << count << " attempts." << endl;
 
    return 0;
}
 
int crack( int w, int pos, string& p, string& pass )
{
    static int count = 0;
    string chars( "abcdefghijklmnopqrstuvwxyz" );
 
    for( int i=0; i<chars.length(); i++)
    {
        p.push_back( chars[i] );
 
        if( pos == w-1 )
        {
            if( p == pass )
                return( count );
            count++;
        }
 
        int ret = 0;
        if( pos < w-1 )
            ret = crack( w, pos+1, p, pass );
 
        if( ret )
            return( count );
 
        p.erase( p.end()-1 );
    }
    return( 0 );
}

And the timings:

Password | Time taken
zzzz     | 0m0.036s
zzzzz    | 0m0.730s
zzzzzz   | 0m18.999s

I’d heard that Java is in many cases faster than C++; clearly this is not one of those cases. Now the C version:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
unsigned long crack( int w, int pos, char *p, char *pass );
 
int main( int argc, char **argv )
{
    int maxlen = 10, i;
    unsigned long count;
    char *pwd, *p;
    if( argc < 2 )
    {
        printf( "Usage: %s <password>\n", argv[0] );
        return( 1 );
    }
 
    if( argc > 2 )
        maxlen = atoi( argv[2] );
 
    pwd = argv[1];
    p = (char *)calloc( maxlen + 1, sizeof(char) );
 
    for( i=0; i<maxlen; i++)
        if( count = crack( i+1, 0, p, pwd ) )
            break;
 
    printf( "Found password %s with %lu attempts.\n", p, count );
 
    return 0;
}
 
unsigned long crack( int w, int pos, char *p, char *pass )
{
    static unsigned long count = 0;
    int i, ret;
    char *chars = "abcdefghijklmnopqrstuvwxyz";
 
    for( i=0; i<strlen(chars); i++)
    {
        p[strlen(p)] = chars[i];
 
        if( pos == w-1 )
        {
            count++;
            if( !strcmp( p, pass ) )
                return( count );
        }
 
        ret = 0;
        if( pos < w-1 )
            ret = crack( w, pos+1, p, pass );
 
        if( ret )
            return( count );
 
        p[strlen(p)-1] = 0;
    }    
    return( 0 );
}

And the timings:

Password | Time taken
zzzz     | 0m0.022s
zzzzz    | 0m0.381s
zzzzzz   | 0m9.663s

That’s pretty much twice the speed of the C++ version! This shows how much overhead there is for using STL Strings rather than C strings.

Although Psyco provided a fantastic speed boost, it has not eliminated the need to know C as some people have claimed - the native C version of the program performed 13.25x faster than the Psyco version.

Password Cracking in Python with Psyco

Posted on July 3rd, 2008 in Python | No Comments »

A friend of mine recently implemented a brute-force password cracker in Python. I, having never looked into this kind of thing before, became curious about the implementation details. He showed me his code and it was surprisingly simple - just one recursive function and a small loop that called the function. It wasn’t actually a proper password cracker - you supplied a word and it would generate all possible words (by combining letters from a list) until the supplied word was matched. As it stands, it is useless but it wouldn’t be hard to make it produce SHA1 or MD5 hash of each word it generated - this would allow it to break hashed passwords.

The main problem with the code was that it was extremely slow. I wanted to see how much faster it would run if Psyco (a JIT complier for Python) was used. I wrote my own implementation of the original code and took a few timings:

#!/usr/bin/env python
 
import sys
 
chars = 'abcdefghijklmnopqrstuvwxyz0123456789'
 
def crack(w, pos, p, pwd):
    for c in chars:
        cr = None
        if pos < w:
            cr = crack(w, pos+1, p+c, pwd)
        if cr:
            return cr
        if p+c == pwd:
            return p+c
 
if __name__ == '__main__':
    try:
        pwd = sys.argv[1]
    except:
        print 'Usage: %s <password>' % sys.argv[0]
        sys.exit(1)
    p = None
    w = 1
    max_len = 10
 
    while not p and w < max_len:
        p = crack(w, 0, '', pwd)
        w += 1
    print 'Found', p

Here are the results of the timings for 4, 5 and 6 character passwords (all timings taken on a 2.4GHz Macbook Pro w/ 2GB RAM):

$ time ./python_brutepass zzzz
Found zzzz

real	0m0.567s
user	0m0.542s
sys	0m0.013s
$ time ./python_brutepass zzzzz
Found zzzzz

real	0m18.779s
user	0m18.632s
sys	0m0.046s
$ time ./python_brutepass zzzzzz
Found zzzzzz

real	11m9.131s
user	11m4.323s
sys	0m1.190s

11 minutes to match a 6 character word (which is only lowercase characters and numbers)?! Once you’d added in a hashing step, I imagine this would take a great deal longer still.

I added in a couple of lines at the top of the code:

import psyco
psyco.full()

and took the timings again:

$ time ./python_brutepass zzzz
Found zzzz

real	0m0.134s
user	0m0.112s
sys	0m0.014s
$ time ./python_brutepass zzzzz
Found zzzzz

real	0m3.486s
user	0m3.435s
sys	0m0.022s
$ time ./python_brutepass zzzzzz
Found zzzzzz

real	2m8.056s
user	2m6.994s
sys	0m0.254s

With Psyco enabled it ran ~5x quicker - not bad for an extra two lines of code.

Anyway, these examples show that such computationally expensive operations aren’t really possible in Python. To find a 5 character password, the maximum number of operations needed (assuming no hashing is done) is ~62193781. I worked this out as follows. If n is the length of the password (5), and l is the number of letters being used (36 in this case for lowercase letters + numbers from 0 to 9), the total number of possible permutations with repitition for each length of word is ln, but as the length of the word would not be known, the total number of permuatations with repitition is the sum of lk for k=0 to n. This is actually a geometric progression and thus can be calculated like this:

Binary Representation of Numbers in Python

Posted on July 2nd, 2008 in Python | 2 Comments »

Over the last few months I’ve found myself playing with binary numbers a great deal - for investigating various algorithms, for working out bit masks, for understanding bit-twiddling operations (such as using xor for clearing a register in assembly language), etc. At the moment, I’m predominantly a Python user (I love the clean syntax, the power and speed, and the compatibility with C), but I’ve found no easy way of representing numbers in binary form in Python (or indeed in any other language).

Clearly the ability to represent numbers in binary form would have little practical use in a programming language, but for purely academic reasons I decided to write a very basic binary class in Python. At the moment it only supports positive integers (its just plain base 2, not two’s complement), and is limited in its range of features, but I have found it to useful on occasion (mostly when using Python in the command line mode).

class Bin(object):
 
    def __init__(self, num):
        num = str(num)
        if [d for d in num if int(d) > 1]:
            raise(TypeError, 'not a binary number')
        self.__num = self.__bin_str_to_int(num)
 
    def __bin_str_to_int(num):
        return sum(2**n for n in range(len(num)) if int(num[-n-1]))
 
    def __get_bin_str(self):
        binstr = ''
        n = self.__num
        while n > 0:
            binstr = str(n % 2) + binstr
            n /= 2
        return binstr
 
    def __int__(self):
        return self.__num
 
    def __repr__(self):
        return str(self.__bin_str)
 
    __bin_str = property(fget=__get_bin_str)

Here is an example of its basic usage:

>>> from bintype import Bin
>>> x = Bin(11001100)
>>> x
11001100
>>> int(x)
204

In it’s current state it is of little use, but with some simple operator overloading (__add__, __lshift__, etc.) and support for negative numbers using two’s complement it could provide a useful way of testing basic operations on binary integers.