craigradcliffe.com

August 26, 2008

A script to generate RSS feeds for wlu.ca

Filed under: Posts — craig @ 12:41 pm

So RSS has been popular on the web for about a bajillion years and table layouts have been passée for even longer. Yet Wilfrid Laurier University’s website still uses tables and still does not have RSS feeds for its news items. I’ve griped about this before, and attempted a primitive version of a screen scraper for the website, but changes to the site tended to break that version.

Using BeautifulSoup and PyRSS2Gen, both Python libraries, I’ve created a new screen scraper that should be fairly robust thanks to BeautifulSoup’s forgiving HTML parsing engine.

So far, the scraper produces RSS feeds for the main news page and the Physics and Computer Science department news page. Any other news pages on the Laurier site can be scraped by adding a line to the script file. Let me know if there’s a WLU news page you would like me to scrape.

RSS Feeds:
Main Page
Physics and Computer Science

Code:

#!/usr/bin/env python

import sys
import os
import datetime
import time

import urllib2
import PyRSS2Gen
import BeautifulSoup

debug = False

def parsePage(url, filename):
    rss_items = []
    prefix = "http://www.wlu.ca/"
    document = urllib2.urlopen(url)
    souped_doc = BeautifulSoup.BeautifulSoup(document)
    main_title = souped_doc.head.title.contents[0]
    for table in souped_doc('table', 'news'):
        for tr in table('tr'):
            if tr.td is not None and not tr.td.has_key('colspan'):
                title = "".join([str(i) for i in tr('td')[1].a.contents])
                date = datetime.datetime(*(time.strptime(tr('td')[0].contents[0], "%b %d/%y")[0:6]))
                link = tr('td')[1].a['href']
                if not link.startswith("http"):
                    link = prefix + link
                guid = PyRSS2Gen.Guid(link)
                if debug:
                    print "Title: %s" % title
                    print "Date: %s" % date
                    print "URL: %s" % link
                    print "Guid: %s" % guid
                    print "===================="
                rss_items.append(PyRSS2Gen.RSSItem(title=title,
                                                   link=link,
                                                   description="",
                                                   guid=guid,
                                                   pubDate=date))
    output = PyRSS2Gen.RSS2(title=main_title,
                            link=url,
                            description="",
                            lastBuildDate = datetime.datetime.now(),
                            items=rss_items)
    output.write_xml(open(filename, "w"))

if __name__ == "__main__":
    parsePage("http://www.wlu.ca/news_listing.php", "wlu_main.xml")
    parsePage("http://www.wlu.ca/news_listing.php?grp_id=2", "wlu_physcomp.xml")

August 25, 2008

Announcing CryptoWorkflow Version 1.0

Filed under: Posts — Tags: , , , — craig @ 1:52 pm

I wrote a learning tool for cryptography as a directed research project for credit this summer, supervised by Dr. Angèle Hamel. It’s now at version 1.0 and available here.

From the website:

CryptoWorkflow is a tool for learning the intricacies of cryptographic algorithms. This tool allows you to step through component parts in an algorithm to see exactly what is happening to the data at each point. CryptoWorkflow also allows you to enter your predictions for the result at each step, and will check your work for correctness.

CryptoWorkflow currently has implementations of Vigenère, ElGamal, Simplified DES, and Simplified AES.

August 1, 2008

Tool for off-campus access to library resources

Filed under: Posts — Tags: , , — craig @ 5:00 pm

For remote users of libraries who use the  EZProxy software, I found a pretty good Firefox extension that rewrites an URL to use the proxy for journal article access. This is useful in that I can use Google Scholar to search for articles, click through to the article on a journal’s website (i.e. SpringerLink, ACM, etc.), and then click the extension’s button in the browser.

Normally, I would have to cut and paste the proxy address (remote.libproxy.wlu.ca in the case of Wilfrid Laurier University) into the URL to get access to the article, but with this extension I just click a button instead. A small improvement, but a good one, although ideally, Google Scholar would have some setting that would allow me to set a proxy for the journal websites.

June 25, 2008

The exec statement

Filed under: Posts — Tags: , — craig @ 10:48 pm

The exec statement is one of those things that programmers are no doubt really split on. exec takes a string and runs it as if it were a regular block of python code. So, exec "import os" would, in fact, import the os module. One could also do something like this:

g = [i for i in "formatter"[-2:]]
g.reverse()
exec "import %s" % "".join(g)

which would import the re module, but that would be ludicrous. It’s for this very reason that I’m sure Python programmers are divided on the necessity of the exec statement, but it suits my purposes rather nicely.

One of the problems I was faced with was how to turn strings containing variables and functions into calls to those functions. My solution was the exec statement. The steps method is a instance method of the Workflow class and provides an iterator over the steps in the algorithm. It is passed a dictionary of inputs, to which it applies the algorithm’s steps. It does this by iterating over the self.__operations list, which contains the names (as strings) of the functions called by each step in the algorithm. The code (as it stands now) is as follows:

class Workflow(object):
[ . . . ]
    def steps(self, inputs):
        """
        Runs the workflow step-by-step. Provides an
        iterator.                                                                     

        Input: A dictionary of inputs.
        Output: A dictionary of outputs.
        """
        for k,v in inputs.items():
            exec "%s = %s" % (k,v)
        for op in self.__operations:
            results = []
            inputs = ", ".join(op['inputs'])
            outputs = ", ".join(op['outputs'])
            exec "%s = plugins.%s(%s)" % (outputs, op['name'], inputs)
            for i in op['outputs']:
                exec "results.append(%s)" % i
        yield results

The first exec statement initializes local variables to the input values, which will later be references by the algorithm’s steps. The second statement, enclosed in a for loop, executes the step’s function on the step’s input values, and sets the outputs to the variable names specified by the step’s output parameter.

This is a perfect example of Python just working. I would anticipate that in a language like Java or C, I would have had to construct a complex case statement to obtain the same result.

UPDATE: The 1.0 version of CryptoWorkflow does not use any exec statements. Instead, the program calls functions embedded in a workflow module using the __dict__ attribute.

Python tricks

Filed under: Posts — Tags: , — craig @ 10:28 pm

For a directed research credit, I’m working on a cryptographic learning tool that allows users to step through crypto algorithms and see the result after certain operations have been performed. For this project, I’ve needed to use a bunch of Python features to accomplish exactly what I want.

I want my application to be as flexible as possible, and with that in mind I set out to develop a plugin and extension system that would allow other users to implement other algorithms in the program. I decided to use a plugin directory for functions required by the algorithms. Every .py file dropped into that folder is automatically imported into the system for use by any algorithm. For the algorithms themselves, I decided to describe them using YAML, a very user-friendly markup language that has a really nice python parser.

I’m going to post some of my favourite Python constructs over the next few days, starting with the exec command.

June 21, 2008

Canada’s DMCA: A Letter to the Leader of the Official Opposition

Filed under: Posts — Tags: , , — craig @ 12:25 am

I sent this letter today to Stéphane Dion, Leader of the Official Opposition, regarding Bill C-61: An Act to Amend the Copyright Act. The Bloc and NDP have come out in opposition to the bill, but the Liberals have not stated which way they’re voting:

Dear M. Dion,

As the Leader of the Official Opposition and the Leader of the Liberal Party of Canada, you have a unique role to play in the issue of copyright reform. Your caucus’ vote on Bill C-61 will make the difference between a Canada that protects innovation and consumers’ rights, and a Canada that is stuck in the last century. The other two oppositions parties have come out in opposition to this bill, and so it rests with you and your colleagues to determine which Canada you want.

Bill C-61 is hostile to Canadians who want to get reasonable enjoyment out of their lawfully-purchased media, and who wish to take advantage of the incredible technological advances that have been made and may arise in the future. This bill is a bad bill and criminalizes technology that a vast majority of technologically-savvy Canadians use on a regular basis.

I have outlined in detail my concerns with this legislation in an email to Industry Minister Jim Prentice, which I have enclosed below. I have a great deal of respect for you as a parliamentarian and an intellectual, so I hope you will not disappoint and that you will oppose this bill.

No matter the cost of opposing this motion, I know that Canadians will be behind you one-hundred percent.

Sincerely,
Craig Radcliffe
Owner
Radcliffe Computing Services

Canada’s DMCA: A Letter to the Government

Filed under: Posts — Tags: , , — craig @ 12:24 am

I sent this letter today in reaction to Bill C-61: An Act to Amend the Copyright Act (some links about it):

Dear Ministers Prentice and Verner,

Bill C-61 as presented by your government today is an absolute affront to the rights of Canadians to enjoy legitimately-purchased media. The exemptions for DRM technology will encourage the use of DRM technology and make criminals out of ordinary, law-abiding citizens.

Restricting the reasonable enjoyment of lawfully-purchased media will only prove a disincentive to consumers who, already in violation of the law for wanting to shift their DRM media purchase, will instead forgo purchasing the media for the freely-available and equally-illegal Internet download.

As a computer scientist, researcher, and business owner, I am appalled at the hostile attitude that this bill demonstrates towards technological innovation. Under bill C-61, software and hardware innovators would have to tread lightly around DRM, no matter how weak, for fear of breaking the law. No longer would it be legal to develop free software to play DVDs, upload songs to an iPod, or turn a video game console into a full-fledged computer. All of these are currently non-infringing uses of technology, yet all would be outlawed if this bill were to pass. Bright young minds looking to learn experientially would be stifled and criminalized for wanting to learn more about the workings of a DRM technology, and Canada’s place on the world stage of innovation would be in grave jeopardy.

Your government promised a new Copyright Act that recognized the changing times and dealt fairly with all parties involved. What you have delivered is little more than lip-service to consumers and a carte-blanche for American media conglomerates to commence bullying litigation against Candian citizens.

I urge you to retract bill C-61 and reconsider your government’s stance on consumers’ rights, the rights of researchers, and the future of Canadian innovation.

Sincerely,

Craig Radcliffe
Owner
Radcliffe Computing Services

April 8, 2008

Google Code Search

Filed under: Posts — Tags: , , — craig @ 6:24 pm

Not sure whether or not this has been out for a while, but Google Code Search is pretty neat. I found it in a special search box while I was googling for the C function execvp.

March 25, 2008

C type declarations

Filed under: Posts — Tags: , , — craig @ 6:30 pm

One of the most confusing and difficult aspects of reading C code is reading type declarations properly. Declarations such as:

char *(*(**foo[][8])())[];

are valid in C (and sometimes there is an appropriate use for them). These types of declarations are complicated to parse in human terms, but this site provides a hard and fast rule, which can be summarized as follows:

“go right when you can, go left when you must”

The article goes into some detail about the specifics, so if you want to know more, I would direct you there.

March 20, 2008

A neat Python quirk

Filed under: Posts — Tags: , — craig @ 5:23 pm

This post was originally going to be about the fact that Python doesn’t have a ternary operator (like C’s ?: operator); however, it appears that Python 2.5 has a ternary operator.

Instead, I’ll just mention this really neat Python code idea that I dug up somewhere (can’t find the reference right now):

print ("this is false", "this is true")[i==1]

If i were 1, then "this is true" would be printed; otherwise, "this is false" would be printed.

Python automatically converts the i==1 boolean value to an integer value (i.e. 0 or 1) when the boolean value is in the context of a list/tuple index. This is a perfect example of Python doing what’s expected in a given context.

Newer Posts »

Powered by WordPress