Saturday, 15 November 2008

random stuff

Beep, beep! I'm still (more or less) alive.

University takes me lots of time, which is my primary excuse for not blogging
for so long time. Projects are like gas - they fill up all available space^H^H^H^H^H time.

Interesting (for geeky enough definition of "interesting") stuff I've done recently:

  • Spellchecker. It's a quite smart one -- written in python and C, it uses trie (ternary search tree) for storing dictionary. It's super-fast to load data (there are no pointers, so loading it is just one read of binary file, and you can use it).

    Then you can use TST to quickly (it's quicker than hashmap) check if a word belongs to vocabulary or retrieve (it's still bleeding fast) a list of words that are no further (in Levenstein's (edit) distance) than misspelled one.

    Then it uses longest common subsequence algorithm to get parts that don't match and compares those parts using knowledge about typical spelling errors in Polish. It can correct "grzegrzułka", "zomp" and "fzhut". In summary: it's cool.

  • I'm preparing a talk about scala. It's work in progress. You can see some slides here. I'll give this talk on 3th December as a part of BIWAK

  • Oh, yes, BIWAK. We (BIT science club) have started a series of talks called BIWAK.

  • Oh, yes, science club. I've done some work on platform game with cool physics, but there is nothing cool to show off yet.

  • I've published some of my .rc files

  • Hooray new swimming pool! Hooray hiking! Hooray birthdays and weddings. Hooray real life.


Ορέστης said...

Cool stuff! Is your spellchecker faster than aspell? It could be useful for one of my old projects!

Krzysiek said...

Well I timed it, but it's biased, because I used smaller dictionary (generating tree for it took ages, anyway, but it can be improved a lot (by generating DAG with common word ends instead of tree), I just didn't have time for it).

It's marginally faster (twice) than aspell --dont-suggest when verifying correctness of input, but verifying is easy and boring.

It's slower a lot when suggesting correct spelling smartly, but it's really smart, eg. 'fzhut' is badly massacred 'wschód', so 'wschód' gets suggested correctly as the most probable word being misspelled, etc.

If you don't want suggestions to be good (read: just use Levenstein's distance), then, well, it depends on the 'radius' you set (just one change -- it's faster 10 times but suggestions suck, two changes -- it gets slower (about twice)).

It has another drawback compared to aspell -- it doesn't know about "hey, you write 'im flammable' together and 'notcool' separately" breed of typos.

And you have to remember that it's just a 2-week-hack-already-semi-orphaned-student's-project before considering using it for something serious ;-)