Here are the results:
Here is a little HOWTO. It may be useful for something practical, or for next year. Basically, you have to use mod_ext_filter see the sed example.
Polish spelling mistakes are different to English ones. They are (mostly) caused by fact, that there are (for historical & (maybe) other slavic languages "compatibility" reasons (is legacy a proper word?)) few sounds that have the two spellings, like ź-rz and u-ó, h-ch. It makes it very easy to inject typos to polish text.
We are talking about HTML pages. You cannot break html, eg <a chref="..."> would be BAD. So we don't touch anything inside html tags, comments and entities. This makes it hard to use regexps. If I were to write this joke's filter program once again I would use flex. But I wrote a short and slow (2,2 seconds to process a page - unacceptable) automata-based python script, and then (when it turned out how bad it performs) added lighting-fast (0,01 s/page) C-code generation to it. Generating code for automata is very easy.
code is here
Code generation itself is a bit over-engineered, too. I shouldn't have cared about beautiful indentation & proper newlines. I should have used GNU indent instead.
Apache configuration:
ExtFilterDefine ortozawal mode=output intype=text/html cmd="/usr/local/bin/ortozawal"
ExtFilterDefine ungzip-filter mode=output intype=text/html cmd="/bin/gunzip -"
ExtFilterDefine gzipme-filter mode=output intype=text/html cmd="/bin/gzip"
ExtFilterDefine flip-image mode=output cmd="/usr/bin/convert - -flip -"
<Location "/forum">
SetOutputFilter ungzip-filter;ortozawal;gzipme-filter
</Location>
<Location "/forum/images">
SetOutputFilter flip-image
</Location>
gunzip-filter-gzip is a filthy trick. You can probably avoid it by proper filter configuration. If you know how, please drop me a comment.
No comments:
Post a Comment