taw's blog: January 2008

Monday, January 28, 2008

PageRank - a new addition to your Data Processing Toolkit

Back when I was at Universität des Saarlandes we had a great seminar at MPII. It was called "Data Processing Tips and Tricks" and covered some important data processing techniques. These techniques varied a lot. What they had in common was their universality - you could throw pretty much any data at them and extract something. And by preprocessing your data differently, and postprocessing the results differently, you could get loads of interesting things without inventing any new algorithms.

Here's a partial list of the classic universal data processing methods, in no particular order:

Bayesian statistics
EM algorithm
Wavelets
Levenberg-Marquardt optimization
Lloyd clustering
Karhunen-Loeve Transform (aka Principal Component Analysis)
Singular Value Decomposition
multi-dimensional scaling
Algebraic reconstruction technique
Support Vector Machines
Graph Cut Optimization
Level-Set Methods
RANSAC
Neural Networks
Hidden Markov Models
Regular expressions

By simply being aware of their existence you greatly increase your chances of solving really big problems you'll be facing. For example naive Bayesian statistics was famously used for fighting spam, and in spite of its striking simplicity is much more effective than all the custom and complicated methods that preceded it.

In the last few years the data processing toolkit got one a new tool - PageRank. Pretty much everybody knows how it works for scoring websites, but the algorithm is capable of much more than that. One great example is extracting keywords from documents. It has nothing to do with the original problem of website scoring, but if you treat words as nodes (websites), create links between words that occur close to each other, and run PageRank on such a graph, you get very decent keywords. Of course you might want to add some pre- and post-processing to improve keyword quality (obviously removing HTML tags, also stemming, removing stopwords, weighting words by part of speech or whatever you feel like doing), but so does Google in determining pages' scores. And I bet you expected keyword extractors to either actually understand what's written (not possible yet) or to simply count number of occurences (really horrible results).

You can use PageRank to asses importance of countries in international trade, importance of people in organization's communication flow, and many other problems. Or you could simply throw arbitrary graphs at PageRank, look at the results and simply guess what they mean. Perhaps that will be enough to solve the problem you've been thinking about for such a long time. If not, you still have two dozen of other universally applicable techniques to choose from.

Saturday, January 26, 2008

Keyboard layout should not be a global setting

mac kitty by atomicshark from flickr (CC-NC-SA)

Most computers these days are laptops. Their computing power got quite decent, and they're even becoming reasonably powerful for moderate gaming. One of the major problems left is typing on them. Laptop keyboards, especially in smaller laptops, are tiny, have very small keys, no numeric keypad and painfully unergonomic shape which forces you to keep your hands in unnatural position. As keyboard and screen are attached to each other there's no way to make it comfortable for both your eyes and your hands. To make matters worse recently many laptops started to include a big touchpad which doubles as a mouse button, so when you try to type, and you have to keep your hands very close to each other because the keyboard is so small, you're very likely to accidentally "press mouse button" by touching the touchpad. Basically they're completely unsuitable for touch typing, and they will forever stay this way, because it's impossible to build a decent keyboard that fits in laptop form factor. That doesn't mean that all of them are equally bad, the worst one I've seen so far was Macbook's, and some in bigger laptops are only annoying instead of being actively painful.

This all means that unless the only thing you type are Google search queries, you need a real keyboard for your computer in addition to the internal keyboard. Most people don't seem to care about this, but I really like Dvorak layout. And here lies the problem because in every operating system I've ever seen keyboard layout is a global setting, not per-device setting. I want to touch type in Dvorak on external keyboard, but as touch typing on laptop keyboard is not possible I'd prefer it to stay QWERTY, so I can at least see what key I'm pressing. However as keyboard layout is global rather than per-device setting, I have to manually switch it every time I attach or detach external keyboard.

Pressing keyboard layout switcher a couple times a day is maybe not the worst usability problem out there, but could KDE/GNOME developers please improve it ? That would be really really great.

Wednesday, January 23, 2008

Really strange quirk of Ruby and Perl regular expressions

Sage's Kittens by T. Keller from flickr (CC-BY)

I've found a weird quirk of Ruby's regular expression engine. The same quirk is present in Ruby 1.8.6, Ruby 1.9.0, and Perl 5.8.8, but it is not present in Python 2.5.1. I'm going to let you decide if it's a bug or not.

I tried to replace all space at the end of a string by a single newline. The correct way to do it is str.sub(/\s*\Z/, "\n"). However I've done str.gsub(/\s*\Z/, "\n") instead. String has only one end-of-string, so there should be no way String#gsub could possibly match twice. But it does - the result is two newlines if there was any whitespace at the end, or one newline if there wasn't. I kinda see how it might be getting these results from implementation point of view, and it's not a big deal because there's a simple workaround of replacing #gsub by #sub, but it doesn't make much semantic sense to me.

Here are a few code snippets in Ruby, Perl and Python, with "foo" instead of whitespace for better readability.

# Ruby
"f".gsub(/o*\Z/, "o") # => "fo"
"fo".gsub(/o*\Z/, "o") # => "foo"
"foo".gsub(/o*\Z/, "o") # => "foo"
"fooo".gsub(/o*\Z/, "o") # => "foo"
"f".sub(/o*\Z/, "o") # => "fo"
"fo".sub(/o*\Z/, "o") # => "fo"
"foo".sub(/o*\Z/, "o") # => "fo"
"fooo".sub(/o*\Z/, "o") # => "fo"
"f".gsub(/o*\Z/, "x") # => "fx"
"fo".gsub(/o*\Z/, "x") # => "fxx"
"foo".gsub(/o*\Z/, "x") # => "fxx"
"fooo".gsub(/o*\Z/, "x") # => "fxx"

# Perl
perl -le '$_="f"; s/o*$/o/g; print $_' # => fo
perl -le '$_="fo"; s/o*$/o/g; print $_' # => foo
perl -le '$_="foo"; s/o*$/o/g; print $_' # => foo
perl -le '$_="fooo"; s/o*$/o/g; print $_' # => foo

# Python
re.compile(r'o*$').sub("o", "f") # => 'fo'
re.compile(r'o*$').sub("o", "fo") # => 'fo'
re.compile(r'o*$').sub("o", "foo") # => 'fo'
re.compile(r'o*$').sub("o", "fooo") # => 'fo'

For all you Ruby/Perl programmers out there, Python sub is equivalent of gsub or s///g not sub or s///.

Which behaviour makes more sense ? I think Python's much more intuitive. Should we treat it as a bug in Ruby/Perl and fix it, or accept it as an unintended feature ?

Sunday, January 20, 2008

My poor bottom left 6

tree eater by splityarn from flickr (CC-NC-SA)

My tooth hurt a bit in the last week, and it got really bad yesterday, so I called some 24h dental service. It wasn't as 24h as they claimed and I was only able to get an appointment for the next day. The bad news is - root canal treatment was required. The even worse news - it was damn 700 quid for two visits. Here goes the new computer, at least for now. It's seriously insanely expensive compared to dentistry in Poland, whene plain checkup is about 10 GBP, and root canal treatment is about 50 GBP - just Google for prices if you don't believe me. If I actually knew in advance it would cost that much I'd probably just book a flight. Too bad you cannot schedule your dental problems.

Sunday, January 13, 2008

Protecting custom SQL in Rails from SQL injections

Claw by AnasiZ from Wikimedia Commons (public domain)

The technique I'm presenting here is not an "awesome revolutionary breakthrough in Rails security" kind of thing (like XSS Shield), it's more like a small cool trick that can improve your code and make it more secure.

Most web applications are backed by some SQL-based database. If they're written in a language like PHP in which SQL is freely mixed with code, it's very easy to open an SQL injection vulnerability. There are two common solutions for making such vulnerabilities less likely. One is "prepared statements", where instead of creating complete SQL like "SELECT * FROM foo WHERE name='#{name}'" one crerates SQL with placeholders like "SELECT * FROM foo WHERE name=?", and then fills the placeholders with the actual data in a safe way within the database or database driver. The other solution is using Object-Relational Mappers instead of raw SQL, and that's what Rails is doing with Active Record. As long as you're using Active Record only with no custom SQL you don't have to worry about SQL injections at all.

Unfortunately in real applications Active Record is often not good enough. It is OK for locally manipulating small graphs of objects (basically creating, reading, updating and deleting single objects/rows at time), but you need to go back to raw SQL every time you want to compute a query on a lot of data (database views would also work, but that's not a subject today). Having to use SQL is not a bad thing, however due to Rails' lack of support for prepared statements it leaves us in a rather bad situation from security point of view.

Did I just say Rails doesn't support prepared statements ? So what are all those :conditions => ['foo > ? AND bar IN ?', frob, bang] if not prepared statements ? They are "fake" prepared statements, expanded on ActiveRecord level unlike "real" prepared statements expanded on database level. The method which fakes them is ActiveRecord::Base#sanitize_sql and we can reuse it to fake prepared statements of other kinds.

This means replacing ugly code like that:

connection.select <<SQL
  SELECT COUNT(*) FROM people WHERE last_name LIKE '#{quote_value(prefix)}%'
SQL

by far nicer:

connection.safe_select <<SQL, "#{prefix}%"
  SELECT COUNT(*) FROM people WHERE last_name LIKE ?'
SQL

The code to make that happen is pretty simple:

class ActiveRecord::ConnectionAdapters::AbstractAdapter
  [:select, :delete, :update, :execute].each do |m|
    define_method(:"safe_#{m}") do |*args|
      send(m, ActiveRecord::Base.send(:sanitize_sql, args))
    end
  end
end

Wednesday, January 09, 2008

Voter discrimination in Polish parliamentary elections

Poland has a reputation of a stable democratic country, and indeed - pretty much everybody can vote or candidate, and there have never been any vote counting problems. However, there's a massive discrimination of some classes of voters going on, one that the mainstream media never talk about. Your vote will count for less if you:

vote in a district with a high turnout
vote in a district which arbitrarily got fewer mandates per person
you were arbitrarily excluded from district's population count

All three unfortunately apply to voters living in London.

To cut the story short, here's a table of voting power depending on district. Elections to both houses have slightly different number of districts (40 vs 41) so I'll show them separately.

Upper house (Senat)
District	Mandates	Eligible voters	Actual voters	Vote weight (theoretical)	Vote weight (actual)
Warszawa I	4	1567038	1144318	63	40
Warszawa II	2	756628	460634	65	50
Poznań	2	681953	444243	72	52
Szczecin	2	837580	440897	58	52
Łódź	2	693779	422493	71	54
Bydgoszcz	2	806833	421442	61	55
Kraków	4	1431548	829767	68	55
Wrocław	3	971597	557429	76	62
Bielsko-Biała	2	598761	346544	82	66
Gdynia	3	906688	506522	81	68
Lublin	3	964553	501398	76	69
Krosno	2	697632	331401	70	70
Gliwice	2	659406	330660	74	70
Rzeszów	3	956730	493160	77	70
Sosnowiec	2	600155	326536	82	71
Kielce	3	1043578	484240	70	71
Gdańsk	3	823742	473407	89	73
Katowice	3	826292	468692	89	74
Białystok	3	951236	464432	77	74
Płock	2	668405	307810	73	75
Rybnik	2	583350	303688	84	76
Olsztyn	2	628234	301048	78	77
Piła	2	596237	300511	82	77
Nowy Sącz	2	584126	299453	84	77
Konin	2	604427	299432	81	77
Piotrków Trybunalski	2	590193	292757	83	79
Radom	2	575812	285526	85	81
Tarnów	2	552942	283424	89	81
Wałbrzych	2	567692	272550	86	85
Legnica	3	806174	405298	91	85
Zielona Góra	3	801490	396389	92	87
Toruń	3	834430	393878	88	88
Kalisz	3	782596	388566	94	89
Częstochowa	2	493644	256382	100	90
Sieradz	3	792112	383239	93	90
Koszalin	2	516655	248912	95	93
Opole	3	830903	370996	89	93
Siedlce	3	746842	368269	99	94
Chełm	3	778641	352365	95	98
Elbląg	2	504837	232096	97	100

Lower house (Sejm)
District	Mandates	Eligible voters	Actual voters	Vote weight (theoretical)	Vote weight (actual)
Warszawa I	19	1567038	1145983	74	47
Poznań	10	681953	447616	90	63
Kraków	13	929826	561404	85	65
Łódź	10	693779	422754	88	67
Warszawa II	11	756628	460582	89	67
Wrocław	14	971597	556133	88	71
Gdańsk	12	823742	472504	89	71
Katowice	12	826292	465482	89	73
Bielsko-Biała	9	598761	344763	92	74
Częstochowa	7	493644	255728	87	77
Sosnowiec	9	600155	325737	92	78
Gdynia	14	906688	503636	94	78
Bydgoszcz	12	806833	418902	91	81
Wałbrzych	8	567692	271278	86	83
Szczecin	13	837580	438829	95	83
Legnica	12	806174	404559	91	84
Rybnik	9	583350	302914	94	84
Chrzanów	8	501722	267676	97	84
Lublin	15	964553	501704	95	84
Piła	9	596237	298569	92	85
Nowy Sącz	9	584126	297902	94	85
Konin	9	604427	297645	91	85
Gliwice	10	659406	329350	93	86
Zielona Góra	12	801490	394215	91	86
Rzeszów	15	956730	491363	96	86
Piotrków Trybunalski	9	590193	291982	93	87
Kalisz	12	782596	386235	94	88
Sieradz	12	792112	382387	93	88
Radom	9	575812	284839	96	89
Tarnów	9	552942	281323	100	90
Koszalin	8	516655	246514	95	92
Białystok	15	951236	461180	96	92
Płock	10	668405	306981	91	92
Siedlce	12	746842	368142	98	92
Toruń	13	834430	392729	95	93
Kielce	16	1043578	482571	94	93
Krosno	11	697632	330228	96	94
Olsztyn	10	628234	299231	97	94
Chełm	12	778641	351500	94	96
Elbląg	8	504837	230592	97	98
Opole	13	830903	368540	96	100

So if you live in Warsaw or London, your vote only counts as 40% to 47% of someone living in Elbląg or Opole. How it happened again ?

Allocation of mandates per district was distriminatory in the first place - even dividing mandates per eligible voters a citizen in some districts (like Warsaw) are worth only 63%/74% of a citizen in other district
Counts of eligible voters exclude people living abroad. Even worse - for purposes of mandate assignment they are falsely counted as still living at their last address in Poland. So if a few thousand people moved from Legnica district to London (and are now voting in Warsaw I district), a mandate for them is not reassigned from Legnica to London. So people who stayed in Legnica get to vote for them, while people living in Warsaw or London get their votes diluted !
Mandates are allocated based on number of theoretical eligible voters, not number of people who actually voted. Just because you live in the same district as someone who didn't bother to vote doesn't give you any right of voting for that person! Only actual voters should be counted.

These three effects discriminated against voters in the past, but it only became a major issue in the last elections, with unprecedented number of people living abroad voting. Apparently they and people living in Warsaw are not worth less than half a citizen. The next parliamentary elections are only due in 2011, so there's plenty of time to fix the problem. People living abroad should either get their own district, or number of mandates for Warszawa I should be increased to take them into account. Or the best thing - allocate the mandates only after the voting, strictly proportionally to the number of valid votes.

Sunday, January 06, 2008

GNOME sucks

one pissed off pussycat by John Carleton from flickr (CC-NC-SA)

My dead Macbook returned with a new disk. I normally use KDE, but this time I thought it would be a good idea to try the Ubuntu's default GNOME desktop. And it sucks.

The thing which sucks most is virtual desktops support. If I have programs opened on different virtual desktops I cannot switch between them using alt+tab. I cannot switch between them using taskbar. And I cannot take my mouse and simply move from one virtual desktop to another over screen edge. That makes the whole virtual desktop thing pretty much useless.

The second things which sucks is lack of single place with all configuration options. I don't think it even has configuration options I want, believing instead in "one size fits all". Wake up GNOME people - configurable Windows is way more popular than one-true-way Mac for a reason, and so is configurable KDE way more popular than one-true-way GNOME.

The third thing which sucks is GNOME trying too hard to be a Mac but failing. There's even a menubar at top of the screen, but it does not host current application's menus. Instead it takes one of the most valuable parts of screen (top edge) and completely wastes it. It's even worse on a computer with tiny screen like Macbook, where top bar and bottom bar together take large part of total screenspace, leaving much less for applications.

A few more things. Applications menu does not remember my recently used programs. The only reason I would ever use Applications menu instead of Alt+F2 is if it listed the few most frequently (or most recently) used programs for simple clicking. Without that it's completely useless. Places menu is completely useless, and System menu as I said should be replaced by a single place with all config it in. Like your beloved Mac. Or Windows. Or KDE. Or anything except for GNOME.

Wednesday, January 02, 2008

24C3

TV-B-Gone from Kit by mlcastle from flickr (CC-NC-SA)

24C3 is over. Here's a quick summary of events I attended and some general impressions. Recordings of most of the events are downloadable.

Anonymity for 2015: Why not just use Tor? was a quick overview of state of web anonymizers. In case you wondered - they're all insecure against serious attackers and there's little hope for them ever being secure. Web (low latency) anonymizers are much worse than email (high latency) anonymizers.

Make Cool Things with Microcontrollers: Hacking with Microcontrollers and "Design Noir: The seedy underbelly of electronic engineering" talked about microcontrollers and cool things you can do with them. The most popular one was TV-B-Gone which is an universal TV remote with just an off button. In some modded versions it can turn off all TVs in radius of 100 meters. Apparently everyone who bought one at the Congress tested it at nearby MediaMarkt or Saturn, what really pissed them off. Another popular thing was Brainwave Machine, which allegedly affects your brainwave patterns using flashing diodes and binaural sounds. I tested it and it only made my eyes hurt by flashing diodes directly into them.

Programming DNA: A 2-bit language for engineering biology talked about progress of DNA-based engineering, and how cheap and simple DNA synthesis became in recent years, at Moore's Law kind of pace. One thing I'm wondering is if we will ever create DNA processors which do something to DNA (or RNA) linearly one nucleotide at a time. In nature there are only two such processors - one for copying DNA (or RNA) to DNA (or RNA), and one for RNA to protein synthesis. There's a lot of stuff based on local pattern matching (DNA repairs, post-transcription RNA modification, RNA interference etc.) but they seem much less interesting. It would be really awesome if we could create more DNA processors, let's say one which would synthesize different kind of polymers based on DNA/RNA, or one which would do interpret nucleotides as instructions more complex than "copy, move forward by one". In a way DNA sequencing and DNA synthesis are DNA processors, but they're purely artificial and we cannot use them from within cells. Maybe I'll talk about it more in some other post.

The second day was probably the most interesting. In started with the first round of the Ligthning Talks, where I gave a brief presentation of XSS Shield plugin for Ruby on Rails. It was released only a few days before the Congress, so I had no time to do it before, but I will definitely blog about it soon. Some other interesting talks were about UN Democracy and vulnerabilities in Mac desktop widgets. Unfortunately the rest of the lightning talks were scheduled for fourth day evening, so I was unable to see them.

Then I've seen Quantum Cryptography and Possible Attacks which was rather introductory and hand-wavy, so I'd probably be better off seing After C: D, libd and the Slate project: A clean slate for operating systems in the same time slot.

Linguistic Hacking: How to know what a text in an unknown language is about? was very introductory too. It's kinda difficult to know in advance which CCC talk is going to be introductory and which is going to be totally hardcore.

Just in Time compilers - breaking a VM: Practical VM exploiting based on CACAO was something I enjoyed a lot, as writing compilers is my hobby. Immediately afterwards there was Modelling Infectious Diseases in Virtual Realities: The "corrupted blood" plague of WoW from an epidemiological perspective which applied epidemiological models to Corrupted Blood epidemics from World of Warcraft.

Then was the rescheduled AES: side-channel attacks for the masses talk which was very interesting in spite of being a clear case of Death By PowerPoint. It wasn't the only messily structured presentation, but the Congress average was pretty good, so I'm not going to complain.

The late evening of the second day was even better, with DIY Survival: How to survive the apocalypse or a robot uprising and the absolute hit of the 24C3: Rule 34 Contest: There is porn of it. Unfortunately there are no official recordings of Rule 34 contest (no idea about unofficial ones), but participants searched for porn on various subjects ranging from the simplest like Star Wars to more difficult like Apple and Religion. Contest winner was an experienced porn seeker who used websites like Encyclopedia Dramatica and Rule 34 search engine, but participants were able to get really far with just plain Google Image search. Fittingly the first prize was an inflatable sheep.

The third day was the weakest, with From Ring Zero to UID Zero: A couple of stories about kernel exploiting being the only really good talk. haXe: hacking a programming language didn't really explain what's the point of haXe, and Space Communism: Communism or Space first? started with an interesting premise but was very boring in execution.

The last day was full of really great talks which I wasn't able to attend as I had a train at 16:10. It would be much better if they moved some of the good ones from day 4 to the least interesting day 3.

A Spotter's Guide to AACS Keys presented a gloom view of widespread censorship we would live in if Hollywood won. It is very important to keep breaking every digital restriction management system Hollywood, RIAA, Microsoft and other forces of evil keep throwing at us. If we are unable to do so, we're going to live in the fascist world where only licenced "content providers" will be allowed. In their evil world there would be no Wikipedia, no YouTube, no blogs, and IMs would be paid per message. Just look at game consoles and mobile phones, which were the most successful in implementing DRMs - is this the world you want ? I like free Internet much more.

The second talk was Overtaking Proprietary Software Without Writing Code: a few rough insights on sharpening free software, the basic premise of which was that quality is massively overrated when it comes to Open Source software popularity. But that's not all - shitty quality cheap stuff is what created the civilization. Ten thousand years ago people gave up good quality food they hunted and gathered and for which they were evolutionarily adapted, and replaced it with agricultural food which lacked adequate nutrituion, made them fall to many diseases, increased their death rates, reduced their heights, shortened their lifespans, made them susceptible to regular mass starvation events due to bad weather, and eventually enabled civilization as we know it. Gradually the quality of food and life improved, but many people in poor countries are in many ways still worse off than in the Paleolithic.

Many other social and technical changes required giving up quality in exchange for something else, usually quantity. For a recent example - people are reading blogs instead of reading newspapers and other traditional media - quality of presentation in blogs is typically far worse than in newspapers, but there are more blogs, on more subjects, presenting more points of view, and so blogs get more and more popular. Wikipedia vs professionally developed textbooks is another good example. Wikipedia is often less well edited, has typos, bad grammar and inconsistencies which would never get past even the quickest review, not to mention occasional vandalisms and pov wars. But Wikipedia is also much bigger and much more accessible than traditional textbooks and encyclopedias, and somewhat worse quality is not that big of a deal (I'm talking about small articles about less popular subjects mostly, quality of the most popular 10% of articles is usually very high). Oh and it's free, what also helps.

Browsers are another good example. The most popular browser IE is worse than any other browser ever made, but has one good feature of being preinstalled (where's antitrust law when we need it, OEMs should be free to install IE or Firefox or Opera as they see fit, instead of being de facto forced to install IE only). The second most popular browser Firefox has better quality than IE, and is gaining popularity relative to IE. But that's not a victory of quality - Opera is even better when it comes to low bug counts but it never got anywhere as popular as Firefox (or IE). Firefox is winning not because of quality but because it has very easy to write extensions, and therefore more useful extensions than any other browsers. That and good marketing.

The last talk I attended was Ruby on Rails Security, which is really a must-see for anybody doing Rails. It covers everything you need to know except of XSS Shield.

Just afterwards there was a Lego Sumo contest where people built and programmed their Lego Mindstorms robots which tried to throw each other off the table.

A few observations - I liked go corner more when it was located in the dining hall like on 23C3. Some German language talks had English descriptions (like Sex 2.0: Hacking Heteronormativity, and Hacker Jeopardy) and there was no way to tell that from the paper Fahrplan. They should make it possible to pay for tickets by cards for 25C3, cash is so 20th century. In spite of these minor glitches I think 24C3 was a really awesome event, and I'm definitely attending the next year.