Monday, January 28, 2008

PageRank - a new addition to your Data Processing Toolkit

G0321 by Piez from flickr (CC-BY)

Back when I was at Universität des Saarlandes we had a great seminar at MPII. It was called "Data Processing Tips and Tricks" and covered some important data processing techniques. These techniques varied a lot. What they had in common was their universality - you could throw pretty much any data at them and extract something. And by preprocessing your data differently, and postprocessing the results differently, you could get loads of interesting things without inventing any new algorithms.

Here's a partial list of the classic universal data processing methods, in no particular order:
  • Bayesian statistics
  • EM algorithm
  • Wavelets
  • Levenberg-Marquardt optimization
  • Lloyd clustering
  • Karhunen-Loeve Transform (aka Principal Component Analysis)
  • Singular Value Decomposition
  • multi-dimensional scaling
  • Algebraic reconstruction technique
  • Support Vector Machines
  • Graph Cut Optimization
  • Level-Set Methods
  • RANSAC
  • Neural Networks
  • Hidden Markov Models
  • Regular expressions

By simply being aware of their existence you greatly increase your chances of solving really big problems you'll be facing. For example naive Bayesian statistics was famously used for fighting spam, and in spite of its striking simplicity is much more effective than all the custom and complicated methods that preceded it.

In the last few years the data processing toolkit got one a new tool - PageRank. Pretty much everybody knows how it works for scoring websites, but the algorithm is capable of much more than that. One great example is extracting keywords from documents. It has nothing to do with the original problem of website scoring, but if you treat words as nodes (websites), create links between words that occur close to each other, and run PageRank on such a graph, you get very decent keywords. Of course you might want to add some pre- and post-processing to improve keyword quality (obviously removing HTML tags, also stemming, removing stopwords, weighting words by part of speech or whatever you feel like doing), but so does Google in determining pages' scores. And I bet you expected keyword extractors to either actually understand what's written (not possible yet) or to simply count number of occurences (really horrible results).

You can use PageRank to asses importance of countries in international trade, importance of people in organization's communication flow, and many other problems. Or you could simply throw arbitrary graphs at PageRank, look at the results and simply guess what they mean. Perhaps that will be enough to solve the problem you've been thinking about for such a long time. If not, you still have two dozen of other universally applicable techniques to choose from.

Saturday, January 26, 2008

Keyboard layout should not be a global setting

mac kitty by atomicshark from flickr (CC-NC-SA)

Most computers these days are laptops. Their computing power got quite decent, and they're even becoming reasonably powerful for moderate gaming. One of the major problems left is typing on them. Laptop keyboards, especially in smaller laptops, are tiny, have very small keys, no numeric keypad and painfully unergonomic shape which forces you to keep your hands in unnatural position. As keyboard and screen are attached to each other there's no way to make it comfortable for both your eyes and your hands. To make matters worse recently many laptops started to include a big touchpad which doubles as a mouse button, so when you try to type, and you have to keep your hands very close to each other because the keyboard is so small, you're very likely to accidentally "press mouse button" by touching the touchpad. Basically they're completely unsuitable for touch typing, and they will forever stay this way, because it's impossible to build a decent keyboard that fits in laptop form factor. That doesn't mean that all of them are equally bad, the worst one I've seen so far was Macbook's, and some in bigger laptops are only annoying instead of being actively painful.

This all means that unless the only thing you type are Google search queries, you need a real keyboard for your computer in addition to the internal keyboard. Most people don't seem to care about this, but I really like Dvorak layout. And here lies the problem because in every operating system I've ever seen keyboard layout is a global setting, not per-device setting. I want to touch type in Dvorak on external keyboard, but as touch typing on laptop keyboard is not possible I'd prefer it to stay QWERTY, so I can at least see what key I'm pressing. However as keyboard layout is global rather than per-device setting, I have to manually switch it every time I attach or detach external keyboard.

Pressing keyboard layout switcher a couple times a day is maybe not the worst usability problem out there, but could KDE/GNOME developers please improve it ? That would be really really great.

Wednesday, January 23, 2008

Really strange quirk of Ruby and Perl regular expressions

Sage's Kittens by T. Keller from flickr (CC-BY)

I've found a weird quirk of Ruby's regular expression engine. The same quirk is present in Ruby 1.8.6, Ruby 1.9.0, and Perl 5.8.8, but it is not present in Python 2.5.1. I'm going to let you decide if it's a bug or not.

I tried to replace all space at the end of a string by a single newline. The correct way to do it is str.sub(/\s*\Z/, "\n"). However I've done str.gsub(/\s*\Z/, "\n") instead. String has only one end-of-string, so there should be no way String#gsub could possibly match twice. But it does - the result is two newlines if there was any whitespace at the end, or one newline if there wasn't. I kinda see how it might be getting these results from implementation point of view, and it's not a big deal because there's a simple workaround of replacing #gsub by #sub, but it doesn't make much semantic sense to me.

Here are a few code snippets in Ruby, Perl and Python, with "foo" instead of whitespace for better readability.
# Ruby
"f".gsub(/o*\Z/, "o") # => "fo"
"fo".gsub(/o*\Z/, "o") # => "foo"
"foo".gsub(/o*\Z/, "o") # => "foo"
"fooo".gsub(/o*\Z/, "o") # => "foo"
"f".sub(/o*\Z/, "o") # => "fo"
"fo".sub(/o*\Z/, "o") # => "fo"
"foo".sub(/o*\Z/, "o") # => "fo"
"fooo".sub(/o*\Z/, "o") # => "fo"
"f".gsub(/o*\Z/, "x") # => "fx"
"fo".gsub(/o*\Z/, "x") # => "fxx"
"foo".gsub(/o*\Z/, "x") # => "fxx"
"fooo".gsub(/o*\Z/, "x") # => "fxx"

# Perl
perl -le '$_="f"; s/o*$/o/g; print $_' # => fo
perl -le '$_="fo"; s/o*$/o/g; print $_' # => foo
perl -le '$_="foo"; s/o*$/o/g; print $_' # => foo
perl -le '$_="fooo"; s/o*$/o/g; print $_' # => foo

# Python
re.compile(r'o*$').sub("o", "f") # => 'fo'
re.compile(r'o*$').sub("o", "fo") # => 'fo'
re.compile(r'o*$').sub("o", "foo") # => 'fo'
re.compile(r'o*$').sub("o", "fooo") # => 'fo'


For all you Ruby/Perl programmers out there, Python sub is equivalent of gsub or s///g not sub or s///.

Which behaviour makes more sense ? I think Python's much more intuitive. Should we treat it as a bug in Ruby/Perl and fix it, or accept it as an unintended feature ?

Sunday, January 20, 2008

My poor bottom left 6

tree eater by splityarn from flickr (CC-NC-SA)

My tooth hurt a bit in the last week, and it got really bad yesterday, so I called some 24h dental service. It wasn't as 24h as they claimed and I was only able to get an appointment for the next day. The bad news is - root canal treatment was required. The even worse news - it was damn 700 quid for two visits. Here goes the new computer, at least for now. It's seriously insanely expensive compared to dentistry in Poland, whene plain checkup is about 10 GBP, and root canal treatment is about 50 GBP - just Google for prices if you don't believe me. If I actually knew in advance it would cost that much I'd probably just book a flight. Too bad you cannot schedule your dental problems.

Sunday, January 13, 2008

Protecting custom SQL in Rails from SQL injections

Claw by AnasiZ from Wikimedia Commons (public domain)
The technique I'm presenting here is not an "awesome revolutionary breakthrough in Rails security" kind of thing (like XSS Shield), it's more like a small cool trick that can improve your code and make it more secure.

Most web applications are backed by some SQL-based database. If they're written in a language like PHP in which SQL is freely mixed with code, it's very easy to open an SQL injection vulnerability. There are two common solutions for making such vulnerabilities less likely. One is "prepared statements", where instead of creating complete SQL like "SELECT * FROM foo WHERE name='#{name}'" one crerates SQL with placeholders like "SELECT * FROM foo WHERE name=?", and then fills the placeholders with the actual data in a safe way within the database or database driver. The other solution is using Object-Relational Mappers instead of raw SQL, and that's what Rails is doing with Active Record. As long as you're using Active Record only with no custom SQL you don't have to worry about SQL injections at all.

Unfortunately in real applications Active Record is often not good enough. It is OK for locally manipulating small graphs of objects (basically creating, reading, updating and deleting single objects/rows at time), but you need to go back to raw SQL every time you want to compute a query on a lot of data (database views would also work, but that's not a subject today). Having to use SQL is not a bad thing, however due to Rails' lack of support for prepared statements it leaves us in a rather bad situation from security point of view.

Did I just say Rails doesn't support prepared statements ? So what are all those :conditions => ['foo > ? AND bar IN ?', frob, bang] if not prepared statements ? They are "fake" prepared statements, expanded on ActiveRecord level unlike "real" prepared statements expanded on database level. The method which fakes them is ActiveRecord::Base#sanitize_sql and we can reuse it to fake prepared statements of other kinds.

This means replacing ugly code like that:
connection.select <<SQL
SELECT COUNT(*) FROM people WHERE last_name LIKE '#{quote_value(prefix)}%'
SQL
by far nicer:
connection.safe_select <<SQL, "#{prefix}%"
SELECT COUNT(*) FROM people WHERE last_name LIKE ?'
SQL


The code to make that happen is pretty simple:
class ActiveRecord::ConnectionAdapters::AbstractAdapter
[:select, :delete, :update, :execute].each do |m|
define_method(:"safe_#{m}") do |*args|
send(m, ActiveRecord::Base.send(:sanitize_sql, args))
end
end
end

Wednesday, January 09, 2008

Voter discrimination in Polish parliamentary elections

Pink buttons by kup,kup from flickr (CC-NC-SA)

Poland has a reputation of a stable democratic country, and indeed - pretty much everybody can vote or candidate, and there have never been any vote counting problems. However, there's a massive discrimination of some classes of voters going on, one that the mainstream media never talk about. Your vote will count for less if you:
  • vote in a district with a high turnout
  • vote in a district which arbitrarily got fewer mandates per person
  • you were arbitrarily excluded from district's population count
All three unfortunately apply to voters living in London.

To cut the story short, here's a table of voting power depending on district. Elections to both houses have slightly different number of districts (40 vs 41) so I'll show them separately.

Upper house (Senat)
DistrictMandatesEligible votersActual votersVote weight (theoretical)Vote weight (actual)
Warszawa I4156703811443186340
Warszawa II27566284606346550
Poznań26819534442437252
Szczecin28375804408975852
Łódź26937794224937154
Bydgoszcz28068334214426155
Kraków414315488297676855
Wrocław39715975574297662
Bielsko-Biała25987613465448266
Gdynia39066885065228168
Lublin39645535013987669
Krosno26976323314017070
Gliwice26594063306607470
Rzeszów39567304931607770
Sosnowiec26001553265368271
Kielce310435784842407071
Gdańsk38237424734078973
Katowice38262924686928974
Białystok39512364644327774
Płock26684053078107375
Rybnik25833503036888476
Olsztyn26282343010487877
Piła25962373005118277
Nowy Sącz25841262994538477
Konin26044272994328177
Piotrków Trybunalski25901932927578379
Radom25758122855268581
Tarnów25529422834248981
Wałbrzych25676922725508685
Legnica38061744052989185
Zielona Góra38014903963899287
Toruń38344303938788888
Kalisz37825963885669489
Częstochowa249364425638210090
Sieradz37921123832399390
Koszalin25166552489129593
Opole38309033709968993
Siedlce37468423682699994
Chełm37786413523659598
Elbląg250483723209697100


Lower house (Sejm)
DistrictMandatesEligible votersActual votersVote weight (theoretical)Vote weight (actual)
Warszawa I19156703811459837447
Poznań106819534476169063
Kraków139298265614048565
Łódź106937794227548867
Warszawa II117566284605828967
Wrocław149715975561338871
Gdańsk128237424725048971
Katowice128262924654828973
Bielsko-Biała95987613447639274
Częstochowa74936442557288777
Sosnowiec96001553257379278
Gdynia149066885036369478
Bydgoszcz128068334189029181
Wałbrzych85676922712788683
Szczecin138375804388299583
Legnica128061744045599184
Rybnik95833503029149484
Chrzanów85017222676769784
Lublin159645535017049584
Piła95962372985699285
Nowy Sącz95841262979029485
Konin96044272976459185
Gliwice106594063293509386
Zielona Góra128014903942159186
Rzeszów159567304913639686
Piotrków Trybunalski95901932919829387
Kalisz127825963862359488
Sieradz127921123823879388
Radom95758122848399689
Tarnów955294228132310090
Koszalin85166552465149592
Białystok159512364611809692
Płock106684053069819192
Siedlce127468423681429892
Toruń138344303927299593
Kielce1610435784825719493
Krosno116976323302289694
Olsztyn106282342992319794
Chełm127786413515009496
Elbląg85048372305929798
Opole1383090336854096100


So if you live in Warsaw or London, your vote only counts as 40% to 47% of someone living in Elbląg or Opole. How it happened again ?
  • Allocation of mandates per district was distriminatory in the first place - even dividing mandates per eligible voters a citizen in some districts (like Warsaw) are worth only 63%/74% of a citizen in other district
  • Counts of eligible voters exclude people living abroad. Even worse - for purposes of mandate assignment they are falsely counted as still living at their last address in Poland. So if a few thousand people moved from Legnica district to London (and are now voting in Warsaw I district), a mandate for them is not reassigned from Legnica to London. So people who stayed in Legnica get to vote for them, while people living in Warsaw or London get their votes diluted !
  • Mandates are allocated based on number of theoretical eligible voters, not number of people who actually voted. Just because you live in the same district as someone who didn't bother to vote doesn't give you any right of voting for that person! Only actual voters should be counted.
These three effects discriminated against voters in the past, but it only became a major issue in the last elections, with unprecedented number of people living abroad voting. Apparently they and people living in Warsaw are not worth less than half a citizen. The next parliamentary elections are only due in 2011, so there's plenty of time to fix the problem. People living abroad should either get their own district, or number of mandates for Warszawa I should be increased to take them into account. Or the best thing - allocate the mandates only after the voting, strictly proportionally to the number of valid votes.

Sunday, January 06, 2008

GNOME sucks

one pissed off pussycat by John Carleton from flickr (CC-NC-SA)

My dead Macbook returned with a new disk. I normally use KDE, but this time I thought it would be a good idea to try the Ubuntu's default GNOME desktop. And it sucks.

The thing which sucks most is virtual desktops support. If I have programs opened on different virtual desktops I cannot switch between them using alt+tab. I cannot switch between them using taskbar. And I cannot take my mouse and simply move from one virtual desktop to another over screen edge. That makes the whole virtual desktop thing pretty much useless.

The second things which sucks is lack of single place with all configuration options. I don't think it even has configuration options I want, believing instead in "one size fits all". Wake up GNOME people - configurable Windows is way more popular than one-true-way Mac for a reason, and so is configurable KDE way more popular than one-true-way GNOME.

The third thing which sucks is GNOME trying too hard to be a Mac but failing. There's even a menubar at top of the screen, but it does not host current application's menus. Instead it takes one of the most valuable parts of screen (top edge) and completely wastes it. It's even worse on a computer with tiny screen like Macbook, where top bar and bottom bar together take large part of total screenspace, leaving much less for applications.

A few more things. Applications menu does not remember my recently used programs. The only reason I would ever use Applications menu instead of Alt+F2 is if it listed the few most frequently (or most recently) used programs for simple clicking. Without that it's completely useless. Places menu is completely useless, and System menu as I said should be replaced by a single place with all config it in. Like your beloved Mac. Or Windows. Or KDE. Or anything except for GNOME.

Wednesday, January 02, 2008

24C3

TV-B-Gone from Kit by mlcastle from flickr (CC-NC-SA)
24C3 is over. Here's a quick summary of events I attended and some general impressions. Recordings of most of the events are downloadable.

Anonymity for 2015: Why not just use Tor? was a quick overview of state of web anonymizers. In case you wondered - they're all insecure against serious attackers and there's little hope for them ever being secure. Web (low latency) anonymizers are much worse than email (high latency) anonymizers.

Make Cool Things with Microcontrollers: Hacking with Microcontrollers and "Design Noir: The seedy underbelly of electronic engineering" talked about microcontrollers and cool things you can do with them. The most popular one was TV-B-Gone which is an universal TV remote with just an off button. In some modded versions it can turn off all TVs in radius of 100 meters. Apparently everyone who bought one at the Congress tested it at nearby MediaMarkt or Saturn, what really pissed them off. Another popular thing was Brainwave Machine, which allegedly affects your brainwave patterns using flashing diodes and binaural sounds. I tested it and it only made my eyes hurt by flashing diodes directly into them.

Programming DNA: A 2-bit language for engineering biology talked about progress of DNA-based engineering, and how cheap and simple DNA synthesis became in recent years, at Moore's Law kind of pace. One thing I'm wondering is if we will ever create DNA processors which do something to DNA (or RNA) linearly one nucleotide at a time. In nature there are only two such processors - one for copying DNA (or RNA) to DNA (or RNA), and one for RNA to protein synthesis. There's a lot of stuff based on local pattern matching (DNA repairs, post-transcription RNA modification, RNA interference etc.) but they seem much less interesting. It would be really awesome if we could create more DNA processors, let's say one which would synthesize different kind of polymers based on DNA/RNA, or one which would do interpret nucleotides as instructions more complex than "copy, move forward by one". In a way DNA sequencing and DNA synthesis are DNA processors, but they're purely artificial and we cannot use them from within cells. Maybe I'll talk about it more in some other post.

The second day was probably the most interesting. In started with the first round of the Ligthning Talks, where I gave a brief presentation of XSS Shield plugin for Ruby on Rails. It was released only a few days before the Congress, so I had no time to do it before, but I will definitely blog about it soon. Some other interesting talks were about UN Democracy and vulnerabilities in Mac desktop widgets. Unfortunately the rest of the lightning talks were scheduled for fourth day evening, so I was unable to see them.

Then I've seen Quantum Cryptography and Possible Attacks which was rather introductory and hand-wavy, so I'd probably be better off seing After C: D, libd and the Slate project: A clean slate for operating systems in the same time slot.

Linguistic Hacking: How to know what a text in an unknown language is about? was very introductory too. It's kinda difficult to know in advance which CCC talk is going to be introductory and which is going to be totally hardcore.

Just in Time compilers - breaking a VM: Practical VM exploiting based on CACAO was something I enjoyed a lot, as writing compilers is my hobby. Immediately afterwards there was Modelling Infectious Diseases in Virtual Realities: The "corrupted blood" plague of WoW from an epidemiological perspective which applied epidemiological models to Corrupted Blood epidemics from World of Warcraft.

Then was the rescheduled AES: side-channel attacks for the masses talk which was very interesting in spite of being a clear case of Death By PowerPoint. It wasn't the only messily structured presentation, but the Congress average was pretty good, so I'm not going to complain.

The late evening of the second day was even better, with DIY Survival: How to survive the apocalypse or a robot uprising and the absolute hit of the 24C3: Rule 34 Contest: There is porn of it. Unfortunately there are no official recordings of Rule 34 contest (no idea about unofficial ones), but participants searched for porn on various subjects ranging from the simplest like Star Wars to more difficult like Apple and Religion. Contest winner was an experienced porn seeker who used websites like Encyclopedia Dramatica and Rule 34 search engine, but participants were able to get really far with just plain Google Image search. Fittingly the first prize was an inflatable sheep.

The third day was the weakest, with From Ring Zero to UID Zero: A couple of stories about kernel exploiting being the only really good talk. haXe: hacking a programming language didn't really explain what's the point of haXe, and Space Communism: Communism or Space first? started with an interesting premise but was very boring in execution.

The last day was full of really great talks which I wasn't able to attend as I had a train at 16:10. It would be much better if they moved some of the good ones from day 4 to the least interesting day 3.

A Spotter's Guide to AACS Keys presented a gloom view of widespread censorship we would live in if Hollywood won. It is very important to keep breaking every digital restriction management system Hollywood, RIAA, Microsoft and other forces of evil keep throwing at us. If we are unable to do so, we're going to live in the fascist world where only licenced "content providers" will be allowed. In their evil world there would be no Wikipedia, no YouTube, no blogs, and IMs would be paid per message. Just look at game consoles and mobile phones, which were the most successful in implementing DRMs - is this the world you want ? I like free Internet much more.

The second talk was Overtaking Proprietary Software Without Writing Code: a few rough insights on sharpening free software, the basic premise of which was that quality is massively overrated when it comes to Open Source software popularity. But that's not all - shitty quality cheap stuff is what created the civilization. Ten thousand years ago people gave up good quality food they hunted and gathered and for which they were evolutionarily adapted, and replaced it with agricultural food which lacked adequate nutrituion, made them fall to many diseases, increased their death rates, reduced their heights, shortened their lifespans, made them susceptible to regular mass starvation events due to bad weather, and eventually enabled civilization as we know it. Gradually the quality of food and life improved, but many people in poor countries are in many ways still worse off than in the Paleolithic.

Many other social and technical changes required giving up quality in exchange for something else, usually quantity. For a recent example - people are reading blogs instead of reading newspapers and other traditional media - quality of presentation in blogs is typically far worse than in newspapers, but there are more blogs, on more subjects, presenting more points of view, and so blogs get more and more popular. Wikipedia vs professionally developed textbooks is another good example. Wikipedia is often less well edited, has typos, bad grammar and inconsistencies which would never get past even the quickest review, not to mention occasional vandalisms and pov wars. But Wikipedia is also much bigger and much more accessible than traditional textbooks and encyclopedias, and somewhat worse quality is not that big of a deal (I'm talking about small articles about less popular subjects mostly, quality of the most popular 10% of articles is usually very high). Oh and it's free, what also helps.

Browsers are another good example. The most popular browser IE is worse than any other browser ever made, but has one good feature of being preinstalled (where's antitrust law when we need it, OEMs should be free to install IE or Firefox or Opera as they see fit, instead of being de facto forced to install IE only). The second most popular browser Firefox has better quality than IE, and is gaining popularity relative to IE. But that's not a victory of quality - Opera is even better when it comes to low bug counts but it never got anywhere as popular as Firefox (or IE). Firefox is winning not because of quality but because it has very easy to write extensions, and therefore more useful extensions than any other browsers. That and good marketing.

The last talk I attended was Ruby on Rails Security, which is really a must-see for anybody doing Rails. It covers everything you need to know except of XSS Shield.

Just afterwards there was a Lego Sumo contest where people built and programmed their Lego Mindstorms robots which tried to throw each other off the table.

A few observations - I liked go corner more when it was located in the dining hall like on 23C3. Some German language talks had English descriptions (like Sex 2.0: Hacking Heteronormativity, and Hacker Jeopardy) and there was no way to tell that from the paper Fahrplan. They should make it possible to pay for tickets by cards for 25C3, cash is so 20th century. In spite of these minor glitches I think 24C3 was a really awesome event, and I'm definitely attending the next year.