[Start] [Computer] [Physik] [Bilder] [Sprüche] [Wiki] [Blog] [Kontakt]
Python: Sorting words with umlauts

When I tried to sort words with german umlauts with Python I recognized, that the sorted()-function sorts german words with umlauts in the wrong way (Note: The correct order is already given by the list words!):

>>> words=[u"aber",u"All",u"Ärger",u"ärgerlich",u"tränen",u"Tränen",u"Zauber",u"zum"]
>>> print sorted(words)
[u'All', u'Tr\xe4nen', u'Zauber', u'aber', u'tr\xe4nen', u'zum', u'\xc4rger', u'\xe4rgerlich']

So the umlauts are sorted to the end of the list, which is wrong according to DIN 5007, and lowercase words are sorted after uppercase words, which is also wrong (have a look into your DUDEN, if you don't believe it ;-) ). First I tried to solve this using functions of the module locale like

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
>>> print sorted(words, key=locale.strxfrm)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc4' in position 0: ordinal not in range(128)

So locale.strxfrm doesn't seem to support Unicode, but I could fix it using

>>> import sys
>>> sys.setdefaultencoding("utf_8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'setdefaultencoding'

Another annoying problem in Python, but one can circumvent it by

>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding("utf_8")
>>> print sorted(words, key=locale.strxfrm)
[u'All', u'Tr\xe4nen', u'Zauber', u'aber', u'tr\xe4nen', u'zum', u'\xc4rger', u'\xe4rgerlich']

So nothing has changed! But this was only the case, when I used Python on MAC OS X 10.5. Using Python installed in a german Windows enviroment yielded the right result. So dependent on your OS and your Python enviroment one gets different results, when sorting german words when using locale. This is stupid!

So I started to write my own function for sorting words in the correct way independed of any OS and localization setting and finally came up with this:

# -*- coding: utf-8 -*-

def din5007(input):
	""" This function implements sort keys for the german language according to 
	DIN 5007."""
	# key1: compare words lowercase and replace umlauts according to DIN 5007
	key1=key1.replace(u"ä", u"a")
	key1=key1.replace(u"ö", u"o")
	key1=key1.replace(u"ü", u"u")
	key1=key1.replace(u"ß", u"ss")
	# key2: sort the lowercase word before the uppercase word and sort
	# the word with umlaut after the word without umlaut
	# in case two words are the same according to key1, sort the words
	# according to key2. 
	return (key1, key2)
words=[u"All", u"Tränen", u"Zauber", u"aber", u"tränen", u"zum", u"\Ärger", u"ärgerlich"]

print sorted(words, key=din5007)

The result of this routine is

[u'aber', u'All', u'\xc4rger', u'\xe4rgerlich', u'tr\xe4nen', u'Tr\xe4nen', u'Zauber', u'zum']

Note that the function returns a tuple of two keys, where the second one is only used if according to the first key the two words are the same. This is a feature of Python that is not wildely known (it is not mentioned in the Documentation or the HowToSorting), although it is very useful.

Python: Connecting to databases

Here I collect some information on how to connect to different kind of databases using python.

Exist database

eXist is an open source database management system entirely built on XML technology (a so called native XML database) .

I found it the easiest way to connect to the eXist database via Python using the XML-RPC interface. Unfortunately the documentation only describes how to connect via XML-RPC using Java or Perl. So in the following I describe the Python way:

First there is a useful list of all available methods of the XML-RPC interface of eXist. The methods description is written for Java. However the method names and arguments can be used the same way for Python, if you substitute HashMap with Dictonary. So if a method expects a HashMap as argument you have to insert a Python dictionary instead (often an empty one {} will do). So lets see how this works.

# global parameters for the database db
db = xmlrpclib.ServerProxy("http://dbuser:dbpass@hostname:8080/exist/xmlrpc")
xpath=u"distinct-values(for $x in //Project/Budget/Amount return concat($x/../../System/GroupName,\",\",$x/../../Name,\",\",$x))"

print hits
for i in xrange(hits):

#print results

groupnames = [result[0] for result in results]

#print groupnames

#results = db.retrieveAll(result,{"indent":"yes"})