Test database instances

OSM is a big database. Postgresql/Postgis is an excellent database management system, but for local tests on my notebook I have a problem: I have only one slow harddisk that is shared by postgresql tablespace, tomcat, linux swap partition and firefox.

Even with 4 GB RAM that leads to a problem, if the database tables become too big to fit into memory. Especially for the mapnik database schema some actions require full table scans on big tables (that’s one reason why I would prefer a different database schema).

A worldwide osm database is to big for my harddisk (or at least I don’t want to give that much disk space for testing currently here), so I first tried a germany extract database (built out of the Geofabrik’s pbf Germany extract), but even that’s too big for fast processing as at least one query of the exploration page needs a full sequential table scan through the ways table. As a result the calculation of any exploration page takes several minutes, which isn’t acceptable for debugging (as it would not be acceptable for a live environment, of course).

After all I had to restrict myself to an even smaller part of the world.

To apply that I first thought about cutting a boundingbox around Paderborn by osm2pgsql directly:

osm2pgsql -s -C 768 -c -p paderborn -d lalm -U lalm -W -H -P 5432 
          -b 8.6,51.65,8.9,51.8 --hstore --style ./default.style germany.osm.pbf

The problem here is, that osm2pgsql ignores the bounding box (parameter -b) to write the slim tables. These are written containing every node of the import file, which takes quite long and uses lot’s of disk space.

The second attempt is better: I first used osmosis to get the bounding box extract and then imported the whole resulting file using osm2pgsql. The corresponding osmosis command is

osmosis --read-pbf file="germany.osm.pbf" 
        --bounding-box top=51.8 left=8.6 bottom=51.65 right=8.9 
        --write-pbf file="paderborn.osm.pbf" compress=none

I decided not to compress the paderborn.osm.pbf file as that speeds up reading the file for osm2pgsql afterwards. The osm2pgsql import then is again the same as above except the -b parameter as now the bbox corresponds to the whole file already.

How to „register“ a new Describer

Adding Describer Classes should be easy, as I suggest that to be the most agile part in Lalm development for a long time. Generating descriptions is different between languages and there are a lot of strategies to implement the generator, starting from string substitutions up to highly complex Natural Language Generation engines with highly sophisticated grammar assemblers, sentence planning stuff and so on.

Unfortunately Tapestry is developed following a strong mantra: Static Structure, Dynamic Behavior.

In fact that means that a Tapestry template is static, even a component’s template is static and cannot change substantially. It’s possible to hardcode alternatives and use a parameter or some other data source to choose one of them, but every alternative again has to be hardcoded in that template and (!) in the corresponding components java class.

Therefore adding a new describer component involves the following steps:

  1. implement the describer component’s class.
  2. add it to the ConfigurableDescriberComponent.java source:
    1. put it into the describerMap hashmap by extending the initDescriberMap() method.
    2. inject the block like it’s done with the other blocks
  3. add the new block to the ConfigurableDescriberComponent.tml

Using the Describer afterwards simply involves setting the configuration value in the lalm.properties file.


In the last days I worked on configuration capabilities of the Lalm. Configuration is necessary for several reasons.

Technical reasons include different database backends, different java setups, different describers and so on. Some of them are planned, others are already realized. Let’s dive a little bit into the available options.

Options for Data Sources

Initially I didn’t think about supporting different data sources, but there’s one big problem with OSM data: it’s a really big bunch of bits and bytes. Even importing the database takes many hours and needs lots of processing power on the server – once initially and then to keep up to date.

While I didn’t care about the update process, yet, I have to get a server where I can run the development snapshot of the Look-and-Listen-Map. Thanks to Wikimedia (Deutschland) I’m able to run it on the toolserver, but there’s not enough spare capacity to use „my own“ dedicated osm database. I have to share the Mapnik scheme database used by the wikipedia map projects for rendering. These databases contain everything I need – but sometimes in a horrible schema, so that queries might take incredibly long. For my local test setup here, a pretty common page request to navigate one way takes several minutes as it needs one full sequential table scan over the ways table for every node of the way currently. Of course the database schema could be changed to speed up that, but that would break the other projects on toolserver, so I’m happy at least the storage behind the postgresql-server at the toolserver is fast enough.

For local testing a separate set of tables using the snapshot scheme created by osmosis is possible to use, too.

Which one to use is possible to configure in the file lalm.properties, including table prefixes, authorization data and – of course – the DatabaseService-Implementation to use. Here currently MapnikDatabaseAPI (using the mapnik hstore schema), SnapshotDatabaseAPI (for the snapshot schema) and last but not least OverpassAPI, using the Overpass API as a web service combined with a local memory cache are implemented (not finished, yet).

Options for Describers

How to describe the geodata as text? I have a lot of ideas, and I think it should be possible to change that according to the users needs, the lalm providers decisions or for testing purposes.

Describers are Tapestry components and it’s possible to choose different implementations by configuring them.

More Ideas

Debugging options: it should be possible to show or hide debugging data, run in productionMode (or not) and more


Routing Engine

There are several routing engines. PGRouting looks very nice for several reasons:

  • pg routing is written in pure SQL (requiring Postgis/PostgreSQL).
  • routing network configurable: can route over streets as well as over waterways or powerlines
  • costs of routing networks can be stored in a fixed table or otherwise attatched by an SQL query.
    That should allow to provide many different profiles.
  • costs can be direction dependent
  • support of node and way weigths

The database schema of PGRouting is not well documented, but exposed in the source of the class Export2DB.

The schema stores OSM ids. For ways these IDs are stored explicitely in the table „ways“, as ways are not stored in their original form, but sliced to routing segments. IDs of nodes and relations are stored in their original form.

Using only this routing database will not be sufficient as it lacks all tags not relevant for routing, Usage in combination with other modules of the portal could be completely separated (e.g. in a different postgres schema)

Database Schema

The basic OSM data scheme is the XML exchange format. It’s simple, plain and includes all data, but it’s not optimized for any purpose.

To use the data in applications, several database schemes (and file formats) are common.

api db (osmosis)

used by the rails core of OSM. Optimized for editing capabilities. possible to include history

pgsql simple (osmosis)

pgsql simple history (osmosis + history plugin)

The osmosis-history plugin provides additional fields for the simple schema to add history capabilities.

pg snapshot (osmosis)

gazetter (osm2pgsql)

used by Nominatim and optimized for geocoding. It’s highly optimized and perhaps not very well capable for other purposes like using the original data as they are. One table (place or placex) connects the gazetteer database to the original osm data.

Decision and Rationale

The gazetteer format is useful and needed for the geocoding process, but in the first step we can use the MapQuest Nominatim instance and the webservice availlable there to avoid more load then necessary while testing.

Using the Postgres Schema concept, where a schema is kind of a sub database, I will use several database users and schemas for the portal. The main application uses the pg simple schema with history additions (but without deep history import currently).
Rationales for each single schema used can be found below.

main schema for OSM data. It uses the pgsql simple schema provided by osmosis and the additions provided by the osmosis history plugin. The deep history of OSM is not required to be imported. The history features are necessary to allow references to old versions of ways. A prone utility could delete old osm element versions that are not referenced by the application any more.
Routing should be done with the pgrouting library. This library is pure postgresql code, and does not require much preprocessing. The routing module therefore will get an own, dedicated schema, where the routing tables are stored and routing functions are defined.
The import process of the main database will be time and memory consuming. Additionally it’s not clear, if it will be possible to import into the live database directly. For now, the import user uses the basic public schema. Later it will probably be better to use two separate import schemes, where one is in the import process, the other one is used by the application.
search and geocoding
At a later stage, nominatim should be used for geocoding and search purposes. It has a highly specialized database schema not useful for any other application type. Additionally it requires expensive precalculation efforts. The only connection between nominatim and the osm database are the OSM ids, that can be used across schemas.