Simple Language-Defined posts in WordPress

For different audiences I write blog posts in different languages. Developers usually understand English and often even talk in English, it’s their lingua franca of the programming domain.
On the other hand other articles are about more local topics, so their target audience is mainly Germans. Here I prefer writing in my mother tongue. Unfortunately there seems to be no easy and simple solution to define the language per article in WordPress, so I considered to write one myself.

The plugin is available for download on github now.

Read further for more details on What it does, How it is done, Why I did it and What could be added.

What is „Multi-Language Support“ on a blog?

The wordpress codex contains an article about multitlangual WordPress. It starts immediately with a sentence that sounds strange, given that wordpress powers a big part of the web with millions of installations:

WordPress does not support a bilingual or multilingual blog out-of-the-box.

But what is a „multilingual blog“? I think, there are two different interpretations about it.

The first is: In a multilingual blog it is possible to write articles in different languages. This is of course possible, there’s nothing required to support that – at first glance. Looking in what a typical website contains transmits besides the directly visible text and images, that’s different: Any web page should tell clients (browsers, search engines, screen readers) the language of their content. WordPress does so, using a setting that holds globally for the complete blog instance – so it would define the whole blog as being in German, although some articles are written in English. If you ever have heard a french native speaker to read an English or German text without knowing how that language should be pronounced, you can imagine how a screenreader behaves, that serves an audible representation of an article to blind and visually impaired users.

The second interpretation for what a „multilingual blog“ is requires some more features. It is the idea that a whole blog is available in multiple languages. Here it should be possible to have the same article in more than one language, thus linking the different localized versions together to stay on the same article when switching the language should be possible.

If WordPress didn’t support the second interpretation, but would work flawlessly according to the first, I would have been happy and that would have been enough, but even that’s not the case. Luckily, the article continues:

There are however Plugins developed by the WordPress community which will allow you to create a multilingual blog easily.

Why I did not use an Existing Plugin?

The same article defines even more types of multilingual plugins, considering mainly technical properties for the distinction.

Some store several languages of an article in the same post. I have two major concerns against this approach. One thing is the database structure. If storing several language versions of an article in the same post refers to storing them in the same content field, that breaks any reasonable database normalization. It get’s difficult to read the database tables in case of any problems and even more difficult to extract a single language post from the database.

As an example qtranslate stores many languages in the same field like this (taken from the plugin page as linked to before):

[:en]English Text[:de]Deutsch[:]

While this approach supports other fields as well I am not sure how it affects other plugins. Is it really still possible for a statistics plugin to count words of an article, without supporting that syntax? I didn’t test it, probably it’s really possible, but I don’t like the approach as a developer due to the strange de-normalization of the database, storing multiple values in the same field.

WPGlobus, another set of plugins providing multi-language support, requires and forces permalink settings to neither contain index.php nor to use the post id. This should not be a problem for most pages, but it’s something I don’t like to be restricted to.
I didn’t test this plugin myself, but I looked into it’s source code, and it has many different places where the content from the database is trimmed to the correct language and stuff like that. This does not look like it’s being very stable across plugin sets and different configurations, something that is indicated by code parts, where other plugins names are hard-coded into the php code, which should not be necessary in a good architecture.

WPML, a commercially sold plugin with a lot of features does not support the right semantical markup of the articles, as their own news page demonstrate, where the articles are not translated, thus only available in English, but as the page is marked as being in the selected language, the English text contradicts the html markup, which says de_DE.

Another approach: Use Tags to encode language

Thomas Van Houtte suggests to use tags to encode the language. His solution has two drawbacks: It has to be supported by the theme and it mixes content related tags (what the post is about) with information unrelated to the information itself, but connected to the way this information is transfered in.

On top of that his particular solution does not work for a variable list of languages, as the theme selects the language on some fixed tags using php if-then-else constructs.

Requirements for myself

My main concern is that I want to keep everything on the same spot, but clearly denote the language of an article. The language should be engraved in the html source of the page, and if the theme supports it, it should be visible to the user as well.

On top of that it would be great to select articles by language.

Let’s sum this up as a list of requirements:

  1. Language of writing is independent of the topic, Categories should denote topics, Tags should be just tags – relevant buzzwords for the post. A post that doesn’t deal with „German“ should not get a tag „German“
  2. The user should not accidently define a language for a particular post by editing it without thinking about this decision.
  3. There should be a way to define the language when writing or editing the article.
  4. The language should be visible in the post header on the page.
  5. The language should be marked by the html lang attribute for the blog element.
  6. If possible, direct support from the theme should not be required. Language of the article is a content issue, the theme should deal with design and presentation alone.
  7. It should be possible to only show articles of a specific language.

Design decisions

  1. Selection of articles by another dimension of properties sounds like a good way for either a private field or a custom taxonomy.
    Custom taxonomies define a set of terms to organize content. Built-In taxonomies of WordPress are Categories and Tags. The same way a post can get a set of categories and tags it can get a set of values of any custom taxonomy, so that’s the way to go.
  2. The user interface to select the language of the post should present a proper default selection that doesn’t fix the language to a probably wrong value. Here I’m not sure if that’s the best way to do it. Major drawback on undefined languages of posts is, that whenever the blogs language changes, the page would tell the client (browser or crawler) another language for the same content, so it might get wrong. This leads to the extension idea to provide an overview of untagged posts.
  3. We need a meta-box in the admin interface that allows to select the language. As the language is an important property on any post, this box should be visible on the first screen page when editing.
  4. and
  5. Research diving into the filter and actions database of the wordpress codex and into the way themes are written lead to the filters the_content() and the_title(), the basic idea I got from Pippin Williamsons article Playing Nice with the “the_content” Filter.
  6. Selecting by language is natively supported by using a custom taxonomy using the tags-widget configured to use that custom taxonomy. Unfortunately it’s necessary to define the language for all articles in this case as other posts are not listed in any of those filters. The feature idea to provide an overview of untagged posts might be a solution for this problem.

How I did it?

To get the requirements done I started coding. The basics are as usual for a plugin: I decided to go the class based way, thus my plugins php page get’s the required header part that defines it’s meta data and the plugin class itself. As that’s pretty boring and explained in many other articles elsewhere, let’s dive into the details.

The constructor of my plugin registers actions and filters that are necessary for the plugin to work. More details on each filter in the following paragraphs.

Action hook „init“

The init hook of wordpress is part of the setup. Here the localization is loaded by load_plugin_textdomain() and the custom taxonomy that defines the languages is registered.

On top of that de and en are registered as default language codes as terms of the taxonomy with wp_insert_term().

Action hook „add_meta_boxes“

This hook is called when the meta-boxes aside of content elements are set up. The handler we add here registers the same meta box on pages and posts that provides a select box with all languages registered in our taxonomy for selction.

Adding the box is done by the function call to add_meta_box():

add_meta_box(	 	 
  $this->METABOX_ID, // html id of the meta box
  __('Language', 'jugglingPostLang'), // box title
  array( $this, 'languageSelectorContent'),
  'post', // show on page edit screens
  'side',
  'high' // priority
);

I defined the html id in a class variable before. The third parameter refers to the callback that generates the content of the box. Setting the priority to high serves the requirement to make the setting possibility as visible as possible to the post author.

This requires the function languageSelectorContent to generate the content. Metaboxes are placed inside the general html form wrapping the whole post. Thus the content of the box does not have to be a form on it’s own. Inside that function a nonce-field is generated. Nonce is short for „number used once“ and such a number is used to prevent duplicate actions due to accidentally performed. In web applications this can happen for example when the user hit’s F5 or presses a submit button twice before the browser can refresh it’s view. As it is required to know the nonce in the first place, the nonce prevents attacks on the form handler as well.

To make it short, the content generation produces roughly this html code, prependet by a hidden field for the nonce:

<label for="jugglingPostLang_selector">Language</label>
<select id="jugglingPostLang_selector"
        name="jugglingPostLang_selector"
  <option value="-1"
          style="display:none;"
          default
 	  disabled
 	  selected> - not specified - </option>	 	 
  <option value="1">de</option>
</select>

The options for individual languages – in the code snipped only one for German (de) is shown – are generated from the taxonomy values.

What’s important here is the first option. It is only generated when there is no language defined for the edited post, yet and supports the requirement to not default to any language. On the other hand it should not be possible to set the language of an article to „undefined“ again.

If it is shown at all (and thus no other language is selected), it is selected by default. It’s the default value and to prevent selection by the user after he chose a language, it’s disabled as well.

Unfortunately disabled select options can be seen in the dropdown – although they cannot be selected. Here the style „display:none“ jumps in to hide it. That style does not apply to the selected element in the box itself, but it applies to the drop down list once it is opened.

Action hook „save_post“

When the post is saved the language definition has to be evaluated and stored to the database. Here the action hook on save_post jumps in.

The handler get’s the post and the posts id as a parameter. For security reasons it verifies the Nonce defined int the meta box content before calling using wp_verify_nonce() with the same parameters.

If the nonce could be verified, it determines if the user is allowed to edit the post at all, and if he is, it connects the post with the term from the taxonomy that was selected.

Filters on „the_content()“ and „the_title()“

These filters are responsible for the most tricky part of the plugin. The one still missing requirement is to inject the language as a html lang attribute to the title and content of a post.

Unfortunately there is no matching filter or action defined in wordpress. Closest to what I needed are filters on the_content() and the_title(). A filter attached to that function (and named after them) takes the title as defined in the post, and pushes it to the filter handler. If more than one handler is defined on the filter, this is done for any handler subsequently.

The next thing that makes this solution non-optimal is, that the result of calling the_content() or the_title() may be processed further and not be printed directly. As an example at the backends post list the filters are applied to the content, but html is encoded by the corresponding entities afterwards.

To prevent this, the filter handler only wraps the title or content string with the language span when it is in the main query and not on the admin page. This is a major drawback as the idea of the plugin to enhance accessibility is therefore not enhanced in some cases. On the admin interface it’s no enhancement, and if a theme does show a title or the content outside the loop.

Possible extensions

Default languages

Provide a more complete list of locales by default, but let the user decide which should be presented in the select box.

This doesn’t add any functionality and it’s not a feature I need myself, but it provides better usability for blog admins. Currently additional languages can be provided in the taxonomy management page „Languages“ in the posts-menu.

Pretty styling of the admin interface

The meta-box besides the article in the backend currently lists the language code alone. It would be nice to provide a visual representation like a flag icon aside.

Support right-to-left languages

Languages that are written from right to left may require additional tweaks. I don’t have any experience with this, but there’s the html property dir and the corresponding css property „direction“, which should not be used on html pages, as the specification candidate warns (as the lang attribute, dir is related to the content, not to the presentation).

The impact of the dir attribute on how a web page is rendered is summed up by Mozillas API documentations on HTMLElement.dir: It affects e.g. column order in tables, where the first column on arabic languages is displayed on the right when dir=rtl
is set.

Provide an overview on untagged posts

Posts where no language has been defined gracefuly default to the language of the blog as no lang attribute get’s overwritten in the html result. As the language of the blog may be changed, in that case all un-specified articles afterwards get the wrong language in the html markup afterwards, as their default now is a different one.

As a second drawback untagged articles are not found inside the taxonomy when that’s used to search posts by language.

To prevent this, the best way is to always specify the language of any article. Currently this requires to update any existing article when adding the plugin. An overview list in the backend to list all articles that are not tagged would be great to check for correctness.

This can either be a distinct page of the admin menu, or – even better, an additional column on the edit.php backend page that lists all posts. Here a column could display the language of each post, if set, and a filter box can provide filtering on that column to find articles of one specific language or those that don’t have a language specified.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.