Drupal 7 i18n Architecture Q&A

Greg Sims wrote me an email the other day asking for some Drupal 7 internationalisation architecture advice, having seen the Christian Assemblies International website for which I’m the tech lead. Here is his email (reproduced with permission), my answers interspersed.

Hey John,

Thanks for responding to my query!  We are starting to load Spanish content into a Drupal 7 website.  D7 is still a little rough but it seems to have better features for a multi-national effort.  Specifically D7 has the ability to maintain all translations of a piece of content within a single node – it seems that D6 relates multiple nodes together where each node is a different language. Is cai.org on D6 or D7?

What a timely question, the cai.org is currently on Drupal 6. However, believe it or not I’m actually writing this in a plane on the way back from Berlin to my home in Valencia, Spain. In Berlin I got together with some others from our web team and the primary thing we did was work on the Drupal 7 upgrade for a few days! D7 is definitely the go now.

The feature you’re referring to with “maintain all translations of a piece of content within a single node” has rather confusingly been referred to as “field translation”, “content translation” and “entity translation”. I think “entity translation” will stick and is the least confusing term. It is a very interesting development. I studied it recently and initially the concept seemed logical to me, however I eventually concluded that there is very little value-add in the whole effort and any explanation of the rationale or benefit of this approach is hard to come by! Gabor Hoytsy gives the main explanation of the benefits (see under heading “Translation-enabling fields”), which boils down to field sharing. If you have a common field, especially a big file field like a video, do you want to duplicate that on every node? And try and synchronise updates? Fair argument, however there’s a fantastic field synchronisation module in D7 and it works very nicely. We’re using it in D6 and will continue to use it in D7. So with the field problem solved, what is the actual motivation for field translation? I sense this is a large amount of work which will just create problems for i18n-related modules because they will have to accommodate both ways of translation. Just read the list of problems this causes, explained in Gabor’s post. But I stand willing to be corrected. In your case I would really not recommend it unless you feel very strongly about it because it’s still in alpha, I’m not sure anyone’s actually using it, and you’re no doubt going to fight with a lot of “pioneer problems”.

I noticed that you are prefixing your path names with the language codes and translating the path names as well.  I have read some material that says SEO might be better if the path names are also translated – was this your motivation? It certainly seems simpler if the path names remain largely the same when the language changes – ie) keep the paths in English.

Yes, translating path names is better for SEO, because Google (a.k.a. Skynet) ranks pages higher for a particular query if they have the keywords in the URL. Consider a search for “Water Baptism” and “Bautismo de Agua”. On the sheets I have linked to, the keywords feature in the URL. Not only is it better for the Spanish page to have its keywords in the URL, but additionally it is not competing with the English page for the term “Water Baptism”. OK, it’s not exactly going to beat the English page in the Skynet rankings considering the actual content, but nevertheless I think it’s better to have no confusion there.

Translating path names is also better for usability. If you see the URL, perhaps by hovering over a link from another website, you immediately know what it’s about. I generally try to give an international reader the feeling that a website was really made for them in their language. To do this you want to give them a complete experience, no bits of English hanging around. This may not be achievable to perfection level, but it’s the right aim to strive for. It also separates you from some websites out there which have a few poorly translated pages mingled in, on such websites I have found that any action I perform may or may not work.

I don’t see why it’s much simpler not translating path names. Well, it might be simpler for administrators who don’t Speak Spanish smiley. But technically it’s easy – just install the Pathauto module and, if you like, the Transliteration module, and everything works.

We own www.RayStedman.es which we will likely use as a way to distinguish between Spanish and the English site at www.RayStedman.org.  It seems that this approach will appear as two different websites from Google’s perspective – I’m not sure if this is good or bad.  The Spanish site will likely “fall back” to English in the event that the Spanish content does not exist.  Any thoughts about paths would be appreciated.

The question of how to structure your domains is a good one and both options can work well. Usually I think business motives will decide it for you, in particular, availability of domains. We would never have been able to get cai.xyz for every current and hopefully future language, so we would have ended up with an array of varying domains like christianassemblies.de. How confusing for the web team, and not necessarily great for unified identity. In your case I reckon you’d have a better chance of getting raystedman.fr and so forth. Oh and by the way, all those domains cost money, which isn’t negligible. You also need an address in many countries where you register a domain (e.g. Germany and France insist on this). So if you plan on one day expanding into other languages, this is worth considering. If, on the other hand, you’re most likely just going to stick with the two languages you’ve got, then you’re already set.

I actually think that if the aforementioned problems are surmountable, a localised domain approach is actually superior, in certain cases. This is, again, for both usability and SEO concerns. Skynet more or less tells us that a .de website will be ranked more highly for searchers in Germany than content under .org/de, whereas for searchers outside Germany the .org/de variant will be ranked more highly. Additionally, it kind of feels nicer that the website bothered to get the appropriate domain, it feels more self-contained and more like the local presence of an international organisation than the “bit where someone translated a few things”. However, you have a rather special case with Spanish, because the majority of Spanish speakers are outside Spain itself, i.e. in South America and the USA, and the questions then become: “Who am I trying to target?” and “How does Skynet rank .es websites in comparison to .org websites for searchers outside Spain?” The fact is that .es refers to a country more than a language. After all, Switzerland has a domain extension (.ch) and they have three official languages. I think a lot of these things can be worked out by putting yourself in Skynet’s shoes and using a bit of logic. What they probably do with .es domains is try to determine what is local information, like your church address (if it were actually in Spain), and return that highly for Spanish searchers, and what is “general information” which would be relevant for anyone, like “what do I need to do to be saved?” and return that highly for international searchers. At least that’s what I would do. Many organisations in South America use .com domains, I guess because if you’re an Argentinian firm and you want clients from more than just Argentina, or you feel that you may go outside your local country in the future, then a local country domain is limiting. For a US organization interested in targeting Spanish speakers in the USA, I sense that a .es domain might be a mistake.

Since cai.org has so many languages, I looked around at major international corporations to see what they do. Most of them, like Microsoft, go with the one-domain solution. But Skynet itself, for example, uses local domains.

Here is a specific problem I have been thinking about.  Google does not like duplicate content.  If our .es site falls back to English with missing Spanish content, it will likely be viewed as a duplication by Google – the same content on both .es and .org.  If the user comes to .es and we perform a redirect to .org, the user will stay in English even if they surf to a part of the site that has a Spanish translation.  We can have a language switcher block like cai.org but it would be best to automatically display Spanish if it is available.  It seems that cai.org is complete in all languages so you may not have this problem.

This is a very interesting question which I also considered in detail when laying down the architecture for cai.org. When I started using the i18n features of Drupal I felt that there was an inordinate emphasis on the ability to serve different language content at the same URL, language-independent content (how often do you have that??) and a mix of different language nodes on one page. These options are probably better for a community-edited site, but for a business-style site, i.e. one where you retain the editorial control and try to present a polished look, I feel that it’s better to try to keep the user in his language.

I believe that the duplicate content problem can be solved by the canonical tag. I’m not sure whether that works between domains though. However the problem about a Spanish user getting “stuck” in English or always having to check whether a Spanish option is available is a serious one, one that I didn’t like at all. Let’s call it “foreign language lock-in”. I feel it would be very disorientating to be on a Spanish website and then shoved over to a foreign language on a different domain to what you were expecting! We did also face this problem because the cai.org website is not complete – see for example "Did Fish Grow Legs?". You’ll notice that there is only German in the language switcher block and in the links at the bottom of the page. I felt that it’s important to only show languages which we actually have a translation for in that block. Otherwise what would you do when the person changes to, e.g. Spanish? Send them to the homepage? Show them the same English content, but with a Spanish URL? Those both seem horrible from a usability perspective. If you change languages on the Microsoft website it sends you back to the homepage in the new language – extremely irritating! By the way, the block in use on cai.org is actually a custom piece of code, the standard ones didn’t suit me. I can publish the code if you like.

I also see that some of your content is different depending on the language – not just translated but different.  We would likely not do much if any of this.

As far as I know we just translate content on cai.org, we don’t rewrite it in different languages. Can you give me an example of where you’re seen the differing content?

Some more thoughts:
80% of the cai.org users come from search, and I would guess that this is a pretty common statistic for most big websites. When you come from search, you want to read the actual page you have searched for and seen in the search results, you don’t want anyone doing any language tricks on you.

I think we have enough content on cai.org to interest people in their native language, and I think people would be more willing to read something in their native language than a slightly more interesting-sounding page in a foreign language. It is evident that the web has the most information in English, yet if you have ended up on our Spanish site it’s obviously because you’ve chosen to browse the web in Spanish and accept the trade-off of having less information available. Bi-lingual people which I know tend to bear this trade-off in mind when choosing which language to search the web in. Of course we faced some tricky situations – our Swedish site didn’t have much content translated, yet basically everyone under 50 years old there speaks English. So our Swedish site editors started publishing the English pages as “Swedish” translations just to get them available to Swedish users on the Swedish site. But we stopped that because of duplicate content issues. We resolved it by adding a note on some key Swedish pages: “Lots more information is available on the English site.” That's what it means where it says "Vi har också översatt många av artiklarna till olika språk. Kontakta oss om du är intresserad av översättningar på engelska, tyska, franska, ryska, polska eller holländska." at the top of the Bibelstudier page. Actually, my wife and in-laws have taught me a bit of Swedish and now that I look at it, that in fact means "contact us for the translations"... I'd better get in touch with the Swedes and ask them about that. To be honest, the fact that “Lots more information is available on the English site” is actually a logical conclusion on any website that I think most bi-lingual users instinctively know anyway.

A Spanish translation aimed at US users is a fascinating question because a high proportion of those users might speak English anyway and be quite comfortable being shown English pages, if the content is more relevant. I don’t know the best solution. You may be able to exclusively use the “browser preference” option in Drupal to select which language content to show and this might be a way around the foreign language lock-in problem.

I feel I have much to learn about multi-language websites.  Are you “publishing” your site to the search engines in different languages?

I’ve never published any site to search engines. In the conventional sense of improving rankings, it’s entirely a waste of time. Skynet’s job is to spider the web and find all the pages and then rank them accordingly. They will not rank you better, nor worse (I think), for informing them about your site. After all they are trying to give the best results to the user, and he doesn’t care who’s publishing what, he just wants the most relevant results! Of course your entire site has to be “spiderable”, i.e. you have to eventually be able to get to all the pages starting from any page of the site. Publishing, via things like sitemaps.xml, is mainly relevant for two things:

  1. You have pages that for some reason don’t have any incoming links and therefore won’t be found otherwise by Skynet
  2. You want your new pages and page updates to be indexed faster.

This second reason is certainly worthwhile for news sites and perhaps for everyone, but it’s a question of effort invested vs gain. Google won't necessarily rush to index your new content just because you told them to, by the way.

I hope that helps you and gives any budding Drupal architects a better insight into the intricacies and technical challenges of a internationalised Drupal site!

Comments

I just noticed that this guy has some similar stuff to say about the domain/path selection issue.

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.