Language codes for Twitter

Brueghel's Tower of BabelHow to find tweets in a specific language?

That’s an issue for many Twitter users, including language learners and native speakers of other languages. Because of the dominance of English on the web, it’s easy to find English tweets. But finding tweets in other languages is not so straightforward.

One solution could be to tag each tweet with a language code. Using IANA’s existing language codes seems an ideal solution for compatibility and ease of recognition. This coding system is used widely on the web and in its basic form uses a two-letter code for each language. For example, the code for English is en, the code for Maori is mi.

It would be possible to use a tag prefix symbol in front of each code so that we could search for tweets in that language. But we need to use a different tag prefix than is currently used for tweet topics. Ideally, we could use just a single non-alpha character: where # is used for topic tags, we could use something like the percent sign. So, a tweet in modern Greek might be tagged with %el.

It might also be useful to flag tweets written in a non-standard language character set. For example, because of the limitations of some Twitter clients, we might want an additional symbol to tag a tweet that it is in Greek but which is transliterated into an English character set. Eg %el!

Since there is a lack of documentation on which specific characters are distinguished by Twitter’s search function, any use of a new tag prefix to denote language will require some trial-and-error testing to ensure it works effectively. If the polyglot community of Twitter users could agree on such a coding system, it would make it much easier to find relevant posts in languages other than English.

Image: Brueghel’s Tower of Babel

Bookmark and Share

8 thoughts on “Language codes for Twitter

  1. Heru Kurniawan

    Is still issue, isn’t..Yes, I am totally agree with you that with a coding system, it would make it much easier to find relevant posts in languages other than English. I hope it can carried out.

  2. Alex Boschmans

    Unless you do an analysis of each tweet, you cannot readely distinguish the language in tweets – however each tweet does come with the iso language code of the user that created it.

    I know, cos I’m currently scanning keywords in twitter and differentiating tweets based on these language codes :-)

    For a demo, see http://twitalytics.dataconnect.be/twitdemo – on the right hand side of each tweet the iso language code is shown.

    There is ofcourse (at least) one problem – multiple language speaking tweeters. They set their code once and tweet in 3 or 4 different languages. No way to distinguish the tweets except if you use bayesian or trigram analysis. And then you have those that set it to whatever just to get rid of it.

    But with it you can do *some* filtering so that most of the noise is cleaned out.

    Contact me via mail if you want more info on this.

    Regards,

    Alex

    1. Paul Left Post author

      Fascinating – thanks for this! I hadn’t realised the user’s language code was attached to every tweet. As a language learner I’d like to be able to use several languages, hence the idea of embedding the 2-letter codes within each tweet. It’d be nice if you could at least override your own default language code for individual tweets.

      Thanks again for the interesting link…

  3. Hywel

    I’ve just started looking at the language codes of tweets including the word ‘Cymraeg’ (i.e. the Welsh word for Welsh). Of 100, all were written in either Welsh or English, or a mix of the two, with one including some Scottish Gaelic too. 37 were coded as English, 20 as Dutch, 10 French, 7 Polish, 6 Icelandic, 4 German, 4 Danish, 3 Italian, 2 Norwegian, 2 Spanish, and 1 each in Esperanto, Finnish, Indonesian, Lithuanian and Portugese. I think one can be fairly certain that these language codes were not set (at least deliberately) by the users.

  4. Paul Left Post author

    My understanding is that Twitter has a language setting for users but not for individual tweets. ie when a user posts a new tweet, they cannot set the language for that tweet but it inherits the language of its user. So Twitter appears to make the assumption that every user has one native language and doesn’t make allowances for using multiple languages. For language learners or bilingual users it’s less than ideal. That’s why my original post was suggesting a way to include a two-letter language code for individual tweets.

    These comments relate to tweets written using a standard Twitter client. I’m not sure what provision there is in Twitter’s API for including a language code in individual tweets – maybe it would be possible to develop a client which did allow selection of a language code for individual tweets but I don’t think this would be straightforward. Happy to be contradicted on this :-)

  5. Dafydd Tomos

    The language code returned for individual tweets are clearly incorrect. My twitter client interface is in English, although I’m not aware that it sends any language code. My tweets in the last few days are marked as English, Danish, French and Polish.

    Rather than a problem with the code sent from the client, it might be a bug in the Search API as noted here ?

    The set of language codes supported by the API appears to be restricted to the languages officially supported by Twitter as an interface language, so it’s not much use for any unsupported languages. Twitter’s l10n support seems rather poor in general.

Comments are closed.