MfGames.Culture API - Language Codes

This is the first part of a short series on the MfGames Culture CIL API. It is currently alpha software, but I'm looking for critiques, opinions, and general feedback. All of my work for this is in the Github repository in the drem-0.0.0 branch. It is licensed with MIT.

This page is also a form of documentation by example.

When I started working on the culture logic, I decided to hang the code off as many standards as possible. I was very familiar with ISO 639. ISO 639 is a standardized list of languages and codes to identify them. You can see these in various programs and places such as en or fr (English and French respectively).

Links

  1. Introduction
  2. Language Codes
  3. Country Codes

ISO 639

There are a few components to the ISO 639 code:

  • A two character code (en and fr).
  • A three character version (eng).

Actually, there are two versions, a bibliographic and a terminologic code. These are known as the B and T codes respectively. The bibliographic code is based on the English translation of the name while the terminologic is based on the language's name for itself.

For example, the bibliographic code for Armenian is arm while the terminologic is hye.

According to Wikipedia, the terminologic is the preferred over the bibliographic.

Also, en and eng are identical codes, but if you treat them simply as a string, they are different.

System.Globalization

To my surprise, there is no dedicated object in the base library for C# for ISO 639 codes. There is some properties in System.Globalization on CultureInfo, but nothing that handles the equivalency of en and eng. And I haven't had a lot of success with creating non-standard languages (my xmi for Miwāfu) inside the framework.

There are some enum versions of the ISO code, but they don't have the flexibility to add custom languages.

Unable to find something already there, I created my own ISO 639 class for handling these codes. I called it LanguageCode because I didn't like how Iso639 looked. It does ignore the other standards for languages right now, but I was thinking that LanguageCode could handle all of those as separate properties.

var english1 = new MfGames.Culture.Codes.LanguageCode("eng");
var english2 = new MfGames.Culture.Codes.LanguageCode("eng", "en");

Assert.AreEqual(english1, english2);

I set it up so the ToString translates into the preferred three-character code.

var armenian = new LanguageCode("hye", "hy", "arm");

Assert.AreEqual("hye", armenian.IsoAlpha3);
Assert.AreEqual("hye", armenian.IsoAlpha3T);
Assert.AreEqual("arm", armenian.IsoAlpha3B);
Assert.AreEqual("hy", armenian.IsoAlpha2);
Assert.AreEqual("hye", armenian.ToString());

LanguageCode is an immutable object that encapsulates all the properties of an ISO 639 code except for its name. It also compares against the preferred three-character code for equivalency. I also had it intern the strings to avoid memory pressure with larger number of codes.

Memory

Memory is something I concern myself with. With a single code, you have:

  • The pointer to the code
  • The class overhead for LanguageCode
  • Three pointers to strings
  • Three strings in memory.

Using an interned string for the code means that the three pointers will remain, but at least I won't have a huge number of three- and two-character strings in memory.

Singleton

I still wanted to potentially reduce the memory pressure even further. To do this, I created a singleton class LanguageCodeManager which provides a singleton access to the LanguageCode.

var manager = LanguageCodeManager.Instance;
var english1 = manager.Get("eng");
var english2 = manager.Get("en");
var english3 = manager.GetIsoAlpha3("eng");
var english4 = manager.GetIsoAlpha3T("eng");

This way, you'll only have one instance of “English” regardless of how many pointers you use. Of course, if you also decide to manually create an English tag, it will continue to compare against the singleton version even though it is a separate object.

I made LanuageCodeManager an injectable singleton to provide for customizations.

LanguageCodeManager.Instance = new LanguageCodeManager();
LanguageCodeManager.Instance.Add(new LanguageCode("xmi")); // Miwāfu
LanguageCodeManager.Instance.Add("xlo"); // Lorban

Assert.AreEqual(2, LanguageCodeManager.Instance.Count);

foreach (LanguageCode lc in LanguageCodeManager.Instance)
{
	Assert.IsNull(lc.IsoAlpha2);
}

This also means that most methods that use language codes actually take a LanguageCodeManager as a parameter to facilitate testing and isolation. So far, I found that this adds a bit of overhead with many functions but I think it gives the flexibility needed; I'm in the process of converting most of those to argument objects to simplify the process.

The default LanguageCodeManager does not have any of the ISO codes. It is an empty list of codes. To add the ones stored as a manifest resource, you can use AddDefaults() to include them. The initially created LanguageCodeManager has these defaults already added.

Why a class?

I decided to make LanguageCode a class despite the overhead of the class mainly to make it easy to pass null in. Also because if I used a struct, then the item would have at least three string pointers everywhere it is used instead of a single one.

Why not string?

The main reason I just didn't leave this as a string is because of type-safety. I like passing in a language code when it is suppose to be a language code and not worry that one of the five different strings is suppose to be the three-character code. Or if it is suppose to be a two-character. Or something else.

var english = LanguageCodeManager.Instance.Get("eng");
var translation = GetTranslation(english, "bob");

Special

There is one LanguageCode that doesn't fit with the ISO standard, “Canonical”. This has a code of * for all of the fields and is used to do the final matching or determine the canonical name of something.

var canonical = LanguageCode.Canonical;

Assert.AreEqual("*", canonical.IsoAlpha3);

Names of Languages

One aspect of the language code that is not included in the object is the name of the language. This led into one of the more complicated parts of the library, and one of the ones I'm most unsure about, but that requires me to have country codes and language tags to explain.

Self-review

An interesting aspect about writing up this page is that I found things wrong with my API. For example, I had LanguageCode.Alpha3 when it really should have been LanguageCode.IsoAlpha3. It is a simple change, but writing this was a way of stepping back and looking over it again.

Metadata

Categories:

Tags: