WHITE PAPERS Resource String Guidelines

Contents

1. Resource Files

1.1. Resource File Formats

1.1.1. Use a Standard Resource File Format

Ideally, use one of the established resource file formats.

Examples of well established, standard resource file formats include...

	`.strings` - iOS and MacOS resource strings file format (often as `Localizable.strings`)
	`.xlf` - XML using the XLIFF schema
	`.properties` - Java properties file format (also see Wikipedia article)
	`.po`, `.pot` - PO (Portable Object) and POT (Portable Object Template) files used by GNU's gettext, and also by Drupal and PHP
	`.resx` - resource file format for Microsoft .NET applications
	`.rc` - Windows resource files (older format, typically used for C/C++ desktop applications)
	`.yaml` - YAML formatted resource string files for Ruby on Rails

Failure to use a standard resource file format will result in the need to implement specialized parsing and serialization code in a custom adapter for the localization automation system.

2. Resource Strings

2.1. Resource String Naming

The following guidelines should be used regarding the naming of resource strings.

2.1.1. Use Symbolic Naming

It is highly recommended that resource strings be named using a symbolic naming scheme, rather than using the source language text as the name/key.

INSTEAD OF...

"Your password has been reset." = "Your password has been reset."

USE...

PASSWORD_RESET_CONFIRMATION = "Your password has been reset."

It is a popular practice in some software frameworks to use the source language text for a resource string as its key name. This practice has the following pros and cons...

PROS

	you don't have to think up a symbolic name, since you just use the source text
	you can tell from the code literally what the string says

CONS

	the source text for a string may need to be changed in future, in which case the key would need to be changed in all the already translated resource files containing the string, as would the application code that uses the string
	a well conceived symbolic name can often be more expressive in conveying the purpose of a resource string in the context of its usage in an application, where one might need to read through the source text to try to determine the function of a string (consider the example of a privacy legal disclaimer paragraph)
	the source text can be very lengthy (a paragraph or mor) for some strings, so it can start feeling quite clumsy to use a large (and possibly multi-line) block of text as the string key
	the source text can contain characters that would need to be escaped, such as quotation characters
	the source text may be multi-line and may contain new line characters that need to be escaped

Most frameworks support arbitrary resource string naming, so it is merely by convention that developers use the source language text for key names and using the symbolic naming approach is also possible. Because of the cons mentioned above, it is highly recommended that one use symbolic naming consistently throughout all the different components of one's application, across all the different technologies used.

2.1.2. Use Identifier Naming

In addition to the recommendation to use symbolic naming, it is further recommended to specifically use identifier naming.

INSTEAD OF...

"password reset confirmation" = "Your password has been reset."

USE...

PASSWORD_RESET_CONFIRMATION = "Your password has been reset."

OR...

passwordResetConfirmation = "Your password has been reset."

While a string name like "password reset confirmation" technically qualifies as symbolic naming, a name like that cannot be used as a naked identifier in code.

It is recommended that string names rely on a restrictive naming scheme that allows string names to be used as valid identifiers in most languages, thereby providing the maximum flexibility in how those string names may be used, either for property dereferencing with resource string bundles, or through dynamically generated methods that can be used to process strings with any substitutions they may support.

To comply with this recommendation, string names should...

	start with one of the characters "a" thtough "z", "A" through "Z", "$", or "_"
	beyond the first character, have zero or more additional characters "a" thtough "z", "A" through "Z", "0" through "9", "$", or "_"

This means that a string name should match against the following regular expression...

REGULAR EXPRESSION

/^[$_a-zA-Z][$_a-zA-Z\d]*$/

While this might be slightly restrictive for some languages that support non-ASCII characters in identifier names, this "lowest common denominator" approach should ensure string naming that is compatible with identifier naming in most popular / prevalent programming languages.

2.2. Resource String Values

The following guidelines should be used regarding the values / content of resource strings.

2.2.1. Limit Resource String Size

The size of resource strings should generally be limited, ideally to single sentences.

When creating large, paragraph long resource strings, the entire strings can be re-translated when only portions of them change in the future. Translators typically charge per word and are not necessarily incentivized to translate less and may only optimize out redundancies at the individual resource string level but not within the resource strings. Maintaining more granular strings will help to minimze ongoing translation costs.

2.2.2. Limit Parameterization That Has Declension Effects

The number of different parameters that have declension effects inside the same resource string should be limited, especially since numerous permutations can be required for some languages.

Consider the example of "{personA} gave {quantity} {object} to {personB}". Now, the token {personA} is a gendered value, as is {personB}. In some languages, depenending on the value, token {object} could also be gendered. Also, it has pluralization declension effects because a quantity of it is being specified is the sentence. Because of the numerous possible values for these substitutions, a number of variations would need to be produced when translating to languages such as French.

The permutation effect can also come into play when not breaking up large resource strings into numerous, sufficiently small resource strings. In a large paragraph of text, or a resource string that contains multiple paragraphs of text, different substitution tokens could appear innocently in different parts of the large block of text, and the values for these different tokens may impose their own declension effects that combine to produce multiple translation permutations for the string as a whole.

EXAMPLE

You have {newTexts} new text messages, {newVoicemails} new voice-mails, and {newEmails} new emails.

In the above example, three phrases are being combined to form a sentence, and each phrase contains a count substitution that will impose a pluralization declension effect. Considering that the Russian language has three different plural forms, and if one also wished to use words rather than a number substitution for the 0 cases, this single resource string might need 64 different Russian translations (4 x 4 x 4) to cover all the possible permutations of values for the newTexts, newVoicemails, and newEmails substitution values. If the resource string were split up into three separate resource strings and presented as discrete sentences, then the number of Russian translations would be just 12 (4 + 4 + 4).

2.2.3. Avoid HTML in Resource Strings

Embedding HTML in resource strings should be avoided.

2.2.3.1. Downsides of Embedding HTML in Resource Strings

To dispell any doubts about the importance of avoiding HTML in resource strings, consider the following numerous downsides of embedding HTML...

2.2.3.1.1. Increased Translation Complexity

Embedded HTML can complicate the translation process by increasing the complexity of the resource strings.

This is especially true if portions of sentences are enclosed in formatting markup. Dependending on the grammar of the language being translated to, it may not be so obvious how to map the formatting from the canonical string to the translated string. In any event, it takes extra care to preserve the intent of the formatting markup and ensure that the formatting doesn't literally get lost in translation.

2.2.3.1.2. Increased Refactoring Effort

Having HTML embedded inside resource strings can increase refactoring effort as an application grows and is localized to more locales.

Certainly, HTML inside resource strings can complicate refactoring as text searches will need to cover resource strings as well. If HTML embedded in a resource string interacts with and has dependencies on the context in which it is used (such as relying on CSS classes), then the resource string will need to be updated along with the related changes in CSS and/or JavaScript.

Now, if the resource string is already translated for several languages, then the HTML embedded in the various translations will need to be updated as well. Alternatively, deployment will have to wait on re-translation from the canonical string in order for the translated versions of the resource string to reflect the change in the HTML embedded in the canonical string.

Having to update resource strings as part of layout change refactoring is clearly undesirable.

2.2.3.1.3. Increased Re-translation Costs

HTML embedded inside resource strings is sometimes associated with larger resource strings that have not been split up into multiple smaller strings.

If this is the case, re-translation costs can increase as a result of having to re-translate multi-sentence resource strings where only one sentence has been modified. Translators are not financially incentivized to reduce their redundant translation effort and may only perform delta optimization as granular as individual resource strings and not at the granularity of changed segments of large strings.

Translation services may also employ crude metrics for calculating the fees charged for their service that may take into account string size, and extra HTML will increase string size. A charge inflated by embedded HTML may slip past the person managing the relationship will the translation service. The safest policy in this case is to not have HTML inside resource strings.

2.2.3.1.4. Unnoticed Bugs Creep In

When HTML is embedded in resource strings, bugs can creep in as a result of refactoring changes.

When refactoring changes are made in the larger codebase that interact with the resource strings, but where the corresponding changes that would need to be made in the resource strings are neglected, then bugs can arise that can go undetected by the developer. This is especially true for resource strings that only appear in the UI under very specific conditions that are not easy for the developer to replicate (such as when the user account needs to be in a specific uncommon state).

Changes that would need to be made in the resource strings can be neglected in any or all the translated versions of a string. In particular, a developer is not likely to test the application for all supported locales after their change, so bugs can slip in that affect just certain locales that are not the dominant locale used by the developer during development.

2.2.3.2. Most Problematic Issues

The following problematic issues can arise when embedding HTML markup inside resource strings...

2.2.3.2.1. Inline Style in Resource Strings

To address some issue with layout, a developer may be tempted to embed inline style attributes inside the HTML that is embedded in the resource strings.

This could arise if a developer takes a lazy approach to addressing a layout issue with a language where longer average word length breaks the layout in some way. A developer may slip in a font-size inline style to quickly resolve a bug for a language such as German, for instance.

Embedding inline style in resource strings has the following issues...

	Maintenance Pain - Just as with inlining style in HTML templates, inlining CSS style in HTML inside resource strings can produce pain with maintenance of the codebase as it evolves and is refactored, since styling is now distributed rather than being consolidated in CSS files.
	Obliteration Upon Re-translation - If specific styling exceptions are placed in the translated resource strings for specific languages, those language specific modifications can easily suffer from obliteration if the canonical string is modified in future and the translator re-translates the text from the canonical string to produce new translated versions per language.

2.2.3.2.2. Inline Code or Binding Logic

In the situation of a portion of a string needing to be dynamic, it might be tempting for a developer to embed the templating language's data binding tokens inside a resource string.

This is problematic for several reasons, including...

Maintenance Pain - With some data binding logic embedded inside resource strings, the logic of the application is now distributed and harder to maintain and there is an increased likelihood of bugs arising during refactoring of the application where changes are made in some parts of the code but the developer misses making corresponding changes in the translated strings for all supported languages.

Translation Correctness - In a case where a string starts out as having no dynamic parts and then later it is decided that a name, quantity, or other value needs to be substituted into it, it might be tempting to the developer to simply add a data-binding token / expression into the resource string and rely on the data binding facilities of the UI technology. This might work according to the grammatical rules of the developer's spoken language, but it's quite possible it will not work for all other languages.

2.2.3.2.3. Inline URLs or HREFs

In string resources that contain linked text or inline images, it might be tempting to a developer to include URLs inside resource strings, particularly where the URLs may differ by locale / language.

This situation may more likely arise in resource strings for marketing pages or legal documents. In any event, including fully resolved URLs inside resource strings should be avoided. Instead, such strings should be parameterized and the application should supply the URLs (possibly locale specific) as substitution tokens to the localization methods. This permits the URLs to be processed separately through the selection / fallback mechanism of the localization system.

2.2.3.3. Alternatives to Carte Blanche HTML

There are at least two reasonable alternatives to allowing developers carte blanche in embedding HTML markup in resource stirngs.

2.2.3.3.1. HTML Subset

A very limited set of HTML tags could be supported to provide basic formatting ability without permitting layout to be implemented within HTML in resource strings.

Tags supported could include b, i, strong, p, br, ul, ol, li, and other basic tags. A downside of this approach is that it might not be entirely clear that there are limits to the HTML tags that can be used and developers may frequently hit up against the frustration of not being able to go that one step further by adding a CSS class or an inline style attribute or some other tag that could solve their immediate problem.

2.2.3.3.2. Wikitext / Markdown

A wikitext / markdown language could be supported to allow minimal control over formatting and logical layout.

A wikitext / markdown language, for example, may be sufficient for the purposes of legal documents. Using this approach, it would be plainly clear to a developer that they're in wikitext territory and there is no freedom to embed HTML. Furthermore, the localization system could explicitly prevent HTML by scanning for HTML tags and producing errors that would be seen by the developer, or HTML could be displayed literally and a warning produced for the developer. Audit scripts could also be employed to detect unwanted HTML inside resource strings.

	Creole - an effort to create a standard wiki markup language
	Markdown - used on github, stackoverflow, and some notable others

2.2.3.4. Areas of Temptation

The areas where it is tempting to embed HTML into resource strings includes various areas where there often larger blocks of text to translate and where it is perceived as too onerous to separate strings from structure.

Such areas include...

2.2.3.4.1. Marketing Pages

Marketing pages can have large, multi-paragraph blocks of text that contain HTML markup for layout and styling.

Often these pages originate from a CMS system and they are built initially without the mind to localization. It might be assumed that the content will vary so dramatically by locale that it's not feasible to create a single structure with common sentences that can be translated per locale. Even when possible, the upfront cost of creating a separation of structure and strings is not perceived as having a worthwhile payoff at the time.

When it comes time to localize the content, the apparent path of least resistance might be to provide the pages, as is, to translators to translate. If the pages continue to live and evolve over time, then the pain of not investing in separating structure from strings become apparent.

2.2.3.4.2. Legal Documents

Much like marketing pages, legal documents like EULA and TOS documents tend to be large documents that already contain embedded HTML by the time that consideration is given to translating them to different languages.

One would be hard pressed trying to find a writer of a legal document who would author the document as a structure template and a set of resource strings. Deconstructing a legal document into a template and a set of resource strings would also present an obstacle to the legal team when it comes time to update portions of the document.

Furthermore, specific locales may actually impose different legal frameworks that may require a specific legal document to differ in its content from one locale to another. For these reasons, the only viable approach may be to translate such documents in their entirety. Fortunately, legal documents do not generally require token substitution.

For legal documents, any of the recommended alternatives to carte blanche HTML would be acceptable.

3. Policing and Auditing

In order to enforce the chosen policies with respect to resource strings, conformance sanity checks should be instituted.

Opportunities to perform sanity checking include...

3.1. Manual Audits

The resource string sanity checker should be runnable manually on a project's entire codebase via an audit script.

This is particularly useful as a stopgap measure before better integrations are put in place, or when first initiating compliance for legacy resource string files that were created before a continuous sanity checking process was instituted.

3.2. Build Process Integration

An obvious opportunity to sanity check resource string files would be during a build process - either as part of CI (Continuous Integration) deploy process, or as part of a staging deploy process.

This oppotunity for sanity checking is not the most ideal as it doesn't catch issues all that early.

3.3. Pre-commit Hooks

Pre-commit hooks can be instituted in the source control system in order to sanity check any changes to resource string files contained in commits.

For serious violations of the compliancy guidelines, commits can be blocked outright. For minor issues, a report can be sent, or such issues can be left to post-commit hooks, build process integration, or manual audits.

3.4. Post-commit Hooks

Post-commit hooks can be instituted in the source control system in order to sanity check changes to resource string files after the changes have already been committed.

Post-commit hooks might be preferable to pre-commit hooks if the sanity checking is "fuzzy" and can only identify likely / possible problems and not with sufficient certainty to ever block a commit. Post-commit hooks may also be considered preferable to pre-commit hooks if development speed is considered more important and where presenting too many barriers to committing changes has a negative impact on motivation.

3.5. Upon Every Save

In an ideal world, sanity checking can be performed for resource string files on every save of such files.

If the build system being used during development supports file system watchers and automating rebuilding on file saves, then resource string sanity checking can be performed automatically upon every save to a resource string file in a similar manner to immediate sanity checking for JavaScript files or other file types that are subject to tidy / lint policies.

Since a saved file is not yet committed to the source control system, a sanity check upon save could provide a notification to the developer at best. This notification could appear in the developer's IDE (if it is possible to integrate the sanity checking in the IDE), or in a window of a separate application that runs as a daemon and acts as an assistant during development.

UIZE JavaScript Framework