MO文件格式学习

DongRH

1 背景介绍

mo文件是GNU gettext标准下的一种文件格式，便于软件向用户提供多语言支持。软件的翻译者通常先逐条翻译文本，保存为文本格式的po文件(Poedit)，再用msgfmt等程序编译成便于程序读取的mo文件，供软件读入内存使用。

使用gettext工具提供多语言支持的workflow如下(以C语言程序为例)：

Original C Sources ───> Preparation ───> Marked C Sources ───╮
                                                             │
              ╭─────────<─── GNU gettext Library             │
╭─── make <───┤                                              │
│             ╰─────────<────────────────────┬───────────────╯
│                                            │
│   ╭─────<─── PACKAGE.pot <─── xgettext <───╯   ╭───<─── PO Compendium
│   │                                            │              ↑
│   │                                            ╰───╮          │
│   ╰───╮                                            ├───> PO editor ───╮
│       ├────> msgmerge ──────> LANG.po ────>────────╯                  │
│   ╭───╯                                                               │
│   │                                                                   │
│   ╰─────────────<───────────────╮                                     │
│                                 ├─── New LANG.po <────────────────────╯
│   ╭─── LANG.gmo <─── msgfmt <───╯
│   │
│   ╰───> install ───> /.../LANG/PACKAGE.mo ───╮
│                                              ├───> "Hello world!"
╰───────> install ───> /.../bin/PROGRAM ───────╯

DongRH

2 MO的文件格式

格式并不复杂，如下所示：

        byte
             +------------------------------------------+
          0  | magic number = 0x950412de or 0xde120495  |
             |                                          |
          4  | file format revision = 0 or 1            |
             |                                          |
          8  | number of strings                        |  == N
             |                                          |
         12  | offset of table with original strings    |  == O
             |                                          |
         16  | offset of table with translation strings |  == T
             |                                          |
         20  | size of hashing table                    |  == S
             |                                          |
         24  | offset of hashing table                  |  == H
             |                                          |
             .                                          .
             .    (possibly more entries later)         .
             .                                          .
             |                                          |
          O  | length & offset 0th string  ----------------.
      O + 8  | length & offset 1st string  ------------------.
              ...                                    ...   | |
O + ((N-1)*8)| length & offset (N-1)th string           |  | |
             |                                          |  | |
          T  | length & offset 0th translation  ---------------.
      T + 8  | length & offset 1st translation  -----------------.
              ...                                    ...   | | | |
T + ((N-1)*8)| length & offset (N-1)th translation      |  | | | |
             |                                          |  | | | |
          H  | start hash table                         |  | | | |
              ...                                    ...   | | | |
  H + S * 4  | end hash table                           |  | | | |
             |                                          |  | | | |
             | NUL terminated 0th string  <----------------' | | |
             |                                          |    | | |
             | NUL terminated 1st string  <------------------' | |
             |                                          |      | |
              ...                                    ...       | |
             |                                          |      | |
             | NUL terminated 0th translation  <---------------' |
             |                                          |        |
             | NUL terminated 1st translation  <-----------------'
             |                                          |
              ...                                    ...
             |                                          |
             +------------------------------------------+

值得注意的几点有：

hash table用来快速查找某个字符串对应翻译的index，但是mo文件可以不包含hash table，其大小S可以为0
The size S of the hash table can be zero. In this case, the hash table itself is not contained in the MO file. Some people might prefer this because a precomputed hashing table takes disk space, and does not win that much speed. The hash table contains indices to the sorted array of strings in the MO file
GNU gettext要求原字符串以字典序递增排列，以便在hash table不存在时，也能利用二叉查找提升速度
The first table contains descriptors for the original strings, and is sorted so the original strings are in increasing lexicographical order.
每个字符串以NUL结尾，但每个串后面的NUL可以是一个或多个(便于对齐)，因此文件头部记录的字符串长度不包含NUL字符
As for the strings themselves, they follow the hash file, and each is terminated with a NUL, and this NUL is not counted in the length which appears in the string descriptor. The msgfmt program has an option selecting the alignment for MO file strings. With this option, each string is separately aligned so it starts at an offset which is a multiple of the alignment value.
复数形式的串和单数形式的串通过一个NUL分隔，存在一个字符串记录里，串的长度同时包含2者，但只有单数形式会被用到hash table里做查找
Plural forms are stored by letting the plural of the original string follow the singular of the original string, separated through a NUL byte. The length which appears in the string descriptor includes both. However, only the singular of the original string takes part in the hash table lookup.
字符串中间可以包含NUL(尽管现有实现不一定支持)

DongRH

3 我遇到的问题

gettext现有的实现中，强制检查字符串是否以字典序排列，如果不是则抛出错误并退出。gettext-tools/src/read-mo.c(release v0.21)中的相关源码如下：

      /* Verify that the array of messages is sorted.  */
      {
        char *prev_msgid = NULL;

        for (i = 0; i < header.nstrings; i++)
          {
            char *msgid;
            size_t msgid_len;

            msgid = get_string (&bf, header.orig_tab_offset + i * 8,
                                &msgid_len);
            if (i == 0)
              prev_msgid = msgid;
            else
              {
                if (!(strcmp (prev_msgid, msgid) < 0))
                  error (EXIT_FAILURE, 0,
                         _("file \"%s\" is not in GNU .mo format: The array of messages is not sorted."),
                         filename);
              }
          }
      }

碰巧我拿到一个mo文件，想要反编译出po文件来做点改动，不料原字符串是无序的，msgunfmt直接拒绝工作(防止乱序串可能导致的libintl crash)。

DongRH

4 解决方案

注释掉这段检查
- 好消息是libintl没有crash，坏消息是read-mo.c现在不支持non-ASCII编码，中文宽字符直接全部没了，翻译了个寂寞
- 然而write-mo.c是不关心编码的，所以msgfmt编译出来的中文mo文件，msgunfmt反编译不出来，合理
- we ever want to implement wide characters right in MO files, where NUL bytes may accidentally appear. (No, we don’t want to have wide characters in MO files. They would make the file unnecessarily large, and the ‘wchar_t’ type being platform dependent, MO files would be platform dependent as well.)
终究还是造了个轮子，好在mo文件格式简单，而且只做简单的读取操作。(代码写得比较烂，见谅)

DongRH

5 PO文件格式补充

po文本文件由多条翻译构成，每条翻译格式一般如下:

optional white-spaces
#  translator-comments
#. extracted-comments
#: reference…
#, flag…
#| msgid previous-untranslated-string
msgid untranslated-string
msgstr translated-string

例如:

#: lib/error.c:116
msgid "Unknown system error"
msgstr "Error desconegut del sistema"

一些值得注意的地方：

#后面接空格的注释，由翻译者维护；其他类型的翻译由gettext工具自动维护(除了fuzzyflag以外)
msgid和msgstr对应的字符串用"包含起来，特殊字符需要\转义