nexmon – Blame information for rev 1

Subversion Repositories:
Rev:
Rev Author Line No. Line
1 office 1 Q: Why does libiconv support encoding XXX? Why does libiconv not support
2 encoding ZZZ?
3  
4 A: libiconv, as an internationalization library, supports those character
5 sets and encodings which are in wide-spread use in at least one territory
6 of the world.
7  
8 Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
9 page "Languages, countries, and the charsets typically used for them".
10 From this table, we can conclude that the following are in active use:
11  
12 ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
13 English, Faroese, Finnish, French, Galician, German,
14 Icelandic, Irish, Italian, Norwegian, Portuguese,
15 Scottish, Spanish, Swedish
16 ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
17 Slovenian
18 ISO-8859-3 Esperanto, Maltese
19 ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
20 Serbian, Ukrainian
21 ISO-8859-6 Arabic
22 ISO-8859-7 Greek
23 ISO-8859-8 Hebrew
24 ISO-8859-9, CP1254 Turkish
25 ISO-8859-10 Inuit, Lapp
26 ISO-8859-13 Latvian, Lithuanian
27 ISO-8859-15 Estonian
28 KOI8-R Russian
29 SHIFT_JIS Japanese
30 ISO-2022-JP Japanese
31 EUC-JP Japanese
32  
33 Ordered by frequency on the web (1997):
34 ISO-8859-1, CP1252 96%
35 SHIFT_JIS 1.6%
36 ISO-2022-JP 1.2%
37 EUC-JP 0.4%
38 CP1250 0.3%
39 CP1251 0.2%
40 CP850 0.1%
41 MACINTOSH 0.1%
42 ISO-8859-5 0.1%
43 ISO-8859-2 0.0%
44  
45 Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
46  
47 ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
48 English, Estonian, Faroese, Finnish, French,
49 Galician, German, Greenlandic, Icelandic,
50 Indonesian, Irish, Italian, Lithuanian, Norwegian,
51 Occitan, Portuguese, Scottish, Spanish, Swedish,
52 Walloon, Welsh
53 ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish,
54 Romanian, Serbian, Slovak, Slovenian
55 ISO-8859-3 Esperanto
56 ISO-8859-4 Estonian, Latvian, Lithuanian
57 ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
58 Serbian, Ukrainian
59 ISO-8859-6 Arabic
60 ISO-8859-7 Greek
61 ISO-8859-8 Hebrew
62 ISO-8859-9 Turkish
63 ISO-8859-14 Breton, Irish, Scottish, Welsh
64 ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian,
65 Faroese, Finnish, French, Galician, German,
66 Greenlandic, Icelandic, Irish, Italian, Lithuanian,
67 Norwegian, Occitan, Portuguese, Scottish, Spanish,
68 Swedish, Walloon, Welsh
69 KOI8-R Russian
70 KOI8-U Russian, Ukrainian
71 EUC-JP (alias eucJP) Japanese
72 ISO-2022-JP (alias JIS7) Japanese
73 SHIFT_JIS (alias SJIS) Japanese
74 U90 Japanese
75 S90 Japanese
76 EUC-CN (alias eucCN) Chinese
77 EUC-TW (alias eucTW) Chinese
78 BIG5 Chinese
79 EUC-KR (alias eucKR) Korean
80 ARMSCII-8 Armenian
81 GEORGIAN-ACADEMY Georgian
82 GEORGIAN-PS Georgian
83 TIS-620 (alias TACTIS) Thai
84 MULELAO-1 Laothian
85 IBM-CP1133 Laothian
86 VISCII Vietnamese
87 TCVN Vietnamese
88 NUNACOM-8 Inuktitut
89  
90 Hint3: The character sets supported by Netscape Communicator 4.
91  
92 Where is this documented? For the complete picture, I had to use
93 "strings netscape" and then a lot of guesswork. For a quick take,
94 look at the "View - Character set" menu of Netscape Communicator 4.6:
95  
96 ISO-8859-{1,2,5,7,9,15}
97 WINDOWS-{1250,1251,1253}
98 KOI8-R Cyrillic
99 CP866 Cyrillic
100 Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
101 EUC-JP Japanese
102 SHIFT_JIS Japanese
103 GB2312 Chinese
104 BIG5 Chinese
105 EUC-TW Chinese
106 Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB)
107  
108 UTF-8
109 UTF-7
110  
111 Hint4: The character sets supported by Microsoft Internet Explorer 4.
112  
113 ISO-8859-{1,2,3,4,5,6,7,8,9}
114 WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
115 KOI8-R Cyrillic
116 KOI8-RU Ukrainian
117 ASMO-708 Arabic
118 EUC-JP Japanese
119 ISO-2022-JP Japanese
120 SHIFT_JIS Japanese
121 GB2312 Chinese
122 HZ-GB-2312 Chinese
123 BIG5 Chinese
124 EUC-KR Korean
125 ISO-2022-KR Korean
126 WINDOWS-874 Thai
127 WINDOWS-1258 Vietnamese
128  
129 UTF-8
130 UTF-7
131 UNICODE actually UNICODE-LITTLE
132 UNICODEFEFF actually UNICODE-BIG
133  
134 and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
135  
136 We take the union of all these four sets. The result is:
137  
138 European and Semitic languages
139 * ASCII.
140 We implement this because it is occasionally useful to know or to
141 check whether some text is entirely ASCII (i.e. if the conversion
142 ISO-8859-x -> UTF-8 is trivial).
143 * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
144 We implement this because they are widely used. Except ISO-8859-4
145 which appears to have been superseded by ISO-8859-13 in the baltic
146 countries. But it's an ISO standard anyway.
147 * ISO-8859-13
148 We implement this because it's a standard in Lithuania and Latvia.
149 * ISO-8859-14
150 We implement this because it's an ISO standard.
151 * ISO-8859-15
152 We implement this because it's increasingly used in Europe, because
153 of the Euro symbol.
154 * ISO-8859-16
155 We implement this because it's an ISO standard.
156 * KOI8-R, KOI8-U
157 We implement this because it appears to be the predominant encoding
158 on Unix in Russia and Ukraine, respectively.
159 * KOI8-RU
160 We implement this because MSIE4 supports it.
161 * KOI8-T
162 We implement this because it is the locale encoding in glibc's Tajik
163 locale.
164 * PT154
165 We implement this because it is the locale encoding in glibc's Kazakh
166 locale.
167 * RK1048
168 We implement this because it's a standard in Kazakhstan.
169 * CP{1250,1251,1252,1253,1254,1255,1256,1257}
170 We implement these because they are the predominant Windows encodings
171 in Europe.
172 * CP850
173 We implement this because it is mentioned as occurring in the web
174 in the aforementioned statistics.
175 * CP862
176 We implement this because Ron Aaron says it is sometimes used in web
177 pages and emails.
178 * CP866
179 We implement this because Netscape Communicator does.
180 * CP1131
181 We implement this because it is the locale encoding of a Belorusian
182 locale in FreeBSD and MacOS X.
183 * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
184 Mac{Hebrew,Arabic}
185 We implement these because the Sun JDK does, and because Mac users
186 don't deserve to be punished.
187 * Macintosh
188 We implement this because it is mentioned as occurring in the web
189 in the aforementioned statistics.
190 Japanese
191 * EUC-JP, SHIFT_JIS, ISO-2022-JP
192 We implement these because they are widely used. EUC-JP and SHIFT_JIS
193 are more used for files, whereas ISO-2022-JP is recommended for email.
194 * CP932
195 We implement this because it is the Microsoft variant of SHIFT_JIS,
196 used on Windows.
197 * ISO-2022-JP-2
198 We implement this because it's the common way to represent mails which
199 make use of JIS X 0212 characters.
200 * ISO-2022-JP-1
201 We implement this because it's in the RFCs, but I don't think it is
202 really used.
203 * U90, S90
204 We DON'T implement this because I have no informations about what it
205 is or who uses it.
206 Simplified Chinese
207 * EUC-CN = GB2312
208 We implement this because it is the widely used representation
209 of simplified Chinese.
210 * GBK
211 We implement this because it appears to be used on Solaris and Windows.
212 * GB18030
213 We implement this because it is an official requirement in the
214 People's Republic of China.
215 * ISO-2022-CN
216 We implement this because it is in the RFCs, but I have no idea
217 whether it is really used.
218 * ISO-2022-CN-EXT
219 We implement this because it's in the RFCs, but I don't think it is
220 really used.
221 * HZ = HZ-GB-2312
222 We implement this because the RFCs recommend it for Usenet postings,
223 and because MSIE4 supports it.
224 Traditional Chinese
225 * EUC-TW
226 We implement it because it appears to be used on Unix.
227 * BIG5
228 We implement it because it is the de-facto standard for traditional
229 Chinese.
230 * CP950
231 We implement this because it is the Microsoft variant of BIG5, used
232 on Windows.
233 * BIG5+
234 We DON'T implement this because it doesn't appear to be in wide use.
235 Only the CWEX fonts use this encoding. Furthermore, the conversion
236 tables in the big5p package are not coherent: If you convert directly,
237 you get different results than when you convert via GBK.
238 * BIG5-HKSCS
239 We implement it because it is the de-facto standard for traditional
240 Chinese in Hongkong.
241 Korean
242 * EUC-KR
243 We implement these because they appear to be the widely used
244 representations for Korean.
245 * CP949
246 We implement this because it is the Microsoft variant of EUC-KR, used
247 on Windows.
248 * ISO-2022-KR
249 We implement it because it is in the RFCs and because MSIE4 supports
250 it, but I have no idea whether it's really used.
251 * JOHAB
252 We implement this because it is apparently used on Windows as a locale
253 encoding (codepage 1361).
254 * ISO-646-KR
255 We DON'T implement this because although an old ASCII variant, its
256 glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
257 say it's a tilde, but Ken Lunde's "CJKV information processing" says
258 it's an overline. And it is not ISO-IR registered.
259 Armenian
260 * ARMSCII-8
261 We implement it because XFree86 supports it.
262 Georgian
263 * Georgian-Academy, Georgian-PS
264 We implement these because they appear to be both used for Georgian;
265 Xfree86 supports them.
266 Thai
267 * ISO-8859-11, TIS-620
268 We implement these because it seems to be standard for Thai.
269 * CP874
270 We implement this because MSIE4 supports it.
271 * MacThai
272 We implement this because the Sun JDK does, and because Mac users
273 don't deserve to be punished.
274 Laotian
275 * MuleLao-1, CP1133
276 We implement these because XFree86 supports them. I have no idea which
277 one is used more widely.
278 Vietnamese
279 * VISCII, TCVN
280 We implement these because XFree86 supports them.
281 * CP1258
282 We implement this because MSIE4 supports it.
283 Other languages
284 * NUNACOM-8 (Inuktitut)
285 We DON'T implement this because it isn't part of Unicode yet, and
286 therefore doesn't convert to anything except itself.
287 Platform specifics
288 * HP-ROMAN8, NEXTSTEP
289 We implement these because they were the native character set on HPs
290 and NeXTs for a long time, and libiconv is intended to be usable on
291 these old machines.
292 Full Unicode
293 * UTF-8, UCS-2, UCS-4
294 We implement these. Obviously.
295 * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
296 We implement these because they are the preferred internal
297 representation of strings in Unicode aware applications. These are
298 non-ambiguous names, known to glibc. (glibc doesn't have
299 UCS-2-INTERNAL and UCS-4-INTERNAL.)
300 * UTF-16, UTF-16BE, UTF-16LE
301 We implement these, because UTF-16 is still the favourite encoding of
302 the president of the Unicode Consortium (for political reasons), and
303 because they appear in RFC 2781.
304 * UTF-32, UTF-32BE, UTF-32LE
305 We implement these because they are part of Unicode 3.1.
306 * UTF-7
307 We implement this because it is essential functionality for mail
308 applications.
309 * C99
310 We implement it because it's used for C and C++ programs and because
311 it's a nice encoding for debugging.
312 * JAVA
313 We implement it because it's used for Java programs and because it's
314 a nice encoding for debugging.
315 * UNICODE (big endian), UNICODEFEFF (little endian)
316 We DON'T implement these because they are stupid and not standardized.
317 Full Unicode, in terms of `uint16_t' or `uint32_t'
318 (with machine dependent endianness and alignment)
319 * UCS-2-INTERNAL, UCS-4-INTERNAL
320 We implement these because they are the preferred internal
321 representation of strings in Unicode aware applications.
322  
323 Q: Support encodings mentioned in RFC 1345 ?
324 A: No, they are not in use any more. Supporting ISO-646 variants is pointless
325 since ISO-8859-* have been adopted.
326  
327 Q: Support EBCDIC ?
328 A: No!
329  
330 Q: How do I add a new character set?
331 A: 1. Explain the "why" in this file, above.
332 2. You need to have a conversion table from/to Unicode. Transform it into
333 the format used by the mapping tables found on ftp.unicode.org: each line
334 contains the character code, in hex, with 0x prefix, then whitespace,
335 then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
336 counts as a comment delimiter until end of line.
337 Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
338 can include it in his collection.
339 3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
340 tools directory to generate the C code for the conversion. You may tweak
341 the resulting C code if you are not satisfied with its quality, but this
342 is rarely needed.
343 If it's a two-dimensional character set (with rows and columns), use the
344 'cjk_tab_to_h' program in the tools directory to generate the C code for
345 the conversion. You will need to modify the main() function to recognize
346 the new character set name, with the proper dimensions, but that shouldn't
347 be too hard. This yields the CCS. The CES you have to write by hand.
348 4. Store the resulting C code file in the lib directory. Add a #include
349 directive to converters.h, and add an entry to the encodings.def file.
350 5. Compile the package, and test your new encoding using a program like
351 iconv(1) or clisp(1).
352 6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
353 encoding, create the complete table as a TXT file. For a stateful encoding,
354 provide a text snippet encoded using your new encoding and its UTF-8
355 equivalent.
356 7. Update the README and man/iconv_open.3, to mention the new encoding.
357 Add a note in the NEWS file.
358  
359 Q: What about bidirectional text? Should it be tagged or reversed when
360 converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
361 this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
362 A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
363 ISO-8859-E remains to be implemented.
364 On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
365 is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
366 the same as ISO-8859-8-I. I'm confused.
367  
368 Other character sets not implemented:
369 "MNEMONIC" = "csMnemonic"
370 "MNEM" = "csMnem"
371 "ISO-10646-UCS-Basic" = "csUnicodeASCII"
372 "ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
373 "ISO-10646-J-1"
374 "UNICODE-1-1" = "csUnicode11"
375 "csWindows31Latin5"
376  
377 Other aliases not implemented (and not implemented in glibc-2.1 either):
378 From MSIE4:
379 ISO-8859-1: alias ISO8859-1
380 ISO-8859-2: alias ISO8859-2
381 KSC_5601: alias KS_C_5601
382 UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
383  
384  
385 Q: How can I integrate libiconv into my package?
386 A: Just copy the entire libiconv package into a subdirectory of your package.
387 At configuration time, call libiconv's configure script with the
388 appropriate --srcdir option and maybe --enable-static or --disable-shared.
389 Then "cd libiconv && make && make install-lib libdir=... includedir=...".
390 'install-lib' is a special (not GNU standardized) target which installs
391 only the include file - in $(includedir) - and the library - in $(libdir) -
392 and does not use other directory variables. After "installing" libiconv
393 in your package's build directory, building of your package can proceed.
394  
395 Q: Why is the testsuite so big?
396 A: Because some of the tests are very comprehensive.
397 If you don't feel like using the testsuite, you can simply remove the
398 tests/ directory.
399