WebSVN – nexmon – Blame – Rev 1 – /utilities/libiconv/NOTES

1

office

1

Q: Why does libiconv support encoding XXX? Why does libiconv not support

2

encoding ZZZ?

3

4

A: libiconv, as an internationalization library, supports those character

5

sets and encodings which are in wide-spread use in at least one territory

6

of the world.

7

8

Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a

9

page "Languages, countries, and the charsets typically used for them".

10

From this table, we can conclude that the following are in active use:

11

12

ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,

13

English, Faroese, Finnish, French, Galician, German,

14

Icelandic, Irish, Italian, Norwegian, Portuguese,

15

Scottish, Spanish, Swedish

16

ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak,

17

Slovenian

18

ISO-8859-3 Esperanto, Maltese

19

ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,

Serbian, Ukrainian

ISO-8859-6 Arabic

ISO-8859-7 Greek

ISO-8859-8 Hebrew

ISO-8859-9, CP1254 Turkish

25

ISO-8859-10 Inuit, Lapp

26

ISO-8859-13 Latvian, Lithuanian

27

ISO-8859-15 Estonian

28

KOI8-R Russian

29

SHIFT_JIS Japanese

30

ISO-2022-JP Japanese

31

EUC-JP Japanese

32

33

Ordered by frequency on the web (1997):

34

ISO-8859-1, CP1252 96%

SHIFT_JIS 1.6%

ISO-2022-JP 1.2%

EUC-JP 0.4%

CP1250 0.3%

CP1251 0.2%

CP850 0.1%

MACINTOSH 0.1%

ISO-8859-5 0.1%

ISO-8859-2 0.0%

Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.

46

47

ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch,

48

English, Estonian, Faroese, Finnish, French,

49

Galician, German, Greenlandic, Icelandic,

50

Indonesian, Irish, Italian, Lithuanian, Norwegian,

51

Occitan, Portuguese, Scottish, Spanish, Swedish,

52

Walloon, Welsh

53

ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish,

54

Romanian, Serbian, Slovak, Slovenian

55

ISO-8859-3 Esperanto

56

ISO-8859-4 Estonian, Latvian, Lithuanian

57

ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,

Serbian, Ukrainian

ISO-8859-6 Arabic

ISO-8859-7 Greek

ISO-8859-8 Hebrew

ISO-8859-9 Turkish

ISO-8859-14 Breton, Irish, Scottish, Welsh

64

ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian,

65

Faroese, Finnish, French, Galician, German,

66

Greenlandic, Icelandic, Irish, Italian, Lithuanian,

67

Norwegian, Occitan, Portuguese, Scottish, Spanish,

68

Swedish, Walloon, Welsh

69

KOI8-R Russian

70

KOI8-U Russian, Ukrainian

71

EUC-JP (alias eucJP) Japanese

72

ISO-2022-JP (alias JIS7) Japanese

73

SHIFT_JIS (alias SJIS) Japanese

74

U90 Japanese

75

S90 Japanese

76

EUC-CN (alias eucCN) Chinese

77

EUC-TW (alias eucTW) Chinese

78

BIG5 Chinese

79

EUC-KR (alias eucKR) Korean

80

ARMSCII-8 Armenian

81

GEORGIAN-ACADEMY Georgian

82

GEORGIAN-PS Georgian

83

TIS-620 (alias TACTIS) Thai

84

MULELAO-1 Laothian

85

IBM-CP1133 Laothian

86

VISCII Vietnamese

87

TCVN Vietnamese

88

NUNACOM-8 Inuktitut

89

90

Hint3: The character sets supported by Netscape Communicator 4.

91

92

Where is this documented? For the complete picture, I had to use

93

"strings netscape" and then a lot of guesswork. For a quick take,

94

look at the "View - Character set" menu of Netscape Communicator 4.6:

95

96

ISO-8859-{1,2,5,7,9,15}

97

WINDOWS-{1250,1251,1253}

98

KOI8-R Cyrillic

99

CP866 Cyrillic

100

Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)

EUC-JP Japanese

SHIFT_JIS Japanese

GB2312 Chinese

BIG5 Chinese

EUC-TW Chinese

Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB)

UTF-8

UTF-7

Hint4: The character sets supported by Microsoft Internet Explorer 4.

112

113

ISO-8859-{1,2,3,4,5,6,7,8,9}

114

WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}

KOI8-R Cyrillic

KOI8-RU Ukrainian

ASMO-708 Arabic

EUC-JP Japanese

ISO-2022-JP Japanese

SHIFT_JIS Japanese

GB2312 Chinese

HZ-GB-2312 Chinese

BIG5 Chinese

EUC-KR Korean

ISO-2022-KR Korean

WINDOWS-874 Thai

WINDOWS-1258 Vietnamese

UTF-8

UTF-7

UNICODE actually UNICODE-LITTLE

132

UNICODEFEFF actually UNICODE-BIG

133

134

and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.

135

136

We take the union of all these four sets. The result is:

137

138

European and Semitic languages

139

* ASCII.

140

We implement this because it is occasionally useful to know or to

141

check whether some text is entirely ASCII (i.e. if the conversion

142

ISO-8859-x -> UTF-8 is trivial).

143

* ISO-8859-{1,2,3,4,5,6,7,8,9,10}

144

We implement this because they are widely used. Except ISO-8859-4

145

which appears to have been superseded by ISO-8859-13 in the baltic

146

countries. But it's an ISO standard anyway.

147

* ISO-8859-13

148

We implement this because it's a standard in Lithuania and Latvia.

149

* ISO-8859-14

150

We implement this because it's an ISO standard.

151

* ISO-8859-15

152

We implement this because it's increasingly used in Europe, because

153

of the Euro symbol.

154

* ISO-8859-16

155

We implement this because it's an ISO standard.

156

* KOI8-R, KOI8-U

157

We implement this because it appears to be the predominant encoding

158

on Unix in Russia and Ukraine, respectively.

159

* KOI8-RU

160

We implement this because MSIE4 supports it.

161

* KOI8-T

162

We implement this because it is the locale encoding in glibc's Tajik

163

locale.

164

* PT154

165

We implement this because it is the locale encoding in glibc's Kazakh

166

locale.

167

* RK1048

168

We implement this because it's a standard in Kazakhstan.

169

* CP{1250,1251,1252,1253,1254,1255,1256,1257}

170

We implement these because they are the predominant Windows encodings

171

in Europe.

172

* CP850

173

We implement this because it is mentioned as occurring in the web

174

in the aforementioned statistics.

175

* CP862

176

We implement this because Ron Aaron says it is sometimes used in web

177

pages and emails.

178

* CP866

179

We implement this because Netscape Communicator does.

180

* CP1131

181

We implement this because it is the locale encoding of a Belorusian

182

locale in FreeBSD and MacOS X.

183

* Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and

184

Mac{Hebrew,Arabic}

185

We implement these because the Sun JDK does, and because Mac users

186

don't deserve to be punished.

187

* Macintosh

188

We implement this because it is mentioned as occurring in the web

189

in the aforementioned statistics.

190

Japanese

191

* EUC-JP, SHIFT_JIS, ISO-2022-JP

192

We implement these because they are widely used. EUC-JP and SHIFT_JIS

193

are more used for files, whereas ISO-2022-JP is recommended for email.

194

* CP932

195

We implement this because it is the Microsoft variant of SHIFT_JIS,

196

used on Windows.

197

* ISO-2022-JP-2

198

We implement this because it's the common way to represent mails which

199

make use of JIS X 0212 characters.

200

* ISO-2022-JP-1

201

We implement this because it's in the RFCs, but I don't think it is

202

really used.

203

* U90, S90

204

We DON'T implement this because I have no informations about what it

is or who uses it.

Simplified Chinese

* EUC-CN = GB2312

We implement this because it is the widely used representation

209

of simplified Chinese.

210

* GBK

211

We implement this because it appears to be used on Solaris and Windows.

212

* GB18030

213

We implement this because it is an official requirement in the

214

People's Republic of China.

215

* ISO-2022-CN

216

We implement this because it is in the RFCs, but I have no idea

217

whether it is really used.

218

* ISO-2022-CN-EXT

219

We implement this because it's in the RFCs, but I don't think it is

220

really used.

221

* HZ = HZ-GB-2312

222

We implement this because the RFCs recommend it for Usenet postings,

223

and because MSIE4 supports it.

224

Traditional Chinese

225

* EUC-TW

226

We implement it because it appears to be used on Unix.

227

* BIG5

228

We implement it because it is the de-facto standard for traditional

229

Chinese.

230

* CP950

231

We implement this because it is the Microsoft variant of BIG5, used

232

on Windows.

233

* BIG5+

234

We DON'T implement this because it doesn't appear to be in wide use.

235

Only the CWEX fonts use this encoding. Furthermore, the conversion

236

tables in the big5p package are not coherent: If you convert directly,

237

you get different results than when you convert via GBK.

238

* BIG5-HKSCS

239

We implement it because it is the de-facto standard for traditional

240

Chinese in Hongkong.

241

Korean

242

* EUC-KR

243

We implement these because they appear to be the widely used

244

representations for Korean.

245

* CP949

246

We implement this because it is the Microsoft variant of EUC-KR, used

247

on Windows.

248

* ISO-2022-KR

249

We implement it because it is in the RFCs and because MSIE4 supports

250

it, but I have no idea whether it's really used.

251

* JOHAB

252

We implement this because it is apparently used on Windows as a locale

253

encoding (codepage 1361).

254

* ISO-646-KR

255

We DON'T implement this because although an old ASCII variant, its

256

glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT

257

say it's a tilde, but Ken Lunde's "CJKV information processing" says

258

it's an overline. And it is not ISO-IR registered.

259

Armenian

260

* ARMSCII-8

261

We implement it because XFree86 supports it.

262

Georgian

263

* Georgian-Academy, Georgian-PS

264

We implement these because they appear to be both used for Georgian;

265

Xfree86 supports them.

266

Thai

267

* ISO-8859-11, TIS-620

268

We implement these because it seems to be standard for Thai.

269

* CP874

270

We implement this because MSIE4 supports it.

271

* MacThai

272

We implement this because the Sun JDK does, and because Mac users

273

don't deserve to be punished.

274

Laotian

275

* MuleLao-1, CP1133

276

We implement these because XFree86 supports them. I have no idea which

277

one is used more widely.

278

Vietnamese

279

* VISCII, TCVN

280

We implement these because XFree86 supports them.

281

* CP1258

282

We implement this because MSIE4 supports it.

283

Other languages

284

* NUNACOM-8 (Inuktitut)

285

We DON'T implement this because it isn't part of Unicode yet, and

286

therefore doesn't convert to anything except itself.

287

Platform specifics

288

* HP-ROMAN8, NEXTSTEP

289

We implement these because they were the native character set on HPs

290

and NeXTs for a long time, and libiconv is intended to be usable on

291

these old machines.

292

Full Unicode

293

* UTF-8, UCS-2, UCS-4

294

We implement these. Obviously.

295

* UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE

296

We implement these because they are the preferred internal

297

representation of strings in Unicode aware applications. These are

298

non-ambiguous names, known to glibc. (glibc doesn't have

299

UCS-2-INTERNAL and UCS-4-INTERNAL.)

300

* UTF-16, UTF-16BE, UTF-16LE

301

We implement these, because UTF-16 is still the favourite encoding of

302

the president of the Unicode Consortium (for political reasons), and

303

because they appear in RFC 2781.

304

* UTF-32, UTF-32BE, UTF-32LE

305

We implement these because they are part of Unicode 3.1.

306

* UTF-7

307

We implement this because it is essential functionality for mail

308

applications.

309

* C99

310

We implement it because it's used for C and C++ programs and because

311

it's a nice encoding for debugging.

312

* JAVA

313

We implement it because it's used for Java programs and because it's

314

a nice encoding for debugging.

315

* UNICODE (big endian), UNICODEFEFF (little endian)

316

We DON'T implement these because they are stupid and not standardized.

317

Full Unicode, in terms of `uint16_t' or `uint32_t'

318

(with machine dependent endianness and alignment)

319

* UCS-2-INTERNAL, UCS-4-INTERNAL

320

We implement these because they are the preferred internal

321

representation of strings in Unicode aware applications.

322

323

Q: Support encodings mentioned in RFC 1345 ?

324

A: No, they are not in use any more. Supporting ISO-646 variants is pointless

325

since ISO-8859-* have been adopted.

326

327

Q: Support EBCDIC ?

328

A: No!

329

330

Q: How do I add a new character set?

331

A: 1. Explain the "why" in this file, above.

332

2. You need to have a conversion table from/to Unicode. Transform it into

333

the format used by the mapping tables found on ftp.unicode.org: each line

334

contains the character code, in hex, with 0x prefix, then whitespace,

335

then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'

336

counts as a comment delimiter until end of line.

337

Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he

338

can include it in his collection.

339

3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the

340

tools directory to generate the C code for the conversion. You may tweak

341

the resulting C code if you are not satisfied with its quality, but this

342

is rarely needed.

343

If it's a two-dimensional character set (with rows and columns), use the

344

'cjk_tab_to_h' program in the tools directory to generate the C code for

345

the conversion. You will need to modify the main() function to recognize

346

the new character set name, with the proper dimensions, but that shouldn't

347

be too hard. This yields the CCS. The CES you have to write by hand.

348

4. Store the resulting C code file in the lib directory. Add a #include

349

directive to converters.h, and add an entry to the encodings.def file.

350

5. Compile the package, and test your new encoding using a program like

351

iconv(1) or clisp(1).

352

6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless

353

encoding, create the complete table as a TXT file. For a stateful encoding,

354

provide a text snippet encoded using your new encoding and its UTF-8

355

equivalent.

356

7. Update the README and man/iconv_open.3, to mention the new encoding.

357

Add a note in the NEWS file.

358

359

Q: What about bidirectional text? Should it be tagged or reversed when

360

converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do

361

this, see qt-2.0.1/src/tools/qrtlcodec.cpp.

362

A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and

363

ISO-8859-E remains to be implemented.

364

On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*

365

is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.

366

the same as ISO-8859-8-I. I'm confused.

367

368

Other character sets not implemented:

369

"MNEMONIC" = "csMnemonic"

370

"MNEM" = "csMnem"

371

"ISO-10646-UCS-Basic" = "csUnicodeASCII"

372

"ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"

373

"ISO-10646-J-1"

374

"UNICODE-1-1" = "csUnicode11"

375

"csWindows31Latin5"

376

377

Other aliases not implemented (and not implemented in glibc-2.1 either):

378

From MSIE4:

379

ISO-8859-1: alias ISO8859-1

380

ISO-8859-2: alias ISO8859-2

381

KSC_5601: alias KS_C_5601

382

UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8

383

384

385

Q: How can I integrate libiconv into my package?

386

A: Just copy the entire libiconv package into a subdirectory of your package.

387

At configuration time, call libiconv's configure script with the

388

appropriate --srcdir option and maybe --enable-static or --disable-shared.

389

Then "cd libiconv && make && make install-lib libdir=... includedir=...".

390

'install-lib' is a special (not GNU standardized) target which installs

391

only the include file - in $(includedir) - and the library - in $(libdir) -

392

and does not use other directory variables. After "installing" libiconv

393

in your package's build directory, building of your package can proceed.

394

395

Q: Why is the testsuite so big?

396

A: Because some of the tests are very comprehensive.

397

If you don't feel like using the testsuite, you can simply remove the

398

tests/ directory.

399

nexmon – Blame information for rev 1