ASCII, Unicode: mappings of characters to integers (codepoints) Encoding: storing strings of chars in files and in RAM at run-time UTF-8 UTF-16, little endian & big endian (byte order) File contents, by byte: 0x48 -- H 0x65 -- e 0x72 -- r 0x65 -- e 0x20 -- ... od -x reads this file in 16-bit (2-byte) chunks First chunk: 0x48, 0x65 Does this represent the 16-bit integer 0x4865 or 0x6548? big endian little endian Motorola (early Mac) Intel (PCs, later Macs) MIPS you choose! My file in UTF-16 LE: H \0 e \0 r \0... 0x48 0x00 0x65 0x00 01001000 00000000 01100101 00000000 Viewed as a sequence of little endian 16-bit ints 0000000001001000 -- 0x0048 0000000001100101 -- 0x0065 UTF-8 (early 90's, Ken Thompson) ("Reflections on Trusting Trust" 1983) UTF-8 again null termination problem with UTF-16 file size problem " é -- 0xE9 = 11101001 binary = 233 decimal = 00011 101001 put into UTF-8 110xxxxx 10xxxxxx 11000011 10101001 0xC3 0xA9 BOM -- Byte order mark ADVICE WHEN WORKING WITH ENCODED TEXT 1. Read from data source, and immediately transform into the "unicode" type of string 2. Do whatever you need to do, always using the unicode type. 3. Just before output, encode your unicode strings into their target encoding.