UTF-1
UTF-1 is a method of transforming ISO 10646/Unicode into a stream of bytes. Its design does not provide self-synchronization, which makes searching for substrings and error recovery difficult. It reuses the ASCII printing characters for multi-byte encodings, making it unsuited for some uses (for instance Unix filenames cannot contain the byte value used for forward slash). UTF-1 is also slow to encode or decode due to its use of division and multiplication by a number which is not a power of 2. Due to these issues, it did not gain acceptance and was quickly replaced by UTF-8.
| MIME / IANA | ISO-10646-UTF-1 | 
|---|---|
| Language(s) | International | 
| Current status | Obscure, of mainly historical interest. | 
| Classification | Unicode Transformation Format, extended ASCII, variable-width encoding | 
| Extends | US-ASCII | 
| Transforms / Encodes | ISO 10646 (Unicode) | 
| Succeeded by | UTF-8 | 
Design
    
Similar to UTF-8, UTF-1 is a variable-width encoding that is backwards-compatible with ASCII. Every Unicode code point is represented by either a single byte, or a sequence of two, three, or five bytes. ASCII is supported via the single-byte encodings, which, unlike those of UTF-8, also include the non-ASCII code points U+0080 through U+009F.
UTF-1 does not use the C0 and C1 control codes or the space character in multi-byte encodings: a byte in the range 0–0x20 or 0x7F–0x9F always stands for the corresponding code point. This design with 66 protected characters tried to be ISO 2022 compatible.
UTF-1 uses "modulo 190" arithmetic (256 − 66 = 190). For comparison, UTF-8 protects all 128 ASCII characters and needs one bit for this, and a second bit to make it self-synchronizing, resulting in "modulo 64" arithmetic (8 − 2 = 6; 26 = 64). BOCU-1 protects only the minimal set required for MIME-compatibility (0x00, 0x07–0x0F, 0x1A–0x1B, and 0x20), resulting in "modulo 243" arithmetic (256 − 13 = 243).
| code point | UTF-8 | UTF-1 | 
|---|---|---|
| U+007F | 7F | 7F | 
| U+0080 | C2 80 | 80 | 
| U+009F | C2 9F | 9F | 
| U+00A0 | C2 A0 | A0 A0 | 
| U+00BF | C2 BF | A0 BF | 
| U+00C0 | C3 80 | A0 C0 | 
| U+00FF | C3 BF | A0 FF | 
| U+0100 | C4 80 | A1 21 | 
| U+015D | C5 9D | A1 7E | 
| U+015E | C5 9E | A1 A0 | 
| U+01BD | C6 BD | A1 FF | 
| U+01BE | C6 BE | A2 21 | 
| U+07FF | DF BF | AA 72 | 
| U+0800 | E0 A0 80 | AA 73 | 
| U+0FFF | E0 BF BF | B5 48 | 
| U+1000 | E1 80 80 | B5 49 | 
| U+4015 | E4 80 95 | F5 FF | 
| U+4016 | E4 80 96 | F6 21 21 | 
| U+D7FF | ED 9F BF | F7 2F C3 | 
| U+E000 | EE 80 80 | F7 3A 79 | 
| U+F8FF | EF A3 BF | F7 5C 3C | 
| U+FDD0 | EF B7 90 | F7 62 BA | 
| U+FDEF | EF B7 AF | F7 62 D9 | 
| U+FEFF | EF BB BF | F7 64 4C | 
| U+FFFD | EF BF BD | F7 65 AD | 
| U+FFFE | EF BF BE | F7 65 AE | 
| U+FFFF | EF BF BF | F7 65 AF | 
| U+10000 | F0 90 80 80 | F7 65 B0 | 
| U+38E2D | F0 B8 B8 AD | FB FF FF | 
| U+38E2E | F0 B8 B8 AE | FC 21 21 21 21 | 
| U+FFFFF | F3 BF BF BF | FC 21 37 B2 7A | 
| U+100000 | F4 80 80 80 | FC 21 37 B2 7B | 
| U+10FFFF | F4 8F BF BF | FC 21 39 6E 6C | 
| U+7FFFFFFF | FD BF BF BF BF BF | FD BD 2B B9 40 | 
Although modern Unicode ends at U+10FFFF, both UTF-1 and UTF-8 were designed to encode the complete 31 bits of the original Universal Character Set (UCS-4), and the last entry in this table shows this original final code point.
References
    
- "The Unicode Standard: Appendix F FSS-UTF" (PDF) (PDF, 768 KiB). Version 1.1. Unicode, Inc.
- ISO/IEC JTC 1/SC2/WG2 (1993-01-21). "ISO IR 178: UCS Transformation Format One (UTF-1)" (PDF) (PDF, 256 KiB) (1 ed.). Registration number 178.
- Czyborra, Roman (1998-11-30). "Unicode Transformation Formats: UTF-8 & Co". Archived from the original on 2016-06-07. Retrieved 2016-06-07.
- F. Yergeau, F. (November 2003). "UTF-8, a transformation format of ISO 10646". {{cite journal}}: Cite journal requires|journal=(help)
