utf8proc
annotate README @ 3:4ee0d5f54af1
Version 1.0
- added the LUMP option, which lumps certain characters together (see lump.txt) (also used for the PostgreSQL "unifold" function)
- added the STRIPMARK option, which strips marking characters (or marks of composed characters)
- deprecated ruby method String#char_ary in favour of String#utf8chars
- added the LUMP option, which lumps certain characters together (see lump.txt) (also used for the PostgreSQL "unifold" function)
- added the STRIPMARK option, which strips marking characters (or marks of composed characters)
- deprecated ruby method String#char_ary in favour of String#utf8chars
| author | jbe |
|---|---|
| date | Sun Sep 17 12:00:00 2006 +0200 (2006-09-17) |
| parents | aaad485d5335 |
| children | a49e32490aac |
| rev | line source |
|---|---|
| jbe@0 | 1 |
| jbe@0 | 2 Please read the LICENSE file, which is shipping with this software. |
| jbe@0 | 3 |
| jbe@0 | 4 |
| jbe@0 | 5 *** QUICK START *** |
| jbe@0 | 6 |
| jbe@0 | 7 For compilation of the C library call "make c-library", for compilation of |
| jbe@0 | 8 the ruby library call "make ruby-library" and for compilation of the |
| jbe@0 | 9 PostgreSQL extension call "make pgsql-library". |
| jbe@0 | 10 |
| jbe@0 | 11 "make all" can be used to build everything, but both ruby and PostgreSQL |
| jbe@0 | 12 installations are required in this case. |
| jbe@0 | 13 |
| jbe@0 | 14 |
| jbe@0 | 15 *** GENERAL INFORMATION *** |
| jbe@0 | 16 |
| jbe@0 | 17 The C library is found in this directory after successful compilation and is |
| jbe@0 | 18 named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the |
| jbe@0 | 19 files "utf8proc.rb" and "utf8proc_native.so", which are found in the |
| jbe@0 | 20 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so" |
| jbe@0 | 21 and resides in the "pgsql/" directory. |
| jbe@0 | 22 |
| jbe@0 | 23 Both the ruby library and the PostgreSQL extension are built as stand-alone |
| jbe@0 | 24 libraries and are therefore not dependent the dynamic version of the |
| jbe@0 | 25 C library files, but this behaviour might change in future releases. |
| jbe@0 | 26 |
| jbe@2 | 27 The Unicode version being supported is 5.0.0. |
| jbe@2 | 28 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as version 5.0.0 |
| jbe@2 | 29 had not been available yet. |
| jbe@0 | 30 |
| jbe@0 | 31 For Unicode normalizations, the following options have to be used: |
| jbe@0 | 32 Normalization Form C: STABLE, COMPOSE |
| jbe@2 | 33 Normalization Form D: STABLE, DECOMPOSE |
| jbe@0 | 34 Normalization Form KC: STABLE, COMPOSE, COMPAT |
| jbe@2 | 35 Normalization Form KD: STABLE, DECOMPOSE, COMPAT |
| jbe@0 | 36 |
| jbe@0 | 37 |
| jbe@0 | 38 *** C LIBRARY *** |
| jbe@0 | 39 |
| jbe@0 | 40 The documentation for the C library is found in the utf8proc.h header file. |
| jbe@0 | 41 "utf8proc_map" is most likely function you will be using for mapping UTF-8 |
| jbe@0 | 42 strings, unless you want to allocate memory yourself. |
| jbe@0 | 43 |
| jbe@0 | 44 |
| jbe@0 | 45 *** RUBY API *** |
| jbe@0 | 46 |
| jbe@0 | 47 The ruby library adds the methods "utf8map" and "utf8map!" to the String |
| jbe@0 | 48 class, and the method "utf8" to the Integer class. |
| jbe@0 | 49 |
| jbe@0 | 50 The String#utf8map method does the same as the "utf8proc_map" C function. |
| jbe@0 | 51 Options for the mapping procedure are passed as symbols, i.e: |
| jbe@2 | 52 "Hello".utf8map(:casefold) => "hello" |
| jbe@0 | 53 |
| jbe@0 | 54 The descriptions of all options are found in the C header file "utf8proc.h". |
| jbe@0 | 55 Please notice that the according symbols in ruby are all lowercase. |
| jbe@0 | 56 |
| jbe@0 | 57 String#utf8map! is the destructive function in the meaning that the string |
| jbe@0 | 58 is replaced by the result. |
| jbe@0 | 59 |
| jbe@0 | 60 There are shortcuts for the 4 normalization forms specified by Unicode: |
| jbe@0 | 61 String#utf8nfd, String#utf8nfd!, |
| jbe@0 | 62 String#utf8nfc, String#utf8nfc!, |
| jbe@0 | 63 String#utf8nfkd, String#utf8nfkd!, |
| jbe@0 | 64 String#utf8nfkc, String#utf8nfkc! |
| jbe@0 | 65 |
| jbe@0 | 66 The method Integer#utf8 returns a UTF-8 string, which is containing the |
| jbe@2 | 67 unicode char given by the code point. |
| jbe@0 | 68 0x000A.utf8 => "\n" |
| jbe@0 | 69 0x2028.utf8 => "\342\200\250" |
| jbe@0 | 70 |
| jbe@0 | 71 |
| jbe@0 | 72 *** POSTGRESQL API *** |
| jbe@0 | 73 |
| jbe@0 | 74 For PostgreSQL there is a SQL function supplied named "unifold". This |
| jbe@0 | 75 function can be used to prepare index fields in order to be normalized and |
| jbe@0 | 76 case-folded, i.e.: |
| jbe@0 | 77 |
| jbe@1 | 78 CREATE TABLE people ( |
| jbe@1 | 79 id serial8 primary key, |
| jbe@1 | 80 name text, |
| jbe@1 | 81 CHECK (unifold(name) NOTNULL) |
| jbe@1 | 82 ); |
| jbe@0 | 83 CREATE INDEX name_idx ON people (unifold(name)); |
| jbe@0 | 84 SELECT * FROM people WHERE unifold(name) = unifold('John Doe'); |
| jbe@0 | 85 |
| jbe@2 | 86 NOTICE: The outputs of the function can change between releases, as utf8proc |
| jbe@2 | 87 does not follow a versioning stability policy. You have to rebuild |
| jbe@2 | 88 your database indicies, if you upgrade to a newer version of utf8proc. |
| jbe@2 | 89 |
| jbe@2 | 90 |
| jbe@2 | 91 *** KNOWN BUGS *** |
| jbe@2 | 92 |
| jbe@2 | 93 - on Mac OS X there were segfaults reported when compiling the ruby library |
| jbe@2 | 94 with optimization (-> don't use optimization if you have problems) |
| jbe@2 | 95 |
| jbe@0 | 96 |
| jbe@0 | 97 *** TODO *** |
| jbe@0 | 98 |
| jbe@0 | 99 - detect stable code points and process segments independently in order to |
| jbe@0 | 100 save memory |
| jbe@0 | 101 - do a quick check before normalizing strings to optimize speed |
| jbe@0 | 102 - support stream processing |
| jbe@0 | 103 |
| jbe@0 | 104 |
| jbe@0 | 105 Unicode is a trademark of Unicode, Inc., and may be registered in some |
| jbe@0 | 106 jurisdictions. |
| jbe@0 | 107 |
| jbe@0 | 108 |