utf8proc
annotate README @ 14:d0bab6ca89a5
Version 1.1.6
- PostgreSQL 9.2 and 9.3 compatibility (lowercase 'c' language name)
- PostgreSQL 9.2 and 9.3 compatibility (lowercase 'c' language name)
| author | jbe | 
|---|---|
| date | Wed Nov 27 12:00:00 2013 +0100 (2013-11-27) | 
| parents | 00d2bcbdc945 | 
| children | 
| rev | line source | 
|---|---|
| jbe@0 | 1 | 
| jbe@0 | 2 Please read the LICENSE file, which is shipping with this software. | 
| jbe@0 | 3 | 
| jbe@0 | 4 | 
| jbe@0 | 5 *** QUICK START *** | 
| jbe@0 | 6 | 
| jbe@0 | 7 For compilation of the C library call "make c-library", for compilation of | 
| jbe@0 | 8 the ruby library call "make ruby-library" and for compilation of the | 
| jbe@0 | 9 PostgreSQL extension call "make pgsql-library". | 
| jbe@0 | 10 | 
| jbe@10 | 11 For ruby you can also create a gem-file by calling "make ruby-gem". | 
| jbe@10 | 12 | 
| jbe@0 | 13 "make all" can be used to build everything, but both ruby and PostgreSQL | 
| jbe@0 | 14 installations are required in this case. | 
| jbe@0 | 15 | 
| jbe@0 | 16 | 
| jbe@0 | 17 *** GENERAL INFORMATION *** | 
| jbe@0 | 18 | 
| jbe@7 | 19 The C library is found in this directory after successful compilation and | 
| jbe@7 | 20 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of | 
| jbe@7 | 21 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the | 
| jbe@10 | 22 subdirectory "ruby/". If you chose to create a gem-file it is placed in the | 
| jbe@10 | 23 "ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so" | 
| jbe@0 | 24 and resides in the "pgsql/" directory. | 
| jbe@0 | 25 | 
| jbe@0 | 26 Both the ruby library and the PostgreSQL extension are built as stand-alone | 
| jbe@0 | 27 libraries and are therefore not dependent the dynamic version of the | 
| jbe@0 | 28 C library files, but this behaviour might change in future releases. | 
| jbe@0 | 29 | 
| jbe@2 | 30 The Unicode version being supported is 5.0.0. | 
| jbe@7 | 31 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as | 
| jbe@10 | 32 version 5.0.0 had not been available at the time of implementation. | 
| jbe@0 | 33 | 
| jbe@0 | 34 For Unicode normalizations, the following options have to be used: | 
| jbe@0 | 35 Normalization Form C: STABLE, COMPOSE | 
| jbe@2 | 36 Normalization Form D: STABLE, DECOMPOSE | 
| jbe@0 | 37 Normalization Form KC: STABLE, COMPOSE, COMPAT | 
| jbe@2 | 38 Normalization Form KD: STABLE, DECOMPOSE, COMPAT | 
| jbe@0 | 39 | 
| jbe@0 | 40 | 
| jbe@0 | 41 *** C LIBRARY *** | 
| jbe@0 | 42 | 
| jbe@0 | 43 The documentation for the C library is found in the utf8proc.h header file. | 
| jbe@0 | 44 "utf8proc_map" is most likely function you will be using for mapping UTF-8 | 
| jbe@0 | 45 strings, unless you want to allocate memory yourself. | 
| jbe@0 | 46 | 
| jbe@0 | 47 | 
| jbe@0 | 48 *** RUBY API *** | 
| jbe@0 | 49 | 
| jbe@0 | 50 The ruby library adds the methods "utf8map" and "utf8map!" to the String | 
| jbe@0 | 51 class, and the method "utf8" to the Integer class. | 
| jbe@0 | 52 | 
| jbe@0 | 53 The String#utf8map method does the same as the "utf8proc_map" C function. | 
| jbe@0 | 54 Options for the mapping procedure are passed as symbols, i.e: | 
| jbe@2 | 55 "Hello".utf8map(:casefold) => "hello" | 
| jbe@0 | 56 | 
| jbe@7 | 57 The descriptions of all options are found in the C header file | 
| jbe@7 | 58 "utf8proc.h". Please notice that the according symbols in ruby are all | 
| jbe@7 | 59 lowercase. | 
| jbe@0 | 60 | 
| jbe@0 | 61 String#utf8map! is the destructive function in the meaning that the string | 
| jbe@0 | 62 is replaced by the result. | 
| jbe@0 | 63 | 
| jbe@0 | 64 There are shortcuts for the 4 normalization forms specified by Unicode: | 
| jbe@0 | 65 String#utf8nfd, String#utf8nfd!, | 
| jbe@0 | 66 String#utf8nfc, String#utf8nfc!, | 
| jbe@0 | 67 String#utf8nfkd, String#utf8nfkd!, | 
| jbe@0 | 68 String#utf8nfkc, String#utf8nfkc! | 
| jbe@0 | 69 | 
| jbe@0 | 70 The method Integer#utf8 returns a UTF-8 string, which is containing the | 
| jbe@7 | 71 unicode char given by the code point. | 
| jbe@0 | 72 0x000A.utf8 => "\n" | 
| jbe@0 | 73 0x2028.utf8 => "\342\200\250" | 
| jbe@0 | 74 | 
| jbe@0 | 75 | 
| jbe@0 | 76 *** POSTGRESQL API *** | 
| jbe@0 | 77 | 
| jbe@7 | 78 For PostgreSQL there are two SQL functions supplied named "unifold" and | 
| jbe@7 | 79 "unistrip". These functions function can be used to prepare index fields in | 
| jbe@7 | 80 order to be folded in a way where string-comparisons make more sense, e.g. | 
| jbe@7 | 81 where "bathtub" == "bath<soft hyphen>tub" | 
| jbe@7 | 82 or "Hello World" == "hello world". | 
| jbe@0 | 83 | 
| jbe@1 | 84 CREATE TABLE people ( | 
| jbe@1 | 85 id serial8 primary key, | 
| jbe@1 | 86 name text, | 
| jbe@1 | 87 CHECK (unifold(name) NOTNULL) | 
| jbe@1 | 88 ); | 
| jbe@0 | 89 CREATE INDEX name_idx ON people (unifold(name)); | 
| jbe@0 | 90 SELECT * FROM people WHERE unifold(name) = unifold('John Doe'); | 
| jbe@0 | 91 | 
| jbe@7 | 92 The function "unistrip" removes character marks like accents or diaeresis, | 
| jbe@7 | 93 while "unifold" keeps then. | 
| jbe@7 | 94 | 
| jbe@7 | 95 NOTICE: The outputs of the function can change between releases, as | 
| jbe@7 | 96 utf8proc does not follow a versioning stability policy. You have to | 
| jbe@7 | 97 rebuild your database indicies, if you upgrade to a newer version | 
| jbe@7 | 98 of utf8proc. | 
| jbe@2 | 99 | 
| jbe@2 | 100 | 
| jbe@0 | 101 *** TODO *** | 
| jbe@0 | 102 | 
| jbe@0 | 103 - detect stable code points and process segments independently in order to | 
| jbe@0 | 104 save memory | 
| jbe@0 | 105 - do a quick check before normalizing strings to optimize speed | 
| jbe@0 | 106 - support stream processing | 
| jbe@0 | 107 | 
| jbe@0 | 108 | 
| jbe@7 | 109 *** CONTACT *** | 
| jbe@0 | 110 | 
| jbe@7 | 111 If you find any bugs or experience difficulties in compiling this software, | 
| jbe@10 | 112 please contact us: | 
| jbe@0 | 113 | 
| jbe@10 | 114 Project page: http://www.public-software-group.org/utf8proc | 
| jbe@7 | 115 | 
| jbe@9 | 116 |