utf8proc
annotate README @ 9:951e73a98021
Version 1.1.3
- Added a function utf8proc_version returning a string containing the version number of the library.
- Included a target libutf8proc.dylib for MacOSX.
- PostgreSQL 8.3 compatibility (use of SET_VARSIZE macro)
- Added a function utf8proc_version returning a string containing the version number of the library.
- Included a target libutf8proc.dylib for MacOSX.
- PostgreSQL 8.3 compatibility (use of SET_VARSIZE macro)
author | jbe |
---|---|
date | Fri May 01 12:00:00 2009 +0200 (2009-05-01) |
parents | fcfd8c836c64 |
children | 00d2bcbdc945 |
rev | line source |
---|---|
jbe@0 | 1 |
jbe@0 | 2 Please read the LICENSE file, which is shipping with this software. |
jbe@0 | 3 |
jbe@0 | 4 |
jbe@0 | 5 *** QUICK START *** |
jbe@0 | 6 |
jbe@0 | 7 For compilation of the C library call "make c-library", for compilation of |
jbe@0 | 8 the ruby library call "make ruby-library" and for compilation of the |
jbe@0 | 9 PostgreSQL extension call "make pgsql-library". |
jbe@0 | 10 |
jbe@0 | 11 "make all" can be used to build everything, but both ruby and PostgreSQL |
jbe@0 | 12 installations are required in this case. |
jbe@0 | 13 |
jbe@9 | 14 For ruby there is alternatively provided a gem-file "utf8proc-1.1.3.gem". |
jbe@4 | 15 |
jbe@0 | 16 |
jbe@0 | 17 *** GENERAL INFORMATION *** |
jbe@0 | 18 |
jbe@7 | 19 The C library is found in this directory after successful compilation and |
jbe@7 | 20 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of |
jbe@7 | 21 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the |
jbe@0 | 22 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so" |
jbe@0 | 23 and resides in the "pgsql/" directory. |
jbe@0 | 24 |
jbe@0 | 25 Both the ruby library and the PostgreSQL extension are built as stand-alone |
jbe@0 | 26 libraries and are therefore not dependent the dynamic version of the |
jbe@0 | 27 C library files, but this behaviour might change in future releases. |
jbe@0 | 28 |
jbe@2 | 29 The Unicode version being supported is 5.0.0. |
jbe@7 | 30 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as |
jbe@7 | 31 version 5.0.0 had not been available yet. |
jbe@0 | 32 |
jbe@0 | 33 For Unicode normalizations, the following options have to be used: |
jbe@0 | 34 Normalization Form C: STABLE, COMPOSE |
jbe@2 | 35 Normalization Form D: STABLE, DECOMPOSE |
jbe@0 | 36 Normalization Form KC: STABLE, COMPOSE, COMPAT |
jbe@2 | 37 Normalization Form KD: STABLE, DECOMPOSE, COMPAT |
jbe@0 | 38 |
jbe@0 | 39 |
jbe@0 | 40 *** C LIBRARY *** |
jbe@0 | 41 |
jbe@0 | 42 The documentation for the C library is found in the utf8proc.h header file. |
jbe@0 | 43 "utf8proc_map" is most likely function you will be using for mapping UTF-8 |
jbe@0 | 44 strings, unless you want to allocate memory yourself. |
jbe@0 | 45 |
jbe@0 | 46 |
jbe@0 | 47 *** RUBY API *** |
jbe@0 | 48 |
jbe@0 | 49 The ruby library adds the methods "utf8map" and "utf8map!" to the String |
jbe@0 | 50 class, and the method "utf8" to the Integer class. |
jbe@0 | 51 |
jbe@0 | 52 The String#utf8map method does the same as the "utf8proc_map" C function. |
jbe@0 | 53 Options for the mapping procedure are passed as symbols, i.e: |
jbe@2 | 54 "Hello".utf8map(:casefold) => "hello" |
jbe@0 | 55 |
jbe@7 | 56 The descriptions of all options are found in the C header file |
jbe@7 | 57 "utf8proc.h". Please notice that the according symbols in ruby are all |
jbe@7 | 58 lowercase. |
jbe@0 | 59 |
jbe@0 | 60 String#utf8map! is the destructive function in the meaning that the string |
jbe@0 | 61 is replaced by the result. |
jbe@0 | 62 |
jbe@0 | 63 There are shortcuts for the 4 normalization forms specified by Unicode: |
jbe@0 | 64 String#utf8nfd, String#utf8nfd!, |
jbe@0 | 65 String#utf8nfc, String#utf8nfc!, |
jbe@0 | 66 String#utf8nfkd, String#utf8nfkd!, |
jbe@0 | 67 String#utf8nfkc, String#utf8nfkc! |
jbe@0 | 68 |
jbe@0 | 69 The method Integer#utf8 returns a UTF-8 string, which is containing the |
jbe@7 | 70 unicode char given by the code point. |
jbe@0 | 71 0x000A.utf8 => "\n" |
jbe@0 | 72 0x2028.utf8 => "\342\200\250" |
jbe@0 | 73 |
jbe@0 | 74 |
jbe@0 | 75 *** POSTGRESQL API *** |
jbe@0 | 76 |
jbe@7 | 77 For PostgreSQL there are two SQL functions supplied named "unifold" and |
jbe@7 | 78 "unistrip". These functions function can be used to prepare index fields in |
jbe@7 | 79 order to be folded in a way where string-comparisons make more sense, e.g. |
jbe@7 | 80 where "bathtub" == "bath<soft hyphen>tub" |
jbe@7 | 81 or "Hello World" == "hello world". |
jbe@0 | 82 |
jbe@1 | 83 CREATE TABLE people ( |
jbe@1 | 84 id serial8 primary key, |
jbe@1 | 85 name text, |
jbe@1 | 86 CHECK (unifold(name) NOTNULL) |
jbe@1 | 87 ); |
jbe@0 | 88 CREATE INDEX name_idx ON people (unifold(name)); |
jbe@0 | 89 SELECT * FROM people WHERE unifold(name) = unifold('John Doe'); |
jbe@0 | 90 |
jbe@7 | 91 The function "unistrip" removes character marks like accents or diaeresis, |
jbe@7 | 92 while "unifold" keeps then. |
jbe@7 | 93 |
jbe@7 | 94 NOTICE: The outputs of the function can change between releases, as |
jbe@7 | 95 utf8proc does not follow a versioning stability policy. You have to |
jbe@7 | 96 rebuild your database indicies, if you upgrade to a newer version |
jbe@7 | 97 of utf8proc. |
jbe@2 | 98 |
jbe@2 | 99 |
jbe@0 | 100 *** TODO *** |
jbe@0 | 101 |
jbe@0 | 102 - detect stable code points and process segments independently in order to |
jbe@0 | 103 save memory |
jbe@0 | 104 - do a quick check before normalizing strings to optimize speed |
jbe@0 | 105 - support stream processing |
jbe@0 | 106 |
jbe@0 | 107 |
jbe@7 | 108 *** CONTACT *** |
jbe@0 | 109 |
jbe@7 | 110 If you find any bugs or experience difficulties in compiling this software, |
jbe@7 | 111 please contact me: |
jbe@0 | 112 |
jbe@9 | 113 Project page: http://www.flexiguided.de/publications.utf8proc.en.html |
jbe@9 | 114 Contact form: http://www.flexiguided.de/contactform.en.html |
jbe@7 | 115 |
jbe@9 | 116 Jan Behrens |
jbe@9 | 117 |