utf8proc
diff README @ 0:a0368662434c
Version 0.1
| author | jbe | 
|---|---|
| date | Fri Jun 02 12:00:00 2006 +0200 (2006-06-02) | 
| parents | |
| children | 61a89ecc2fb9 | 
   line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/README Fri Jun 02 12:00:00 2006 +0200 1.3 @@ -0,0 +1,92 @@ 1.4 + 1.5 +Please read the LICENSE file, which is shipping with this software. 1.6 + 1.7 + 1.8 +*** QUICK START *** 1.9 + 1.10 +For compilation of the C library call "make c-library", for compilation of 1.11 +the ruby library call "make ruby-library" and for compilation of the 1.12 +PostgreSQL extension call "make pgsql-library". 1.13 + 1.14 +"make all" can be used to build everything, but both ruby and PostgreSQL 1.15 +installations are required in this case. 1.16 + 1.17 + 1.18 +*** GENERAL INFORMATION *** 1.19 + 1.20 +The C library is found in this directory after successful compilation and is 1.21 +named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the 1.22 +files "utf8proc.rb" and "utf8proc_native.so", which are found in the 1.23 +subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so" 1.24 +and resides in the "pgsql/" directory. 1.25 + 1.26 +Both the ruby library and the PostgreSQL extension are built as stand-alone 1.27 +libraries and are therefore not dependent the dynamic version of the 1.28 +C library files, but this behaviour might change in future releases. 1.29 + 1.30 +The Unicode version being supported is 4.1.0. 1.31 + 1.32 +For Unicode normalizations, the following options have to be used: 1.33 +Normalization Form C: STABLE, COMPOSE 1.34 +Normalization Form D: STABLE 1.35 +Normalization Form KC: STABLE, COMPOSE, COMPAT 1.36 +Normalization Form KD: STABLE, COMPAT 1.37 + 1.38 + 1.39 +*** C LIBRARY *** 1.40 + 1.41 +The documentation for the C library is found in the utf8proc.h header file. 1.42 +"utf8proc_map" is most likely function you will be using for mapping UTF-8 1.43 +strings, unless you want to allocate memory yourself. 1.44 + 1.45 + 1.46 +*** RUBY API *** 1.47 + 1.48 +The ruby library adds the methods "utf8map" and "utf8map!" to the String 1.49 +class, and the method "utf8" to the Integer class. 1.50 + 1.51 +The String#utf8map method does the same as the "utf8proc_map" C function. 1.52 +Options for the mapping procedure are passed as symbols, i.e: 1.53 +"Hello".utf8map(:stable, :casefold) => "hello" 1.54 + 1.55 +The descriptions of all options are found in the C header file "utf8proc.h". 1.56 +Please notice that the according symbols in ruby are all lowercase. 1.57 + 1.58 +String#utf8map! is the destructive function in the meaning that the string 1.59 +is replaced by the result. 1.60 + 1.61 +There are shortcuts for the 4 normalization forms specified by Unicode: 1.62 +String#utf8nfd, String#utf8nfd!, 1.63 +String#utf8nfc, String#utf8nfc!, 1.64 +String#utf8nfkd, String#utf8nfkd!, 1.65 +String#utf8nfkc, String#utf8nfkc! 1.66 + 1.67 +The method Integer#utf8 returns a UTF-8 string, which is containing the 1.68 +unicode char given by the code point. 1.69 +0x000A.utf8 => "\n" 1.70 +0x2028.utf8 => "\342\200\250" 1.71 + 1.72 + 1.73 +*** POSTGRESQL API *** 1.74 + 1.75 +For PostgreSQL there is a SQL function supplied named "unifold". This 1.76 +function can be used to prepare index fields in order to be normalized and 1.77 +case-folded, i.e.: 1.78 + 1.79 +CREATE TABLE people (id serial8 primary key, name text); 1.80 +CREATE INDEX name_idx ON people (unifold(name)); 1.81 +SELECT * FROM people WHERE unifold(name) = unifold('John Doe'); 1.82 + 1.83 + 1.84 +*** TODO *** 1.85 + 1.86 +- detect stable code points and process segments independently in order to 1.87 + save memory 1.88 +- do a quick check before normalizing strings to optimize speed 1.89 +- support stream processing 1.90 + 1.91 + 1.92 +Unicode is a trademark of Unicode, Inc., and may be registered in some 1.93 +jurisdictions. 1.94 + 1.95 +