utf8proc

annotate README @ 0:a0368662434c

Version 0.1
author jbe
date Fri Jun 02 12:00:00 2006 +0200 (2006-06-02)
parents
children 61a89ecc2fb9
rev   line source
jbe@0 1
jbe@0 2 Please read the LICENSE file, which is shipping with this software.
jbe@0 3
jbe@0 4
jbe@0 5 *** QUICK START ***
jbe@0 6
jbe@0 7 For compilation of the C library call "make c-library", for compilation of
jbe@0 8 the ruby library call "make ruby-library" and for compilation of the
jbe@0 9 PostgreSQL extension call "make pgsql-library".
jbe@0 10
jbe@0 11 "make all" can be used to build everything, but both ruby and PostgreSQL
jbe@0 12 installations are required in this case.
jbe@0 13
jbe@0 14
jbe@0 15 *** GENERAL INFORMATION ***
jbe@0 16
jbe@0 17 The C library is found in this directory after successful compilation and is
jbe@0 18 named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the
jbe@0 19 files "utf8proc.rb" and "utf8proc_native.so", which are found in the
jbe@0 20 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
jbe@0 21 and resides in the "pgsql/" directory.
jbe@0 22
jbe@0 23 Both the ruby library and the PostgreSQL extension are built as stand-alone
jbe@0 24 libraries and are therefore not dependent the dynamic version of the
jbe@0 25 C library files, but this behaviour might change in future releases.
jbe@0 26
jbe@0 27 The Unicode version being supported is 4.1.0.
jbe@0 28
jbe@0 29 For Unicode normalizations, the following options have to be used:
jbe@0 30 Normalization Form C: STABLE, COMPOSE
jbe@0 31 Normalization Form D: STABLE
jbe@0 32 Normalization Form KC: STABLE, COMPOSE, COMPAT
jbe@0 33 Normalization Form KD: STABLE, COMPAT
jbe@0 34
jbe@0 35
jbe@0 36 *** C LIBRARY ***
jbe@0 37
jbe@0 38 The documentation for the C library is found in the utf8proc.h header file.
jbe@0 39 "utf8proc_map" is most likely function you will be using for mapping UTF-8
jbe@0 40 strings, unless you want to allocate memory yourself.
jbe@0 41
jbe@0 42
jbe@0 43 *** RUBY API ***
jbe@0 44
jbe@0 45 The ruby library adds the methods "utf8map" and "utf8map!" to the String
jbe@0 46 class, and the method "utf8" to the Integer class.
jbe@0 47
jbe@0 48 The String#utf8map method does the same as the "utf8proc_map" C function.
jbe@0 49 Options for the mapping procedure are passed as symbols, i.e:
jbe@0 50 "Hello".utf8map(:stable, :casefold) => "hello"
jbe@0 51
jbe@0 52 The descriptions of all options are found in the C header file "utf8proc.h".
jbe@0 53 Please notice that the according symbols in ruby are all lowercase.
jbe@0 54
jbe@0 55 String#utf8map! is the destructive function in the meaning that the string
jbe@0 56 is replaced by the result.
jbe@0 57
jbe@0 58 There are shortcuts for the 4 normalization forms specified by Unicode:
jbe@0 59 String#utf8nfd, String#utf8nfd!,
jbe@0 60 String#utf8nfc, String#utf8nfc!,
jbe@0 61 String#utf8nfkd, String#utf8nfkd!,
jbe@0 62 String#utf8nfkc, String#utf8nfkc!
jbe@0 63
jbe@0 64 The method Integer#utf8 returns a UTF-8 string, which is containing the
jbe@0 65 unicode char given by the code point.
jbe@0 66 0x000A.utf8 => "\n"
jbe@0 67 0x2028.utf8 => "\342\200\250"
jbe@0 68
jbe@0 69
jbe@0 70 *** POSTGRESQL API ***
jbe@0 71
jbe@0 72 For PostgreSQL there is a SQL function supplied named "unifold". This
jbe@0 73 function can be used to prepare index fields in order to be normalized and
jbe@0 74 case-folded, i.e.:
jbe@0 75
jbe@0 76 CREATE TABLE people (id serial8 primary key, name text);
jbe@0 77 CREATE INDEX name_idx ON people (unifold(name));
jbe@0 78 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
jbe@0 79
jbe@0 80
jbe@0 81 *** TODO ***
jbe@0 82
jbe@0 83 - detect stable code points and process segments independently in order to
jbe@0 84 save memory
jbe@0 85 - do a quick check before normalizing strings to optimize speed
jbe@0 86 - support stream processing
jbe@0 87
jbe@0 88
jbe@0 89 Unicode is a trademark of Unicode, Inc., and may be registered in some
jbe@0 90 jurisdictions.
jbe@0 91
jbe@0 92

Impressum / About Us