utf8proc

diff README @ 0:a0368662434c
Version 0.1
author: jbe
date: Fri Jun 02 12:00:00 2006 +0200 (2006-06-02)
children: 61a89ecc2fb9
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/README	Fri Jun 02 12:00:00 2006 +0200
     1.3 @@ -0,0 +1,92 @@
     1.4 +
     1.5 +Please read the LICENSE file, which is shipping with this software.
     1.6 +
     1.7 +
     1.8 +*** QUICK START ***
     1.9 +
    1.10 +For compilation of the C library call "make c-library", for compilation of
    1.11 +the ruby library call "make ruby-library" and for compilation of the
    1.12 +PostgreSQL extension call "make pgsql-library".
    1.13 +
    1.14 +"make all" can be used to build everything, but both ruby and PostgreSQL
    1.15 +installations are required in this case.
    1.16 +
    1.17 +
    1.18 +*** GENERAL INFORMATION ***
    1.19 +
    1.20 +The C library is found in this directory after successful compilation and is
    1.21 +named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the
    1.22 +files "utf8proc.rb" and "utf8proc_native.so", which are found in the
    1.23 +subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
    1.24 +and resides in the "pgsql/" directory.
    1.25 +
    1.26 +Both the ruby library and the PostgreSQL extension are built as stand-alone
    1.27 +libraries and are therefore not dependent the dynamic version of the
    1.28 +C library files, but this behaviour might change in future releases.
    1.29 +
    1.30 +The Unicode version being supported is 4.1.0.
    1.31 +
    1.32 +For Unicode normalizations, the following options have to be used:
    1.33 +Normalization Form C:  STABLE, COMPOSE
    1.34 +Normalization Form D:  STABLE
    1.35 +Normalization Form KC: STABLE, COMPOSE, COMPAT
    1.36 +Normalization Form KD: STABLE, COMPAT
    1.37 +
    1.38 +
    1.39 +*** C LIBRARY ***
    1.40 +
    1.41 +The documentation for the C library is found in the utf8proc.h header file.
    1.42 +"utf8proc_map" is most likely function you will be using for mapping UTF-8
    1.43 +strings, unless you want to allocate memory yourself.
    1.44 +
    1.45 +
    1.46 +*** RUBY API ***
    1.47 +
    1.48 +The ruby library adds the methods "utf8map" and "utf8map!" to the String
    1.49 +class, and the method "utf8" to the Integer class.
    1.50 +
    1.51 +The String#utf8map method does the same as the "utf8proc_map" C function.
    1.52 +Options for the mapping procedure are passed as symbols, i.e:
    1.53 +"Hello".utf8map(:stable, :casefold) => "hello"
    1.54 +
    1.55 +The descriptions of all options are found in the C header file "utf8proc.h".
    1.56 +Please notice that the according symbols in ruby are all lowercase.
    1.57 +
    1.58 +String#utf8map! is the destructive function in the meaning that the string
    1.59 +is replaced by the result.
    1.60 +
    1.61 +There are shortcuts for the 4 normalization forms specified by Unicode:
    1.62 +String#utf8nfd,  String#utf8nfd!,
    1.63 +String#utf8nfc,  String#utf8nfc!,
    1.64 +String#utf8nfkd, String#utf8nfkd!,
    1.65 +String#utf8nfkc, String#utf8nfkc!
    1.66 +
    1.67 +The method Integer#utf8 returns a UTF-8 string, which is containing the
    1.68 +unicode char given by the code point.
    1.69 +0x000A.utf8 => "\n"
    1.70 +0x2028.utf8 => "\342\200\250"
    1.71 +
    1.72 +
    1.73 +*** POSTGRESQL API ***
    1.74 +
    1.75 +For PostgreSQL there is a SQL function supplied named "unifold". This
    1.76 +function can be used to prepare index fields in order to be normalized and
    1.77 +case-folded, i.e.:
    1.78 +
    1.79 +CREATE TABLE people (id serial8 primary key, name text);
    1.80 +CREATE INDEX name_idx ON people (unifold(name));
    1.81 +SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
    1.82 +
    1.83 +
    1.84 +*** TODO ***
    1.85 +
    1.86 +- detect stable code points and process segments independently in order to
    1.87 +  save memory
    1.88 +- do a quick check before normalizing strings to optimize speed
    1.89 +- support stream processing
    1.90 +
    1.91 +
    1.92 +Unicode is a trademark of Unicode, Inc., and may be registered in some
    1.93 +jurisdictions.
    1.94 +
    1.95 +
author	jbe
date	Fri Jun 02 12:00:00 2006 +0200 (2006-06-02)
parents
children	61a89ecc2fb9