utf8proc
view README @ 0:a0368662434c
Version 0.1
| author | jbe | 
|---|---|
| date | Fri Jun 02 12:00:00 2006 +0200 (2006-06-02) | 
| parents | |
| children | 61a89ecc2fb9 | 
 line source
     2 Please read the LICENSE file, which is shipping with this software.
     5 *** QUICK START ***
     7 For compilation of the C library call "make c-library", for compilation of
     8 the ruby library call "make ruby-library" and for compilation of the
     9 PostgreSQL extension call "make pgsql-library".
    11 "make all" can be used to build everything, but both ruby and PostgreSQL
    12 installations are required in this case.
    15 *** GENERAL INFORMATION ***
    17 The C library is found in this directory after successful compilation and is
    18 named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the
    19 files "utf8proc.rb" and "utf8proc_native.so", which are found in the
    20 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
    21 and resides in the "pgsql/" directory.
    23 Both the ruby library and the PostgreSQL extension are built as stand-alone
    24 libraries and are therefore not dependent the dynamic version of the
    25 C library files, but this behaviour might change in future releases.
    27 The Unicode version being supported is 4.1.0.
    29 For Unicode normalizations, the following options have to be used:
    30 Normalization Form C:  STABLE, COMPOSE
    31 Normalization Form D:  STABLE
    32 Normalization Form KC: STABLE, COMPOSE, COMPAT
    33 Normalization Form KD: STABLE, COMPAT
    36 *** C LIBRARY ***
    38 The documentation for the C library is found in the utf8proc.h header file.
    39 "utf8proc_map" is most likely function you will be using for mapping UTF-8
    40 strings, unless you want to allocate memory yourself.
    43 *** RUBY API ***
    45 The ruby library adds the methods "utf8map" and "utf8map!" to the String
    46 class, and the method "utf8" to the Integer class.
    48 The String#utf8map method does the same as the "utf8proc_map" C function.
    49 Options for the mapping procedure are passed as symbols, i.e:
    50 "Hello".utf8map(:stable, :casefold) => "hello"
    52 The descriptions of all options are found in the C header file "utf8proc.h".
    53 Please notice that the according symbols in ruby are all lowercase.
    55 String#utf8map! is the destructive function in the meaning that the string
    56 is replaced by the result.
    58 There are shortcuts for the 4 normalization forms specified by Unicode:
    59 String#utf8nfd,  String#utf8nfd!,
    60 String#utf8nfc,  String#utf8nfc!,
    61 String#utf8nfkd, String#utf8nfkd!,
    62 String#utf8nfkc, String#utf8nfkc!
    64 The method Integer#utf8 returns a UTF-8 string, which is containing the
    65 unicode char given by the code point.
    66 0x000A.utf8 => "\n"
    67 0x2028.utf8 => "\342\200\250"
    70 *** POSTGRESQL API ***
    72 For PostgreSQL there is a SQL function supplied named "unifold". This
    73 function can be used to prepare index fields in order to be normalized and
    74 case-folded, i.e.:
    76 CREATE TABLE people (id serial8 primary key, name text);
    77 CREATE INDEX name_idx ON people (unifold(name));
    78 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
    81 *** TODO ***
    83 - detect stable code points and process segments independently in order to
    84   save memory
    85 - do a quick check before normalizing strings to optimize speed
    86 - support stream processing
    89 Unicode is a trademark of Unicode, Inc., and may be registered in some
    90 jurisdictions.
