| rev | 
   line source | 
| 
jbe@0
 | 
     1 
 | 
| 
jbe@0
 | 
     2 Please read the LICENSE file, which is shipping with this software.
 | 
| 
jbe@0
 | 
     3 
 | 
| 
jbe@0
 | 
     4 
 | 
| 
jbe@0
 | 
     5 *** QUICK START ***
 | 
| 
jbe@0
 | 
     6 
 | 
| 
jbe@0
 | 
     7 For compilation of the C library call "make c-library", for compilation of
 | 
| 
jbe@0
 | 
     8 the ruby library call "make ruby-library" and for compilation of the
 | 
| 
jbe@0
 | 
     9 PostgreSQL extension call "make pgsql-library".
 | 
| 
jbe@0
 | 
    10 
 | 
| 
jbe@0
 | 
    11 "make all" can be used to build everything, but both ruby and PostgreSQL
 | 
| 
jbe@0
 | 
    12 installations are required in this case.
 | 
| 
jbe@0
 | 
    13 
 | 
| 
jbe@0
 | 
    14 
 | 
| 
jbe@0
 | 
    15 *** GENERAL INFORMATION ***
 | 
| 
jbe@0
 | 
    16 
 | 
| 
jbe@0
 | 
    17 The C library is found in this directory after successful compilation and is
 | 
| 
jbe@0
 | 
    18 named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the
 | 
| 
jbe@0
 | 
    19 files "utf8proc.rb" and "utf8proc_native.so", which are found in the
 | 
| 
jbe@0
 | 
    20 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
 | 
| 
jbe@0
 | 
    21 and resides in the "pgsql/" directory.
 | 
| 
jbe@0
 | 
    22 
 | 
| 
jbe@0
 | 
    23 Both the ruby library and the PostgreSQL extension are built as stand-alone
 | 
| 
jbe@0
 | 
    24 libraries and are therefore not dependent the dynamic version of the
 | 
| 
jbe@0
 | 
    25 C library files, but this behaviour might change in future releases.
 | 
| 
jbe@0
 | 
    26 
 | 
| 
jbe@0
 | 
    27 The Unicode version being supported is 4.1.0.
 | 
| 
jbe@0
 | 
    28 
 | 
| 
jbe@0
 | 
    29 For Unicode normalizations, the following options have to be used:
 | 
| 
jbe@0
 | 
    30 Normalization Form C:  STABLE, COMPOSE
 | 
| 
jbe@0
 | 
    31 Normalization Form D:  STABLE
 | 
| 
jbe@0
 | 
    32 Normalization Form KC: STABLE, COMPOSE, COMPAT
 | 
| 
jbe@0
 | 
    33 Normalization Form KD: STABLE, COMPAT
 | 
| 
jbe@0
 | 
    34 
 | 
| 
jbe@0
 | 
    35 
 | 
| 
jbe@0
 | 
    36 *** C LIBRARY ***
 | 
| 
jbe@0
 | 
    37 
 | 
| 
jbe@0
 | 
    38 The documentation for the C library is found in the utf8proc.h header file.
 | 
| 
jbe@0
 | 
    39 "utf8proc_map" is most likely function you will be using for mapping UTF-8
 | 
| 
jbe@0
 | 
    40 strings, unless you want to allocate memory yourself.
 | 
| 
jbe@0
 | 
    41 
 | 
| 
jbe@0
 | 
    42 
 | 
| 
jbe@0
 | 
    43 *** RUBY API ***
 | 
| 
jbe@0
 | 
    44 
 | 
| 
jbe@0
 | 
    45 The ruby library adds the methods "utf8map" and "utf8map!" to the String
 | 
| 
jbe@0
 | 
    46 class, and the method "utf8" to the Integer class.
 | 
| 
jbe@0
 | 
    47 
 | 
| 
jbe@0
 | 
    48 The String#utf8map method does the same as the "utf8proc_map" C function.
 | 
| 
jbe@0
 | 
    49 Options for the mapping procedure are passed as symbols, i.e:
 | 
| 
jbe@0
 | 
    50 "Hello".utf8map(:stable, :casefold) => "hello"
 | 
| 
jbe@0
 | 
    51 
 | 
| 
jbe@0
 | 
    52 The descriptions of all options are found in the C header file "utf8proc.h".
 | 
| 
jbe@0
 | 
    53 Please notice that the according symbols in ruby are all lowercase.
 | 
| 
jbe@0
 | 
    54 
 | 
| 
jbe@0
 | 
    55 String#utf8map! is the destructive function in the meaning that the string
 | 
| 
jbe@0
 | 
    56 is replaced by the result.
 | 
| 
jbe@0
 | 
    57 
 | 
| 
jbe@0
 | 
    58 There are shortcuts for the 4 normalization forms specified by Unicode:
 | 
| 
jbe@0
 | 
    59 String#utf8nfd,  String#utf8nfd!,
 | 
| 
jbe@0
 | 
    60 String#utf8nfc,  String#utf8nfc!,
 | 
| 
jbe@0
 | 
    61 String#utf8nfkd, String#utf8nfkd!,
 | 
| 
jbe@0
 | 
    62 String#utf8nfkc, String#utf8nfkc!
 | 
| 
jbe@0
 | 
    63 
 | 
| 
jbe@0
 | 
    64 The method Integer#utf8 returns a UTF-8 string, which is containing the
 | 
| 
jbe@0
 | 
    65 unicode char given by the code point.
 | 
| 
jbe@0
 | 
    66 0x000A.utf8 => "\n"
 | 
| 
jbe@0
 | 
    67 0x2028.utf8 => "\342\200\250"
 | 
| 
jbe@0
 | 
    68 
 | 
| 
jbe@0
 | 
    69 
 | 
| 
jbe@0
 | 
    70 *** POSTGRESQL API ***
 | 
| 
jbe@0
 | 
    71 
 | 
| 
jbe@0
 | 
    72 For PostgreSQL there is a SQL function supplied named "unifold". This
 | 
| 
jbe@0
 | 
    73 function can be used to prepare index fields in order to be normalized and
 | 
| 
jbe@0
 | 
    74 case-folded, i.e.:
 | 
| 
jbe@0
 | 
    75 
 | 
| 
jbe@0
 | 
    76 CREATE TABLE people (id serial8 primary key, name text);
 | 
| 
jbe@0
 | 
    77 CREATE INDEX name_idx ON people (unifold(name));
 | 
| 
jbe@0
 | 
    78 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
 | 
| 
jbe@0
 | 
    79 
 | 
| 
jbe@0
 | 
    80 
 | 
| 
jbe@0
 | 
    81 *** TODO ***
 | 
| 
jbe@0
 | 
    82 
 | 
| 
jbe@0
 | 
    83 - detect stable code points and process segments independently in order to
 | 
| 
jbe@0
 | 
    84   save memory
 | 
| 
jbe@0
 | 
    85 - do a quick check before normalizing strings to optimize speed
 | 
| 
jbe@0
 | 
    86 - support stream processing
 | 
| 
jbe@0
 | 
    87 
 | 
| 
jbe@0
 | 
    88 
 | 
| 
jbe@0
 | 
    89 Unicode is a trademark of Unicode, Inc., and may be registered in some
 | 
| 
jbe@0
 | 
    90 jurisdictions.
 | 
| 
jbe@0
 | 
    91 
 | 
| 
jbe@0
 | 
    92 
 |