| rev | 
   line source | 
| 
jbe@0
 | 
     1 
 | 
| 
jbe@0
 | 
     2 Please read the LICENSE file, which is shipping with this software.
 | 
| 
jbe@0
 | 
     3 
 | 
| 
jbe@0
 | 
     4 
 | 
| 
jbe@0
 | 
     5 *** QUICK START ***
 | 
| 
jbe@0
 | 
     6 
 | 
| 
jbe@0
 | 
     7 For compilation of the C library call "make c-library", for compilation of
 | 
| 
jbe@0
 | 
     8 the ruby library call "make ruby-library" and for compilation of the
 | 
| 
jbe@0
 | 
     9 PostgreSQL extension call "make pgsql-library".
 | 
| 
jbe@0
 | 
    10 
 | 
| 
jbe@10
 | 
    11 For ruby you can also create a gem-file by calling "make ruby-gem".
 | 
| 
jbe@10
 | 
    12 
 | 
| 
jbe@0
 | 
    13 "make all" can be used to build everything, but both ruby and PostgreSQL
 | 
| 
jbe@0
 | 
    14 installations are required in this case.
 | 
| 
jbe@0
 | 
    15 
 | 
| 
jbe@0
 | 
    16 
 | 
| 
jbe@0
 | 
    17 *** GENERAL INFORMATION ***
 | 
| 
jbe@0
 | 
    18 
 | 
| 
jbe@7
 | 
    19 The C library is found in this directory after successful compilation and
 | 
| 
jbe@7
 | 
    20 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
 | 
| 
jbe@7
 | 
    21 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
 | 
| 
jbe@10
 | 
    22 subdirectory "ruby/". If you chose to create a gem-file it is placed in the
 | 
| 
jbe@10
 | 
    23 "ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
 | 
| 
jbe@0
 | 
    24 and resides in the "pgsql/" directory.
 | 
| 
jbe@0
 | 
    25 
 | 
| 
jbe@0
 | 
    26 Both the ruby library and the PostgreSQL extension are built as stand-alone
 | 
| 
jbe@0
 | 
    27 libraries and are therefore not dependent the dynamic version of the
 | 
| 
jbe@0
 | 
    28 C library files, but this behaviour might change in future releases.
 | 
| 
jbe@0
 | 
    29 
 | 
| 
jbe@2
 | 
    30 The Unicode version being supported is 5.0.0.
 | 
| 
jbe@7
 | 
    31 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
 | 
| 
jbe@10
 | 
    32       version 5.0.0 had not been available at the time of implementation.
 | 
| 
jbe@0
 | 
    33 
 | 
| 
jbe@0
 | 
    34 For Unicode normalizations, the following options have to be used:
 | 
| 
jbe@0
 | 
    35 Normalization Form C:  STABLE, COMPOSE
 | 
| 
jbe@2
 | 
    36 Normalization Form D:  STABLE, DECOMPOSE
 | 
| 
jbe@0
 | 
    37 Normalization Form KC: STABLE, COMPOSE, COMPAT
 | 
| 
jbe@2
 | 
    38 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
 | 
| 
jbe@0
 | 
    39 
 | 
| 
jbe@0
 | 
    40 
 | 
| 
jbe@0
 | 
    41 *** C LIBRARY ***
 | 
| 
jbe@0
 | 
    42 
 | 
| 
jbe@0
 | 
    43 The documentation for the C library is found in the utf8proc.h header file.
 | 
| 
jbe@0
 | 
    44 "utf8proc_map" is most likely function you will be using for mapping UTF-8
 | 
| 
jbe@0
 | 
    45 strings, unless you want to allocate memory yourself.
 | 
| 
jbe@0
 | 
    46 
 | 
| 
jbe@0
 | 
    47 
 | 
| 
jbe@0
 | 
    48 *** RUBY API ***
 | 
| 
jbe@0
 | 
    49 
 | 
| 
jbe@0
 | 
    50 The ruby library adds the methods "utf8map" and "utf8map!" to the String
 | 
| 
jbe@0
 | 
    51 class, and the method "utf8" to the Integer class.
 | 
| 
jbe@0
 | 
    52 
 | 
| 
jbe@0
 | 
    53 The String#utf8map method does the same as the "utf8proc_map" C function.
 | 
| 
jbe@0
 | 
    54 Options for the mapping procedure are passed as symbols, i.e:
 | 
| 
jbe@2
 | 
    55 "Hello".utf8map(:casefold) => "hello"
 | 
| 
jbe@0
 | 
    56 
 | 
| 
jbe@7
 | 
    57 The descriptions of all options are found in the C header file
 | 
| 
jbe@7
 | 
    58 "utf8proc.h". Please notice that the according symbols in ruby are all
 | 
| 
jbe@7
 | 
    59 lowercase.
 | 
| 
jbe@0
 | 
    60 
 | 
| 
jbe@0
 | 
    61 String#utf8map! is the destructive function in the meaning that the string
 | 
| 
jbe@0
 | 
    62 is replaced by the result.
 | 
| 
jbe@0
 | 
    63 
 | 
| 
jbe@0
 | 
    64 There are shortcuts for the 4 normalization forms specified by Unicode:
 | 
| 
jbe@0
 | 
    65 String#utf8nfd,  String#utf8nfd!,
 | 
| 
jbe@0
 | 
    66 String#utf8nfc,  String#utf8nfc!,
 | 
| 
jbe@0
 | 
    67 String#utf8nfkd, String#utf8nfkd!,
 | 
| 
jbe@0
 | 
    68 String#utf8nfkc, String#utf8nfkc!
 | 
| 
jbe@0
 | 
    69 
 | 
| 
jbe@0
 | 
    70 The method Integer#utf8 returns a UTF-8 string, which is containing the
 | 
| 
jbe@7
 | 
    71 unicode char given by the code point.
 | 
| 
jbe@0
 | 
    72 0x000A.utf8 => "\n"
 | 
| 
jbe@0
 | 
    73 0x2028.utf8 => "\342\200\250"
 | 
| 
jbe@0
 | 
    74 
 | 
| 
jbe@0
 | 
    75 
 | 
| 
jbe@0
 | 
    76 *** POSTGRESQL API ***
 | 
| 
jbe@0
 | 
    77 
 | 
| 
jbe@7
 | 
    78 For PostgreSQL there are two SQL functions supplied named "unifold" and
 | 
| 
jbe@7
 | 
    79 "unistrip". These functions function can be used to prepare index fields in
 | 
| 
jbe@7
 | 
    80 order to be folded in a way where string-comparisons make more sense, e.g.
 | 
| 
jbe@7
 | 
    81 where "bathtub" == "bath<soft hyphen>tub"
 | 
| 
jbe@7
 | 
    82 or "Hello World" == "hello world".
 | 
| 
jbe@0
 | 
    83 
 | 
| 
jbe@1
 | 
    84 CREATE TABLE people (
 | 
| 
jbe@1
 | 
    85   id    serial8 primary key,
 | 
| 
jbe@1
 | 
    86   name  text,
 | 
| 
jbe@1
 | 
    87   CHECK (unifold(name) NOTNULL)
 | 
| 
jbe@1
 | 
    88 );
 | 
| 
jbe@0
 | 
    89 CREATE INDEX name_idx ON people (unifold(name));
 | 
| 
jbe@0
 | 
    90 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
 | 
| 
jbe@0
 | 
    91 
 | 
| 
jbe@7
 | 
    92 The function "unistrip" removes character marks like accents or diaeresis,
 | 
| 
jbe@7
 | 
    93 while "unifold" keeps then.
 | 
| 
jbe@7
 | 
    94 
 | 
| 
jbe@7
 | 
    95 NOTICE: The outputs of the function can change between releases, as
 | 
| 
jbe@7
 | 
    96         utf8proc does not follow a versioning stability policy. You have to
 | 
| 
jbe@7
 | 
    97         rebuild your database indicies, if you upgrade to a newer version
 | 
| 
jbe@7
 | 
    98         of utf8proc.
 | 
| 
jbe@2
 | 
    99 
 | 
| 
jbe@2
 | 
   100 
 | 
| 
jbe@0
 | 
   101 *** TODO ***
 | 
| 
jbe@0
 | 
   102 
 | 
| 
jbe@0
 | 
   103 - detect stable code points and process segments independently in order to
 | 
| 
jbe@0
 | 
   104   save memory
 | 
| 
jbe@0
 | 
   105 - do a quick check before normalizing strings to optimize speed
 | 
| 
jbe@0
 | 
   106 - support stream processing
 | 
| 
jbe@0
 | 
   107 
 | 
| 
jbe@0
 | 
   108 
 | 
| 
jbe@7
 | 
   109 *** CONTACT ***
 | 
| 
jbe@0
 | 
   110 
 | 
| 
jbe@7
 | 
   111 If you find any bugs or experience difficulties in compiling this software,
 | 
| 
jbe@10
 | 
   112 please contact us:
 | 
| 
jbe@0
 | 
   113 
 | 
| 
jbe@10
 | 
   114 Project page: http://www.public-software-group.org/utf8proc
 | 
| 
jbe@7
 | 
   115 
 | 
| 
jbe@9
 | 
   116 
 |