utf8proc
view README @ 9:951e73a98021
Version 1.1.3
- Added a function utf8proc_version returning a string containing the version number of the library.
- Included a target libutf8proc.dylib for MacOSX.
- PostgreSQL 8.3 compatibility (use of SET_VARSIZE macro)
- Added a function utf8proc_version returning a string containing the version number of the library.
- Included a target libutf8proc.dylib for MacOSX.
- PostgreSQL 8.3 compatibility (use of SET_VARSIZE macro)
| author | jbe | 
|---|---|
| date | Fri May 01 12:00:00 2009 +0200 (2009-05-01) | 
| parents | fcfd8c836c64 | 
| children | 00d2bcbdc945 | 
 line source
     2 Please read the LICENSE file, which is shipping with this software.
     5 *** QUICK START ***
     7 For compilation of the C library call "make c-library", for compilation of
     8 the ruby library call "make ruby-library" and for compilation of the
     9 PostgreSQL extension call "make pgsql-library".
    11 "make all" can be used to build everything, but both ruby and PostgreSQL
    12 installations are required in this case.
    14 For ruby there is alternatively provided a gem-file "utf8proc-1.1.3.gem".
    17 *** GENERAL INFORMATION ***
    19 The C library is found in this directory after successful compilation and
    20 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
    21 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
    22 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
    23 and resides in the "pgsql/" directory.
    25 Both the ruby library and the PostgreSQL extension are built as stand-alone
    26 libraries and are therefore not dependent the dynamic version of the
    27 C library files, but this behaviour might change in future releases.
    29 The Unicode version being supported is 5.0.0.
    30 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
    31       version 5.0.0 had not been available yet.
    33 For Unicode normalizations, the following options have to be used:
    34 Normalization Form C:  STABLE, COMPOSE
    35 Normalization Form D:  STABLE, DECOMPOSE
    36 Normalization Form KC: STABLE, COMPOSE, COMPAT
    37 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
    40 *** C LIBRARY ***
    42 The documentation for the C library is found in the utf8proc.h header file.
    43 "utf8proc_map" is most likely function you will be using for mapping UTF-8
    44 strings, unless you want to allocate memory yourself.
    47 *** RUBY API ***
    49 The ruby library adds the methods "utf8map" and "utf8map!" to the String
    50 class, and the method "utf8" to the Integer class.
    52 The String#utf8map method does the same as the "utf8proc_map" C function.
    53 Options for the mapping procedure are passed as symbols, i.e:
    54 "Hello".utf8map(:casefold) => "hello"
    56 The descriptions of all options are found in the C header file
    57 "utf8proc.h". Please notice that the according symbols in ruby are all
    58 lowercase.
    60 String#utf8map! is the destructive function in the meaning that the string
    61 is replaced by the result.
    63 There are shortcuts for the 4 normalization forms specified by Unicode:
    64 String#utf8nfd,  String#utf8nfd!,
    65 String#utf8nfc,  String#utf8nfc!,
    66 String#utf8nfkd, String#utf8nfkd!,
    67 String#utf8nfkc, String#utf8nfkc!
    69 The method Integer#utf8 returns a UTF-8 string, which is containing the
    70 unicode char given by the code point.
    71 0x000A.utf8 => "\n"
    72 0x2028.utf8 => "\342\200\250"
    75 *** POSTGRESQL API ***
    77 For PostgreSQL there are two SQL functions supplied named "unifold" and
    78 "unistrip". These functions function can be used to prepare index fields in
    79 order to be folded in a way where string-comparisons make more sense, e.g.
    80 where "bathtub" == "bath<soft hyphen>tub"
    81 or "Hello World" == "hello world".
    83 CREATE TABLE people (
    84   id    serial8 primary key,
    85   name  text,
    86   CHECK (unifold(name) NOTNULL)
    87 );
    88 CREATE INDEX name_idx ON people (unifold(name));
    89 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
    91 The function "unistrip" removes character marks like accents or diaeresis,
    92 while "unifold" keeps then.
    94 NOTICE: The outputs of the function can change between releases, as
    95         utf8proc does not follow a versioning stability policy. You have to
    96         rebuild your database indicies, if you upgrade to a newer version
    97         of utf8proc.
   100 *** TODO ***
   102 - detect stable code points and process segments independently in order to
   103   save memory
   104 - do a quick check before normalizing strings to optimize speed
   105 - support stream processing
   108 *** CONTACT ***
   110 If you find any bugs or experience difficulties in compiling this software,
   111 please contact me:
   113 Project page: http://www.flexiguided.de/publications.utf8proc.en.html
   114 Contact form: http://www.flexiguided.de/contactform.en.html
   116 Jan Behrens
