utf8proc

annotate README @ 10:00d2bcbdc945

Version 1.1.4

- replaced C++ style comments for compatibility reasons
- added typecasts to suppress compiler warnings
- removed redundant source files for ruby-gemfile generation
- Changed copyright notice for Public Software Group e. V.
- Minor changes in the README file
author jbe
date Wed Aug 19 12:00:00 2009 +0200 (2009-08-19)
parents 951e73a98021
children
rev   line source
jbe@0 1
jbe@0 2 Please read the LICENSE file, which is shipping with this software.
jbe@0 3
jbe@0 4
jbe@0 5 *** QUICK START ***
jbe@0 6
jbe@0 7 For compilation of the C library call "make c-library", for compilation of
jbe@0 8 the ruby library call "make ruby-library" and for compilation of the
jbe@0 9 PostgreSQL extension call "make pgsql-library".
jbe@0 10
jbe@10 11 For ruby you can also create a gem-file by calling "make ruby-gem".
jbe@10 12
jbe@0 13 "make all" can be used to build everything, but both ruby and PostgreSQL
jbe@0 14 installations are required in this case.
jbe@0 15
jbe@0 16
jbe@0 17 *** GENERAL INFORMATION ***
jbe@0 18
jbe@7 19 The C library is found in this directory after successful compilation and
jbe@7 20 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
jbe@7 21 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
jbe@10 22 subdirectory "ruby/". If you chose to create a gem-file it is placed in the
jbe@10 23 "ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
jbe@0 24 and resides in the "pgsql/" directory.
jbe@0 25
jbe@0 26 Both the ruby library and the PostgreSQL extension are built as stand-alone
jbe@0 27 libraries and are therefore not dependent the dynamic version of the
jbe@0 28 C library files, but this behaviour might change in future releases.
jbe@0 29
jbe@2 30 The Unicode version being supported is 5.0.0.
jbe@7 31 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
jbe@10 32 version 5.0.0 had not been available at the time of implementation.
jbe@0 33
jbe@0 34 For Unicode normalizations, the following options have to be used:
jbe@0 35 Normalization Form C: STABLE, COMPOSE
jbe@2 36 Normalization Form D: STABLE, DECOMPOSE
jbe@0 37 Normalization Form KC: STABLE, COMPOSE, COMPAT
jbe@2 38 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
jbe@0 39
jbe@0 40
jbe@0 41 *** C LIBRARY ***
jbe@0 42
jbe@0 43 The documentation for the C library is found in the utf8proc.h header file.
jbe@0 44 "utf8proc_map" is most likely function you will be using for mapping UTF-8
jbe@0 45 strings, unless you want to allocate memory yourself.
jbe@0 46
jbe@0 47
jbe@0 48 *** RUBY API ***
jbe@0 49
jbe@0 50 The ruby library adds the methods "utf8map" and "utf8map!" to the String
jbe@0 51 class, and the method "utf8" to the Integer class.
jbe@0 52
jbe@0 53 The String#utf8map method does the same as the "utf8proc_map" C function.
jbe@0 54 Options for the mapping procedure are passed as symbols, i.e:
jbe@2 55 "Hello".utf8map(:casefold) => "hello"
jbe@0 56
jbe@7 57 The descriptions of all options are found in the C header file
jbe@7 58 "utf8proc.h". Please notice that the according symbols in ruby are all
jbe@7 59 lowercase.
jbe@0 60
jbe@0 61 String#utf8map! is the destructive function in the meaning that the string
jbe@0 62 is replaced by the result.
jbe@0 63
jbe@0 64 There are shortcuts for the 4 normalization forms specified by Unicode:
jbe@0 65 String#utf8nfd, String#utf8nfd!,
jbe@0 66 String#utf8nfc, String#utf8nfc!,
jbe@0 67 String#utf8nfkd, String#utf8nfkd!,
jbe@0 68 String#utf8nfkc, String#utf8nfkc!
jbe@0 69
jbe@0 70 The method Integer#utf8 returns a UTF-8 string, which is containing the
jbe@7 71 unicode char given by the code point.
jbe@0 72 0x000A.utf8 => "\n"
jbe@0 73 0x2028.utf8 => "\342\200\250"
jbe@0 74
jbe@0 75
jbe@0 76 *** POSTGRESQL API ***
jbe@0 77
jbe@7 78 For PostgreSQL there are two SQL functions supplied named "unifold" and
jbe@7 79 "unistrip". These functions function can be used to prepare index fields in
jbe@7 80 order to be folded in a way where string-comparisons make more sense, e.g.
jbe@7 81 where "bathtub" == "bath<soft hyphen>tub"
jbe@7 82 or "Hello World" == "hello world".
jbe@0 83
jbe@1 84 CREATE TABLE people (
jbe@1 85 id serial8 primary key,
jbe@1 86 name text,
jbe@1 87 CHECK (unifold(name) NOTNULL)
jbe@1 88 );
jbe@0 89 CREATE INDEX name_idx ON people (unifold(name));
jbe@0 90 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
jbe@0 91
jbe@7 92 The function "unistrip" removes character marks like accents or diaeresis,
jbe@7 93 while "unifold" keeps then.
jbe@7 94
jbe@7 95 NOTICE: The outputs of the function can change between releases, as
jbe@7 96 utf8proc does not follow a versioning stability policy. You have to
jbe@7 97 rebuild your database indicies, if you upgrade to a newer version
jbe@7 98 of utf8proc.
jbe@2 99
jbe@2 100
jbe@0 101 *** TODO ***
jbe@0 102
jbe@0 103 - detect stable code points and process segments independently in order to
jbe@0 104 save memory
jbe@0 105 - do a quick check before normalizing strings to optimize speed
jbe@0 106 - support stream processing
jbe@0 107
jbe@0 108
jbe@7 109 *** CONTACT ***
jbe@0 110
jbe@7 111 If you find any bugs or experience difficulties in compiling this software,
jbe@10 112 please contact us:
jbe@0 113
jbe@10 114 Project page: http://www.public-software-group.org/utf8proc
jbe@7 115
jbe@9 116

Impressum / About Us