utf8proc

view README @ 6:d04d3a9b486e

Version 1.0.3

- Fixed a bug in the ruby library, which caused an error, when splitting an empty string at grapheme cluster boundaries (method String#utf8chars).
author jbe
date Fri Mar 16 12:00:00 2007 +0100 (2007-03-16)
parents a49e32490aac
children fcfd8c836c64
line source
2 Please read the LICENSE file, which is shipping with this software.
5 *** QUICK START ***
7 For compilation of the C library call "make c-library", for compilation of
8 the ruby library call "make ruby-library" and for compilation of the
9 PostgreSQL extension call "make pgsql-library".
11 "make all" can be used to build everything, but both ruby and PostgreSQL
12 installations are required in this case.
14 For ruby there is alternatively provided a gem-file "utf8proc-1.0.1.gem".
17 *** GENERAL INFORMATION ***
19 The C library is found in this directory after successful compilation and is
20 named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the
21 files "utf8proc.rb" and "utf8proc_native.so", which are found in the
22 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
23 and resides in the "pgsql/" directory.
25 Both the ruby library and the PostgreSQL extension are built as stand-alone
26 libraries and are therefore not dependent the dynamic version of the
27 C library files, but this behaviour might change in future releases.
29 The Unicode version being supported is 5.0.0.
30 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as version 5.0.0
31 had not been available yet.
33 For Unicode normalizations, the following options have to be used:
34 Normalization Form C: STABLE, COMPOSE
35 Normalization Form D: STABLE, DECOMPOSE
36 Normalization Form KC: STABLE, COMPOSE, COMPAT
37 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
40 *** C LIBRARY ***
42 The documentation for the C library is found in the utf8proc.h header file.
43 "utf8proc_map" is most likely function you will be using for mapping UTF-8
44 strings, unless you want to allocate memory yourself.
47 *** RUBY API ***
49 The ruby library adds the methods "utf8map" and "utf8map!" to the String
50 class, and the method "utf8" to the Integer class.
52 The String#utf8map method does the same as the "utf8proc_map" C function.
53 Options for the mapping procedure are passed as symbols, i.e:
54 "Hello".utf8map(:casefold) => "hello"
56 The descriptions of all options are found in the C header file "utf8proc.h".
57 Please notice that the according symbols in ruby are all lowercase.
59 String#utf8map! is the destructive function in the meaning that the string
60 is replaced by the result.
62 There are shortcuts for the 4 normalization forms specified by Unicode:
63 String#utf8nfd, String#utf8nfd!,
64 String#utf8nfc, String#utf8nfc!,
65 String#utf8nfkd, String#utf8nfkd!,
66 String#utf8nfkc, String#utf8nfkc!
68 The method Integer#utf8 returns a UTF-8 string, which is containing the
69 unicode char given by the code point.
70 0x000A.utf8 => "\n"
71 0x2028.utf8 => "\342\200\250"
74 *** POSTGRESQL API ***
76 For PostgreSQL there is a SQL function supplied named "unifold". This
77 function can be used to prepare index fields in order to be normalized and
78 case-folded, i.e.:
80 CREATE TABLE people (
81 id serial8 primary key,
82 name text,
83 CHECK (unifold(name) NOTNULL)
84 );
85 CREATE INDEX name_idx ON people (unifold(name));
86 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
88 NOTICE: The outputs of the function can change between releases, as utf8proc
89 does not follow a versioning stability policy. You have to rebuild
90 your database indicies, if you upgrade to a newer version of utf8proc.
93 *** TODO ***
95 - detect stable code points and process segments independently in order to
96 save memory
97 - do a quick check before normalizing strings to optimize speed
98 - support stream processing
101 Unicode is a trademark of Unicode, Inc., and may be registered in some
102 jurisdictions.

Impressum / About Us