rev |
line source |
jbe@0
|
1
|
jbe@0
|
2 Please read the LICENSE file, which is shipping with this software.
|
jbe@0
|
3
|
jbe@0
|
4
|
jbe@0
|
5 *** QUICK START ***
|
jbe@0
|
6
|
jbe@0
|
7 For compilation of the C library call "make c-library", for compilation of
|
jbe@0
|
8 the ruby library call "make ruby-library" and for compilation of the
|
jbe@0
|
9 PostgreSQL extension call "make pgsql-library".
|
jbe@0
|
10
|
jbe@0
|
11 "make all" can be used to build everything, but both ruby and PostgreSQL
|
jbe@0
|
12 installations are required in this case.
|
jbe@0
|
13
|
jbe@0
|
14
|
jbe@0
|
15 *** GENERAL INFORMATION ***
|
jbe@0
|
16
|
jbe@0
|
17 The C library is found in this directory after successful compilation and is
|
jbe@0
|
18 named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the
|
jbe@0
|
19 files "utf8proc.rb" and "utf8proc_native.so", which are found in the
|
jbe@0
|
20 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
|
jbe@0
|
21 and resides in the "pgsql/" directory.
|
jbe@0
|
22
|
jbe@0
|
23 Both the ruby library and the PostgreSQL extension are built as stand-alone
|
jbe@0
|
24 libraries and are therefore not dependent the dynamic version of the
|
jbe@0
|
25 C library files, but this behaviour might change in future releases.
|
jbe@0
|
26
|
jbe@0
|
27 The Unicode version being supported is 4.1.0.
|
jbe@0
|
28
|
jbe@0
|
29 For Unicode normalizations, the following options have to be used:
|
jbe@0
|
30 Normalization Form C: STABLE, COMPOSE
|
jbe@0
|
31 Normalization Form D: STABLE
|
jbe@0
|
32 Normalization Form KC: STABLE, COMPOSE, COMPAT
|
jbe@0
|
33 Normalization Form KD: STABLE, COMPAT
|
jbe@0
|
34
|
jbe@0
|
35
|
jbe@0
|
36 *** C LIBRARY ***
|
jbe@0
|
37
|
jbe@0
|
38 The documentation for the C library is found in the utf8proc.h header file.
|
jbe@0
|
39 "utf8proc_map" is most likely function you will be using for mapping UTF-8
|
jbe@0
|
40 strings, unless you want to allocate memory yourself.
|
jbe@0
|
41
|
jbe@0
|
42
|
jbe@0
|
43 *** RUBY API ***
|
jbe@0
|
44
|
jbe@0
|
45 The ruby library adds the methods "utf8map" and "utf8map!" to the String
|
jbe@0
|
46 class, and the method "utf8" to the Integer class.
|
jbe@0
|
47
|
jbe@0
|
48 The String#utf8map method does the same as the "utf8proc_map" C function.
|
jbe@0
|
49 Options for the mapping procedure are passed as symbols, i.e:
|
jbe@0
|
50 "Hello".utf8map(:stable, :casefold) => "hello"
|
jbe@0
|
51
|
jbe@0
|
52 The descriptions of all options are found in the C header file "utf8proc.h".
|
jbe@0
|
53 Please notice that the according symbols in ruby are all lowercase.
|
jbe@0
|
54
|
jbe@0
|
55 String#utf8map! is the destructive function in the meaning that the string
|
jbe@0
|
56 is replaced by the result.
|
jbe@0
|
57
|
jbe@0
|
58 There are shortcuts for the 4 normalization forms specified by Unicode:
|
jbe@0
|
59 String#utf8nfd, String#utf8nfd!,
|
jbe@0
|
60 String#utf8nfc, String#utf8nfc!,
|
jbe@0
|
61 String#utf8nfkd, String#utf8nfkd!,
|
jbe@0
|
62 String#utf8nfkc, String#utf8nfkc!
|
jbe@0
|
63
|
jbe@0
|
64 The method Integer#utf8 returns a UTF-8 string, which is containing the
|
jbe@0
|
65 unicode char given by the code point.
|
jbe@0
|
66 0x000A.utf8 => "\n"
|
jbe@0
|
67 0x2028.utf8 => "\342\200\250"
|
jbe@0
|
68
|
jbe@0
|
69
|
jbe@0
|
70 *** POSTGRESQL API ***
|
jbe@0
|
71
|
jbe@0
|
72 For PostgreSQL there is a SQL function supplied named "unifold". This
|
jbe@0
|
73 function can be used to prepare index fields in order to be normalized and
|
jbe@0
|
74 case-folded, i.e.:
|
jbe@0
|
75
|
jbe@0
|
76 CREATE TABLE people (id serial8 primary key, name text);
|
jbe@0
|
77 CREATE INDEX name_idx ON people (unifold(name));
|
jbe@0
|
78 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
|
jbe@0
|
79
|
jbe@0
|
80
|
jbe@0
|
81 *** TODO ***
|
jbe@0
|
82
|
jbe@0
|
83 - detect stable code points and process segments independently in order to
|
jbe@0
|
84 save memory
|
jbe@0
|
85 - do a quick check before normalizing strings to optimize speed
|
jbe@0
|
86 - support stream processing
|
jbe@0
|
87
|
jbe@0
|
88
|
jbe@0
|
89 Unicode is a trademark of Unicode, Inc., and may be registered in some
|
jbe@0
|
90 jurisdictions.
|
jbe@0
|
91
|
jbe@0
|
92
|