utf8proc: README annotate

utf8proc

annotate README @ 3:4ee0d5f54af1

Version 1.0

- added the LUMP option, which lumps certain characters together (see lump.txt) (also used for the PostgreSQL "unifold" function)
- added the STRIPMARK option, which strips marking characters (or marks of composed characters)
- deprecated ruby method String#char_ary in favour of String#utf8chars

author	jbe
date	Sun Sep 17 12:00:00 2006 +0200 (2006-09-17)
parents	aaad485d5335
children	a49e32490aac

rev	line source
jbe@0	1
jbe@0	2 Please read the LICENSE file, which is shipping with this software.
jbe@0	3
jbe@0	4
jbe@0	5 * QUICK START *
jbe@0	6
jbe@0	7 For compilation of the C library call "make c-library", for compilation of
jbe@0	8 the ruby library call "make ruby-library" and for compilation of the
jbe@0	9 PostgreSQL extension call "make pgsql-library".
jbe@0	10
jbe@0	11 "make all" can be used to build everything, but both ruby and PostgreSQL
jbe@0	12 installations are required in this case.
jbe@0	13
jbe@0	14
jbe@0	15 * GENERAL INFORMATION *
jbe@0	16
jbe@0	17 The C library is found in this directory after successful compilation and is
jbe@0	18 named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the
jbe@0	19 files "utf8proc.rb" and "utf8proc_native.so", which are found in the
jbe@0	20 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
jbe@0	21 and resides in the "pgsql/" directory.
jbe@0	22
jbe@0	23 Both the ruby library and the PostgreSQL extension are built as stand-alone
jbe@0	24 libraries and are therefore not dependent the dynamic version of the
jbe@0	25 C library files, but this behaviour might change in future releases.
jbe@0	26
jbe@2	27 The Unicode version being supported is 5.0.0.
jbe@2	28 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as version 5.0.0
jbe@2	29 had not been available yet.
jbe@0	30
jbe@0	31 For Unicode normalizations, the following options have to be used:
jbe@0	32 Normalization Form C: STABLE, COMPOSE
jbe@2	33 Normalization Form D: STABLE, DECOMPOSE
jbe@0	34 Normalization Form KC: STABLE, COMPOSE, COMPAT
jbe@2	35 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
jbe@0	36
jbe@0	37
jbe@0	38 * C LIBRARY *
jbe@0	39
jbe@0	40 The documentation for the C library is found in the utf8proc.h header file.
jbe@0	41 "utf8proc_map" is most likely function you will be using for mapping UTF-8
jbe@0	42 strings, unless you want to allocate memory yourself.
jbe@0	43
jbe@0	44
jbe@0	45 * RUBY API *
jbe@0	46
jbe@0	47 The ruby library adds the methods "utf8map" and "utf8map!" to the String
jbe@0	48 class, and the method "utf8" to the Integer class.
jbe@0	49
jbe@0	50 The String#utf8map method does the same as the "utf8proc_map" C function.
jbe@0	51 Options for the mapping procedure are passed as symbols, i.e:
jbe@2	52 "Hello".utf8map(:casefold) => "hello"
jbe@0	53
jbe@0	54 The descriptions of all options are found in the C header file "utf8proc.h".
jbe@0	55 Please notice that the according symbols in ruby are all lowercase.
jbe@0	56
jbe@0	57 String#utf8map! is the destructive function in the meaning that the string
jbe@0	58 is replaced by the result.
jbe@0	59
jbe@0	60 There are shortcuts for the 4 normalization forms specified by Unicode:
jbe@0	61 String#utf8nfd, String#utf8nfd!,
jbe@0	62 String#utf8nfc, String#utf8nfc!,
jbe@0	63 String#utf8nfkd, String#utf8nfkd!,
jbe@0	64 String#utf8nfkc, String#utf8nfkc!
jbe@0	65
jbe@0	66 The method Integer#utf8 returns a UTF-8 string, which is containing the
jbe@2	67 unicode char given by the code point.
jbe@0	68 0x000A.utf8 => "\n"
jbe@0	69 0x2028.utf8 => "\342\200\250"
jbe@0	70
jbe@0	71
jbe@0	72 * POSTGRESQL API *
jbe@0	73
jbe@0	74 For PostgreSQL there is a SQL function supplied named "unifold". This
jbe@0	75 function can be used to prepare index fields in order to be normalized and
jbe@0	76 case-folded, i.e.:
jbe@0	77
jbe@1	78 CREATE TABLE people (
jbe@1	79 id serial8 primary key,
jbe@1	80 name text,
jbe@1	81 CHECK (unifold(name) NOTNULL)
jbe@1	82 );
jbe@0	83 CREATE INDEX name_idx ON people (unifold(name));
jbe@0	84 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
jbe@0	85
jbe@2	86 NOTICE: The outputs of the function can change between releases, as utf8proc
jbe@2	87 does not follow a versioning stability policy. You have to rebuild
jbe@2	88 your database indicies, if you upgrade to a newer version of utf8proc.
jbe@2	89
jbe@2	90
jbe@2	91 * KNOWN BUGS *
jbe@2	92
jbe@2	93 - on Mac OS X there were segfaults reported when compiling the ruby library
jbe@2	94 with optimization (-> don't use optimization if you have problems)
jbe@2	95
jbe@0	96
jbe@0	97 * TODO *
jbe@0	98
jbe@0	99 - detect stable code points and process segments independently in order to
jbe@0	100 save memory
jbe@0	101 - do a quick check before normalizing strings to optimize speed
jbe@0	102 - support stream processing
jbe@0	103
jbe@0	104
jbe@0	105 Unicode is a trademark of Unicode, Inc., and may be registered in some
jbe@0	106 jurisdictions.
jbe@0	107
jbe@0	108