utf8proc: README annotate

utf8proc

annotate README @ 9:951e73a98021

Version 1.1.3

- Added a function utf8proc_version returning a string containing the version number of the library.
- Included a target libutf8proc.dylib for MacOSX.
- PostgreSQL 8.3 compatibility (use of SET_VARSIZE macro)

author	jbe
date	Fri May 01 12:00:00 2009 +0200 (2009-05-01)
parents	fcfd8c836c64
children	00d2bcbdc945

rev	line source
jbe@0	1
jbe@0	2 Please read the LICENSE file, which is shipping with this software.
jbe@0	3
jbe@0	4
jbe@0	5 * QUICK START *
jbe@0	6
jbe@0	7 For compilation of the C library call "make c-library", for compilation of
jbe@0	8 the ruby library call "make ruby-library" and for compilation of the
jbe@0	9 PostgreSQL extension call "make pgsql-library".
jbe@0	10
jbe@0	11 "make all" can be used to build everything, but both ruby and PostgreSQL
jbe@0	12 installations are required in this case.
jbe@0	13
jbe@9	14 For ruby there is alternatively provided a gem-file "utf8proc-1.1.3.gem".
jbe@4	15
jbe@0	16
jbe@0	17 * GENERAL INFORMATION *
jbe@0	18
jbe@7	19 The C library is found in this directory after successful compilation and
jbe@7	20 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
jbe@7	21 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
jbe@0	22 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
jbe@0	23 and resides in the "pgsql/" directory.
jbe@0	24
jbe@0	25 Both the ruby library and the PostgreSQL extension are built as stand-alone
jbe@0	26 libraries and are therefore not dependent the dynamic version of the
jbe@0	27 C library files, but this behaviour might change in future releases.
jbe@0	28
jbe@2	29 The Unicode version being supported is 5.0.0.
jbe@7	30 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
jbe@7	31 version 5.0.0 had not been available yet.
jbe@0	32
jbe@0	33 For Unicode normalizations, the following options have to be used:
jbe@0	34 Normalization Form C: STABLE, COMPOSE
jbe@2	35 Normalization Form D: STABLE, DECOMPOSE
jbe@0	36 Normalization Form KC: STABLE, COMPOSE, COMPAT
jbe@2	37 Normalization Form KD: STABLE, DECOMPOSE, COMPAT
jbe@0	38
jbe@0	39
jbe@0	40 * C LIBRARY *
jbe@0	41
jbe@0	42 The documentation for the C library is found in the utf8proc.h header file.
jbe@0	43 "utf8proc_map" is most likely function you will be using for mapping UTF-8
jbe@0	44 strings, unless you want to allocate memory yourself.
jbe@0	45
jbe@0	46
jbe@0	47 * RUBY API *
jbe@0	48
jbe@0	49 The ruby library adds the methods "utf8map" and "utf8map!" to the String
jbe@0	50 class, and the method "utf8" to the Integer class.
jbe@0	51
jbe@0	52 The String#utf8map method does the same as the "utf8proc_map" C function.
jbe@0	53 Options for the mapping procedure are passed as symbols, i.e:
jbe@2	54 "Hello".utf8map(:casefold) => "hello"
jbe@0	55
jbe@7	56 The descriptions of all options are found in the C header file
jbe@7	57 "utf8proc.h". Please notice that the according symbols in ruby are all
jbe@7	58 lowercase.
jbe@0	59
jbe@0	60 String#utf8map! is the destructive function in the meaning that the string
jbe@0	61 is replaced by the result.
jbe@0	62
jbe@0	63 There are shortcuts for the 4 normalization forms specified by Unicode:
jbe@0	64 String#utf8nfd, String#utf8nfd!,
jbe@0	65 String#utf8nfc, String#utf8nfc!,
jbe@0	66 String#utf8nfkd, String#utf8nfkd!,
jbe@0	67 String#utf8nfkc, String#utf8nfkc!
jbe@0	68
jbe@0	69 The method Integer#utf8 returns a UTF-8 string, which is containing the
jbe@7	70 unicode char given by the code point.
jbe@0	71 0x000A.utf8 => "\n"
jbe@0	72 0x2028.utf8 => "\342\200\250"
jbe@0	73
jbe@0	74
jbe@0	75 * POSTGRESQL API *
jbe@0	76
jbe@7	77 For PostgreSQL there are two SQL functions supplied named "unifold" and
jbe@7	78 "unistrip". These functions function can be used to prepare index fields in
jbe@7	79 order to be folded in a way where string-comparisons make more sense, e.g.
jbe@7	80 where "bathtub" == "bath<soft hyphen>tub"
jbe@7	81 or "Hello World" == "hello world".
jbe@0	82
jbe@1	83 CREATE TABLE people (
jbe@1	84 id serial8 primary key,
jbe@1	85 name text,
jbe@1	86 CHECK (unifold(name) NOTNULL)
jbe@1	87 );
jbe@0	88 CREATE INDEX name_idx ON people (unifold(name));
jbe@0	89 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
jbe@0	90
jbe@7	91 The function "unistrip" removes character marks like accents or diaeresis,
jbe@7	92 while "unifold" keeps then.
jbe@7	93
jbe@7	94 NOTICE: The outputs of the function can change between releases, as
jbe@7	95 utf8proc does not follow a versioning stability policy. You have to
jbe@7	96 rebuild your database indicies, if you upgrade to a newer version
jbe@7	97 of utf8proc.
jbe@2	98
jbe@2	99
jbe@0	100 * TODO *
jbe@0	101
jbe@0	102 - detect stable code points and process segments independently in order to
jbe@0	103 save memory
jbe@0	104 - do a quick check before normalizing strings to optimize speed
jbe@0	105 - support stream processing
jbe@0	106
jbe@0	107
jbe@7	108 * CONTACT *
jbe@0	109
jbe@7	110 If you find any bugs or experience difficulties in compiling this software,
jbe@7	111 please contact me:
jbe@0	112
jbe@9	113 Project page: http://www.flexiguided.de/publications.utf8proc.en.html
jbe@9	114 Contact form: http://www.flexiguided.de/contactform.en.html
jbe@7	115
jbe@9	116 Jan Behrens
jbe@9	117