utf8proc: fcfd8c836c64 README

utf8proc

view README @ 7:fcfd8c836c64

Version 1.1.1

- Added a new PostgreSQL function 'unistrip', which behaves like 'unifold', but also removes all character marks (e.g. accents).
- Changed license from BSD to MIT style.
- Added a new function 'utf8proc_codepoint_valid' to the C library.
- Changed compiler flags in Makefile from -g -O0 to -O2
- The ruby script, which was used to build the utf8proc_data.c file, is now included in the distribution.

author	jbe
date	Sun Jul 22 12:00:00 2007 +0200 (2007-07-22)
parents	a49e32490aac
children	951e73a98021

line source

2 Please read the LICENSE file, which is shipping with this software.

5 *** QUICK START ***

7 For compilation of the C library call "make c-library", for compilation of

8 the ruby library call "make ruby-library" and for compilation of the

9 PostgreSQL extension call "make pgsql-library".

11 "make all" can be used to build everything, but both ruby and PostgreSQL

12 installations are required in this case.

14 For ruby there is alternatively provided a gem-file "utf8proc-1.1.1.gem".

17 *** GENERAL INFORMATION ***

19 The C library is found in this directory after successful compilation and

20 is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of

21 the files "utf8proc.rb" and "utf8proc_native.so", which are found in the

22 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"

23 and resides in the "pgsql/" directory.

25 Both the ruby library and the PostgreSQL extension are built as stand-alone

26 libraries and are therefore not dependent the dynamic version of the

27 C library files, but this behaviour might change in future releases.

29 The Unicode version being supported is 5.0.0.

30 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as

31 version 5.0.0 had not been available yet.

33 For Unicode normalizations, the following options have to be used:

34 Normalization Form C: STABLE, COMPOSE

35 Normalization Form D: STABLE, DECOMPOSE

36 Normalization Form KC: STABLE, COMPOSE, COMPAT

37 Normalization Form KD: STABLE, DECOMPOSE, COMPAT

40 *** C LIBRARY ***

42 The documentation for the C library is found in the utf8proc.h header file.

43 "utf8proc_map" is most likely function you will be using for mapping UTF-8

44 strings, unless you want to allocate memory yourself.

47 *** RUBY API ***

49 The ruby library adds the methods "utf8map" and "utf8map!" to the String

50 class, and the method "utf8" to the Integer class.

52 The String#utf8map method does the same as the "utf8proc_map" C function.

53 Options for the mapping procedure are passed as symbols, i.e:

54 "Hello".utf8map(:casefold) => "hello"

56 The descriptions of all options are found in the C header file

57 "utf8proc.h". Please notice that the according symbols in ruby are all

58 lowercase.

60 String#utf8map! is the destructive function in the meaning that the string

61 is replaced by the result.

63 There are shortcuts for the 4 normalization forms specified by Unicode:

64 String#utf8nfd, String#utf8nfd!,

65 String#utf8nfc, String#utf8nfc!,

66 String#utf8nfkd, String#utf8nfkd!,

67 String#utf8nfkc, String#utf8nfkc!

69 The method Integer#utf8 returns a UTF-8 string, which is containing the

70 unicode char given by the code point.

71 0x000A.utf8 => "\n"

72 0x2028.utf8 => "\342\200\250"

75 *** POSTGRESQL API ***

77 For PostgreSQL there are two SQL functions supplied named "unifold" and

78 "unistrip". These functions function can be used to prepare index fields in

79 order to be folded in a way where string-comparisons make more sense, e.g.

80 where "bathtub" == "bath<soft hyphen>tub"

81 or "Hello World" == "hello world".

83 CREATE TABLE people (

84 id serial8 primary key,

85 name text,

86 CHECK (unifold(name) NOTNULL)

87 );

88 CREATE INDEX name_idx ON people (unifold(name));

89 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');

91 The function "unistrip" removes character marks like accents or diaeresis,

92 while "unifold" keeps then.

94 NOTICE: The outputs of the function can change between releases, as

95 utf8proc does not follow a versioning stability policy. You have to

96 rebuild your database indicies, if you upgrade to a newer version

97 of utf8proc.

100 *** TODO ***

101

102 - detect stable code points and process segments independently in order to

103 save memory

104 - do a quick check before normalizing strings to optimize speed

105 - support stream processing

106

107

108 *** CONTACT ***

109

110 If you find any bugs or experience difficulties in compiling this software,

111 please contact me:

112

113 Jan Behrens <jan.behrens.n4272.expires-2008-06@flexiguided.de>

114 http://www.flexiguided.de/publications.utf8proc.en.html

115