utf8proc: aaad485d5335 README

utf8proc

view README @ 2:aaad485d5335

Version 0.3

- changed normalization from NFC to NFKC for postgresql unifold function
- added support to mark the beginning of a grapheme cluster with 0xFF (option: CHARBOUND)
- added the ruby method String#chars, which is returning an array of UTF-8 encoded grapheme clusters
- added NLF2LF transformation in postgresql unifold function
- added the DECOMPOSE option, if you neither use COMPOSE or DECOMPOSE, no normalization will be performed (different from previous versions)
- using integer constants rather than C-strings for character properties
- fixed (hopefully) a problem with the ruby library on Mac OS X, which occured when compiler optimization was switched on

author	jbe
date	Fri Aug 04 12:00:00 2006 +0200 (2006-08-04)
parents	61a89ecc2fb9
children	a49e32490aac

line source

2 Please read the LICENSE file, which is shipping with this software.

5 *** QUICK START ***

7 For compilation of the C library call "make c-library", for compilation of

8 the ruby library call "make ruby-library" and for compilation of the

9 PostgreSQL extension call "make pgsql-library".

11 "make all" can be used to build everything, but both ruby and PostgreSQL

12 installations are required in this case.

15 *** GENERAL INFORMATION ***

17 The C library is found in this directory after successful compilation and is

18 named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the

19 files "utf8proc.rb" and "utf8proc_native.so", which are found in the

20 subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"

21 and resides in the "pgsql/" directory.

23 Both the ruby library and the PostgreSQL extension are built as stand-alone

24 libraries and are therefore not dependent the dynamic version of the

25 C library files, but this behaviour might change in future releases.

27 The Unicode version being supported is 5.0.0.

28 Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as version 5.0.0

29 had not been available yet.

31 For Unicode normalizations, the following options have to be used:

32 Normalization Form C: STABLE, COMPOSE

33 Normalization Form D: STABLE, DECOMPOSE

34 Normalization Form KC: STABLE, COMPOSE, COMPAT

35 Normalization Form KD: STABLE, DECOMPOSE, COMPAT

38 *** C LIBRARY ***

40 The documentation for the C library is found in the utf8proc.h header file.

41 "utf8proc_map" is most likely function you will be using for mapping UTF-8

42 strings, unless you want to allocate memory yourself.

45 *** RUBY API ***

47 The ruby library adds the methods "utf8map" and "utf8map!" to the String

48 class, and the method "utf8" to the Integer class.

50 The String#utf8map method does the same as the "utf8proc_map" C function.

51 Options for the mapping procedure are passed as symbols, i.e:

52 "Hello".utf8map(:casefold) => "hello"

54 The descriptions of all options are found in the C header file "utf8proc.h".

55 Please notice that the according symbols in ruby are all lowercase.

57 String#utf8map! is the destructive function in the meaning that the string

58 is replaced by the result.

60 There are shortcuts for the 4 normalization forms specified by Unicode:

61 String#utf8nfd, String#utf8nfd!,

62 String#utf8nfc, String#utf8nfc!,

63 String#utf8nfkd, String#utf8nfkd!,

64 String#utf8nfkc, String#utf8nfkc!

66 The method Integer#utf8 returns a UTF-8 string, which is containing the

67 unicode char given by the code point.

68 0x000A.utf8 => "\n"

69 0x2028.utf8 => "\342\200\250"

72 *** POSTGRESQL API ***

74 For PostgreSQL there is a SQL function supplied named "unifold". This

75 function can be used to prepare index fields in order to be normalized and

76 case-folded, i.e.:

78 CREATE TABLE people (

79 id serial8 primary key,

80 name text,

81 CHECK (unifold(name) NOTNULL)

82 );

83 CREATE INDEX name_idx ON people (unifold(name));

84 SELECT * FROM people WHERE unifold(name) = unifold('John Doe');

86 NOTICE: The outputs of the function can change between releases, as utf8proc

87 does not follow a versioning stability policy. You have to rebuild

88 your database indicies, if you upgrade to a newer version of utf8proc.

91 *** KNOWN BUGS ***

93 - on Mac OS X there were segfaults reported when compiling the ruby library

94 with optimization (-> don't use optimization if you have problems)

97 *** TODO ***

99 - detect stable code points and process segments independently in order to

100 save memory

101 - do a quick check before normalizing strings to optimize speed

102 - support stream processing

103

104

105 Unicode is a trademark of Unicode, Inc., and may be registered in some

106 jurisdictions.

107

108