utf8proc

diff README @ 7:fcfd8c836c64
Version 1.1.1

- Added a new PostgreSQL function 'unistrip', which behaves like 'unifold', but also removes all character marks (e.g. accents).
- Changed license from BSD to MIT style.
- Added a new function 'utf8proc_codepoint_valid' to the C library.
- Changed compiler flags in Makefile from -g -O0 to -O2
- The ruby script, which was used to build the utf8proc_data.c file, is now included in the distribution.
author: jbe
date: Sun Jul 22 12:00:00 2007 +0200 (2007-07-22)
parents: a49e32490aac
children: 951e73a98021
     1.1 --- a/README	Fri Mar 16 12:00:00 2007 +0100
     1.2 +++ b/README	Sun Jul 22 12:00:00 2007 +0200
     1.3 @@ -11,14 +11,14 @@
     1.4  "make all" can be used to build everything, but both ruby and PostgreSQL
     1.5  installations are required in this case.
     1.6  
     1.7 -For ruby there is alternatively provided a gem-file "utf8proc-1.0.1.gem".
     1.8 +For ruby there is alternatively provided a gem-file "utf8proc-1.1.1.gem".
     1.9  
    1.10  
    1.11  *** GENERAL INFORMATION ***
    1.12  
    1.13 -The C library is found in this directory after successful compilation and is
    1.14 -named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the
    1.15 -files "utf8proc.rb" and "utf8proc_native.so", which are found in the
    1.16 +The C library is found in this directory after successful compilation and
    1.17 +is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
    1.18 +the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
    1.19  subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so"
    1.20  and resides in the "pgsql/" directory.
    1.21  
    1.22 @@ -27,8 +27,8 @@
    1.23  C library files, but this behaviour might change in future releases.
    1.24  
    1.25  The Unicode version being supported is 5.0.0.
    1.26 -Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as version 5.0.0
    1.27 -      had not been available yet.
    1.28 +Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
    1.29 +      version 5.0.0 had not been available yet.
    1.30  
    1.31  For Unicode normalizations, the following options have to be used:
    1.32  Normalization Form C:  STABLE, COMPOSE
    1.33 @@ -53,8 +53,9 @@
    1.34  Options for the mapping procedure are passed as symbols, i.e:
    1.35  "Hello".utf8map(:casefold) => "hello"
    1.36  
    1.37 -The descriptions of all options are found in the C header file "utf8proc.h".
    1.38 -Please notice that the according symbols in ruby are all lowercase.
    1.39 +The descriptions of all options are found in the C header file
    1.40 +"utf8proc.h". Please notice that the according symbols in ruby are all
    1.41 +lowercase.
    1.42  
    1.43  String#utf8map! is the destructive function in the meaning that the string
    1.44  is replaced by the result.
    1.45 @@ -66,16 +67,18 @@
    1.46  String#utf8nfkc, String#utf8nfkc!
    1.47  
    1.48  The method Integer#utf8 returns a UTF-8 string, which is containing the
    1.49 -unicode char given by the code point. 
    1.50 +unicode char given by the code point.
    1.51  0x000A.utf8 => "\n"
    1.52  0x2028.utf8 => "\342\200\250"
    1.53  
    1.54  
    1.55  *** POSTGRESQL API ***
    1.56  
    1.57 -For PostgreSQL there is a SQL function supplied named "unifold". This
    1.58 -function can be used to prepare index fields in order to be normalized and
    1.59 -case-folded, i.e.:
    1.60 +For PostgreSQL there are two SQL functions supplied named "unifold" and
    1.61 +"unistrip". These functions function can be used to prepare index fields in
    1.62 +order to be folded in a way where string-comparisons make more sense, e.g.
    1.63 +where "bathtub" == "bath<soft hyphen>tub"
    1.64 +or "Hello World" == "hello world".
    1.65  
    1.66  CREATE TABLE people (
    1.67    id    serial8 primary key,
    1.68 @@ -85,9 +88,13 @@
    1.69  CREATE INDEX name_idx ON people (unifold(name));
    1.70  SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
    1.71  
    1.72 -NOTICE: The outputs of the function can change between releases, as utf8proc
    1.73 -        does not follow a versioning stability policy. You have to rebuild
    1.74 -        your database indicies, if you upgrade to a newer version of utf8proc.
    1.75 +The function "unistrip" removes character marks like accents or diaeresis,
    1.76 +while "unifold" keeps then.
    1.77 +
    1.78 +NOTICE: The outputs of the function can change between releases, as
    1.79 +        utf8proc does not follow a versioning stability policy. You have to
    1.80 +        rebuild your database indicies, if you upgrade to a newer version
    1.81 +        of utf8proc.
    1.82  
    1.83  
    1.84  *** TODO ***
    1.85 @@ -98,7 +105,11 @@
    1.86  - support stream processing
    1.87  
    1.88  
    1.89 -Unicode is a trademark of Unicode, Inc., and may be registered in some
    1.90 -jurisdictions.
    1.91 +*** CONTACT ***
    1.92  
    1.93 +If you find any bugs or experience difficulties in compiling this software,
    1.94 +please contact me:
    1.95  
    1.96 +Jan Behrens <jan.behrens.n4272.expires-2008-06@flexiguided.de>
    1.97 +http://www.flexiguided.de/publications.utf8proc.en.html
    1.98 +
author	jbe
date	Sun Jul 22 12:00:00 2007 +0200 (2007-07-22)
parents	a49e32490aac
children	951e73a98021