jbe@0: jbe@0: Please read the LICENSE file, which is shipping with this software. jbe@0: jbe@0: jbe@0: *** QUICK START *** jbe@0: jbe@0: For compilation of the C library call "make c-library", for compilation of jbe@0: the ruby library call "make ruby-library" and for compilation of the jbe@0: PostgreSQL extension call "make pgsql-library". jbe@0: jbe@0: "make all" can be used to build everything, but both ruby and PostgreSQL jbe@0: installations are required in this case. jbe@0: jbe@4: For ruby there is alternatively provided a gem-file "utf8proc-1.0.1.gem". jbe@4: jbe@0: jbe@0: *** GENERAL INFORMATION *** jbe@0: jbe@0: The C library is found in this directory after successful compilation and is jbe@0: named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of the jbe@0: files "utf8proc.rb" and "utf8proc_native.so", which are found in the jbe@0: subdirectory "ruby/". The PostgreSQL extension is named "utf8proc_pgsql.so" jbe@0: and resides in the "pgsql/" directory. jbe@0: jbe@0: Both the ruby library and the PostgreSQL extension are built as stand-alone jbe@0: libraries and are therefore not dependent the dynamic version of the jbe@0: C library files, but this behaviour might change in future releases. jbe@0: jbe@2: The Unicode version being supported is 5.0.0. jbe@2: Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as version 5.0.0 jbe@2: had not been available yet. jbe@0: jbe@0: For Unicode normalizations, the following options have to be used: jbe@0: Normalization Form C: STABLE, COMPOSE jbe@2: Normalization Form D: STABLE, DECOMPOSE jbe@0: Normalization Form KC: STABLE, COMPOSE, COMPAT jbe@2: Normalization Form KD: STABLE, DECOMPOSE, COMPAT jbe@0: jbe@0: jbe@0: *** C LIBRARY *** jbe@0: jbe@0: The documentation for the C library is found in the utf8proc.h header file. jbe@0: "utf8proc_map" is most likely function you will be using for mapping UTF-8 jbe@0: strings, unless you want to allocate memory yourself. jbe@0: jbe@0: jbe@0: *** RUBY API *** jbe@0: jbe@0: The ruby library adds the methods "utf8map" and "utf8map!" to the String jbe@0: class, and the method "utf8" to the Integer class. jbe@0: jbe@0: The String#utf8map method does the same as the "utf8proc_map" C function. jbe@0: Options for the mapping procedure are passed as symbols, i.e: jbe@2: "Hello".utf8map(:casefold) => "hello" jbe@0: jbe@0: The descriptions of all options are found in the C header file "utf8proc.h". jbe@0: Please notice that the according symbols in ruby are all lowercase. jbe@0: jbe@0: String#utf8map! is the destructive function in the meaning that the string jbe@0: is replaced by the result. jbe@0: jbe@0: There are shortcuts for the 4 normalization forms specified by Unicode: jbe@0: String#utf8nfd, String#utf8nfd!, jbe@0: String#utf8nfc, String#utf8nfc!, jbe@0: String#utf8nfkd, String#utf8nfkd!, jbe@0: String#utf8nfkc, String#utf8nfkc! jbe@0: jbe@0: The method Integer#utf8 returns a UTF-8 string, which is containing the jbe@2: unicode char given by the code point. jbe@0: 0x000A.utf8 => "\n" jbe@0: 0x2028.utf8 => "\342\200\250" jbe@0: jbe@0: jbe@0: *** POSTGRESQL API *** jbe@0: jbe@0: For PostgreSQL there is a SQL function supplied named "unifold". This jbe@0: function can be used to prepare index fields in order to be normalized and jbe@0: case-folded, i.e.: jbe@0: jbe@1: CREATE TABLE people ( jbe@1: id serial8 primary key, jbe@1: name text, jbe@1: CHECK (unifold(name) NOTNULL) jbe@1: ); jbe@0: CREATE INDEX name_idx ON people (unifold(name)); jbe@0: SELECT * FROM people WHERE unifold(name) = unifold('John Doe'); jbe@0: jbe@2: NOTICE: The outputs of the function can change between releases, as utf8proc jbe@2: does not follow a versioning stability policy. You have to rebuild jbe@2: your database indicies, if you upgrade to a newer version of utf8proc. jbe@2: jbe@2: jbe@0: *** TODO *** jbe@0: jbe@0: - detect stable code points and process segments independently in order to jbe@0: save memory jbe@0: - do a quick check before normalizing strings to optimize speed jbe@0: - support stream processing jbe@0: jbe@0: jbe@0: Unicode is a trademark of Unicode, Inc., and may be registered in some jbe@0: jurisdictions. jbe@0: jbe@0: