utf8proc is a library for processing UTF-8 encoded Unicode strings. Some features are Unicode normalization, stripping of default ignorable characters, case folding and detection of grapheme cluster boundaries. A special character mapping is available, which converts for example the characters “Hyphen” (U+2010), “Minus” (U+2212) and “Hyphen-Minus” (U+002D, ASCII Minus) all into the ASCII minus sign, to make them equal for comparisons.
The library can be used in C programs, but most of the functionality is also available as a ruby library. For PostgreSQL there is an extension, providing a function for preparing strings in case insensitive indicies or to compare two strings for equality.
The currently supported Unicode version is 5.0.0.
Package for RubyGems
- Wrong treatment of COMBINING GREEK YPOGEGRAMMENI and any characters that have it as part of their decomposition (requires normalization before and after casefolding, refer to chapter 3 of the Unicode standard)
- Currently, only Unicode v5.0.0 is supported
- Code-cleanup needed
- UTF-8 decoding and encoding should be done in an independent step
- Import script for Unicode data requires code-cleanup
- 2013-11-27: Version 1.1.6 released
- PostgreSQL 9.2 and 9.3 compatibility (lower case 'c' language name)
- 2009-10-16: Version 1.1.5 released
- Use RSTRING_PTR() and RSTRING_LEN() instead of RSTRING()->ptr and RSTRING()->len for ruby1.9 compatibility (and #define them, if not existent)
- Patches for compatibility with Microsoft Visual Studio
- Fixes to make utf8proc usable in C++ programs
- 2009-08-19: Version 1.1.4 released
- Replaced C++ style comments for compatibility reasons
- Added typecasts to suppress compiler warnings
- Removed redundant source files for ruby-gemfile generation
- Changed copyright notice for Public Software Group e. V.
- Minor changes in the README file
- Changes in version 1.1.3:
- PostgreSQL 8.3 compatibility (use of SET_VARSIZE macro)
- Added a function utf8proc_version returning a string containing the version number of the library.
- Included a target libutf8proc.dylib for MacOSX.
- Changes in version 1.1.2
- Fixed a serious bug in the data file generator, which caused characters being treated incorrectly, when stripping default ignorable characters or calculating grapheme cluster boundaries.
- Changes in version 1.1.1
- Changed license from BSD to MIT style.
- Added a new function utf8proc_codepoint_valid to the C library.
- Changed compiler flags in Makefile from -g -O0 to -O2
- The ruby script, which was used to build the utf8proc_data.c file, is now included in the distribution.
- Added a new PostgreSQL function unistrip, which behaves like unifold, but also removes all character marks (e.g. accents).
- Changes in version 1.0.3
- Fixed a bug in the ruby library, which caused an error, when splitting an empty string at grapheme cluster boundaries (method String#utf8chars).
- Changes in version 1.0.2
- added support for PostgreSQL version 8.2
- included a check in Integer#utf8, which raises an exception, if the given code-point is invalid because of being too high (this was missing yet)
- Changes in version 1.0.1
- included a gem file for the ruby version of the library
- Changes in version 1.0
- added the LUMP option, which lumps certain characters together (see lump.txt) (also used for the PostgreSQL unifold function)
- added the STRIPMARK option, which strips marking characters (or marks of composed characters)
- deprecated ruby method String#char_ary in favour of String#utf8chars
- Changes in version 0.3
added support to mark the beginning of a grapheme cluster with 0xFF (option: CHARBOUND)
- added the ruby method String#chars, which is returning an array of UTF-8 encoded grapheme clusters
- added NLF2LF transformation in postgresql unifold function
- added the DECOMPOSE option, if you neither use COMPOSE or DECOMPOSE, no normalization will be performed (different from previous versions)
- using integer constants rather than C-strings for character properties
- fixed (hopefully) a problem with the ruby library on Mac OS X, which occured when compiler optimization was switched on
- changed normalization from NFC to NFKC for postgresql unifold function
- Changes in version 0.2
- added -fpic compiler flag in Makefile
- fixed bug in the C code for the ruby library (usage of non-existent function)
- changed behaviour of PostgreSQL function to return NULL in case of invalid input, rather than raising an exceptional condition
- improved efficiency of PostgreSQL function (no transformation to C string is done)
- 2006-06-02: First release v0.1