utf8proc is a library for processing UTF-8 encoded Unicode strings. Some features are Unicode normalization, stripping of default ignorable characters, case folding and detection of grapheme cluster boundaries. A special character mapping is available, which converts for example the characters “Hyphen” (U+2010), “Minus” (U+2212) and “Hyphen-Minus” (U+002D, ASCII Minus) all into the ASCII minus sign, to make them equal for comparisons.

The library can be used in C programs, but most of the functionality is also available as a ruby library. For PostgreSQL there is an extension, providing a function for preparing strings in case insensitive indicies or to compare two strings for equality.

The currently supported Unicode version is 5.0.0.


Package for RubyGems


Open issues

There currently exists a development fork of utf8proc on github, which is called libmojibake. See their project description for more information.