TYPO3CMS  8
 All Classes Namespaces Files Functions Variables Pages
CharsetConverter Class Reference
Inheritance diagram for CharsetConverter:
SingletonInterface

Public Member Functions

 parse_charset ($charset)
 
 conv ($inputString, $fromCharset, $toCharset, $useEntityForNoChar=false)
 
 convArray (&$array, $fromCharset, $toCharset, $useEntityForNoChar=false)
 
 utf8_encode ($str, $charset)
 
 utf8_decode ($str, $charset, $useEntityForNoChar=false)
 
 utf8_to_entities ($str)
 
 entities_to_utf8 ($str)
 
 utf8_to_numberarray ($str)
 
 UnumberToChar ($unicodeInteger)
 
 utf8CharToUnumber ($str, $hex=false)
 
 initCharset ($charset)
 
 initUnicodeData ($mode=null)
 
 initCaseFolding ($charset)
 
 initToASCII ($charset)
 
 substr ($charset, $string, $start, $len=null)
 
 strlen ($charset, $string)
 
 crop ($charset, $string, $len, $crop= '')
 
 strtrunc ($charset, $string, $len)
 
 conv_case ($charset, $string, $case)
 
 convCaseFirst ($charset, $string, $case)
 
 convCapitalize ($charset, $string)
 
 specCharsToASCII ($charset, $string)
 
 sb_char_mapping ($str, $charset, $mode, $opt= '')
 
 utf8_substr ($str, $start, $len=null)
 
 utf8_strlen ($str)
 
 utf8_strtrunc ($str, $len)
 
 utf8_strpos ($haystack, $needle, $offset=0)
 
 utf8_strrpos ($haystack, $needle)
 
 utf8_char2byte_pos ($str, $pos)
 
 utf8_byte2char_pos ($str, $pos)
 
 utf8_char_mapping ($str, $mode, $opt= '')
 
 euc_strtrunc ($str, $len, $charset)
 
 euc_substr ($str, $start, $charset, $len=null)
 
 euc_strlen ($str, $charset)
 
 euc_char2byte_pos ($str, $pos, $charset)
 
 euc_char_mapping ($str, $charset, $mode, $opt= '')
 

Public Attributes

 $noCharByteVal = 63
 
 $parsedCharsets = []
 
 $caseFolding = []
 
 $toASCII = []
 
 $twoByteSets
 
 $fourByteSets
 
 $eucBasedSets
 
 $synonyms
 
 $charSetArray
 

Detailed Description

Notes on UTF-8

Functions working on UTF-8 strings:

  • strchr/strstr
  • strrchr
  • substr_count
  • implode/explode/join

Functions nearly working on UTF-8 strings:

  • strlen: returns the length in BYTES, if you need the length in CHARACTERS use utf8_strlen
  • trim/ltrim/rtrim: the second parameter 'charlist' won't work for characters not contained in 7-bit ASCII
  • strpos/strrpos: they return the BYTE position, if you need the CHARACTER position use utf8_strpos/utf8_strrpos
  • htmlentities: charset support for UTF-8 only since PHP 4.3.0
  • preg_*: Support compiled into PHP by default nowadays, but could be unavailable, need to use modifier

Functions NOT working on UTF-8 strings:

  • str*cmp
  • stristr
  • stripos
  • substr
  • strrev
  • split/spliti
  • ... Class for conversion between charsets

Definition at line 54 of file CharsetConverter.php.

Member Function Documentation

conv (   $inputString,
  $fromCharset,
  $toCharset,
  $useEntityForNoChar = false 
)

Convert from one charset to another charset.

Parameters
string$inputStringInput string
string$fromCharsetFrom charset (the current charset of the string)
string$toCharsetTo charset (the output charset wanted)
bool$useEntityForNoCharIf set, then characters that are not available in the destination character set will be encoded as numeric entities
Returns
string Converted string
See also
convArray()

Definition at line 313 of file CharsetConverter.php.

References CharsetConverter\utf8_decode(), and CharsetConverter\utf8_encode().

Referenced by CharsetConverter\convArray().

conv_case (   $charset,
  $string,
  $case 
)

Translates all characters of a string into their respective case values. Unlike strtolower() and strtoupper() this method is locale independent. Note that the string length may change! eg. lower case German "ß" (sharp S) becomes upper case "SS" Unit-tested by Kasper Real case folding is language dependent, this method ignores this fact.

Parameters
string$charsetCharacter set of string
string$stringInput string to convert case for
string$caseCase keyword: "toLower" means lowercase conversion, anything else is uppercase (use "toUpper" )
Returns
string The converted string
See also
strtolower(), strtoupper()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_strtolower() or mb_strtoupper() directly

Definition at line 1221 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction().

convArray ( $array,
  $fromCharset,
  $toCharset,
  $useEntityForNoChar = false 
)

Convert all elements in ARRAY with type string from one charset to another charset. NOTICE: Array is passed by reference!

Parameters
array$arrayInput array, possibly multidimensional
string$fromCharsetFrom charset (the current charset of the string)
string$toCharsetTo charset (the output charset wanted)
bool$useEntityForNoCharIf set, then characters that are not available in the destination character set will be encoded as numeric entities
Returns
void
See also
conv()

Definition at line 346 of file CharsetConverter.php.

References CharsetConverter\conv().

convCapitalize (   $charset,
  $string 
)

Capitalize the given string

Parameters
string$charset
string$string
Returns
string
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_convert_case() directly

Definition at line 1255 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction().

convCaseFirst (   $charset,
  $string,
  $case 
)

Equivalent of lcfirst/ucfirst but using character set.

Parameters
string$charset
string$string
string$casecan be 'toLower' or 'toUpper'
Returns
string

Definition at line 1237 of file CharsetConverter.php.

crop (   $charset,
  $string,
  $len,
  $crop = '' 
)

Truncates a string and pre-/appends a string. Unit tested by Kasper

Parameters
string$charsetThe character set
string$stringCharacter string
int$lenLength (in characters)
string$cropCrop signifier
Returns
string The shortened string
See also
substr(), mb_strimwidth()

Definition at line 1176 of file CharsetConverter.php.

entities_to_utf8 (   $str)

Converts numeric entities (UNICODE, eg. decimal (Ӓ) or hexadecimal ()) to UTF-8 multibyte chars. All string-HTML entities (like & or £) will be converted as well

Parameters
string$strInput string, UTF-8
Returns
string Output string

Definition at line 536 of file CharsetConverter.php.

References CharsetConverter\substr(), and CharsetConverter\UnumberToChar().

Referenced by CharsetConverter\utf8_to_numberarray().

euc_char2byte_pos (   $str,
  $pos,
  $charset 
)

Translates a character position into an 'absolute' byte position.

Parameters
string$strEUC multibyte character string
int$posCharacter position (negative values start from the end)
string$charsetThe charset
Returns
int Byte position
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, former internal function only

Definition at line 1723 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction(), and CharsetConverter\strlen().

Referenced by CharsetConverter\euc_substr().

euc_char_mapping (   $str,
  $charset,
  $mode,
  $opt = '' 
)

Maps all characters of a string in the EUC charset family.

Parameters
string$strEUC multibyte character string
string$charsetThe charset
string$modeMode: 'case' (case folding) or 'ascii' (ASCII transliteration)
string$opt'case': conversion 'toLower' or 'toUpper'
Returns
string The converted string

Definition at line 1771 of file CharsetConverter.php.

References CharsetConverter\initCaseFolding(), CharsetConverter\initToASCII(), and CharsetConverter\substr().

Referenced by CharsetConverter\specCharsToASCII().

euc_strlen (   $str,
  $charset 
)

Counts the number of characters of a string in the EUC charset family.

Parameters
string$strEUC multibyte character string
string$charsetThe charset
Returns
int The number of characters
See also
strlen()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_strlen() directly

Definition at line 1693 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction().

euc_strtrunc (   $str,
  $len,
  $charset 
)

Cuts a string in the EUC charset family short at a given byte length.

Parameters
string$strEUC multibyte character string
int$lenThe byte length
string$charsetThe charset
Returns
string The shortened string
See also
mb_strcut()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_strcut() directly

Definition at line 1624 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction(), and CharsetConverter\substr().

euc_substr (   $str,
  $start,
  $charset,
  $len = null 
)

Returns a part of a string in the EUC charset family.

Parameters
string$strEUC multibyte character string
int$startStart position (character position)
string$charsetThe charset
int$lenLength (in characters)
Returns
string the substring
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_substr() directly

Definition at line 1662 of file CharsetConverter.php.

References CharsetConverter\euc_char2byte_pos(), GeneralUtility\logDeprecatedFunction(), and CharsetConverter\substr().

initCaseFolding (   $charset)

This function initializes the folding table for a charset other than UTF-8. This function is automatically called by the case folding functions.

Parameters
string$charsetCharset for which to initialize case folding.
Returns
int Returns FALSE on error, a TRUE value on success: 1 table already loaded, 2, cached version, 3 table parsed (and cached). private

Definition at line 1031 of file CharsetConverter.php.

References GeneralUtility\getFileAbsFileName(), CharsetConverter\initCharset(), CharsetConverter\initUnicodeData(), CharsetConverter\utf8_decode(), and GeneralUtility\writeFileToTypo3tempDir().

Referenced by CharsetConverter\euc_char_mapping(), and CharsetConverter\sb_char_mapping().

initCharset (   $charset)

This will initialize a charset for use if it's defined in the 'typo3/sysext/core/Resources/Private/Charsets/csconvtbl/' folder This function is automatically called by the conversion functions

PLEASE SEE: http://www.unicode.org/Public/MAPPINGS/

Parameters
string$charsetThe charset to be initialized. Use lowercase charset always (the charset must match exactly with a filename in csconvtbl/ folder ([charset].tbl)
Returns
int Returns '1' if already loaded. Returns FALSE if charset conversion table was not found. Returns '2' if the charset conversion table was found and parsed. private

Definition at line 725 of file CharsetConverter.php.

References ExtensionManagementUtility\extPath(), GeneralUtility\getFileAbsFileName(), CharsetConverter\substr(), GeneralUtility\trimExplode(), CharsetConverter\UnumberToChar(), GeneralUtility\validPathStr(), and GeneralUtility\writeFileToTypo3tempDir().

Referenced by CharsetConverter\initCaseFolding(), CharsetConverter\initToASCII(), CharsetConverter\utf8_decode(), and CharsetConverter\utf8_encode().

initToASCII (   $charset)

This function initializes the to-ASCII conversion table for a charset other than UTF-8. This function is automatically called by the ASCII transliteration functions.

Parameters
string$charsetCharset for which to initialize conversion.
Returns
int Returns FALSE on error, a TRUE value on success: 1 table already loaded, 2, cached version, 3 table parsed (and cached). private

Definition at line 1093 of file CharsetConverter.php.

References GeneralUtility\getFileAbsFileName(), CharsetConverter\initCharset(), CharsetConverter\initUnicodeData(), CharsetConverter\utf8_decode(), and GeneralUtility\writeFileToTypo3tempDir().

Referenced by CharsetConverter\euc_char_mapping(), and CharsetConverter\sb_char_mapping().

initUnicodeData (   $mode = null)

This function initializes all UTF-8 character data tables.

PLEASE SEE: http://www.unicode.org/Public/UNIDATA/

Parameters
string$modeMode ("case", "ascii", ...)
Returns
int Returns FALSE on error, a TRUE value on success: 1 table already loaded, 2, cached version, 3 table parsed (and cached). private

Definition at line 793 of file CharsetConverter.php.

References ExtensionManagementUtility\extPath(), GeneralUtility\getFileAbsFileName(), GeneralUtility\trimExplode(), CharsetConverter\UnumberToChar(), GeneralUtility\validPathStr(), and GeneralUtility\writeFileToTypo3tempDir().

Referenced by CharsetConverter\initCaseFolding(), CharsetConverter\initToASCII(), and CharsetConverter\utf8_char_mapping().

parse_charset (   $charset)

Normalize - changes input character set to lowercase letters.

Parameters
string$charsetInput charset
Returns
string Normalized charset

Definition at line 289 of file CharsetConverter.php.

sb_char_mapping (   $str,
  $charset,
  $mode,
  $opt = '' 
)

Maps all characters of a string in a single byte charset.

Parameters
string$strThe string
string$charsetThe charset
string$modeMode: 'case' (case folding) or 'ascii' (ASCII transliteration)
string$opt'case': conversion 'toLower' or 'toUpper'
Returns
string The converted string

Definition at line 1311 of file CharsetConverter.php.

References CharsetConverter\initCaseFolding(), and CharsetConverter\initToASCII().

Referenced by CharsetConverter\specCharsToASCII().

specCharsToASCII (   $charset,
  $string 
)

Converts special chars (like æøåÆØÅ, umlauts etc) to ascii equivalents (usually double-bytes, like æ => ae etc.)

Parameters
string$charsetCharacter set of string
string$stringInput string to convert
Returns
string The converted string

Definition at line 1268 of file CharsetConverter.php.

References CharsetConverter\euc_char_mapping(), CharsetConverter\sb_char_mapping(), and CharsetConverter\utf8_char_mapping().

strlen (   $charset,
  $string 
)

Counts the number of characters. Unit-tested by Kasper (single byte charsets only)

Parameters
string$charsetThe character set
string$stringCharacter string
Returns
int The number of characters
See also
strlen()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_strlen() directly

Definition at line 1159 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction().

Referenced by CharsetConverter\euc_char2byte_pos(), CharsetConverter\utf8_char2byte_pos(), CharsetConverter\utf8_decode(), CharsetConverter\utf8_encode(), CharsetConverter\utf8_to_entities(), and CharsetConverter\utf8_to_numberarray().

strtrunc (   $charset,
  $string,
  $len 
)

Cuts a string short at a given byte length.

Parameters
string$charsetThe character set
string$stringCharacter string
int$lenThe byte length
Returns
string The shortened string
See also
mb_strcut()

Definition at line 1198 of file CharsetConverter.php.

substr (   $charset,
  $string,
  $start,
  $len = null 
)

Returns a part of a string. Unit-tested by Kasper (single byte charsets only)

Parameters
string$charsetThe character set
string$stringCharacter string
int$startStart position (character position)
int$lenLength (in characters)
Returns
string The substring
See also
substr(), mb_substr()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_substr() directly

Definition at line 1143 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction().

Referenced by CharsetConverter\entities_to_utf8(), CharsetConverter\euc_char_mapping(), CharsetConverter\euc_strtrunc(), CharsetConverter\euc_substr(), CharsetConverter\initCharset(), CharsetConverter\utf8_char_mapping(), CharsetConverter\utf8_decode(), CharsetConverter\utf8_encode(), CharsetConverter\utf8_strtrunc(), CharsetConverter\utf8_substr(), CharsetConverter\utf8_to_entities(), CharsetConverter\utf8_to_numberarray(), and CharsetConverter\utf8CharToUnumber().

UnumberToChar (   $unicodeInteger)

Converts a UNICODE number to a UTF-8 multibyte character Algorithm based on script found at From: http://czyborra.com/utf/ Unit-tested by Kasper

The binary representation of the character's integer value is thus simply spread across the bytes and the number of high bits set in the lead byte announces the number of bytes in the multibyte sequence:

bytes | bits | representation 1 | 7 | 0vvvvvvv 2 | 11 | 110vvvvv 10vvvvvv 3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv 4 | 21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv 5 | 26 | 111110vv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 6 | 31 | 1111110v 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv 10vvvvvv

Parameters
int$unicodeIntegerUNICODE integer
Returns
string UTF-8 multibyte character string
See also
utf8CharToUnumber()

Definition at line 638 of file CharsetConverter.php.

Referenced by CharsetConverter\entities_to_utf8(), CharsetConverter\initCharset(), and CharsetConverter\initUnicodeData().

utf8_byte2char_pos (   $str,
  $pos 
)

Translates an 'absolute' byte position into a character position. Unit tested by Kasper.

Parameters
string$strUTF-8 string
int$posByte position
Returns
int Character position
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, former internal function only

Definition at line 1532 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction().

utf8_char2byte_pos (   $str,
  $pos 
)

Translates a character position into an 'absolute' byte position. Unit tested by Kasper.

Parameters
string$strUTF-8 string
int$posCharacter position (negative values start from the end)
Returns
int Byte position

Definition at line 1484 of file CharsetConverter.php.

References CharsetConverter\strlen().

Referenced by CharsetConverter\utf8_substr().

utf8_char_mapping (   $str,
  $mode,
  $opt = '' 
)

Maps all characters of an UTF-8 string.

Parameters
string$strUTF-8 string
string$modeMode: 'case' (case folding) or 'ascii' (ASCII transliteration)
string$opt'case': conversion 'toLower' or 'toUpper'
Returns
string The converted string

Definition at line 1562 of file CharsetConverter.php.

References CharsetConverter\initUnicodeData(), and CharsetConverter\substr().

Referenced by CharsetConverter\specCharsToASCII().

utf8_decode (   $str,
  $charset,
  $useEntityForNoChar = false 
)

Converts $str from UTF-8 to $charset

Parameters
string$strString in UTF-8 to convert to local charset
string$charsetCharset, lowercase. Must be found in csconvtbl/ folder.
bool$useEntityForNoCharIf set, then characters that are not available in the destination character set will be encoded as numeric entities
Returns
string Output string, converted to local charset

Definition at line 425 of file CharsetConverter.php.

References CharsetConverter\initCharset(), CharsetConverter\strlen(), CharsetConverter\substr(), and CharsetConverter\utf8CharToUnumber().

Referenced by CharsetConverter\conv(), CharsetConverter\initCaseFolding(), and CharsetConverter\initToASCII().

utf8_encode (   $str,
  $charset 
)

Converts $str from $charset to UTF-8

Parameters
string$strString in local charset to convert to UTF-8
string$charsetCharset, lowercase. Must be found in csconvtbl/ folder.
Returns
string Output string, converted to UTF-8

Definition at line 364 of file CharsetConverter.php.

References CharsetConverter\initCharset(), CharsetConverter\strlen(), and CharsetConverter\substr().

Referenced by CharsetConverter\conv().

utf8_strlen (   $str)

Counts the number of characters of a string in UTF-8. Unit-tested by Kasper and works 100% like strlen() / mb_strlen()

Parameters
string$strUTF-8 multibyte character string
Returns
int The number of characters
See also
strlen()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_strlen() directly

Definition at line 1396 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction().

utf8_strpos (   $haystack,
  $needle,
  $offset = 0 
)

Find position of first occurrence of a string, both arguments are in UTF-8.

Parameters
string$haystackUTF-8 string to search in
string$needleUTF-8 string to search for
int$offsetPosition to start the search
Returns
int The character position
See also
strpos()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_strpos() directly

Definition at line 1455 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction().

utf8_strrpos (   $haystack,
  $needle 
)

Find position of last occurrence of a char in a string, both arguments are in UTF-8.

Parameters
string$haystackUTF-8 string to search in
string$needleUTF-8 character to search for (single character)
Returns
int The character position
See also
strrpos()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_strrpos() directly

Definition at line 1470 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction().

utf8_strtrunc (   $str,
  $len 
)

Truncates a string in UTF-8 short at a given byte length.

Parameters
string$strUTF-8 multibyte character string
int$lenThe byte length
Returns
string The shortened string
See also
mb_strcut()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_strcut() directly

Definition at line 1422 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction(), and CharsetConverter\substr().

utf8_substr (   $str,
  $start,
  $len = null 
)

Returns a part of a UTF-8 string. Unit-tested by Kasper and works 100% like substr() / mb_substr() for full range of $start/$len

Parameters
string$strUTF-8 string
int$startStart position (character position)
int$lenLength (in characters)
Returns
string The substring
See also
substr()
Deprecated:
since TYPO3 v8, will be removed with TYPO3 v9, use mb_substr() directly

Definition at line 1359 of file CharsetConverter.php.

References GeneralUtility\logDeprecatedFunction(), CharsetConverter\substr(), and CharsetConverter\utf8_char2byte_pos().

utf8_to_entities (   $str)

Converts all chars > 127 to numeric entities.

Parameters
string$strInput string
Returns
string Output string

Definition at line 492 of file CharsetConverter.php.

References CharsetConverter\strlen(), CharsetConverter\substr(), and CharsetConverter\utf8CharToUnumber().

utf8_to_numberarray (   $str)

Converts all chars in the input UTF-8 string into integer numbers returned in an array. All HTML entities (like & or £ or { or 㽝) will be detected as characters. Also, instead of integer numbers the real UTF-8 char is returned.

Parameters
string$strInput string, UTF-8
Returns
array Output array with the char numbers

Definition at line 576 of file CharsetConverter.php.

References CharsetConverter\entities_to_utf8(), CharsetConverter\strlen(), and CharsetConverter\substr().

utf8CharToUnumber (   $str,
  $hex = false 
)

Converts a UTF-8 Multibyte character to a UNICODE number Unit-tested by Kasper

Parameters
string$strUTF-8 multibyte character string
bool$hexIf set, then a hex. number is returned.
Returns
int UNICODE integer
See also
UnumberToChar()

Definition at line 684 of file CharsetConverter.php.

References CharsetConverter\substr().

Referenced by CharsetConverter\utf8_decode(), and CharsetConverter\utf8_to_entities().

Member Data Documentation

$caseFolding = []

Definition at line 75 of file CharsetConverter.php.

$charSetArray
Initial value:
= [
'af' => ''

Definition at line 214 of file CharsetConverter.php.

$eucBasedSets
Initial value:
= [
'gb2312' => 1

Definition at line 109 of file CharsetConverter.php.

$fourByteSets
Initial value:
= [
'ucs-4' => 1

Definition at line 99 of file CharsetConverter.php.

$noCharByteVal = 63

Definition at line 61 of file CharsetConverter.php.

$parsedCharsets = []

Definition at line 68 of file CharsetConverter.php.

$synonyms
Initial value:
= [
'us' => 'ascii'

Definition at line 122 of file CharsetConverter.php.

$toASCII = []

Definition at line 82 of file CharsetConverter.php.

$twoByteSets
Initial value:
= [
'ucs-2' => 1
]

Definition at line 89 of file CharsetConverter.php.