‪TYPO3CMS  ‪main
TYPO3\CMS\IndexedSearch\Lexer Class Reference

Public Member Functions

array split2Words (string $wordString)
 
 addWords (array &$words, string &$wordString, int $start, int $len)
 
array bool get_word (string &$str, int $pos=0)
 
bool utf8_is_letter (string &$str, int &$len, int $pos=0)
 
string null charType (int $cp)
 
int string utf8_ord (string &$str, int &$len, int $pos=0, bool $hex=false)
 

Protected Attributes

const CHARTYPE_NUMBER = 'num'
 
const CHARTYPE_ALPHA = 'alpha'
 
const CHARTYPE_CJK = 'cjk'
 
array $lexerConf
 

Detailed Description

Lexer class for indexed_search A lexer splits the text into words

Definition at line 23 of file Lexer.php.

Member Function Documentation

◆ addWords()

TYPO3\CMS\IndexedSearch\Lexer::addWords ( array &  $words,
string &  $wordString,
int  $start,
int  $len 
)

Add word to word-array This function should be used to make sure CJK sequences are split up in the right way

Parameters
array$words‪Array of accumulated words
string$wordString‪Complete Input string from where to extract word
int$start‪Start position of word in input string
int$len‪The Length of the word string from start position

Definition at line 84 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ charType()

string null TYPO3\CMS\IndexedSearch\Lexer::charType ( int  $cp)

Determine the type of character

Parameters
int$cp‪Unicode number to evaluate
Returns
‪string|null Type of char; the main type: num, alpha or CJK (Chinese / Japanese / Korean)

Definition at line 224 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_ALPHA, TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_CJK, and TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_NUMBER.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

◆ get_word()

array bool TYPO3\CMS\IndexedSearch\Lexer::get_word ( string &  $str,
int  $pos = 0 
)

Get the first word in a given utf-8 string (initial non-letters will be skipped)

Parameters
string$str‪Input string (reference)
int$pos‪Starting position in input string
Returns
‪array|bool 0: start, 1: len or FALSE if no word has been found

Definition at line 126 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ split2Words()

array TYPO3\CMS\IndexedSearch\Lexer::split2Words ( string  $wordString)

Splitting string into words. Used for indexing, can also be used to find words in query.

Parameters
string$wordString‪String with UTF-8 content to process.
Returns
‪array Array of words in utf-8

Definition at line 49 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_is_letter()

bool TYPO3\CMS\IndexedSearch\Lexer::utf8_is_letter ( string &  $str,
int &  $len,
int  $pos = 0 
)

See if a character is a letter (or a string of letters or non-letters).

Parameters
string$str‪Input string (reference)
int$len‪Byte-length of character sequence (reference, return value)
int$pos‪Starting position in input string
Returns
‪bool letter (or word) found

Definition at line 151 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_ALPHA, and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_ord()

int string TYPO3\CMS\IndexedSearch\Lexer::utf8_ord ( string &  $str,
int &  $len,
int  $pos = 0,
bool  $hex = false 
)

Converts a UTF-8 multibyte character to a UNICODE codepoint

Parameters
string$str‪UTF-8 multibyte character string (reference)
int$len‪The length of the character (reference, return value)
int$pos‪Starting position in input string
bool$hex‪If set, then a hex. number is returned
Returns
‪int|string UNICODE codepoint

Definition at line 252 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Member Data Documentation

◆ $lexerConf

array TYPO3\CMS\IndexedSearch\Lexer::$lexerConf
protected
Initial value:
= [
'printjoins' => [
46,
45,
95,
58,
47,
39,
],
'casesensitive' => false,
]

Definition at line 30 of file Lexer.php.

◆ CHARTYPE_ALPHA

const TYPO3\CMS\IndexedSearch\Lexer::CHARTYPE_ALPHA = 'alpha'
protected

◆ CHARTYPE_CJK

const TYPO3\CMS\IndexedSearch\Lexer::CHARTYPE_CJK = 'cjk'
protected

Definition at line 28 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\charType().

◆ CHARTYPE_NUMBER

const TYPO3\CMS\IndexedSearch\Lexer::CHARTYPE_NUMBER = 'num'
protected

Definition at line 25 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\charType().