Public Member Functions
array	split2Words ($wordString)

	addWords (&$words, &$wordString, $start, $len)

array bool	get_word (&$str, $pos=0)

bool	utf8_is_letter (&$str, &$len, $pos=0)

string null	charType ($cp)

int string	utf8_ord (&$str, &$len, $pos=0, $hex=false)

Public Attributes
bool	$debug = false

string	$debugString = ''

array	$lexerConf

Protected Attributes
const	CHARTYPE_NUMBER = 'num'

const	CHARTYPE_ALPHA = 'alpha'

const	CHARTYPE_CJK = 'cjk'

Detailed Description

Lexer class for indexed_search A lexer splits the text into words

Definition at line 26 of file Lexer.php.

Member Function Documentation

◆ addWords()

TYPO3\CMS\IndexedSearch\Lexer::addWords	(	&	$words,
		&	$wordString,
			$start,
			$len
	)

Add word to word-array This function should be used to make sure CJK sequences are split up in the right way

Parameters

array	$words	‪Array of accumulated words
string	$wordString	‪Complete Input string from where to extract word
int	$start	‪Start position of word in input string
int	$len	‪The Length of the word string from start position

Definition at line 114 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ charType()

string null TYPO3\CMS\IndexedSearch\Lexer::charType ( $cp )

Determine the type of character

Parameters

int $cp ‪Unicode number to evaluate

Returns: ‪string|null Type of char; the main type: num, alpha or CJK (Chinese / Japanese / Korean)

Definition at line 261 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_ALPHA, TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_CJK, and TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_NUMBER.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

◆ get_word()

array bool TYPO3\CMS\IndexedSearch\Lexer::get_word	(	&	$str,
			$pos = `0`
	)

Get the first word in a given utf-8 string (initial non-letters will be skipped)

Parameters

string	$str	‪Input string (reference)
int	$pos	‪Starting position in input string

Returns: ‪array|bool 0: start, 1: len or FALSE if no word has been found

Definition at line 162 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ split2Words()

array TYPO3\CMS\IndexedSearch\Lexer::split2Words ( $wordString )

Splitting string into words. Used for indexing, can also be used to find words in query.

Parameters

string $wordString ‪String with UTF-8 content to process.

Returns: ‪array Array of words in utf-8

Definition at line 69 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\addWords(), debug(), and TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_is_letter()

bool TYPO3\CMS\IndexedSearch\Lexer::utf8_is_letter	(	&	$str,
		&	$len,
			$pos = `0`
	)

See if a character is a letter (or a string of letters or non-letters).

Parameters

string	$str	‪Input string (reference)
int	$len	‪Byte-length of character sequence (reference, return value)
int	$pos	‪Starting position in input string

Returns: ‪bool letter (or word) found

Definition at line 187 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_ALPHA, and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_ord()

int string TYPO3\CMS\IndexedSearch\Lexer::utf8_ord	(	&	$str,
		&	$len,
			$pos = `0`,
			$hex = `false`
	)

Converts a UTF-8 multibyte character to a UNICODE codepoint

Parameters

string	$str	‪UTF-8 multibyte character string (reference)
int	$len	‪The length of the character (reference, return value)
int	$pos	‪Starting position in input string
bool	$hex	‪If set, then a hex. number is returned

Returns: ‪int|string UNICODE codepoint

Definition at line 289 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Member Data Documentation

◆ $debug

bool TYPO3\CMS\IndexedSearch\Lexer::$debug = false

Debugging options:

Definition at line 37 of file Lexer.php.

◆ $debugString

string TYPO3\CMS\IndexedSearch\Lexer::$debugString = ''

If set, the debugString is filled with HTML output highlighting search / non-search words (for backend display)

Definition at line 43 of file Lexer.php.

◆ $lexerConf

array TYPO3\CMS\IndexedSearch\Lexer::$lexerConf

Initial value:

= array( 
        'printjoins' => [
            46, 
            45, 
            95, 
            58, 
            47, 
            39, 
        ],
        'casesensitive' => false, 
        'removeChars' => [],
     )

Configuration of the lexer:

Definition at line 49 of file Lexer.php.

◆ CHARTYPE_ALPHA

const TYPO3\CMS\IndexedSearch\Lexer::CHARTYPE_ALPHA = 'alpha'

protected

Definition at line 29 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\charType(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

◆ CHARTYPE_CJK

const TYPO3\CMS\IndexedSearch\Lexer::CHARTYPE_CJK = 'cjk'

protected

Definition at line 31 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\charType().

◆ CHARTYPE_NUMBER

const TYPO3\CMS\IndexedSearch\Lexer::CHARTYPE_NUMBER = 'num'

protected

Definition at line 28 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\charType().

Public Member Functions

Public Attributes

Protected Attributes

Detailed Description

Member Function Documentation

◆ addWords()

◆ charType()

◆ get_word()

◆ split2Words()

◆ utf8_is_letter()

◆ utf8_ord()

Member Data Documentation

◆ $debug

◆ $debugString

◆ $lexerConf

◆ CHARTYPE_ALPHA

◆ CHARTYPE_CJK

◆ CHARTYPE_NUMBER