‪TYPO3CMS  11.5
TYPO3\CMS\IndexedSearch\Lexer Class Reference

Public Member Functions

array split2Words ($wordString)
 
 addWords (&$words, &$wordString, $start, $len)
 
array bool get_word (&$str, $pos=0)
 
bool utf8_is_letter (&$str, &$len, $pos=0)
 
string null charType ($cp)
 
int string utf8_ord (&$str, &$len, $pos=0, $hex=false)
 

Public Attributes

bool $debug = false
 
string $debugString = ''
 
array $lexerConf
 

Protected Attributes

const CHARTYPE_NUMBER = 'num'
 
const CHARTYPE_ALPHA = 'alpha'
 
const CHARTYPE_CJK = 'cjk'
 

Detailed Description

Lexer class for indexed_search A lexer splits the text into words

Definition at line 26 of file Lexer.php.

Member Function Documentation

◆ addWords()

TYPO3\CMS\IndexedSearch\Lexer::addWords ( $words,
$wordString,
  $start,
  $len 
)

Add word to word-array This function should be used to make sure CJK sequences are split up in the right way

Parameters
array$words‪Array of accumulated words
string$wordString‪Complete Input string from where to extract word
int$start‪Start position of word in input string
int$len‪The Length of the word string from start position

Definition at line 114 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ charType()

string null TYPO3\CMS\IndexedSearch\Lexer::charType (   $cp)

Determine the type of character

Parameters
int$cp‪Unicode number to evaluate
Returns
‪string|null Type of char; the main type: num, alpha or CJK (Chinese / Japanese / Korean)

Definition at line 261 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_ALPHA, TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_CJK, and TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_NUMBER.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

◆ get_word()

array bool TYPO3\CMS\IndexedSearch\Lexer::get_word ( $str,
  $pos = 0 
)

Get the first word in a given utf-8 string (initial non-letters will be skipped)

Parameters
string$str‪Input string (reference)
int$pos‪Starting position in input string
Returns
‪array|bool 0: start, 1: len or FALSE if no word has been found

Definition at line 162 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ split2Words()

array TYPO3\CMS\IndexedSearch\Lexer::split2Words (   $wordString)

Splitting string into words. Used for indexing, can also be used to find words in query.

Parameters
string$wordString‪String with UTF-8 content to process.
Returns
‪array Array of words in utf-8

Definition at line 69 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\addWords(), debug(), and TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_is_letter()

bool TYPO3\CMS\IndexedSearch\Lexer::utf8_is_letter ( $str,
$len,
  $pos = 0 
)

See if a character is a letter (or a string of letters or non-letters).

Parameters
string$str‪Input string (reference)
int$len‪Byte-length of character sequence (reference, return value)
int$pos‪Starting position in input string
Returns
‪bool letter (or word) found

Definition at line 187 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), TYPO3\CMS\IndexedSearch\Lexer\CHARTYPE_ALPHA, and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_ord()

int string TYPO3\CMS\IndexedSearch\Lexer::utf8_ord ( $str,
$len,
  $pos = 0,
  $hex = false 
)

Converts a UTF-8 multibyte character to a UNICODE codepoint

Parameters
string$str‪UTF-8 multibyte character string (reference)
int$len‪The length of the character (reference, return value)
int$pos‪Starting position in input string
bool$hex‪If set, then a hex. number is returned
Returns
‪int|string UNICODE codepoint

Definition at line 289 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Member Data Documentation

◆ $debug

bool TYPO3\CMS\IndexedSearch\Lexer::$debug = false

Debugging options:

Definition at line 37 of file Lexer.php.

◆ $debugString

string TYPO3\CMS\IndexedSearch\Lexer::$debugString = ''

If set, the debugString is filled with HTML output highlighting search / non-search words (for backend display)

Definition at line 43 of file Lexer.php.

◆ $lexerConf

array TYPO3\CMS\IndexedSearch\Lexer::$lexerConf
Initial value:
= array(
'printjoins' => [
46,
45,
95,
58,
47,
39,
],
'casesensitive' => false,
'removeChars' => [],
)

Configuration of the lexer:

Definition at line 49 of file Lexer.php.

◆ CHARTYPE_ALPHA

const TYPO3\CMS\IndexedSearch\Lexer::CHARTYPE_ALPHA = 'alpha'
protected

◆ CHARTYPE_CJK

const TYPO3\CMS\IndexedSearch\Lexer::CHARTYPE_CJK = 'cjk'
protected

Definition at line 31 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\charType().

◆ CHARTYPE_NUMBER

const TYPO3\CMS\IndexedSearch\Lexer::CHARTYPE_NUMBER = 'num'
protected

Definition at line 28 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\charType().