TYPO3 12.4 API: Lexer

Lexer

Lexer class for indexed_search A lexer splits the text into words

Internal

Constants

CHARTYPE_ALPHA = 'alpha'
CHARTYPE_CJK = 'cjk'
CHARTYPE_NUMBER = 'num'

Properties

$debug : bool: Debugging options:
$debugString : string: If set, the debugString is filled with HTML output highlighting search / non-search words (for backend display)
$lexerConf : array<string|int, mixed>: Configuration of the lexer:

Methods

addWords() : mixed: Add word to word-array This function should be used to make sure CJK sequences are split up in the right way
charType() : string|null: Determine the type of character
get_word() : array<string|int, mixed>|bool: Get the first word in a given utf-8 string (initial non-letters will be skipped)
split2Words() : array<string|int, mixed>: Splitting string into words.
utf8_is_letter() : bool: See if a character is a letter (or a string of letters or non-letters).
utf8_ord() : int|string: Converts a UTF-8 multibyte character to a UNICODE codepoint

CHARTYPE_ALPHA


    protected
        mixed
    CHARTYPE_ALPHA
    = 'alpha'

CHARTYPE_CJK


    protected
        mixed
    CHARTYPE_CJK
    = 'cjk'

CHARTYPE_NUMBER


    protected
        mixed
    CHARTYPE_NUMBER
    = 'num'

$debug

Debugging options:


        public
            bool
    $debug
     = false

$debugString

If set, the debugString is filled with HTML output highlighting search / non-search words (for backend display)


        public
            string
    $debugString
     = ''

$lexerConf

Configuration of the lexer:


        public
            array<string|int, mixed>
    $lexerConf
     = [
    'printjoins' => [
        46,
        // .
        45,
        // -
        95,
        // _
        58,
        // :
        47,
        // /
        39,
    ],
    'casesensitive' => false,
    // Set, if case-sensitive indexing is wanted
    'removeChars' => [],
]

addWords()

Add word to word-array This function should be used to make sure CJK sequences are split up in the right way


    public
                    addWords(array<string|int, mixed> &$words, string &$wordString, int $start, int $len) : mixed

Parameters

$words : array<string|int, mixed>: Array of accumulated words
$wordString : string: Complete Input string from where to extract word
$start : int: Start position of word in input string
$len : int: The Length of the word string from start position

charType()

Determine the type of character


    public
                    charType(int $cp) : string|null

Parameters

$cp : int: Unicode number to evaluate

Return values

string|null —

Type of char; the main type: num, alpha or CJK (Chinese / Japanese / Korean)

get_word()

Get the first word in a given utf-8 string (initial non-letters will be skipped)


    public
                    get_word(string &$str[, int $pos = 0 ]) : array<string|int, mixed>|bool

Parameters

$str : string: Input string (reference)
$pos : int = 0: Starting position in input string

Return values

array<string|int, mixed>|bool —

0: start, 1: len or FALSE if no word has been found

split2Words()

Splitting string into words.


    public
                    split2Words(string $wordString) : array<string|int, mixed>

Used for indexing, can also be used to find words in query.

Parameters

$wordString : string: String with UTF-8 content to process.

Return values

array<string|int, mixed> —

Array of words in utf-8

utf8_is_letter()

See if a character is a letter (or a string of letters or non-letters).


    public
                    utf8_is_letter(string &$str, int &$len[, int $pos = 0 ]) : bool

Parameters

$str : string: Input string (reference)
$len : int: Byte-length of character sequence (reference, return value)
$pos : int = 0: Starting position in input string

Return values

bool —

letter (or word) found

utf8_ord()

Converts a UTF-8 multibyte character to a UNICODE codepoint


    public
                    utf8_ord(string &$str, int &$len[, int $pos = 0 ][, bool $hex = false ]) : int|string

Parameters

$str : string: UTF-8 multibyte character string (reference)
$len : int: The length of the character (reference, return value)
$pos : int = 0: Starting position in input string
$hex : bool = false: If set, then a hex. number is returned

Return values

int|string —

UNICODE codepoint

Lexer

Table of Contents

Constants

Properties

Methods

Constants

CHARTYPE_ALPHA

CHARTYPE_CJK

CHARTYPE_NUMBER

Properties

$debug

$debugString

$lexerConf

Methods

addWords()

Parameters

charType()

Parameters

Return values

get_word()

Parameters

Return values

split2Words()

Parameters

Return values

utf8_is_letter()

Parameters

Return values

utf8_ord()

Parameters

Return values

Search results