TYPO3 main API: Lexer

Lexer

Lexer class for indexed_search A lexer splits the text into words

Internal

Constants

CHARTYPE_ALPHA = 'alpha'
CHARTYPE_CJK = 'cjk'
CHARTYPE_NUMBER = 'num'

Properties

$lexerConf : array<string|int, mixed>

Methods

addWords() : void: Add word to word-array This function should be used to make sure CJK sequences are split up in the right way
charType() : string|null: Determine the type of character
get_word() : array<string|int, mixed>|bool: Get the first word in a given utf-8 string (initial non-letters will be skipped)
split2Words() : array<string|int, mixed>: Splitting string into words.
utf8_is_letter() : bool: See if a character is a letter (or a string of letters or non-letters).
utf8_ord() : int|string: Converts a UTF-8 multibyte character to a UNICODE codepoint

CHARTYPE_ALPHA


    protected
        mixed
    CHARTYPE_ALPHA
    = 'alpha'

CHARTYPE_CJK


    protected
        mixed
    CHARTYPE_CJK
    = 'cjk'

CHARTYPE_NUMBER


    protected
        mixed
    CHARTYPE_NUMBER
    = 'num'

$lexerConf


    protected
        array<string|int, mixed>
    $lexerConf
     = ['printjoins' => [
    46,
    // .
    45,
    // -
    95,
    // _
    58,
    // :
    47,
    // /
    39,
], 'casesensitive' => false]

addWords()

Add word to word-array This function should be used to make sure CJK sequences are split up in the right way


    public
                    addWords(array<string|int, mixed> &$words, string &$wordString, int $start, int $len) : void

Parameters

$words : array<string|int, mixed>: Array of accumulated words
$wordString : string: Complete Input string from where to extract word
$start : int: Start position of word in input string
$len : int: The Length of the word string from start position

charType()

Determine the type of character


    public
                    charType(int $cp) : string|null

Parameters

$cp : int: Unicode number to evaluate

Return values

string|null —

Type of char; the main type: num, alpha or CJK (Chinese / Japanese / Korean)

get_word()

Get the first word in a given utf-8 string (initial non-letters will be skipped)


    public
                    get_word(string &$str[, int $pos = 0 ]) : array<string|int, mixed>|bool

Parameters

$str : string: Input string (reference)
$pos : int = 0: Starting position in input string

Return values

array<string|int, mixed>|bool —

0: start, 1: len or FALSE if no word has been found

split2Words()

Splitting string into words.


    public
                    split2Words(string $wordString) : array<string|int, mixed>

Used for indexing, can also be used to find words in query.

Parameters

$wordString : string: String with UTF-8 content to process.

Return values

array<string|int, mixed> —

Array of words in utf-8

utf8_is_letter()

See if a character is a letter (or a string of letters or non-letters).


    public
                    utf8_is_letter(string &$str, int &$len[, int $pos = 0 ]) : bool

Parameters

$str : string: Input string (reference)
$len : int: Byte-length of character sequence (reference, return value)
$pos : int = 0: Starting position in input string

Return values

bool —

letter (or word) found

utf8_ord()

Converts a UTF-8 multibyte character to a UNICODE codepoint


    public
                    utf8_ord(string &$str, int &$len[, int $pos = 0 ][, bool $hex = false ]) : int|string

Parameters

$str : string: UTF-8 multibyte character string (reference)
$len : int: The length of the character (reference, return value)
$pos : int = 0: Starting position in input string
$hex : bool = false: If set, then a hex. number is returned

Return values

int|string —

UNICODE codepoint

Lexer

Table of Contents

Constants

Properties

Methods

Constants

CHARTYPE_ALPHA

CHARTYPE_CJK

CHARTYPE_NUMBER

Properties

$lexerConf

Methods

addWords()

Parameters

charType()

Parameters

Return values

get_word()

Parameters

Return values

split2Words()

Parameters

Return values

utf8_is_letter()

Parameters

Return values

utf8_ord()

Parameters

Return values

Search results