Lexer

Lexer class for indexed_search A lexer splits the text into words

Internal

Table of Contents

Constants

CHARTYPE_ALPHA  = 'alpha'
CHARTYPE_CJK  = 'cjk'
CHARTYPE_NUMBER  = 'num'

Properties

$lexerConf  : array<string|int, mixed>

Methods

addWords()  : void
Add word to word-array This function should be used to make sure CJK sequences are split up in the right way
charType()  : string|null
Determine the type of character
get_word()  : array<string|int, mixed>|bool
Get the first word in a given utf-8 string (initial non-letters will be skipped)
split2Words()  : array<string|int, mixed>
Splitting string into words.
utf8_is_letter()  : bool
See if a character is a letter (or a string of letters or non-letters).
utf8_ord()  : int|string
Converts a UTF-8 multibyte character to a UNICODE codepoint

Constants

CHARTYPE_ALPHA

protected mixed CHARTYPE_ALPHA = 'alpha'

CHARTYPE_CJK

protected mixed CHARTYPE_CJK = 'cjk'

CHARTYPE_NUMBER

protected mixed CHARTYPE_NUMBER = 'num'

Properties

$lexerConf

protected array<string|int, mixed> $lexerConf = ['printjoins' => [ 46, // . 45, // - 95, // _ 58, // : 47, // / 39, ], 'casesensitive' => false]

Methods

addWords()

Add word to word-array This function should be used to make sure CJK sequences are split up in the right way

public addWords(array<string|int, mixed> &$words, string &$wordString, int $start, int $len) : void
Parameters
$words : array<string|int, mixed>

Array of accumulated words

$wordString : string

Complete Input string from where to extract word

$start : int

Start position of word in input string

$len : int

The Length of the word string from start position

charType()

Determine the type of character

public charType(int $cp) : string|null
Parameters
$cp : int

Unicode number to evaluate

Return values
string|null

Type of char; the main type: num, alpha or CJK (Chinese / Japanese / Korean)

get_word()

Get the first word in a given utf-8 string (initial non-letters will be skipped)

public get_word(string &$str[, int $pos = 0 ]) : array<string|int, mixed>|bool
Parameters
$str : string

Input string (reference)

$pos : int = 0

Starting position in input string

Return values
array<string|int, mixed>|bool

0: start, 1: len or FALSE if no word has been found

split2Words()

Splitting string into words.

public split2Words(string $wordString) : array<string|int, mixed>

Used for indexing, can also be used to find words in query.

Parameters
$wordString : string

String with UTF-8 content to process.

Return values
array<string|int, mixed>

Array of words in utf-8

utf8_is_letter()

See if a character is a letter (or a string of letters or non-letters).

public utf8_is_letter(string &$str, int &$len[, int $pos = 0 ]) : bool
Parameters
$str : string

Input string (reference)

$len : int

Byte-length of character sequence (reference, return value)

$pos : int = 0

Starting position in input string

Return values
bool

letter (or word) found

utf8_ord()

Converts a UTF-8 multibyte character to a UNICODE codepoint

public utf8_ord(string &$str, int &$len[, int $pos = 0 ][, bool $hex = false ]) : int|string
Parameters
$str : string

UTF-8 multibyte character string (reference)

$len : int

The length of the character (reference, return value)

$pos : int = 0

Starting position in input string

$hex : bool = false

If set, then a hex. number is returned

Return values
int|string

UNICODE codepoint


        
On this page

Search results