Lexer
Lexer class for indexed_search A lexer splits the text into words
Table of Contents
Constants
- CHARTYPE_ALPHA = 'alpha'
- CHARTYPE_CJK = 'cjk'
- CHARTYPE_NUMBER = 'num'
Properties
- $debug : bool
- Debugging options:
- $debugString : string
- If set, the debugString is filled with HTML output highlighting search / non-search words (for backend display)
- $lexerConf : array<string|int, mixed>
- Configuration of the lexer:
Methods
- addWords() : mixed
- Add word to word-array This function should be used to make sure CJK sequences are split up in the right way
- charType() : string|null
- Determine the type of character
- get_word() : array<string|int, mixed>|bool
- Get the first word in a given utf-8 string (initial non-letters will be skipped)
- split2Words() : array<string|int, mixed>
- Splitting string into words.
- utf8_is_letter() : bool
- See if a character is a letter (or a string of letters or non-letters).
- utf8_ord() : int|string
- Converts a UTF-8 multibyte character to a UNICODE codepoint
Constants
CHARTYPE_ALPHA
protected
mixed
CHARTYPE_ALPHA
= 'alpha'
CHARTYPE_CJK
protected
mixed
CHARTYPE_CJK
= 'cjk'
CHARTYPE_NUMBER
protected
mixed
CHARTYPE_NUMBER
= 'num'
Properties
$debug
Debugging options:
public
bool
$debug
= false
$debugString
If set, the debugString is filled with HTML output highlighting search / non-search words (for backend display)
public
string
$debugString
= ''
$lexerConf
Configuration of the lexer:
public
array<string|int, mixed>
$lexerConf
= [
'printjoins' => [
46,
// .
45,
// -
95,
// _
58,
// :
47,
// /
39,
],
'casesensitive' => false,
// Set, if case-sensitive indexing is wanted
'removeChars' => [],
]
Methods
addWords()
Add word to word-array This function should be used to make sure CJK sequences are split up in the right way
public
addWords(array<string|int, mixed> &$words, string &$wordString, int $start, int $len) : mixed
Parameters
- $words : array<string|int, mixed>
-
Array of accumulated words
- $wordString : string
-
Complete Input string from where to extract word
- $start : int
-
Start position of word in input string
- $len : int
-
The Length of the word string from start position
charType()
Determine the type of character
public
charType(int $cp) : string|null
Parameters
- $cp : int
-
Unicode number to evaluate
Return values
string|null —Type of char; the main type: num, alpha or CJK (Chinese / Japanese / Korean)
get_word()
Get the first word in a given utf-8 string (initial non-letters will be skipped)
public
get_word(string &$str[, int $pos = 0 ]) : array<string|int, mixed>|bool
Parameters
- $str : string
-
Input string (reference)
- $pos : int = 0
-
Starting position in input string
Return values
array<string|int, mixed>|bool —0: start, 1: len or FALSE if no word has been found
split2Words()
Splitting string into words.
public
split2Words(string $wordString) : array<string|int, mixed>
Used for indexing, can also be used to find words in query.
Parameters
- $wordString : string
-
String with UTF-8 content to process.
Return values
array<string|int, mixed> —Array of words in utf-8
utf8_is_letter()
See if a character is a letter (or a string of letters or non-letters).
public
utf8_is_letter(string &$str, int &$len[, int $pos = 0 ]) : bool
Parameters
- $str : string
-
Input string (reference)
- $len : int
-
Byte-length of character sequence (reference, return value)
- $pos : int = 0
-
Starting position in input string
Return values
bool —letter (or word) found
utf8_ord()
Converts a UTF-8 multibyte character to a UNICODE codepoint
public
utf8_ord(string &$str, int &$len[, int $pos = 0 ][, bool $hex = false ]) : int|string
Parameters
- $str : string
-
UTF-8 multibyte character string (reference)
- $len : int
-
The length of the character (reference, return value)
- $pos : int = 0
-
Starting position in input string
- $hex : bool = false
-
If set, then a hex. number is returned
Return values
int|string —UNICODE codepoint