Inheritance diagram for TYPO3\CMS\IndexedSearch\Lexer:

Public Member Functions
	__construct ()

	split2Words ($wordString)

	addWords (&$words, &$wordString, $start, $len)

	get_word (&$str, $pos=0)

	utf8_is_letter (&$str, &$len, $pos=0)

	charType ($cp)

	utf8_ord (&$str, &$len, $pos=0, $hex=FALSE)

Public Attributes
	$debug = FALSE

	$debugString = ''

	$csObj

	$lexerConf

Detailed Description

This file is part of the TYPO3 CMS project.

It is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License, either version 2 of the License, or any later version.

For the full copyright and license information, please read the LICENSE.txt file that was distributed with this source code.

The TYPO3 project - inspiring people to share! Lexer for indexed_search

Author: Kasper Skårhøj kaspe.nosp@m.rYYY.nosp@m.Y@typ.nosp@m.o3.c.nosp@m.om Lexer class for indexed_search A lexer splits the text into words; Kasper Skårhøj kaspe.nosp@m.rYYY.nosp@m.Y@typ.nosp@m.o3.c.nosp@m.om

Definition at line 27 of file Lexer.php.

Constructor & Destructor Documentation

◆ __construct()

TYPO3\CMS\IndexedSearch\Lexer::__construct ( )

Constructor: Initializes the charset class

Returns: void

Todo:: Define visibility

Definition at line 66 of file Lexer.php.

References TYPO3\CMS\Core\Utility\GeneralUtility\makeInstance().

Member Function Documentation

◆ addWords()

TYPO3\CMS\IndexedSearch\Lexer::addWords	(	&	$words,
		&	$wordString,
			$start,
			$len
	)

Add word to word-array This function should be used to make sure CJK sequences are split up in the right way

Parameters

array	Array of accumulated words
string	Complete Input string from where to extract word
integer	Start position of word in input string
integer	The Length of the word string from start position

Returns: void

Todo:: Define visibility

Definition at line 122 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ charType()

TYPO3\CMS\IndexedSearch\Lexer::charType ( $cp )

Determine the type of character

Parameters

integer Unicode number to evaluate

Returns: array Type of char; index-0: the main type: num, alpha or CJK (Chinese / Japanese / Korean)

Todo:: Define visibility

Definition at line 264 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

◆ get_word()

TYPO3\CMS\IndexedSearch\Lexer::get_word	(	&	$str,
			$pos = `0`
	)

Get the first word in a given utf-8 string (initial non-letters will be skipped)

Parameters

string	Input string (reference)
integer	Starting position in input string

Returns: array 0: start, 1: len or FALSE if no word has been found

Todo:: Define visibility

Definition at line 169 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ split2Words()

TYPO3\CMS\IndexedSearch\Lexer::split2Words ( $wordString )

Splitting string into words. Used for indexing, can also be used to find words in query.

Parameters

string String with UTF-8 content to process.

Returns: array Array of words in utf-8

Todo:: Define visibility

Definition at line 78 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\addWords(), debug(), and TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_is_letter()

TYPO3\CMS\IndexedSearch\Lexer::utf8_is_letter	(	&	$str,
		&	$len,
			$pos = `0`
	)

See if a character is a letter (or a string of letters or non-letters).

Parameters

string	Input string (reference)
integer	Byte-length of character sequence (reference, return value)
integer	Starting position in input string

Returns: boolean letter (or word) found

Todo:: Define visibility

Definition at line 194 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), TYPO3\CMS\Core\Utility\GeneralUtility\inList(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_ord()

TYPO3\CMS\IndexedSearch\Lexer::utf8_ord	(	&	$str,
		&	$len,
			$pos = `0`,
			$hex = `FALSE`
	)

Converts a UTF-8 multibyte character to a UNICODE codepoint

Parameters

string	UTF-8 multibyte character string (reference)
integer	The length of the character (reference, return value)
integer	Starting position in input string
boolean	If set, then a hex. number is returned

Returns: integer UNICODE codepoint

Todo:: Define visibility

Definition at line 291 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Member Data Documentation

◆ $csObj

TYPO3\CMS\IndexedSearch\Lexer::$csObj

Definition at line 47 of file Lexer.php.

◆ $debug

TYPO3\CMS\IndexedSearch\Lexer::$debug = FALSE

Todo:: Define visibility

Definition at line 33 of file Lexer.php.

◆ $debugString

TYPO3\CMS\IndexedSearch\Lexer::$debugString = ''

Todo:: Define visibility

Definition at line 39 of file Lexer.php.

◆ $lexerConf

TYPO3\CMS\IndexedSearch\Lexer::$lexerConf

Initial value:

= array(
        'printjoins' => array(46, 45, 95, 58, 47, 39),
        'casesensitive' => FALSE,
        
        'removeChars' => array(45)
    )

Todo:: Define visibility

Definition at line 53 of file Lexer.php.

Public Member Functions

Public Attributes

Detailed Description

Constructor & Destructor Documentation

◆ __construct()

Member Function Documentation

◆ addWords()

◆ charType()

◆ get_word()

◆ split2Words()

◆ utf8_is_letter()

◆ utf8_ord()

Member Data Documentation

◆ $csObj

◆ $debug

◆ $debugString

◆ $lexerConf