TYPO3 CMS  TYPO3_6-2
TYPO3\CMS\IndexedSearch\Lexer Class Reference
Inheritance diagram for TYPO3\CMS\IndexedSearch\Lexer:
tx_indexedsearch_lexer

Public Member Functions

 __construct ()
 
 split2Words ($wordString)
 
 addWords (&$words, &$wordString, $start, $len)
 
 get_word (&$str, $pos=0)
 
 utf8_is_letter (&$str, &$len, $pos=0)
 
 charType ($cp)
 
 utf8_ord (&$str, &$len, $pos=0, $hex=FALSE)
 

Public Attributes

 $debug = FALSE
 
 $debugString = ''
 
 $csObj
 
 $lexerConf
 

Detailed Description

This file is part of the TYPO3 CMS project.

It is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License, either version 2 of the License, or any later version.

For the full copyright and license information, please read the LICENSE.txt file that was distributed with this source code.

The TYPO3 project - inspiring people to share! Lexer for indexed_search

Author
Kasper Skårhøj kaspe.nosp@m.rYYY.nosp@m.Y@typ.nosp@m.o3.c.nosp@m.om Lexer class for indexed_search A lexer splits the text into words
Kasper Skårhøj kaspe.nosp@m.rYYY.nosp@m.Y@typ.nosp@m.o3.c.nosp@m.om

Definition at line 27 of file Lexer.php.

Constructor & Destructor Documentation

◆ __construct()

TYPO3\CMS\IndexedSearch\Lexer::__construct ( )

Constructor: Initializes the charset class

Returns
void
Todo:
Define visibility

Definition at line 66 of file Lexer.php.

References TYPO3\CMS\Core\Utility\GeneralUtility\makeInstance().

Member Function Documentation

◆ addWords()

TYPO3\CMS\IndexedSearch\Lexer::addWords ( $words,
$wordString,
  $start,
  $len 
)

Add word to word-array This function should be used to make sure CJK sequences are split up in the right way

Parameters
arrayArray of accumulated words
stringComplete Input string from where to extract word
integerStart position of word in input string
integerThe Length of the word string from start position
Returns
void
Todo:
Define visibility

Definition at line 122 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ charType()

TYPO3\CMS\IndexedSearch\Lexer::charType (   $cp)

Determine the type of character

Parameters
integerUnicode number to evaluate
Returns
array Type of char; index-0: the main type: num, alpha or CJK (Chinese / Japanese / Korean)
Todo:
Define visibility

Definition at line 264 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

◆ get_word()

TYPO3\CMS\IndexedSearch\Lexer::get_word ( $str,
  $pos = 0 
)

Get the first word in a given utf-8 string (initial non-letters will be skipped)

Parameters
stringInput string (reference)
integerStarting position in input string
Returns
array 0: start, 1: len or FALSE if no word has been found
Todo:
Define visibility

Definition at line 169 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\split2Words().

◆ split2Words()

TYPO3\CMS\IndexedSearch\Lexer::split2Words (   $wordString)

Splitting string into words. Used for indexing, can also be used to find words in query.

Parameters
stringString with UTF-8 content to process.
Returns
array Array of words in utf-8
Todo:
Define visibility

Definition at line 78 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\addWords(), debug(), and TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_is_letter()

TYPO3\CMS\IndexedSearch\Lexer::utf8_is_letter ( $str,
$len,
  $pos = 0 
)

See if a character is a letter (or a string of letters or non-letters).

Parameters
stringInput string (reference)
integerByte-length of character sequence (reference, return value)
integerStarting position in input string
Returns
boolean letter (or word) found
Todo:
Define visibility

Definition at line 194 of file Lexer.php.

References TYPO3\CMS\IndexedSearch\Lexer\charType(), TYPO3\CMS\Core\Utility\GeneralUtility\inList(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_ord().

Referenced by TYPO3\CMS\IndexedSearch\Lexer\get_word().

◆ utf8_ord()

TYPO3\CMS\IndexedSearch\Lexer::utf8_ord ( $str,
$len,
  $pos = 0,
  $hex = FALSE 
)

Converts a UTF-8 multibyte character to a UNICODE codepoint

Parameters
stringUTF-8 multibyte character string (reference)
integerThe length of the character (reference, return value)
integerStarting position in input string
booleanIf set, then a hex. number is returned
Returns
integer UNICODE codepoint
Todo:
Define visibility

Definition at line 291 of file Lexer.php.

Referenced by TYPO3\CMS\IndexedSearch\Lexer\addWords(), and TYPO3\CMS\IndexedSearch\Lexer\utf8_is_letter().

Member Data Documentation

◆ $csObj

TYPO3\CMS\IndexedSearch\Lexer::$csObj

Definition at line 47 of file Lexer.php.

◆ $debug

TYPO3\CMS\IndexedSearch\Lexer::$debug = FALSE
Todo:
Define visibility

Definition at line 33 of file Lexer.php.

◆ $debugString

TYPO3\CMS\IndexedSearch\Lexer::$debugString = ''
Todo:
Define visibility

Definition at line 39 of file Lexer.php.

◆ $lexerConf

TYPO3\CMS\IndexedSearch\Lexer::$lexerConf
Initial value:
= array(
'printjoins' => array(46, 45, 95, 58, 47, 39),
'casesensitive' => FALSE,
'removeChars' => array(45)
)
Todo:
Define visibility

Definition at line 53 of file Lexer.php.