TYPO3 13.4 API: Indexer

Indexer

Indexing class for TYPO3 frontend

Internal

Properties

$conf : array<string|int, mixed>
$content_md5h : string: Content of TYPO3 page
$defaultIndexingDataPayload : array<string|int, mixed>
$excludeSections : string: HTML code blocks to exclude from indexing
$external_parsers : array<string|int, mixed>: Supported Extensions for external files
$externalFileCounter : int
$file_phash_arr : array<string|int, mixed>: Hash array, contains phash and phash_grouping
$flagBitMask : int
$forceIndexing : bool: Max number of external files to index.
$freqMax : float
$freqRange : int
$hash : array<string|int, mixed>: Indexer configuration, coming from TYPO3's system configuration for EXT:indexed_search
$indexerConfig : array<string|int, mixed>: Configuration set internally (see init functions for required keys and their meaning)
$indexExternalUrl_content : string
$indexingDataStringDto : IndexingDataAsString
$maxExternalFiles : int: If set, this tells a minimum limit before a document can be indexed again. This is regardless of mtime.
$tstamp_minAge : int: If set, this tells a number of seconds that is the maximum age of an indexed document.
$wordcount : int

Methods

__construct() : mixed
analyzeBody() : void: Calculates relevant information for body content
analyzeHeaderinfo() : void: Calculates relevant information for headercontent
bodyDescription() : string: Extracts the sample description text from the content array.
charsetEntity2utf8() : void: Convert character set and HTML entities in the value of input content array keys
checkContentHash() : array<string|int, mixed>|true: Check content hash in phash table
checkExternalDocContentHash() : bool: Check content hash for external documents Returns TRUE if the document needs to be indexed (that is, there was no result)
checkWordList() : void: Adds new words to db
convertHTMLToUtf8() : string: Converts a HTML document to utf-8
embracingTags() : bool: Finds first occurrence of embracing tags and returns the embraced content and the original string with the tag removed in the two passed variables. Returns FALSE if no match found. i.e. useful for finding <title> of document or removing <script>-sections
extractBaseHref() : string: Extracts the "base href" from content string.
extractHyperLinks() : array<int, array{tag: string, href: string, localPath: string}>: Extracts all links to external documents from the HTML content string
extractLinks() : void: Extract links (hrefs) from HTML content and if indexable media is found, it is indexed.
fileContentParts() : array<string|int, mixed>: Creates an array with pointers to divisions of document.
freqMap() : int: maps frequency from a real number in [0;1] to an integer in [0;$this->freqRange] with anything above $this->freqMax as 1 and back.
getHTMLcharset() : string: Extract the charset value from HTML meta tag.
getIndexStatus() : IndexStatus: Check the mtime / tstamp of the currently indexed page/file (based on phash)
getRootLineFields() : void: Adding values for root-line fields.
getUrlHeaders() : array<string, string>|false: Getting HTTP request headers of URL
indexAnalyze() : array<string|int, mixed>: Analyzes content to use for indexing,
indexExternalUrl() : void: Index External URLs HTML content
indexRegularDocument() : void: Indexing a regular document given as $file (relative to public web path, local file)
indexTypo3PageContent() : void: Start indexing of the TYPO3 page
init() : void
initializeExternalParsers() : void
is_grlist_set() : bool: Checks if a grlist record has been set for the phash value input (looking at the "real" phash of the current content, not the linked-to phash of the common search result page)
log_setTSlogMessage() : void
processWordsInArrays() : IndexingDataAsArray: Processing words in the array from split*Content -functions. Values are ensured to be unique.
readFileContent() : IndexingDataAsString|null: Reads the content of an external file being indexed.
removeOldIndexedFiles() : void: Removes records for the indexed page, $phash
removeOldIndexedPages() : void: Removes records for the indexed page, $phash
setExtHashes() : array{phash_grouping: string, phash: string}: Get search hash, external files
setT3Hashes() : void: Get search hash, T3 pages
splitHTMLContent() : IndexingDataAsString: Splits HTML content and returns an associative array, with title, a list of meta tags, and a list of words in the body.
splitRegularContent() : IndexingDataAsString: Splits non-HTML content (from external files for instance)
submit_grlist() : void: Stores gr_list in the database.
submit_section() : void: Stores section $hash and $hash_t3 are the same for TYPO3 pages, but different when it is external files.
submitFile_section() : void: Stores file section for a file IF it does not exist
submitFilePage() : void: Updates db with information about the file
submitPage() : void: Updates db with information about the page (TYPO3 page, not external media)
submitWords() : void: Submits RELATIONS between words and phash
typoSearchTags() : bool: Removes content that shouldn't be indexed according to TYPO3SEARCH-tags.
update_grlist() : void: Check if a grlist-entry for this hash exists and if not so, write one.
updateParsetime() : void: Update parse time for phash row.
updateRootline() : void: Update section rootline for the page
updateSetId() : void: Update SetID of the index_phash record.
updateTstamp() : void: Update tstamp for a phash row.
addSpacesToKeywordList() : string: Makes sure that keywords are space-separated. This is important for their proper displaying as a part of fulltext index.
createLocalPath() : string: Checks if the file is local
createLocalPathFromAbsoluteURL() : string: Attempts to create a local file path from the absolute URL without schema.
createLocalPathFromRelativeURL() : string: Attempts to create a local file path from the relative URL.
createLocalPathUsingAbsRefPrefix() : string: Attempts to create a local file path by matching absRefPrefix. This requires TSFE. If TSFE is missing, this function does nothing.
createLocalPathUsingDomainURL() : string: Attempts to create a local file path by matching a current request URL.
isAllowedLocalFile() : bool: Checks if the path points to the file inside the website
isRelativeURL() : bool: Checks if URL is relative.
milliseconds() : int: Gets the unixtime as milliseconds.

$conf


        public
            array<string|int, mixed>
    $conf
     = []

$content_md5h

Content of TYPO3 page


        public
            string
    $content_md5h
     = ''

$defaultIndexingDataPayload


        public
            array<string|int, mixed>
    $defaultIndexingDataPayload
     = ['title' => '', 'description' => '', 'keywords' => '', 'body' => '']

$excludeSections

HTML code blocks to exclude from indexing


        public
            string
    $excludeSections
     = 'script,style'

$external_parsers

Supported Extensions for external files


        public
            array<string|int, mixed>
    $external_parsers
     = []

$externalFileCounter


        public
            int
    $externalFileCounter
     = 0

$file_phash_arr

Hash array, contains phash and phash_grouping


        public
            array<string|int, mixed>
    $file_phash_arr
     = []

$flagBitMask


        public
            int
    $flagBitMask

$forceIndexing

Max number of external files to index.


        public
            bool
    $forceIndexing
     = false

$freqMax


        public
            float
    $freqMax
     = 0.1

$freqRange


        public
            int
    $freqRange
     = 32000

$hash

Indexer configuration, coming from TYPO3's system configuration for EXT:indexed_search


        public
            array<string|int, mixed>
    $hash
     = []

$indexerConfig

Configuration set internally (see init functions for required keys and their meaning)


        public
            array<string|int, mixed>
    $indexerConfig
     = []

$indexExternalUrl_content


        public
            string
    $indexExternalUrl_content
     = ''

$indexingDataStringDto


        public
            IndexingDataAsString
    $indexingDataStringDto

$maxExternalFiles

If set, this tells a minimum limit before a document can be indexed again. This is regardless of mtime.


        public
            int
    $maxExternalFiles
     = 0

$tstamp_minAge

If set, this tells a number of seconds that is the maximum age of an indexed document.


        public
            int
    $tstamp_minAge
     = 0

Regardless of mtime the document will be re-indexed if this limit is exceeded.

$wordcount


        public
            int
    $wordcount
     = 0

__construct()


    public
                    __construct(TimeTracker $timeTracker, Lexer $lexer, RequestFactory $requestFactory, ConnectionPool $connectionPool, ExtensionConfiguration $extensionConfiguration) : mixed

Parameters

$timeTracker : TimeTracker
$lexer : Lexer
$requestFactory : RequestFactory
$connectionPool : ConnectionPool
$extensionConfiguration : ExtensionConfiguration

analyzeBody()

Calculates relevant information for body content


    public
                    analyzeBody(array<string|int, mixed> &$retArr, IndexingDataAsArray $indexingDataDto) : void

Parameters

$retArr : array<string|int, mixed>: Index array, passed by reference
$indexingDataDto : IndexingDataAsArray

analyzeHeaderinfo()

Calculates relevant information for headercontent


    public
                    analyzeHeaderinfo(array<string|int, mixed> &$retArr, array<string|int, mixed> $content, int $offset) : void

Parameters

$retArr : array<string|int, mixed>: Index array, passed by reference
$content : array<string|int, mixed>: Standard content array
$offset : int: Bit-wise priority to type

bodyDescription()

Extracts the sample description text from the content array.


    public
                    bodyDescription(IndexingDataAsString $indexingDataDto) : string

Parameters

$indexingDataDto : IndexingDataAsString

Return values

string

charsetEntity2utf8()

Convert character set and HTML entities in the value of input content array keys


    public
                    charsetEntity2utf8(IndexingDataAsString $indexingDataDto) : void

Parameters

$indexingDataDto : IndexingDataAsString

checkContentHash()

Check content hash in phash table


    public
                    checkContentHash() : array<string|int, mixed>|true

Return values

array<string|int, mixed>|true —

Returns TRUE if the page needs to be indexed (that is, there was no result), otherwise the phash value (in an array) of the phash record to which the grlist_record should be related!

checkExternalDocContentHash()

Check content hash for external documents Returns TRUE if the document needs to be indexed (that is, there was no result)


    public
                    checkExternalDocContentHash(string $hashGr, string $content_md5h) : bool

Parameters

$hashGr : string: phash value to check (phash_grouping)
$content_md5h : string: Content hash to check

Return values

bool

checkWordList()

Adds new words to db


    public
                    checkWordList(array<string|int, mixed> $wordListArray) : void

Parameters

$wordListArray : array<string|int, mixed>: Word List array (where each word has information about position, etc.).

convertHTMLToUtf8()

Converts a HTML document to utf-8


    public
                    convertHTMLToUtf8(string $content[, string $charset = '' ]) : string

Parameters

$content : string
$charset : string = ''

Return values

string

embracingTags()

Finds first occurrence of embracing tags and returns the embraced content and the original string with the tag removed in the two passed variables. Returns FALSE if no match found. i.e. useful for finding <title> of document or removing <script>-sections


    public
                    embracingTags(string $string, string $tagName, string|null &$tagContent, string|null &$stringAfter, string|null &$paramList) : bool

Parameters

$string : string: String to search in
$tagName : string: Tag name, eg. "script
$tagContent : string|null: Passed by reference: Content inside found tag
$stringAfter : string|null: Passed by reference: Content after found tag
$paramList : string|null: Passed by reference: Attributes of the found tag.

Return values

bool

extractBaseHref()

Extracts the "base href" from content string.


    public
                    extractBaseHref(string $html) : string

Parameters

$html : string

Return values

string

extractHyperLinks()

Extracts all links to external documents from the HTML content string


    public
                    extractHyperLinks(string $html) : array<int, array{tag: string, href: string, localPath: string}>

Parameters

$html : string

Return values

array<int, array{tag: string, href: string, localPath: string}>

extractLinks()

Extract links (hrefs) from HTML content and if indexable media is found, it is indexed.


    public
                    extractLinks(string $content) : void

Parameters

$content : string

fileContentParts()

Creates an array with pointers to divisions of document.


    public
                    fileContentParts(string $ext, string $absFile) : array<string|int, mixed>

Parameters

$ext : string: File extension
$absFile : string: Absolute filename (must exist and be validated OK before calling function)

Return values

array<string|int, mixed> —

Array of pointers to sections that the document should be divided into

freqMap()

maps frequency from a real number in [0;1] to an integer in [0;$this->freqRange] with anything above $this->freqMax as 1 and back.


    public
                    freqMap(float $freq) : int

Parameters

$freq : float: Frequency

Return values

int —

Frequency in range.

getHTMLcharset()

Extract the charset value from HTML meta tag.


    public
                    getHTMLcharset(string $content) : string

Parameters

$content : string

Return values

string

getIndexStatus()

Check the mtime / tstamp of the currently indexed page/file (based on phash)


    public
                    getIndexStatus(int $mtime, string $phash) : IndexStatus

Parameters

$mtime : int: mtime value to test against limits and indexed page (usually this is the mtime of the cached document)
$phash : string: "phash" used to select any already indexed page to see what its mtime is.

Return values

IndexStatus

getRootLineFields()

Adding values for root-line fields.


    public
                    getRootLineFields(array<string|int, mixed> &$fieldArray) : void

rl0, rl1 and rl2 are standard. A hook might add more.

Parameters

$fieldArray : array<string|int, mixed>: Field array, passed by reference

getUrlHeaders()

Getting HTTP request headers of URL


    public
                    getUrlHeaders(string $url) : array<string, string>|false

Parameters

$url : string: The URL

Return values

array<string, string>|false —

If no answer, returns FALSE. Otherwise, an array where HTTP headers are keys

indexAnalyze()

Analyzes content to use for indexing,


    public
                    indexAnalyze(IndexingDataAsArray $indexingDataDto) : array<string|int, mixed>

Parameters

$indexingDataDto : IndexingDataAsArray

Return values

array<string|int, mixed> —

Index Array (whatever that is...)

indexExternalUrl()

Index External URLs HTML content


    public
                    indexExternalUrl(string $externalUrl) : void

Parameters

$externalUrl : string: URL, eg. "https://typo3.org/

indexRegularDocument()

Indexing a regular document given as $file (relative to public web path, local file)


    public
                    indexRegularDocument(string $file[, bool $force = false ][, string $contentTmpFile = '' ][, string $altExtension = '' ]) : void

Parameters

$file : string: Relative Filename, relative to public web path. It can also be an absolute path as long as it is inside the lockRootPath. Finally, if $contentTmpFile is set, this value can be anything, most likely a URL
$force : bool = false: If set, indexing is forced (despite content hashes, mtime etc).
$contentTmpFile : string = '': Temporary file with the content to read it from (instead of $file). Used when the $file is a URL.
$altExtension : string = '': File extension for temporary file.

indexTypo3PageContent()

Start indexing of the TYPO3 page


    public
                    indexTypo3PageContent() : void

init()


    public
                    init([array<string|int, mixed>|null $configuration = null ]) : void

Parameters

$configuration : array<string|int, mixed>|null = null: will be used to set $this->conf, otherwise $this->conf MUST be set with proper values prior to this call

initializeExternalParsers()


    public
                    initializeExternalParsers() : void

is_grlist_set()

Checks if a grlist record has been set for the phash value input (looking at the "real" phash of the current content, not the linked-to phash of the common search result page)


    public
                    is_grlist_set(string $phash_x) : bool

Parameters

$phash_x : string

Return values

bool

log_setTSlogMessage()


    public
                    log_setTSlogMessage(string $msg[, string $logLevel = LogLevel::INFO ]) : void

Parameters

$msg : string
$logLevel : string = LogLevel::INFO

processWordsInArrays()

Processing words in the array from split*Content -functions. Values are ensured to be unique.


    public
                    processWordsInArrays(IndexingDataAsString $input) : IndexingDataAsArray

Parameters

$input : IndexingDataAsString

Return values

IndexingDataAsArray

readFileContent()

Reads the content of an external file being indexed.


    public
                    readFileContent(string $fileExtension, string $absoluteFileName, string|int $sectionPointer) : IndexingDataAsString|null

The content from the external parser MUST be returned in utf-8!

Parameters

$fileExtension : string: File extension, eg. "pdf", "doc" etc.
$absoluteFileName : string: Absolute filename of file (must exist and be validated OK before calling function)
$sectionPointer : string|int: Pointer to section (zero for all other than PDF which will have an indication of pages into which the document should be splitted.)

Return values

IndexingDataAsString|null

removeOldIndexedFiles()

Removes records for the indexed page, $phash


    public
                    removeOldIndexedFiles(string $phash) : void

Parameters

$phash : string: phash value to flush

removeOldIndexedPages()

Removes records for the indexed page, $phash


    public
                    removeOldIndexedPages(string $phash) : void

Parameters

$phash : string: phash value to flush

setExtHashes()

Get search hash, external files


    public
                    setExtHashes(string $file[, array<string|int, mixed> $subinfo = [] ]) : array{phash_grouping: string, phash: string}

Parameters

$file : string: File name / path which identifies it on the server
$subinfo : array<string|int, mixed> = []: Additional content identifying the (subpart of) content. For instance; PDF files are divided into groups of pages for indexing.

Return values

array{phash_grouping: string, phash: string}

setT3Hashes()

Get search hash, T3 pages


    public
                    setT3Hashes() : void

splitHTMLContent()

Splits HTML content and returns an associative array, with title, a list of meta tags, and a list of words in the body.


    public
                    splitHTMLContent(string $content) : IndexingDataAsString

Parameters

$content : string: HTML content to index. To some degree expected to be made by TYPO3 (i.e. splitting the header by ":")

Return values

IndexingDataAsString

splitRegularContent()

Splits non-HTML content (from external files for instance)


    public
                    splitRegularContent(string $content) : IndexingDataAsString

Parameters

$content : string

Return values

IndexingDataAsString

submit_grlist()

Stores gr_list in the database.


    public
                    submit_grlist(string $hash, string $phash_x) : void

Parameters

$hash : string: Search result record phash
$phash_x : string: Actual phash of current content

submit_section()

Stores section $hash and $hash_t3 are the same for TYPO3 pages, but different when it is external files.


    public
                    submit_section(string $hash, string $hash_t3) : void

Parameters

$hash : string: phash of TYPO3 parent search result record
$hash_t3 : string: phash of the file indexation search record

submitFile_section()

Stores file section for a file IF it does not exist


    public
                    submitFile_section(string $hash) : void

Parameters

$hash : string: phash value of file

submitFilePage()

Updates db with information about the file


    public
                    submitFilePage(array<string|int, mixed> $hash, string $file, array<string|int, mixed> $subinfo, string $ext, int $mtime, int $ctime, int $size, string $content_md5h, IndexingDataAsString $indexingDataDto) : void

Parameters

$hash : array<string|int, mixed>: Array with phash and phash_grouping keys for file
$file : string: File name
$subinfo : array<string|int, mixed>: Array of "static_page_arguments" for files: This is for instance the page index for a PDF file (other document types it will be a zero)
$ext : string: File extension determining the type of media.
$mtime : int: Modification time of file.
$ctime : int: Creation time of file.
$size : int: Size of file in bytes
$content_md5h : string: Content HASH value.
$indexingDataDto : IndexingDataAsString

submitPage()

Updates db with information about the page (TYPO3 page, not external media)


    public
                    submitPage() : void

submitWords()

Submits RELATIONS between words and phash


    public
                    submitWords(array<string|int, mixed> $wordList, string $phash) : void

Parameters

$wordList : array<string|int, mixed>
$phash : string

typoSearchTags()

Removes content that shouldn't be indexed according to TYPO3SEARCH-tags.


    public
                    typoSearchTags(string &$body) : bool

Parameters

$body : string: HTML Content, passed by reference

Return values

bool —

Returns TRUE if a TYPOSEARCH_ tag was found, otherwise FALSE.

update_grlist()

Check if a grlist-entry for this hash exists and if not so, write one.


    public
                    update_grlist(string $phash, string $phash_x) : void

Parameters

$phash : string: phash of the search result that should be found
$phash_x : string: The real phash of the current content. The two values are different when a page with userlogin turns out to contain the exact same content as another already indexed version of the page; This is the whole reason for the grlist table in fact...

updateParsetime()

Update parse time for phash row.


    public
                    updateParsetime(string $phash, int $parsetime) : void

Parameters

$phash : string
$parsetime : int

updateRootline()

Update section rootline for the page


    public
                    updateRootline() : void

updateSetId()

Update SetID of the index_phash record.


    public
                    updateSetId(string $phash) : void

Parameters

$phash : string

updateTstamp()

Update tstamp for a phash row.


    public
                    updateTstamp(string $phash[, int $mtime = 0 ]) : void

Parameters

$phash : string
$mtime : int = 0

addSpacesToKeywordList()

Makes sure that keywords are space-separated. This is important for their proper displaying as a part of fulltext index.


    protected
                    addSpacesToKeywordList(string $keywordList) : string

Parameters

$keywordList : string

Return values

string

createLocalPath()

Checks if the file is local


    protected
                    createLocalPath(string $sourcePath) : string

Parameters

$sourcePath : string

Return values

string —

Absolute path to file if file is local, else empty string

createLocalPathFromAbsoluteURL()

Attempts to create a local file path from the absolute URL without schema.


    protected
                    createLocalPathFromAbsoluteURL(string $sourcePath) : string

Parameters

$sourcePath : string

Return values

string

createLocalPathFromRelativeURL()

Attempts to create a local file path from the relative URL.


    protected
                    createLocalPathFromRelativeURL(string $sourcePath) : string

Parameters

$sourcePath : string

Return values

string

createLocalPathUsingAbsRefPrefix()

Attempts to create a local file path by matching absRefPrefix. This requires TSFE. If TSFE is missing, this function does nothing.


    protected
                    createLocalPathUsingAbsRefPrefix(string $sourcePath) : string

Parameters

$sourcePath : string

Return values

string

createLocalPathUsingDomainURL()

Attempts to create a local file path by matching a current request URL.


    protected
                    createLocalPathUsingDomainURL(string $sourcePath) : string

Parameters

$sourcePath : string

Return values

string

isAllowedLocalFile()

Checks if the path points to the file inside the website


    protected
            static        isAllowedLocalFile(string $filePath) : bool

Parameters

$filePath : string

Return values

bool

isRelativeURL()

Checks if URL is relative.


    protected
            static        isRelativeURL(string $url) : bool

Parameters

$url : string

Return values

bool

milliseconds()

Gets the unixtime as milliseconds.


    protected
                    milliseconds() : int

Return values

int

Indexer

Table of Contents

Properties

Methods

Properties

$conf

$content_md5h

$defaultIndexingDataPayload

$excludeSections

$external_parsers

$externalFileCounter

$file_phash_arr

$flagBitMask

$forceIndexing

$freqMax

$freqRange

$hash

$indexerConfig

$indexExternalUrl_content

$indexingDataStringDto

$maxExternalFiles

$tstamp_minAge

$wordcount

Methods

__construct()

Parameters

analyzeBody()

Parameters

analyzeHeaderinfo()

Parameters

bodyDescription()

Parameters

Return values

charsetEntity2utf8()

Parameters

checkContentHash()

Return values

checkExternalDocContentHash()

Parameters

Return values

checkWordList()

Parameters

convertHTMLToUtf8()

Parameters

Return values

embracingTags()

Parameters

Return values

extractBaseHref()

Parameters

Return values

extractHyperLinks()

Parameters

Return values

extractLinks()

Parameters

fileContentParts()

Parameters

Return values

freqMap()

Parameters

Return values

getHTMLcharset()

Parameters

Return values

getIndexStatus()

Parameters

Return values

getRootLineFields()

Parameters

getUrlHeaders()

Parameters

Return values

indexAnalyze()

Parameters

Return values

indexExternalUrl()

Parameters

indexRegularDocument()

Parameters