Indexer

Indexing class for TYPO3 frontend

Internal

Table of Contents

Properties

$conf  : array<string|int, mixed>
$content_md5h  : string
Content of TYPO3 page
$defaultIndexingDataPayload  : array<string|int, mixed>
$excludeSections  : string
HTML code blocks to exclude from indexing
$external_parsers  : array<string|int, mixed>
Supported Extensions for external files
$externalFileCounter  : int
$file_phash_arr  : array<string|int, mixed>
Hash array, contains phash and phash_grouping
$flagBitMask  : int
$forceIndexing  : bool
Max number of external files to index.
$freqMax  : float
$freqRange  : int
$hash  : array<string|int, mixed>
Indexer configuration, coming from TYPO3's system configuration for EXT:indexed_search
$indexerConfig  : array<string|int, mixed>
Configuration set internally (see init functions for required keys and their meaning)
$indexExternalUrl_content  : string
$indexingDataStringDto  : IndexingDataAsString
$maxExternalFiles  : int
If set, this tells a minimum limit before a document can be indexed again. This is regardless of mtime.
$tstamp_minAge  : int
If set, this tells a number of seconds that is the maximum age of an indexed document.
$wordcount  : int

Methods

__construct()  : mixed
analyzeBody()  : void
Calculates relevant information for body content
analyzeHeaderinfo()  : void
Calculates relevant information for headercontent
bodyDescription()  : string
Extracts the sample description text from the content array.
charsetEntity2utf8()  : void
Convert character set and HTML entities in the value of input content array keys
checkContentHash()  : array<string|int, mixed>|true
Check content hash in phash table
checkExternalDocContentHash()  : bool
Check content hash for external documents Returns TRUE if the document needs to be indexed (that is, there was no result)
checkWordList()  : void
Adds new words to db
convertHTMLToUtf8()  : string
Converts a HTML document to utf-8
embracingTags()  : bool
Finds first occurrence of embracing tags and returns the embraced content and the original string with the tag removed in the two passed variables. Returns FALSE if no match found. i.e. useful for finding <title> of document or removing <script>-sections
extractBaseHref()  : string
Extracts the "base href" from content string.
extractHyperLinks()  : array<int, array{tag: string, href: string, localPath: string}>
Extracts all links to external documents from the HTML content string
extractLinks()  : void
Extract links (hrefs) from HTML content and if indexable media is found, it is indexed.
fileContentParts()  : array<string|int, mixed>
Creates an array with pointers to divisions of document.
freqMap()  : int
maps frequency from a real number in [0;1] to an integer in [0;$this->freqRange] with anything above $this->freqMax as 1 and back.
getHTMLcharset()  : string
Extract the charset value from HTML meta tag.
getIndexStatus()  : IndexStatus
Check the mtime / tstamp of the currently indexed page/file (based on phash)
getRootLineFields()  : void
Adding values for root-line fields.
getUrlHeaders()  : array<string, string>|false
Getting HTTP request headers of URL
indexAnalyze()  : array<string|int, mixed>
Analyzes content to use for indexing,
indexExternalUrl()  : void
Index External URLs HTML content
indexRegularDocument()  : void
Indexing a regular document given as $file (relative to public web path, local file)
indexTypo3PageContent()  : void
Start indexing of the TYPO3 page
init()  : void
initializeExternalParsers()  : void
is_grlist_set()  : bool
Checks if a grlist record has been set for the phash value input (looking at the "real" phash of the current content, not the linked-to phash of the common search result page)
log_setTSlogMessage()  : void
processWordsInArrays()  : IndexingDataAsArray
Processing words in the array from split*Content -functions. Values are ensured to be unique.
readFileContent()  : IndexingDataAsString|null
Reads the content of an external file being indexed.
removeOldIndexedFiles()  : void
Removes records for the indexed page, $phash
removeOldIndexedPages()  : void
Removes records for the indexed page, $phash
setExtHashes()  : array{phash_grouping: string, phash: string}
Get search hash, external files
setT3Hashes()  : void
Get search hash, T3 pages
splitHTMLContent()  : IndexingDataAsString
Splits HTML content and returns an associative array, with title, a list of meta tags, and a list of words in the body.
splitRegularContent()  : IndexingDataAsString
Splits non-HTML content (from external files for instance)
submit_grlist()  : void
Stores gr_list in the database.
submit_section()  : void
Stores section $hash and $hash_t3 are the same for TYPO3 pages, but different when it is external files.
submitFile_section()  : void
Stores file section for a file IF it does not exist
submitFilePage()  : void
Updates db with information about the file
submitPage()  : void
Updates db with information about the page (TYPO3 page, not external media)
submitWords()  : void
Submits RELATIONS between words and phash
typoSearchTags()  : bool
Removes content that shouldn't be indexed according to TYPO3SEARCH-tags.
update_grlist()  : void
Check if a grlist-entry for this hash exists and if not so, write one.
updateParsetime()  : void
Update parse time for phash row.
updateRootline()  : void
Update section rootline for the page
updateSetId()  : void
Update SetID of the index_phash record.
updateTstamp()  : void
Update tstamp for a phash row.
addSpacesToKeywordList()  : string
Makes sure that keywords are space-separated. This is important for their proper displaying as a part of fulltext index.
createLocalPath()  : string
Checks if the file is local
createLocalPathFromAbsoluteURL()  : string
Attempts to create a local file path from the absolute URL without schema.
createLocalPathFromRelativeURL()  : string
Attempts to create a local file path from the relative URL.
createLocalPathUsingAbsRefPrefix()  : string
Attempts to create a local file path by matching absRefPrefix. This requires TSFE. If TSFE is missing, this function does nothing.
createLocalPathUsingDomainURL()  : string
Attempts to create a local file path by matching a current request URL.
isAllowedLocalFile()  : bool
Checks if the path points to the file inside the website
isRelativeURL()  : bool
Checks if URL is relative.
milliseconds()  : int
Gets the unixtime as milliseconds.

Properties

$conf

public array<string|int, mixed> $conf = []

$content_md5h

Content of TYPO3 page

public string $content_md5h = ''

$defaultIndexingDataPayload

public array<string|int, mixed> $defaultIndexingDataPayload = ['title' => '', 'description' => '', 'keywords' => '', 'body' => '']

$excludeSections

HTML code blocks to exclude from indexing

public string $excludeSections = 'script,style'

$external_parsers

Supported Extensions for external files

public array<string|int, mixed> $external_parsers = []

$externalFileCounter

public int $externalFileCounter = 0

$file_phash_arr

Hash array, contains phash and phash_grouping

public array<string|int, mixed> $file_phash_arr = []

$forceIndexing

Max number of external files to index.

public bool $forceIndexing = false

$hash

Indexer configuration, coming from TYPO3's system configuration for EXT:indexed_search

public array<string|int, mixed> $hash = []

$indexerConfig

Configuration set internally (see init functions for required keys and their meaning)

public array<string|int, mixed> $indexerConfig = []

$indexExternalUrl_content

public string $indexExternalUrl_content = ''

$maxExternalFiles

If set, this tells a minimum limit before a document can be indexed again. This is regardless of mtime.

public int $maxExternalFiles = 0

$tstamp_minAge

If set, this tells a number of seconds that is the maximum age of an indexed document.

public int $tstamp_minAge = 0

Regardless of mtime the document will be re-indexed if this limit is exceeded.

Methods

analyzeBody()

Calculates relevant information for body content

public analyzeBody(array<string|int, mixed> &$retArr, IndexingDataAsArray $indexingDataDto) : void
Parameters
$retArr : array<string|int, mixed>

Index array, passed by reference

$indexingDataDto : IndexingDataAsArray

analyzeHeaderinfo()

Calculates relevant information for headercontent

public analyzeHeaderinfo(array<string|int, mixed> &$retArr, array<string|int, mixed> $content, int $offset) : void
Parameters
$retArr : array<string|int, mixed>

Index array, passed by reference

$content : array<string|int, mixed>

Standard content array

$offset : int

Bit-wise priority to type

checkContentHash()

Check content hash in phash table

public checkContentHash() : array<string|int, mixed>|true
Return values
array<string|int, mixed>|true

Returns TRUE if the page needs to be indexed (that is, there was no result), otherwise the phash value (in an array) of the phash record to which the grlist_record should be related!

checkExternalDocContentHash()

Check content hash for external documents Returns TRUE if the document needs to be indexed (that is, there was no result)

public checkExternalDocContentHash(string $hashGr, string $content_md5h) : bool
Parameters
$hashGr : string

phash value to check (phash_grouping)

$content_md5h : string

Content hash to check

Return values
bool

checkWordList()

Adds new words to db

public checkWordList(array<string|int, mixed> $wordListArray) : void
Parameters
$wordListArray : array<string|int, mixed>

Word List array (where each word has information about position, etc.).

convertHTMLToUtf8()

Converts a HTML document to utf-8

public convertHTMLToUtf8(string $content[, string $charset = '' ]) : string
Parameters
$content : string
$charset : string = ''
Return values
string

embracingTags()

Finds first occurrence of embracing tags and returns the embraced content and the original string with the tag removed in the two passed variables. Returns FALSE if no match found. i.e. useful for finding <title> of document or removing <script>-sections

public embracingTags(string $string, string $tagName, string|null &$tagContent, string|null &$stringAfter, string|null &$paramList) : bool
Parameters
$string : string

String to search in

$tagName : string

Tag name, eg. "script

$tagContent : string|null

Passed by reference: Content inside found tag

$stringAfter : string|null

Passed by reference: Content after found tag

$paramList : string|null

Passed by reference: Attributes of the found tag.

Return values
bool

extractBaseHref()

Extracts the "base href" from content string.

public extractBaseHref(string $html) : string
Parameters
$html : string
Return values
string

Extracts all links to external documents from the HTML content string

public extractHyperLinks(string $html) : array<int, array{tag: string, href: string, localPath: string}>
Parameters
$html : string
Return values
array<int, array{tag: string, href: string, localPath: string}>

Extract links (hrefs) from HTML content and if indexable media is found, it is indexed.

public extractLinks(string $content) : void
Parameters
$content : string

fileContentParts()

Creates an array with pointers to divisions of document.

public fileContentParts(string $ext, string $absFile) : array<string|int, mixed>
Parameters
$ext : string

File extension

$absFile : string

Absolute filename (must exist and be validated OK before calling function)

Return values
array<string|int, mixed>

Array of pointers to sections that the document should be divided into

freqMap()

maps frequency from a real number in [0;1] to an integer in [0;$this->freqRange] with anything above $this->freqMax as 1 and back.

public freqMap(float $freq) : int
Parameters
$freq : float

Frequency

Return values
int

Frequency in range.

getHTMLcharset()

Extract the charset value from HTML meta tag.

public getHTMLcharset(string $content) : string
Parameters
$content : string
Return values
string

getIndexStatus()

Check the mtime / tstamp of the currently indexed page/file (based on phash)

public getIndexStatus(int $mtime, string $phash) : IndexStatus
Parameters
$mtime : int

mtime value to test against limits and indexed page (usually this is the mtime of the cached document)

$phash : string

"phash" used to select any already indexed page to see what its mtime is.

Return values
IndexStatus

getRootLineFields()

Adding values for root-line fields.

public getRootLineFields(array<string|int, mixed> &$fieldArray) : void

rl0, rl1 and rl2 are standard. A hook might add more.

Parameters
$fieldArray : array<string|int, mixed>

Field array, passed by reference

getUrlHeaders()

Getting HTTP request headers of URL

public getUrlHeaders(string $url) : array<string, string>|false
Parameters
$url : string

The URL

Return values
array<string, string>|false

If no answer, returns FALSE. Otherwise, an array where HTTP headers are keys

indexAnalyze()

Analyzes content to use for indexing,

public indexAnalyze(IndexingDataAsArray $indexingDataDto) : array<string|int, mixed>
Parameters
$indexingDataDto : IndexingDataAsArray
Return values
array<string|int, mixed>

Index Array (whatever that is...)

indexExternalUrl()

Index External URLs HTML content

public indexExternalUrl(string $externalUrl) : void
Parameters
$externalUrl : string

URL, eg. "https://typo3.org/

indexRegularDocument()

Indexing a regular document given as $file (relative to public web path, local file)

public indexRegularDocument(string $file[, bool $force = false ][, string $contentTmpFile = '' ][, string $altExtension = '' ]) : void
Parameters
$file : string

Relative Filename, relative to public web path. It can also be an absolute path as long as it is inside the lockRootPath. Finally, if $contentTmpFile is set, this value can be anything, most likely a URL

$force : bool = false

If set, indexing is forced (despite content hashes, mtime etc).

$contentTmpFile : string = ''

Temporary file with the content to read it from (instead of $file). Used when the $file is a URL.

$altExtension : string = ''

File extension for temporary file.

indexTypo3PageContent()

Start indexing of the TYPO3 page

public indexTypo3PageContent() : void

init()

public init([array<string|int, mixed>|null $configuration = null ]) : void
Parameters
$configuration : array<string|int, mixed>|null = null

will be used to set $this->conf, otherwise $this->conf MUST be set with proper values prior to this call

initializeExternalParsers()

public initializeExternalParsers() : void

is_grlist_set()

Checks if a grlist record has been set for the phash value input (looking at the "real" phash of the current content, not the linked-to phash of the common search result page)

public is_grlist_set(string $phash_x) : bool
Parameters
$phash_x : string
Return values
bool

log_setTSlogMessage()

public log_setTSlogMessage(string $msg[, string $logLevel = LogLevel::INFO ]) : void
Parameters
$msg : string
$logLevel : string = LogLevel::INFO

readFileContent()

Reads the content of an external file being indexed.

public readFileContent(string $fileExtension, string $absoluteFileName, string|int $sectionPointer) : IndexingDataAsString|null

The content from the external parser MUST be returned in utf-8!

Parameters
$fileExtension : string

File extension, eg. "pdf", "doc" etc.

$absoluteFileName : string

Absolute filename of file (must exist and be validated OK before calling function)

$sectionPointer : string|int

Pointer to section (zero for all other than PDF which will have an indication of pages into which the document should be splitted.)

Return values
IndexingDataAsString|null

removeOldIndexedFiles()

Removes records for the indexed page, $phash

public removeOldIndexedFiles(string $phash) : void
Parameters
$phash : string

phash value to flush

removeOldIndexedPages()

Removes records for the indexed page, $phash

public removeOldIndexedPages(string $phash) : void
Parameters
$phash : string

phash value to flush

setExtHashes()

Get search hash, external files

public setExtHashes(string $file[, array<string|int, mixed> $subinfo = [] ]) : array{phash_grouping: string, phash: string}
Parameters
$file : string

File name / path which identifies it on the server

$subinfo : array<string|int, mixed> = []

Additional content identifying the (subpart of) content. For instance; PDF files are divided into groups of pages for indexing.

Return values
array{phash_grouping: string, phash: string}

setT3Hashes()

Get search hash, T3 pages

public setT3Hashes() : void

splitHTMLContent()

Splits HTML content and returns an associative array, with title, a list of meta tags, and a list of words in the body.

public splitHTMLContent(string $content) : IndexingDataAsString
Parameters
$content : string

HTML content to index. To some degree expected to be made by TYPO3 (i.e. splitting the header by ":")

Return values
IndexingDataAsString

submit_grlist()

Stores gr_list in the database.

public submit_grlist(string $hash, string $phash_x) : void
Parameters
$hash : string

Search result record phash

$phash_x : string

Actual phash of current content

submit_section()

Stores section $hash and $hash_t3 are the same for TYPO3 pages, but different when it is external files.

public submit_section(string $hash, string $hash_t3) : void
Parameters
$hash : string

phash of TYPO3 parent search result record

$hash_t3 : string

phash of the file indexation search record

submitFile_section()

Stores file section for a file IF it does not exist

public submitFile_section(string $hash) : void
Parameters
$hash : string

phash value of file

submitFilePage()

Updates db with information about the file

public submitFilePage(array<string|int, mixed> $hash, string $file, array<string|int, mixed> $subinfo, string $ext, int $mtime, int $ctime, int $size, string $content_md5h, IndexingDataAsString $indexingDataDto) : void
Parameters
$hash : array<string|int, mixed>

Array with phash and phash_grouping keys for file

$file : string

File name

$subinfo : array<string|int, mixed>

Array of "static_page_arguments" for files: This is for instance the page index for a PDF file (other document types it will be a zero)

$ext : string

File extension determining the type of media.

$mtime : int

Modification time of file.

$ctime : int

Creation time of file.

$size : int

Size of file in bytes

$content_md5h : string

Content HASH value.

$indexingDataDto : IndexingDataAsString

submitPage()

Updates db with information about the page (TYPO3 page, not external media)

public submitPage() : void

submitWords()

Submits RELATIONS between words and phash

public submitWords(array<string|int, mixed> $wordList, string $phash) : void
Parameters
$wordList : array<string|int, mixed>
$phash : string

typoSearchTags()

Removes content that shouldn't be indexed according to TYPO3SEARCH-tags.

public typoSearchTags(string &$body) : bool
Parameters
$body : string

HTML Content, passed by reference

Return values
bool

Returns TRUE if a TYPOSEARCH_ tag was found, otherwise FALSE.

update_grlist()

Check if a grlist-entry for this hash exists and if not so, write one.

public update_grlist(string $phash, string $phash_x) : void
Parameters
$phash : string

phash of the search result that should be found

$phash_x : string

The real phash of the current content. The two values are different when a page with userlogin turns out to contain the exact same content as another already indexed version of the page; This is the whole reason for the grlist table in fact...

updateParsetime()

Update parse time for phash row.

public updateParsetime(string $phash, int $parsetime) : void
Parameters
$phash : string
$parsetime : int

updateRootline()

Update section rootline for the page

public updateRootline() : void

updateSetId()

Update SetID of the index_phash record.

public updateSetId(string $phash) : void
Parameters
$phash : string

updateTstamp()

Update tstamp for a phash row.

public updateTstamp(string $phash[, int $mtime = 0 ]) : void
Parameters
$phash : string
$mtime : int = 0

addSpacesToKeywordList()

Makes sure that keywords are space-separated. This is important for their proper displaying as a part of fulltext index.

protected addSpacesToKeywordList(string $keywordList) : string
Parameters
$keywordList : string
Tags
see
https://forge.typo3.org/issues/14959
Return values
string

createLocalPath()

Checks if the file is local

protected createLocalPath(string $sourcePath) : string
Parameters
$sourcePath : string
Return values
string

Absolute path to file if file is local, else empty string

createLocalPathFromAbsoluteURL()

Attempts to create a local file path from the absolute URL without schema.

protected createLocalPathFromAbsoluteURL(string $sourcePath) : string
Parameters
$sourcePath : string
Return values
string

createLocalPathFromRelativeURL()

Attempts to create a local file path from the relative URL.

protected createLocalPathFromRelativeURL(string $sourcePath) : string
Parameters
$sourcePath : string
Return values
string

createLocalPathUsingAbsRefPrefix()

Attempts to create a local file path by matching absRefPrefix. This requires TSFE. If TSFE is missing, this function does nothing.

protected createLocalPathUsingAbsRefPrefix(string $sourcePath) : string
Parameters
$sourcePath : string
Return values
string

createLocalPathUsingDomainURL()

Attempts to create a local file path by matching a current request URL.

protected createLocalPathUsingDomainURL(string $sourcePath) : string
Parameters
$sourcePath : string
Return values
string

isAllowedLocalFile()

Checks if the path points to the file inside the website

protected static isAllowedLocalFile(string $filePath) : bool
Parameters
$filePath : string
Return values
bool

isRelativeURL()

Checks if URL is relative.

protected static isRelativeURL(string $url) : bool
Parameters
$url : string
Return values
bool

milliseconds()

Gets the unixtime as milliseconds.

protected milliseconds() : int
Return values
int

        
On this page

Search results