Indexer
Indexing class for TYPO3 frontend
Table of Contents
Properties
- $conf : array<string|int, mixed>
- $content_md5h : string
- Content of TYPO3 page
- $defaultIndexingDataPayload : array<string|int, mixed>
- $excludeSections : string
- HTML code blocks to exclude from indexing
- $external_parsers : array<string|int, mixed>
- Supported Extensions for external files
- $externalFileCounter : int
- $file_phash_arr : array<string|int, mixed>
- Hash array, contains phash and phash_grouping
- $flagBitMask : int
- $forceIndexing : bool
- Max number of external files to index.
- $freqMax : float
- $freqRange : int
- $hash : array<string|int, mixed>
- Indexer configuration, coming from TYPO3's system configuration for EXT:indexed_search
- $indexerConfig : array<string|int, mixed>
- Configuration set internally (see init functions for required keys and their meaning)
- $indexExternalUrl_content : string
- $indexingDataStringDto : IndexingDataAsString
- $maxExternalFiles : int
- If set, this tells a minimum limit before a document can be indexed again. This is regardless of mtime.
- $tstamp_minAge : int
- If set, this tells a number of seconds that is the maximum age of an indexed document.
- $wordcount : int
Methods
- __construct() : mixed
- analyzeBody() : void
- Calculates relevant information for body content
- analyzeHeaderinfo() : void
- Calculates relevant information for headercontent
- bodyDescription() : string
- Extracts the sample description text from the content array.
- charsetEntity2utf8() : void
- Convert character set and HTML entities in the value of input content array keys
- checkContentHash() : array<string|int, mixed>|true
- Check content hash in phash table
- checkExternalDocContentHash() : bool
- Check content hash for external documents Returns TRUE if the document needs to be indexed (that is, there was no result)
- checkWordList() : void
- Adds new words to db
- convertHTMLToUtf8() : string
- Converts a HTML document to utf-8
- embracingTags() : bool
- Finds first occurrence of embracing tags and returns the embraced content and the original string with the tag removed in the two passed variables. Returns FALSE if no match found. i.e. useful for finding <title> of document or removing <script>-sections
- extractBaseHref() : string
- Extracts the "base href" from content string.
- extractHyperLinks() : array<int, array{tag: string, href: string, localPath: string}>
- Extracts all links to external documents from the HTML content string
- extractLinks() : void
- Extract links (hrefs) from HTML content and if indexable media is found, it is indexed.
- fileContentParts() : array<string|int, mixed>
- Creates an array with pointers to divisions of document.
- freqMap() : int
- maps frequency from a real number in [0;1] to an integer in [0;$this->freqRange] with anything above $this->freqMax as 1 and back.
- getHTMLcharset() : string
- Extract the charset value from HTML meta tag.
- getIndexStatus() : IndexStatus
- Check the mtime / tstamp of the currently indexed page/file (based on phash)
- getRootLineFields() : void
- Adding values for root-line fields.
- getUrlHeaders() : array<string, string>|false
- Getting HTTP request headers of URL
- indexAnalyze() : array<string|int, mixed>
- Analyzes content to use for indexing,
- indexExternalUrl() : void
- Index External URLs HTML content
- indexRegularDocument() : void
- Indexing a regular document given as $file (relative to public web path, local file)
- indexTypo3PageContent() : void
- Start indexing of the TYPO3 page
- init() : void
- initializeExternalParsers() : void
- is_grlist_set() : bool
- Checks if a grlist record has been set for the phash value input (looking at the "real" phash of the current content, not the linked-to phash of the common search result page)
- log_setTSlogMessage() : void
- processWordsInArrays() : IndexingDataAsArray
- Processing words in the array from split*Content -functions. Values are ensured to be unique.
- readFileContent() : IndexingDataAsString|null
- Reads the content of an external file being indexed.
- removeOldIndexedFiles() : void
- Removes records for the indexed page, $phash
- removeOldIndexedPages() : void
- Removes records for the indexed page, $phash
- setExtHashes() : array{phash_grouping: string, phash: string}
- Get search hash, external files
- setT3Hashes() : void
- Get search hash, T3 pages
- splitHTMLContent() : IndexingDataAsString
- Splits HTML content and returns an associative array, with title, a list of meta tags, and a list of words in the body.
- splitRegularContent() : IndexingDataAsString
- Splits non-HTML content (from external files for instance)
- submit_grlist() : void
- Stores gr_list in the database.
- submit_section() : void
- Stores section $hash and $hash_t3 are the same for TYPO3 pages, but different when it is external files.
- submitFile_section() : void
- Stores file section for a file IF it does not exist
- submitFilePage() : void
- Updates db with information about the file
- submitPage() : void
- Updates db with information about the page (TYPO3 page, not external media)
- submitWords() : void
- Submits RELATIONS between words and phash
- typoSearchTags() : bool
- Removes content that shouldn't be indexed according to TYPO3SEARCH-tags.
- update_grlist() : void
- Check if a grlist-entry for this hash exists and if not so, write one.
- updateParsetime() : void
- Update parse time for phash row.
- updateRootline() : void
- Update section rootline for the page
- updateSetId() : void
- Update SetID of the index_phash record.
- updateTstamp() : void
- Update tstamp for a phash row.
- addSpacesToKeywordList() : string
- Makes sure that keywords are space-separated. This is important for their proper displaying as a part of fulltext index.
- createLocalPath() : string
- Checks if the file is local
- createLocalPathFromAbsoluteURL() : string
- Attempts to create a local file path from the absolute URL without schema.
- createLocalPathFromRelativeURL() : string
- Attempts to create a local file path from the relative URL.
- createLocalPathUsingAbsRefPrefix() : string
- Attempts to create a local file path by matching absRefPrefix. This requires TSFE. If TSFE is missing, this function does nothing.
- createLocalPathUsingDomainURL() : string
- Attempts to create a local file path by matching a current request URL.
- isAllowedLocalFile() : bool
- Checks if the path points to the file inside the website
- isRelativeURL() : bool
- Checks if URL is relative.
- milliseconds() : int
- Gets the unixtime as milliseconds.
Properties
$conf
public
array<string|int, mixed>
$conf
= []
$content_md5h
Content of TYPO3 page
public
string
$content_md5h
= ''
$defaultIndexingDataPayload
public
array<string|int, mixed>
$defaultIndexingDataPayload
= ['title' => '', 'description' => '', 'keywords' => '', 'body' => '']
$excludeSections
HTML code blocks to exclude from indexing
public
string
$excludeSections
= 'script,style'
$external_parsers
Supported Extensions for external files
public
array<string|int, mixed>
$external_parsers
= []
$externalFileCounter
public
int
$externalFileCounter
= 0
$file_phash_arr
Hash array, contains phash and phash_grouping
public
array<string|int, mixed>
$file_phash_arr
= []
$flagBitMask
public
int
$flagBitMask
$forceIndexing
Max number of external files to index.
public
bool
$forceIndexing
= false
$freqMax
public
float
$freqMax
= 0.1
$freqRange
public
int
$freqRange
= 32000
$hash
Indexer configuration, coming from TYPO3's system configuration for EXT:indexed_search
public
array<string|int, mixed>
$hash
= []
$indexerConfig
Configuration set internally (see init functions for required keys and their meaning)
public
array<string|int, mixed>
$indexerConfig
= []
$indexExternalUrl_content
public
string
$indexExternalUrl_content
= ''
$indexingDataStringDto
public
IndexingDataAsString
$indexingDataStringDto
$maxExternalFiles
If set, this tells a minimum limit before a document can be indexed again. This is regardless of mtime.
public
int
$maxExternalFiles
= 0
$tstamp_minAge
If set, this tells a number of seconds that is the maximum age of an indexed document.
public
int
$tstamp_minAge
= 0
Regardless of mtime the document will be re-indexed if this limit is exceeded.
$wordcount
public
int
$wordcount
= 0
Methods
__construct()
public
__construct(TimeTracker $timeTracker, Lexer $lexer, RequestFactory $requestFactory, ConnectionPool $connectionPool, ExtensionConfiguration $extensionConfiguration) : mixed
Parameters
- $timeTracker : TimeTracker
- $lexer : Lexer
- $requestFactory : RequestFactory
- $connectionPool : ConnectionPool
- $extensionConfiguration : ExtensionConfiguration
analyzeBody()
Calculates relevant information for body content
public
analyzeBody(array<string|int, mixed> &$retArr, IndexingDataAsArray $indexingDataDto) : void
Parameters
- $retArr : array<string|int, mixed>
-
Index array, passed by reference
- $indexingDataDto : IndexingDataAsArray
analyzeHeaderinfo()
Calculates relevant information for headercontent
public
analyzeHeaderinfo(array<string|int, mixed> &$retArr, array<string|int, mixed> $content, int $offset) : void
Parameters
- $retArr : array<string|int, mixed>
-
Index array, passed by reference
- $content : array<string|int, mixed>
-
Standard content array
- $offset : int
-
Bit-wise priority to type
bodyDescription()
Extracts the sample description text from the content array.
public
bodyDescription(IndexingDataAsString $indexingDataDto) : string
Parameters
- $indexingDataDto : IndexingDataAsString
Return values
stringcharsetEntity2utf8()
Convert character set and HTML entities in the value of input content array keys
public
charsetEntity2utf8(IndexingDataAsString $indexingDataDto) : void
Parameters
- $indexingDataDto : IndexingDataAsString
checkContentHash()
Check content hash in phash table
public
checkContentHash() : array<string|int, mixed>|true
Return values
array<string|int, mixed>|true —Returns TRUE if the page needs to be indexed (that is, there was no result), otherwise the phash value (in an array) of the phash record to which the grlist_record should be related!
checkExternalDocContentHash()
Check content hash for external documents Returns TRUE if the document needs to be indexed (that is, there was no result)
public
checkExternalDocContentHash(string $hashGr, string $content_md5h) : bool
Parameters
- $hashGr : string
-
phash value to check (phash_grouping)
- $content_md5h : string
-
Content hash to check
Return values
boolcheckWordList()
Adds new words to db
public
checkWordList(array<string|int, mixed> $wordListArray) : void
Parameters
- $wordListArray : array<string|int, mixed>
-
Word List array (where each word has information about position, etc.).
convertHTMLToUtf8()
Converts a HTML document to utf-8
public
convertHTMLToUtf8(string $content[, string $charset = '' ]) : string
Parameters
- $content : string
- $charset : string = ''
Return values
stringembracingTags()
Finds first occurrence of embracing tags and returns the embraced content and the original string with the tag removed in the two passed variables. Returns FALSE if no match found. i.e. useful for finding <title> of document or removing <script>-sections
public
embracingTags(string $string, string $tagName, string|null &$tagContent, string|null &$stringAfter, string|null &$paramList) : bool
Parameters
- $string : string
-
String to search in
- $tagName : string
-
Tag name, eg. "script
- $tagContent : string|null
-
Passed by reference: Content inside found tag
- $stringAfter : string|null
-
Passed by reference: Content after found tag
- $paramList : string|null
-
Passed by reference: Attributes of the found tag.
Return values
boolextractBaseHref()
Extracts the "base href" from content string.
public
extractBaseHref(string $html) : string
Parameters
- $html : string
Return values
stringextractHyperLinks()
Extracts all links to external documents from the HTML content string
public
extractHyperLinks(string $html) : array<int, array{tag: string, href: string, localPath: string}>
Parameters
- $html : string
Return values
array<int, array{tag: string, href: string, localPath: string}>extractLinks()
Extract links (hrefs) from HTML content and if indexable media is found, it is indexed.
public
extractLinks(string $content) : void
Parameters
- $content : string
fileContentParts()
Creates an array with pointers to divisions of document.
public
fileContentParts(string $ext, string $absFile) : array<string|int, mixed>
Parameters
- $ext : string
-
File extension
- $absFile : string
-
Absolute filename (must exist and be validated OK before calling function)
Return values
array<string|int, mixed> —Array of pointers to sections that the document should be divided into
freqMap()
maps frequency from a real number in [0;1] to an integer in [0;$this->freqRange] with anything above $this->freqMax as 1 and back.
public
freqMap(float $freq) : int
Parameters
- $freq : float
-
Frequency
Return values
int —Frequency in range.
getHTMLcharset()
Extract the charset value from HTML meta tag.
public
getHTMLcharset(string $content) : string
Parameters
- $content : string
Return values
stringgetIndexStatus()
Check the mtime / tstamp of the currently indexed page/file (based on phash)
public
getIndexStatus(int $mtime, string $phash) : IndexStatus
Parameters
- $mtime : int
-
mtime value to test against limits and indexed page (usually this is the mtime of the cached document)
- $phash : string
-
"phash" used to select any already indexed page to see what its mtime is.
Return values
IndexStatusgetRootLineFields()
Adding values for root-line fields.
public
getRootLineFields(array<string|int, mixed> &$fieldArray) : void
rl0, rl1 and rl2 are standard. A hook might add more.
Parameters
- $fieldArray : array<string|int, mixed>
-
Field array, passed by reference
getUrlHeaders()
Getting HTTP request headers of URL
public
getUrlHeaders(string $url) : array<string, string>|false
Parameters
- $url : string
-
The URL
Return values
array<string, string>|false —If no answer, returns FALSE. Otherwise, an array where HTTP headers are keys
indexAnalyze()
Analyzes content to use for indexing,
public
indexAnalyze(IndexingDataAsArray $indexingDataDto) : array<string|int, mixed>
Parameters
- $indexingDataDto : IndexingDataAsArray
Return values
array<string|int, mixed> —Index Array (whatever that is...)
indexExternalUrl()
Index External URLs HTML content
public
indexExternalUrl(string $externalUrl) : void
Parameters
- $externalUrl : string
-
URL, eg. "https://typo3.org/
indexRegularDocument()
Indexing a regular document given as $file (relative to public web path, local file)
public
indexRegularDocument(string $file[, bool $force = false ][, string $contentTmpFile = '' ][, string $altExtension = '' ]) : void
Parameters
- $file : string
-
Relative Filename, relative to public web path. It can also be an absolute path as long as it is inside the lockRootPath. Finally, if $contentTmpFile is set, this value can be anything, most likely a URL
- $force : bool = false
-
If set, indexing is forced (despite content hashes, mtime etc).
- $contentTmpFile : string = ''
-
Temporary file with the content to read it from (instead of $file). Used when the $file is a URL.
- $altExtension : string = ''
-
File extension for temporary file.
indexTypo3PageContent()
Start indexing of the TYPO3 page
public
indexTypo3PageContent() : void
init()
public
init([array<string|int, mixed>|null $configuration = null ]) : void
Parameters
- $configuration : array<string|int, mixed>|null = null
-
will be used to set $this->conf, otherwise $this->conf MUST be set with proper values prior to this call
initializeExternalParsers()
public
initializeExternalParsers() : void
is_grlist_set()
Checks if a grlist record has been set for the phash value input (looking at the "real" phash of the current content, not the linked-to phash of the common search result page)
public
is_grlist_set(string $phash_x) : bool
Parameters
- $phash_x : string
Return values
boollog_setTSlogMessage()
public
log_setTSlogMessage(string $msg[, string $logLevel = LogLevel::INFO ]) : void
Parameters
- $msg : string
- $logLevel : string = LogLevel::INFO
processWordsInArrays()
Processing words in the array from split*Content -functions. Values are ensured to be unique.
public
processWordsInArrays(IndexingDataAsString $input) : IndexingDataAsArray
Parameters
- $input : IndexingDataAsString
Return values
IndexingDataAsArrayreadFileContent()
Reads the content of an external file being indexed.
public
readFileContent(string $fileExtension, string $absoluteFileName, string|int $sectionPointer) : IndexingDataAsString|null
The content from the external parser MUST be returned in utf-8!
Parameters
- $fileExtension : string
-
File extension, eg. "pdf", "doc" etc.
- $absoluteFileName : string
-
Absolute filename of file (must exist and be validated OK before calling function)
- $sectionPointer : string|int
-
Pointer to section (zero for all other than PDF which will have an indication of pages into which the document should be splitted.)
Return values
IndexingDataAsString|nullremoveOldIndexedFiles()
Removes records for the indexed page, $phash
public
removeOldIndexedFiles(string $phash) : void
Parameters
- $phash : string
-
phash value to flush
removeOldIndexedPages()
Removes records for the indexed page, $phash
public
removeOldIndexedPages(string $phash) : void
Parameters
- $phash : string
-
phash value to flush
setExtHashes()
Get search hash, external files
public
setExtHashes(string $file[, array<string|int, mixed> $subinfo = [] ]) : array{phash_grouping: string, phash: string}
Parameters
- $file : string
-
File name / path which identifies it on the server
- $subinfo : array<string|int, mixed> = []
-
Additional content identifying the (subpart of) content. For instance; PDF files are divided into groups of pages for indexing.
Return values
array{phash_grouping: string, phash: string}setT3Hashes()
Get search hash, T3 pages
public
setT3Hashes() : void
splitHTMLContent()
Splits HTML content and returns an associative array, with title, a list of meta tags, and a list of words in the body.
public
splitHTMLContent(string $content) : IndexingDataAsString
Parameters
- $content : string
-
HTML content to index. To some degree expected to be made by TYPO3 (i.e. splitting the header by ":")
Return values
IndexingDataAsStringsplitRegularContent()
Splits non-HTML content (from external files for instance)
public
splitRegularContent(string $content) : IndexingDataAsString
Parameters
- $content : string
Return values
IndexingDataAsStringsubmit_grlist()
Stores gr_list in the database.
public
submit_grlist(string $hash, string $phash_x) : void
Parameters
- $hash : string
-
Search result record phash
- $phash_x : string
-
Actual phash of current content
submit_section()
Stores section $hash and $hash_t3 are the same for TYPO3 pages, but different when it is external files.
public
submit_section(string $hash, string $hash_t3) : void
Parameters
- $hash : string
-
phash of TYPO3 parent search result record
- $hash_t3 : string
-
phash of the file indexation search record
submitFile_section()
Stores file section for a file IF it does not exist
public
submitFile_section(string $hash) : void
Parameters
- $hash : string
-
phash value of file
submitFilePage()
Updates db with information about the file
public
submitFilePage(array<string|int, mixed> $hash, string $file, array<string|int, mixed> $subinfo, string $ext, int $mtime, int $ctime, int $size, string $content_md5h, IndexingDataAsString $indexingDataDto) : void
Parameters
- $hash : array<string|int, mixed>
-
Array with phash and phash_grouping keys for file
- $file : string
-
File name
- $subinfo : array<string|int, mixed>
-
Array of "static_page_arguments" for files: This is for instance the page index for a PDF file (other document types it will be a zero)
- $ext : string
-
File extension determining the type of media.
- $mtime : int
-
Modification time of file.
- $ctime : int
-
Creation time of file.
- $size : int
-
Size of file in bytes
- $content_md5h : string
-
Content HASH value.
- $indexingDataDto : IndexingDataAsString
submitPage()
Updates db with information about the page (TYPO3 page, not external media)
public
submitPage() : void
submitWords()
Submits RELATIONS between words and phash
public
submitWords(array<string|int, mixed> $wordList, string $phash) : void
Parameters
- $wordList : array<string|int, mixed>
- $phash : string
typoSearchTags()
Removes content that shouldn't be indexed according to TYPO3SEARCH-tags.
public
typoSearchTags(string &$body) : bool
Parameters
- $body : string
-
HTML Content, passed by reference
Return values
bool —Returns TRUE if a TYPOSEARCH_ tag was found, otherwise FALSE.
update_grlist()
Check if a grlist-entry for this hash exists and if not so, write one.
public
update_grlist(string $phash, string $phash_x) : void
Parameters
- $phash : string
-
phash of the search result that should be found
- $phash_x : string
-
The real phash of the current content. The two values are different when a page with userlogin turns out to contain the exact same content as another already indexed version of the page; This is the whole reason for the grlist table in fact...
updateParsetime()
Update parse time for phash row.
public
updateParsetime(string $phash, int $parsetime) : void
Parameters
- $phash : string
- $parsetime : int
updateRootline()
Update section rootline for the page
public
updateRootline() : void
updateSetId()
Update SetID of the index_phash record.
public
updateSetId(string $phash) : void
Parameters
- $phash : string
updateTstamp()
Update tstamp for a phash row.
public
updateTstamp(string $phash[, int $mtime = 0 ]) : void
Parameters
- $phash : string
- $mtime : int = 0
addSpacesToKeywordList()
Makes sure that keywords are space-separated. This is important for their proper displaying as a part of fulltext index.
protected
addSpacesToKeywordList(string $keywordList) : string
Parameters
- $keywordList : string
Tags
Return values
stringcreateLocalPath()
Checks if the file is local
protected
createLocalPath(string $sourcePath) : string
Parameters
- $sourcePath : string
Return values
string —Absolute path to file if file is local, else empty string
createLocalPathFromAbsoluteURL()
Attempts to create a local file path from the absolute URL without schema.
protected
createLocalPathFromAbsoluteURL(string $sourcePath) : string
Parameters
- $sourcePath : string
Return values
stringcreateLocalPathFromRelativeURL()
Attempts to create a local file path from the relative URL.
protected
createLocalPathFromRelativeURL(string $sourcePath) : string
Parameters
- $sourcePath : string
Return values
stringcreateLocalPathUsingAbsRefPrefix()
Attempts to create a local file path by matching absRefPrefix. This requires TSFE. If TSFE is missing, this function does nothing.
protected
createLocalPathUsingAbsRefPrefix(string $sourcePath) : string
Parameters
- $sourcePath : string
Return values
stringcreateLocalPathUsingDomainURL()
Attempts to create a local file path by matching a current request URL.
protected
createLocalPathUsingDomainURL(string $sourcePath) : string
Parameters
- $sourcePath : string
Return values
stringisAllowedLocalFile()
Checks if the path points to the file inside the website
protected
static isAllowedLocalFile(string $filePath) : bool
Parameters
- $filePath : string
Return values
boolisRelativeURL()
Checks if URL is relative.
protected
static isRelativeURL(string $url) : bool
Parameters
- $url : string
Return values
boolmilliseconds()
Gets the unixtime as milliseconds.
protected
milliseconds() : int