‪TYPO3CMS  10.4
TYPO3\CMS\IndexedSearch\Hook\CrawlerHook Class Reference

Public Member Functions

 crawler_init (&$pObj)
 
array crawler_execute ($params, &$pObj)
 
 crawler_execute_type1 ($cfgRec, &$session_data, $params, &$pObj)
 
 crawler_execute_type2 ($cfgRec, &$session_data, $params, &$pObj)
 
 crawler_execute_type3 ($cfgRec, &$session_data, $params, &$pObj)
 
 crawler_execute_type4 ($cfgRec, &$session_data, $params, &$pObj)
 
 cleanUpOldRunningConfigurations ()
 
string checkUrl ($url, $urlLog, $baseUrl)
 
array indexExtUrl ($url, $pageId, $rl, $cfgUid, $setId)
 
 indexSingleRecord ($r, $cfgRec, $rl=null)
 
array getUidRootLineForClosestTemplate ($id)
 
int generateNextIndexingTime ($cfgRec)
 
bool checkDeniedSuburls ($url, $url_deny)
 
 addQueueEntryForHook ($cfgRec, $title)
 
 deleteFromIndex ($id)
 
 processCmdmap_preProcess ($command, $table, $id, $value, $pObj)
 
 processDatamap_afterDatabaseOperations ($status, $table, $id, $fieldArray, $pObj)
 

Public Attributes

int $secondsPerExternalUrl = 3
 
int $instanceCounter = 0
 
string $callBack = self::class
 

Protected Member Functions

Indexer initializeIndexer ($id, $type, $sys_language_uid, $MP, $uidRL, $queryArguments=[], $freeIndexUid=0, $freeIndexSetId=0)
 
 indexAsTYPO3Page (Indexer $indexer, $title, $content, $mtime, $crdate=0, $recordUid=0)
 

Private Attributes

object $pObj
 

Detailed Description

Crawler hook for indexed search. Works with the "crawler" extension

this is a TYPO3-internal hook implementation and not part of TYPO3's Core API.

Definition at line 36 of file CrawlerHook.php.

Member Function Documentation

◆ addQueueEntryForHook()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::addQueueEntryForHook (   $cfgRec,
  $title 
)

Adding entry in queue for Hook

Parameters
array$cfgRec‪Configuration record
string$title‪Title/URL

Definition at line 762 of file CrawlerHook.php.

◆ checkDeniedSuburls()

bool TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::checkDeniedSuburls (   $url,
  $url_deny 
)

Checks if $url has any of the URls in the $url_deny "list" in it and if so, returns TRUE.

Parameters
string$url‪URL to test
string$url_deny‪String where URLs are separated by line-breaks; If any of these strings is the first part of $url, the function returns TRUE (to indicate denial of descend)
Returns
‪bool TRUE if there is a matching URL (hence, do not index!)

Definition at line 743 of file CrawlerHook.php.

References TYPO3\CMS\Core\Utility\GeneralUtility\trimExplode().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3().

◆ checkUrl()

string TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::checkUrl (   $url,
  $urlLog,
  $baseUrl 
)

Check if an input URL are allowed to be indexed. Depends on whether it is already present in the url log.

Parameters
string$url‪URL string to check
array$urlLog‪Array of already indexed URLs (input url is looked up here and must not exist already)
string$baseUrl‪Base URL of the indexing process (input URL must be "inside" the base URL!)
Returns
‪string Returns the URL if OK, otherwise empty string

Definition at line 586 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3().

◆ cleanUpOldRunningConfigurations()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::cleanUpOldRunningConfigurations ( )

Look up all old index configurations which are finished and needs to be reset and done

Definition at line 487 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_init().

◆ crawler_execute()

array TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute (   $params,
$pObj 
)

Call back function for execution of a log element

Parameters
array$params‪Params from log element. Must contain $params['indexConfigUid']
object$pObj‪Parent object (tx_crawler lib)
Returns
‪array Result array

Definition at line 188 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\$pObj, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type1(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type2(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type4().

◆ crawler_execute_type1()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute_type1 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Indexing records from a table

Parameters
array$cfgRec‪Indexing Configuration Record
array$session_data‪Session data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$params‪Parameters from the log queue.
object$pObj‪Parent object (from "crawler" extension!)

Definition at line 262 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\$pObj, TYPO3\CMS\Core\Utility\MathUtility\forceIntegerInRange(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\getUidRootLineForClosestTemplate(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexSingleRecord().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute().

◆ crawler_execute_type2()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute_type2 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Indexing files from fileadmin

Parameters
array$cfgRec‪Indexing Configuration Record
array$session_data‪Session data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$params‪Parameters from the log queue.
object$pObj‪Parent object (from "crawler" extension!)

Definition at line 329 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\$pObj, TYPO3\CMS\Core\Utility\GeneralUtility\get_dirs(), TYPO3\CMS\Core\Core\Environment\getPublicPath(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\getUidRootLineForClosestTemplate(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\initializeIndexer(), TYPO3\CMS\Core\Utility\PathUtility\stripPathSitePrefix(), and TYPO3\CMS\Core\Utility\GeneralUtility\trimExplode().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute().

◆ crawler_execute_type3()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute_type3 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Indexing External URLs

Parameters
array$cfgRec‪Indexing Configuration Record
array$session_data‪Session data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$params‪Parameters from the log queue.
object$pObj‪Parent object (from "crawler" extension!)

Definition at line 388 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\$pObj, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\checkDeniedSuburls(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\checkUrl(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\getUidRootLineForClosestTemplate(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexExtUrl().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute().

◆ crawler_execute_type4()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute_type4 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Page tree indexing type

Parameters
array$cfgRec‪Indexing Configuration Record
array$session_data‪Session data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$params‪Parameters from the log queue.
object$pObj‪Parent object (from "crawler" extension!)

Definition at line 428 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\$pObj, and TYPO3\CMS\Backend\Utility\BackendUtility\getRecord().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute().

◆ crawler_init()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_init ( $pObj)

Initialization of crawler hook. This function is asked for each instance of the crawler and we must check if something is timed to happen and if so put entry(s) in the crawlers log to start processing. In reality we select indexing configurations and evaluate if any of them needs to run.

Parameters
object$pObj‪Parent object (tx_crawler lib)

Definition at line 65 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\$pObj, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\cleanUpOldRunningConfigurations(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\generateNextIndexingTime().

◆ deleteFromIndex()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::deleteFromIndex (   $id)

Deletes all data stored by indexed search for a given page

Parameters
int$id‪Uid of the page to delete all pHash

Definition at line 778 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\processCmdmap_preProcess(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\processDatamap_afterDatabaseOperations().

◆ generateNextIndexingTime()

int TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::generateNextIndexingTime (   $cfgRec)

Generate the unix time stamp for next visit.

Parameters
array$cfgRec‪Index configuration record
Returns
‪int The next time stamp

Definition at line 717 of file CrawlerHook.php.

References $GLOBALS, and TYPO3\CMS\Core\Utility\MathUtility\forceIntegerInRange().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_init().

◆ getUidRootLineForClosestTemplate()

array TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::getUidRootLineForClosestTemplate (   $id)

Get rootline for closest TypoScript template root. Algorithm same as used in Web > Template, Object browser

Parameters
int$id‪The page id to traverse rootline back from
Returns
‪array Array where the root lines uid values are found.

Definition at line 692 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type1(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type2(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexSingleRecord().

◆ indexAsTYPO3Page()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::indexAsTYPO3Page ( Indexer  $indexer,
  $title,
  $content,
  $mtime,
  $crdate = 0,
  $recordUid = 0 
)
protected

Indexing records as the content of a TYPO3 page.

Parameters
Indexer$indexer
string$title‪Title equivalent
string$content‪The main content to index
int$mtime‪Last modification time, in seconds
int$crdate‪The creation date of the content, in seconds
int$recordUid‪The record UID that the content comes from (for registration with the indexed rows)

Definition at line 972 of file CrawlerHook.php.

References TYPO3\CMS\IndexedSearch\Indexer\indexTypo3PageContent().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexSingleRecord().

◆ indexExtUrl()

array TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::indexExtUrl (   $url,
  $pageId,
  $rl,
  $cfgUid,
  $setId 
)

Indexing External URL

Parameters
string$url‪URL, http://....
int$pageId‪Page id to relate indexing to.
array$rl‪Rootline array to relate indexing to
int$cfgUid‪Configuration UID
int$setId‪Set ID value
Returns
‪array URLs found on this page

Definition at line 611 of file CrawlerHook.php.

References TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\initializeIndexer().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3().

◆ indexSingleRecord()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::indexSingleRecord (   $r,
  $cfgRec,
  $rl = null 
)

◆ initializeIndexer()

Indexer TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::initializeIndexer (   $id,
  $type,
  $sys_language_uid,
  $MP,
  $uidRL,
  $queryArguments = [],
  $freeIndexUid = 0,
  $freeIndexSetId = 0 
)
protected

Initializing the "combined ID" of the page (phash) being indexed (or for which external media is attached)

Parameters
int$id‪The page uid, &id=
int$type‪The page type, &type=
int$sys_language_uid‪sys_language uid, typically &L=
string$MP‪The MP variable (Mount Points), &MP=
array$uidRL‪Rootline array of only UIDs.
array$queryArguments‪Array of GET variables to register with this indexing
int$freeIndexUid‪Free index UID
int$freeIndexSetId‪Set id - an integer identifying the "set" of indexing operations.
Returns
Indexer

Definition at line 927 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type2(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexExtUrl(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexSingleRecord().

◆ processCmdmap_preProcess()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::processCmdmap_preProcess (   $command,
  $table,
  $id,
  $value,
  $pObj 
)

DataHandler hook function for on-the-fly indexing of database records

Parameters
string$command‪DataHandler command
string$table‪Table name
string$id‪Record ID. If new record its a string pointing to index inside \TYPO3\CMS\Core\DataHandling\DataHandler::substNEWwithIDs
mixed$value‪Target value (ignored)
DataHandler$pObj‪DataHandler calling object

Definition at line 838 of file CrawlerHook.php.

References TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\deleteFromIndex().

◆ processDatamap_afterDatabaseOperations()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::processDatamap_afterDatabaseOperations (   $status,
  $table,
  $id,
  $fieldArray,
  $pObj 
)

DataHandler hook function for on-the-fly indexing of database records

Parameters
string$status‪Status "new" or "update
string$table‪Table name
string$id‪Record ID. If new record its a string pointing to index inside \TYPO3\CMS\Core\DataHandling\DataHandler::substNEWwithIDs
array$fieldArray‪Field array of updated fields in the operation
DataHandler$pObj‪DataHandler calling object

Definition at line 855 of file CrawlerHook.php.

References TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\$pObj, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\deleteFromIndex(), TYPO3\CMS\Backend\Utility\BackendUtility\getRecord(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexSingleRecord().

Member Data Documentation

◆ $callBack

string TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::$callBack = self::class

Definition at line 52 of file CrawlerHook.php.

◆ $instanceCounter

int TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::$instanceCounter = 0

Counts up for each added URL (type 3)

Definition at line 48 of file CrawlerHook.php.

◆ $pObj

◆ $secondsPerExternalUrl

int TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::$secondsPerExternalUrl = 3

Number of seconds to use as interval between queued indexing operations of URLs / files (types 2 & 3)

Definition at line 42 of file CrawlerHook.php.