‪TYPO3CMS  9.5
TYPO3\CMS\IndexedSearch\Hook\CrawlerHook Class Reference

Public Member Functions

 __construct ()
 
 crawler_init (&$pObj)
 
array crawler_execute ($params, &$pObj)
 
 crawler_execute_type1 ($cfgRec, &$session_data, $params, &$pObj)
 
 crawler_execute_type2 ($cfgRec, &$session_data, $params, &$pObj)
 
 crawler_execute_type3 ($cfgRec, &$session_data, $params, &$pObj)
 
 crawler_execute_type4 ($cfgRec, &$session_data, $params, &$pObj)
 
 cleanUpOldRunningConfigurations ()
 
string checkUrl ($url, $urlLog, $baseUrl)
 
array indexExtUrl ($url, $pageId, $rl, $cfgUid, $setId)
 
 indexSingleRecord ($r, $cfgRec, $rl=null)
 
array getUidRootLineForClosestTemplate ($id)
 
int generateNextIndexingTime ($cfgRec)
 
bool checkDeniedSuburls ($url, $url_deny)
 
 addQueueEntryForHook ($cfgRec, $title)
 
 deleteFromIndex ($id)
 
 processCmdmap_preProcess ($command, $table, $id, $value, $pObj)
 
 processDatamap_afterDatabaseOperations ($status, $table, $id, $fieldArray, $pObj)
 

Public Attributes

int $secondsPerExternalUrl = 3
 
int $instanceCounter = 0
 
string $callBack = self::class
 

Detailed Description

Crawler hook for indexed search. Works with the "crawler" extension

this is a TYPO3-internal hook implementation and not part of TYPO3's Core API.

Definition at line 32 of file CrawlerHook.php.

Constructor & Destructor Documentation

◆ __construct()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::__construct ( )

The constructor

Definition at line 53 of file CrawlerHook.php.

References $GLOBALS.

Member Function Documentation

◆ addQueueEntryForHook()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::addQueueEntryForHook (   $cfgRec,
  $title 
)

Adding entry in queue for Hook

Parameters
array$cfgRec‪Configuration record
string$title‪Title/URL

Definition at line 759 of file CrawlerHook.php.

◆ checkDeniedSuburls()

bool TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::checkDeniedSuburls (   $url,
  $url_deny 
)

Checks if $url has any of the URls in the $url_deny "list" in it and if so, returns TRUE.

Parameters
string$url‪URL to test
string$url_deny‪String where URLs are separated by line-breaks; If any of these strings is the first part of $url, the function returns TRUE (to indicate denial of decend)
Returns
‪bool TRUE if there is a matching URL (hence, do not index!)

Definition at line 740 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3().

◆ checkUrl()

string TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::checkUrl (   $url,
  $urlLog,
  $baseUrl 
)

Check if an input URL are allowed to be indexed. Depends on whether it is already present in the url log.

Parameters
string$url‪URL string to check
array$urlLog‪Array of already indexed URLs (input url is looked up here and must not exist already)
string$baseUrl‪Base URL of the indexing process (input URL must be "inside" the base URL!)
Returns
‪string Returls the URL if OK, otherwise FALSE

Definition at line 588 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3().

◆ cleanUpOldRunningConfigurations()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::cleanUpOldRunningConfigurations ( )

Look up all old index configurations which are finished and needs to be reset and done

Definition at line 489 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_init().

◆ crawler_execute()

array TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute (   $params,
$pObj 
)

Call back function for execution of a log element

Parameters
array$params‪Params from log element. Must contain $params['indexConfigUid']
object$pObj‪Parent object (tx_crawler lib)
Returns
‪array Result array

Definition at line 189 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type1(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type2(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type4().

◆ crawler_execute_type1()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute_type1 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Indexing records from a table

Parameters
array$cfgRec‪Indexing Configuration Record
array$session_data‪Session data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$params‪Parameters from the log queue.
object$pObj‪Parent object (from "crawler" extension!)

Definition at line 262 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\Core\Utility\MathUtility\forceIntegerInRange(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\getUidRootLineForClosestTemplate(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexSingleRecord().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute().

◆ crawler_execute_type2()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute_type2 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Indexing files from fileadmin

Parameters
array$cfgRec‪Indexing Configuration Record
array$session_data‪Session data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$params‪Parameters from the log queue.
object$pObj‪Parent object (from "crawler" extension!)

Definition at line 329 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\Core\Core\Environment\getPublicPath(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\getUidRootLineForClosestTemplate(), and TYPO3\CMS\Core\Utility\PathUtility\stripPathSitePrefix().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute().

◆ crawler_execute_type3()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute_type3 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Indexing External URLs

Parameters
array$cfgRec‪Indexing Configuration Record
array$session_data‪Session data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$params‪Parameters from the log queue.
object$pObj‪Parent object (from "crawler" extension!)

Definition at line 390 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\checkDeniedSuburls(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\checkUrl(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\getUidRootLineForClosestTemplate(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexExtUrl().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute().

◆ crawler_execute_type4()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_execute_type4 (   $cfgRec,
$session_data,
  $params,
$pObj 
)

Page tree indexing type

Parameters
array$cfgRec‪Indexing Configuration Record
array$session_data‪Session data for the indexing session spread over multiple instances of the script. Passed by reference so changes hereto will be saved for the next call!
array$params‪Parameters from the log queue.
object$pObj‪Parent object (from "crawler" extension!)

Definition at line 430 of file CrawlerHook.php.

References $GLOBALS, and TYPO3\CMS\Backend\Utility\BackendUtility\getRecord().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute().

◆ crawler_init()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::crawler_init ( $pObj)

Initialization of crawler hook. This function is asked for each instance of the crawler and we must check if something is timed to happen and if so put entry(s) in the crawlers log to start processing. In reality we select indexing configurations and evaluate if any of them needs to run.

Parameters
object$pObj‪Parent object (tx_crawler lib)

Definition at line 69 of file CrawlerHook.php.

References $GLOBALS, TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\cleanUpOldRunningConfigurations(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\generateNextIndexingTime().

◆ deleteFromIndex()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::deleteFromIndex (   $id)

Deletes all data stored by indexed search for a given page

Parameters
int$id‪Uid of the page to delete all pHash

Definition at line 775 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\processCmdmap_preProcess(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\processDatamap_afterDatabaseOperations().

◆ generateNextIndexingTime()

int TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::generateNextIndexingTime (   $cfgRec)

Generate the unix time stamp for next visit.

Parameters
array$cfgRec‪Index configuration record
Returns
‪int The next time stamp

Definition at line 714 of file CrawlerHook.php.

References $GLOBALS, and TYPO3\CMS\Core\Utility\MathUtility\forceIntegerInRange().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_init().

◆ getUidRootLineForClosestTemplate()

array TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::getUidRootLineForClosestTemplate (   $id)

Get rootline for closest TypoScript template root. Algorithm same as used in Web > Template, Object browser

Parameters
int$id‪The page id to traverse rootline back from
Returns
‪array Array where the root lines uid values are found.

Definition at line 689 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type1(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type2(), TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexSingleRecord().

◆ indexExtUrl()

array TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::indexExtUrl (   $url,
  $pageId,
  $rl,
  $cfgUid,
  $setId 
)

Indexing External URL

Parameters
string$url‪URL, http://....
int$pageId‪Page id to relate indexing to.
array$rl‪Rootline array to relate indexing to
int$cfgUid‪Configuration UID
int$setId‪Set ID value
Returns
‪array URLs found on this page

Definition at line 611 of file CrawlerHook.php.

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type3().

◆ indexSingleRecord()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::indexSingleRecord (   $r,
  $cfgRec,
  $rl = null 
)

Indexing Single Record

Parameters
array$r‪Record to index
array$cfgRec‪Configuration Record
array$rl‪Rootline array to relate indexing to

Definition at line 657 of file CrawlerHook.php.

References $GLOBALS, and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\getUidRootLineForClosestTemplate().

Referenced by TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\crawler_execute_type1(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\processDatamap_afterDatabaseOperations().

◆ processCmdmap_preProcess()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::processCmdmap_preProcess (   $command,
  $table,
  $id,
  $value,
  $pObj 
)

DataHandler hook function for on-the-fly indexing of database records

Parameters
string$command‪DataHandler command
string$table‪Table name
string$id‪Record ID. If new record its a string pointing to index inside \TYPO3\CMS\Core\DataHandling\DataHandler::substNEWwithIDs
mixed$value‪Target value (ignored)
DataHandler$pObj‪DataHandler calling object

Definition at line 835 of file CrawlerHook.php.

References TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\deleteFromIndex().

◆ processDatamap_afterDatabaseOperations()

TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::processDatamap_afterDatabaseOperations (   $status,
  $table,
  $id,
  $fieldArray,
  $pObj 
)

DataHandler hook function for on-the-fly indexing of database records

Parameters
string$status‪Status "new" or "update
string$table‪Table name
string$id‪Record ID. If new record its a string pointing to index inside \TYPO3\CMS\Core\DataHandling\DataHandler::substNEWwithIDs
array$fieldArray‪Field array of updated fields in the operation
DataHandler$pObj‪DataHandler calling object

Definition at line 852 of file CrawlerHook.php.

References TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\deleteFromIndex(), TYPO3\CMS\Backend\Utility\BackendUtility\getRecord(), and TYPO3\CMS\IndexedSearch\Hook\CrawlerHook\indexSingleRecord().

Member Data Documentation

◆ $callBack

string TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::$callBack = self::class

Definition at line 48 of file CrawlerHook.php.

◆ $instanceCounter

int TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::$instanceCounter = 0

Counts up for each added URL (type 3)

Definition at line 44 of file CrawlerHook.php.

◆ $secondsPerExternalUrl

int TYPO3\CMS\IndexedSearch\Hook\CrawlerHook::$secondsPerExternalUrl = 3

Number of seconds to use as interval between queued indexing operations of URLs / files (types 2 & 3)

Definition at line 38 of file CrawlerHook.php.