HtmlParser

Functions for parsing HTML.

You are encouraged to use this class in your own applications

Table of Contents

Constants

VOID_ELEMENTS  = 'area|base|br|col|command|embed|hr|img|input|keygen|meta|param|source|track|wbr'

Methods

bidir_htmlspecialchars()  : string
Converts htmlspecialchars forth ($dir=1) AND back ($dir=-1)
compileTagAttribs()  : string
Compiling an array with tag attributes into a string
get_tag_attributes()  : array<string|int, mixed>
Returns an array with all attributes as keys. Attributes are only lowercase a-z If an attribute is empty (shorthand), then the value for the key is empty. You can check if it existed with isset()
getFirstTag()  : string
Returns the first tag in $str Actually everything from the beginning of the $str is returned, so you better make sure the tag is the first thing.
getFirstTagName()  : string
Returns the NAME of the first tag in $str
HTMLcleaner()  : string
Function that can clean up HTML content according to configuration given in the $tags array.
HTMLparserConfig()  : array<string|int, mixed>
Converts TSconfig into an array for the HTMLcleaner function.
prefixRelPath()  : string
Internal sub-function for ->prefixResourcePath()
prefixResourcePath()  : string
Prefixes the relative paths of hrefs/src/action in the tags [td,table,body,img,input,form,link,script,a] in the $content with the $main_prefix or and alternative given by $alternatives
removeFirstAndLastTag()  : string
Removes the first and last tag in the string Anything before the first and after the last tags respectively is also removed
split_tag_attributes()  : array<string|int, mixed>
Returns an array with the 'components' from an attribute list.
splitIntoBlock()  : array<string|int, mixed>
Returns an array with the $content divided by tag-blocks specified with the list of tags, $tag Even numbers in the array are outside the blocks, Odd numbers are block-content.
splitIntoBlockRecursiveProc()  : string
Splitting content into blocks *recursively* and processing tags/content with call back functions.
splitTags()  : array<string|int, mixed>
Returns an array with the $content divided by tag-blocks specified with the list of tags, $tag Even numbers in the array are outside the blocks, Odd numbers are block-content.
stripEmptyTags()  : string
Strips empty tags from HTML.
stripEmptyTagsIfConfigured()  : string
Strips the configured empty tags from the HMTL code.

Constants

VOID_ELEMENTS

public mixed VOID_ELEMENTS = 'area|base|br|col|command|embed|hr|img|input|keygen|meta|param|source|track|wbr'

Methods

bidir_htmlspecialchars()

Converts htmlspecialchars forth ($dir=1) AND back ($dir=-1)

public bidir_htmlspecialchars(string $value, int $dir) : string
Parameters
$value : string

Input value

$dir : int

Direction: forth ($dir=1, dir=2 for preserving entities) AND back ($dir=-1)

Return values
string

Output value

compileTagAttribs()

Compiling an array with tag attributes into a string

public compileTagAttribs(array<string|int, mixed> $tagAttrib[, array<string|int, mixed> $meta = [] ]) : string
Parameters
$tagAttrib : array<string|int, mixed>

Tag attributes

$meta : array<string|int, mixed> = []

Meta information about these attributes (like if they were quoted)

Internal
Return values
string

Imploded attributes, eg: 'attribute="value" attrib2="value2"'

get_tag_attributes()

Returns an array with all attributes as keys. Attributes are only lowercase a-z If an attribute is empty (shorthand), then the value for the key is empty. You can check if it existed with isset()

public get_tag_attributes(string $tag[, bool $deHSC = false ]) : array<string|int, mixed>

Compared to the method in GeneralUtility::get_tag_attributes this method also returns meta data about each attribute, e.g. if it is a shorthand attribute, and what the quotation is. Also, since all attribute keys are lower-cased, the meta information contains the original attribute name.

Parameters
$tag : string

Tag: $tag is either a whole tag (eg '<TAG OPTION ATTRIB=VALUE>') or the parameterlist (ex ' OPTION ATTRIB=VALUE>')

$deHSC : bool = false

If set, the attribute values are de-htmlspecialchar'ed. Should actually always be set!

Return values
array<string|int, mixed>

array(Tag attributes,Attribute meta-data)

getFirstTag()

Returns the first tag in $str Actually everything from the beginning of the $str is returned, so you better make sure the tag is the first thing.

public getFirstTag(string $str) : string

..

Parameters
$str : string

HTML string with tags

Return values
string

getFirstTagName()

Returns the NAME of the first tag in $str

public getFirstTagName(string $str[, bool $preserveCase = false ]) : string
Parameters
$str : string

HTML tag (The element name MUST be separated from the attributes by a space character! Just whitespace will not do)

$preserveCase : bool = false

If set, then the tag is NOT converted to uppercase by case is preserved.

Tags
see
getFirstTag()
Return values
string

Tag name in upper case

HTMLcleaner()

Function that can clean up HTML content according to configuration given in the $tags array.

public HTMLcleaner(string $content[, array<string|int, mixed> $tags = [] ][, mixed $keepAll = 0 ][, int $hSC = 0 ][, array<string|int, mixed> $addConfig = [] ]) : string

Initializing the $tags array to allow a list of tags (in this case <B>,<I>,<U> and <A>), set it like this: $tags = array_flip(explode(',','b,a,i,u')) If the value of the $tags[$tagname] entry is an array, advanced processing of the tags is initialized. These are the options:

$tags[$tagname] = Array(
'overrideAttribs' => ''		If set, this string is preset as the attributes of the tag
'allowedAttribs' =>   '0' (zero) = no attributes allowed, '[commalist of attributes]' = only allowed attributes. If blank, all attributes are allowed.
'fixAttrib' => Array(
'[attribute name]' => Array (
'set' => Force the attribute value to this value.
'unset' => Boolean: If set, the attribute is unset.
'default' =>	 If no attribute exists by this name, this value is set as default value (if this value is not blank)
'always' =>	 Boolean. If set, the attribute is always processed. Normally an attribute is processed only if it exists
'trim,intval,lower,upper' =>	 All booleans. If any of these keys are set, the value is passed through the respective PHP-functions.
'range' => Array ('[low limit]','[high limit, optional]')		Setting integer range.
'list' => Array ('[value1/default]','[value2]','[value3]')		Attribute must be in this list. If not, the value is set to the first element.
'removeIfFalse' =>	 Boolean/'blank'.	If set, then the attribute is removed if it is 'FALSE'. If this value is set to 'blank' then the value must be a blank string (that means a 'zero' value will not be removed)
'removeIfEquals' =>	 [value]	If the attribute value matches the value set here, then it is removed.
'casesensitiveComp' => 1	If set, then the removeIfEquals and list comparisons will be case sensitive. Otherwise not.
)
),
'protect' => '',	Boolean. If set, the tag <> is converted to &lt; and &gt;
'remap' => '',		String. If set, the tagname is remapped to this tagname
'rmTagIfNoAttrib' => '',	Boolean. If set, then the tag is removed if no attributes happened to be there.
'nesting' => '',	Boolean/'global'. If set TRUE, then this tag must have starting and ending tags in the correct order. Any tags not in this order will be discarded. Thus '</B><B><I></B></I></B>' will be converted to '<B><I></B></I>'. Is the value 'global' then true nesting in relation to other tags marked for 'global' nesting control is preserved. This means that if <B> and <I> are set for global nesting then this string '</B><B><I></B></I></B>' is converted to '<B></B>'
)
Parameters
$content : string

Is the HTML-content being processed. This is also the result being returned.

$tags : array<string|int, mixed> = []

Is an array where each key is a tagname in lowercase. Only tags present as keys in this array are preserved. The value of the key can be an array with a vast number of options to configure.

$keepAll : mixed = 0

Boolean/'protect', if set, then all tags are kept regardless of tags present as keys in $tags-array. If 'protect' then the preserved tags have their <> converted to < and >

$hSC : int = 0

Values -1,0,1,2: Set to zero= disabled, set to 1 then the content BETWEEN tags is htmlspecialchar()'ed, set to -1 its the opposite and set to 2 the content will be HSC'ed BUT with preservation for real entities (eg. "&" or "ê")

$addConfig : array<string|int, mixed> = []

Configuration array send along as $conf to the internal functions

Return values
string

Processed HTML content

HTMLparserConfig()

Converts TSconfig into an array for the HTMLcleaner function.

public HTMLparserConfig(array<string|int, mixed> $TSconfig[, array<string|int, mixed> $keepTags = [] ]) : array<string|int, mixed>
Parameters
$TSconfig : array<string|int, mixed>

TSconfig for HTMLcleaner

$keepTags : array<string|int, mixed> = []

Array of tags to keep (?)

Internal
Return values
array<string|int, mixed>

prefixRelPath()

Internal sub-function for ->prefixResourcePath()

public prefixRelPath(string $prefix, string $srcVal[, string $suffix = '' ]) : string
Parameters
$prefix : string

Prefix string

$srcVal : string

Relative path/URL

$suffix : string = ''

Suffix string

Internal
Return values
string

Output path, prefixed if no scheme in input string

prefixResourcePath()

Prefixes the relative paths of hrefs/src/action in the tags [td,table,body,img,input,form,link,script,a] in the $content with the $main_prefix or and alternative given by $alternatives

public prefixResourcePath(string $main_prefix, string $content[, array<string|int, mixed> $alternatives = [] ][, string $suffix = '' ]) : string
Parameters
$main_prefix : string

Prefix string

$content : string

HTML content

$alternatives : array<string|int, mixed> = []

Array with alternative prefixes for certain of the tags. key=>value pairs where the keys are the tag element names in uppercase

$suffix : string = ''

Suffix string (put after the resource).

Return values
string

Processed HTML content

removeFirstAndLastTag()

Removes the first and last tag in the string Anything before the first and after the last tags respectively is also removed

public removeFirstAndLastTag(string $str) : string
Parameters
$str : string

String to process

Return values
string

split_tag_attributes()

Returns an array with the 'components' from an attribute list.

public split_tag_attributes(string $tag) : array<string|int, mixed>

The result is normally analyzed by get_tag_attributes Removes tag-name if found.

The difference between this method and the one in GeneralUtility is that this method actually determines more information on the attribute, e.g. if the value is enclosed by a " or ' character. That's why this method returns two arrays, the "components" and the "meta-information" of the "components".

Parameters
$tag : string

The tag or attributes

Internal
Tags
see
GeneralUtility::split_tag_attributes()
Return values
array<string|int, mixed>

splitIntoBlock()

Returns an array with the $content divided by tag-blocks specified with the list of tags, $tag Even numbers in the array are outside the blocks, Odd numbers are block-content.

public splitIntoBlock(string $tag, string $content[, bool $eliminateExtraEndTags = false ]) : array<string|int, mixed>

Use ->removeFirstAndLastTag() to process the content if needed.

Parameters
$tag : string

List of tags, comma separated.

$content : string

HTML-content

$eliminateExtraEndTags : bool = false

If set, excessive end tags are ignored - you should probably set this in most cases.

Tags
see
splitTags()
see
removeFirstAndLastTag()
Return values
array<string|int, mixed>

Even numbers in the array are outside the blocks, Odd numbers are block-content.

splitIntoBlockRecursiveProc()

Splitting content into blocks *recursively* and processing tags/content with call back functions.

public splitIntoBlockRecursiveProc(string $tag, string $content, object &$procObj, string $callBackContent, string $callBackTags[, int $level = 0 ]) : string
Parameters
$tag : string

Tag list, see splitIntoBlock()

$content : string

Content, see splitIntoBlock()

$procObj : object

Object where call back methods are.

$callBackContent : string

Name of call back method for content; "function callBackContent($str,$level)

$callBackTags : string

Name of call back method for tags; "function callBackTags($tags,$level)

$level : int = 0

Indent level

Tags
see
splitIntoBlock()
Return values
string

Processed content

splitTags()

Returns an array with the $content divided by tag-blocks specified with the list of tags, $tag Even numbers in the array are outside the blocks, Odd numbers are block-content.

public splitTags(string $tag, string $content) : array<string|int, mixed>

Use ->removeFirstAndLastTag() to process the content if needed.

Parameters
$tag : string

List of tags

$content : string

HTML-content

Tags
see
splitIntoBlock()
see
removeFirstAndLastTag()
Return values
array<string|int, mixed>

Even numbers in the array are outside the blocks, Odd numbers are block-content.

stripEmptyTags()

Strips empty tags from HTML.

public stripEmptyTags(string $content[, string $tagList = '' ][, bool $treatNonBreakingSpaceAsEmpty = false ][, bool $keepTags = false ]) : string
Parameters
$content : string

The content to be stripped of empty tags

$tagList : string = ''

The comma separated list of tags to be stripped. If empty, all empty tags will be stripped

$treatNonBreakingSpaceAsEmpty : bool = false

If TRUE tags containing only   entities will be treated as empty.

$keepTags : bool = false

If true, the provided tags will be kept instead of stripped.

Return values
string

the stripped content

stripEmptyTagsIfConfigured()

Strips the configured empty tags from the HMTL code.

protected stripEmptyTagsIfConfigured(string $value, array<string|int, mixed> $configuration) : string
Parameters
$value : string
$configuration : array<string|int, mixed>
Return values
string

        
On this page

Search results