HtmlParser
Functions for parsing HTML.
You are encouraged to use this class in your own applications
Table of Contents
Constants
- VOID_ELEMENTS = 'area|base|br|col|command|embed|hr|img|input|keygen|meta|param|source|track|wbr'
Methods
- bidir_htmlspecialchars() : string
- Converts htmlspecialchars forth ($dir=1) AND back ($dir=-1)
- compileTagAttribs() : string
- Compiling an array with tag attributes into a string
- get_tag_attributes() : array<string|int, mixed>
- Returns an array with all attributes as keys. Attributes are only lowercase a-z If an attribute is empty (shorthand), then the value for the key is empty. You can check if it existed with isset()
- getFirstTag() : string
- Returns the first tag in $str Actually everything from the beginning of the $str is returned, so you better make sure the tag is the first thing.
- getFirstTagName() : string
- Returns the NAME of the first tag in $str
- HTMLcleaner() : string
- Function that can clean up HTML content according to configuration given in the $tags array.
- HTMLparserConfig() : array<string|int, mixed>
- Converts TSconfig into an array for the HTMLcleaner function.
- prefixRelPath() : string
- Internal sub-function for ->prefixResourcePath()
- prefixResourcePath() : string
- Prefixes the relative paths of hrefs/src/action in the tags [td,table,body,img,input,form,link,script,a] in the $content with the $main_prefix or and alternative given by $alternatives
- removeFirstAndLastTag() : string
- Removes the first and last tag in the string Anything before the first and after the last tags respectively is also removed
- split_tag_attributes() : array<string|int, mixed>
- Returns an array with the 'components' from an attribute list.
- splitIntoBlock() : array<string|int, mixed>
- Returns an array with the $content divided by tag-blocks specified with the list of tags, $tag Even numbers in the array are outside the blocks, Odd numbers are block-content.
- splitIntoBlockRecursiveProc() : string
- Splitting content into blocks *recursively* and processing tags/content with call back functions.
- splitTags() : array<string|int, mixed>
- Returns an array with the $content divided by tag-blocks specified with the list of tags, $tag Even numbers in the array are outside the blocks, Odd numbers are block-content.
- stripEmptyTags() : string
- Strips empty tags from HTML.
- stripEmptyTagsIfConfigured() : string
- Strips the configured empty tags from the HMTL code.
Constants
VOID_ELEMENTS
public
mixed
VOID_ELEMENTS
= 'area|base|br|col|command|embed|hr|img|input|keygen|meta|param|source|track|wbr'
Methods
bidir_htmlspecialchars()
Converts htmlspecialchars forth ($dir=1) AND back ($dir=-1)
public
bidir_htmlspecialchars(string $value, int $dir) : string
Parameters
- $value : string
-
Input value
- $dir : int
-
Direction: forth ($dir=1, dir=2 for preserving entities) AND back ($dir=-1)
Return values
string —Output value
compileTagAttribs()
Compiling an array with tag attributes into a string
public
compileTagAttribs(array<string|int, mixed> $tagAttrib[, array<string|int, mixed> $meta = [] ]) : string
Parameters
- $tagAttrib : array<string|int, mixed>
-
Tag attributes
- $meta : array<string|int, mixed> = []
-
Meta information about these attributes (like if they were quoted)
Return values
string —Imploded attributes, eg: 'attribute="value" attrib2="value2"'
get_tag_attributes()
Returns an array with all attributes as keys. Attributes are only lowercase a-z If an attribute is empty (shorthand), then the value for the key is empty. You can check if it existed with isset()
public
get_tag_attributes(string $tag[, bool $deHSC = false ]) : array<string|int, mixed>
Compared to the method in GeneralUtility::get_tag_attributes this method also returns meta data about each attribute, e.g. if it is a shorthand attribute, and what the quotation is. Also, since all attribute keys are lower-cased, the meta information contains the original attribute name.
Parameters
- $tag : string
-
Tag: $tag is either a whole tag (eg '<TAG OPTION ATTRIB=VALUE>') or the parameterlist (ex ' OPTION ATTRIB=VALUE>')
- $deHSC : bool = false
-
If set, the attribute values are de-htmlspecialchar'ed. Should actually always be set!
Return values
array<string|int, mixed> —array(Tag attributes,Attribute meta-data)
getFirstTag()
Returns the first tag in $str Actually everything from the beginning of the $str is returned, so you better make sure the tag is the first thing.
public
getFirstTag(string $str) : string
..
Parameters
- $str : string
-
HTML string with tags
Return values
stringgetFirstTagName()
Returns the NAME of the first tag in $str
public
getFirstTagName(string $str[, bool $preserveCase = false ]) : string
Parameters
- $str : string
-
HTML tag (The element name MUST be separated from the attributes by a space character! Just whitespace will not do)
- $preserveCase : bool = false
-
If set, then the tag is NOT converted to uppercase by case is preserved.
Tags
Return values
string —Tag name in upper case
HTMLcleaner()
Function that can clean up HTML content according to configuration given in the $tags array.
public
HTMLcleaner(string $content[, array<string|int, mixed> $tags = [] ][, mixed $keepAll = 0 ][, int $hSC = 0 ][, array<string|int, mixed> $addConfig = [] ]) : string
Initializing the $tags array to allow a list of tags (in this case <B>,<I>,<U> and <A>), set it like this: $tags = array_flip(explode(',','b,a,i,u')) If the value of the $tags[$tagname] entry is an array, advanced processing of the tags is initialized. These are the options:
$tags[$tagname] = Array(
'overrideAttribs' => '' If set, this string is preset as the attributes of the tag
'allowedAttribs' => '0' (zero) = no attributes allowed, '[commalist of attributes]' = only allowed attributes. If blank, all attributes are allowed.
'fixAttrib' => Array(
'[attribute name]' => Array (
'set' => Force the attribute value to this value.
'unset' => Boolean: If set, the attribute is unset.
'default' => If no attribute exists by this name, this value is set as default value (if this value is not blank)
'always' => Boolean. If set, the attribute is always processed. Normally an attribute is processed only if it exists
'trim,intval,lower,upper' => All booleans. If any of these keys are set, the value is passed through the respective PHP-functions.
'range' => Array ('[low limit]','[high limit, optional]') Setting integer range.
'list' => Array ('[value1/default]','[value2]','[value3]') Attribute must be in this list. If not, the value is set to the first element.
'removeIfFalse' => Boolean/'blank'. If set, then the attribute is removed if it is 'FALSE'. If this value is set to 'blank' then the value must be a blank string (that means a 'zero' value will not be removed)
'removeIfEquals' => [value] If the attribute value matches the value set here, then it is removed.
'casesensitiveComp' => 1 If set, then the removeIfEquals and list comparisons will be case sensitive. Otherwise not.
)
),
'protect' => '', Boolean. If set, the tag <> is converted to < and >
'remap' => '', String. If set, the tagname is remapped to this tagname
'rmTagIfNoAttrib' => '', Boolean. If set, then the tag is removed if no attributes happened to be there.
'nesting' => '', Boolean/'global'. If set TRUE, then this tag must have starting and ending tags in the correct order. Any tags not in this order will be discarded. Thus '</B><B><I></B></I></B>' will be converted to '<B><I></B></I>'. Is the value 'global' then true nesting in relation to other tags marked for 'global' nesting control is preserved. This means that if <B> and <I> are set for global nesting then this string '</B><B><I></B></I></B>' is converted to '<B></B>'
)
Parameters
- $content : string
-
Is the HTML-content being processed. This is also the result being returned.
- $tags : array<string|int, mixed> = []
-
Is an array where each key is a tagname in lowercase. Only tags present as keys in this array are preserved. The value of the key can be an array with a vast number of options to configure.
- $keepAll : mixed = 0
-
Boolean/'protect', if set, then all tags are kept regardless of tags present as keys in $tags-array. If 'protect' then the preserved tags have their <> converted to < and >
- $hSC : int = 0
-
Values -1,0,1,2: Set to zero= disabled, set to 1 then the content BETWEEN tags is htmlspecialchar()'ed, set to -1 its the opposite and set to 2 the content will be HSC'ed BUT with preservation for real entities (eg. "&" or "ê")
- $addConfig : array<string|int, mixed> = []
-
Configuration array send along as $conf to the internal functions
Return values
string —Processed HTML content
HTMLparserConfig()
Converts TSconfig into an array for the HTMLcleaner function.
public
HTMLparserConfig(array<string|int, mixed> $TSconfig[, array<string|int, mixed> $keepTags = [] ]) : array<string|int, mixed>
Parameters
- $TSconfig : array<string|int, mixed>
-
TSconfig for HTMLcleaner
- $keepTags : array<string|int, mixed> = []
-
Array of tags to keep (?)
Return values
array<string|int, mixed>prefixRelPath()
Internal sub-function for ->prefixResourcePath()
public
prefixRelPath(string $prefix, string $srcVal[, string $suffix = '' ]) : string
Parameters
- $prefix : string
-
Prefix string
- $srcVal : string
-
Relative path/URL
- $suffix : string = ''
-
Suffix string
Return values
string —Output path, prefixed if no scheme in input string
prefixResourcePath()
Prefixes the relative paths of hrefs/src/action in the tags [td,table,body,img,input,form,link,script,a] in the $content with the $main_prefix or and alternative given by $alternatives
public
prefixResourcePath(string $main_prefix, string $content[, array<string|int, mixed> $alternatives = [] ][, string $suffix = '' ]) : string
Parameters
- $main_prefix : string
-
Prefix string
- $content : string
-
HTML content
- $alternatives : array<string|int, mixed> = []
-
Array with alternative prefixes for certain of the tags. key=>value pairs where the keys are the tag element names in uppercase
- $suffix : string = ''
-
Suffix string (put after the resource).
Return values
string —Processed HTML content
removeFirstAndLastTag()
Removes the first and last tag in the string Anything before the first and after the last tags respectively is also removed
public
removeFirstAndLastTag(string $str) : string
Parameters
- $str : string
-
String to process
Return values
stringsplit_tag_attributes()
Returns an array with the 'components' from an attribute list.
public
split_tag_attributes(string $tag) : array<string|int, mixed>
The result is normally analyzed by get_tag_attributes Removes tag-name if found.
The difference between this method and the one in GeneralUtility is that this method actually determines more information on the attribute, e.g. if the value is enclosed by a " or ' character. That's why this method returns two arrays, the "components" and the "meta-information" of the "components".
Parameters
- $tag : string
-
The tag or attributes
Tags
Return values
array<string|int, mixed>splitIntoBlock()
Returns an array with the $content divided by tag-blocks specified with the list of tags, $tag Even numbers in the array are outside the blocks, Odd numbers are block-content.
public
splitIntoBlock(string $tag, string $content[, bool $eliminateExtraEndTags = false ]) : array<string|int, mixed>
Use ->removeFirstAndLastTag() to process the content if needed.
Parameters
- $tag : string
-
List of tags, comma separated.
- $content : string
-
HTML-content
- $eliminateExtraEndTags : bool = false
-
If set, excessive end tags are ignored - you should probably set this in most cases.
Tags
Return values
array<string|int, mixed> —Even numbers in the array are outside the blocks, Odd numbers are block-content.
splitIntoBlockRecursiveProc()
Splitting content into blocks *recursively* and processing tags/content with call back functions.
public
splitIntoBlockRecursiveProc(string $tag, string $content, object &$procObj, string $callBackContent, string $callBackTags[, int $level = 0 ]) : string
Parameters
- $tag : string
-
Tag list, see splitIntoBlock()
- $content : string
-
Content, see splitIntoBlock()
- $procObj : object
-
Object where call back methods are.
- $callBackContent : string
-
Name of call back method for content; "function callBackContent($str,$level)
- $callBackTags : string
-
Name of call back method for tags; "function callBackTags($tags,$level)
- $level : int = 0
-
Indent level
Tags
Return values
string —Processed content
splitTags()
Returns an array with the $content divided by tag-blocks specified with the list of tags, $tag Even numbers in the array are outside the blocks, Odd numbers are block-content.
public
splitTags(string $tag, string $content) : array<string|int, mixed>
Use ->removeFirstAndLastTag() to process the content if needed.
Parameters
- $tag : string
-
List of tags
- $content : string
-
HTML-content
Tags
Return values
array<string|int, mixed> —Even numbers in the array are outside the blocks, Odd numbers are block-content.
stripEmptyTags()
Strips empty tags from HTML.
public
stripEmptyTags(string $content[, string $tagList = '' ][, bool $treatNonBreakingSpaceAsEmpty = false ][, bool $keepTags = false ]) : string
Parameters
- $content : string
-
The content to be stripped of empty tags
- $tagList : string = ''
-
The comma separated list of tags to be stripped. If empty, all empty tags will be stripped
- $treatNonBreakingSpaceAsEmpty : bool = false
-
If TRUE tags containing only entities will be treated as empty.
- $keepTags : bool = false
-
If true, the provided tags will be kept instead of stripped.
Return values
string —the stripped content
stripEmptyTagsIfConfigured()
Strips the configured empty tags from the HMTL code.
protected
stripEmptyTagsIfConfigured(string $value, array<string|int, mixed> $configuration) : string
Parameters
- $value : string
- $configuration : array<string|int, mixed>