Class HtmlRewriter
- Direct Known Subclasses:
RewriteContext
The user can sequentially examine and rewrite each token in the source HTML document. As each token in the document is seen, the user has two choices:
- modify the current token.
- don't modify the current token.
Parsing is implemented lazily, meaning, for example, that unless the user actually asks for attributes of an HTML tag, this parser does not have to spend the time breaking up the attributes.
This class is used by HTML filters to maintain the state of the document and allow the filters to perform arbitrary rewriting.
- Version:
- @(#)HtmlRewriter.java 2.6
- Author:
- Colin Stevens (colin.stevens@sun.com)
-
Field Summary
FieldsModifier and TypeFieldDescriptionThe parser for the source HTML document.Storage holding the resultant HTML document. -
Constructor Summary
ConstructorsConstructorDescriptionHtmlRewriter(String str) Creates a newHtmlRewriterthat will operate on the given string.HtmlRewriter(LexHTML lex) Creates a newHtmlRewriterfrom the given HTML parser. -
Method Summary
Modifier and TypeMethodDescriptionbooleanaccumulate(boolean accumulate) Turns on or off the automatic accumulation of each token.voidInstead of modifying an existing token, this method allows the user to completely replace the current token with arbitrary new content.voidAppends the current token to the resultant HTML document.Returns the value that the specified case-insensitive key maps to in the attributes for the current tag.getArgs()Gets the arguments of the current token as a string.getBody()Gets the body of the current token as a string.getMap()Return a copy of the StringMap of attributes.getTag()Gets the current tag's name.getToken()Gets the raw string making up the entire current token, including the angle brackets or comment delimiters, if applicable.intgetType()Gets the type of the current token.booleanSee if the current tag a singleton.keys()Returns an enumeration of the keys in the current tag's attributes.voidTells thisHtmlRewriternot to append the current token to the resultant HTML document.booleannextTag()A convenence method built on top ofnextToken.booleanAdvances to the next token in the source HTML document.voidpushback()Puts the current token back.voidMaps the given case-insensitive key to the specified value in the current tag's attributes.static StringHelper class to quote a attribute's value when the value is being written to the resultant HTML document.voidRemoves the given case-insensitive key and its corresponding value from the current tag's attributes.voidreset()Forgets all the tokens that have been appended to the resultant HTML document so far, including the current token.voidsetSingleton(boolean singleton) Make the current tag a singleton.voidChanges the current tag's name.voidsetType(int type) Sets the type of the current token.inttagCount()Return count of tags seen so farintReturn count of tokens seen so fartoString()Returns the "new" rewritten HTML document.
-
Field Details
-
lex
The parser for the source HTML document. -
sb
Storage holding the resultant HTML document.
-
-
Constructor Details
-
HtmlRewriter
Creates a newHtmlRewriterfrom the given HTML parser.- Parameters:
lex- The HTML parser.
-
HtmlRewriter
Creates a newHtmlRewriterthat will operate on the given string.- Parameters:
str- The HTML document.
-
-
Method Details
-
toString
Returns the "new" rewritten HTML document. This is normally called once all of the tokens have been processed, and the user wants to send on this rewritten document.At any time, this method can be called to return the current state of the HTML document. The return value is the result of processing the source document up to this point in time; the unprocessed remainder of the source document is not considered.
Due to the implementation, calling this method may be expensive. Specifically, calling this method a second (or further) time for a given
HtmlRewritermay involve copying temporary strings around. The pessimal case would be to call this method every time a new token is appended. -
nextToken
public boolean nextToken()Advances to the next token in the source HTML document.The other purpose of this function is to "do the right thing", which is to append the token we just processed to the resultant HTML document, unless the user has already appended something else.
A sample program follows. This program changes all
<img>tags to<form>tags, deletes all<table>tags, capitalizes and bolds each string token, and passes all other tokens through unchanged, to illustrate hownextTokeninteracts with some of the other methods in this class.HtmlRewriter hr = new HtmlRewriter(str); while (hr.nextToken()) { switch (hr.getType()) { case LexHTML.TAG: if (hr.getTag().equals("img")) { // Change the tag name w/o affecting the attributes. hr.setTag("form"); } else if (hr.getTag().equals("table")) { // Eliminate the entire "table" token. hr.killToken(); } break; case LexHTML.STRING: // Append a new sequence in place of the existing token. hr.append("<b>" + hr.getToken().toUpperCase() + "</b>"); break; } // Any tokens we didn't modify get copied through unchanged. }- Returns:
trueif there are tokens left to process,falseotherwise.
-
nextTag
public boolean nextTag()A convenence method built on top ofnextToken. Advances to the next HTML tag. All intervening strings and comments between the last tag and the new current tag are copied through unchanged. This method can be used when the caller wants to process only HTML tags, without having to manually check the type of each token to see if it is actually a tag.- Returns:
trueif there are tokens left to process,falseotherwise.
-
getType
public int getType()Gets the type of the current token.- Returns:
- The type.
- See Also:
-
setType
public void setType(int type) Sets the type of the current token. -
isSingleton
public boolean isSingleton()See if the current tag a singleton. A Singleton tag ends in "/", as in<invalid input: '<'br />>. -
setSingleton
public void setSingleton(boolean singleton) Make the current tag a singleton. A Singleton tag ends in "/", as in<invalid input: '<'br />>. -
getToken
Gets the raw string making up the entire current token, including the angle brackets or comment delimiters, if applicable.- Returns:
- The current token.
- See Also:
-
getTag
Gets the current tag's name. The name returned is converted to lower case.- Returns:
- The lower-cased tag name, or
nullif the current token does not have a tag name - See Also:
-
setTag
Changes the current tag's name. The tag's attributes are not changed.- Parameters:
tag- New tag name
-
getBody
Gets the body of the current token as a string.- Returns:
- The body.
- See Also:
-
getArgs
Gets the arguments of the current token as a string.- Returns:
- The body.
- See Also:
-
get
Returns the value that the specified case-insensitive key maps to in the attributes for the current tag. For keys that were present in the tag's attributes without a value, the value returned is the empty string. In other words, for the tag<table border rows=2>:-
get("border")returns the empty string "". -
get("rows")returns 2.
Surrounding single and double quote marks that occur in the literal tag are removed from the values reported. So, for the tag
<a href="/foo.html" target=_top onclick='alert("hello")'>:-
get("href")returns /foo.html . -
get("target")returns _top . -
get("onclick")returns alert("hello") .
- Parameters:
The- key to lookup in the current tag's attributes.- Returns:
- The value to which the specified key is mapped, or
nullif the key was not in the attributes. - See Also:
-
-
put
Maps the given case-insensitive key to the specified value in the current tag's attributes.The value can be retrieved by calling
getwith a key that is case-insensitive equal to the given key.If the attributes already contained a mapping for the given key, the old value is forgotten and the new specified value is used. The case of the prior key is retained in that case. Otherwise the case of the new key is used and a new mapping is made.
- Parameters:
key- The new key. May not benull.value- The new value. May be not benull.
-
remove
Removes the given case-insensitive key and its corresponding value from the current tag's attributes. This method does nothing if the key is not in the attributes.- Parameters:
key- The key that needs to be removed. Must not benull.
-
keys
Returns an enumeration of the keys in the current tag's attributes. The elements of the enumeration are the string keys. The keys can be passed togetto get the values of the attributes.- Returns:
- An enumeration of the keys.
-
append
Instead of modifying an existing token, this method allows the user to completely replace the current token with arbitrary new content.This method may be called multiple times while processing the current token to add more and more data to the resultant HTML document. Before and/or after calling this method, the
appendTokenmethod may also be called explicitly in order to add the current token to the resultant HTML document.Following is sample code illustrating how to use this method to put bold tags around all the
<a>tags.HtmlRewriter hr = new HtmlRewriter(str); while (hr.nextTag()) { if (hr.getTag().equals("a")) { hr.append("<b>"); hr.appendToken(); } else if (hr.getTag().equals("/a")) { hr.appendToken(); hr.append("</b>"); } }The calls toappendTokenare necessary. Otherwise, theHtmlRewritercould not know where and when to append the existing token in addition to the new content provided by the user.- Parameters:
str- The new content to append. May benull, in which case no new content is appended (the equivalent of appending "").- See Also:
-
appendToken
public void appendToken()Appends the current token to the resultant HTML document. If the caller has changed the current token using thesetTag,set, orremovemethods, those changes will be reflected.By default, this method is automatically called after each token is processed unless the user has already appended something to the resultant HTML document. Therefore, if the user appends something and also wants to append the current token, or if the user wants to append the current token a number of times, this method must be called.
- See Also:
-
killToken
public void killToken()Tells thisHtmlRewriternot to append the current token to the resultant HTML document. Even if the user hasn't appended anything else, the current token will be ignored rather than appended.- See Also:
-
accumulate
public boolean accumulate(boolean accumulate) Turns on or off the automatic accumulation of each token.After each token is processed, the current token is appended to to the resultant HTML document unless the user has already appended something else. By setting
accumulatetofalse, this behavior is turned off. The user must then explicitly callappendTokento cause the current token to be appended.Turning off accumulation takes effect immediately, while turning on accumulation takes effect on the next token. In other words, whether the user turns this setting off or on, the current token will not be added to the resultant HTML document unless the user explicitly calls
appendToken.Following is sample code that illustrates how to use this method to extract the contents of the
<head>of the source HTML document.HtmlRewriter hr = new HtmlRewriter(str); // Don't accumulate tokens until we see the <head> below. hr.accumulate(false); while (hr.nextTag()) { if (hr.getTag().equals("head")) { // Start remembering the contents of the HTML document, // not including the <head> tag itself. hr.accumulate(true); } else if (hr.getTag().equals("/head")) { // Return everything accumulated so far. return hr.toString(); } }This method can be called any number of times while processing the source HTML document.- Parameters:
accumulate-trueto automatically accumulate tokens in the resultant HTML document,falseto require that the user explicitly accumulate them.- Returns:
- The previous accumulate setting
- See Also:
-
reset
public void reset()Forgets all the tokens that have been appended to the resultant HTML document so far, including the current token. -
pushback
public void pushback()Puts the current token back. The next timenextTokenis called, it will be the current token again, rather than advancing to the next token in the source HTML document.This is useful when a code fragment needs to read an indefinite number of tokens, but that once some distinguished token is found, needs to push that token back so that normal processing can occur on that token.
-
tokenCount
public int tokenCount()Return count of tokens seen so far -
tagCount
public int tagCount()Return count of tags seen so far -
quote
Helper class to quote a attribute's value when the value is being written to the resultant HTML document. Values set by theputmethod are automatically quoted as needed. This method is provided in case the user is dynamically constructing a new tag to be appended withappendand needs to quote some arbitrary values.The quoting algorithm is as follows:
If the string contains double-quotes, put single quotes around it.
If the string contains any "special" characters, put double-quotes around it.This algorithm is, of course, insufficient for complicated strings that include both single and double quotes. In that case, it is the user's responsibility to escape the special characters in the string using the HTML special symbols like
"or"- Returns:
- The quoted string, or the original string if it did not need to be quoted.
-
getMap
Return a copy of the StringMap of attributes.
-