최신 업데이트:2024-08-14 19:50:31
The HTMLRewriter
class allows developers to build comprehensive and expressive HTML parsers within CDNetworks Edge Cloud Apps Functions. It provides a streamlined JavaScript API for parsing and transforming HTML, enabling developers to build highly functional applications with a jQuery-like experience directly within the functions.
To use the HTMLRewriter
class, instantiate it once within your Worker script and attach various handlers using the on
and onDocument
functions.
new HTMLRewriter().on('*', new ElementHandler()).onDocument(new DocumentHandler());
The HTMLRewriter
API consistently uses the following types for various properties and methods:
Content
Content inserted into the output stream should be a string.
ContentOptions
{ html: Boolean }
Controls how the HTMLRewriter
treats inserted content. If the html
boolean is set to true
, content is treated as raw HTML. If the html
boolean is set to false
or not provided, content is treated as text and proper HTML escaping is applied.
The HTMLRewriter
utilizes two types of handlers: element handlers and document handlers.
An element handler responds to incoming elements when attached using the .on
function of an HTMLRewriter
instance. The element handler should respond to element
, comments
, and text
. The following example processes div
elements using an ElementHandler
class.
class ElementHandler {
element(element) {
// An incoming element, such as `div`
console.log(`Incoming element: ${element.tagName}`);
}
comments(comment) {
// An incoming comment
}
text(text) {
// An incoming piece of text
}
}
async function handleRequest(req) {
const res = await fetch(req);
return new HTMLRewriter().on('div', new ElementHandler()).transform(res);
}
A document handler represents the incoming HTML document. A number of functions can be defined on a document handler to query and manipulate a document’s doctype
, comments
, text
, and end
. Unlike an element handler, a document handler’s doctype
, comments
, text
, and end
functions are not scoped by a particular selector. These functions are called for all content on the page, including content outside the top-level HTML tag:
class DocumentHandler {
doctype(doctype) {
// An incoming doctype, such as <!DOCTYPE html>
}
comments(comment) {
// An incoming comment
}
text(text) {
// An incoming piece of text
}
end(end) {
// The end of the document
}
}
All functions defined on both element and document handlers can return either void
or a Promise<void>
. Making your handler function async
allows you to access external resources such as an API via fetch
, Workers KV, Durable Objects, or the cache.
class UserElementHandler {
async element(element) {
let response = await fetch(new Request('/user'));
// fill in user info using response
}
}
async function handleRequest(req) {
const res = await fetch(req);
// run the user element handler via HTMLRewriter on a div with ID `user_info`
return new HTMLRewriter().on('div#user_info', new UserElementHandler()).transform(res);
}
The element
argument, used only in element handlers, represents a DOM element. Several methods are available on an element
to query and manipulate it:
tagName
The name of the tag, such as “h1” or “div”. This property can be assigned different values to modify an element’s tag.
attributes
(read-only)
A [name, value]
pair of the tag’s attributes.
removed
Indicates whether the element has been removed or replaced by one of the previous handlers.
namespaceURI
Represents the namespace URI of an element.
getAttribute(name)
:
Returns the value for a given attribute name on the element, or null
if it is not found.
hasAttribute(name)
:
Returns a boolean indicating whether an attribute exists on the element.
setAttribute(name, value)
:
Sets an attribute to a provided value, creating the attribute if it does not exist.
removeAttribute(name)
:
Removes the attribute.
before(content, contentOptions)
:
Inserts content before the element. Refer to Global types for more information on Content
and ContentOptions
.
after(content, contentOptions)
:
Inserts content right after the element.
prepend(content, contentOptions)
:
Inserts content right after the start tag of the element.
append(content, contentOptions)
:
Inserts content right before the end tag of the element.
replace(content, contentOptions)
:
Removes the element and inserts content in its place.
setInnerContent(content, contentOptions)
:
Replaces the content of the element.
remove()
:
Removes the element with all its content.
removeAndKeepContent()
:
Removes the start and end tags of the element but keeps its inner content intact.
onEndTag(handler)
:
Registers a handler that is invoked when the end tag of the element is reached.
The endTag
argument, used only in handlers registered with element.onEndTag
, is a limited representation of a DOM element.
name
The name of the tag, such as “h1” or “div”. This property can be assigned different values to modify an element’s tag.
before(content, contentOptions)
:
Inserts content right before the end tag.
after(content, contentOptions)
:
Inserts content right after the end tag. Refer to Global types for more information on Content
and ContentOptions
.
remove()
:
Removes the element with all its content.
Due to CDNetworks’ zero-copy streaming parsing, text chunks are not equivalent to text nodes in the lexical tree. A lexical tree text node can be represented by multiple chunks as they arrive from the origin server.
Consider the markup: <div>Hey. How are you?</div>
. The Worker script may not receive the entire text node from the origin at once. Instead, the text element handler will be invoked for each received part of the text node. For example, the handler might be invoked with “Hey. How ”, then “are you?”. When the last chunk arrives, the text’s lastInTextNode
property will be set to true
. Developers should ensure these chunks are concatenated together.
removed
Indicates whether the element has been removed or replaced by one of the previous handlers.
text
(read-only)
The text content of the chunk. Could be empty if the chunk is the last chunk of the text node.
lastInTextNode
(read-only)
Specifies whether the chunk is the last chunk of the text node.
before(content, contentOptions)
:
Inserts content before the element. Refer to Global types for more information on Content
and ContentOptions
.
after(content, contentOptions)
:
Inserts content right after the element.
replace(content, contentOptions)
:
Removes the element and inserts content in its place.
remove()
:
Removes the element with all its content.
The comments
function on an element handler allows developers to query and manipulate HTML comment tags.
class ElementHandler {
comments(comment) {
// An incoming comment element, such as <!-- My comment -->
}
}
comment.removed
Indicates whether the element has been removed or replaced by one of the previous handlers.
comment.text
The text of the comment. This property can be assigned different values to modify the comment’s text.
before(content, contentOptions)
:
Inserts content before the element. Refer to Global types for more information on Content
and ContentOptions
.
after(content, contentOptions)
:
Inserts content right after the element.
replace(content, contentOptions)
:
Removes the element and inserts content in its place.
remove()
:
Removes the element with all its content.
The doctype
function on a document handler allows developers to query a document’s doctype.
class DocumentHandler {
doctype(doctype) {
// An incoming doctype element, such as
// <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
}
}
doctype.name
(read-only)
The doctype name.
doctype.publicId
(read-only)
The quoted string in the doctype after the PUBLIC
atom.
doctype.systemId
(read-only)
The quoted string in the doctype after the SYSTEM
atom or immediately after the publicId
.
The end
function on a document handler allows developers to append content to the end of a document.
class DocumentHandler {
end(end) {
// The end of the document
}
}
append(content, contentOptions)
:
Inserts content after the end of the document. Refer to Global types for more information on Content
and ContentOptions
.
Selectors are patterns used to select specific elements within an HTML document. The HTMLRewriter
supports a wide range of CSS selectors. Here are some examples:
Selector | Description |
---|---|
* |
Any element. |
E |
Any element of type E . |
E:nth-child(n) |
An E element, the nth child of its parent. |
E:first-child |
An E element, the first child of its parent. |
E:nth-of-type(n) |
An E element, the nth sibling of its type. |
E:first-of-type |
An E element, the first sibling of its type. |
E:not(s) |
An E element that does not match the selector s . |
E.warning |
An E element belonging to the class warning . |
E#myid |
An E element with ID equal to myid . |
E[foo] |
An E element with a foo attribute. |
E[foo="bar"] |
An E element whose foo attribute value is exactly equal to “bar”. |
E[foo~="bar"] |
An E element whose foo attribute value is a list of whitespace-separated values, one of which is exactly equal to “bar”. |
E[foo^="bar"] |
An E element whose foo attribute value begins exactly with the string “bar”. |
E[foo$="bar"] |
An E element whose foo attribute value ends exactly with the string “bar”. |
E[foo*="bar"] |
An E element whose foo attribute value contains the substring “bar”. |
E F |
An F element descendant of an E element. |
E > F |
An F element child of an E element. |
If a handler throws an exception, parsing is immediately halted, the transformed response body is errored with the thrown exception, and the untransformed response body is canceled (closed). If the transformed response body was already partially streamed back to the client, the client will see a truncated response.
async function handle(request) {
let oldResponse = await fetch(request);
let newResponse = new HTMLRewriter()
.on('*', {
element(element) {
throw new Error('A really bad error.');
},
})
.transform(oldResponse);
// At this point, an expression like `await newResponse.text()`
// will throw `new Error("A really bad error.")`.
// Thereafter, any use of `newResponse.body` will throw the same error,
// and `oldResponse.body` will be closed.
// Alternatively, this will produce a truncated response to the client:
return newResponse;
}