`
gstarwd
  • 浏览: 1487946 次
  • 性别: Icon_minigender_1
  • 来自: 杭州
社区版块
存档分类
最新评论

HtmlCleaner API

阅读更多

HtmlCleaner API

Create cleaner instance:

Constructor or method Purpose
HtmlCleaner() Create cleaner with default tag information provider.
HtmlCleaner(ITagInfoProvider) Create cleaner with custom tag information provider.

Set cleaner properties in order to tune its behavior:

Set cleaner transformations:new!

Constructor or method Purpose
CleanerTransformations() Create collection of transformations.
TagTransformation(String, String, boolean) Create single tag transformation.
CleanerTransformations.
addTransformation(TagTransformation)
Add tag transormation to transformations collection.
TagTransformation.
addAttributeTransformation(String, String)
Specify attribute transformation for the tag transformation.
HtmlCleaner.
setTransformations(CleanerTransformations)
Set cleaner transformations.

Clean HTML with instance of HtmlCleaner:

Search cleaned DOM and modify its structure:

Serialize DOM nodes:

Providing custom tag info set

HtmlCleaner implements default HTML tag set and rules for their balancing, that is similar to the browsers' behavior. However, user is free to implement interface ITagInfoProvider or extend some of its imlementations in order to provide custom tag info set. The easiest way to do that is to write XML configuration file which describes all tags and their dependacies and use ConfigFileTagProvider like:

HtmlCleaner cleaner = 
    new
 HtmlCleaner(
 new
 ConfigFileTagProvider(
myConfigFile)
 )
;

 

Perhaps the best starting point is default tag ruleset description file . It is the basis for DefaultTagProvider .

For example, someone may not like the rule that implicit TBODY is inserted before TR in the HTML table. To remove it, find <tag name="tr"... element in the XML and remove tbody from req-enclosing-tags section.

Setting cleaner transformations

Following code snippet demonstrates how to set tranformations from the example :

...
HtmlCleaner
 cleaner = new
 HtmlCleaner(
...)
;

...
CleanerTransformations
 transformations = 
    new
 CleanerTransformations(
)
;

 
TagTransformation tt = new
 TagTransformation(
"cfoutput"
)
;

transformations.addTransformation
(
tt)
;

 
tt = new
 TagTransformation(
"c:block"
, "div"
, false
)
;

transformations.addTransformation
(
tt)
;

 
tt = new
 TagTransformation(
"font"
, "span"
, true
)
;

tt.addAttributeTransformation
(
"size"
)
;

tt.addAttributeTransformation
(
"face"
)
;

tt.addAttributeTransformation
(

    "style"
, 
    "${style};font-family=${face};font-size=${size};"

)
;

transformations.addTransformation
(
tt)
;

...
cleaner
.setTransformations
(
transformations)
;

...
TagNode
 node = cleaner.clean
(
...)
;
 
分享到:
评论
2 楼 javaloverkehui 2014-04-06  
这也叫文档,别逗我行吗,也就自己看看。
1 楼 SE_XiaoFeng 2013-04-19  
至少也应该写个注释吧。

相关推荐

    HtmlCleaner2.1API参考手册.chm

    HtmlCleaner2.1API参考手册.chm HtmlCleaner是一个把html解析为XML文档的Java程序库。 我试过,这是java世界中最快、最好、最小、最强大的Html解析库。 可以解析为DOM对象,然后使用其他的xml分析器进行分析。

    HtmlCleaner2.6.1 API (英文) 及 JAR Library

    HtmlCleaner2.6.1 API (英文) 及 JAR Library API LINK: http://htmlcleaner.sourceforge.net/doc/index.html

    HtmlCleaner

    HtmlCleaner是一个开源的Java语言的Html文档解析器。 HtmlCleaner能够重新整理HTML文档的每个元素并生成结构良好(Well-Formed)的 HTML 文档。默认它遵循的规则是类似于大部份web浏览器为创文档对象模型所使用的规则...

    htmlcleaner html解析器

    htmlcleaner html解析器htmlcleaner html解析器

    htmlcleaner-2.2.4.jar

    网络爬虫htmlcleaner的jar包

    htmlcleaner html解析器2.2版

    htmlcleaner html解析器2.2版 ,解析速度很快的,比htmlparser1使用还速度快

    HtmlCleanerv2.13Html文档解析器

    HtmlCleaner是一个开源的Java语言的Html文档解析器。HtmlCleaner能够重新整理HTML文档的每个元素并生成结构良好(Well-Formed)的 HTML 文档。... 主页地址://htmlcleaner.sourceforge.net/htmlclea

    网页爬虫demo 带htmlcleaner jar包

    一个最简单的htmlcleaner扒网页内容,demo中以58种的一个页面为例,xpath请通过chrome浏览器控制台选中需要的内容后右击选择复制xpath 不喜勿喷,资源免费下载

    htmlcleaner-2.8.jar

    Java解析HTML利器 htmlcleaner2.8

    网页解析工具HTMLCleaner

    能清晰方便的找出网页中元素的对应关系,可以单独去掉某个tag标签及向里面添加页面元素

    htmlcleaner,活跃的.zip

    注意//htmlcleaner.sourceforge.net/从2.4版起的项目

    HTMLcleaner

    Java的HTML开源解析框架,可以用来提取Html文档里面的特定元素

    HtmlCleaner 用法

    NULL 博文链接:https://z-one.iteye.com/blog/1172948

    htmlcleaner2_1.jar

    html解析工具,支持xpath,简单方便

    HtmlCleaner使用说明文档

    HtmlCleaner使用说明文档, 全面,还有很多例子可以参考!有空多交流!

    htmlcleaner使用方法及xpath语法初探

    HtmlCleaner是一个开源的Java语言的Html文档解析器。HtmlCleaner能够重新整理HTML文档的每个元素并生成结构良好(Well-Formed)的 HTML 文档

    htmlcleaner

    HTML网页解析,非常好的jar包!方便开发。

    htmlcleaner-2.2

    HtmlCleaner是一个开源的Java语言的Html文档解析器。

    HTMLCleaner(HTML代码优化工具)V1.0官方英文免费版

    HTML优化工具(HTML Cleaner) V1.0免费版HTML Cleaner是一个方便和可靠的HTML优化工具,旨在从HTML文档中删除不必要的字符,如多余的空格,引号,可选结束标记,等等。

    页面正文提取htmlcleaner-2.8.jar

    HtmlCleanner HtmlCleaner极其短小精悍,源码一共只有260KB,并且速度惊人,只需要10毫秒左右就可以处理完HtmlParser需要300毫秒处理的Html页面。

Global site tag (gtag.js) - Google Analytics