- 浏览: 1487946 次
- 性别:
- 来自: 杭州
文章分类
- 全部博客 (525)
- SEO (16)
- JAVA-EE-Hibernate (6)
- JAVA-EE-Struts (29)
- JAVA-EE-Spring (15)
- Linux (37)
- JAVA-SE (29)
- NetWork (1)
- CMS (14)
- Semantic Research (3)
- RIA-Flex (0)
- Ajax-Extjs (4)
- Ajax-Jquery (1)
- www.godaddy.com (0)
- SSH (34)
- JavaScript (6)
- SoftwareEngineer (9)
- CMMI (0)
- IDE-Myeclipse (3)
- PHP (1)
- Algorithm (3)
- C/C++ (18)
- Concept&Items (2)
- Useful WebSite (1)
- ApacheServer (2)
- CodeReading (1)
- Socket (2)
- UML (10)
- PowerDesigner (1)
- Repository (19)
- MySQL (3)
- SqlServer (0)
- Society (1)
- Tomcat (7)
- WebService (5)
- JBoss (1)
- FCKeditor (1)
- PS/DW/CD/FW (0)
- DesignPattern (11)
- WebSite_Security (1)
- WordPress (5)
- WebConstruction (3)
- XML|XSD (7)
- Android (0)
- Project-In-Action (9)
- DatabaseDesign (3)
- taglib (7)
- DIV+CSS (10)
- Silverlight (52)
- JSON (7)
- VC++ (8)
- C# (8)
- LINQ (1)
- WCF&SOA (5)
- .NET (20)
- SOA (1)
- Mashup (2)
- RegEx (6)
- Psychology (5)
- Stock (1)
- Google (2)
- Interview (4)
- HTML5 (1)
- Marketing (4)
- Vaadin (2)
- Agile (2)
- Apache-common (6)
- ANTLR (0)
- REST (1)
- HtmlAnalysis (18)
- csv-export (3)
- Nucth (3)
- Xpath (1)
- Velocity (6)
- ASP.NET (9)
- Product (2)
- CSS (1)
最新评论
-
lt26w:
理解成门面模式应该比较容易明白吧
FacadePattern-Java代码实例讲解 -
lt26w:
看下面的例子比较明白.
FacadePattern-Java代码实例讲解 -
javaloverkehui:
这也叫文档,别逗我行吗,也就自己看看。
HtmlCleaner API -
SE_XiaoFeng:
至少也应该写个注释吧。
HtmlCleaner API -
jfzshandong:
...
org.springframework.web.filter.CharacterEncodingFilter 配置
HtmlCleaner API
Create cleaner instance:
HtmlCleaner()
|
Create cleaner with default tag information provider. |
HtmlCleaner(ITagInfoProvider)
|
Create cleaner with custom tag information provider. |
Set cleaner properties in order to tune its behavior:
Set cleaner transformations:new!
CleanerTransformations()
|
Create collection of transformations. |
TagTransformation(String,
String, boolean)
|
Create single tag transformation. |
CleanerTransformations.
|
Add tag transormation to transformations collection. |
TagTransformation.
|
Specify attribute transformation for the tag transformation. |
HtmlCleaner.
|
Set cleaner transformations. |
Clean HTML with instance of HtmlCleaner:
class HtmlCleaner
:
clean(String)
|
Clean HTML that comes from verious sources. |
Search cleaned DOM and modify its structure:
class TagNode
:
getAttributeByName(String)
|
Work with node (tag) attributes |
class TagNode
:
getChildTagList()
|
Find and modify nodes. |
HtmlCleaner.setInnerHtml(TagNode,
String)
|
Cleans given portion of HTML and stores it in specified tag node. |
Serialize DOM nodes:
SimpleXmlSerializer(CleanerProperties)
|
Create various kinds of XML serializers. |
class XmlSerializer
:
writeXmlToStream(TagNode,
OutputStream, String)
|
Serialize node to different outputs. |
DomSerializer.createDOM(TagNode)
|
Create common DOM objects out of cleaned HTML. |
Providing custom tag info set
HtmlCleaner implements default HTML tag set and rules for their
balancing, that
is similar to the browsers' behavior. However, user is free to
implement interface
ITagInfoProvider
or extend some of its imlementations in order to provide custom tag
info set.
The easiest way to do that is to write XML configuration file which
describes all tags
and their dependacies and use
ConfigFileTagProvider
like:
HtmlCleaner cleaner = new HtmlCleaner( new ConfigFileTagProvider( myConfigFile) ) ;
Perhaps the best starting point is default tag
ruleset description file
.
It is the basis for
DefaultTagProvider
.
For example, someone may not like the rule that implicit TBODY is
inserted before TR in the HTML table.
To remove it, find <tag name="tr"...
element in the
XML and remove tbody
from
req-enclosing-tags
section.
Setting cleaner transformations
Following code snippet demonstrates how to set tranformations from the example :
... HtmlCleaner cleaner = new HtmlCleaner( ...) ; ... CleanerTransformations transformations = new CleanerTransformations( ) ; TagTransformation tt = new TagTransformation( "cfoutput" ) ; transformations.addTransformation ( tt) ; tt = new TagTransformation( "c:block" , "div" , false ) ; transformations.addTransformation ( tt) ; tt = new TagTransformation( "font" , "span" , true ) ; tt.addAttributeTransformation ( "size" ) ; tt.addAttributeTransformation ( "face" ) ; tt.addAttributeTransformation ( "style" , "${style};font-family=${face};font-size=${size};" ) ; transformations.addTransformation ( tt) ; ... cleaner .setTransformations ( transformations) ; ... TagNode node = cleaner.clean ( ...) ;
发表评论
-
htmlunit 示例
2010-08-20 18:40 4305先下载依赖的相关JAR包:http://sourcefor ... -
HTMLParser的两种使用方法
2010-04-15 16:37 5369HTMLParser的两种使用方法 ... -
HtmlCleanner结合xpath用法
2010-04-15 13:24 3519文章分类:Java编程 ... -
基于Htmlparser的天气预报程序(续)
2010-04-14 13:53 1064zz:http://www.iteye.com/topic/6 ... -
httpclient(校内网)
2010-04-13 15:10 1269Java code <!-- C ... -
httpclient(校内网)
2010-04-13 15:10 1395httpclient(校内网),大家帮忙看看我的 http ... -
HTTPClient模拟登陆人人网
2010-04-13 14:58 1871zz: 目的: http://www.iteye. ... -
htmlcleaner惯用法
2010-04-13 13:39 1420Common usage Tipically the f ... -
htmlcleaner惯用法
2010-04-13 13:39 1495Common usage Tipically t ... -
htmlcleaner 使用示例.
2010-04-13 13:10 10002原文出处:http://blog.chenlb.com/200 ... -
http://htmlparser.com.cn/
2010-04-12 16:20 1027http://htmlparser.com.cn/ ... -
开源网络蜘蛛spider(转载)
2010-04-12 15:42 1306spider是搜索引擎的必须 ... -
基于Spindle的增强HTTP Spider
2010-04-12 15:33 1453zz:http://www.iteye.com/news ... -
Cobra: Java HTML 解析器
2010-04-12 15:32 2910Cobra 简介: Cobra是一个 ... -
用htmlparser分析并抽取正文
2010-04-12 15:26 1534我这次要介绍的是如何抽取正文,这部分是最为核心的.因为如果不能 ... -
HtmlParser初步研究
2010-04-12 15:18 905目的是快速入手,而不 ... -
基于Htmlparser的天气预报程序
2010-04-12 15:16 1053htmlparser是一个纯的java写的html解析的库,它 ...
相关推荐
HtmlCleaner2.1API参考手册.chm HtmlCleaner是一个把html解析为XML文档的Java程序库。 我试过,这是java世界中最快、最好、最小、最强大的Html解析库。 可以解析为DOM对象,然后使用其他的xml分析器进行分析。
HtmlCleaner2.6.1 API (英文) 及 JAR Library API LINK: http://htmlcleaner.sourceforge.net/doc/index.html
HtmlCleaner是一个开源的Java语言的Html文档解析器。 HtmlCleaner能够重新整理HTML文档的每个元素并生成结构良好(Well-Formed)的 HTML 文档。默认它遵循的规则是类似于大部份web浏览器为创文档对象模型所使用的规则...
htmlcleaner html解析器htmlcleaner html解析器
网络爬虫htmlcleaner的jar包
htmlcleaner html解析器2.2版 ,解析速度很快的,比htmlparser1使用还速度快
HtmlCleaner是一个开源的Java语言的Html文档解析器。HtmlCleaner能够重新整理HTML文档的每个元素并生成结构良好(Well-Formed)的 HTML 文档。... 主页地址://htmlcleaner.sourceforge.net/htmlclea
一个最简单的htmlcleaner扒网页内容,demo中以58种的一个页面为例,xpath请通过chrome浏览器控制台选中需要的内容后右击选择复制xpath 不喜勿喷,资源免费下载
Java解析HTML利器 htmlcleaner2.8
能清晰方便的找出网页中元素的对应关系,可以单独去掉某个tag标签及向里面添加页面元素
注意//htmlcleaner.sourceforge.net/从2.4版起的项目
Java的HTML开源解析框架,可以用来提取Html文档里面的特定元素
NULL 博文链接:https://z-one.iteye.com/blog/1172948
html解析工具,支持xpath,简单方便
HtmlCleaner使用说明文档, 全面,还有很多例子可以参考!有空多交流!
HtmlCleaner是一个开源的Java语言的Html文档解析器。HtmlCleaner能够重新整理HTML文档的每个元素并生成结构良好(Well-Formed)的 HTML 文档
HTML网页解析,非常好的jar包!方便开发。
HtmlCleaner是一个开源的Java语言的Html文档解析器。
HTML优化工具(HTML Cleaner) V1.0免费版HTML Cleaner是一个方便和可靠的HTML优化工具,旨在从HTML文档中删除不必要的字符,如多余的空格,引号,可选结束标记,等等。
HtmlCleanner HtmlCleaner极其短小精悍,源码一共只有260KB,并且速度惊人,只需要10毫秒左右就可以处理完HtmlParser需要300毫秒处理的Html页面。