人工知能学会論文誌
Online ISSN : 1346-8030
Print ISSN : 1346-0714
ISSN-L : 1346-0714
論文
事例に基づくシリーズ型HTML文書の意味論理構造の自動認識
HTMLからXMLへの自動変換を目指して
梅原 雅之岩沼 宏治鍋島 英知
著者情報
ジャーナル フリー

2002 年 17 巻 6 号 p. 690-698

詳細
抄録

The recognition and extraction of semantic/logical structures in HTML documents are substantially important and difficult tasks for intelligent document processing. In this paper, we show that the alignment technology is an appropriate tool, within a framework of case-based reasoning, for recognizing semantic structures inherently embedded in a series of HTML documents. That is, given a series of HTML documents and a document example of which semantic structures are explicitly indicated by a user, then the alignment can identify semantic structures in the HTML document series, by matching a text-block sequence in each HTML document with the text-block sequence in the example document. Several important properties in text documents, such as continuity, sequentiality of texts, can be treated by the alignment in a quite natural way.

      The alignment technology can significantly improve the capability of the case-based transformation method which transforms a spatial and/or temporal series of HTML documents into machine-readable XML formats. Moreover, the alignment dramatically eases the construction of transformation exmaples. Throughout experimental evaluation for 47 pages of 8 series of HTML documents, we show that the case-based method using the alignment achieved a highly accurate transformation into XML formats.

著者関連情報
© 2002 JSAI (The Japanese Society for Artificial Intelligence)
前の記事 次の記事
feedback
Top