Metaparser
By Wolfgang Schwarz
PhilPapers' home page harvester consists in a crawler, a paper detector, a pre-processor
and, a metadata extractor.
The crawler regularly checks the tracked pages to look for new
links. When it finds one, it calls the paper detector, which tries to
guess whether the linked resource is an academic article (or book)
rather than, say, the department's homepage, a course handout or a CV.
This is currently done by combining a Bayesian classifier applied to
the document content with other heuristics such as file type and
features of the URL. The results are very reliable: more than 99% of
irrelevant links are recognized as such, and there are practically no
false negatives (articles dismissed as junk).
Documents harvested from authors's personal pages vary greatly in
format and layout. For instance, while most documents have the author
name(s) right above or below the title, some contain it only at the
very end of the paper, and a substantial number (over 10%) of papers
does not specify the author at all. It is therefore important that
every feature of a text that might be relevant for extracting
meta-data be preserved. To this end, each paper is first converted
into an XML document that specifies a) the precise location, font
size, etc. of every word in the paper, and b) information about the
source page, the original file type, the text of the anchor that led
to it, etc. This is what I referred to as 'pre-processing'. A large
number of tools are currently employed for this task, including
pdftohtml for processing regular PDF files, a modified version of
Google's ocropus package for OCR processing of scanned documents, and
a xulrunner application (based on the Mozilla web browser) written
from scratch to process HTML pages. All third-party software currently
employed is open-source.
The meta-data extractor takes the XML document created by the
pre-processer and tries to estimate author(s), title and abstract of
the document. As a first step, this involves chunking the content into
consecutive strings of words that might constitute a title, an author
line or a paragraph. These chunks are then classified based on
features such as font size, length, position in the document, presence
of keywords ('abstract'), etc. In the current prototype, this is done
by simply assigning a score to each feature, which does not properly
take into account dependencies and independencies between them. In the
final version, the classification will probably be carried out by a
Maximum Entropy Classifier or a Support Vector Machine. The final step
of the extraction is to filter out the author name(s) from chunks that
were classified accordingly. (If there are none, the person associated
with the source page is chosen.) The meta-data extraction currently
gets about 92% of papers right.
At each step, the system keeps tracks of difficulties it encounters,
from which it calculates a 'confidence' score. Documents with low
confidence can thereby be presented to an administrator for
confirmation before adding them to the main corpus.
loading ..