(You have to register/login to see more than this overview.)
Workhorse document scraping, scheduling and general utility package.
Workhorse provides the following things:
- A scheduler which runs java coded tasks on an execution schedule
normally controlled through XML specification
(the workhorse.scheduler package).
- Ability to scrape documents through an xml specified "skeleton"
which specifies how data is to be passed to a java coded scraper. The
skeleton is intended to be much easier to use than XSL/T and provides
a bridge to "real" coding via java, while hiding the details of the document
structure from that code
(the workhorse.scraper package).
- A "Tag Rationalizer" package aimed at taking XML-like, but
not XML compliant, input like HTML documents and using heuristic rules
to produce a well-formed document which will pass through an XML parser,
while retaining the structure of the original
(the workhorse.tagrat package).
- An extensive utility library which supports
the above tasks and is usable in its own right. A logger, enhanced
string parsing, uniform handling of parameters, pluggable pattern matching,
enhanced exceptions, a general object caching mechanism, an abstracted
clock mechanism, command line parsing and a simplified abstraction
for XML document parsing are included
(the workhorse.util workhorse.util.* packages).
As you can see, this is a software package to be used by java programmers.
To that end, much of the functionality is configurable, extensible and pluggable through:
- Documented API's by extending base classes and implementing
defined interfaces. The API's have been designed with extensibility
and implementation replacement as a key feature. This means care
has been taken to abstract and provide interfaces where appropriate.
- Setting properties, which includes the ability to provide
local implementations for several pluggable interfaces.
© copyright, 2005-2022, Robert L. McQueer
|
Powered By
|
|
|