WorkHorse

(You have to register/login to see more than this overview.)

Workhorse document scraping, scheduling and general utility package.

Workhorse provides the following things:

A scheduler which runs java coded tasks on an execution schedule normally controlled through XML specification (the workhorse.scheduler package).

Ability to scrape documents through an xml specified "skeleton" which specifies how data is to be passed to a java coded scraper. The skeleton is intended to be much easier to use than XSL/T and provides a bridge to "real" coding via java, while hiding the details of the document structure from that code (the workhorse.scraper package).

A "Tag Rationalizer" package aimed at taking XML-like, but not XML compliant, input like HTML documents and using heuristic rules to produce a well-formed document which will pass through an XML parser, while retaining the structure of the original (the workhorse.tagrat package).

An extensive utility library which supports the above tasks and is usable in its own right. A logger, enhanced string parsing, uniform handling of parameters, pluggable pattern matching, enhanced exceptions, a general object caching mechanism, an abstracted clock mechanism, command line parsing and a simplified abstraction for XML document parsing are included (the workhorse.util workhorse.util.* packages).

As you can see, this is a software package to be used by java programmers. To that end, much of the functionality is configurable, extensible and pluggable through:

Documented API's by extending base classes and implementing defined interfaces. The API's have been designed with extensibility and implementation replacement as a key feature. This means care has been taken to abstract and provide interfaces where appropriate.

Setting properties, which includes the ability to provide local implementations for several pluggable interfaces.