Published: June 30, 2011 - 02:00

Content Annotation and Enrichment with Post Filter Transformations

The standard task of the filter service in our architecture is to provide content-extraction and enrichment mechanisms to a series of known mime types (e. g. Microsoft Word documents, HTML files, etc.). Besides that, Fabasoft Mindbreeze Enterprise offers a lot of enrichment intelligence with its content extraction pipeline. The content extraction pipeline enables the loading of powerful and tailored Post Filter Transformation plugins. An example of HTML filtering can be found here, but for more specific use cases you can extend the pipeline as needed.

In the case of filters the processing methods are determined by the type of the input. Post filter transformation services in contrast, provide tools for processing and transforming the document model produced by the filters, in many cases based on the semantic information extracted during filtering.

Post filter transformation services can plug into the the Fabasoft Mindbreeze Enterprise infrastructure via dedicated extension points.  They can be installed and activated on demand for certain filter services, which makes their deployment flexible and easy.

Now let us take a look at a few possible deployment scenarios of Fabasoft Mindbreeze where post filter transformation plugins can play a key role.

Intelligent deduplication

Especially in the case of crawling web content it is a common problem that a large number of web pages with identical content or with irrelevant content differences are harvested. If each of these documents is individually indexed, the search results are overloaded by identical hits, sometimes making it extremely hard for the user to find relevant items among the results.

Several deduplication techniques are known that are specially developed for web search solutions. Using near hashing techniques, like the TextProfileSignature from the Apache Solr project, we can compute a key for each document based on the textual content extracted by the filter.

Using a post filter transformation plugin, we compute hash values for specific areas of each document. The plugin assures that only one item with a given content hash can be found in the index by setting this hash value as the doucment's unique identifier.

This plugin is already available at

Extracted content annotation phase for document categorization

The extracted information available can be used in many ways for document categorization. As an illustrating example it is possible to customize a mail search solution based on Fabasoft Mindbreeze Enterprise using post filter transformation services that index the mails with a certain adressee and sender and add a special label to them. This label can be used later in the search expressions. In this way mails sent for example to a given mailing list can be easily retrieved. Moreover, statistic or dictionary-driven algorithms can be plugged in.

Customizing access rights

In the case of a Portal Search Solution based on Fabasoft Mindbreeze Enterprise, post filter transformation plugins can generate access control lists for the crawled documents based on various metadata (for example by using URL patterns or processed microformats)  that can be harvested during filtering. In this case using a custom plugin allows the fine tuning of the metadata used for access control as well as the process of how these metadata are used in building the access control lists.