The key to high-quality web search results is to harvest detailed site information with the Fabasoft Mindbreeze - Web Connector. In this article we'll briefly summarize how we try to take the perfect, contextualized and enriched understanding of your web content.
Phase 1: Setting the Frame
Where is the content that needs to be crawled?
The first phase of the web harvesting process is to tailor our web crawler of choice for crawling exactly the pages that are relevant. Using our Mindbreeze Web Connector provides various possibilities for fine tuning the harvesting sources.
Some of them are:
- Setting multiple crawling roots,
- Regular expression-based selection for the URLs to be crawled,
- Regular expression-based selection for URL blacklisting,
- Detecting and/or configuration of seed lists and crawling depth.
Phase 2: Adjusting the Focus
What is your relevant content? What needs to be emphasized?
The next challenging task is to filter the relevant information from the crawled documents. We need to extract such information as the title, an identifier of the search hit or the textual content. The possibilities of data extraction end by far not here. Several technologies are emerging for semantic markup of HTML documents, for example the Microdata format defined in the HTML 5 standard, the Dublin Core Metadata Initiative or the popular and often discussed Microformats. These provide tools for adding semantic information to HTML elements, such as marking it as contact information, geographical position or access control information.
The metadata extraction mechanism of our Filter Suite and Connectors aim to provide flexible tools to extract such semantic data and already supports the parsing of a large subset of microformats.
The harvested information is then added to the indexed document, annotating it with semantically enriched context. These technologies enable a whole new palette of possible use cases for search-based content and websites, far beyond traditional full-text searches. For example:
- Search for people and contact information on websites,
- Location-based search, for example search for documents related to a given street address,
- Access control of search results based on metadata,
- Or even search-driven landing pages, taking care of the user's specific context.
Phase 3: Usability & Style for your Users
How can we access the search results?
The already mentioned Fabasoft Mindbreeze InApp client is maybe the best example of how easy and adaptable can be to display the website search results on the site itself. With just a few lines of code one can easily integrate and customize the Mindbreeze InApp Client to include exactly the needed UI Widgets and to have the look-and-feel that fits the hosting site.
The web indices built by the Mindbreeze Web Connector are also accessible from the Fabasoft Mindbreeze Enterprise client as well as from the Fabasoft Mindbreeze Mobile client.
Using the Fabasoft Mindbreeze Enterprise client a search can cover sources from the intranet (mail, fileshare etc. ) combined with the company's website, providing the user with a common access point to all his relevant content.
With the Fabasoft Mindbreeze Mobile client the search results can be displayed on a wide range of mobile devices making website search easily available from just about anywhere.