This post deals with the typical steps you can follow when connecting (crawling, indexing, querying) a custom data source, like self-built line of business applications, to an enterprise search infrastructure. You will find out about the possibilities to search-enable your data
sources, as well as how to enhance the experience for the user. This is the typical 5 step approach:
- Crawl all relevant information
- Verify the results
- Customize how results are displayed
- Enforce access rights
- Provide actions for results
Crawl all relevant information
The common approach to connect/crawl a custom data source is to use the Application Programming Interface (API) provided by the enterprise search vendor. You have to implement a so called crawler to access the data source, then transform the data to the
structure needed by the search engine to support your use cases and submit the data. This approach provides the maximum flexibility and dependant on the product you can implement very powerful and hopefully easy to use functions. But there is also a big pain point: you have to set up a development project to even get somewhere.
Another approach is to use “Extract, Transform and Load“ (ETL) mechanisms to ease the process of getting data into the index. ETL tools allow the user to graphically configure a process for extracting data from the data source, transforming the data to a different (and
possibly varying) schema and loading it into a target system.
ETL tools also allow combining multiple data sources, as well as data cleansing and normalization, to name just a few possibilities. Being able to use your enterprise search system as target in your ETL process allows for great flexibility in accessing a multitude of data sources. ETL tools also allow for a more explorative approach, since you can validate the data in the graphical user interface and adapt as necessary. One ETL product I did experience to be worth spending time on is Talend Open Studio. Talend Open Studio ships with 500+ components to start with and can be extended by Talend Exchange which is maintained by Talend’s community.
Verify the results
After the first crawl run you have to verify if all data was indexed as intended. Try behaving as a typical user and query for common use cases. Get your users for a feedback loop into this verifying process. Moreover use the available log files to verify if all metadata needed is present or if the ETL Job needs to be adapted.
Customize how results are displayed
To enable your users to get most out of the results you should refine the display of the results. Only display the data needed for the specific task; define formatting options for values such as dates; adapt the facets available for drill down. All these steps will provide your users with a convenient interface.
Enforce access rights
One really important step is to enforce secure access. You have to make sure and enforce that everyone sees only what he/she is supposed to see. Search engines use Access Control Lists (ACL) for determining the access rights for each user or conduct live access
checks against the data source while searching. Adding and updating ACLs can also be done using an ETL tool.
Provide actions for results
At last, after finding the results, your users need to be able to act on them. Being able to execute other actions, such as sending, editing or deleting the result object, allows the user to quickly work through his/her tasks without a lot of context switches.
In our opinion, step one “Crawl all relevant information and keeping all semantics” is by far the most elaborate part of integrating a custom data source. We have experienced that we can eliminate or at least bring down the project specific implementation parts to a minimum for most of the customers by using an ETL tool.
Start integrating your custom data source into Fabasoft Mindbreeze Enterprise, see the Fabasoft Mindbreeze Enterprise downloads page. We recommend taking a look at the Talend Open Studio to define your ETL Jobs.