Semantic Pipeline Introduction

Category: Semantic Pipeline

Discover how the Semantic Pipeline enhances your search experience by enriching metadata, recognizing entities, and transforming data for smarter, more intuitive results!

The Semantic Pipeline

The Semantic Pipeline enriches and enhances the documents in your index. You can extract additional metadata using vocabularies, patterns, and the Property Expression Language. It allows plugins to find out further information.

Core Features

Out of the box, you have the ability to:

Translate metadata values
Label documents using text classification
Call LLMs
Much more

Precomputed Synthesized Metadata

One of the most powerful features in the Semantic Pipeline is Precomputed Synthesized Metadata. For instance:

A numeric value like a file size can be categorized into predefined groups, simplifying usage and enhancing accessibility.

A flat filter facet, such as a URL, can be transformed into a hierarchical navigation system, making exploration easier.

Entity Recognition

Entity Recognition empowers us to extract product names from unstructured and semi-structured documents. All identified product names are extracted and organized into a metadata field called ProductName. Additionally, Entity Recognition provides precise matching locations.

For example, by opening a brochure in the preview, we can see exactly where each product name has been identified.

CSV Transformation

CSV Transformation allows us to attach columns from a CSV file to results from Mindbreeze. For example, we can seamlessly attach the corresponding role from the CSV file for each result to the related author.

After applying CSV Transformation, the role column is now available and enriched by the CSV data, alongside the author column. We can test this by selecting Educator, which should correspond to Keenan Whitney. Filtering by Educator now displays all documents created by Keenan Whitney.

Item Transformation

Item Transformation allows us to group files by type efficiently. By default, file types are presented as a simple list. However, we can easily organize and interact with files using the new groups configuration.

For example, clicking Documents reveals all the file types under this category. The same applies to other types, such as Spreadsheets, providing a more streamlined and intuitive experience.

Language Detection

The Language Detector automatically identifies the language of results by analyzing their content. No metadata is required within the file to indicate its language to Mindbreeze.

For example:

Selecting "de" displays German results.
Selecting "fr" brings up French results.
Selecting "en" provides English results.

Precomputed Synthesized Metadata and Property Expression Language

Precomputed Synthesized Metadata is among the most powerful tools within the Semantic Pipeline. It has its own scripting language, the Property Expression Language, enabling us to manipulate and generate metadata through user-defined expressions, fully calculated during index reinversion.

Entity Recognition Using Regex

Entity Recognition enables information extraction from semi-structured and unstructured data using Regex rules. For example, using the SM87A microphone as a product name:

The first Entity Recognition rule identifies and extracts one or more letters as complete words.
The second rule focuses on numbers, extracting one or more digits.
An optional suffix of one or more letters at the end is also recognized.

When combined, these rules accurately extract the product name and store it in a new metadata field labeled ProductName.

CSV Transformation in Action

CSV Transformation enriches results within a Mindbreeze InSpire index. By mapping the Author field to the corresponding Name column in the CSV file, we link each author to their corresponding row in the CSV. This grants access to additional information, such as their role.

Item Transformers

Item Transformers are a series of plugins that are typically more complex and tailored to specific tasks. Examples include:

Language Detection plug-in
I18n Translation plug-in

Semantic Processing

Semantic Processing includes powerful features such as:

Language detection
Named Entity Recognition

Execution Phases in the Semantic Pipeline

The phases are executed in sequence:

If you create metadata in the Entity Recognition phase, it will be available in the CSV Transformation phase.
However, metadata created in the CSV Transformation phase cannot be used during Entity Recognition.

Index Reinvert and Reindex

The Semantic Pipeline operates on documents in the index during the inversion phase. Every document goes through this phase when inserted or updated.

That means changes to settings are not automatically applied to documents already in the index before the change. To apply new settings to all documents:

Perform a reinvert, which processes all documents again.
Alternatively, perform a reindex, which downloads all documents from the data source again (this takes longer).

A reinvert is also necessary when making metadata available for filtering or grouping results.

Advanced Filtering and Grouping

Only metadata that is aggregatable is available for advanced filtering, grouping, sorting, and mapping.