Search Engine
Introduction#
The full text search (FTS) is an integral part of imperia and is used to find any words or combinations of words in the documents of a project. It serves as a way for searching the available documents on a target system. Documents on a production system may also be browsed. This is useful if the capabilities of the integrated into imperia's user interface search is not enough. Documents that have already left the workflow are indexed in the FTS.
By default, the FTS processes ASCII documents (HTML, ASP, etc.). With the help of additional programs, it is also possible to search Microsoft Office (*.doc, *.xls, etc.) and PDF documents. The search may be restricted to certain areas by grouping, which can be limited to directory or meta levels. The search can be adjusted by using different search and result screens for different user groups. Domains enable you to manage multiple search indexes. In this way, for example, the databases of systems with multiple clients can be indexed and searched separately.
Basically, the full text search indexes the visible HTML text of a document. It is also possible to specifically indicate not visible meta field content and exclude visible HTML code from the indexing.
Entering Search Queries/Search Syntax#
The input for searches is done via templates. More information on creating search templates can be found in chapter imperia's Full Text Search Templates in the programming documentation.
There are also some syntax rules for entering search terms:
- A search query consists of one or more words.
- A refinement of the search provides more precise search results, by combining multiple search words.
- Grouping function, logical operators, and the ability to search for specific phrases are available.
- The search requested may also be filtered according to the contents of certain meta fields.
The syntax of this special search variants is described in the following sections. The spacing between individual words from a keyword in the text of the retrieved documents plays no role in the calculation of relevance.
Case-sensitive#
imperia's FTS ignores case in search terms.
Special Characters and Punctuation#
In general, the full text search ignores special characters and punctuation since they are generally not indexed. For this reason, there are no exceptions for fixed terms with special characters such as C ++
. There are, however, some exceptions in the use of special characters in a query. The special characters "
, (
, )
, :
. \
and |
can be used to run a specific search (see below).
Wildcards#
Use the asterisk “*” in search terms in order to use a wildcard for one or more letters. The more wildcards contained in a query, the more resource-intensive will they be processed. Therefore, by default, the full text search is configured so that a maximum of one wildcard per keyword is possible. For performance reasons, wildcard searches allow one word searches (querymode = strict
). For example, a search for 'a *'' potentially has hundreds of hits.
This setting can be adapted by changing the values for querymode
, see also Directives in Option Blocks.
If wildcards with the default setting are used in conjunction with other words, the results won't be accurate. In this case, the FTS ignores the wildcard. A search for a * e AND term returns, instead of documents with words that start with “a” and end in “e” and contain the “term”, documents with “a” or “e” and the “term”.
Wildcards that are not part of a search word (space before and after) are ignored by the full text search.
It is not possible to replace one or more words in phrases with wildcards (see Phrase Search).
It is possible to change the wildcard settings, so that more wildcards can be used within a word, as well as a combination of keywords with or without wildcards. Keep in mind that this has a negative effect on the performance.
Note
Make sure to set the right values for minValidCharacters
, see also Directives in Option Blocks.
Furthermore, all other parsing- and processing rules are valid for querymode
.
Alternative Spelling/Hyphenation#
A function for finding alternative spellings for word with and without hyphen, written separately and together is currently not implemented. Separately written words are also indexed separately. It is not possible to specifically search for words in hyphenated spelling.
Umlauts#
imperia's full text search does not make a difference between alternative writing with umlauts. The terms “Bürgermeister” and “Buergermeister” return identical results.
Phonetic Similarity#
The full text search ignores phonetic similarity. This also applies to terms like “graphik” and “grafik”, etc.
Stop Words#
imperia's full text search provides the possibility to define a list of stop words. The words from this list do not appear in an index. The full text search also ignores stop words in search queries. A request that contains a word from the stop word list and other search words, gives the same result as a request without the stop word. It is irrelevant whether the individual keywords are linked by AND or OR, or whether it is a phrase Search. If a search consists solely of stop word(s), it returns no results.
In the result template, there is a possibility to give separate feedback for the use of any stop word in searches.
Stop word lists can be configured globally or can be domain specific. Combinations of the two variants is also possible. Information on how to create and manage stop word lists see in Create and Manage Blacklists, or how to manage them, read in basis tags.
Phrase Search#
Put one or more search terms in double quotation marks to identify them as search phrase. The search will then return only results that contain all the words in the phrase in exactly the same order. The punctuation in the original document is irrelevant.
Categorized Search#
Restriction of the search to specific categories can be done by defining meta groups in the index creation, which are then put in the search form as selection options. For details, see section
URL Search#
A targeted URL search, similar to the Google search engine, can also be used in imperia's FTS. For this purpose, define a site
meta field in documents. Keywords that use the syntax 'site:www.myurl.en' will return the relevant results. Syntactically this is a filter of the separate parts of the meta contents. See also Filtering by Meta Contents.
Stemming#
Currently, the search for alternative words with the same root is not implemented.
Grouping#
By bracketing "()", you can group one or more words in search terms. Pay attention to proper bracketing. In the absence of opening or closing parentheses, grouping does not work. Instead, the full text search interprets the brackets as individual components of the search term.
Filtering by Meta Contents#
In order to filter searches for specific meta field contents use:
metaFieldName:term
OR
metaFieldName:|term term2|
The second example, with the pipe character, contains an implicit AND operation of the two terms.
Combine and Exclude Keywords, Logical Operators#
With the AND and OR operators parts of the search term can be linked together:
- The link “word1 AND word2” provides only hits, in which both search words are included.
- When “word1 OR word2” is used you get hits, in which one of the two keywords is included.
- If you want to exclude certain words from your search, put NOT in front of the keyword.
The upper and lower case is irrelevant for the AND, OR and NOT operators. Language-specific synonyms for the logical operators can be specified in the search configuration (file site/config/fts.conf
). In addition, the following special characters function as “shortcuts” for logical operators:
Operator | Shortcut |
---|---|
AND | + |
OR | ? |
NOT | -, ! |
Logical operators can be combined arbitrarily. By default, multiple search words without associated operator are interpreted with OR. This may be changed in the search configuration (see section defaultOperator in Directives in Option Blocks, defaultOperator
directive).
A special case is searching words with a hyphen. When searching for “Handball-WM” the FTS interpreds it as: (handball OR wm) OR "handball world cup" OR "handballwm". If the “AND” operator is set as the default, the following is searched: (handball AND wm) OR "handball world cup" OR "handballwm".
Displaying Search Results#
The presentation of search results is done via their own templates. The special syntax for creating result templates is explained in the chapter Result Templates in the programming documentation.
By default, the hits are sorted in a descending list by relevance. More sorting options can be enabled by configuring result filtering. Refer to UserSort: Determining the Hits' Order , sortBy
directive.
When displaying a single hit, a configurable amount of context in which the search term is highlighted can also be displayed.
Displaying Context#
When displaying search hits the full text search also shows context for each hit. The position of the keyword in a paragraph, as well as the word count for this excerpt, are definable.
If the search query is composed of several words, the first found section of text, in which the words occur most, and does not exceed the defined word count, is displayed. Even if a document contains all of the search words, they are not necessarily displayed in the context, if the word count is not sufficient.
Furthermore, the first paragraph is not necessarily the one that contains the search phrase.
Highlighting#
If you want the results to be highlighted, the necessary instructions are entered in the result template. There are special highlighting modes and CSS classes for highlighting search terms with wildcards. Refer to the chapter Result Templates in the programming documentation.
imperia provides a default CSS file for the presentation of highlights. By defining a local CSS class in the result template, you can replace this standard design of highlighting with its own formatting. For more information, read the section highlight in Directives in option blocks.
Result Filters#
Usually, imperia sends results to the requesting client directly after processing a query. An open interface plug-in allows results to be edited before presentation. In this way, special methods of sorting or manipulation of the metadata's representation can be implemented for the retrieved documents.
Result filter plug-ins should be stored in site/modules/core/Dynamic/FTSResultFilter
. The directory contains some examples. The available methods for hit list's representation are described in the POD of the site/modules/core/Imperia/FTS/ResultFilterPlugin.pm
module.
For information on the activation of result filters in the FTS, please read the section resultFilterPlugin in Directives in option blocks.
Calculating Relevance#
The calculation of the relevance of a document in a hit list is done in several steps. When indexing the FTS determines a relevance value for each word contained in a text. This is done on the basis of how many documents contain the queried word and how common it is in those documents. The final relevance of individual words in an index is obtained from the relation to the average. Words like “and”, “is”, etc. obtain a comparatively minor relevance, since they occur frequently in many documents.
For a full text search query, the relevance of the retrieved documents is identified from the product of the relevance of a search term and its frequency in each document. If a search term consists of multiple words, documents' relevance is calculated from the sum of the values of individual words.
Documents' relevance can be influenced by awarding bonuses in two ways:
- bonuses can be based on the position of a word in a document. Refer to section <bonus> in Tags and Directices or
- bonuses can be based-date related. Refer to Datebonus: sorting hites with date-based bonuses.
Although the relevance value is ultimately determined as an absolute number, the output is given as a percent. Here, the document with the highest value is the starting point for the conversion.
For more information read the section UserSort: Determining the Hits' Order.
Configuring imperia's Full Text Search#
The full text search uses two scripts: site/bin/fts_index.pl
for indexing and cgi-bin/fts_search.pl
for processing queries.
There is also an additional script (fts_conf_convert.pl
) that helps convert existing configuration from older versions of imperia's FTS to the new format. The entire configuration is controlled by the configuration files /site/config/index.conf
and /site/config/fts.conf
. A sample configuration can be found in /site/config/index.conf.sample
, where there is also a brief description of parameters' functions.
A quick start guide see in Quick Start Guide. Each script is discussed in detail in later sections.
Prerequisites#
imperia's full text search requires the following Perl modules:
-
BerkeleyDB >= 4.0. (recommended version, but at least version 3.0)
-
BerkeleyDB::Btree
-
Config::General (*)
-
File::Spec (*)
-
File::Copy (*)
-
File::Path (part of the interpreter since Perl 5.8.X)
-
Tree::RedBlack (*)
The modules marked with an asterisk are included in imperia's distribution, the rest have to be installed before using the full text search itself. These Perl modules can be obtained for free from www.cpan.org
. If one of the required modules is missing, this is indicated by a corresponding error message.
The Compress::Zlib is an optional module. This module is only needed if you want to compress the index's cache.
If you want to index and search other file types in addition to HTML and plain text files, such as PDF documents, additional parser programs must be installed.
Currently the following parser plug-ins are available:
-
For Microsoft Office documents, the catdoc program, that can be obtained at
http://www.wagner.pp.ru/~vitus/software/catdoc/
(state of the source: June 2013). Use version 0.94x, which is under the GPL 2 license. This is a suite of parsers, which includes the following programs:-
DOC - parser for Word documents
-
XLS - parser for Excel documents
-
PPT -Parser for Powerpoint documents
-
-
For PDF files, you need XPDF - a parser based on a program suite called xpdf. The parser is available at the following URLs:
http://www.xpdf.com/
(State of source: June 2013) andund http://www.foolabs.com/xpdf/
. This includes two programs:-
pdftotext
-
pdfinfo
-
Quick Start Guide#
Setting up the FTS is done it three steps:
- configuration
- indexing
- template creation.
Among other things, through the configuration it is determined whether documents should be discoverable via search. Templates are used to enter search criteria, search parameters and display search results. A search index is created when indexing. The search itself describes the way in which searched words can be linked together.
There are default values for all configurable parameters, which are used in the configuration examples. Example templated can be found in site/fts/templates
. If you do not need a special configuration, follow these steps:
-
Configuration: in the
site/config
directory you will find example configuration files for indexing (index.conf.sample
) and output of the search (fts.conf.sample
). Copy these files toindex.conf
andfts.conf
. -
Indexing: run the
site/bin/fts_index.pl
script with parameters–b
and-i
to create an index.Note
For large projects, this process may take some time to complete.
-
Now you can enter search terms into the prepared templates by calling the
fts_search.pl
search script from the cgi-bin directory in your browser:http://your_server.en/cgi-bin/fts_search.pl
orhttp://ihr_server.de/cgi-bin/fts_search.pl?ADV=1
for a more refined search form with additional search options.
Full Text Search Configuration#
imperia's FTS is divided into two areas - creating a search index and searching for terms in the search index, including the hits' display.
This functional division is also reflected in the configuration, which is done in two separate files. Already existing search configurations from older versions of imperia can be migrated with a script. Refer to Indexing Configuration . If you want to manage multiple indexes, you can do this by defining multiple domains. This allows, for example, the use of a search configuration for multiple target systems.
Domains can be used to search different parts of databases separately. If, for example, you want to make part of the available data on a target system searchable only for your employees, the appropriate domaines have to be configured. In most cases, however, it is sufficient to use the default domain setting (default
).
General Syntax Conventions#
The instructions syntax for the FTS configuration is based on the syntax for configuring Apache web server. Some instructions are listed in brackets. These “tags” form blocks in which more tags or directives can be nested. Similar to markup languages, or the Apache configuration, there are opening and closing tags.
Example:
<option>
Directives
</option>
With some tags, the parameter is set in the main tag. The parameter may be set via an equal sign or quotation marks and it's optional. Example:
<directory = "path/to/directory">
Directives
</directory>
or
<directory path/to/directory>
Directives
</directory>
Furthermore, there are directives used in the following way:
Directive="Value"
Example:
chmap = "UTF-8"
Again, the use of the equal sign and quotes is optional. Exceptions are cases where parameters, respectively directives contain spaces. There, the use of quotation marks is bound to ensure that the entire expression is evaluated.
In general, however, the use of quotation marks is recommended in order to avoid misinterpretation. The above example would not work if just a simple change is made.
The following is wrong:
<directory path/to/directory >
Directives
</directory>
The space before the closing angle bracket is interpreted as part of the path, i.e. "path/to/directory " instead of "path/to/directory". In contrast, the variant works without a problem, if the path is put in quotation marks.
Correct:
<directory "path/to/directory" >
Directives
</directory>
In this case, the parser simply ignores the extra spaces. Upper- and lowercase letters are also not considered by the interpreter.
To comment a line use a hash sign (#) in the beginning of that line. Do not use umlauts or special characters, except dash (-) and underscore (_), in the value assignements. This does not apply for regular expressions.
Regular Expressions in the Configuration#
Regular expressions in the configuration files are strongly based on Perl in functionality. At your disposal are a variety of syntax options in Perl, that can be written within a search pattern (between slashes). Modifiers (single line, multiline, ignore case, etc.) may not be used. In general, the defined patterns are executed with the “ignore case” option. Files and directory names are a special case. Here, depending on the operating system, case sensitivity may be relevant.
Indexing Configuration (index.conf)#
The indexing is configured in the site/config/index.conf
file. The configuration is read and processed by the fts_index.pl
script.
Important
Changes to the indexing configuration are only active when a new index is created.
Below you can find a list of tags and directives. The individual instructions are listed respectively in the highest structural level in which they can be used.
Tags and Directives#
<domain>
The “domain” tag describes several independent configurations for different indexes within a configuration file. With a few exceptions, it sets the global settings and therefore can also be defined at the top level, note all other tags and directives within a domain section.
The use of domains is flexible. Different databases or even the same database with different settings can be indexed. For example, you can set up different domains for employees and customers. When creating a search index for customers, you can exclude areas that are only for internal business use, while the index used for the staff would include non-public areas.
There is also the option to filter by grouping information, which makes the data separation secure. In “hidden” groups, information is accessible only to authorized users by manipulating the parameter that calls the search script.
Note
The “default” domain in the index.conf.sample
is an independent domain and does not serve as collection of default values for other domains.
Syntax:
<domain= "domain_name">
[Tags and directives to configure the index for this domain]
</domain>
Example:
<domain = "secure”>
#limited index; only intranet
</domain>
<domain = ”web”>
#general index; public web content
</domain>
<settype>
The “settype” directive defines which file types to be indexed. The structure corresponds to a multi-valued list. The following keys are available:
-
HTML
HTML files can be indexed with this plug-in.
-
TXT
TXT files can be indexed with this plug-in.
-
XPDF
PDF files can be indexed with this plug-in.
-
DOC
DOC files can be indexed with this plug-in.
-
PPT
PPT files can be indexed with this plug-in.
-
XLS
XLS files can be indexed with this plug-in.
-
NULL
Creates a meta index.
Syntax:
<setType>
Type = [regular expression]
</setType>
the regular expression describes file extensions.
Example:
<setType>
HTML = "\..?html?$"
TXT = "\.txt$"
<setType>
Note
The PDF, DOC, PPT and XLS document types are indexed by external parsers. This tag can also be used within a domain section.
For multilanguage documents where the file name is something like index.html.en
, the search pattern must be adjusted accordingly.
<setType>
HTML = "\.html\.*$"
TXT = "\.txt$"
</setType>
Note
PDF, DOC, PPT and XLS document types are indexed by external parsers. This tag can also be used within a domain section.
<bonus>
The bonus tag influences the relevance calculation of the FTS, by giving a term extra weight, depending on its position in a document's structure. Bonuses can be awarded to certain structural elements and meta fields of a document. These are then added when determining the frequency of a term within a document.
If you have set a bonus of two, for example, for top level categories, the counter adds three to the term's frequency score every time, when it appears within a top-level category (+1 for the term, +2 for the bonus).
Bonuses can be assigned for the following elements:
-
title (<title></title>)
-
bold (<b></b>)
-
italics (<i></i>)
-
header1 (<h1></h1>)
-
header2 (<h2></h2>)
-
header3 (<h3></h3>)
-
header4 (<h4></h4>)
-
header5 (<h5></h5>)
-
header6 (<h6></h6>)
-
keywords (imperia meta fields or pageMetaComment
keywords
) -
description (imperia meta field or pageMetaComment description)
Important
The above selection of structural elements is fully available only in HTML documents. For PDF files, bonuses can only be awarded to the title. For Microsoft Office® documents, no bonuses can be awarded.
The possible values are integers and decimals. Use dot as decimal point.
Syntax:
<bonus>
Structural element = number
</bonus>
Example:
<bonus>
italics = 1.2
title = 2
bold = 1.3
header1 = 1.2
header2 = 1.2
header3 = 1.2
</bonus>
Domain-specific bonuses can also be defined.
<exclude>
This tag is used to exclude the contents of individual meta fields or files that contain a particular meta field from indexing.
Syntax:
<exclude>
[Directives]
</exclude>
More information about the exclude blocks see in Directives within Exclude Blocks. This tag can also be used within Domain, Directory, directoryMatch and File blocks.
<include>
This tag is used to allow indexing of files that contain a particular meta field.
Syntax:
<include>
[Directives]
</include>
More information about the include blocks see in Directives within Include Blocks. This tag can also be used within Domain, Directory, directoryMatch and File blocks.
<pageMetaComment>
This tag is used to define a meta field that will be searched and its contents will be printed in a template. Limit the meta field's contents from the rest of a document's text, using the start
and stop
directives.
Syntax:
<pageMetaComment = "meta_name">
start = "<!--startcomment-->"
stop = "<!--stopcomment-->"
</pageMetaComment>
Example:
<pageMetaComment = "FSK 0">
start = "<!--start_FSK 0-->"
stop = "<!--stop_FSK 0-->"
</pageMetaComment>
This tag can also be used within Directory, directoryMatch and File blocks.
blacklist
This directive specifies one or more stop word lists, valid for all configured domains. The words listed in such a blacklist are not indexed. In order for the system to ignore stop words searches, the directive must also be included in the search configuration (site/config/fts.conf
).
Syntax:
blacklist = IDENTIFIER
IDENTIFIER
specifies a stop word list. The list has to be entered in the system using the site/bin/fts_blacklist.pl
script. See also Create and Manage Blacklists.
If you want to use multiple stop word lists, use the directive multiple times, once for each stop word list.
This directive can also be used inside of domain blocks, then the stop word list is valid only for the domain in question.
PDFINFO, PDFTOTEXT, XLS2CSV, CATDOC, CATPPT
If the parser program is not within the environment path, use an absolute path to the program. This tag can be used inside of domain blocks.
Tags within Domain Blocks#
<metaGroup>
This tag is used to group files based on the value of a meta field. These groups can be passed as a filter for searches in the search script (see Transfer Parameters and Variables in Search Template in the programmers manual). In a metaGroup block you define one or more meta fields with related content. Documents that have the corresponding meta fields and contents are then part of the group.
More information on defining meta fields see in Transfer Parameters and Variables in Search Template in the programming documentation.
Syntax:
<metaGroup>
<meta Metafeldname>
[group definition (en)]
</meta>
</metaGroup>
Example:
<metaGroup>
<meta charset>
<groupname "CZECH">
match = "ISO-8859-2"
</groupname>
<groupname "RU">
match = "ISO-8859-5|KOI.+"
</groupname>
<groupname "INTL">
match = "UTF-8"
</groupname>
<groupname "WESTERN">
match = "ISO-8859-1"
</groupname>
</meta>
</metaGroup>
Meta groups can be within a directory block.
<option>
The option tag describes various directives that control indexing. Description of each directive, see in Directives' Options.
Syntax:
<option>
[Directives]
</option>
Example:
<option>
verbosityLevel = 1
followSymLinks = ON
</option>
<directory>
The directory tag is used primarily to determine directories to be indexed. You can also use it as a container for other parameters. The indexing of documentDir that corresponds to the DOCUMENT-ROOT
of your system is enabled by default.
Syntax:
<directory "path/to/directory”>
[Directives]
</directory>
Absolute paths are noted with a forward slash or a drive letter followed by a colon. All other paths are evaluated relatively to documentDir
.
Tip
If you want to disable documentDir
indexing, use the following:
<directory "">
index = 0
</directory>
Examples:
The following example is an absolute path. The list corresponds to the default value of documentDir
(see Directives' Options). Alternatively, you could also refer to it with an empty string (<directory="">).
<directory /imperia/htdocs>
[Directives]
</directory>
The following example shows a relative path. The directory is located directly under the FTS documentDir directory.
<directory FTS>
[Directives]
</directory>
The description of each directive, see in Tags and Directives within Directory Blocks.
<directoryMatch>
The <directoryMatch> directive allows the use of a regular expression instead of a certain directory name.
Note
The matching of a pattern cannot span multiple path components.
Syntax:
<directoryMatch = "[regular_expression]">
[Directives]
</directoryMatch>
Example:
<directoryMatch = ".+print$">
index = 0
</directoryMatch>
By this definition, the following directories would not be indexed:
-
news_print
-
products_print
But the following directories are indexed:
-
news_print/doc
-
news_print/pdf
-
products_print/doc
-
products_print/pdf
-
print
<files>
Directives to a specific file can be applyed using this tag. Use a regular expression. The directives are applied to files whose name match the pattern.
Syntax:
<files = "regular expression">
[Directives]
</files>
Examples:
<files = "parse_this.html">
index = 2
</files>
The parse_this.html
file will be indexed.
Note
The directive index=2
enables indexing. The value 2 within file blocks is considered as 1
or yes
or true
. Read also Tags and Directives within Directory Blocks.
Files with the extension .text
will be indexed. For an explanation of the directives used in the examples refer to Tags and Directives within Directory Blocks.
This tag can be used within directory and directoryMatch blocks.
Directives within Exclude Blocks#
fileMetaKey
Use a regular expression to define a meta field, the presence of which would exclude a document from indexing.
Syntax:
fileMetaKey = "regular expression"
Exmaple:
<exclude>
fileMetaKey "^mod.*"
fileMetaKey "^layout_test$"
metaKey "imperia"
</exclude>
The example defines the following exclusion criteria:
-
Files containing a meta field whose name begins with “mod” are not indexed.
-
Files containing a meta field
layout_test
are not indexed. -
Meta field names that contain the string “imperia” are not indexed.
metaKey
Use a regular expression to define a meta field that should not be indexed.
Syntax:
metaKey = "regular expression"
Example:
metaKey = "regulärer_Ausdruck"
<excludeByValue>
<condition>
metaKey "^page_type$"
metaValue "^overview$"
</condition>
</excludeByValue>
The example defines the following exclusion criteria: Pages having a meta field "page_type" with value "overview" are excluded from indexing, everything else is included.
Directives within Include Blocks#
fileMetaKey
Use a regular expression to define a meta field, the presence of which would include a document in the indexing.
Syntax:
fileMetaKey = "regular expression"
Example:
<exclude>
fileMetaKey "^mod.*"
fileMetaKey "^layout_test$"
</exclude>
The example defines the following criteria:
-
Files containing a meta field whose name begins with “mod” are indexed.
-
Files containing a meta field
layout_test
are indexed.
Tags and Directives within Directory Blocks#
<pageMeta>
This tag is used to define which meta tags' and imperia meta fields' contents will be included in the indexing. Furthermore, one can define a name, under which they will be references.
Syntax:
<pageMeta>
Name_of_imperia_meta_field = meta_alias_in_search_template
Name_of HTML_meta_tag = meta_alias_in_search_template
</pageMeta>
Example:
In PDF files, some meta information is stored in addition to the actual content. However, as this information is available in a text format it cannot be accessed easily. The <pageMeta> directive allows you to selectively index the meta information of a PDF document. The “pdfinfo” program provides the following meta information about a PDF document:
Title: Microsoft Word - MyTitle
Author: John Doe
Creator: PScript5.dll Version 5.2.2
Producer: Acrobat Distiller 6.0 (Windows)
CreationDate: Fri Jul 8 10:28:06 2005
ModDate: Tue Aug 9 11:18:17 2005
Tagged: no
Pages: 23
Encrypted: no
Page size: 595 x 842 pts (A4)
File size: 343800 bytes
Optimized: no
PDF version: 1.5
The following example shows how to take the information about title, creation and modification date of a PDF file in the indexing:
<pagemeta>
title = dc_title
creationdate = page_time
moddate = mod_date
</pagemeta>
Indexed meta tags' and imperia meta fields' content can be output in a result template. If a document contains such meta tags and meta fields, the respective contents appear in the appropriate place in the results list.
Searching via meta files, respectively meta tags, displays results, even when the contents is not part of the visible text of a document, as is the case with meta tags.
This tag can be used in directoryMatch and file blocks.
<pageMetaCS>
Since all directives in the index configuration are interpreted in a case-insensitive manner, a workaround has to be used in order to index meta tags, the names of which contain uppercase letters.
The idea is to give an alias to the original name of the meta tag:
<pageMetaCS>
<alias>
originalName = "_group_test_ID"
aliasName = "_group_test_id"
</alias>
</pageMetaCS>
dirGroup
The dirGroup directive allows for the combination of directories and directory contents into groups. One directory can belong to several groups. Defined groups can be used as filter, when calling the search script, with the GROUP
parameter (see also Search Templates in the programming documentation).
Syntax:
dirGroup = [group name(s)]
Example:
<directory ="products_de">
index = yes
dirGroup = german
</directory>
<directory ="news_de">
index = yes
dirGroup = german
</directory>
<directory = "products_en">
index= yes
dirGroup = english
</directory>
<directory = "products_en">
index= yes
dirGroup = english
</directory>
In the above example, the directories products_de
and news_de
belong to a german
group and the directories products_en
and news_en
- to an english
group. The same applies to the respective subdirectories.
Several groups are noted by a space-separated list:
<directory ="products_de">
index = yes
dirGroup = assortment german products
</directory>
This directive is also applicable within directoryMatch blocks. Be careful with the special operation of directoryMatch (see Tags within Domain Blocks).
Example:
<directoryMatch =".+_de">
index = yes
dirGroup = german
</directory>
<directoryMatch = ".+_en">
index= yes
dirGroup = english
</directory>
All directories whose name ends in “_de” are part of the german
group. All directories whose name ends in “_en” belong to the english
group.
In contrast to the previous example, in this case the subdirectories are not included.
index
The index directive cotrols the indexing. One can specify whether files or directories to be indexed. For directories, it can be controlled whether subdirectories should also be included. The settings of parent directories can be overwritten by the directives for subcategories (directory
) and files (files
).
To activate a recursively executed indexing set the following values:
-
yes
-
true
-
1
To exclude a directory or group of files from indexing, use the following values:
-
no
-
false
-
0
Note
Within directory tags “yes” applies as a default value, if no index directive is set.
If only a directory has to be indexed, do not use any other tags. In this case, no subdirectories are search, unless otherwise specified.
Example:
index = yes-no-descend
Note
This option is not available in file blocks. There, each value, that the option does not explicitly turns off, is enabled.
The index directive is also applicable within directoryMatch and file blocks.
Note
Within file blocks, the index directive refers only to files, not directories in which they are located.
chmap
This directive allows the specification of a default character set for directories. When creating an index, the parser evaluates this information. Possible values are all supported by the Perl module Locale::Recode
fonts.
Syntax:
chmap = "character set"
Example:
chmap= "ISO-8859-1"
This directive can also be used in directoryMatch blocks.
inherit
This directive is used to allow the settings for a parent directory to be inherited by the subdirectories, even if the directory cannot or shouldn't be indexed recursively. The “inherit” settings span over one directory level. Set and contradictory directives override “inherit” settings in subdirectories.
Note
The setting for “index” cannot be inherited.
Syntax:
inherit
Example:
<directory = "news_de">
index = "0"
chmap = "ISO-8859-1"
inherit
</directory>
<directory = "news_de/economics">
index = "1"
</directory>
<directory = "news_de/politics">
index = "1"
</directory>
<directory = "news_de/sports">
index = "not-recursive"
inherit
</directory>
<directory = "news_de/weather">
index = "1"
chmap = "utf-8"
</directory>
<directory = "news_de/sports/fly_fishing">
index = "1"
</directory>
<directory = "news_de/sports/handball">
index = "1"
</directory>
In this example, the news_de
directory should not be indexed. However, it is determined that its content is ISO-8859-1 encoded. This setting will be inherited by subdirectories which is why the “inherit” directive is set. The subdirectories news_de/economics
, news_de/politics
and news_de/sports
have the same setting for “chmap” as the parent directory.
The news_de/weather
directory is an exception, as it has a different setting for “chmap”. As the news_de/sports
directory has the setting index = "not-recursive"
, it will not be recursively searched. Because of that, an additional inherit
directive has to be used so that the news_de/sport/fliegenfischen
and news_de/sport/eisstockschiessen
are ISO-8859-1 encoded. Without the additional inherit
directives, the character setting will not be inherited.
“Inherit” can be used in exclude, file, metaGroup, pageMetaComment and pageMeta blocks when they are within a directory block.
Directives' Options#
cacheSize
This directive specifies the number of words that are contained in a cache file. If the value is set to 0
or the directive is removed, no cache is created. It is not recommended to have values less than 100
, since, in this case, very small files are generated and so depending on the size of the clusters hard disk space is wasted.
Syntax:
cacheSize = [number]
Example:
cacheSize = 150
cacheMapLevels
This directive specifies how the cache is saved on the file system. The default value is 6
, which allows for 1.7594524e+48
files to be in the cache. In case the collection is larger, the cacheMapLevels directive has to be increased accordingly.
Syntax:
cachemaplevels = [number]
Example:
cachemaplevels = 6
Important
If the value of cacheMapLevels is changed, it is mandatory to rebuild all FTS indexes.
compress
This directive is used to select the compression plug-in. To obtain a list of available plug-ins, call the site/bin/fts_index.pl
script with the -t
option.
Note
A change of the compression plug-ins may require an adaptation of the cacheSize setting.
The default value is 0
(no compression plug-in).
cacheDir
Use this directive to specify a directory where the FTS cache file is saved.
Syntax:
cacheDir = "/absolute/path"
Example:
cacheDir = "/usr/local/share/imperia/site/fts/index/mandant1/cache"
Important
- If this variable is set, make sure that the directory is used by a single index or domain.
- However, under certain circumstances, for example when multiple domains access overlapping data sets, it may even be desirable that at least parts of the caches are shared between domains.
- If in doubt contact imperia's support.
For each domain the system automatically creates a subdirectory in which the actual cache is stored.
Important
This only applies if the default value is used.
The default value is dataDir/cache
.
dataDir "/absolute/path"
This directive holds the absolute path to where the generated index is stored. When creating an index a directory, named after the respective domain below dataDir
, is created for each configured domain. The actual index is in this subdirectory.
Syntax:
dataDir = "/absolute/path"
Example:
dataDir = "/usr/local/share/imperia/site/fts/index/mandant1"
The default value is site/fts/index
of your imperia installation. This value must be the same in the index (index.conf
) and search (fts.conf
) configurations.
dataTempDir
With this directive set a temporary working directory for the index creation. Enter an absolute path as the value. The temporary files for an index are created for each configured domain in a subdirectory named after the domain.
Syntax:
dataTempDir = "/absolute/path"
Example:
dataTempDir = "/usr/local/share/imperia/site/fts/index/tmp"
The default value is site/fts/index/tmp
.
documentDir
Define an absolute path to the directory containing documents to be indexed.
Syntax:
documentDir = "/absolute/path"
Example:
documentDir = "/usr/local/share/imperia/htdocs"
The default value is imperia's document root.
followSubDir
Use this directive to enable or disable recursive indexing.
followSubDir = [ON|OFF]
The defaul value is ON
.
followSymlink
This directive is used to control treatment of symbolic links.
Syntax:
followSymlink = [ON|OFF]
The default value is ON
.
maxDepth
This directive controls the maximum directory depth, considered for an index creation.
Syntax:
maxDepth = [number]
The default value is 100
(directory levels).
memUseControl Numeric Value
This directive has a numeric value that specifies the maximum memory size used for index creation. This directive affects the overall performance and memory usage.
Syntax:
memUseControl = [number]
The default value is 6400000
(Bytes).
verbosityLevel
This directive controls the amount of information that is issued during the index creation.
Syntax:
verbosityLevel = [0-3]
Possible values:
value | description |
---|---|
0 | “Minimal”, only the respective steps of the index creation are displayed. |
1 | “Normal”, the respective steps of the indexing and indexed files, as well as any error in creating the file list or parsing, are displayed. |
2 | “Detail”, in addition to the usual messages, omitted directories and SR files statistics are displayed. |
3 | In addition to omitted files in the directories processed, a list of currently indexed directories is displayed. |
Tags and Directives in MetaGroup Blocks#
<meta>
This tag is used to set groups for individual meta fields within MetaGroup blocks. The following meta fields are available for each document:
Field | description |
---|---|
charset | This describes the character set in which a document is originally encoded - not to be confused with imperia's internal data management character set (UTF-8). |
filename | This is the file name of the document. |
title | The document's title, in case this field has explicitly been cleared by the appropriate configuration. |
Note
For documents that were not created with imperia, for example, PDF files or Microsoft Office documents generated outside imperia, the title
field is empty.
Syntax:
<meta = "MetaFieldName">
[group definition(s)]
</meta>
Groups can be defined for each, available in a document, meta field. Write the name of the field in the opening tag of a meta block. Then, one or more groups can be defined for the meta field. The name of the group is determined by the groupname
tag.
<groupname>
This tag is used to specify the name of a meta group. Group names can be chosen freely, but cannot contain spaces or special characters, except “-” and “_”. To get dynamically named groups, use a regular expression. See also Dynamically Named Meta Groups. Within the tags, define a search pattern for meta field contents.
Syntax:
<groupname "String | regular expression">
[search pattern]
</groupname>
The search pattern is defined with the following directive.
match
Specify a regular expression that is evaluated agains the appropriate meta fields during indexing. If the content of the relevant meta field match the pattern in this directive, the document is considered as part of the corresponding group.
Example :
<metaGroup>
<meta charset>
<groupname "CZECH">
match = "ISO-8859-2"
</groupname>
<groupname "RU">
match = "ISO-8859-5|KOI.+"
</groupname>
<groupname "INTL">
match = "UTF-8"
</groupname>
<groupname "WESTERN">
match = "ISO-8859-1"
</groupname>
</meta>
</metaGroup>
Explanation:
-
CZECH
- all documents, which character set contains “ISO-8859-2” or “iso-8859-2”. -
RU
- all documents, which character set contains “ISO-8859-5” or “KOI” followed by any other character. -
INTL
- all documents, which character set contains “UTF-8” or “utf-8”. -
WESTERN
- all documents, which character set contains “ISO-8859-1” or “iso-8859-1”.
Note
When evaluating regular expressions case sensitivity is generally not considered. Read also Regular Expressions in the Configuration.
Dynamically Named Meta Groups#
It is also possible to use a dynamic component for group name. This dynamic component is derived from the content of meta filed that determines the group membership. In this way groups are defined independently. The syntax for generating dynamic group names is based on the functionality of regular expressions in Perl:
<metaGroup>
<meta MetaFieldName>
<groupname "GroupName_${1}">
match = "(muster1)"
match = "(muster2)"
</groupname>
</meta>
</metaGroup>
So that regular expressions can be added as a suffix to the name of the group, the pattern must be set in parentheses (grouping in regular expressions). Then it can be accessed in the group with $ {1}.
Example:
<metaGroup>
<meta title>
<groupname "planet-${1}">
match = "(weather)"
match = "(forecast)"
match = "THIS IS NOT WHAT YOU WANT"
</groupname>
</meta>
</metaGroup>
Based on the contents of the title
meta field, this definition specifies three states:
-
Documents containing the word "weather" in the title belong to the group "planet-weather".
-
Documents containing the word "forecast" in the title belong to the group "planet-forecast".
-
Documents with the phrase "THIS IS NOT WHAT YOU WANT" in the title belong to the group "planet-".
The third group is incorrectly listed for demonstration purposes only and does not provide the desired result. In the match
directive, the brackets are missing, therefore this is not considered a dynamic name. Moreover, the pattern contains spaces, and these are not allowed in group names.
Search Configuration (fts.conf)#
The settings in fts.conf
determine how the FTS searches for a query in the search index. If you are using an older version of imperia's full text search, you may use the administration script fts_conf_convert.pl
to convert old FTS configurations to the new FTS format. Necessary files are automatically created.
An example configuration file, with the minimum requirements, is available in site/config
, under the name fts.conf.sample
. All parameters have default values, which are also used in the sample file.
Note
Through the default values, the search can be used without a fts.conf
file or even with an empty one. However, in this canse the default configuration may not be sufficient for your purposes.
Basic Tags#
blacklist
This directive allows you to specify one or more stop word lists to be valid for all configured domains of the full text search. When stop words are used as keywords in the search query, they are ignored. Thus, the full text search does not index stop words. The directive must also be included in the indexing configuration (site/config/index.conf
).
Syntax:
blacklist = IDENTIFIER
IDENTIFIER
specifies a stop word list. The list has to be entered in the system using the site/bin/fts_blacklist.pl
script. See also Create and Manage Blacklists.
If you want to use multiple stop word lists, use the directive multiple times, once for each stop word list.
This directive can also be used inside of domain blocks, then the stop word list is valid only for the domain in question.
<domain>
The settings for searches within an index are grouped together in a block with the domain
tag. The fts.conf
file may contain several domain blocks. In this way multiple sets of different search settings can be managed in one file.
These sets can refer to either a single or to several different databases. Each set, or each configured domain, requires its own search script (see also Creating Search Scripts).
All other settings are listed in a domain block.
Syntax:
<domain = "domain_name">
[Directives]
</domain>
The default value is default
or an empty string.
Tags in Domain Blocks#
<option>
The option
tag includes all index-related directives into a block.
Syntax:
<option>
[Directives]
</option>
Description and examples of directives in option blocks, see in Directives in Option Blocks.
<lang>
Within a lang
block, you define language-specific synonyms for keywords components. Thus, for example, the logical operators for combining multiple search words are translated into different languages. The CGI parameter lang
is used to pass the desired language version, which then activates the translations, set in the configuration.
Syntax:
<lang "language code">
Word = translation
</lang>
Example:
<lang "DE">
or = oder
not = nicht
and = und
</lang>
This example shows the replacement of the logical operators OR
, NOT
and AND
with their German translations
Note
This also applies when OR , NOT , END are configured as stop words.
<map>
The map
tag is used to specify how system paths of found documents are displayed in the results. The path to the respective directory is then replaced with links in the results list by the specified value.
Syntax:
<map "absolute/path">
to = "[URL]"
</map>
Example:
<map "/usr/local/share/imperia/htdocs">
to = "http://your_server.net"
</map>
By default, the directory set in the system.conf
under DOCUMENT-ROOT
is replaced by ABS-DOC-ROOT
.
Note
When mapping, long paths are compared before short ones.
<output>
Directives, that control the output of search results are placed within an outplut
block.
Syntax:
<output>
[Directives]
</output>
A sample configuration can be found in site/config/fts.conf.sample
. The directives for controlling the output are explained in the section on Output directives.
Template Tags
The template tags specify the location of the FTS templates on a target system. More than one template may be given as an alternative. The selection of a variant is done via a parameter, passed to the search sctipt.
If only one template is needed, but it has to be accessible through direct links, the templated should be placed uder the DOCUMENT ROOT of the target system to be searched.
Syntax:
<[template control tag]>
[number] = "/path/to/template"
</template control tag>
The number is the key through which the relevant template is referenced in the call of the search script. As far as the path is concerned, the possibilities are absolute paths or valid cgi-bin directory for relative paths.
<standardHTMLTemplate>
Inside this tag, specify templates for basic searches.
Example:
<standardHTMLTemplate>
1 = "/usr/local/share/imperia/htdocs/search/search-form.default.html"
</standardHTMLTemplate>
<advancedHTMLTemplate>
Inside this tag, specify templates for refined searches.
Example:
<advancedHTMLTemplate>
1 = "/imperia/site/fts/templates/search-form.default.html"
</advancedHTMLTemplate>
<resultHTMLPage>
Inside this tag, specify result templates.
Example:
<resultHTMLTemplate>
1 = "/imperia/site/fts/templates/search-form.default.html"
</resultHTMLTemplate>
<templateDynamicGroups> and <template>
If there are meta groups defined for index creation, it is possible to have a select box in the search template, that is automatically filled with the existing groups. Select the field in the search template with the name GROUP and put it in double quotes. Using the <templateDynamicGroups>, define which groups should be available as options to which templates, and with what names.
Syntax:
<templateDynamicGroups>
<template [template number]>
goupname = "text for option"
</template>
</templateDynamicGroups>
The <template> tag determines in which search template the automatic generation of the options is to take place. Replace the placeholder [template number]
with the number of the templates, specified in the template tags (see above). As a meta group name, enter the name, used in the index configuration. The text for the option will appear in the select box of the search template. Here is an example:
Excerpt from fts.conf
:
<templateDynamicGroups>
<template 1>
newsFb = "Football News"
newsHb = "Handball News"
newsEh = "Hockey News"
_all_ = "Results from all groups"
</template>
</templateDynamicGroups>
A special feature is the group name _all_
. With this option, imperia considers all existing groups.
In the corresponding search template, the following select field is defined:
<select name="GROUP" multiple="multiple">
<option>Dummy</option>
</select>
When calling a search template, using the search script from the cgi-bin, imperia replaces the dummy option with the groups from the fts.conf
file. So the following code is displayed in the search template:
<select name="GROUP" multiple="multiple">
<option value="newsFB">Football News</option>
<option value="newsHb">Handball News</option>
<option value="newsEh">Hockey News</option>
<option value="_all_">Results from all groups</option>
</select>
Directives in Option Blocks#
The following directives must be placed within an option block.
verbosityLevel
This directive allows to control the level of detail of information that is output at the end of the search in the index. The default value is 1
.
Note
At the moment the displayed messages are not different.
dataDir
This directive specifies the directory where the index is located.
Important
The values in index.conf
and fts.conf
must be identical.
Syntax:
dataDir = "/absolute/path"
The system automatically creates a subdirectory of the directoy specified with dataDir
for each configured domain.
Example:
dataDir = "/usr/local/share/imperia/site/fts/index/mandant1"
The default value is the site/fts/index
directory.
showContext
This directive is used to define the word cont of the context of each hit. If the value is set to 0
, displaying context is disabled.
Syntax:
showContext = [number]
Example:
showContext = 100
The default value is 100
.
percentPrecision
This directive specifies the number of decimal places in the display of the value of the template variable PAGE_PERCENT
. The default value is 0
.
phpEnable
Enable this directive if PHP code has to be processed in a search template before it is sent to the browser.
Syntax:
phpEnable = [Value]
Values that enable PHP processing:
-
yes
-
true
-
1
Values that disable PHP processing:
-
no
-
false
-
0
Example:
phpEnable = "yes"
By default, PHP code is disabled.
phpPath
This directive specifies the path to the PHP interpreter, if not specified in the environment path.
Syntax:
phpPath = "absolute/path/to/interpreter/Binary"
Example:
phpPath = "/usr/local/bin/php"
The default value is php
.
queryMode
Normally, for performance reasons, only one wildcard may be used in a search and the search must consist of only one word. To change this default behavior, use the queryMode
directive. Furthermore, a search mode that allows to search meta groups, without specifying a search word, can be activated. The following table describes the available settings.
Parameter | Explanation |
---|---|
strict | Only one wildcard allowed within a search query (default). |
loose | Allows having several search terms with a wildcard. |
extraloose | Allows having two or more wildcards in one search term. |
partial | Search terms are treated as if they had wildcards at both ends of the string. |
metaonly | Allows to search for meta groups without specifying search words. The search term can then consist solely of meta group names. As a result, all documents, belonging to the meta group in question, are displayed. |
minValidCharacters
This directive specifies how many characters must be included except a wildcard in a valid hit. The default value, 0
, allows only a wildcard to be a search term.
highlight
When the context display is activated (highlight = 1
), search keywords in the text can be highlighted.
The full text search normally highlights the whole word, even if, for example, wehn using wildcard, only part of the word actually appears in the search term. To highlight these parts separately, set the highlight
directive to partial
. The highlighting is defined in the default FTS style sheet with CSS classes. If you want to use your own style sheet, use the following classes for highkight formatting:
-
highlight
-
highlight_partial
-
highlight_meta
additionalUserVars
Use this directive to expand the CGI parameters list of automatically generated self-referential calls of the search. This additional parameters are, for example, used when calling the search script in the automatically generated links of result lists.
Syntax:
additionalUserVars = "comma-separated list"
Example:
additionalUserVars = "var1,var2,var3"
groupBool
In the configuration of index creation, you can define groups, using the directives metaGroup and dirGroup, that act as a filter criteria on search queries, passed by the GROUP
parameter in the search script (see also Transfer Parameters and Variables in Search Template in the programming documentation).
The Boolean logic of this filter groups is controlled with the groupBool
directive. Specifying AND
as a value, enables filtering of all groups. Any other value means that results must belong to only one of the existing groups.
Syntax:
groupBool = [VALUE]
Examples:
groupBool = "AND"
groupBool = "and"
These are examples of logical AND operations. Results must include all groups, specified with GROUP
.
groupBool = "OR"
groupBool = "any"
These are examples of logical OR operations. Results will display only one of the groups, specified with GROUP
.
This filtering is active only if more than one group is defined as a filter criterion with the GROUP
parameter when calling the script. The default value is AND
.
defaultOperator
This directive is used to determine a default logical operator that links several search words, in case the user does not specify such an operator. The default value is OR
.
resultFilterPlugin
With this directive a filter for processing resuts before output is activated. Result filter plug-ins must be located in site/modules/core/Dynamic/FTSResultFilter
.
Syntax:
resultFilterPlugIn = PlugIn1,PlugIn2,PlugInN
Several plug-ins are set with a comma-separated list with no spaces. The order of execution of individual filters may also be defines. The FTS uses the first filter listed as first in the search results, as well. Then it moves to the second filter and so on. The output is the filtered result.
There are three different filter results:
-
DateBonus This plug-in allows the sorting of filters on date-based bonus criteria. The awarded date-based bonuses configure a number of other directives. Read also DateBonus: Sorting Hits with Date-based Bonuses.
-
UserSort This filter determines the sort order of the results list. The default is to sort by descending relevance. Alternative sorting methods are determined by the CGI parameter
SORT
, which is described in Transfer Parameters and Variables in Search Template in the programming documentation. With thesortBy
directive determine the default sorting method. Read also UserSort: Determining the Hits' Order .Note
How imperia calculates the relevance of a document is described in Calculating Relevance.
-
XpdfConvert
-
This filter is useful when indexing PDF documents and you want to display the creation or last modification date on the results page. The
pdfinfo
program, part of the XPDF package, provides both dates as a string in English date format. This may be different, depending on the software used to generate the PDF document variations. -
The
XpdfConvert
filter converts these date format variants toTT.MM.JJJJ hh:mm
format.
-
Directives in Parser Option Blocks#
Parser options are a sub-level of the regular options. A special feature is that they are passed directly to all parser plug-ins. As a result, an extension to the configuration is possible, if new plug-ins have to be added.
Example:
<option>
dataDir = "/absolute/path"
<parseroption>
ignoreMeta = "title|author"
</parseroption>
</option>
ignoreMeta
This option is evaluated by the HTML parser. With its help it is possible to ignore the content of metadata from the HTML header. This is particularly important when, as in the following example a meta tag “title” and an HTML title are available.
<html>
<head>
<title>Page Title</title>
<meta name="title" content="Page Title">
</head>
If ignoreMeta="title" is not used, the title will be indexed twice and displayed twice in a row in the search results.
The value of the field is evaluated as a regular expression.
Example:
ignoreMeta = "title|author"
keepphptag
This option is evaluated by the HTML parser. Without this option, the HTML parese removes all <? ... ?>, before the rest of the file is processed.
Example:
<option<
<parseroption>
keepphptag = 1
</parseroption>
</option>
process_ssi
If the option process_ssi set to 1
, the SSI includes in documents will be indexed. By default this option is turned off, which means that parts of websites included as SSI, are not indexed.
Example:
<option>
<parseroption>
parse_ssi = 1
</parseroption>
</option>
Directives in Output Blocks#
CURRENT_PAGE
This directive allows you to specify which results page is displayed first.
Syntax:
CURRENT_PAGE = [number]
Example:
CURRENT_PAGE = 1
The default value is 1
.
MAX_PAGE_LIST
This specifies the maximum number of references to other result pages that are displayed per page.
Syntax:
MAX_PAGE_LIST = [number]
Example:
MAX_PAGE_LIST = 100
The default value is 100
.
MAX_PAGES
This indicates the maximum number of generated result pages.
Syntax:
MAX_PAGES = [number]
Example:
MAX_PAGES = 100
The default value is 100
.
PER_PAGE
This indicates the number of displayed results per page.
Syntax:
PER_PAGE = [number]
Example:
PER_PAGE = 10
The default value is 10
.
UserSort: Determining the Hits' Order#
The result filter UserSort
regulates search results sorting. The particular sorting method is set with the CGI parameter SORT
(see Transfer Parameters and Variables in Search Template in the programming documentation). The default sorting method is determined by the described below directive sortBy
.
sortBy
This directive controls the default sorting method of the hit list. The following table shows the available sorting methods:
Parameter | Explanation |
---|---|
arelevance | ascending sort by relevance |
drelevance | descending order by relevance (default) |
amodified | ascending order by modification date |
dmodified | descending order by modification date |
asize | ascending sort by size |
dsize | descending sort by size |
awords | ascending sort by word count |
dwords | descending sort by word count |
disableUserSort
Set this directive to 1
to restrict custom sort options.
DateBonus: Sorting Hits with Date-based Bonuses#
imperia offers a view filter, which adds relevance bonuses based on a document's actuality. If the filter is activated, a maximum bonus, an interval and a step size have to be defined for the date-based bonus. The most recent document from the hit list gets the maximum bonus. This value decreases in the defined interval with the decreasing actuality of the found documents at the specified step size.
Note
Please consider that the DateBonus plugin has to be included in the ResultFilter plugin's options; please refer to Directives in Option Blocks.
bonus_date_max
With the bonus_date_max
directive define the relevance bonus for the most recent document from the hit list. Specify a value between 1
and 100
.
Example:
bonus_date_max = 50
bonus_date_step
Use this directive to specify by how much the bonus should decrease at the expiration of each interval. Possible values are between 1
and 100
.
bonus_date_step = 10
bonus_date_period
This directive specifies the interval in which the relevance bonus decreases. Enter a number and an optional time unit. For the time units there is short and long notation, listed in the following table.
short notation | long notation | description |
---|---|---|
mi | minutes | minutes |
d | days | days |
w | weeks | weeks |
mo | months | months |
y | years | years |
If no time unit is specified, the FTS automatically selects month. These three directives specify how old a document can be to get a date-based relevance bonus. Example:
bonus_date_max = 50
bonus_date_step = 10
bonus_date_period = 1 d
In this example, documents that are one day old, get a relevance bonus of 50, which correspond to the maximum value. Documents that are between one and two days old, will receive a bonus of 40 (50 - 10). With each passing day, the bonus is reduced by 10, until it is depleted. Documents that are older than 5 days get no bonus.
bonus_date_field
By default, the full text search uses the meta field __imperia_modified
to determine the last revision date of a document and calculate the date-based bonus. However, any meta field, which contains a date in the form of a UNIX timestamp, can be used as a basis for the calculation. This meta field is determined by the bonus_date_field
variable. The following options are available:
Parameter | Explanation |
---|---|
file_modified | Date of last revision in imperia (default) |
fs_file_modified | Modified date on the file system level |
METAFELDNAME | Any meta field that contains a Unix timestamp. A PageMeta directive has to be defined for this field in the index configuration, so that it can be indexed. (see Tags and Directives within Directory Blocks) |
Keep in mind that changing the date on the file system level changes, for example, the file when reparsing. This means that a change to the contents of a file, does not necessarily occur when renewing the file modification date.
Creating an Index#
The index is created by the fts_index.pl
script, located in site/bin
. Call the script from the command line. The following call parameters are available:
Parameter | Meaning |
---|---|
-h, --help | This parameter displays the help page. |
-i, --install | The installation of an index, created earlier, is executed. The associated files are copied from a temporary directory to a directory specified with the dataDir directive. |
-b, --build | Triggers the index creation for the specified domains. |
-t, --test | Starts testing the index configuration and returns a list of available compression plug-ins. |
The domains that have to be indexed, may also be specified in the call. Each listed domain has to be defined in a domain block in the index.conf
file, located in site/config
.
Example:
perl fts_index.pl -bi mandant1 mandant2 mandant3
The index script reads the configuration file and builds an index for each defined domain. If no domains are specified in the call, the script uses the default domain (default
). In case there is no configuration file in site/config
, the script sets all parameters to their default values.
The index is thus generated in a temporary directory, and is available when it is called with the options -i
or --install
.
Index creation and installation can also be performed in a single operation. For large data sets, the indexing can take a considerable time. It is recommended to rebuild the index periodically. This can be time controlled and automated, for example by a Cron-Job.
Creating Search Scripts#
When a search is started, the search script reads the search settings from the index of the relevant domain from the fts.conf
file, located in site/config
. If multiple domains are configured, each of them needs its own search script.
To create a domain-specific search script, proceed as follows:
Important
The installation script for hotfixes may replace the existing version of cgi-bin/fts_search.pl
on the system and install the original version of the script. Therefore, always make domain-specific adaptations in a copy of the file and use it instead of the original file.
-
Copy the
cgi-bin/fts_search.pl.
script. -
Open the file in an editor and locate the following line (line 28):
my $domain = 'default';
-
Change the line as follows:
my $domain = '[Domain_Name]';
Example:
my $domain = 'mandant1';
-
Save the file.
Result: Search queries for this domain are now referred to the edited copy of the seach script.
Create and Manage Blacklists#
To create a stop word list for the imperia's full text search, proceed as follows:
1. Creating a stop word list file
First, create a standard ASCII text file, with any file extension and place it in an arbitrary place in imperia's directory structure.
Tip
It is also possible to manage this file with an imperia document. Note, however, that the text file generated in this case is accessible via the web server. Maybe it is therefore necessary to restrict access to this file, for example, through password protection.
In the file, each stop words has to be listed on a separate line as shown in the following example:
is
by
at
and
in
to
on
with
the
one
Apart from the actual stop words, no further instructions are needed in the file. Once created the list can be replaced, changed or removed at any time.
2. Generating a stop word list entry
A stop word list file cannot be directly handled by imperia's full text search. To make a stop word list file available to the system, call the site/bin/fts_blacklist.pl
script:
perl fts_blacklist.pl -b IDENTIFIER path/to/stopwordlist.extension
The IDENTIFIER, specified with the -b
parameter, will be used at a later stage in the index creation and activation of the stop words list. If the path is entered without the beginning slash, it is interpreted as relative to the sctipt; otherwise it is interpreted as absolute. The script generates a file, named IDENTIFIER, in a directory parallel to the search index. This directory is serialized with the Perl Module Storable and can be processed by the FTS.
Further script parameters are used to manage existing stop word lists.
To see the available parameters call the script with option -h, --help
in the command line.
Example:
-
perl fts_blacklist.pl -b mylist ../config/mylist.conf
This call generates a stop word entry that is available in themylist
identifier, located insite/config/mylist.conf
. A possible existing stop word list entry replaces this script execution by the new entry. -
perl fts_blacklist.pl -a -b mylist /home/wwwrun/imperia/stopwords/extra_words.txt
This call adds the words form the lying under/home/wwwrun/imperia/stopwords
stop word list fileextra_words.txt
to the list of stop words entries in themylist
indetifier. -
perl fts_blacklist.pl -d -b mylistThis call deletes the stop word list entry that is available under the identifier
mylist
.
3. Stop word lists embedded in the configuration
Stop word lists are enabled by stop word entries in the indexing (site/config/index.conf
) and search (site/config/fts.conf
) configuration files. This is done via the blacklist
directive in both files.
Syntax:
blacklist = IDENTIFIER
- The
IDENTIFIER
is the name of the stop word list entry used whensite/bin/fts_blacklist.pl
has been called with parameter-b
. - Use the directive multiple times if you want to use several stop word lists.
- For multiple full text searchn domains, the stop word lists can be applyed to all or only to individual domains.
- Set the
blacklist
directive within a domain block, so that a stop word list is valid only for this domain, and outside the domain blocks in the configuration file, so that a stop word list is valid for all domains.
Important
When stop words lists are embedded in the indexing configuration, they take effect after the next indexing. The same also applies to changes to existing and already integrated stop word lists.
However, stop word lists affect search requests immediately after their integration into the search configuration. The search ignores the stop words immediately, regardless of whether they are still in the index or not.
Full Text Search Suggestions#
imperia's FTS may also display suggestions in the search template. If the minimum number of letters, usually 3 or 4, are entered then a query to the internal search index is created by AJAX and the first N results are made available to choose from.
To enable suggestions, the indexing configuration (index.conf
) and the search parameters (fts.conf
) have to be adjusted. Also, the search template has to be extended and, if necessary, the Ajax scripts must be adapted for different domains.
Adjustments to the index.conf
#
In the index.conf
file an additional index with actual words must be created. To do so create a “suggest” option within the “suggest” directive of a domain.
Example:
<suggest>
rule = "^\S{3,}$" # fallback
</suggest>
The rule which terms will enter the index is controlled by the rule
statement.
-
If the value does not exist or is set to
false
, no index is created for the suggestions. Thus, no suggestions can be made. -
If
rule
is set totrue
, by default, all words with at leas 3 characters are taken in the index. This is internally defined by the regular expression: "^\S{3,}$" -
The allocation of a different regular expression controls the contents of the index.
rule = "^\S{4,}$"
keepbase64imagecontent
keepbase64imagecontent
in the index.conf is an option to the HTML parser. With that option set to a true value the HTML parser will also try to parse the base64 content.
In all other cases the base64 content is not parsed!
keepbase64imagecontent 0
Adjustments to the fts.conf
#
The section “suggest” in the fts.conf
file must be created or edited.
The basic configuration is done using the following parameters:
-
minChars
This parameters defines the minimum characters the script requires to submit suggestions. Usually only 3 or 4 characters are sufficient.
-
frequency
This parameter controls how often the input is checked for changes. By default, the FTS uses
prototype
library, where frequency is set in seconds (default is1
second, but smaller intervals are also possible).Another option, if you need the frequency to be checked on smaller intervals, is to use
jQuery
, where frequency is set in milliseconds (for example,1000
).Note
For this option to work, both the
fts.conf
andsearch-advanced.default.html
files have to be adjusted. -
type
The parameter
type
allows to define a different library for autocompletion. By default, the FTS usesPrototype.Autocompleter
. Another option is to usejQuery.Autocomplete
(optionjquery
). -
jqalias
The parameter
jqalias
allows for setting an alias of the jQuery “$” namespace, so that both prototype and jQuery can be used at the same time. Default is$
. -
maxSuggestions
The parameter
maxSuggestions
controls the number of suggestions. -
ajaxURI
The
ajaxURI
parameter sets and controls which script is called from the search template. -
formname
The
formname
parameter holds the name of the form, which holds the search box in a template.
Example:
minChars = 4
frequency = 0.5
maxSuggestions = 10
ajaxURI = "ajax_fts_suggestions.pl"
formname = "search"
jqalias = "jq"
type = "prototype"
Adjustments in a Template#
Each search box in the FTS template can be added to the suggest function. For that the form must be extended to the Ajax functions, as in the following example:
<form id="search_form" action="/cgi-bin/fts_search.pl" method="post">
//The following code is wrapped for representation purposes
<input id="search-value" type="text" name="search" value="Suchbegriff"
style="width:90%"
onFocus="if (value == 'Suchbegriff') {value =''}" onBlur="if
(value == '') {value = 'Suchbegriff'}" />
<div id="ifts_suggestions_search-value" class="ifts_suggestions"
style="display:none"></div>
<script language="JavaScript" type="text/javascript">
document.observe('dom:loaded', function () {
new Ajax.Autocompleter("search-value", "ifts_suggestions_search-value",
"/cgi-bin/ajax_fts_suggestions.pl", {
paramName: "suggest",
parameters: $($('search-value').form.id).serialize(),
minChars: 3,
frequency: 0.1,
afterUpdateElement: function () {$('search-value').form.submit()}
});
});
</script>
<input type="submit" id="submit" name="submit" value="start"
style="width:auto" />
</form>
In the result pages, that are evaluated by the search engine, a tag, for example “SEARCH”, may be added. In this way, the above described, JavaScript is included automaticaly.
<!--SUGGESTIONS:SEARCH-->
The example above can be implemented as follows:
<div class="searchextended" style="display:none;">
<h2>Advanced Search:</h2>
<fieldset>
<label class="left">with all words</label>
<input type="text" id="SEARCH_AND"
name="SEARCH_AND"
value="[!--SEARCH_REQUESTED--]"
size="30" /><br />
<!--SUGGESTIONS:SEARCH_AND-->
<label class="left">with the exact phrase</label>
<input type="text" size="30"
id="SEARCH_PHRASE"
name="SEARCH_PHRASE" /><br />
<!--SUGGESTIONS:SEARCH_PHRASE-->
<label class="left">with any of the words</label>
<input type="text" size="30"
id="SEARCH_OR"
name="SEARCH_OR" /><br />
<!--SUGGESTIONS:SEARCH_OR-->
<label class="left">without words</label>
<input type="text" size="30"
id="SEARCH_NOT"
name="SEARCH_NOT" />
<!--SUGGESTIONS:SEARCH_NOT-->
</fieldset>
</div>
Note
The code above is wrapped for representation purposes.
Important
The ID of the search field may only contain alphanumeric characters and underscore, if you want to use the <!--SUGGESTIONS:INPUT_ID--> variable.
To make the basic Ajax functions available, the following JavaScript libraries should be included in the template:
<script language="JavaScript" type="text/javascript"
src="/imperia/iwl/jscript/dist/prototype.js"></script>
<script language="JavaScript" type="text/javascript"
src="/imperia/iwl/jscript/dist/effects.js"></script>
<script language="JavaScript" type="text/javascript"
src="/imperia/iwl/jscript/dist/controls.js"></script
Note
So that the functionality is given on the live system, the files must be copied to the live system.
The above source code is wrapped because of representation reasons.
Adjustments to the Ajax Script#
By default, the ajax_fts_suggestions.pl
script is available. If only the “default” domain is used, no adjustments are needed.
If different domains are used, the ajax_fts_suggestions.pl
script has to be copied, for example in a ajax_fts_suggestions_MYDOMAIN.pl
file, and the line:
my $domain = 'default';
should be replaced with:
my $domain = 'MYDOMAIN';
.