Work with case-level indexing options

Use the indexing options when generating indexes for use with document searches. These settings apply at the case level and override portal options set on the Settings page.

Note: As an alternative to making individual changes to the indexing settings for a case, use the Clone Settings option. The option clones ALL of the default portal settings or ALL of the settings from a selected source case, and applies them to the case. For the changes to take effect, rebuild the indexes for the case.

Important: If you make any changes to the indexing options, you must rebuild the case indexes for the changes to take effect. For more information, see Add jobs.  

View indexing settings for a case

To view the default indexing options for a case:

On the Portal Home page, under Portal Management, click Cases and Servers.

On the Cases page, click the name of the case whose settings you want to view. For portal administrators, when organization security is enabled, the list of available items depends on membership in a provider or client organization. To understand how organizations are managed in Nuix Discover, see Organizations. For a summary of how organization security affects portal access for each user category, see Portal security table.

In the navigation pane, click one of the Indexing pages.

For a description of the settings on each page, see Indexing Option Settings.

Add indexing settings for a case

You can add settings for the following indexing options:

Noise Words

Thesaurus

Stemming Rules

File Type Rules

To add indexing settings for a case:

On the Portal Home page, under Portal Management, click Cases and Servers.

On the Cases page, click the name of the case for which you want to add settings. For portal administrators, when organization security is enabled, the list of available items depends on membership in a provider or client organization. To understand how organizations are managed in Nuix Discover, see Organizations. For a summary of how organization security affects portal access for each user category, see Portal security table.

In the navigation pane, click one of the Indexing pages.

Click Add.

In the dialog box that appears, type the information for the new setting.

Note: For a description of the settings, see Indexing Option Settings.

Click Save.

Note: For Thesaurus entries, click the name in the Entries column to add synonyms for that entry.

Edit indexing settings for a case

You can edit settings for the following indexing options:

Options

Alpha Standard

Alpha Extended

File Type Rules

To edit indexing settings for a case:

On the Portal Home page, under Portal Management, click Cases and Servers.

On the Cases page, click the name of the case whose settings you want to edit. For portal administrators, when organization security is enabled, the list of available items depends on membership in a provider or client organization. To understand how organizations are managed in Nuix Discover, see Organizations. For a summary of how organization security affects portal access for each user category, see Portal security table.

In the navigation pane, click one of the Indexing pages.

Edit the settings as follows:

For Options, edit the settings, and then click Save.

For Alpha Standard and Alpha Extended, select a character and click Edit. In the Edit characters dialog box, select a Treat as option and click Save.

For File Type Rules, click the Name of a rule and, on the Properties page, modify the settings. Then click Save.

Note: For a description of the settings, see Indexing Option Settings.

Click Save.

Delete indexing settings for a case

You can delete settings for the following indexing options:

Noise Words

Thesaurus

Stemming Rules

File Type Rules

To delete indexing settings for a case:

On the Portal Home page, under Portal Management, click Cases and Servers.

On the Cases page, click the name of the case whose settings you want to delete. For portal administrators, when organization security is enabled, the list of available items depends on membership in a provider or client organization. To understand how organizations are managed in Nuix Discover, see Organizations. For a summary of how organization security affects portal access for each user category, see Portal security table.

In the navigation pane, click one of the Indexing pages.

Select the check box next to the setting to delete.

Click Delete, and then click OK in the confirmation message.

Note: For a description of the settings, see Indexing Option Settings.

Indexing Option Settings

The following table describes the options used to configure the index-based search engine for concept analysis.

The portal options apply to each new case by default. Use the case-level indexing options to change the options for a specific case. Use the Clone Settings option to clone the default portal indexing settings or the indexing settings from a selected source case and apply them to another case. For more information, see Clone Settings.

Option

Settings

Default

Document threshold

Type the maximum number of documents for an index. If the number of documents is higher than the maximum, the application creates additional indexes.

800000

Exclusion filter

Do not include files with these extensions in the index. To add or delete extensions to filter, type the extension in the text box. Separate each extension with a space. Use an asterisk as a wildcard for the file name. You can also specify a file name and use the asterisk as a wildcard for the extension, as in the default value DRVSPACE.*.

*.7Z *.accdb *.ace *.ai *.aif *.aiff *.arc *.arj *.avi *.bag *.bin *.bmp *.bz2 *.cab *.chi *.chm *.class *.com *.db *.dll *.dwc *.eps *.exe *.exp *.gif *.gzip *.ha *.hlp *.hqx *.hyp *.idb *.ilk *.iso *.ivi *.ivt *.ix *.jar *.jpeg *.jpg *.lbr *.lbx *.lib *.lnk *.lzh *.mar *.mdb *.mdbx *.mdi *.mov *.mp3 *.mpe *.mpeg *.mpg *.mpq *.ms *.msi *.nls *.obj *.ocx *.opt *.orig *.pch *.pdb *.pea *.psd *.psp *.pst *.q *.qic *.ram *.rar *.res *.rm *.rmi *.rpm *.sfx *.snd *.snp *.sqz *.swf *.sys *.tar *.td0 *.tif *.tiff *.tlh *.tmp *.trg *.ttf *.vbx *.wav *.wma *.wmv *.word *.wpg *.xcr *.xfd *.zip *.zlib *.zoo DRVSPACE.*

Binary files

The search engine does not index unrecognized binary files as documents. Examples of binary files include executable programs, fragments of documents recovered through an undelete process, or blocks of unallocated or recovered data obtained through computer forensics. Content in these files is stored in a variety of formats, such as plain text, Unicode text, or fragments of .doc or .xls files. A binary file can contain many different fragments with different encoding. Indexing such a file as if it were a simple text file would miss most of the content.

Select from one of the following options:

Index binary files: Index binary files as plain text.

Index and skip binary files: Do not index binary files.

Filter binary files: Index text in binary files.

Filter binary Unicode: Index Unicode in binary files.

Note: The file name extension does not always indicate the type of data in the file. Users can change the extension to, for example, .doc in an attempt to force the application to index a binary file. However, a check of the file header determines that the file is binary. To index the content of all files, select an appropriate option such as Index binary files.

Index skip binary

Filtering

Additional filter options include:

Extract blocks as HTML: Add comments identifying the original location of each sequence of text that was filtered from the original data.

Note: Each binary file is first divided into blocks, and the text is extracted from each block. For example, if you specify a block size of 100 KB, then a 1,000 KB file is indexed as 10 separate blocks.

Filter all documents: Use filtering to index all documents.

Filter failed documents: Instead of skipping the indexing of corrupt or encrypted documents, attempt to extract text from the documents.

Overlap blocks: Prevent text that crosses a block boundary from being missed in the filtering process.

Filter failed documents

This option is not selected by default.

Word auto break

A word break controls the line-breaking behavior within words.

CJK: Insert a word break around Chinese, Japanese, and Korean Unicode ranges. Enabling CJK increases the size of the index.

By case: Insert a word break where an uppercase letter follows a lowercase letter. For example, myDocument is indexed as my document.

By length: Set the maximum length of a word by inserting word breaks into long sequences of letters.

On digit: Insert a word break where a digit follows a letter. For example, C3PO is indexed as C 3 PO.

Overlap words: When the By length option causes a word break, overlap the two words to prevent the indexer from missing broken words. For example, if the By length option is set to 10, the word internationalization is split into internatio and natilization. The word internationalization is also indexed.

No default

Max File Size To Index

Maximum file size for content files to be indexed. Files with a file size above this limit will have the file indexing status of “Excluded by Size.”

20971520 bytes (20 MB) for existing cases. Any value in this field will not change upon upgrade.

For new cases, the default value is 104857600 bytes (100 MB).

Entries to skip

To skip entries while indexing, select any of the following options:

Numeric values: Skips numeric values. Selecting this option prevents users from searching on numeric ranges such as dates and credit card numbers.

If this option is not selected: Performing a content search for a numeric range, for example ( 10~~20), returns documents containing numbers between 10 and 20, including 10 and 20

If this option is selected: You cannot perform a content search for a numeric range, for example (10~~20), but you can search for documents that include specific numbers, such as 10 and 20.

File name field: Skips character strings in a file name. For example, if a file is named Secondary.doc, and the word Secondary appears nowhere in the document, then a content search for the word Secondary does not return this document.

File name field path: This option is related to the File name field option. When you select File name field, File name field path is automatically selected and disabled. After File name field is cleared, File name field path is enabled and selected.

If you select File name field, the character strings in the path for a file are not indexed. For example, a full path and file name for a document is \\SERVER\CSMAgentData\images\MyCase\RootLevel\Appraisals\Secondary.doc. If the word Appraisals appears nowhere in the document, a content search for the word Appraisals does not return this document.

Document properties: Skips document metadata. For example, if a document has Acme Incorporated listed as the company in the document properties, and the word Acme appears nowhere in the document, a content search for the word Acme does not return this document.

File name field

File name field path

Document properties

Date recognition

Select one of the following options to disable date recognition or to enable date recognition based on the date format. Examples appear with the options.

Disable: Do not recognize dates in text as text is indexed; dates are indexed as plain text.

M-DD-YYYY

DD-M-YYYY

YYYY-M-DD

Disable recognition of dates

Sensitivity

Case: Take capitalization into account when indexing words. In a case-sensitive index, APPLE, Apple, and apple are three different words. This option is not recommended because most users want to retrieve a document containing Apple in a search for apple.

Accent: Take accents into account when indexing words. Select this option with caution because you can miss documents that omit or add an accent.

No default

Default character encoding

Plain text files, some older word processor files, and HTML files written in languages other than English use character encoding to specify the meaning of characters in the range from 128 to 255. For example, a Russian document might have CP1251 encoding, which uses these characters for Cyrillic letters.

Based on an analysis of the contents, the index-based search engine tries to detect the type of encoding. Therefore, Auto-detect is recommended. Select another option if this is not working for the documents that do not specify the type of encoding.

Auto-detect

Hyphens

Select from the following options to determine how to handle hyphens. By default, all of the rules are applied.

Ignore: For example, index first-class as firstclass.

Treat as searchable characters: For example, index first-class as first-class.

Treat as spaces: For example, index first-class as first and class.

Apply all rules

To illustrate, when you select Apply all rules, three entries are in the index for any text that includes a hyphen. Using the example above, the text “first-class” is indexed as:

first-class

firstclass

first class

A search for any of those three terms returns the document.

Apply all rules

Noise Words

A noise word is a word such as the or if that is so common that it is not useful in searches. Use the Noise Words setting to manage the list of words to ignore during indexing.

The words in the noise words list do not have to be in any order and can include wildcard characters such as * and ?. In addition, a noise word does not have to be an actual word. A noise word can consist of one or more characters, for example, ae or 123.

Note that the word and is reserved as a noise word and behaves slightly differently than many other noise words. For example, a search for 'test and car' using single quotes looks for any document that includes the words test and car. The search for 'test car' using single quotes looks for documents that include the words test and car next to each other, with no words in between. The search for "test and car" using double quotes treats the word and as a "standard" word and not as a keyword. However, since and is a noise word, the search is interpreted as "test ? car," where the question mark is any word. For this search to return a document hit, there must be a word between test and car.

Note: For indexing setting changes to take effect for a case, you must rebuild the indexes for that case. For instructions, see Index and analyze documents.

See Default Noise Words.

Alpha Standard

Using the alphabet indexing options, you can configure how to treat many common characters. A character set is a defined list of characters recognized by computer hardware and software. Each character is represented by a number. The ASCII or Standard Character Set uses the numbers 0 through 127 to represent all English characters and special control characters.

The Alpha Standard page displays the following information:

Characters: Displays the character to configure.

Codes: Displays the numeric code for the character. The alphabet settings affect only characters in the 33-127 range. The 33-127 range refers to the ASCII code, which is the most common code for text files on computers. For example, the letter A is always represented by the number 65. The first 32 characters are used as control characters.

Treat as: Displays the following configuration options for the character:

Letter: A searchable character. All of the characters in the alphabet (a-z and A-Z) and all of the digits (0-9) are classified as letters by default.

Hyphen: Hyphen characters can receive special processing. By default, only the "-" is defined as a hyphen.

Space: A character that causes a word break. For example, classifying the period as a space character processes U.S.A. as three separate words: U, S, and A.

Ignore: A character that is disregarded in processing text. For example, classifying the period as ignore instead of space processes U.S.A. as one word: USA.

Note: For indexing setting changes to take effect for a case, you must rebuild the indexes for that case. For instructions, see Index and analyze documents.

See Default Alpha Standard Options.

In most scenarios, the default behavior will suffice, although some situations may require that punctuation characters be treated as letters.

Important: Do not configure the ampersand (&) character  as a letter. The ampersand character is not supported in searches. 

Alpha Extended

Using the alphabet indexing options, you can configure how to treat many common characters. A character set is a defined list of characters recognized by computer hardware and software. Each character is represented by a number. The Extended Character Set is based on a combination of standards: ISO 8859-1, Microsoft Windows Latin-1, and the Unicode standard.

The Alpha Extended page displays the following information:

Characters: Displays the character to configure.

Codes: Displays the numeric code for the character.

Treat as: Displays the following configuration options for the character:

Letter: A searchable character.

Ignore: A character that is disregarded in processing text.

Note: For indexing setting changes to take effect for a case, you must rebuild the indexes for that case. For instructions, see Index and analyze documents.

All characters are set to Ignore and can be set to Letter on a case-by-case basis.

For hit highlighting to work properly, if you set one of the extended characters to index as a letter, you must add a filename filter that sets *.txt files to be indexed as UTF-8.

Thesaurus

Use the thesaurus indexing options to configure how to treat synonyms during searching. The thesaurus contains entries with synonym groups. A synonym group is a group of words or phrases that the system treats as equivalent when performing a search. For example, an entry called fast that includes the words fast, quick, speedy, rapid, and immediate in the synonym group will find any of the words in the group when performing a search for the word fast.

When adding to the thesaurus, add the entry, then add synonyms to the entry.

Note: For indexing setting changes to take effect for a case, you must rebuild the indexes for that case. For instructions, see Index and analyze documents.

No default

All thesaurus entries and synonyms must contain at least one character.

Stemming Rules

Stemming is the use of linguistic analysis to determine the root form of a word. In the application, stemming extends a search to cover grammatical variations on a word. For example, a stemming rule can ensure that you can search for verbs with standard suffixes, for example, a search for fish also finds fishing. A search for apply also finds applying, applies, and applied.

The Stemming Rules page displays the following information:

Minimum character count: Specifies how many characters a word must contain for a stemming rule to apply to the word. This field can be blank or contain a number between 1 and 256.

Find suffix: Specifies grammatical variations at the end of a word that you want to treat differently when searching.

Replace with: Specifies the text string you want to substitute for the suffix, for the purposes of searching only. Note that stemming rules do not replace text in the documents being indexed and searched. This field can be blank.

The ordering of the stemming rules affects the processing of the stemming rules configuration. The first applicable stemming rule in the list is applied to a word, and any matching rules following the applied rule are ignored.

You cannot reorder stemming rules. All new stemming rules are added to the end of the existing list of rules. Be sure to add stemming rules in the order of precedence for processing, with the highest precedence rules added first.

Note: For indexing setting changes to take effect for a case, you must rebuild the indexes for that case. For instructions, see Index and analyze documents.

See Default Stemming Rules.

File Type Rules

On the File Type Rules page, manage rules that determine how to handle particular file types. The File Type Rules page displays the following information:

Name: Identifies each rule and must be unique within each case and within the rules for the default indexing template.

File type: The file type selected from the list of file types the application supports.

File name filters: The list of file name filters to be indexed as the specified file type for this rule. You can associate file name filters with one of the following supported file types:

ANSI Text

DOS Text

UTF8 Text

XyWrite

Filtered Binary

HTML

XML

WordPerfect 4.2

WordStar

MBOX Email

MIME Document

IFilter

Note: At least one entry is required. File name filters without the * wildcard character are applied literally.

Override other file type detection: If selected, this rule applies to all files with the extensions specified in the File name filters column. The override applies even if those file extensions are normally handled differently.

The ordering of the file type entries affects the processing of the file types configuration. File type entries are processed in the following order:

Entries with the option set to override all other file type detection rules are processed first, according to their order in the list of entries.

Entries without the override option set are processed next, according to their order in the list of entries.

The first entry found that applies to a given file is used. All subsequent entries are ignored for that file.

You cannot reorder file type entries. All new file type entries are added to the end of the existing list of entries. Add file type entries in the order of precedence for processing, according to the precedence rules listed above.

Note: For indexing setting changes to take effect for a case, you must rebuild the indexes for that case. For instructions, see Index and analyze documents.

The following file type rule is configured by default.

Name: CSV

File type: ANSI text

File name filters: *.csv, *.mbx, *.mbox, *.dbf

Override other file type detection: Yes

Default Noise Words

about, after, all, also, an, and, another, any, are, as, at, be, because, been, before, being, both, but, by, came, can, come, could, did, do, each, even, for, further, furthermore, get, got, had, has, have, he, her, here, hi, him, himself, how, however, if, in, indeed, into, is, it, its, just, like, made, many, me, might, more, moreover, most, much, must, my, never, not, now, of, on, only, or, other, our, out, over, said, same, see, she, should, since, some, still, such, take, than, that, the, their, them, then, there, therefore, these, they, this, those, through, thus, too, under, up, very, was, way, we, well, were, what, when, where, which, while, who, will, with, would, you, your

Default Alpha Standard options

Character

Character Type

Character

Character Type

Character

Character Type

!

Space

/

Space

=

Space

"

Space

0

Letter

>

Space

#

Space

1

Letter

?

Space

$

Space

2

Letter

@

Space

%

Space

3

Letter

[

Space

&

Space

4

Letter

\

Space

'

Space

5

Letter

]

Space

(

Space

6

Letter

^

Space

)

Space

7

Letter

-

Space

*

Space

8

Letter

`

Space

+

Space

9

Letter

{

Space

,

Space

:

Space

|

Space

-

Hyphen

;

Space

}

Space

.

Space

<

Space

~

Space

Default Stemming Rules

The following stemming rules are configured by default.

Minimum character count

Find suffix

Replace with

3

ies

y

3

ing

 

4

ness

 

0

ss

ss

3

s

 

4

ion

 

4

ism

 

4

ly

 

3

eed

ee

4

ied

y

4

ed

 

3

ed

e

4

er

 

4

ful

 

4

able

 

4

ible

 

3

v

f

4

e

 

3

dd

d

3

gg

g

3

ll

l

3

mm

m

3

nn

n

3

pp

p

3

rr

r

3

ss

s

3

tt

t