This article discusses the way Exchange 2000 uses full-text indexing with multiple languages.
About Full-Text Indexing and Exchange 2000
Exchange 2000 can create and manage full-text indexes for fast searches. When Exchange 2000 indexes multiple language content, it uses filtering tools that interpret the file's format. Exchange 2000 also supports multiple language features, such as international languages and locales.
Earlier versions of Exchange Server searched every item in every folder, so the search times increased as the database size increased. With full-text indexing, every word in the mailbox or public folder store is indexed, making faster searching possible. Microsoft Outlook users can search documents in the information store as easily as they can search for e-mail messages. They can also search attachments that have the following file types:
- .doc
- .xls
- .ppt
- .html
- .htm
- .asp
- .txt
- .eml (embedded MIME messages)
Binary attachments are not indexed.
For additional information about creating and managing full-text indexes in Exchange 2000, see the
Full Text Indexing topic in the Exchange 2000 product documentation.
Full-Text Indexing Components
When indexes process content that contains multiple languages, they identify locales by using a locale identifier property,
Locale ID, which each document has. For languages that are supported by the Windows 2000 Index Service, there are word breakers and stemmers for the locale in the
Locale ID property value.
Word breakers are language utilities that identify words in a document. During indexing, the word breaker identifies where the words are located in the sentence. There is a word-breaking module for each of the supported languages. These modules are used to index the document regardless of the locale of the Exchange 2000 server. If the
Locale ID property value is not supported (Swahili, for example), a neutral word breaker is used to index the document. Word breakers are available in Dutch, English, French, German, Italian, Japanese, Korean, Spanish, Swedish, Thai, Chinese (Simplified), and Chinese (Traditional).
Stemmers take a word and generate grammatically correct variations of that word. Each language requires its own stemmer. For example, for the word
swam, the English stemmer generates
swim,
swam,
swum,
swimming, and
swims.
The information is then normalized and a noise filter file is applied, according to language. A noise filter file contains words that are not indexed, such as "a," "an," "and," "the," and single letters of the alphabet. For example, "the" is skipped in English-language normalization because "the" is considered a noise word in English.
The Language Setting of Individual Messages
Full-text indexing uses the language setting of individual messages to determine which word breaker to use. MAPI messages have a
Locale ID property value that is determined by the language setting in Microsoft Office on the client computer. If full-text indexing cannot find a word breaker to match the
Locale ID property value, the index uses the neutral word breaker.
If a message is created by using Distributed Authoring and Versioning (DAV), full-text indexing uses the Accept-Language header to determine which locale to use.
The Language Setting of Attachments
If an attachment is an Office document, full-text indexing uses the language setting that is used in the document; otherwise, it uses the neutral word breaker.
The Language Setting of the Exchange 2000
If a message is not a MAPI message, its
Locale ID property is not set. Full-text indexing uses the system locale setting for the server to determine which word breaker to use.
To verify the server language, follow these steps:
- Click Start, point to Settings, and then click Control Panel.
- Click Date, Time, Language, and Regional Options.
- Click Regional and Language Options, and then verify the locale and language settings.
- Click OK.
Full-text indexing works best when the query language of the client computer matches the language of the files that are being indexed. The server language is sometimes used as the query language if the language of the client computer is not known. Because of this, it is best for the server language to match the language of the majority of the documents that are on the server.
The Language Setting of the Client Computer
When the index is queried, it is assumed that the query is written in the language that is used on the client computer. Therefore, if the client computer's
Locale ID property is set to German, it is assumed that all queries from that computer are written in German. If the
Locale ID values of the message and the query are different, the search results are not predictable.
The language of the Exchange 2000 server is not important in this scenario.
The language of the client computer that sends the query determines which language is used. If the language is not defined, the server language is used.
Error messages that are generated by the indexing process are delivered in the default locale language for the Exchange 2000 server. Supported languages include English, French, German, Spanish, and Japanese. This language set can be extended by using Microsoft SharePoint Portal Server, which adds support for Korean, Chinese (Simplified), and Chinese (Traditional).
For additional information about SharePoint Portal Server, visit the following Microsoft Web site:
Microsoft SharePoint TechnologiesExamples of Full-Text Indexing Behavior in Mixed-Language Scenarios
Client and Server Both Use U.S. Language Settings
An Outlook client who uses U.S. English language settings composes and sends an e-mail message from a computer that uses U.S. English language settings. The message is submitted to an Exchange 2000 server that runs on a Windows 2000 Server computer with U.S. English language settings. In this scenario, full-text indexing indexes the message by using the U.S. English language word breaker. The
Locale ID property is set to U.S. English because of the Office settings on the client computer. Queries that are sent from the client computer are successfully processed.
Client Uses Hebrew Language Settings
An Outlook client who uses U.S. English language settings composes and sends an e-mail message from a computer that uses Hebrew language settings. The message is submitted to an Exchange 2000 server that runs on a Windows 2000 Server computer with U.S. English language settings. In this scenario, full-text indexing indexes the message by using the U.S. English word breaker. The
Locale ID property is set to U.S. English because of the Office settings on the client computer. Queries that are sent from the client computer are not successfully processed because the Hebrew message does not have the correct word breaker applied.
Client Uses the Japanese Language Settings
An Outlook client who uses Japanese language settings composes and sends an e-mail message from a computer that uses U.S. English language settings. The message is submitted to an Exchange 2000 server that runs on a Windows 2000 Server computer with U.S. English language settings. In this scenario, full-text indexing indexes the message by using the Japanese word breaker. The
Locale ID property is set to Japanese because of the Office settings on the client computer. Queries from the client computer are successfully processed because the query's
Local ID property and the
Local ID property that was used to index the message are both set to Japanese.