Notice: This website is an unofficial Microsoft Knowledge Base (hereinafter KB) archive and is intended to provide a reliable access to deleted content from Microsoft KB. All KB articles are owned by Microsoft Corporation. Read full disclaimer for more details.

XGEN: Full-Text Indexing with Multiple Languages


View products that this article applies to.

This article was previously published under Q325624

↑ Back to the top


Summary

This article discusses the way Exchange 2000 uses full-text indexing with multiple languages.



About Full-Text Indexing and Exchange 2000

Exchange 2000 can create and manage full-text indexes for fast searches. When Exchange 2000 indexes multiple language content, it uses filtering tools that interpret the file's format. Exchange 2000 also supports multiple language features, such as international languages and locales.

Earlier versions of Exchange Server searched every item in every folder, so the search times increased as the database size increased. With full-text indexing, every word in the mailbox or public folder store is indexed, making faster searching possible. Microsoft Outlook users can search documents in the information store as easily as they can search for e-mail messages. They can also search attachments that have the following file types:
  • .doc
  • .xls
  • .ppt
  • .html
  • .htm
  • .asp
  • .txt
  • .eml (embedded MIME messages)
Binary attachments are not indexed.

For additional information about creating and managing full-text indexes in Exchange 2000, see the Full Text Indexing topic in the Exchange 2000 product documentation.



Full-Text Indexing Components

When indexes process content that contains multiple languages, they identify locales by using a locale identifier property, Locale ID, which each document has. For languages that are supported by the Windows 2000 Index Service, there are word breakers and stemmers for the locale in the Locale ID property value.

Word breakers are language utilities that identify words in a document. During indexing, the word breaker identifies where the words are located in the sentence. There is a word-breaking module for each of the supported languages. These modules are used to index the document regardless of the locale of the Exchange 2000 server. If the Locale ID property value is not supported (Swahili, for example), a neutral word breaker is used to index the document. Word breakers are available in Dutch, English, French, German, Italian, Japanese, Korean, Spanish, Swedish, Thai, Chinese (Simplified), and Chinese (Traditional).

Stemmers take a word and generate grammatically correct variations of that word. Each language requires its own stemmer. For example, for the word swam, the English stemmer generates swim, swam, swum, swimming, and swims.

The information is then normalized and a noise filter file is applied, according to language. A noise filter file contains words that are not indexed, such as "a," "an," "and," "the," and single letters of the alphabet. For example, "the" is skipped in English-language normalization because "the" is considered a noise word in English.



The Language Setting of Individual Messages

Full-text indexing uses the language setting of individual messages to determine which word breaker to use. MAPI messages have a Locale ID property value that is determined by the language setting in Microsoft Office on the client computer. If full-text indexing cannot find a word breaker to match the Locale ID property value, the index uses the neutral word breaker. If a message is created by using Distributed Authoring and Versioning (DAV), full-text indexing uses the Accept-Language header to determine which locale to use.



The Language Setting of Attachments

If an attachment is an Office document, full-text indexing uses the language setting that is used in the document; otherwise, it uses the neutral word breaker.



The Language Setting of the Exchange 2000

If a message is not a MAPI message, its Locale ID property is not set. Full-text indexing uses the system locale setting for the server to determine which word breaker to use.

To verify the server language, follow these steps:
  1. Click Start, point to Settings, and then click Control Panel.
  2. Click Date, Time, Language, and Regional Options.
  3. Click Regional and Language Options, and then verify the locale and language settings.
  4. Click OK.
Full-text indexing works best when the query language of the client computer matches the language of the files that are being indexed. The server language is sometimes used as the query language if the language of the client computer is not known. Because of this, it is best for the server language to match the language of the majority of the documents that are on the server.



The Language Setting of the Client Computer

When the index is queried, it is assumed that the query is written in the language that is used on the client computer. Therefore, if the client computer's Locale ID property is set to German, it is assumed that all queries from that computer are written in German. If the Locale ID values of the message and the query are different, the search results are not predictable.

The language of the Exchange 2000 server is not important in this scenario. The language of the client computer that sends the query determines which language is used. If the language is not defined, the server language is used.

Error messages that are generated by the indexing process are delivered in the default locale language for the Exchange 2000 server. Supported languages include English, French, German, Spanish, and Japanese. This language set can be extended by using Microsoft SharePoint Portal Server, which adds support for Korean, Chinese (Simplified), and Chinese (Traditional).

For additional information about SharePoint Portal Server, visit the following Microsoft Web site:

Microsoft SharePoint Technologies



Examples of Full-Text Indexing Behavior in Mixed-Language Scenarios

Client and Server Both Use U.S. Language Settings

An Outlook client who uses U.S. English language settings composes and sends an e-mail message from a computer that uses U.S. English language settings. The message is submitted to an Exchange 2000 server that runs on a Windows 2000 Server computer with U.S. English language settings. In this scenario, full-text indexing indexes the message by using the U.S. English language word breaker. The Locale ID property is set to U.S. English because of the Office settings on the client computer. Queries that are sent from the client computer are successfully processed.

Client Uses Hebrew Language Settings

An Outlook client who uses U.S. English language settings composes and sends an e-mail message from a computer that uses Hebrew language settings. The message is submitted to an Exchange 2000 server that runs on a Windows 2000 Server computer with U.S. English language settings. In this scenario, full-text indexing indexes the message by using the U.S. English word breaker. The Locale ID property is set to U.S. English because of the Office settings on the client computer. Queries that are sent from the client computer are not successfully processed because the Hebrew message does not have the correct word breaker applied.

Client Uses the Japanese Language Settings

An Outlook client who uses Japanese language settings composes and sends an e-mail message from a computer that uses U.S. English language settings. The message is submitted to an Exchange 2000 server that runs on a Windows 2000 Server computer with U.S. English language settings. In this scenario, full-text indexing indexes the message by using the Japanese word breaker. The Locale ID property is set to Japanese because of the Office settings on the client computer. Queries from the client computer are successfully processed because the query's Local ID property and the Local ID property that was used to index the message are both set to Japanese.



↑ Back to the top


Keywords: KB325624, kbinfo

↑ Back to the top

Article Info
Article ID : 325624
Revision : 4
Created on : 2/28/2007
Published on : 2/28/2007
Exists online : False
Views : 261