Class DefaultICUTokenizerConfig
java.lang.Object
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig
Default
ICUTokenizerConfig that is generally applicable to many languages.
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with
the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
- WARNING: This API is experimental and might change in incompatible ways in the next release.
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final StringToken type for words that appear to be emoji sequencesstatic final StringToken type for words containing Korean hangulstatic final StringToken type for words containing Japanese hiraganastatic final StringToken type for words containing ideographic charactersstatic final StringToken type for words containing Japanese katakanastatic final StringToken type for words that contain lettersstatic final StringToken type for words that appear to be numbersFields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
EMOJI_SEQUENCE_STATUS -
Constructor Summary
ConstructorsConstructorDescriptionDefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords) Creates a new config. -
Method Summary
Modifier and TypeMethodDescriptionbooleantrue if Han, Hiragana, and Katakana scripts should all be returned as Japanesecom.ibm.icu.text.RuleBasedBreakIteratorgetBreakIterator(int script) Return a breakiterator capable of processing a given script.getType(int script, int ruleStatus) Return a token type value for a given script and BreakIterator rule status.
-
Field Details
-
WORD_IDEO
Token type for words containing ideographic characters -
WORD_HIRAGANA
Token type for words containing Japanese hiragana -
WORD_KATAKANA
Token type for words containing Japanese katakana -
WORD_HANGUL
Token type for words containing Korean hangul -
WORD_LETTER
Token type for words that contain letters -
WORD_NUMBER
Token type for words that appear to be numbers -
WORD_EMOJI
Token type for words that appear to be emoji sequences
-
-
Constructor Details
-
DefaultICUTokenizerConfig
public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords) Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.- Parameters:
cjkAsWords- true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.myanmarAsWords- true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
-
-
Method Details
-
combineCJ
public boolean combineCJ()Description copied from class:ICUTokenizerConfigtrue if Han, Hiragana, and Katakana scripts should all be returned as Japanese- Specified by:
combineCJin classICUTokenizerConfig
-
getBreakIterator
public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script) Description copied from class:ICUTokenizerConfigReturn a breakiterator capable of processing a given script.- Specified by:
getBreakIteratorin classICUTokenizerConfig
-
getType
Description copied from class:ICUTokenizerConfigReturn a token type value for a given script and BreakIterator rule status.- Specified by:
getTypein classICUTokenizerConfig
-