public static class Tokenizer.Builder extends TokenizerBase.Builder
characterDefinitions, connectionCosts, doubleArrayTrie, insertedDictionary, mode, partOfSpeechFeature, penalties, readingFeature, resolver, split, tokenFactory, tokenInfoDictionary, totalFeatures, unknownDictionary, userDictionary| Constructor and Description |
|---|
Builder()
Creates a default builder
|
| Modifier and Type | Method and Description |
|---|---|
Tokenizer |
build()
Creates the custom tokenizer instance
|
Tokenizer.Builder |
isSplitOnNakaguro(boolean split)
Predictate that splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)
|
Tokenizer.Builder |
kanjiPenalty(int lengthThreshold,
int penalty)
Sets a custom kanji penalty
|
protected void |
loadDictionaries() |
Tokenizer.Builder |
mode(TokenizerBase.Mode mode)
Sets the tokenization mode
|
Tokenizer.Builder |
otherPenalty(int lengthThreshold,
int penalty)
Sets a custom non-kanji penalty
|
userDictionary, userDictionarypublic Tokenizer.Builder mode(TokenizerBase.Mode mode)
The tokenization mode defines how Available modes are as follows:
TokenizerBase.Mode.NORMAL - The default mode
TokenizerBase.Mode.SEARCH - Uses a heuristic to segment compound nouns (č¤ĺĺčŠ) into their parts
TokenizerBase.Mode.EXTENDED - Same as SEARCH, but emits unigram tokens for unknown terms
kanjiPenalty and otherPenalty for how to adjust costs used by SEARCH and EXTENDED modemode - tokenization modepublic Tokenizer.Builder kanjiPenalty(int lengthThreshold, int penalty)
This is an expert feature used with TokenizerBase.Mode.SEARCH and TokenizerBase.Mode.EXTENDED modes that sets a length threshold and an additional costs used when running the Viterbi search.
The additional cost is applicable for kanji candidate tokens longer than the length threshold specified.
This is an expert feature and you usually would not need to change this.
lengthThreshold - length threshold applicable for this penaltypenalty - cost added to Viterbi nodes for long kanji candidate tokenspublic Tokenizer.Builder otherPenalty(int lengthThreshold, int penalty)
This is an expert feature used with TokenizerBase.Mode.SEARCH and TokenizerBase.Mode.EXTENDED modes that sets a length threshold and an additional costs used when running the Viterbi search.
The additional cost is applicable for non-kanji candidate tokens longer than the length threshold specified.
This is an expert feature and you usually would not need to change this.
lengthThreshold - length threshold applicable for this penaltypenalty - cost added to Viterbi nodes for long non-kanji candidate tokenspublic Tokenizer.Builder isSplitOnNakaguro(boolean split)
This feature is off by default.
This is an expert feature sometimes used with TokenizerBase.Mode.SEARCH and TokenizerBase.Mode.EXTENDED mode.
split - predicate to indicate split on middle dotpublic Tokenizer build()
build in class TokenizerBase.Builderprotected void loadDictionaries()
loadDictionaries in class TokenizerBase.Builder