public static class Tokenizer.Builder extends TokenizerBase.Builder
characterDefinitions, connectionCosts, doubleArrayTrie, insertedDictionary, mode, partOfSpeechFeature, penalties, readingFeature, resolver, split, tokenFactory, tokenInfoDictionary, totalFeatures, unknownDictionary, userDictionary
Constructor and Description |
---|
Builder()
Creates a default builder
|
Modifier and Type | Method and Description |
---|---|
Tokenizer |
build()
Creates the custom tokenizer instance
|
Tokenizer.Builder |
isSplitOnNakaguro(boolean split)
Predictate that splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)
|
Tokenizer.Builder |
kanjiPenalty(int lengthThreshold,
int penalty)
Sets a custom kanji penalty
|
protected void |
loadDictionaries() |
Tokenizer.Builder |
mode(TokenizerBase.Mode mode)
Sets the tokenization mode
|
Tokenizer.Builder |
otherPenalty(int lengthThreshold,
int penalty)
Sets a custom non-kanji penalty
|
userDictionary, userDictionary
public Tokenizer.Builder mode(TokenizerBase.Mode mode)
The tokenization mode defines how Available modes are as follows:
TokenizerBase.Mode.NORMAL
- The default mode
TokenizerBase.Mode.SEARCH
- Uses a heuristic to segment compound nouns (č¤ĺĺčŠ) into their parts
TokenizerBase.Mode.EXTENDED
- Same as SEARCH, but emits unigram tokens for unknown terms
kanjiPenalty
and otherPenalty
for how to adjust costs used by SEARCH and EXTENDED modemode
- tokenization modepublic Tokenizer.Builder kanjiPenalty(int lengthThreshold, int penalty)
This is an expert feature used with TokenizerBase.Mode.SEARCH
and TokenizerBase.Mode.EXTENDED
modes that sets a length threshold and an additional costs used when running the Viterbi search.
The additional cost is applicable for kanji candidate tokens longer than the length threshold specified.
This is an expert feature and you usually would not need to change this.
lengthThreshold
- length threshold applicable for this penaltypenalty
- cost added to Viterbi nodes for long kanji candidate tokenspublic Tokenizer.Builder otherPenalty(int lengthThreshold, int penalty)
This is an expert feature used with TokenizerBase.Mode.SEARCH
and TokenizerBase.Mode.EXTENDED
modes that sets a length threshold and an additional costs used when running the Viterbi search.
The additional cost is applicable for non-kanji candidate tokens longer than the length threshold specified.
This is an expert feature and you usually would not need to change this.
lengthThreshold
- length threshold applicable for this penaltypenalty
- cost added to Viterbi nodes for long non-kanji candidate tokenspublic Tokenizer.Builder isSplitOnNakaguro(boolean split)
This feature is off by default.
This is an expert feature sometimes used with TokenizerBase.Mode.SEARCH
and TokenizerBase.Mode.EXTENDED
mode.
split
- predicate to indicate split on middle dotpublic Tokenizer build()
build
in class TokenizerBase.Builder
protected void loadDictionaries()
loadDictionaries
in class TokenizerBase.Builder