Tokenizer.Builder

java.lang.Object
- com.atilika.kuromoji.TokenizerBase.Builder
- - com.atilika.kuromoji.ipadic.Tokenizer.Builder

Enclosing class:

Tokenizer
```
public static class Tokenizer.Builder
extends TokenizerBase.Builder
```
Builder class for creating a customized tokenizer instance

Field Summary
- Fields inherited from class com.atilika.kuromoji.TokenizerBase.Builder
  characterDefinitions, connectionCosts, doubleArrayTrie, insertedDictionary, mode, partOfSpeechFeature, penalties, readingFeature, resolver, split, tokenFactory, tokenInfoDictionary, totalFeatures, unknownDictionary, userDictionary

Constructor Summary

Constructors
Constructor and Description

Builder()
Creates a default builder

Constructors
Constructor and Description
`Builder()` Creates a default builder

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`Tokenizer`	`build()` Creates the custom tokenizer instance
`Tokenizer.Builder`	`isSplitOnNakaguro(boolean split)` Predictate that splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)
`Tokenizer.Builder`	`kanjiPenalty(int lengthThreshold, int penalty)` Sets a custom kanji penalty
`protected void`	`loadDictionaries()`
`Tokenizer.Builder`	`mode(TokenizerBase.Mode mode)` Sets the tokenization mode
`Tokenizer.Builder`	`otherPenalty(int lengthThreshold, int penalty)` Sets a custom non-kanji penalty

Methods inherited from class com.atilika.kuromoji.TokenizerBase.Builder
userDictionary, userDictionary

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - Builder
```
public Builder()
```
    Creates a default builder
- Method Detail
  - mode
```
public Tokenizer.Builder mode(TokenizerBase.Mode mode)
```
    Sets the tokenization mode
    The tokenization mode defines how Available modes are as follows:
    - TokenizerBase.Mode.NORMAL - The default mode
    - TokenizerBase.Mode.SEARCH - Uses a heuristic to segment compound nouns (複合名詞) into their parts
    - TokenizerBase.Mode.EXTENDED - Same as SEARCH, but emits unigram tokens for unknown terms
    See kanjiPenalty and otherPenalty for how to adjust costs used by SEARCH and EXTENDED mode
    Parameters:
    
    mode - tokenization mode
    
    Returns:
    
    this builder, not null
  - kanjiPenalty
```
public Tokenizer.Builder kanjiPenalty(int lengthThreshold,
                                      int penalty)
```
    Sets a custom kanji penalty
    This is an expert feature used with TokenizerBase.Mode.SEARCH and TokenizerBase.Mode.EXTENDED modes that sets a length threshold and an additional costs used when running the Viterbi search. The additional cost is applicable for kanji candidate tokens longer than the length threshold specified.
    This is an expert feature and you usually would not need to change this.
    
    Parameters:
    
    lengthThreshold - length threshold applicable for this penalty
    
    penalty - cost added to Viterbi nodes for long kanji candidate tokens
    
    Returns:
    
    this builder, not null
  - otherPenalty
```
public Tokenizer.Builder otherPenalty(int lengthThreshold,
                                      int penalty)
```
    Sets a custom non-kanji penalty
    This is an expert feature used with TokenizerBase.Mode.SEARCH and TokenizerBase.Mode.EXTENDED modes that sets a length threshold and an additional costs used when running the Viterbi search. The additional cost is applicable for non-kanji candidate tokens longer than the length threshold specified.
    This is an expert feature and you usually would not need to change this.
    
    Parameters:
    
    lengthThreshold - length threshold applicable for this penalty
    
    penalty - cost added to Viterbi nodes for long non-kanji candidate tokens
    
    Returns:
    
    this builder, not null
  - isSplitOnNakaguro
```
public Tokenizer.Builder isSplitOnNakaguro(boolean split)
```
    Predictate that splits unknown words on the middle dot character (U+30FB KATAKANA MIDDLE DOT)
    This feature is off by default. This is an expert feature sometimes used with TokenizerBase.Mode.SEARCH and TokenizerBase.Mode.EXTENDED mode.
    
    Parameters:
    
    split - predicate to indicate split on middle dot
    
    Returns:
    
    this builder, not null
  - build
```
public Tokenizer build()
```
    Creates the custom tokenizer instance
    
    Specified by:
    
    build in class TokenizerBase.Builder
    
    Returns:
    
    tokenizer instance, not null
  - loadDictionaries
```
protected void loadDictionaries()
```
    Overrides:
    
    loadDictionaries in class TokenizerBase.Builder

Class Tokenizer.Builder

Field Summary

Fields inherited from class com.atilika.kuromoji.TokenizerBase.Builder

Constructor Summary

Method Summary

Methods inherited from class com.atilika.kuromoji.TokenizerBase.Builder

Methods inherited from class java.lang.Object

Constructor Detail

Builder

Method Detail

mode

kanjiPenalty

otherPenalty

isSplitOnNakaguro

build

loadDictionaries