2. 模块列表

jieba.setLogLevel(log_level)

class jieba.Tokenizer(dictionary=None)

基类：object

__init__(dictionary=None)

新建自定义分词器，可用于同时使用不同词典。

jieba.dt 为默认分词器，所有全局分词相关函数都是该分词器的映射。

参数: dictionary – 一个指向自定义字典的文件名,文件格式见`jieba/dict.txt`

static gen_pfdict(f)

initialize(dictionary=None)

check_initialized()

calc(sentence, DAG, route)

get_DAG(sentence)

cut(sentence, cut_all=False, HMM=True, use_paddle=False)

对序列进行不带词性的切分

The main function that segments an entire sentence that contains Chinese characters into separated words.

参数

sentence (str/unicode) – 需要分词的字符串 The str(unicode) to be segmented.
HMM – 是否使用HMM模型. Whether to use the Hidden Markov Model or not

返回

一个遍历结果的生成器,每个元素为str/unicode. A generator that iterates through the tokenized str/unicode

返回类型

generator[str/unicode]

cut_for_search(sentence, HMM=True)

Finer segmentation for search engines. 适合用于搜索引擎构建倒排索引的分词，粒度比较细

参数

sentence – 需要分词的字符串
HMM – 是否使用 HMM 模型

返回

一个分词结果的`generator`(?)

lcut(*args, **kwargs)

lcut_for_search(*args, **kwargs)

get_dict_file()

load_userdict(f)

Load personalized dict to improve detect rate.

Parameter:

fA plain text file contains words and their ocurrences.
Can be a file-like object, or the path of the dictionary file, whose encoding must be utf-8.

Structure of dict file: word1 freq1 word_type1 word2 freq2 word_type2 … Word type may be ignored

add_word(word, freq=None, tag=None)

Add a word to dictionary.

freq and tag can be omitted, freq defaults to be a calculated value that ensures the word can be cut out.

del_word(word): Convenient function for deleting a word.

suggest_freq(segment, tune=False)

Suggest word frequency to force the characters in a word to be joined or splitted.

Parameter:

segmentThe segments that the word is expected to be cut into,
If the word should be treated as a whole, use a str.
tune : If True, tune the word frequency.

Note that HMM may affect the final result. If the result doesn’t change, set HMM=False.

tokenize(unicode_sentence, mode='default', HMM=True)

Tokenize a sentence and yields tuples of (word, start, end)

Parameter:

sentence: the str(unicode) to be segmented.
mode: “default” or “search”, “search” is for finer segmentation.
HMM: whether to use the Hidden Markov Model.

set_dictionary(dictionary_path)

jieba.get_FREQ(k, d=None)

jieba.add_word(word, freq=None, tag=None)

Add a word to dictionary.

freq and tag can be omitted, freq defaults to be a calculated value that ensures the word can be cut out.

jieba.calc(sentence, DAG, route)

jieba.cut(sentence, cut_all=False, HMM=True, use_paddle=False)

对序列进行不带词性的切分

The main function that segments an entire sentence that contains Chinese characters into separated words.

参数

sentence (str/unicode) – 需要分词的字符串 The str(unicode) to be segmented.
HMM – 是否使用HMM模型. Whether to use the Hidden Markov Model or not

返回

一个遍历结果的生成器,每个元素为str/unicode. A generator that iterates through the tokenized str/unicode

返回类型

generator[str/unicode]

jieba.lcut(*args, **kwargs)

jieba.cut_for_search(sentence, HMM=True)

Finer segmentation for search engines. 适合用于搜索引擎构建倒排索引的分词，粒度比较细

参数

sentence – 需要分词的字符串
HMM – 是否使用 HMM 模型

返回

一个分词结果的`generator`(?)

jieba.lcut_for_search(*args, **kwargs)

jieba.del_word(word): Convenient function for deleting a word.

jieba.get_DAG(sentence)

jieba.get_dict_file()

jieba.initialize(dictionary=None)

jieba.load_userdict(f)

Load personalized dict to improve detect rate.

Parameter:

fA plain text file contains words and their ocurrences.
Can be a file-like object, or the path of the dictionary file, whose encoding must be utf-8.

Structure of dict file: word1 freq1 word_type1 word2 freq2 word_type2 … Word type may be ignored

jieba.set_dictionary(dictionary_path)

jieba.suggest_freq(segment, tune=False)

Suggest word frequency to force the characters in a word to be joined or splitted.

Parameter:

segmentThe segments that the word is expected to be cut into,
If the word should be treated as a whole, use a str.
tune : If True, tune the word frequency.

Note that HMM may affect the final result. If the result doesn’t change, set HMM=False.

jieba.tokenize(unicode_sentence, mode='default', HMM=True)

Tokenize a sentence and yields tuples of (word, start, end)

Parameter:

sentence: the str(unicode) to be segmented.
mode: “default” or “search”, “search” is for finer segmentation.
HMM: whether to use the Hidden Markov Model.

jieba.enable_parallel(processnum=None)

Change the module’s cut and cut_for_search functions to the parallel version.

Note that this only works using dt, custom Tokenizer instances are not supported.

jieba.disable_parallel()

jieba.posseg.load_model()

class jieba.posseg.pair(word, flag)

基类：object

__init__(word, flag)

encode(arg)

class jieba.posseg.POSTokenizer(tokenizer=None)

基类：object

__init__(tokenizer=None)

initialize(dictionary=None)

load_word_tag(f)

makesure_userdict_loaded()

cut(sentence, HMM=True)

lcut(*args, **kwargs)

jieba.posseg.initialize(dictionary=None)

jieba.posseg.cut(sentence, HMM=True, use_paddle=False)

对序列进行带词性的切分

Global cut function that supports parallel processing.

Note that this only works using dt, custom POSTokenizer instances are not supported.

参数

sentence – 需要分词的字符串
HMM – HMM 是否使用 HMM 模型
use_paddle – 是否使用paddle模式下的分词模式，paddle模式采用延迟加载方式，通过enable_paddle接口安装paddlepaddle-tiny，并且import相关代码；

返回

一个分词结果的`generator`

jieba.posseg.lcut(sentence, HMM=True, use_paddle=False)