2. 模块列表

jieba.setLogLevel(log_level)
class jieba.Tokenizer(dictionary=None)

基类:object

__init__(dictionary=None)

新建自定义分词器,可用于同时使用不同词典。

jieba.dt 为默认分词器,所有全局分词相关函数都是该分词器的映射。

参数

dictionary – 一个指向自定义字典的文件名,文件格式见`jieba/dict.txt`

static gen_pfdict(f)
initialize(dictionary=None)
check_initialized()
calc(sentence, DAG, route)
get_DAG(sentence)
cut(sentence, cut_all=False, HMM=True, use_paddle=False)

对序列进行不带词性的切分

The main function that segments an entire sentence that contains Chinese characters into separated words.

参数
  • sentence (str/unicode) – 需要分词的字符串 The str(unicode) to be segmented.

  • HMM – 是否使用HMM模型. Whether to use the Hidden Markov Model or not

返回

一个遍历结果的生成器,每个元素为str/unicode. A generator that iterates through the tokenized str/unicode

返回类型

generator[str/unicode]

Finer segmentation for search engines. 适合用于搜索引擎构建倒排索引的分词,粒度比较细

参数
  • sentence – 需要分词的字符串

  • HMM – 是否使用 HMM 模型

返回

一个分词结果的`generator`(?)

lcut(*args, **kwargs)
get_dict_file()
load_userdict(f)

Load personalized dict to improve detect rate.

Parameter:
  • fA plain text file contains words and their ocurrences.

    Can be a file-like object, or the path of the dictionary file, whose encoding must be utf-8.

Structure of dict file: word1 freq1 word_type1 word2 freq2 word_type2 … Word type may be ignored

add_word(word, freq=None, tag=None)

Add a word to dictionary.

freq and tag can be omitted, freq defaults to be a calculated value that ensures the word can be cut out.

del_word(word)

Convenient function for deleting a word.

suggest_freq(segment, tune=False)

Suggest word frequency to force the characters in a word to be joined or splitted.

Parameter:
  • segmentThe segments that the word is expected to be cut into,

    If the word should be treated as a whole, use a str.

  • tune : If True, tune the word frequency.

Note that HMM may affect the final result. If the result doesn’t change, set HMM=False.

tokenize(unicode_sentence, mode='default', HMM=True)

Tokenize a sentence and yields tuples of (word, start, end)

Parameter:
  • sentence: the str(unicode) to be segmented.

  • mode: “default” or “search”, “search” is for finer segmentation.

  • HMM: whether to use the Hidden Markov Model.

set_dictionary(dictionary_path)
jieba.get_FREQ(k, d=None)
jieba.add_word(word, freq=None, tag=None)

Add a word to dictionary.

freq and tag can be omitted, freq defaults to be a calculated value that ensures the word can be cut out.

jieba.calc(sentence, DAG, route)
jieba.cut(sentence, cut_all=False, HMM=True, use_paddle=False)

对序列进行不带词性的切分

The main function that segments an entire sentence that contains Chinese characters into separated words.

参数
  • sentence (str/unicode) – 需要分词的字符串 The str(unicode) to be segmented.

  • HMM – 是否使用HMM模型. Whether to use the Hidden Markov Model or not

返回

一个遍历结果的生成器,每个元素为str/unicode. A generator that iterates through the tokenized str/unicode

返回类型

generator[str/unicode]

jieba.lcut(*args, **kwargs)

Finer segmentation for search engines. 适合用于搜索引擎构建倒排索引的分词,粒度比较细

参数
  • sentence – 需要分词的字符串

  • HMM – 是否使用 HMM 模型

返回

一个分词结果的`generator`(?)

jieba.del_word(word)

Convenient function for deleting a word.

jieba.get_DAG(sentence)
jieba.get_dict_file()
jieba.initialize(dictionary=None)
jieba.load_userdict(f)

Load personalized dict to improve detect rate.

Parameter:
  • fA plain text file contains words and their ocurrences.

    Can be a file-like object, or the path of the dictionary file, whose encoding must be utf-8.

Structure of dict file: word1 freq1 word_type1 word2 freq2 word_type2 … Word type may be ignored

jieba.set_dictionary(dictionary_path)
jieba.suggest_freq(segment, tune=False)

Suggest word frequency to force the characters in a word to be joined or splitted.

Parameter:
  • segmentThe segments that the word is expected to be cut into,

    If the word should be treated as a whole, use a str.

  • tune : If True, tune the word frequency.

Note that HMM may affect the final result. If the result doesn’t change, set HMM=False.

jieba.tokenize(unicode_sentence, mode='default', HMM=True)

Tokenize a sentence and yields tuples of (word, start, end)

Parameter:
  • sentence: the str(unicode) to be segmented.

  • mode: “default” or “search”, “search” is for finer segmentation.

  • HMM: whether to use the Hidden Markov Model.

jieba.enable_parallel(processnum=None)

Change the module’s cut and cut_for_search functions to the parallel version.

Note that this only works using dt, custom Tokenizer instances are not supported.

jieba.disable_parallel()
jieba.posseg.load_model()
class jieba.posseg.pair(word, flag)

基类:object

__init__(word, flag)
encode(arg)
class jieba.posseg.POSTokenizer(tokenizer=None)

基类:object

__init__(tokenizer=None)
initialize(dictionary=None)
load_word_tag(f)
makesure_userdict_loaded()
cut(sentence, HMM=True)
lcut(*args, **kwargs)
jieba.posseg.initialize(dictionary=None)
jieba.posseg.cut(sentence, HMM=True, use_paddle=False)

对序列进行带词性的切分

Global cut function that supports parallel processing.

Note that this only works using dt, custom POSTokenizer instances are not supported.

参数
  • sentence – 需要分词的字符串

  • HMM – HMM 是否使用 HMM 模型

  • use_paddle – 是否使用paddle模式下的分词模式,paddle模式采用延迟加载方式,通过enable_paddle接口安装paddlepaddle-tiny,并且import相关代码;

返回

一个分词结果的`generator`

jieba.posseg.lcut(sentence, HMM=True, use_paddle=False)