2. 模块列表
- jieba.setLogLevel(log_level)
- class jieba.Tokenizer(dictionary=None)
基类:
object
- __init__(dictionary=None)
新建自定义分词器,可用于同时使用不同词典。
jieba.dt 为默认分词器,所有全局分词相关函数都是该分词器的映射。
- 参数
dictionary – 一个指向自定义字典的文件名,文件格式见`jieba/dict.txt`
- static gen_pfdict(f)
- initialize(dictionary=None)
- check_initialized()
- calc(sentence, DAG, route)
- get_DAG(sentence)
- cut(sentence, cut_all=False, HMM=True, use_paddle=False)
对序列进行不带词性的切分
The main function that segments an entire sentence that contains Chinese characters into separated words.
- 参数
sentence (str/unicode) – 需要分词的字符串 The str(unicode) to be segmented.
HMM – 是否使用HMM模型. Whether to use the Hidden Markov Model or not
- 返回
一个遍历结果的生成器,每个元素为str/unicode. A generator that iterates through the tokenized str/unicode
- 返回类型
generator[str/unicode]
- cut_for_search(sentence, HMM=True)
Finer segmentation for search engines. 适合用于搜索引擎构建倒排索引的分词,粒度比较细
- 参数
sentence – 需要分词的字符串
HMM – 是否使用 HMM 模型
- 返回
一个分词结果的`generator`(?)
- lcut(*args, **kwargs)
- lcut_for_search(*args, **kwargs)
- get_dict_file()
- load_userdict(f)
Load personalized dict to improve detect rate.
- Parameter:
- fA plain text file contains words and their ocurrences.
Can be a file-like object, or the path of the dictionary file, whose encoding must be utf-8.
Structure of dict file: word1 freq1 word_type1 word2 freq2 word_type2 … Word type may be ignored
- add_word(word, freq=None, tag=None)
Add a word to dictionary.
freq and tag can be omitted, freq defaults to be a calculated value that ensures the word can be cut out.
- del_word(word)
Convenient function for deleting a word.
- suggest_freq(segment, tune=False)
Suggest word frequency to force the characters in a word to be joined or splitted.
- Parameter:
- segmentThe segments that the word is expected to be cut into,
If the word should be treated as a whole, use a str.
tune : If True, tune the word frequency.
Note that HMM may affect the final result. If the result doesn’t change, set HMM=False.
- tokenize(unicode_sentence, mode='default', HMM=True)
Tokenize a sentence and yields tuples of (word, start, end)
- Parameter:
sentence: the str(unicode) to be segmented.
mode: “default” or “search”, “search” is for finer segmentation.
HMM: whether to use the Hidden Markov Model.
- set_dictionary(dictionary_path)
- jieba.get_FREQ(k, d=None)
- jieba.add_word(word, freq=None, tag=None)
Add a word to dictionary.
freq and tag can be omitted, freq defaults to be a calculated value that ensures the word can be cut out.
- jieba.calc(sentence, DAG, route)
- jieba.cut(sentence, cut_all=False, HMM=True, use_paddle=False)
对序列进行不带词性的切分
The main function that segments an entire sentence that contains Chinese characters into separated words.
- 参数
sentence (str/unicode) – 需要分词的字符串 The str(unicode) to be segmented.
HMM – 是否使用HMM模型. Whether to use the Hidden Markov Model or not
- 返回
一个遍历结果的生成器,每个元素为str/unicode. A generator that iterates through the tokenized str/unicode
- 返回类型
generator[str/unicode]
- jieba.lcut(*args, **kwargs)
- jieba.cut_for_search(sentence, HMM=True)
Finer segmentation for search engines. 适合用于搜索引擎构建倒排索引的分词,粒度比较细
- 参数
sentence – 需要分词的字符串
HMM – 是否使用 HMM 模型
- 返回
一个分词结果的`generator`(?)
- jieba.lcut_for_search(*args, **kwargs)
- jieba.del_word(word)
Convenient function for deleting a word.
- jieba.get_DAG(sentence)
- jieba.get_dict_file()
- jieba.initialize(dictionary=None)
- jieba.load_userdict(f)
Load personalized dict to improve detect rate.
- Parameter:
- fA plain text file contains words and their ocurrences.
Can be a file-like object, or the path of the dictionary file, whose encoding must be utf-8.
Structure of dict file: word1 freq1 word_type1 word2 freq2 word_type2 … Word type may be ignored
- jieba.set_dictionary(dictionary_path)
- jieba.suggest_freq(segment, tune=False)
Suggest word frequency to force the characters in a word to be joined or splitted.
- Parameter:
- segmentThe segments that the word is expected to be cut into,
If the word should be treated as a whole, use a str.
tune : If True, tune the word frequency.
Note that HMM may affect the final result. If the result doesn’t change, set HMM=False.
- jieba.tokenize(unicode_sentence, mode='default', HMM=True)
Tokenize a sentence and yields tuples of (word, start, end)
- Parameter:
sentence: the str(unicode) to be segmented.
mode: “default” or “search”, “search” is for finer segmentation.
HMM: whether to use the Hidden Markov Model.
- jieba.enable_parallel(processnum=None)
Change the module’s cut and cut_for_search functions to the parallel version.
Note that this only works using dt, custom Tokenizer instances are not supported.
- jieba.disable_parallel()
- jieba.posseg.load_model()
- class jieba.posseg.POSTokenizer(tokenizer=None)
基类:
object
- __init__(tokenizer=None)
- initialize(dictionary=None)
- load_word_tag(f)
- makesure_userdict_loaded()
- cut(sentence, HMM=True)
- lcut(*args, **kwargs)
- jieba.posseg.initialize(dictionary=None)
- jieba.posseg.cut(sentence, HMM=True, use_paddle=False)
对序列进行带词性的切分
Global cut function that supports parallel processing.
Note that this only works using dt, custom POSTokenizer instances are not supported.
- 参数
sentence – 需要分词的字符串
HMM – HMM 是否使用 HMM 模型
use_paddle – 是否使用paddle模式下的分词模式,paddle模式采用延迟加载方式,通过enable_paddle接口安装paddlepaddle-tiny,并且import相关代码;
- 返回
一个分词结果的`generator`
- jieba.posseg.lcut(sentence, HMM=True, use_paddle=False)