安装cygwin、python、结巴分词
需要一个Python脚本import syssys.path.append('../')import jiebaimport jieba.analysefrom optparse import OptionParserUSAGE = 'usage: python extract_tags_with_weight.py [file name] -k [top k] -w [with weight=1 or 0]'parser = OptionParser(USAGE)parser.add_option('-k', dest='topK')parser.add_option('-w', dest='withWeight')opt, args = parser.parse_args()if len(args) < 1: print(USAGE) sys.exit(1)file_name = args[0]if opt.topK is None: topK = 10else: topK = int(opt.topK)if opt.withWeight is None: withWeight = Falseelse: if int(opt.withWeight) is 1: withWeight = True else: withWeight = Falsecontent = open(file_name, 'rb').read()tags = jieba.analyse.extract_tags(content, topK=topK, withWeight=withWeight)if withWeight is True: for tag in tags: print('tag: %s\t\t weight: %f' % (tag[0],tag[1]))else: print(','.join(tags))自己去复制,保存.py格式。如:extract_tags.py
测试文本:2.txt,同样放在同一目录下2.txt 内容:98个“6061”、2个“铝板”$ python extract_tags.py 2.txt -w 1 -k 1000得到该文本“铝板”tf-idf值:0.228315
tf(词频)*idf=tf-idftf=tf-idf/idf=0.228315/11.415771=2%那么可以得出2.txt文本“铝板”的密度是2%。
txt是utf-8格式