搜索资源列表
fenci
- 基于IKAnalyzer2012的中文分词java代码,可以去除停用词。-The Chinese word segmentation based IKAnalyzer2012 java code, you can remove stop words.
LDA_java
- Java,LDA(Latent Dirichlet Allocation)源代码,可以实现分词、去除停用词功能。-Java, LDA (Latent Dirichlet Allocation) source code, can achieve the segmentation, removing stop words function.
toolkit_for_words_En
- 处理英文中的停词、同词干词,不改变文章结构。适用于文本分类、文本聚类、推荐预处理。-Processing of stop words in English, with the stem word, does not change the structure of the article. Suitable for text categorization, text clustering, recommend pretreatment.
ExcludeStopWord
- 对一段中文文本经中文分词后,根据停用词表,去除文档中的停用词。-After a period of Chinese text by the Chinese word, according to the stop list, the removal of stop words in the document.
Project
- mini搜索引擎,载入50个文本文件,然后通过输入关键词来计算词出现的次数-we will design and implement a mini search engine that is used to search through a set of 50 documents and a set of sample queries. The data structures used for storing is vector and array. The algorithm in this
stopwords
- In this file you can use English stop words. The usage of this words may can helpful in analyzing content and deleting irrelevant content.
R4
- 短文本数据集,各大论文的数据集取材,英文文本,已经stemming,去停词,提炼后的。-R4 short text dataset,english. stemming and non-stop words.
eliminate
- text file for eliminating stop words for feature selection
WordSplit.java
- java实现的字典分词,有效去除停用词,标点符号,能识别姓名-java achieve dictionary word, the effective removal of stop words, punctuation, can identify the name
SplitWords
- 基于lucene的文档分词程序,去停用词,统计词频,计算词的权重-Lucene-based document segmentation procedures, to stop words, word frequency statistics
stopwords_en.txt
- English stop words for machine learning
InfoRetri
- 基于朴素贝叶斯的文本分类,包含去停用词,分词,特征提取,分类等-Text classification, based libsvm, included to stop words, segmentation, feature extraction and classification
ReadFiles
- 对中文文本进行分词,去停用词以及计算tf-idf值-The Chinese text segmentation, excluding stop words and computing tf- idf values
stopword
- In this code how stop words are removed are shown and after removing stop words documents are displaying
THULAC_lite_java_v1
- 中文文本分词 词频统计,分词,去掉停词。 仅支持UTF-8编码-Chinese text segmentation To get the word frequency, word segmentation, remove stop words. Support only UTF-8 encoding
JavaCODE
- 分本分类去除停用词Java源码,并能以文件形式返回-The sub-classification to remove stop words Java source code, and can return to the documentary form
stemming
- stemming of text file stop words
FileDemo
- 对文件进行分词的例子.输出带词性的中文分词,已经去掉了停用词.-Examples of the file segmentation output of the Chinese word with POS, has been removed stop words.
EnglishChuLi
- 利用python编写的文本预处理的程序,包含了每一步的实现代码,分为删除标点符号、删除停用词、相似度计算、PCA降维、聚类以及可视化等,运行环境为pytharm,python3开发环境(The text preprocessing program written by Python contains every step of implementation code, which is divided into delete punctuation marks, delete stop word
kctp
- 此代码实现数据的预处理,包括分词、去符号、去停用词等。(This code realizes the preprocessing of data, including participle, symbol, stop words, etc.)