中文文本聚類常用停用詞表對比研究__臺灣人文及社會科學引文索引資料庫

:::

詳目顯示

第 1 筆 / 總合 1 筆

/1頁

來源文獻資料
摘要
外文摘要
引文資料

題名：	中文文本聚類常用停用詞表對比研究
書刊名：	數據分析與知識發現
作者：	官琴／鄧三鴻／王昊
出版日期：	2017
卷期：	2017(3)
頁次：	72-80
主題關鍵詞：	文本聚類；停用詞；K-means；Text clustering；Stopword list
原始連結：	連回原系統網址
相關次數：	被引用次數:期刊(1) 博士論文(0) 專書(0) 專書論文(0) 排除自我引用:1 共同引用:0 點閱:1

【目的】通過實驗對比分析,比較不同停用詞表對于不同類型的文本數據的作用效果,對停用詞表的構建與使用提供參考意見。【方法】選取百度停用詞表、哈爾濱工業大學停用詞表以及四川大學機器智能實驗室停用詞表,基于三個不同語料庫運用漢語分詞技術、TF-IDF特征評估函數以及VSM模型進行文本處理,并且采用Java編寫的K-means算法進行聚類實驗,通過準確率P、召回率R和F1三個評價指標對不同聚類結果進行效果評估。【結果】不同停用詞表對于不同類型的文本數據作用效果差異明顯,詞表的長度、內容結構是影響作用效果的直接因素,其中兩字停用詞作用效果最為明顯。【局限】實驗文本類型及數量有限,同時對于不同停用詞表僅在詞語數量及內容上做了簡單的分析比較,未對停用詞按照類別分類進行實驗分析。【結論】停用詞表對于文本聚類準確度有很大的影響,構建或選取適宜的中文停用詞表極為重要。同時,過度增加停用詞的數量并不會一直改善聚類結果。

以文找文

[Objective] This paper compares and analyzes the impacts of stopwords on textual data processing, aiming to improve the construction and use of stopwords. [Methods] We obtained stopword lists from Baidu Search Engine, Harbin Institute of Technology and the Machine Learning Laboratory of Sichuan University for this study. First, we processed text message with the stopword lists and Chinese word segmentation technique, the TF-IDF feature evaluation function and the VSM vector model. Secondly, we analysed the texts with the K-means algorithm to calculate the P, R and F1 values. [Results] Different stopword lists posed various effects to the text data processing tasks. The length of the list and the content structure of the texts directly influenced the clustering results. More importantly, the two-character stopwords was the biggest factor. [Limitations] The text types and quantity were limited. More research is needed to analyze the text with different types of stop words. [Conclusions] Stopword list poses significant impacts on text clustering, thus, it is extremely important to build or choose the appropriate Chinese stopword list. However, excessively increasing the number of stop words might not always improve the clustering results.

以文找文

期刊論文
1.	Luhn, H. P.(1958)。The automatic creation of literature abstracts。IBM Journal of Research and Development，2(2)，159-165。
2.	Luhn, H. P.(1957)。A Statistical Approach to Mechanized Encoding and Searching of Literary Information。IBM Journal of Research and Development，1(4)，309-317。
3.	Francis, W. N.、Kučera, H.、Mackie, A. W.(1982)。Frequency Analysis of English Usage。Frequency Analysis of English Usage Lexicon & Grammar，18，64-70。
4.	Lo, T. W.、He, B.、Ounis, I.(2005)。Automatically Building a Stopword List for an Information Retrieval System。Journal of Digital Information Management，3(1)，3-8。
5.	熊文新、宋柔(2007)。信息檢索用戶查詢語句的停用詞過濾。計算機工程，33(6)，195-197。延伸查詢
6.	周欽強、孫炳達、王義(2005)。文本自動分類系統文本預處理方法的研究。計算機應用研究，2005(2)，85-86。延伸查詢
7.	Tomov, D. T.(2001)。Some Critical Remarks on the Stop Word Lists of ISI Publications。Journal of Documentation，57(6)，798-808。
8.	化柏林(2007)。知識抽取中的停用詞處理技術。現代圖書情報技術，2007(8)，48-51。延伸查詢
9.	Fox, C.(1990)。A Stop List for General Text。ACM SIGIR Forum，24(1/2)，19-21。
10.	陳欣、張菁、李曉光(2011)。一種面向中文敏感網頁識別的文本分類方法。測控技術，30(5)，27-31。延伸查詢
11.	顧益軍、樊孝忠、王建華(2005)。中文停用詞表的自動選取。北京理工大學學報，25(4)，337-340。延伸查詢
12.	崔彩霞(2008)。停用詞的選取對文本分類效果的影響研究。太原師範學院學報：自然科學版，7(4)，91-93。延伸查詢
13.	王素格、魏英杰(2008)。停用詞表對中文文本情感分類的影響。情報學報，27(2)，175-179。延伸查詢
14.	黃磊、伍雁鵬、朱群峰(2014)。關鍵詞自動提取方法的研究與改進。計算機科學，41(6)，204-207。延伸查詢
15.	孫國菊、張杰(2005)。中文文本分類的特徵選取評價。哈爾濱理工大學學報，10(1)，76-78。延伸查詢
16.	于娟、尹積棟、費庶(2013)。基於句法結構分析的同義詞識別方法研究。現代圖書情報技術，2013(9)，35-40。延伸查詢
17.	費洪曉、康松林、朱小娟(2005)。基於詞頻統計的中文分詞的研究。計算機工程與應用，41(7)，67-68。延伸查詢

會議論文
1.	Feldman, R.、Dagan, I.(1995)。Knowledge Discovery in Textual Databases (KDT)。International Conference on Knowledge Discovery and Data Mining，112-117。
2.	Yang, B. Y.、Pedersen, J. O.(2010)。A Comparative Study on Feature。International Conference on Machine Learning。
3.	Silva, C.、Ribeiro, B.(2003)。The Importance of Stop Word Removal on Recall Values in Text Categorization。The International Joint Conference on Neural Networks，20-24。
4.	Zou, F.、Wang, F. L.、Deng, X.(2006)。Automatic Construction of Chinese Stop Word List。The International Conference on Applied Computer Science，16-18。
5.	Makrehchi, M.、Kamel, M. S.(2008)。Automatic Extraction of Domain-Specific Stopwords from Labeled Documents。European Conference on IR Research。Glasgow。222-233。

研究報告
1.	Ahonen-Myka, H.、Heinonen, O.、Klemettinen, M.(1997)。Applying Data Mining Techniques in Text Analysis。Department of Computer Science, University of Helsinki。

學位論文
1.	江兆中(2010)。基於語境和停用詞驅動的中文自動分詞研究(碩士論文)。合肥工業大學，合肥。延伸查詢
2.	周姚(2011)。基於雲計算的文本挖掘技術研究(碩士論文)。國防科學技術大學，長沙。延伸查詢
3.	華林森(2014)。中文文本情感分類研究(碩士論文)。重慶大學，重慶。延伸查詢
4.	李梅(2010)。改進的K均值算法在中文文本聚類中的研究(碩士論文)。安徽大學，合肥。延伸查詢
5.	胡曉輝(2008)。基於團結構的文本分類技術研究(碩士論文)。江西師範大學，南昌。延伸查詢

圖書
1.	Frakes, W. B.、Baeza-Yates, R.(1992)。Information Retrieval Data Structures and Algorithms。Prentice Hall。
2.	Van Rijsbergen, C. J.(1975)。Information Retrieval。London：Butterworths。

其他
1.	數據堂。文本分類語料庫（復旦）測試語料，http://www.datatang.com/datares/go.aspx?dataid=615059。延伸查詢
2.	數據堂。中文文本分類語料，http://www.datatang.com/data/11971/。延伸查詢
3.	數據堂。停用詞集合，http://www.datatang.com/data/19300/。延伸查詢

推文
推薦
引用網址
引用嵌入語法
轉寄

top

:::

相關期刊
相關論文
相關專書
相關著作
熱門點閱

1.	數位人文與傳統文獻的相互作用：《呂氏春秋．十二紀》聚類分析與篇章結構新探

無相關博士論文

無相關書籍

無相關著作

無相關點閱

QR Code

臺灣人文及社會科學引文索引資料庫系統

詳目顯示

臺灣人文及社會科學引文索引資料庫