基於術語抽取與術語叢集技術的主題抽取__臺灣人文及社會科學引文索引資料庫

:::

詳目顯示

第 1 筆 / 總合 1 筆

/1頁

來源文獻資料
摘要
外文摘要
引文資料

題名：	基於術語抽取與術語叢集技術的主題抽取
書刊名：	International Journal of Computational Linguistics & Chinese Language Processing
作者：	林頌堅
作者(外文)：	Lin, Sung-chen
出版日期：	2004
卷期：	9:1
頁次：	頁97-111
主題關鍵詞：	主題抽取；術語抽取；術語叢集；Topic extraction；Term extraction；Term clustering
原始連結：	連回原系統網址
相關次數：	被引用次數:期刊(1) 博士論文(0) 專書(0) 專書論文(0) 排除自我引用:1 共同引用:0 點閱:29

本論文針對主題抽取的問題，提出一系列以自然語言處理為基礎的技術，應用這些技術可以從學術論文抽取重要的術語，並將這些術語依據彼此間的共現關係進行叢集，以叢集所得到的術語集合表示領域中重要的主題，提供研究人員學術領域的梗概並釐清他們的資訊需求。我們將所提出的方法應用到 ROCLING 研討會的論文資料上，結果顯示這個方法可以同時抽取出計算語言學領域的中文和英文術語，所得到的術語叢集結果也可以表示領域中重要的主題。這個初步的研究驗證了本論文所提出方法的可行性。重要的主題包括機器翻譯、語音處理、資訊檢索、語法模式與剖析、斷詞和統計式語言模型等等。從研究結果中，我們也發現計算語言學研究與實務應用有密切的關係。

以文找文

In this paper, we propose a series of natural language processing techniques to be used to extract important topics in a given research field. Topics as defined in this paper are important research problems, theories, and technical methods of the examined field, and we can represent them with groups of relevant terms. The terms are extracted from the texts of papers published in the field, including titles, abstracts, and bibliographies, because they convey important research information and are relevant to knowledge in that field. The topics can provide a clear outline of the field for researchers and are also useful for identifying users’ informationneeds when they are applied to information retrieval. To facilitate topic extraction, key terms in both Chinese and English are extracted from papers and are clustered into groups consisting of terms that frequently co-occur with each other. First, a PAT-tree is generated that stores all possible character strings appearing in the texts of papers. Character strings are retrieved from the PAT-tree as candidates of extracted terms and are tested using the statistical information of the string to filter out impossible candidates. The statistical information for a string includes (1) the total frequency count of the string in all the input papers, (2) the sum of the average frequency and the standard deviation of the string in each paper, and (3) the complexity of the front and rear adjacent character of the string. The total frequency count of the string and the sum of its average frequency and standard deviation are used to measure the importance of the corresponding term to the field. The complexity of adjacent characters is a criterion used to determine whether the string is a complete token of a term. The less complexity the adjacent characters, the more likely the string is a partial token of other terms. Finally, if the leftmost or rightmost part of a string is a stop word, the string is also filtered out. The extracted results are clustered to generate term groups according to their co-occurrences. Several techniques are used in the clustering algorithm to obtain multiple clustering results, including the clique algorithm and a group merging procedure. When the clique algorithm is performed, the latent semantic indexing technique is used to estimate the relevance between two terms to improve the deficiency of term co-occurrences in the papers. Two term groups are further merged into a new one when their members are similar because it is possible that the clusters represent the same topic. The above techniques were applied to the proceedings of ROCLING to uncover topics in the field of computational linguistics. The results show that the key terms in both Chinese and English were extracted successfully, and that the clustered groups represented the topics of computational linguistics. Therefore, the initial study proved the feasibility of the proposed techniques. The extracted topics included “machine translation,” “speech processing,” “information retrieval,” “grammars and parsers,” “Chinese word segmentation,” and “statistical language models.” From the results, we can observe that there is a close relation between basic research and applications in computational linguistics.

以文找文

期刊論文
1.	Deerwester, S.、Dumais, S.T.、Furnas, G.W.、Landauer, T.K.、Harshman, R.A.(1990)。Indexing by latent semantic analysis。Journal of the American Society for Information Science，41(6)，391-407。
2.	(1993)。Introduction to the Special issue on Computational Linguistics Using Large Corpora。Computational Linguistics，19(1)，1-24。

會議論文
1.	Chien, Lee Feng(1997)。PAT-Tree-Based keyword extraction for chinese information retrieval。The 20th ACM SIGIR Conference on Research and Development in Information Retrieval，50-58。
2.	Wayne, C. L.(2000)。Topic Detection and Tracking in English and Chinese165-172。
3.	Yang, Y.、Pierce, T.、Carbonell, J.(1998)。A Study on Retrospective and On-Line Event Detection。Australia。28-36。
4.	Zhang, Jian、Gao, Jian-Feng、Zhou, Ming(2000)。Extraction of Chinese Compound Words: An Experimental Study on a Very Large Corpus。Hong Kong。132-139。

學位論文
1.	Tabah, Albert N.(1996)。Information Epidemics and the Growth of Physics，Canada。

圖書
1.	Kowalski, Gerald J.、Maybury, M. T.(2000)。Document and term clustering。Information storage and retrieval systems: theory and implementation。Boston, MA。

其他
1.	Su, Keh-Yih，Wu, M. W.，Chang, Jyun-Sheng(1994)。A Corpus-based Approach to Automatic Compound Extraction，New Mexico。
2.	簡立峰，Chen, Chun-Liang，盧文祥，Chang, Yuan-Lu(1999)。Recent Results on Domain-Specific Term Extraction From Online Chinese Text Resources。
3.	Hatzivassiloglou, V.，Gravano, Luis，Maganti, A.(2000)。An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering。
4.	Lenders, Winfried(2001)。Past and Future Goals of Computational Linguistics。

推文
推薦
引用網址
引用嵌入語法
轉寄

top

:::

相關期刊
相關論文
相關專書
相關著作
熱門點閱

1.	基於字典釋義關聯方法的同義詞概念擷取：以《同義詞詞林(擴展版)》為例

無相關博士論文

無相關書籍

無相關著作

無相關點閱

QR Code

臺灣人文及社會科學引文索引資料庫系統

詳目顯示

臺灣人文及社會科學引文索引資料庫