:::

詳目顯示

回上一頁
題名:以構詞律與相似法為本的中文動詞自動分類研究
書刊名:International Journal of Computational Linguistics & Chinese Language Processing
作者:曾慧馨劉昭麟高照明 引用關係陳克健
出版日期:2002
卷期:7:1
頁次:頁1-27
主題關鍵詞:構詞律相似法中文動詞自動分類
原始連結:連回原系統網址new window
相關次數:
  • 被引用次數被引用次數:期刊(2) 博士論文(1) 專書(0) 專書論文(0)
  • 排除自我引用排除自我引用:2
  • 共同引用共同引用:0
  • 點閱點閱:101
本論文合併兩種方法預測未知動詞的詞類。第一種方法為規則法,即從訓練 語料中歸納出未知動詞組成的構詞規律,分成兩個主要的判斷方式:一、依 照未知動詞的組成的關鍵字決定其分類。二、依照未知動詞的構成組合決定 其分類。 關鍵字法首先將動詞依長度分為四組。第一組為二字詞、三字詞、四字詞、 五字以上的詞彙。在對實際語料的觀察下,發現不同詞長的動詞結構相異, 因此將語料依詞長分組。例如:三字詞可訓練出「好」、「出」兩條規則決 定動詞的詞類,其他長度的未知動詞並沒有這兩條規則,另外「化」規則不 適用於二字動詞。 規則法的第二部分為依照構成組合決定其分類。在觀察未知動詞時,發現有 部分未知動詞的組合很具有規律,我們就將訓練語料中未知動詞的組合做個 歸納,得到九種組合。在十次實驗中,規則法可以處理的未知動詞平均約為 23.19%,猜測正確的比例為91.67%。 二、相似法為利用與未知動詞相似的例子來預測未知動詞的詞類。相似法主 要利用知網與中央研究院中文句結構樹資料庫1.0 作為語意與詞類相似度測 量的工具。藉由計算未知動詞與已知動詞的相似度來預測未知動詞的詞類, 未知動詞的詞類為與其相似度最高的相似例子的詞類。 * 中央研究院資訊所,曾慧馨E-mail: huihsin@iis.sinica.edu.tw 陳克健E-mail: kchen@iis.sinica.edu.tw + 政使用相似法的好處在於相似法所尋找的的相似詞,若相似度高的話,不僅可 以預測詞類分類,同時也可以預測語意與結構分類。當兩個辭彙相似度高時, 表示這兩個辭彙的詞類、語意類與結構必定相似。在十次實驗中,使用相似 法預測動詞的正確率約為71.05%。 規則法的優點在於判斷正確率高,缺點為可處理的未知動詞數量有限;相似 法的優點為可以處理大部分的未知動詞,但正確率不如規則法高。最後,我 們結合這兩種處理方法來預測未知動詞的分類,將兩個方法同時應用在最後 的測試語料中,規則法的正確率為87.25%,而相似法的正確率為65.04%,兩 著者結合後的正確率為70.80%。
In this paper we present a hybrid approach for automatic classification of Chinese unknown verbs. The first method of the hybrid approach utilizes a set of morphological rules summarized from the training data, i.e. the set of compound verbs extracted from Sinica corpus, to determine the category of an unknown compound verb. If the morphological rules are not applicable, then the instance-based categorization using the k-nearest neighbor method for the classification is employed. It was observed that some suffix morphemes are frequently occurred in compound verbs and also uniquely determine the syntactic categories of the resultant compound verbs. By processing and calculating the training data, 15 suffix rules with coverage over 2% and category prediction accuracy higher than 80% were derived. In addition to the above type of morphological rules, the reduplication rules are also useful for category prediction, such as some famous Chinese reduplication rules, like “aa” in two characters word, “aab”, “abb” and “aab” in three characters word etc. For instance,“喝喝茶”has the same category as “喝茶,” and “研究研究” has the same category as“研究.” As a result, nine reduplication patterns are generated. Experimenting on the training data, it is found that the overall accuracy of the morphological rule classifier is 91.67% and its coverage is 23.19% only. Since the coverage of the morphological rule classifier is low, an instance-based categorization method is employed to taking care the uncovered cases. The instance-based categorization utilizes similar examples to predict the category of an unknown verb. The lexical similarity was measured by both the semantic similarity and syntactic similarity. The semantic similarity between two words is measured by the semantic distance of their HowNet definitions and the syntactic similarity is measured by the distance of their syntactic categories. The distance between two syntactic categories is their cosine measure of their grammatical feature vectors derived from the Sinica Treebank. The category of an unknownverb is predicted as the same as the examples, which are most similar to the unknown verb according to the above criteria of the similarity. For testing on the training data, the optimal accuracy of instance-based categorization is 71.05%, when the similar examples are from unknown verbs and verbs in the dictionary (known verbs). Both the morphological rule classifier and the instance-based categorization have the advantages of not only predicting the syntactic categories of the unknown words but also recognizing their morphological structures and major semantic classes. The advantage of the morphological rule classifier is its higher accuracy and for the instance-based categorization is its higher coverage. However, both of the methods have their own drawback; the former cannot be applied to most unknown verbs, but the latter suffers from low accuracy. For open test, 1000 unknown verbs that are unseen in the training process were tested. The accuracy of the linguistic rule is 87.25%, and the instance-based categorization is 65.04%. Finally, the overall accuracy of the hybrid approach is 70.80%.
期刊論文
1.Resnik, Philip(1999)。Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language。Journal of Artificial Intelligence Research,11(1),95-130。  new window
2.Chen, Keh-jiann、Bai, Ming-hong(19980200)。Unknown Word Detection for Chinese by a Corpus-based Learning Method。International Journal of Computational Linguistics & Chinese Language Processing,3(1),27-44。  new window
3.Sproat, Richard、石基琳(1996)。A Corpus-Based Analysis of Mandarin Nominal Root Compound。Journal of East Asian Linguistics,5(1),49-71。  new window
4.Weischedel, Ralph、Meteer, Marie、Schwartz, Richard、Ramshaw, Lance、Palmucci, Jeff(1993)。Coping with Ambiguity and Unknown Words Through Probabilistic Model。Computational Linguistics,19,359-382。  new window
會議論文
1.白明弘、陳超然、陳克捷(1998)。以語境判定中文未知詞詞類的分法47-60。  延伸查詢new window
2.李振昌、李御璽、陳信希(1994)。中文文本人名辨識問題之研究。臺北。203-222。  延伸查詢new window
3.陳克健、洪偉美(1996)。中文裡「動名」述賓結構與「動名」偏正結構的分析1-29。  延伸查詢new window
4.賴育升、李坤霖、吳宗憲(2000)。網際網路FAQ檢索中意圖萃取及語意比對之研究135-156。  延伸查詢new window
5.Chen, Chao-jan、白明弘、陳克健(1997)。Category Guessing for Chinese Unknown Words35-40。  new window
研究報告
1.中文詞知識庫小組(1996)。技術報告9601:「搜」文解字─中文詞界研究與資訊用分詞標準。臺北。  延伸查詢new window
2.中文詞知識庫小組(1998)。技術報告9502/9804:中央研究院平衡語料庫的內容與說明。臺北。  延伸查詢new window
3.Resnik, Philip、Diab, Mona(2000)。Measuring Verbal Similarity。  new window
學位論文
1.李振昌(1993)。中文文本專有名詞辨識問題之研究,臺北。  延伸查詢new window
2.李坤霖(2000)。網際網路FAQ檢索中意圖萃取及語意比對之研究,臺南。  延伸查詢new window
圖書
1.中文詞知識庫小組(1993)。詞庫小組技術報告93-05中文詞類分析。臺北:中央研究院。  延伸查詢new window
2.梅家駒、竺一鳴、高蘊琦、殷鴻翔(1984)。同義詞詞林。香港。  延伸查詢new window
3.湯廷池(1988)。漢語詞法句法論集。臺北:臺灣學生書局。  延伸查詢new window
4.Li, Charles Na、Thompson, Sandra Annear(1981)。Mandarin Chinese: a functional reference grammar。University of California Press。  new window
5.陳克健、陳超然(1997)。語料庫為本的中文複合詞構詞律模型研究。漢語計量與計算研究。香港。  延伸查詢new window
6.趙元任(1980)。中國話文法。中國話文法。香港。  延伸查詢new window
其他
1.陳克健,白明弘(2000)。Knowledge Extraction for Identification of Chinese Organization Names,Hong Kong, China。  new window
2.Resnik, P. S.(1995)。Using Information Content to Evaluate semantic Similarity in a Taxonomy。  new window
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top
QR Code
QRCODE