:::

詳目顯示

回上一頁
題名:Google文字關聯在多領域文件分類上的應用
作者:陳棅易
作者(外文):Ping-I Chen
校院名稱:國立中央大學
系所名稱:資訊管理研究所
指導教授:林熙禎
學位類別:博士
出版日期:2011
主題關鍵詞:文字向量序列文字檢索文件分類相似度比對keyword sequenceinformation retrievalclassificationsimilarity distance
原始連結:連回原系統網址new window
相關次數:
  • 被引用次數被引用次數:期刊(0) 博士論文(0) 專書(0) 專書論文(0)
  • 排除自我引用排除自我引用:0
  • 共同引用共同引用:0
  • 點閱點閱:36
傳統的文件分類需先將文件都下載到電腦上,接著透過關鍵字重要性計算將潛在關鍵字抽取出來做為文件之代表序列,最後利用文件向量比對演算法進行分類。但是,在網路資訊發展越趨成熟的年代,使用者常常透過網頁瀏覽多種不同領域知識的文件或網頁。若要針對各領域訓練出關鍵字以抽取出代表性序列達到跨領域知識分類的目的,將會造成極大的資源浪費也缺乏效率。而且各領域的序列維度也將會因資訊的無限更新與擴充,而變得極為龐大需要耗費大量運算與儲存資原。本篇論文介紹使用我們自行改良之GCD演算法為基礎,透過每個關鍵字在Google中所擁有網頁數的比率來計算文字的重要性來組成一個關鍵字網路(WANET)。接著利用序列攫取演算法找出文字網路中最具代表性的K個關鍵字 (K≦4)做為代表性序列。由於我們的代表性序列太短,因此傳統的向量比對演算法無法適用在此環境。因此,我們也利用搜尋引擎為基礎的概念做出Google Purity measurement演算法做為向量比對的依據。本系統由於所有演算法都是以搜尋引擎的網頁數值來做為計算依據,所以可達成即時跨領域分類的目的。我們也透過實驗證實了若欲分類的文件包含的專業詞彙較少被其他領域引用的狀態下,可以達到極高的分類精準度。我們系統唯一的缺點在於對Google Query次數太頻繁導致整體執行效率較傳統的向量比對方式差,但是由於我們不需要預先蒐集訓練集,向量也不會跟著文件增加而一直無限制成長。所以長期來看我們提出的方法會比傳統作法有效率。我們相信未來可透過更進一步的改良,使得整體精準度與計算效率能有效提升,將能更加使使用者能有效的整理學習過的資訊,亦能透過相同的演算法找出有用的資訊即時推薦給使用者做為輔助閱讀的依據。
How to automatically classify information in an efficient way is becoming more and more important in recent years. We can collect all kinds of knowledge from search engines to improve the quality of decision making, and use document classification systems to manage the knowledge repository. Document classification systems always need to construct a keyword vector, which always contains thousands of words, to represent the knowledge domain. Thus, the computation complexity of the classification algorithm is very high. Also, users need to download all the documents before extracting the keywords and classifying the documents. In this thesis, we described a new algorithm called “Word AdHoc Network” and used it to extract the most important sequences of keywords for each document. The keyword sequence is composed of no more than four keywords. We will also use a new similarity measurement algorithm, called “Google Purity,” to calculate the similarity between the extracted keyword sequences to classify similar documents together. By using this system, we can easily classify the information in different knowledge domains at the same time, and all the executions are real-time without any pre-established keyword repository. Our experiments show that the classification results are very accurate and useful. The only weakness of our system is that the execution time of our system is longer than the cosine method. But we can save the time of choosing those training data and the vectors of each domain can remain only 4-gram. This new system can improve the efficiency of document classification and make it more usable in Web-based information management.
[1].G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions,” IEEE Transactions on Knowledge and Data Engineering, Vol. 17(6), pp. 734-749, 2005.
[2].E. Agichtein, S. Lawrence, and L. Gravano, “Learning to find answers to questions on the Web,” ACM Transactions on Internet Technology, Vol. 4(2), pp. 129- 162, 2004.
[3].G. Andrew, T. Grenager, and C. Manning, (2004). “Verb Sense and Subcategorization: Using Joint Inference to Improve Performance on Complementary Tasks,” EMNLP 2004, pp. 150-157, 2004.
[4].L. Azzopardi, M. Girolami, and M. Crowe, “Probabilistic hyperspace analogue to language,” SIGIR’05: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval, pp. 575-576, 2005.
[5].S. Banerjee and T. Pedersen, “An adapted Lesk algorithm for word sense disambiguation using Word-Net,” Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, pp. 136-145, 2002.
[6].S. Batra and S. Bawa, “Web Service Categorization Using Normalized Similarity Score,” International Journal of Computer Theory and Engineering, Vol. 2(1), pp. 1793-8201, 2010.
[7].S. Batra and S. Bawa, “Semantic Categorization of Web Services,” International Journal of Recent Trends in Engineering, Vol. 2(3), pp. 19-23, 2009.
[8].D. Beeferman and A. Berger, “Agglomerative Clustering of a Search Engine Query Log,” Proceedings of ACM SIGKDD, 2000.
[9].K. Bharat, “SearchPad: explicit capture of search context to support Web search,” Computer Networks, Vol. 33(1-6), pp. 493-501, 2001.
[10]. T. Biru, A. EI-Hamdouchi, R. S. Rees, and P. Willett, “Inclusion of relevance information in the term discrimination model,” Journal of Documentation, Vol. 45, pp. 85-109, 1989.
[11]. J. Borges and M. Levene, “Evaluating Variable-Length Markov Chain Models for Analysis of User Web Navigation Sessions,” IEEE Transactions on Knowledge and Data Engineering, Vol. 19(4), pp. 441-452, 2007.
[12]. J. S. Breese, D. Heckerman, and C. Kadie, “Empirical analysis of predictive algorithms for collaborative filtering,” Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98), pp. 43–52, 1998.
[13]. G. Browne, S. Curley, and P. Benson, “Evoking information in probability assessment: Knowledge maps and reasoning-based directed questions,” Managements Science, Vol. 43(1), pp. 1-14, 1997.
[14]. V. Bush, As We May Think, Atlantic Monthly, Vol. 176, pp. 101-108, 1945.
[15]. P.I. Chen and S.J. Lin, “Automatic keyword prediction using Google similarity distance,” Expert Systems with Applications, Vol. 37(3), pp. 1928-1938, 2010.
[16]. P.I. Chen and S.J. Lin, “Word AdHoc Network: Using Google Core Distance to Extract the Most Relevant Information,” Knowledge-Based Systems, Vol. 24(3), pp. 393-405, 2011.
[17]. P.I. Chen, S.J. Lin, and Y.C. Chu, “Using Google Latent Semantic Distance to Extract the Most Relevant Information,” Expert Systems with Applications, Vol. 38, pp. 7349-7358, 2011.
[18]. L. F. Chien, “PAT-tree-based Keyword Extraction for Chinese Information Retrieval,” Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-59, 1997.
[19]. L. F. Chien, “PAT-tree-based adaptive keyphrase extraction for intelligent Chinese information retrieval,” Information Processing and Management, Vol. 35, pp. 501-521, 1999.
[20]. S. L. Chuang and L. F. Chien, “Enriching Web taxonomies through subject categorization of query terms from search engine logs,” Decision Support Systems, Vol. 35(1), pp. 113-127, 2003.
[21]. S. L. Chuang and L. F. Chien, “Automatic query taxonomy generation for information retrieval applications,” Online Information Review, Vol. 27(4), pp. 243-255, 2003.
[22]. R. L. Cilibrasi and P.M.B. Vitanyi, “The Google Similarity Distance,” IEEE Transactions on Knowledge and Data Engineering, Vol. 19(3), pp. 370-383, 2007.
[23]. C. Collosal, “How well does the world wide web represent human language?” The Economist, 2005.
[24]. H. Cui, J.R. Wen, J. Y. Nie, W. Y. Ma, “Query Expansion by Mining User Logs,” IEEE Transactions on Knowledge and Data Engineering, Vol. 15(4), pp. 829-839, 2003.
[25]. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, Vol. 41(6), pp. 391-407, 1990.
[26]. D. Doermann, “The Indexing and Retrieval of Document Images: A Survey,” Computer Vision and Image Understanding, Vol. 70(3), pp. 287-298, 1998.
[27]. H. Edith, A.G. Rene, J.A. Carrasco-Ochoa, and J.F. Martinez-Trinidad, “Document clustering based onmaximal frequent sequences,” Proceedings of the FinTAL 2006, pp. 257-267, 2006.
[28]. G. Ercan, I. Cicekli, “Using Lexical Chains for Keyword Extraction,” Information Processing and Management, Vol. 43(6), pp. 1705-1714, 2007.
[29]. R. Feldman, I. Dagen and H. Hirsh, “Mining Text Using Keywords Distributions,” Journal of Intelligent Information Systems, Vol. 10(3), pp. 281-300, 1998.
[30]. F. Feng and W. Croft, “Probabilistic techniques for phrase extraction,” Information Processing and Management, Vol. 37, pp. 199–200, 2001.
[31]. L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, and E. Ruppin, “Placing Search in Context: The Concept Revisited,” ACM Trans on Information Systems, Vol. 20(1), pp. 116-131, 2001.
[32]. E. Frank, G. W. Paynter, and I. H. Witten, “Domain-Specific Keyphrase Extraction,” Proceedings of the 16th International Joint Conference on Aritifcal Intelliegence, pp. 668-673, 1999.
[33]. G. E. Freund and P. Willett, “Online identification of word variants and arbitrary truncation searching using a string similarity measure,” Information Technology: Research and Development, Vol. 1(3), pp.177-187, 1982.
[34]. B. Fung, K. Wang, and M. Ester, “Hierarchical document clustering using frequent itemsets,” Proceedings of the 3rd SIAM International Conference on Data Mining, 2003.
[35]. E. Gabrilovich and S. Markovitch, “Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis,” IJCAI 2007, pp. 1606-1611, 2007.
[36]. D. Godoy and A. Amandi, “Modeling user interests by conceptual clustering,” Information Systems, Vol. 31(4-5), pp.247-265, 2006.
[37]. T. Hofmann, “Probabilistic latent semantic indexing,” Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57, 1999.
[38]. X. H. Hu and B. Wu, “Automatic Keyword Extraction Using Linguistic Features,” Sixth IEEE International Conference on Data Mining, pp.19-23, 2006.
[39]. A. Hulth, “Improved automatic keyword extraction given more linguistic knowledge,” Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 2003.
[40]. R. Islam and R. Islam, “An Improved Keyword Extraction Method Using Graph Based Random Walk Model,” Proceedings of the Eleventh IEEE International Conference on Computer and Information Technology, pp. 25-27, 2008.
[41]. M. Jansen, A. Spink, J. Bateman, and T. Saracevic, “Real Life Information Retrieval: A Study of User Queries on the Web,” Proceedings of the ACM SIGIR Forum, Vol. 32, pp. 5-17, 1998.
[42]. K.S. Jones, “Information Retrieval and Artificial Intelligence,” Artificial Intelligence, Vol. 114(1-2), pp. 257-281, 1999.
[43]. J. Bar-llan, M.H. Mazlita, and L. Mark, “Methods for comparing rankings of search engine results,” Computer Networks, Vol. 50(10), pp. 1448-1463, 2006.
[44]. K. Kageura and B. Umino, “Methods of automatic term recognition,” Terminology, Vol. 3(2), pp. 259, 1996.
[45]. L. Khan, D. McLeod, and E. Hovy, “Retrieval effectiveness of an ontology-based model for information selection,” The VLDB Journal, pp.71-85, 2004.
[46]. M. Kitamura and Y. Matsumoto, “Automatic extraction of word sequence correspondences in parallel corpora,” Proceeding of the 4th Workshop on Very Large Corpora, pp. 78-89, 1996.
[47]. M. Kobayashi and K. Takeda, “Information retrieval on the web,” ACM Computing Surveys (CSUR), Vol. 32(2), pp.144-173, 2000.
[48]. I. Konstas, V. Stathopoulos, and J. M. Jose, “On social networks and collaborative recommendation,” SIGIR’09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 195–202, 2009.
[49]. T. Kudoh and Y. Matsumoto, “Use of support vector learning for chunk identification,” Proceedings of the CoNLL-2000 and LLL-2000, 2000.
[50]. K. C. Lee, J. S. Kim, N. H. Chung, and S. J. Kwon, “Fussy Cognitive Map Approach to Web-Mining Inference Amplification,” Expert System with Applications, Vol. 22, pp. 197-211, 2002.
[51]. K. W.-T. Leung, W. Ng, and D.L. Lee, “Personalized Concept-Based Clustering of Search Engine Queries,” IEEE Transactions on Knowledge and Data Engineering, Vol. 20(11), pp. 1505-1518, 2008.
[52]. S. J. Li, H. F. Wang, S. W. Yu, and C. S. Xin, “News-Oriented Keyword Indexing with Maximum Entropy Principle,” Proceedings of the PACLIC17, pp. 277-281, 2003.
[53]. Y. Li, C. Zhang, and J. R. Swan, “An Information Filtering Model on the Web and Its Application in JobAgent,” Knowledge-Based Systems, Vol. 13(5), pp. 285-296, 2000.
[54]. Y. J. Li, S. M. Chung, and J. D. Holt, “Text document clustering based on frequent word meaning sequences,” Data and Knowledge Engineering, Vol. 64, pp. 381-404, 2008.
[55]. F. R. Lin and C. M. Hsueh, “Knowledge map creation and maintenance for virtual communities of practice,” Information Processing and Management, Vol. 42(2), pp. 551-568, 2006.
[56]. C. Y. Lin and E. H. Hovy, “Automatic evaluation of summaries using n-gram co-occurrence statistics,” Proceedings of the Human Language Technology Conference, 2003.
[57]. R. Lindsey, M. Stipicevic, V. D. Veksler, and W. D. Gray, “BLOSSOM: Best path Length On a Semantic Self-Organizing Map,” Proceedings of the 30th Annual Meeting of the Cognitive Science Society, 2008.
[58]. F. Liu, C. Yu, and W. Meng, “Personalized Web Search for Improving Retrieval Effectiveness,” IEEE Transactions on Knowledge and Data Engineering, Vol. 16(1), pp. 28- 40, 2004.
[59]. Z. Y. Lu, Y. Y. Yao, and N. Zhong, “Web Log Mining,” Web Intelligence, pp. 174-194, 2003.
[60]. H. P. Luhn, “A statistical approach to mechanized encoding and searching of literary information,” IBM Journal of Research and Development, 1957.
[61]. K. Lund and C. Burgess, “Hyperspace analogue to language (HAL): A general model semantic representation,” Brain and Cognition, Vol. 30(3), pp. 5, 1996.
[62]. C. Luo, Y. Li, and S. M. Chung, “Text document clustering based on neighbors,” Data and Knowledge Engineering, Vol. 68(11), pp.1271-1288, 2009.
[63]. C. Mangold, “A survey and classification of semantic search approaches,” International Journal of Metadata Semantics and Ontology, Vol. 2(1), pp. 23-34, 2007.
[64]. M. Makrehchi and M. S. Kamel, “Automatic Taxonomy Extraction Using Google and Term Dependency,” Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, pp.321-325, 2007.
[65]. Y. Matsuo and M. Ishizuka, “Keyword Extraction from a Single Document Using Word Co-ocuurrence Statistical Information,” International Journal on Artificial Intelligence Tools, Vol. 13(1), pp. 157-169, 2004.
[66]. I. Matveeva, G. Levow, A. Farahat, and C. Royer, “Generalized latent semantic analysis for term representation,” Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-05), Borovets, Bulgaria, 2005.
[67]. P. Mcnamee and J. Mayfield, “Entity extraction without language-specific resources,” Proceedings of CoNLL-2002, pp. 183-186, 2002.
[68]. T. Meng and H. F. Yan, “On the peninsula phenomenon in web graph and its implications on web search,” Computer Networks, Vol. 51(1), pp. 177-189, 2007.
[69]. R. Mihalcea and A. Csomai, “Wikify!: linking documents to encyclopedic knowledge,” Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 233-242, 2007.
[70]. R. Mihalcea and P. Tarau, “TextRank - bringing order into texts,” Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004.
[71]. M. O''Mahony, N. Hurley, N. Kushmerick, and G. Silvestre, “Collaborative recommendation: A robustness analysis,” ACM Transactions on Internet Technology (TOIT), Vol. 4(4), pp. 344-377, 2004.
[72]. B. Pang and L. Lee, “A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts,” Proceedings of the Association for Computational Linguistics, pp. 271-278, 2004.
[73]. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation,” IBM Research Report RC22176 (W0109-022), 2001.
[74]. L. A. F. Park and K. Ramamohanarao, “Efficient storage and retrieval of probabilistic latent semantic information for information retrieval,” The VLDB Journal, Vol. 18(1), pp. 141-155, 2009.
[75]. H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser, “Identity uncertainty and citation matching,” Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2002.
[76]. H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser, “Identity uncertainty and citation matching,” Proceedings of NIPS-03, 2003.
[77].T. Pedersen, S. Patwardhan, and J. Michelizzi, “Wordnet::similarity—measuring the relatedness of concepts,” Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), pp. 1024-1025, 2004.
[78]. H. Poon and P. Domingos, “Joint inference in information extraction,” Proceedings of the 22nd national conference on Artificial intelligence, pp. 913-918, 2007.
[79]. A. Ratnaparkhi, J. Reynar, and S. Roukos, “A maximum entropy model for prepositional phrase attachment,” Proceeding of the Human Language Technology Workshop. Plainsboro, NJ: Advanced Research Projects Agency, pp. 250-255, 1994.
[80]. G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,” Communications of the ACM, Vol. 18(11), pp.613-620, 1975.
[81]. G. Salton, C. S. Yang, and C. T. Yu, “A Theory of Term Importance in Automatic Text Analysis,” Journal of the American society for Information Science, Vol. 26(1), pp. 33-44, 1975.
[82]. J. J. Sandvig, B. Mobasher, and R. Burke, “Robustness of collaborative recommendation based on association rule mining,” Proceedings of the 2007 ACM conference on Recommender systems, 2007.
[83]. B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Reidl, “Item-based collaborative filtering recommendation algorithms,” In World Wide Web, pp. 285-295, 2001.
[84]. K. Sato and H. Saito, “Extracting word sequence correspondences with support vector machines,” Proceedings of the 19th international conference on Computational linguistics, Taipei, Taiwan, pp.1-7, 2002.
[85]. F. Sebastiani, “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, Vol. 34(1), pp. 1-47, 2002.
[86]. K. Seymore, A. McCallum, and R. Rosenfeld, “Learning hidden Markov model structure for information extraction,” Proceedings of AAAI’99 workshop on machine learning for information extraction, 1999.
[87]. U. Shardanand and P. Maes, “Social Information Filtering: Algorithms for Automating Word of Mouth,” Proceedings of the Computer-Human Interaction Conference (CHI''95), 1995.
[88]. B. Sigurd, E. O. Mats, and J. V. Weijer, “Word length, sentence length and frequency - Zipf revisited,” Studia Linguistica, Vol. 58(1), pp. 37-52, 2004
[89]. A. Singhal, “Modern Information Retrieval: A Brief Overview,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, Vol. 24(4), pp. 35-43, 2001.
[90]. B. Smyth, D. Wilson, and D. O’Sullivan, “Improving the quality of the personalised electronic programme guide,” Proceedings of the TV’02 the 2nd Workshop on Personalisation in Future TV, pp. 42-55, 2002.
[91]. D. Song and P. D. Bruza, “Discovering information flow using a high dimensional conceptual space,” Proceedings of the 24th ACM SIGIR, pp. 327-333, 2001.
[92]. K. Sparck Jones and P. Willett, “Readings in Information Retrieval,” Morgan Kaufmann, 1997.
[93]. M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” Proceedings of the KDD-2000 Workshop TextMining, 2000.
[94]. A. Takasu, “Bibliographic attribute extraction from erroneous references based on a statistical model,” Proceedings of joint conference on digital libraries (JCDL), 2003.
[95]. P. D. Turney, “Learning Algorithms for Keyphrase Extraction,” Information Retrieval, Vol. 2, pp. 303-336, 2000.
[96]. W. Vestal, “Knowledge Mapping 101,” Presentation at USAID Knowledge for Development Seminar September 22, 2003.
[97]. J. Xu and W. B. Croft, “Improving the effectiveness of information retrieval with local context analysis,” ACM Transaction Information System (TOIS), Vol. 18(1), pp. 79-112, 2000.
[98]. J. Yang, W. Wang, H. Wang, and P. S. Yu, “Delta-clusters: Capturing subspace correlation in a large data set,” Proceedings of the ICDE, pp. 517-528, 2002.
[99]. Q. Yang, H. Zhang, I. Tian, and Y. Li, “Mining Web Logs for Prediction Models in WWW Caching and Prefetching,” Proceedings of the Seventh ACM SIGKDD Internal Conference of Knowledge Discovery and Data Mining, pp. 473-478, 2001.
[100]. C. Z. Zhang, H. L. Wang, Y. Liu, D. Wu, Y. Liao, and B. Wang, “Automatic Keyword Extraction from Documents Using Conditional Random Fields,” Journal of Computational Information Systems, Vol. 4(3), pp.1169-1180, 2008.
[101]. D. Zhang and W. S. Lee, “Web Taxonomy Integration using Support Vector Machines,” Proceedings of the 13th International World Wide Web Conference, 2004.
[102]. W. Zhang, T. Yoshida, and X. J. Tang, “Text classification based on multi-word with support vector machine,” Knowledge-Based Systems, Vol. 21(8), pp. 879-886, 2008.
[103]. N. Zhong, “Representation and Construction of Ontologies for Web Intelligence,” International Journal Foundation of Computer Science, Vol. 13(4), pp. 555-570, 2002.
[104]. Z. Zhuang and S. Cucerzan, “Re-ranking search results using query logs,” Proceedings of the 15th ACM international conference on Information and knowledge management, 2006.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top