:::

詳目顯示

回上一頁
題名:篇章級並列關係文本塊識別方法研究
書刊名:數據分析與知識發現
作者:裴晶晶樂小虬
出版日期:2019
卷期:2019(5)
頁次:51-56
主題關鍵詞:並列關係文本表示文本塊深度學習Coordinate relationshipText representationText blockDeep learning
原始連結:連回原系統網址new window
相關次數:
  • 被引用次數被引用次數:期刊(0) 博士論文(0) 專書(0) 專書論文(0)
  • 排除自我引用排除自我引用:0
  • 共同引用共同引用:0
  • 點閱點閱:1
【目的】識別出科技論文中分布在不同段落、在語義及版面視覺上具有并列關系的文本塊,捕捉并列關系文本特征,為并列關系知識對象識別提供預訓練模型。【方法】以段落為處理單元,在字符向量和詞向量的基礎上附加版面視覺特征,對不同層級具有并列關系的文本進行多維特征表征,利用卷積神經網絡(Convolutional Neural Networks, CNN)模型對標注數據進行文本分類訓練,得到并列關系文本塊識別模型。【結果】在人工標注的科技論文數據集上展開實驗,對并列關系文本塊分類準確率達96%,比基準模型高出約3%,召回率高出約2%。【局限】僅適用于HTML網頁文本數據,對于其他格式的文本數據還有待進一步研究和實驗。【結論】以段落為處理單元,綜合多種特征后利用卷積神經網絡模型能夠高效識別篇章級并列關系文本塊,可以作為并列關系知識對象識別預訓練模型。
[Objective] This paper proposes a method to identify the coordinate text blocks by semantic and layout features, which are distributed in different paragraphs. It also provides a pre-trained model for these knowledge objects.[Methods] First, we used each paragraph as a processing unit and added the layout features based on the character and word vectors. Then, we concatenated multi-dimensional features to represent each paragraph. Third, we employed the convolutional neural network(CNN) model to train the annotated data and obtained the recognition model for coordinate relationship text blocks. [Results] The proposed approach achieved a precision of 96% with manually annotated scientific papers, which was 3% higher than those of the baseline model. The recall was also improved by 2%.[Limitations] Our model can only work with HTML files. More research is needed to examine it with other data formats. [Conclusions] The proposed method is able to effectively identify coordinate text blocks in discourses, which can be used as a pre-trained model for coordinate knowledge objects.
 
 
 
 
第一頁 上一頁 下一頁 最後一頁 top