跨语言文档聚类主要是将跨语言文档按照内容或者话题组织为不同的类簇。该文通过采用跨语言词相似度计算将单语广义向量空间模型(Generalized Vector Space Model,GVSM)拓展到跨语言文档表示中,即跨语言广义空间向量模型(Cross-Lingual Generalized Vector Space Model,CLGVSM),并且比较了不同相似度在文档聚类下的性能。同时提出了适用于GVSM的特征选择算法。实验证明,采用SOCPMI词汇相似度度量算法构造GVSM时,跨语言文档聚类的性能优于LSA。
This paper claries the definition of alignment from the viewpoint of linguistic similarity. Many alignment algorithms have been proposed with very high precision. But the languages belong to occidental family. We propose a new method for alignment between languages that do not belong to the same language family. On the contrary to most of the previously proposed methods that rely heavily on statistics, our method attempts to use linguistic knowledge to overcome the problems of statistical model. Experimental results confirm that the algorithm can align over 85 % of word pairs while maintaining a comparably high precision rate, even when a small corpus is used in training.