عنوان البحث(Papers / Research Title)
Web Documents Similarity Using K-Shingle Tokens and MinHash Technique
الناشر \ المحرر \ الكاتب (Author / Editor / Publisher)
مهدي عبادي مانع الموسوي
Citation Information
مهدي,عبادي,مانع,الموسوي ,Web Documents Similarity Using K-Shingle Tokens and MinHash Technique , Time 03/10/2018 17:34:06 : كلية تكنولوجيا المعلومات
وصف الابستركت (Abstract)
Document similarity
الوصف الكامل (Full Abstract)
Abstract: Nowadays, web search engine plays an integral role in discarding similar documents from the web search engine using one of the effective data mining techniques. Document similarity techniques in a massive data mining is such important technique in order to detect the mirror pages and the similarity of the articles in a large web repository. This will lead to avoid showing two web pages which are near identical at the top of search results. One of the document similarity approach is based on K-shingle which is a unique sequence of consecutive K words that can be used to find the similarity between two documents (K is a positive integer). The large web documents can be represented in a sets of long bit vectors 0 and 1. Here, 0 means not found while 1 means found in that document. The two documents that are near identical should have many shingles in common. The similarity ratio is calculated by using one of the distance metrics such as Jaccard similarity between two documents. Jaccard similarity is working well in the comparison between a pair of set values in a small dataset and to find the similarity score. Whereas in the large data set, MinHash and Locality-Sensitive Hashing (LSH) techniques come to solve this problem by providing a small signature matrix for the fast approximation to the truly Jaccard similarity in less time. In this study, we apply the Jaccard similarity, MinHash and LSH techniques based on K-shingles for a different number of the documents. The results show that the MinHash and LSH techniques produce more accuracy in results with less time for large documents. The experimental results show that the chosen K-shingle is applied into different documents number of ranges from 100, 200, 300-1000 documents. The hash functions are applied in different number from 10, 20 and 30. The average similarity time is <5 sec. The false positive and false negative were minimum to truly clustering of the documents.
تحميل الملف المرفق Download Attached File
|
|