Detect /Remove Duplicate Images from a Dataset for Deep Learning

L Manjusha , V. Suryanarayana

PDF

Published: Mar 28, 2022

Updated: 2022-03-28

L Manjusha , V. Suryanarayana

Abstract

Removal of duplicate data is a new way used to compress the data by removing duplicate copies ofinformation. It ensures data management by efficiently reducing the storage space and maintaining the energy consumption. Duplication improves storage utilization with higher reliability. Due to abundant data generation through various sources need of this system increases as it ensures data management by efficiently reducing the duplicate copies of data. This technique results in making a system more optimized by calculating hash value of the files. The following paper aims to achieve the above goal by detecting and eliminating the duplicate data. This proposed system is a simple framework to use, provides ease of retrieval of data from storage and calculate hash value by using SHA(secure hash algorithm) and d hash(difference hash algorithm). Data in the form of text, images, audio and video can be examined in this proposed paper. This paper proposes new hash functions for indexing local image descriptors. These functions are first applied and evaluated as a range neighbour algorithm. Weshow that it obtains similar results as several state-of-the-art algorithms. In the context of near duplicate image retrieval, we integrated the proposed hash functions within a bag of words approach. Because most of the other methods use a k-means based vocabulary, they require an off-line learning stage and highest performance is obtained when the vocabulary is learned on the searched database. For application where images are often added or removed from the searched dataset, the learning stage must be repeated regularly in order to keep high recalls. We show that our hash functions in a bag of words approach has similar recalls as bag of words with k-means vocabulary learned on the searched dataset, but our method does not require any learning stage. It is thus very well adapted to near duplicate image retrieval applications where the dataset evolves regularly as there is no need to update the vocabulary to guarantee the bestperformance.
Keyterms: duplicate, k-means, bag of words, vocabul

Issue

Vol. 6 No. 2 (2022)

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details