
Data Deduplication Approaches
Concepts, Strategies, and Challenges
- 1st Edition - November 25, 2020
- Imprint: Academic Press
- Editors: Tin Thein Thwel, G. R. Sinha
- Language: English
- Paperback ISBN:9 7 8 - 0 - 1 2 - 8 2 3 3 9 5 - 5
- eBook ISBN:9 7 8 - 0 - 1 2 - 8 2 3 6 3 3 - 8
In the age of data science, the rapidly increasing amount of data is a major concern in numerous applications of computing operations and data storage. Duplicated data or redundant… Read more

Purchase options

Institutional subscription on ScienceDirect
Request a sales quoteIn the age of data science, the rapidly increasing amount of data is a major concern in numerous applications of computing operations and data storage. Duplicated data or redundant data is a main challenge in the field of data science research. Data Deduplication Approaches: Concepts, Strategies, and Challenges shows readers the various methods that can be used to eliminate multiple copies of the same files as well as duplicated segments or chunks of data within the associated files. Due to ever-increasing data duplication, its deduplication has become an especially useful field of research for storage environments, in particular persistent data storage. Data Deduplication Approaches provides readers with an overview of the concepts and background of data deduplication approaches, then proceeds to demonstrate in technical detail the strategies and challenges of real-time implementations of handling big data, data science, data backup, and recovery. The book also includes future research directions, case studies, and real-world applications of data deduplication, focusing on reduced storage, backup, recovery, and reliability.
- Includes data deduplication methods for a wide variety of applications
- Includes concepts and implementation strategies that will help the reader to use the suggested methods
- Provides a robust set of methods that will help readers to appropriately and judiciously use the suitable methods for their applications
- Focuses on reduced storage, backup, recovery, and reliability, which are the most important aspects of implementing data deduplication approaches
- Includes case studies
Biomedical Engineers and researchers in biomedical engineering, applied informatics, and data science
students and researchers in artificial intelligence, data analytics, and data science
- Cover image
- Title page
- Table of Contents
- Copyright
- Dedication
- List of contributors
- About the editors
- Preface
- Acknowledgement
- 1. Introduction to data deduplication approaches
- Abstract
- 1.1 Introduction
- 1.2 Methods of data deduplication
- 1.3 Classic research and classification of methods
- 1.4 File chunking and metadata
- 1.5 Implementation strategies
- 1.6 Performance evaluation and concluding remarks
- References
- 2. Data deduplication concepts
- Abstract
- 2.1 History
- 2.2 Need of data deduplication
- 2.3 Techniques for data redundancy removal
- 2.4 Problems with existing techniques
- 2.5 Redundant arrays of independent disks
- 2.6 Direct attached storage
- 2.7 Storage area network
- 2.8 Network attached storage
- 2.9 Comparison between direct attached storage, network attached storage, and storage area network
- 2.10 Data deduplication techniques
- 2.11 Benefits of data deduplication
- 2.12 How data deduplication operates
- 2.13 Hashing
- 2.14 Deduplication taxonomy
- 2.15 Deduplication versus compression
- 2.16 Challenges in data deduplication
- References
- 3. Concepts, strategies, and challenges of data deduplication
- Abstract
- 3.1 Deduplication approaches
- 3.2 Required components for data deduplication approaches
- 3.3 Centered on granularity for elimination of data duplication
- 3.4 Centered on location for elimination of data duplication
- 3.5 Centered on time for elimination of data duplication
- 3.6 Comparative discussion on different studied and prevailing data deduplication approaches and its challenges
- 3.7 Summary
- References
- 4. Existing mechanisms for data deduplication
- Abstract
- 4.1 Introduction
- 4.2 Classification of data deduplication techniques
- 4.3 Data deduplication in the cloud
- 4.4 Deduplication ratio
- 4.5 Importance of data deduplication
- 4.6 Deduplication for big data
- 4.7 Conclusion
- References
- 5. Classification criteria for data deduplication methods
- Abstract
- 5.1 Introduction
- 5.2 Granularity
- 5.3 Technique to handle duplicates
- 5.4 Locality assumptions for efficiency
- 5.5 Place
- 5.6 Time
- 5.7 Data format awareness
- 5.8 Indexing and techniques to find duplicates
- 5.9 Scope
- 5.10 Data type
- 5.11 Storage type
- 5.12 Conclusion
- References
- 6. File chunking approaches
- Abstract
- 6.1 Introduction
- 6.2 Materials and methods
- 6.3 File-level chunking
- 6.4 Implementation of file chunking
- 6.5 Case study: Deduplicator
- 6.6 Case study: Duplicates Cleaner
- 6.7 Conclusion
- 6.8 Bibliographic note
- 6.9 Supporting GitHub repositories and blogs
- References
- 7. Study of data deduplication for file chunking approaches
- Abstract
- 7.1 Introduction
- 7.2 Related literature
- 7.3 Conclusion
- References
- 8. Essentials of data deduplication using open-source toolkit
- Abstract
- 8.1 Introduction
- 8.2 Basic deduplication structure
- 8.3 Implementation using Python
- 8.4 Record linkage toolkit
- 8.5 Summary
- References
- 9. Efficient data deduplication scheme for scale-out distributed storage
- Abstract
- 9.1 Introduction
- 9.2 Distributed storage system
- 9.3 Related work
- 9.4 Overview of capacity optimization for scale-out distributed storage
- 9.5 Bloom filter array–based data deduplication scheme for scale-out distributed storage
- 9.6 Ensuring reliability in deduplication data by erasure-coded replication
- 9.7 Summary
- References
- 10. Identification of duplicate bug reports in software bug repositories: a systematic review, challenges, and future scope
- Abstract
- 10.1 Introduction
- 10.2 Motivation
- 10.3 Duplicate bug detection
- 10.4 Systematic review
- 10.5 Conclusion, challenges, and future scope
- References
- 11. A survey and critical analysis on energy generation from datacenter
- Abstract
- 11.1 Introduction
- 11.2 Datacenter framework
- 11.3 Power supply among different components of datacenter
- 11.4 Power distribution among different components of datacenter
- 11.5 Significance of efficient energy consumption models
- 11.6 Energy consumption reduction approaches
- 11.7 Conclusion
- References
- 12. Review of MODIS EVI and NDVI data for data mining applications
- Abstract
- 12.1 Introduction
- 12.2 MODIS vegetation indices
- 12.3 MODIS sinusoidal tiling system
- 12.4 MODIS file naming conversion
- 12.5 Data conversion
- 12.6 Quality assurance
- 12.7 Techniques to prepare EVI time series data set
- 12.8 Data mining–based land cover change detection
- 12.9 Summary
- References
- 13. Performance modeling for secure migration processes of legacy systems to the cloud computing
- Abstract
- 13.1 Data migration in cloud computing
- 13.2 Literature review
- 13.3 Proposed work
- 13.4 Proposed encryption approach
- 13.5 Result and conclusion
- References
- 14. DedupCloud: an optimized efficient virtual machine deduplication algorithm in cloud computing environment
- Abstract
- 14.1 Introduction
- 14.2 Motivation
- 14.3 Literature review
- 14.4 Data deduplication on cloud storage systems
- 14.5 DedupCloud: proposed methodology for data deduplication in cloud
- 14.6 Conclusion
- References
- 15. Data deduplication for cloud storage
- Abstract
- 15.1 Introduction
- 15.2 Cloud storage
- 15.3 Data deduplication for cloud storage
- 15.4 Conclusion
- References
- 16. Data duplication using Amazon Web Services cloud storage
- Abstract
- 16.1 Introduction
- 16.2 The workflow of data deduplication
- 16.3 Deduplication in Amazon Web Services
- 16.4 How to deduplicate
- 16.5 Integrate and deduplicate datasets using AWS Lake Formation FindMatches
- 16.6 Additional services and benefits
- 16.7 Comparison of Cloud backup services with AWS, GCP, Azure
- 16.8 Key terms and definitions
- References
- 17. Game-theoretic analysis of encrypted cloud data deduplication
- Abstract
- 17.1 Introduction
- 17.2 Related work review and open research problems
- 17.3 Preliminaries and notations
- 17.4 Game-theoretic analysis of server-controlled deduplication
- 17.5 Game-theoretic analysis of client-controlled deduplication
- 17.6 Conclusion and future work
- Acknowledgment
- References
- 18. Data deduplication applications in cognitive science and computer vision research
- Abstract
- 18.1 Introduction
- 18.2 Redundancy and dimensionality reduction
- 18.3 Interactive deduplication
- 18.4 Image-specific data deduplication
- 18.5 Cognitive science load and dimensionality problem
- 18.6 Conclusion
- References
- Index
- Edition: 1
- Published: November 25, 2020
- Imprint: Academic Press
- No. of pages: 404
- Language: English
- Paperback ISBN: 9780128233955
- eBook ISBN: 9780128236338
TT
Tin Thein Thwel
Tin Thein Thwel, PhD is a Professor at Myanmar Institute of Information Technology (MIIT), Mandalay, Myanmar. She received her PhD in Information Technology from the University of Computer Studies, Yangon (UCSY), Myanmar. She is a reviewer and technical committee member of the International Conference on Computer and Applications (ICCA) on data deduplication, cyber security, data mining, and information retrieval. She has 16 years of teaching experience at the university level and her research interests include data deduplication, cyber security, data mining and data science, information retrieval, and distributed computing.
Affiliations and expertise
Professor, Myanmar Institute of Information Technology (MIIT), Mandalay, MyanmarGS
G. R. Sinha
Dr. G R Sinha is a Professor at Myanmar Institute of Information Technology (MIIT) Mandalay, Myanmar.
To his credit are 255 research papers, book chapters, and books, including Analysis of Medical Modalities for Improved Diagnosis in Modern Healthcare, Biomedical Signal Processing for Healthcare Applications, Brain and Behavior Computing, and Data Science and Its Applications from Chapman and Hall/CRC Press, Advances in Biometrics from Springer, and Cognitive Informatics, Volumes 1 and 2, AI-Based Brain Computer Interfaces, and Data Deduplication Approaches from Elsevier Academic Press. He was Dean of Faculty and an Executive Council Member of CSVTU and has served as Distinguished Speaker in the field of Digital Image Processing for the Computer Society of India. His research interests include Applications of Machine Learning and Artificial Intelligence in Medical Image Analysis, Biomedical Signal Analysis, Computer Aided Diagnosis, Computer Vision, and Cognitive Science.
Affiliations and expertise
Adjunct Professor, International Institute of Information Technology Bengaluru (IIITB), Bangalore, Karnataka, IndiaRead Data Deduplication Approaches on ScienceDirect