JavaScript

Data structure: creating Storage class to store duplicate files

A data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently. In JavaScript, data structure is a value that refers to zero or more other values that represent data. The Arrays and Objects are built-in data structures and can be use to build other complex data structures.

We’ll create a data class for our Duplicate File Finder app to store the results of FileWalker class.

Before creating the data class, we need to know, if the received file is the duplicate of another file or not. We consider a file duplicate if the following properties are match to another file:

  • File size
  • File extension
  • File type
/* We'll compare following file properties
   to find duplicate files */
if (size  == anotherFile.size && 
    type == anotherFile.type &&
    ext == anotherFile.ext)
  
Note: it is not the actual code

Lets start creating our data class Storage to store files :

const path = require('path');

class Storage {
 constructor () {
  this.files = {};
  this.dupFiles = {};
 }
}

The Storage class has two properties:

  • this.files stores “not duplicate” files
  • this.dupFiles stores duplicate files

Next, we’ll create add method:

add (file, stat, hash){
 
}

The add method has three parameters: file, stat and hash. Back to our FileWalker class, the hash event returns us following:

  • file The full path of file i.e. D:\>BrainBell\file.txt
  • stat Stat object of file
  • buffer The 4Kb file chunk
  • hash The 512bit whirlpool hash of buffer

The FileWalker class doesn’t provide us the file type. Usually files stores their information, like file type, in their header. We’ll use the received hash for file type as it is the Hashed buffer of file header.

We’ll create a JavaScript object called “literal object expression”: {hash:{extension:{size:file}}}. Following is an example that shows how we'll structure the received files:

Example: dupFiles data structure:

hash1:{
 txt:{
  1024:["D:\>a.txt","D:\>b.txt"]
 },
 php:{
  1024:["D:\>a.php","D:\>b.php"]
  }
 }
}

Example:  files data structure
{
 hash1:{
  txt:{
   1024:"D:\>a.txt"
  },
  php:{
   1024:"D:\>a.php"
  }
 },
 hash2:{
  txt:{
   1034:"D:\>c.txt",
   1029:"D:\>d.txt"
  }
 }
}

In JavaScript, this object is a data structure that provide a map from names to values, also called Dictionary data structure. We’ll use the following technique to create the above data structure for finding the duplicate files.

The files object structure is:

  • files[hash]
    The files object stores unique hash
  • files[hash][ext]
    The hash object stores unique file extension ext
  • files[hash][ext][size]
    The ext object stores unique file size size
  • files[hash][ext][size] = file
    The size object stores a single file for comparison, for example, a file, file1.txt stored in files array: {hash:{txt:{430:file1.txt}}} and the other file, file2.txt, has similar hash, extension and size {hash:{txt:{430:file2.txt}}}, we consider the file2.txt as the duplicate of file1.txt.

Let’s start creating the add method to implement the above technique for the Storage class:

add (file, stat, hash){
 let ext  = path.extname(file),
 size = stat.size,
 hashExist = this.files[hash];
 
 if (hashExist === undefined){
  this.files[hash] = {};
  this.files[hash][ext] = {};
  this.files[hash][ext][size] = file;
  return
 }
}

The add function extract the file extension and its size. Then it retrieves the hash value from the files object. If hash value not exist, crate a new one by providing the file properties: hash, ext and size as keys.

Next, if the hash already exist in the files array then we retrieve the file extension:

...
if (hashExist === undefined){
 this.files[hash] = {};
 this.files[hash][ext] = {};
 this.files[hash][ext][size] = file;
 return
}

let extExist = hashExist[ext]; 
if (extExist === undefined) {
 hashExist[ext] = {};
 hashExist[ext][size] = file;
 return;
}
...

The add method retrieves the extension ext from the existing hash and store the remaining information to existing hash object if the ext not exist.

Next, If the ext already exist then we retrieve the file size:

...
if (extExist === undefined) {
 hashExist[ext] = {};
 hashExist[ext][size] = file;
 return;
}
let sizeExist = extExist[size];
if (sizeExist === undefined){
 extExist[size] = file;
 return;
}
...

The add method retrieves the size value and assign file as if the size value not exist.

Now, we consider the received file as duplicate of existing file because it matches all the three properties hashExist, extExist and sizeExist. We’ll add this duplicate file to dupFiles. Let’s see the dupFiles object structure:

  • files[hash]
    The files object stores unique hash
  • files[hash][ext]
    The hash object stores unique file extension ext
  • files[hash][ext][size]
    The ext object stores unique file size size
  • files[hash][ext][size] = [file]
    Unlike the files object the dupFiles’ size object stores multiple similar files, for example: {hash:{txt:{430:file1.txt,430:file2.txt,430:file3.txt}}}.

Now we’ve the similar file sizeExist and received file, we’ll add the sizeExist file in the dupFiles if it not already exist in it and also add the received file . Let’s complete the remaining part of the add method by adding the duplicate files in the dupFiles:

var hashDExist = dupFiles[hash];
 if (hashDExist === undefined){
 dupFiles[hash] = {}
 dupFiles[hash][ext] = {}
 dupFiles[hash][ext][size] = [sizeExist,file];
 return;
 }
var extDExist = dupFiles[ext];
 if (extDExist === undefined){
 hashDExist[ext] = {}
 hashDExist[ext][size] = [sizeExist,file];
 return;
 }
var sizeDExist = dupFiles[size];
 if (sizeDExist === undefined){
 extDExist[size] = [sizeExist,file];
 return;
 }
sizeDExist.push(file);

That’s it. We’ve created the Storage class, let’s combine the code chunks:

The Storage class

//Storage.js
const path = require('path');

module exports = class Storage {
 constructor () {
  this.files = {};
  this.dupFiles = {};
 }
 add (file, stat, hash){
  let ext  = path.extname(file),
  size = stat.size,
  hashExist = this.files[hash];  
  if (hashExist === undefined){
   this.files[hash] = {};
   this.files[hash][ext] = {};
   this.files[hash][ext][size] = file;
   return
  }

  let extExist = hashExist[ext]; 
  if (extExist === undefined) {
   hashExist[ext] = {};
   hashExist[ext][size] = file;
   return;
  }

  let sizeExist = extExist[size];
  if (sizeExist === undefined){
   extExist[size] = file;
   return;
  }
  
  let hashDExist = this.dupFiles[hash];
  if (hashDExist === undefined){
   this.dupFiles[hash] = {}
   this.dupFiles[hash][ext] = {}
   this.dupFiles[hash][ext][size] = [sizeExist,file];
   return;
  }

  let extDExist = hashDExist[ext];
  if (extDExist === undefined){
   hashDExist[ext] = {}
   hashDExist[ext][size] = [sizeExist,file];
   return;
  }

  let sizeDExist = extDExist[size];
  if (sizeDExist === undefined){
   extDExist[size] = [sizeExist,file];
   return ;
  }
  sizeDExist.push(file); 
 }
 getDuplicateFiles(){
  return dupFiles;
 }
 getFiles(){
  return files;
 } 
}

The complete Storage class code. We’ve added two more method to return files and dupFiles objects. In next tutorial, I’ll show you how to use this class in our Duplicate File Finder app, inside the walkerHelper.js file, and how to display result on app’s user interface.