near-duplicate-docs

Find similar documents with ease and accuracy

A simple library that can search efficiently for near-duplicates in massive sets, libraries, or databases with almost any kind of texts

Try it now!
img
  • 99%
    Quality Score
  • 100%
    Maintenance Score
  • 0.03
    Popularity Score
  • 1052
    Package Downloads
Key features

Open source JavaScript/TypeScript library with MIT license

Easy to Use

You can start comparing documents and get results in 3 steps

Developed with TypeScript

It's typed and well-designed with SOLID principles in mind

High Code Coverage

A handful of tests are run on every new commit or release

Scalable

Suitable for big sets of text documents

All the cool stuff you can do with modern TypeScript & React
Our Goal

To Help JS Developers Build Amazing And Profitable Stuff

As independent creators, we love creating cool things like websites, web apps, tools, templates, articles, and more. We've merged our passions for coding and writing with the goal of helping other people find prosperity through coding just like us.

So what are we doing about it?

  • We are in a constant search for stuff that can be created once and then given away repeatedly;

  • We create and support apps that can replace unreasonably expensive alternatives;

  • We make profits by offering affordable upgrades, support, affiliate products, and other goodies;

  • We build our apps and our business in public for maximum transparency;

  • We believe that the ongoing digital transformation must be accessible to all.

near-duplicate-docs

Exactly how easy is it to use the library?

You can find duplicate or near-duplicate documents in 3 steps if you don't need to process the texts in other ways before comparing them.


const {makeDuplicatesFinder} = require('near-duplicate-docs');

//Step 1: Create an object instance
const finder = makeDuplicatesFinder({
                                    minSimilarity: 0.75,
                                    shinglesSize: 5,
                                    shinglesType: "word",
                                    signatureLength: 100,
                                    rowsPerBand: 5,
                                });




//Step 2: Pass the documents' ids and texts    
finder.add(document1.id, document1.text);
finder.add(document2.id, document2.text);
finder.add(document3.id, document3.text);
finder.add(documentN.id, documentN.text);


//Step 3: Initiate a search
const duplicates = finder.search();

console.log(duplicates);

//Result

{
    document1: [[0.95, "document3"]],
    documentN: [[0.76, "document2"], [0.80, "document3"]]
}


const {makeAsyncDuplicatesFinder} = require('near-duplicate-docs');

//Step 1: Create an object instance
const finder = makeAsyncDuplicatesFinder({
  minSimilarity: 0.75,
  shinglesSize: 5,
  shinglesType: "char",
  signatureLength: 100,
  rowsPerBand: 5,
});


const promises = [];

//Step 2: Pass the documents' ids and texts   
promises.add(finder.add(document1.id, document1.text));
promises.add(finder.add(document2.id, document2.text));
promises.add(finder.add(document3.id, document3.text));
promises.add(finder.add(documentN.id, documentN.text));

//Step 3: Initiate a search, when all texts are added
Promise.all(promises)
  .then(() => finder.search())
  .then(duplicates => console.log(duplicates))
  .catch(errors => console.log(errors));

//Result

{
  document1: [[0.95, "document3"]], 
  documentN: [[0.76, "document2"], [0.80, "document3"]]
}

Latest Articles

Get fun and check out some of the latest articles in the Stream; they are free and publicly available, so click on the one that picks your interest and you can start reading now.

Free CTA