Metadata-Version: 2.1
Name: Threaded-Sparse-TFIDF
Version: 0.2
Summary: Multithreading TF-IDF vectorization for similarity search using sparse matrices for computations.
Home-page: https://github.com/AmanPriyanshu/Threaded-Sparse-TFIDF
Author: Aman Priyanshu
Author-email: amanpriyanshusms2001@gmail.com
License: BSD 2-clause
Platform: UNKNOWN
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows :: Windows 7
Classifier: Operating System :: Microsoft :: Windows :: Windows 8
Classifier: Operating System :: Microsoft :: Windows :: Windows 8.1
Classifier: Operating System :: Microsoft :: Windows :: Windows 10
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tqdm (>=4)
Requires-Dist: nltk (>=3)
Requires-Dist: numpy

# Threaded-Sparse-TFIDF
Creating a repository for multithreading TF-IDF vectorization for similarity search using sparse matrices for computations. 

## Usage:
```py
from TF_IDF import TF_IDF_Vectorizer

tf_idf = TF_IDF_Vectorizer(use_cached=True, print_output=False)
_, ranking = tf_idf.get_similarity_score("science fiction super hero movie", num_workers=k)
```

## Performance:
### Image:

![image](https://github.com/AmanPriyanshu/Threaded-Sparse-TFIDF/blob/main/performance.png)

### Table:
|num_workers|time              |partition_size     |
|-----------|------------------|-------------------|
|1.0        |1.1117637634277344|6.778499999999999  |
|2.0        |0.8195240020751953|3.4149000000000003 |
|3.0        |0.7357232332229614|2.2773             |
|4.0        |0.7232689380645752|1.7081             |
|5.0        |0.7375946760177612|1.3555999999999997 |
|6.0        |0.7682486534118652|1.1307000000000003 |
|7.0        |0.7640876531600952|0.9618             |
|8.0        |0.7513441801071167|0.8506             |
|9.0        |0.7795052766799927|0.7587             |
|10.0       |0.8141436100006103|0.6807             |
|11.0       |0.8003325223922729|0.6195000000000002 |
|12.0       |0.8441393852233887|0.5697             |
|13.0       |0.8490614175796509|0.5258000000000002 |
|14.0       |0.9322290658950806|0.48739999999999994|
|15.0       |0.8824400186538697|0.45729999999999993|

## Data
A subset of the **Information Retrieval Dataset - Internet Movie Database (IMDB)** specifically movies after the year 2007.


