Metadata-Version: 2.4
Name: advanced-text-processing
Version: 0.2.0
Summary: A powerful Named Entity Recognition and Resolution library with semantic matching
Author: Gaurav Dadhich
License: MIT
Project-URL: Homepage, https://github.com/dadhichgaurav1/advanced-text-processing
Project-URL: Documentation, https://github.com/dadhichgaurav1/advanced-text-processing#readme
Project-URL: Repository, https://github.com/dadhichgaurav1/advanced-text-processing
Project-URL: Issues, https://github.com/dadhichgaurav1/advanced-text-processing/issues
Project-URL: Changelog, https://github.com/dadhichgaurav1/advanced-text-processing/blob/main/CHANGELOG.md
Keywords: ner,named-entity-recognition,entity-resolution,nlp,semantic-matching,canonicalization,wikidata,wordnet,entity-linking,text-processing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: spacy>=3.7.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: transformers>=4.30.0
Requires-Dist: spacy-wordnet>=0.1.0
Requires-Dist: nltk>=3.3
Requires-Dist: requests>=2.31.0
Requires-Dist: requests-ratelimiter>=0.4.0
Requires-Dist: cachetools>=5.3.0
Requires-Dist: faiss-cpu>=1.7.4
Requires-Dist: hnswlib>=0.7.0
Requires-Dist: rapidfuzz>=3.0.0
Requires-Dist: jellyfish>=1.0.0
Requires-Dist: dedupe>=2.0.0
Requires-Dist: recordlinkage>=0.15.0
Requires-Dist: flashtext>=2.7
Requires-Dist: autofaiss>=2.17.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.4.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.3.0; extra == "docs"
Requires-Dist: myst-parser>=2.0.0; extra == "docs"
Provides-Extra: all
Requires-Dist: advanced-text-processing[dev,docs]; extra == "all"
Dynamic: license-file

# Advanced Text Processing

[![PyPI version](https://img.shields.io/pypi/v/advanced-text-processing.svg)](https://pypi.org/project/advanced-text-processing/)
[![Python Versions](https://img.shields.io/pypi/pyversions/advanced-text-processing.svg)](https://pypi.org/project/advanced-text-processing/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

A powerful **Named Entity Recognition (NER)** and **Entity Resolution** library designed for complex text processing tasks. It combines state-of-the-art NLP models (spaCy, Transformers) with robust knowledge bases (Wikidata, WordNet) to provide accurate entity extraction, canonicalization, and semantic matching.

## 🚀 Features

- **Advanced Entity Resolution**:
  - **Mode A (Sequential)**: Fast, early-stopping pipeline for high-confidence matches.
  - **Mode B (Parallel)**: Aggregates multiple signals (fuzzy, semantic, contextual) for difficult cases.
- **Semantic Matching**: Maps inputs to canonical schemas using sentence embeddings (SentenceTransformers).
- **Alias Retrieval**: Automatically fetches aliases from Wikidata and synonyms from WordNet.
- **Canonicalization**:
  - Entities (e.g., "Apple" -> "Apple Inc.")
  - Relationships (e.g., "relies on" -> "depends_on")
  - Properties (e.g., "birth date" -> "date_of_birth")
- **Flexible Candidate Generation**: Supports exact lookup, fulltext blocking, and ANN search (FAISS/hnswlib).

## 📦 Installation

```bash
pip install advanced-text-processing
```

After installation, download the required models:

```bash
# Download spaCy model
python -m spacy download en_core_web_lg

# Download NLTK data
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"
```

See [Installation Guide](docs/INSTALLATION.md) for detailed instructions.

## ⚡ Quick Start

### Named Entity Recognition

```python
from ner_lib import recognize_entities

text = "Apple Inc. was founded by Steve Jobs in Cupertino."
result = recognize_entities(text)

for entity in result['entities']:
    print(f"{entity['text']} ({entity['type']})")
# Output:
# Apple Inc. (ORG)
# Steve Jobs (PERSON)
# Cupertino (GPE)
```

### Entity Canonicalization

```python
from ner_lib import canonicalize_entity

# Canonicalize an entity mention
result = canonicalize_entity("apple inc", mode="progressive")
print(f"Canonical: {result['canonical_name']}")
print(f"Aliases: {result['aliases']}")
# Output: 
# Canonical: Apple Inc.
# Aliases: ['Apple', 'AAPL', 'Apple Computer', ...]
```

### Relationship Canonicalization

```python
from ner_lib import Config, canonicalize_relationship

# Configure semantic matching
config = Config()
config.semantic_matching.enabled = True
config.semantic_matching.canonical_relationships = ["depends_on", "created_by"]

# Canonicalize a relationship phrase
result = canonicalize_relationship("relies heavily on", config=config)
print(f"Canonical: {result['canonical_name']}")
# Output: Canonical: depends_on
```

## 📚 Documentation

- [Installation Guide](docs/INSTALLATION.md)
- [API Reference](docs/API.md)
- [System Architecture](docs/ARCHITECTURE.md)
- [Contributing Guide](CONTRIBUTING.md)
- [Changelog](CHANGELOG.md)

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details on how to get started.

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgements

This library stands on the shoulders of giants. We gratefully acknowledge the following open-source projects:

- **[spaCy](https://spacy.io/)**: For industrial-strength NLP.
- **[Sentence Transformers](https://www.sbert.net/)**: For state-of-the-art text embeddings.
- **[Wikidata](https://www.wikidata.org/)**: For the comprehensive knowledge base.
- **[NLTK](https://www.nltk.org/)** & **[WordNet](https://wordnet.princeton.edu/)**: For lexical database support.

See [ACKNOWLEDGEMENTS.md](ACKNOWLEDGEMENTS.md) for the full list of dependencies and credits.
