Metadata-Version: 2.1
Name: data_prep_toolkit_transforms
Version: 0.2.3.dev2
Summary: Data Preparation Toolkit Transforms using Ray
Author-email: Maroun Touma <touma@us.ibm.com>
License: Apache-2.0
Keywords: transforms,data preprocessing,data preparation,llm,generative,ai,fine-tuning,llmapps
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
Requires-Dist: data-prep-toolkit>=0.2.3.dev1
Provides-Extra: dev
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest>=7.3.2; extra == "dev"
Requires-Dist: pytest-dotenv>=0.5.2; extra == "dev"
Requires-Dist: pytest-env>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.3.2; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: moto==5.0.5; extra == "dev"
Requires-Dist: markupsafe==2.0.1; extra == "dev"
Provides-Extra: ray
Requires-Dist: data-prep-toolkit[ray]>=0.2.3.dev1; extra == "ray"
Requires-Dist: networkx==3.3; extra == "ray"
Requires-Dist: colorlog==6.8.2; extra == "ray"
Requires-Dist: func-timeout==4.3.5; extra == "ray"
Requires-Dist: emerge-viz==2.0.0; extra == "ray"
Provides-Extra: all
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: scancode-toolkit==32.1.0; platform_system != "Darwin" and extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: bs4==0.0.2; extra == "all"
Requires-Dist: transformers==4.38.2; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: parameterized; extra == "all"
Requires-Dist: pandas; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: docling-core==2.3.0; extra == "all"
Requires-Dist: pydantic<2.10.0,>=2.0.0; extra == "all"
Requires-Dist: llama-index-core<0.12.0,>=0.11.22; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: fasttext==0.9.2; extra == "all"
Requires-Dist: langcodes==3.3.0; extra == "all"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "all"
Requires-Dist: numpy==1.26.4; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: sentence-transformers==3.0.1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: docling-core==2.3.0; extra == "all"
Requires-Dist: docling-ibm-models==2.0.3; extra == "all"
Requires-Dist: deepsearch-glm==0.26.1; extra == "all"
Requires-Dist: docling==2.3.1; extra == "all"
Requires-Dist: filetype<2.0.0,>=1.2.0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: nltk==3.9.1; extra == "all"
Requires-Dist: transformers==4.38.2; extra == "all"
Requires-Dist: torch<=2.4.1,>=2.2.2; extra == "all"
Requires-Dist: pandas==2.2.2; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: transformers==4.38.2; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: mmh3>=4.1.0; extra == "all"
Requires-Dist: xxhash==3.4.1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: pyyaml>=6.0.2; extra == "all"
Requires-Dist: boto3>=1.34.69; extra == "all"
Requires-Dist: kubernetes>=30.1.0; extra == "all"
Requires-Dist: polars==1.9.0; extra == "all"
Requires-Dist: disjoint-set>=0.8.0; extra == "all"
Requires-Dist: scipy<2.0.0,>=1.14.1; extra == "all"
Requires-Dist: numpy<1.29.0; extra == "all"
Requires-Dist: sentencepiece>=0.2.0; extra == "all"
Requires-Dist: mmh3>=4.1.0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: mmh3==4.1.0; extra == "all"
Requires-Dist: xxhash==3.4.1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: duckdb>=0.10.1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "all"
Requires-Dist: data_prep_connector>=0.2.3; extra == "all"
Provides-Extra: proglang-select
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "proglang-select"
Provides-Extra: header-cleanser
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "header-cleanser"
Requires-Dist: scancode-toolkit==32.1.0; platform_system != "Darwin" and extra == "header-cleanser"
Provides-Extra: license-select
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "license-select"
Provides-Extra: code-quality
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "code-quality"
Requires-Dist: bs4==0.0.2; extra == "code-quality"
Requires-Dist: transformers==4.38.2; extra == "code-quality"
Provides-Extra: code2parquet
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "code2parquet"
Requires-Dist: parameterized; extra == "code2parquet"
Requires-Dist: pandas; extra == "code2parquet"
Provides-Extra: code-profiler
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "code-profiler"
Requires-Dist: parameterized; extra == "code-profiler"
Requires-Dist: pandas; extra == "code-profiler"
Requires-Dist: aiolimiter==1.1.0; extra == "code-profiler"
Requires-Dist: altair==5.3.0; extra == "code-profiler"
Requires-Dist: annotated-types==0.7.0; extra == "code-profiler"
Requires-Dist: anyio==4.4.0; extra == "code-profiler"
Requires-Dist: appnope==0.1.4; extra == "code-profiler"
Requires-Dist: asttokens==2.4.1; extra == "code-profiler"
Requires-Dist: attrs==23.2.0; extra == "code-profiler"
Requires-Dist: blinker==1.8.2; extra == "code-profiler"
Requires-Dist: cachetools==5.3.3; extra == "code-profiler"
Requires-Dist: certifi==2024.7.4; extra == "code-profiler"
Requires-Dist: charset-normalizer==3.3.2; extra == "code-profiler"
Requires-Dist: click==8.1.7; extra == "code-profiler"
Requires-Dist: comm==0.2.2; extra == "code-profiler"
Requires-Dist: contourpy==1.2.1; extra == "code-profiler"
Requires-Dist: cycler==0.12.1; extra == "code-profiler"
Requires-Dist: debugpy==1.8.1; extra == "code-profiler"
Requires-Dist: decorator==5.1.1; extra == "code-profiler"
Requires-Dist: Deprecated==1.2.14; extra == "code-profiler"
Requires-Dist: executing==2.0.1; extra == "code-profiler"
Requires-Dist: fonttools==4.53.0; extra == "code-profiler"
Requires-Dist: gitdb==4.0.11; extra == "code-profiler"
Requires-Dist: GitPython==3.1.43; extra == "code-profiler"
Requires-Dist: h11==0.14.0; extra == "code-profiler"
Requires-Dist: htbuilder==0.6.2; extra == "code-profiler"
Requires-Dist: httpcore==1.0.5; extra == "code-profiler"
Requires-Dist: httpx==0.27.0; extra == "code-profiler"
Requires-Dist: httpx-sse==0.4.0; extra == "code-profiler"
Requires-Dist: ibm-generative-ai==3.0.0; extra == "code-profiler"
Requires-Dist: idna==3.7; extra == "code-profiler"
Requires-Dist: ipykernel==6.29.4; extra == "code-profiler"
Requires-Dist: ipython==8.25.0; extra == "code-profiler"
Requires-Dist: jedi==0.19.1; extra == "code-profiler"
Requires-Dist: Jinja2==3.1.4; extra == "code-profiler"
Requires-Dist: jsonschema==4.22.0; extra == "code-profiler"
Requires-Dist: jsonschema-specifications==2023.12.1; extra == "code-profiler"
Requires-Dist: jupyter_client==8.6.2; extra == "code-profiler"
Requires-Dist: jupyter_core==5.7.2; extra == "code-profiler"
Requires-Dist: kiwisolver==1.4.5; extra == "code-profiler"
Requires-Dist: markdown-it-py==3.0.0; extra == "code-profiler"
Requires-Dist: MarkupSafe==2.1.5; extra == "code-profiler"
Requires-Dist: matplotlib==3.9.0; extra == "code-profiler"
Requires-Dist: matplotlib-inline==0.1.7; extra == "code-profiler"
Requires-Dist: mdurl==0.1.2; extra == "code-profiler"
Requires-Dist: more-itertools==10.3.0; extra == "code-profiler"
Requires-Dist: nest-asyncio==1.6.0; extra == "code-profiler"
Requires-Dist: networkx==3.3; extra == "code-profiler"
Requires-Dist: numpy==1.26.4; extra == "code-profiler"
Requires-Dist: packaging==24.0; extra == "code-profiler"
Requires-Dist: pandas==2.2.2; extra == "code-profiler"
Requires-Dist: parso==0.8.4; extra == "code-profiler"
Requires-Dist: pexpect==4.9.0; extra == "code-profiler"
Requires-Dist: pillow==10.3.0; extra == "code-profiler"
Requires-Dist: platformdirs==4.2.2; extra == "code-profiler"
Requires-Dist: prompt_toolkit==3.0.45; extra == "code-profiler"
Requires-Dist: protobuf==5.27.2; extra == "code-profiler"
Requires-Dist: psutil==5.9.8; extra == "code-profiler"
Requires-Dist: ptyprocess==0.7.0; extra == "code-profiler"
Requires-Dist: pure-eval==0.2.2; extra == "code-profiler"
Requires-Dist: pyarrow==16.1.0; extra == "code-profiler"
Requires-Dist: pydantic==2.7.4; extra == "code-profiler"
Requires-Dist: pydantic_core==2.18.4; extra == "code-profiler"
Requires-Dist: pydeck==0.9.1; extra == "code-profiler"
Requires-Dist: Pygments==2.18.0; extra == "code-profiler"
Requires-Dist: pyparsing==3.1.2; extra == "code-profiler"
Requires-Dist: python-dateutil==2.9.0.post0; extra == "code-profiler"
Requires-Dist: pytz==2024.1; extra == "code-profiler"
Requires-Dist: pyzmq==26.0.3; extra == "code-profiler"
Requires-Dist: referencing==0.35.1; extra == "code-profiler"
Requires-Dist: regex==2024.5.15; extra == "code-profiler"
Requires-Dist: requests==2.32.3; extra == "code-profiler"
Requires-Dist: rich==13.7.1; extra == "code-profiler"
Requires-Dist: rpds-py==0.18.1; extra == "code-profiler"
Requires-Dist: seaborn==0.13.2; extra == "code-profiler"
Requires-Dist: six==1.16.0; extra == "code-profiler"
Requires-Dist: smmap==5.0.1; extra == "code-profiler"
Requires-Dist: sniffio==1.3.1; extra == "code-profiler"
Requires-Dist: st-annotated-text==4.0.1; extra == "code-profiler"
Requires-Dist: stack-data==0.6.3; extra == "code-profiler"
Requires-Dist: streamlit==1.37.0; extra == "code-profiler"
Requires-Dist: tenacity==8.4.2; extra == "code-profiler"
Requires-Dist: toml==0.10.2; extra == "code-profiler"
Requires-Dist: toolz==0.12.1; extra == "code-profiler"
Requires-Dist: tornado==6.4.1; extra == "code-profiler"
Requires-Dist: traitlets==5.14.3; extra == "code-profiler"
Requires-Dist: tree-sitter==0.21.3; extra == "code-profiler"
Requires-Dist: tree-sitter-cpp==0.22.1; extra == "code-profiler"
Requires-Dist: tree-sitter-java==0.21.0; extra == "code-profiler"
Requires-Dist: tree-sitter-languages==1.10.2; extra == "code-profiler"
Requires-Dist: tree-sitter-php==0.22.5; extra == "code-profiler"
Requires-Dist: typing_extensions==4.12.2; extra == "code-profiler"
Requires-Dist: tzdata==2024.1; extra == "code-profiler"
Requires-Dist: urllib3==2.2.2; extra == "code-profiler"
Requires-Dist: uuid; extra == "code-profiler"
Requires-Dist: wcwidth==0.2.13; extra == "code-profiler"
Requires-Dist: wrapt==1.16.0; extra == "code-profiler"
Requires-Dist: plotly==5.15.0; extra == "code-profiler"
Provides-Extra: doc-quality
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "doc-quality"
Provides-Extra: doc-chunk
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "doc-chunk"
Requires-Dist: docling-core==2.3.0; extra == "doc-chunk"
Requires-Dist: pydantic<2.10.0,>=2.0.0; extra == "doc-chunk"
Requires-Dist: llama-index-core<0.12.0,>=0.11.22; extra == "doc-chunk"
Provides-Extra: html2parquet
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "html2parquet"
Requires-Dist: trafilatura==1.12.0; extra == "html2parquet"
Provides-Extra: pii-redactor
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "pii-redactor"
Requires-Dist: presidio-analyzer>=2.2.355; extra == "pii-redactor"
Requires-Dist: presidio-anonymizer>=2.2.355; extra == "pii-redactor"
Requires-Dist: flair>=0.14.0; extra == "pii-redactor"
Requires-Dist: pandas>=2.2.2; extra == "pii-redactor"
Provides-Extra: lang-id
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "lang-id"
Requires-Dist: fasttext==0.9.2; extra == "lang-id"
Requires-Dist: langcodes==3.3.0; extra == "lang-id"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "lang-id"
Requires-Dist: numpy==1.26.4; extra == "lang-id"
Provides-Extra: text-encoder
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "text-encoder"
Requires-Dist: sentence-transformers==3.0.1; extra == "text-encoder"
Provides-Extra: pdf2parquet
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "pdf2parquet"
Requires-Dist: docling-core==2.3.0; extra == "pdf2parquet"
Requires-Dist: docling-ibm-models==2.0.3; extra == "pdf2parquet"
Requires-Dist: deepsearch-glm==0.26.1; extra == "pdf2parquet"
Requires-Dist: docling==2.3.1; extra == "pdf2parquet"
Requires-Dist: filetype<2.0.0,>=1.2.0; extra == "pdf2parquet"
Provides-Extra: hap
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "hap"
Requires-Dist: nltk==3.9.1; extra == "hap"
Requires-Dist: transformers==4.38.2; extra == "hap"
Requires-Dist: torch<=2.4.1,>=2.2.2; extra == "hap"
Requires-Dist: pandas==2.2.2; extra == "hap"
Provides-Extra: tokenization
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "tokenization"
Requires-Dist: transformers==4.38.2; extra == "tokenization"
Provides-Extra: ededup
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "ededup"
Requires-Dist: mmh3>=4.1.0; extra == "ededup"
Requires-Dist: xxhash==3.4.1; extra == "ededup"
Provides-Extra: fdedup
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "fdedup"
Requires-Dist: pyyaml>=6.0.2; extra == "fdedup"
Requires-Dist: boto3>=1.34.69; extra == "fdedup"
Requires-Dist: kubernetes>=30.1.0; extra == "fdedup"
Requires-Dist: polars==1.9.0; extra == "fdedup"
Requires-Dist: disjoint-set>=0.8.0; extra == "fdedup"
Requires-Dist: scipy<2.0.0,>=1.14.1; extra == "fdedup"
Requires-Dist: numpy<1.29.0; extra == "fdedup"
Requires-Dist: sentencepiece>=0.2.0; extra == "fdedup"
Requires-Dist: mmh3>=4.1.0; extra == "fdedup"
Provides-Extra: profiler
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "profiler"
Requires-Dist: mmh3==4.1.0; extra == "profiler"
Requires-Dist: xxhash==3.4.1; extra == "profiler"
Provides-Extra: doc-id
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "doc-id"
Provides-Extra: filter
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "filter"
Requires-Dist: duckdb>=0.10.1; extra == "filter"
Provides-Extra: resize
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "resize"
Provides-Extra: web2parquet
Requires-Dist: data-prep-toolkit>=0.2.3.dev1; extra == "web2parquet"
Requires-Dist: data_prep_connector>=0.2.3; extra == "web2parquet"

# DPK Python Transforms

## installation

The [transforms](https://github.com/IBM/data-prep-kit/blob/dev/transforms/README.md) are delivered as a standard pyton library available on pypi and can be installed using pip install:

`python -m pip install data-prep-toolkit-transforms`
or
`python -m pip install data-prep-toolkit-transforms[ray]`


installing the python transforms will also install  `data-prep-toolkit`

installing the ray transforms will also install  `data-prep-toolkit[ray]`

## List of Transforms in current package

Note: This list includes the transforms that were part of the release starting with data-prep-toolkit-transforms:0.2.1. This list may not always reflect up to date information. Users are encourage to raise an issue in git when they discover missing components or packages that are listed below but not in the current release they get from pypi.

* code
    * [code2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code2parquet/python/README.md)
    * [header_cleanser (Not available on MacOS)](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/header_cleanser/python/README.md)
    * [code_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code_quality/python/README.md)
    * [proglang_select](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/proglang_select/python/README.md)
* language
    * [doc_chunk](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_chunk/python/README.md)
	* [doc_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_quality/python/README.md)
	* [lang_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/lang_id/python/README.md)
	* [pdf2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/python/README.md)
	* [text_encoder](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/text_encoder/python/README.md)
	* [pii_redactor](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pii_redactor/python/README.md)
* universal
    * [ededup](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/ededup/python/README.md)
	* [filter](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/filter/python/README.md)
	* [resize](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/resize/python/README.md)
	* [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/python/README.md)
	* [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/python/README.md)
	* [web2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/web2parquet/README.md)
   
## Release notes:

### 0.2.3.dev1 
* code_profiler
### 0.2.3.dev0 
* fdedup
### 0.2.2.dev3 
* web2parquet
### 0.2.2.dev2
* pdf2parquet now supports HTML,DOCX,PPTX, ... in addition to PDF




 
