Metadata-Version: 2.2
Name: data_prep_toolkit_transforms
Version: 1.0.1.dev1
Summary: Data Preparation Toolkit Transforms using Ray
Author-email: Maroun Touma <touma@us.ibm.com>
License: Apache-2.0
Keywords: transforms,data preprocessing,data preparation,llm,generative,ai,fine-tuning,llmapps
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
Requires-Dist: data-prep-toolkit>=0.2.4.dev0
Provides-Extra: dev
Requires-Dist: twine; extra == "dev"
Requires-Dist: pytest>=7.3.2; extra == "dev"
Requires-Dist: pytest-dotenv>=0.5.2; extra == "dev"
Requires-Dist: pytest-env>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.3.2; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: moto==5.0.5; extra == "dev"
Requires-Dist: markupsafe==2.0.1; extra == "dev"
Provides-Extra: ray
Requires-Dist: data-prep-toolkit[ray]>=0.2.4.dev0; extra == "ray"
Requires-Dist: networkx==3.3; extra == "ray"
Requires-Dist: colorlog==6.8.2; extra == "ray"
Requires-Dist: func-timeout==4.3.5; extra == "ray"
Requires-Dist: emerge-viz==2.0.0; extra == "ray"
Provides-Extra: all
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "all"
Requires-Dist: scancode-toolkit==32.1.0; platform_system != "Darwin" and extra == "all"
Requires-Dist: timeout-timer==0.2.0; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "all"
Requires-Dist: bs4==0.0.2; extra == "all"
Requires-Dist: transformers>=4.38.2; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "all"
Requires-Dist: parameterized; extra == "all"
Requires-Dist: pandas; extra == "all"
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "all"
Requires-Dist: parameterized>=0.9.0; extra == "all"
Requires-Dist: pandas>=2.2.2; extra == "all"
Requires-Dist: aiolimiter==1.1.0; extra == "all"
Requires-Dist: altair==5.3.0; extra == "all"
Requires-Dist: annotated-types==0.7.0; extra == "all"
Requires-Dist: anyio==4.4.0; extra == "all"
Requires-Dist: appnope==0.1.4; extra == "all"
Requires-Dist: asttokens==2.4.1; extra == "all"
Requires-Dist: attrs==23.2.0; extra == "all"
Requires-Dist: blinker==1.8.2; extra == "all"
Requires-Dist: cachetools==5.3.3; extra == "all"
Requires-Dist: certifi==2024.7.4; extra == "all"
Requires-Dist: charset-normalizer==3.3.2; extra == "all"
Requires-Dist: click==8.1.7; extra == "all"
Requires-Dist: comm==0.2.2; extra == "all"
Requires-Dist: contourpy==1.2.1; extra == "all"
Requires-Dist: cycler==0.12.1; extra == "all"
Requires-Dist: debugpy==1.8.1; extra == "all"
Requires-Dist: decorator==5.1.1; extra == "all"
Requires-Dist: Deprecated==1.2.14; extra == "all"
Requires-Dist: executing==2.0.1; extra == "all"
Requires-Dist: fonttools==4.53.0; extra == "all"
Requires-Dist: gitdb==4.0.11; extra == "all"
Requires-Dist: GitPython==3.1.43; extra == "all"
Requires-Dist: h11==0.14.0; extra == "all"
Requires-Dist: htbuilder==0.6.2; extra == "all"
Requires-Dist: httpcore==1.0.5; extra == "all"
Requires-Dist: httpx==0.27.0; extra == "all"
Requires-Dist: httpx-sse==0.4.0; extra == "all"
Requires-Dist: ibm-generative-ai==3.0.0; extra == "all"
Requires-Dist: idna==3.7; extra == "all"
Requires-Dist: ipykernel==6.29.4; extra == "all"
Requires-Dist: ipython==8.25.0; extra == "all"
Requires-Dist: jedi==0.19.1; extra == "all"
Requires-Dist: Jinja2==3.1.4; extra == "all"
Requires-Dist: jsonschema==4.22.0; extra == "all"
Requires-Dist: jsonschema-specifications==2023.12.1; extra == "all"
Requires-Dist: jupyter_client==8.6.2; extra == "all"
Requires-Dist: jupyter_core==5.7.2; extra == "all"
Requires-Dist: kiwisolver==1.4.5; extra == "all"
Requires-Dist: markdown-it-py==3.0.0; extra == "all"
Requires-Dist: MarkupSafe==2.1.5; extra == "all"
Requires-Dist: matplotlib==3.9.0; extra == "all"
Requires-Dist: matplotlib-inline==0.1.7; extra == "all"
Requires-Dist: mdurl==0.1.2; extra == "all"
Requires-Dist: more-itertools==10.3.0; extra == "all"
Requires-Dist: nest-asyncio==1.6.0; extra == "all"
Requires-Dist: networkx==3.3; extra == "all"
Requires-Dist: numpy==1.26.4; extra == "all"
Requires-Dist: packaging==24.0; extra == "all"
Requires-Dist: parso==0.8.4; extra == "all"
Requires-Dist: pexpect==4.9.0; extra == "all"
Requires-Dist: pillow>=10.3.0; extra == "all"
Requires-Dist: platformdirs==4.2.2; extra == "all"
Requires-Dist: prompt_toolkit==3.0.45; extra == "all"
Requires-Dist: protobuf==5.27.2; extra == "all"
Requires-Dist: psutil==5.9.8; extra == "all"
Requires-Dist: ptyprocess==0.7.0; extra == "all"
Requires-Dist: pure-eval==0.2.2; extra == "all"
Requires-Dist: pyarrow==16.1.0; extra == "all"
Requires-Dist: pydantic>=2.7.4; extra == "all"
Requires-Dist: pydantic_core>=2.18.4; extra == "all"
Requires-Dist: pydeck==0.9.1; extra == "all"
Requires-Dist: Pygments==2.18.0; extra == "all"
Requires-Dist: pyparsing==3.1.2; extra == "all"
Requires-Dist: python-dateutil==2.9.0.post0; extra == "all"
Requires-Dist: pytz==2024.1; extra == "all"
Requires-Dist: pyzmq==26.0.3; extra == "all"
Requires-Dist: referencing==0.35.1; extra == "all"
Requires-Dist: regex==2024.5.15; extra == "all"
Requires-Dist: requests==2.32.3; extra == "all"
Requires-Dist: rich==13.7.1; extra == "all"
Requires-Dist: rpds-py==0.18.1; extra == "all"
Requires-Dist: seaborn==0.13.2; extra == "all"
Requires-Dist: six==1.16.0; extra == "all"
Requires-Dist: smmap==5.0.1; extra == "all"
Requires-Dist: sniffio==1.3.1; extra == "all"
Requires-Dist: st-annotated-text==4.0.1; extra == "all"
Requires-Dist: stack-data==0.6.3; extra == "all"
Requires-Dist: streamlit==1.37.0; extra == "all"
Requires-Dist: tenacity==8.4.2; extra == "all"
Requires-Dist: toml==0.10.2; extra == "all"
Requires-Dist: toolz==0.12.1; extra == "all"
Requires-Dist: tornado==6.4.1; extra == "all"
Requires-Dist: traitlets==5.14.3; extra == "all"
Requires-Dist: tree-sitter==0.21.3; extra == "all"
Requires-Dist: tree-sitter-cpp==0.22.1; extra == "all"
Requires-Dist: tree-sitter-java==0.21.0; extra == "all"
Requires-Dist: tree-sitter-languages==1.10.2; extra == "all"
Requires-Dist: tree-sitter-php==0.22.5; extra == "all"
Requires-Dist: typing_extensions==4.12.2; extra == "all"
Requires-Dist: tzdata==2024.1; extra == "all"
Requires-Dist: uuid; extra == "all"
Requires-Dist: wcwidth==0.2.13; extra == "all"
Requires-Dist: wrapt==1.16.0; extra == "all"
Requires-Dist: plotly==5.15.0; extra == "all"
Requires-Dist: presidio-analyzer>=2.2.355; extra == "all"
Requires-Dist: presidio-anonymizer>=2.2.355; extra == "all"
Requires-Dist: flair>=0.14.0; extra == "all"
Requires-Dist: pandas; extra == "all"
Requires-Dist: mmh3==4.1.0; extra == "all"
Requires-Dist: xxhash==3.4.1; extra == "all"
Requires-Dist: fasttext>=0.9.2; platform_system != "Windows" and extra == "all"
Requires-Dist: langcodes>=3.3.0; extra == "all"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "all"
Requires-Dist: numpy==1.26.4; extra == "all"
Requires-Dist: docling-core==2.3.0; extra == "all"
Requires-Dist: docling-ibm-models==2.0.3; extra == "all"
Requires-Dist: deepsearch-glm==0.26.1; extra == "all"
Requires-Dist: docling==2.3.1; extra == "all"
Requires-Dist: filetype<2.0.0,>=1.2.0; extra == "all"
Requires-Dist: docling-core==2.3.0; extra == "all"
Requires-Dist: pydantic<2.10.0,>=2.0.0; extra == "all"
Requires-Dist: llama-index-core<0.12.0,>=0.11.22; extra == "all"
Requires-Dist: sentence-transformers>=3.0.1; extra == "all"
Requires-Dist: nltk>=3.9.1; extra == "all"
Requires-Dist: transformers>=4.38.2; extra == "all"
Requires-Dist: pandas; extra == "all"
Requires-Dist: requests; extra == "all"
Requires-Dist: polars>=1.9.0; extra == "all"
Requires-Dist: textstat; extra == "all"
Requires-Dist: pandas; extra == "all"
Requires-Dist: fasttext>=0.9.3; platform_system != "Windows" and extra == "all"
Requires-Dist: langcodes>=3.5.0; extra == "all"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "all"
Requires-Dist: numpy<1.29.0,>=1.26.4; extra == "all"
Requires-Dist: duckdb>=0.10.1; extra == "all"
Requires-Dist: mmh3>=4.1.0; extra == "all"
Requires-Dist: xxhash==3.4.1; extra == "all"
Requires-Dist: pyyaml>=6.0.2; extra == "all"
Requires-Dist: boto3>=1.34.69; extra == "all"
Requires-Dist: kubernetes>=30.1.0; extra == "all"
Requires-Dist: polars!=1.10.0,!=1.11.0,!=1.12.0,>=1.9.0; extra == "all"
Requires-Dist: disjoint-set>=0.8.0; extra == "all"
Requires-Dist: scipy<2.0.0,>=1.12.1; extra == "all"
Requires-Dist: numpy<1.29.0; extra == "all"
Requires-Dist: sentencepiece>=0.2.0; extra == "all"
Requires-Dist: mmh3>=4.1.0; extra == "all"
Requires-Dist: nltk==3.9.1; extra == "all"
Requires-Dist: transformers>=4.38.2; extra == "all"
Requires-Dist: torch<=2.5.1,>=2.2.2; extra == "all"
Requires-Dist: pandas; extra == "all"
Requires-Dist: transformers>=4.38.2; extra == "all"
Requires-Dist: data_prep_connector>=0.2.3; extra == "all"
Requires-Dist: nltk>=3.9.1; extra == "all"
Requires-Dist: requests; extra == "all"
Requires-Dist: transformers; extra == "all"
Requires-Dist: pandas; extra == "all"
Requires-Dist: psutil; extra == "all"
Requires-Dist: GPUtil; extra == "all"
Provides-Extra: language
Requires-Dist: presidio-analyzer>=2.2.355; extra == "language"
Requires-Dist: presidio-anonymizer>=2.2.355; extra == "language"
Requires-Dist: flair>=0.14.0; extra == "language"
Requires-Dist: pandas; extra == "language"
Requires-Dist: fasttext>=0.9.2; platform_system != "Windows" and extra == "language"
Requires-Dist: langcodes>=3.3.0; extra == "language"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "language"
Requires-Dist: numpy==1.26.4; extra == "language"
Requires-Dist: docling-core==2.3.0; extra == "language"
Requires-Dist: docling-ibm-models==2.0.3; extra == "language"
Requires-Dist: deepsearch-glm==0.26.1; extra == "language"
Requires-Dist: docling==2.3.1; extra == "language"
Requires-Dist: filetype<2.0.0,>=1.2.0; extra == "language"
Requires-Dist: docling-core==2.3.0; extra == "language"
Requires-Dist: pydantic<2.10.0,>=2.0.0; extra == "language"
Requires-Dist: llama-index-core<0.12.0,>=0.11.22; extra == "language"
Requires-Dist: sentence-transformers>=3.0.1; extra == "language"
Requires-Dist: nltk>=3.9.1; extra == "language"
Requires-Dist: transformers>=4.38.2; extra == "language"
Requires-Dist: pandas; extra == "language"
Requires-Dist: requests; extra == "language"
Requires-Dist: polars>=1.9.0; extra == "language"
Requires-Dist: textstat; extra == "language"
Requires-Dist: pandas; extra == "language"
Requires-Dist: fasttext>=0.9.3; platform_system != "Windows" and extra == "language"
Requires-Dist: langcodes>=3.5.0; extra == "language"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "language"
Requires-Dist: numpy<1.29.0,>=1.26.4; extra == "language"
Requires-Dist: duckdb>=0.10.1; extra == "language"
Requires-Dist: mmh3>=4.1.0; extra == "language"
Requires-Dist: xxhash==3.4.1; extra == "language"
Requires-Dist: pyyaml>=6.0.2; extra == "language"
Requires-Dist: boto3>=1.34.69; extra == "language"
Requires-Dist: kubernetes>=30.1.0; extra == "language"
Requires-Dist: polars!=1.10.0,!=1.11.0,!=1.12.0,>=1.9.0; extra == "language"
Requires-Dist: disjoint-set>=0.8.0; extra == "language"
Requires-Dist: scipy<2.0.0,>=1.12.1; extra == "language"
Requires-Dist: numpy<1.29.0; extra == "language"
Requires-Dist: sentencepiece>=0.2.0; extra == "language"
Requires-Dist: mmh3>=4.1.0; extra == "language"
Requires-Dist: nltk==3.9.1; extra == "language"
Requires-Dist: transformers>=4.38.2; extra == "language"
Requires-Dist: torch<=2.5.1,>=2.2.2; extra == "language"
Requires-Dist: pandas; extra == "language"
Requires-Dist: transformers>=4.38.2; extra == "language"
Requires-Dist: data_prep_connector>=0.2.3; extra == "language"
Requires-Dist: mmh3==4.1.0; extra == "language"
Requires-Dist: xxhash==3.4.1; extra == "language"
Requires-Dist: nltk>=3.9.1; extra == "language"
Requires-Dist: requests; extra == "language"
Requires-Dist: transformers; extra == "language"
Requires-Dist: pandas; extra == "language"
Requires-Dist: psutil; extra == "language"
Requires-Dist: GPUtil; extra == "language"
Provides-Extra: proglang-select
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "proglang-select"
Provides-Extra: header-cleanser
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "header-cleanser"
Requires-Dist: scancode-toolkit==32.1.0; platform_system != "Darwin" and extra == "header-cleanser"
Requires-Dist: timeout-timer==0.2.0; extra == "header-cleanser"
Provides-Extra: license-select
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "license-select"
Provides-Extra: code-quality
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "code-quality"
Requires-Dist: bs4==0.0.2; extra == "code-quality"
Requires-Dist: transformers>=4.38.2; extra == "code-quality"
Provides-Extra: code2parquet
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "code2parquet"
Requires-Dist: parameterized; extra == "code2parquet"
Requires-Dist: pandas; extra == "code2parquet"
Provides-Extra: profiler
Requires-Dist: mmh3==4.1.0; extra == "profiler"
Requires-Dist: xxhash==3.4.1; extra == "profiler"
Provides-Extra: resize
Provides-Extra: doc-chunk
Requires-Dist: docling-core==2.3.0; extra == "doc-chunk"
Requires-Dist: pydantic<2.10.0,>=2.0.0; extra == "doc-chunk"
Requires-Dist: llama-index-core<0.12.0,>=0.11.22; extra == "doc-chunk"
Provides-Extra: doc-quality
Provides-Extra: html2parquet
Requires-Dist: trafilatura==1.12.0; extra == "html2parquet"
Provides-Extra: lang-id
Requires-Dist: fasttext>=0.9.2; platform_system != "Windows" and extra == "lang-id"
Requires-Dist: langcodes>=3.3.0; extra == "lang-id"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "lang-id"
Requires-Dist: numpy==1.26.4; extra == "lang-id"
Provides-Extra: pdf2parquet
Requires-Dist: docling-core==2.3.0; extra == "pdf2parquet"
Requires-Dist: docling-ibm-models==2.0.3; extra == "pdf2parquet"
Requires-Dist: deepsearch-glm==0.26.1; extra == "pdf2parquet"
Requires-Dist: docling==2.3.1; extra == "pdf2parquet"
Requires-Dist: filetype<2.0.0,>=1.2.0; extra == "pdf2parquet"
Provides-Extra: text-encoder
Requires-Dist: sentence-transformers>=3.0.1; extra == "text-encoder"
Provides-Extra: pii-redactor
Requires-Dist: presidio-analyzer>=2.2.355; extra == "pii-redactor"
Requires-Dist: presidio-anonymizer>=2.2.355; extra == "pii-redactor"
Requires-Dist: flair>=0.14.0; extra == "pii-redactor"
Requires-Dist: pandas; extra == "pii-redactor"
Provides-Extra: filter
Requires-Dist: duckdb>=0.10.1; extra == "filter"
Provides-Extra: doc-id
Provides-Extra: hap
Requires-Dist: nltk==3.9.1; extra == "hap"
Requires-Dist: transformers>=4.38.2; extra == "hap"
Requires-Dist: torch<=2.5.1,>=2.2.2; extra == "hap"
Requires-Dist: pandas; extra == "hap"
Provides-Extra: ededup
Requires-Dist: mmh3>=4.1.0; extra == "ededup"
Requires-Dist: xxhash==3.4.1; extra == "ededup"
Provides-Extra: fdedup
Requires-Dist: pyyaml>=6.0.2; extra == "fdedup"
Requires-Dist: boto3>=1.34.69; extra == "fdedup"
Requires-Dist: kubernetes>=30.1.0; extra == "fdedup"
Requires-Dist: polars!=1.10.0,!=1.11.0,!=1.12.0,>=1.9.0; extra == "fdedup"
Requires-Dist: disjoint-set>=0.8.0; extra == "fdedup"
Requires-Dist: scipy<2.0.0,>=1.12.1; extra == "fdedup"
Requires-Dist: numpy<1.29.0; extra == "fdedup"
Requires-Dist: sentencepiece>=0.2.0; extra == "fdedup"
Requires-Dist: mmh3>=4.1.0; extra == "fdedup"
Provides-Extra: tokenization
Requires-Dist: transformers>=4.38.2; extra == "tokenization"
Provides-Extra: web2parquet
Requires-Dist: data_prep_connector>=0.2.3; extra == "web2parquet"
Provides-Extra: similarity
Requires-Dist: nltk>=3.9.1; extra == "similarity"
Requires-Dist: transformers>=4.38.2; extra == "similarity"
Requires-Dist: pandas; extra == "similarity"
Requires-Dist: requests; extra == "similarity"
Provides-Extra: extreme-tokenized
Requires-Dist: polars>=1.9.0; extra == "extreme-tokenized"
Provides-Extra: readability
Requires-Dist: textstat; extra == "readability"
Requires-Dist: pandas; extra == "readability"
Provides-Extra: code-profiler
Requires-Dist: data-prep-toolkit>=0.2.3; extra == "code-profiler"
Requires-Dist: parameterized>=0.9.0; extra == "code-profiler"
Requires-Dist: pandas>=2.2.2; extra == "code-profiler"
Requires-Dist: aiolimiter==1.1.0; extra == "code-profiler"
Requires-Dist: altair==5.3.0; extra == "code-profiler"
Requires-Dist: annotated-types==0.7.0; extra == "code-profiler"
Requires-Dist: anyio==4.4.0; extra == "code-profiler"
Requires-Dist: appnope==0.1.4; extra == "code-profiler"
Requires-Dist: asttokens==2.4.1; extra == "code-profiler"
Requires-Dist: attrs==23.2.0; extra == "code-profiler"
Requires-Dist: blinker==1.8.2; extra == "code-profiler"
Requires-Dist: cachetools==5.3.3; extra == "code-profiler"
Requires-Dist: certifi==2024.7.4; extra == "code-profiler"
Requires-Dist: charset-normalizer==3.3.2; extra == "code-profiler"
Requires-Dist: click==8.1.7; extra == "code-profiler"
Requires-Dist: comm==0.2.2; extra == "code-profiler"
Requires-Dist: contourpy==1.2.1; extra == "code-profiler"
Requires-Dist: cycler==0.12.1; extra == "code-profiler"
Requires-Dist: debugpy==1.8.1; extra == "code-profiler"
Requires-Dist: decorator==5.1.1; extra == "code-profiler"
Requires-Dist: Deprecated==1.2.14; extra == "code-profiler"
Requires-Dist: executing==2.0.1; extra == "code-profiler"
Requires-Dist: fonttools==4.53.0; extra == "code-profiler"
Requires-Dist: gitdb==4.0.11; extra == "code-profiler"
Requires-Dist: GitPython==3.1.43; extra == "code-profiler"
Requires-Dist: h11==0.14.0; extra == "code-profiler"
Requires-Dist: htbuilder==0.6.2; extra == "code-profiler"
Requires-Dist: httpcore==1.0.5; extra == "code-profiler"
Requires-Dist: httpx==0.27.0; extra == "code-profiler"
Requires-Dist: httpx-sse==0.4.0; extra == "code-profiler"
Requires-Dist: ibm-generative-ai==3.0.0; extra == "code-profiler"
Requires-Dist: idna==3.7; extra == "code-profiler"
Requires-Dist: ipykernel==6.29.4; extra == "code-profiler"
Requires-Dist: ipython==8.25.0; extra == "code-profiler"
Requires-Dist: jedi==0.19.1; extra == "code-profiler"
Requires-Dist: Jinja2==3.1.4; extra == "code-profiler"
Requires-Dist: jsonschema==4.22.0; extra == "code-profiler"
Requires-Dist: jsonschema-specifications==2023.12.1; extra == "code-profiler"
Requires-Dist: jupyter_client==8.6.2; extra == "code-profiler"
Requires-Dist: jupyter_core==5.7.2; extra == "code-profiler"
Requires-Dist: kiwisolver==1.4.5; extra == "code-profiler"
Requires-Dist: markdown-it-py==3.0.0; extra == "code-profiler"
Requires-Dist: MarkupSafe==2.1.5; extra == "code-profiler"
Requires-Dist: matplotlib==3.9.0; extra == "code-profiler"
Requires-Dist: matplotlib-inline==0.1.7; extra == "code-profiler"
Requires-Dist: mdurl==0.1.2; extra == "code-profiler"
Requires-Dist: more-itertools==10.3.0; extra == "code-profiler"
Requires-Dist: nest-asyncio==1.6.0; extra == "code-profiler"
Requires-Dist: networkx==3.3; extra == "code-profiler"
Requires-Dist: numpy==1.26.4; extra == "code-profiler"
Requires-Dist: packaging==24.0; extra == "code-profiler"
Requires-Dist: parso==0.8.4; extra == "code-profiler"
Requires-Dist: pexpect==4.9.0; extra == "code-profiler"
Requires-Dist: pillow>=10.3.0; extra == "code-profiler"
Requires-Dist: platformdirs==4.2.2; extra == "code-profiler"
Requires-Dist: prompt_toolkit==3.0.45; extra == "code-profiler"
Requires-Dist: protobuf==5.27.2; extra == "code-profiler"
Requires-Dist: psutil==5.9.8; extra == "code-profiler"
Requires-Dist: ptyprocess==0.7.0; extra == "code-profiler"
Requires-Dist: pure-eval==0.2.2; extra == "code-profiler"
Requires-Dist: pyarrow==16.1.0; extra == "code-profiler"
Requires-Dist: pydantic>=2.7.4; extra == "code-profiler"
Requires-Dist: pydantic_core>=2.18.4; extra == "code-profiler"
Requires-Dist: pydeck==0.9.1; extra == "code-profiler"
Requires-Dist: Pygments==2.18.0; extra == "code-profiler"
Requires-Dist: pyparsing==3.1.2; extra == "code-profiler"
Requires-Dist: python-dateutil==2.9.0.post0; extra == "code-profiler"
Requires-Dist: pytz==2024.1; extra == "code-profiler"
Requires-Dist: pyzmq==26.0.3; extra == "code-profiler"
Requires-Dist: referencing==0.35.1; extra == "code-profiler"
Requires-Dist: regex==2024.5.15; extra == "code-profiler"
Requires-Dist: requests==2.32.3; extra == "code-profiler"
Requires-Dist: rich==13.7.1; extra == "code-profiler"
Requires-Dist: rpds-py==0.18.1; extra == "code-profiler"
Requires-Dist: seaborn==0.13.2; extra == "code-profiler"
Requires-Dist: six==1.16.0; extra == "code-profiler"
Requires-Dist: smmap==5.0.1; extra == "code-profiler"
Requires-Dist: sniffio==1.3.1; extra == "code-profiler"
Requires-Dist: st-annotated-text==4.0.1; extra == "code-profiler"
Requires-Dist: stack-data==0.6.3; extra == "code-profiler"
Requires-Dist: streamlit==1.37.0; extra == "code-profiler"
Requires-Dist: tenacity==8.4.2; extra == "code-profiler"
Requires-Dist: toml==0.10.2; extra == "code-profiler"
Requires-Dist: toolz==0.12.1; extra == "code-profiler"
Requires-Dist: tornado==6.4.1; extra == "code-profiler"
Requires-Dist: traitlets==5.14.3; extra == "code-profiler"
Requires-Dist: tree-sitter==0.21.3; extra == "code-profiler"
Requires-Dist: tree-sitter-cpp==0.22.1; extra == "code-profiler"
Requires-Dist: tree-sitter-java==0.21.0; extra == "code-profiler"
Requires-Dist: tree-sitter-languages==1.10.2; extra == "code-profiler"
Requires-Dist: tree-sitter-php==0.22.5; extra == "code-profiler"
Requires-Dist: typing_extensions==4.12.2; extra == "code-profiler"
Requires-Dist: tzdata==2024.1; extra == "code-profiler"
Requires-Dist: uuid; extra == "code-profiler"
Requires-Dist: wcwidth==0.2.13; extra == "code-profiler"
Requires-Dist: wrapt==1.16.0; extra == "code-profiler"
Requires-Dist: plotly==5.15.0; extra == "code-profiler"
Provides-Extra: gneissweb-classification
Requires-Dist: fasttext>=0.9.3; platform_system != "Windows" and extra == "gneissweb-classification"
Requires-Dist: langcodes>=3.5.0; extra == "gneissweb-classification"
Requires-Dist: huggingface-hub<1.0.0,>=0.21.4; extra == "gneissweb-classification"
Requires-Dist: numpy<1.29.0,>=1.26.4; extra == "gneissweb-classification"
Provides-Extra: rep-removal
Requires-Dist: nltk>=3.9.1; extra == "rep-removal"
Requires-Dist: requests; extra == "rep-removal"
Requires-Dist: transformers; extra == "rep-removal"
Requires-Dist: pandas; extra == "rep-removal"
Requires-Dist: psutil; extra == "rep-removal"
Requires-Dist: GPUtil; extra == "rep-removal"

# DPK Python Transforms

## installation

The [transforms](https://github.com/IBM/data-prep-kit/blob/dev/transforms/README.md) are delivered as a standard pyton library available on pypi and can be installed using pip install:

`python -m pip install data-prep-toolkit-transforms[all]`
or
`python -m pip install data-prep-toolkit-transforms[ray, all]`
or
`python -m pip install data-prep-toolkit-transforms[language]`


installing the python transforms will also install  `data-prep-toolkit`

installing the ray transforms will also install  `data-prep-toolkit[ray]`

## List of Transforms in current package

Note: This list includes the transforms that were part of the release starting with data-prep-toolkit-transforms:0.2.1. This list may not always reflect up to date information. Users are encourage to raise an issue in git when they discover missing components or packages that are listed below but not in the current release they get from pypi.

* code
    * [code2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code2parquet/python/README.md)
    * [header_cleanser (Not available on MacOS)](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/header_cleanser/python/README.md)
    * [code_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code_quality/python/README.md)
    * [proglang_select](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/proglang_select/python/README.md)
    * [code_profiler](https://github.com/IBM/data-prep-kit/blob/dev/transforms/code/code_profiler/README.md)
* language
    * [doc_chunk](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_chunk/README.md)
	* [doc_quality](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/doc_quality/README.md)
	* [lang_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/lang_id/README.md)
	* [pdf2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pdf2parquet/README.md)
	* [text_encoder](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/text_encoder/README.md)
	* [pii_redactor](https://github.com/IBM/data-prep-kit/blob/dev/transforms/language/pii_redactor/python/README.md)
* universal
    * [ededup](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/ededup/README.md)
    * [fdedup](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/fdedup/README.md)
	* [filter](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/filter/python/README.md)
	* [resize](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/resize/python/README.md)
	* [tokenization](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/tokenization/README.md)
	* [doc_id](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/doc_id/README.md)
	* [web2parquet](https://github.com/IBM/data-prep-kit/blob/dev/transforms/universal/web2parquet/README.md)
   
## Release notes:

### 1.0.1.dev1
	Added Gneissweb transforms
	fdedup fix for windows
### 1.0.1.dev0
	PR #979 (code_profiler)
### 1.0.0.a6
	Added Profiler
	Added Resize
### 1.0.0.a5
	Added Pii Redactor
	Relax fasttext requirement >= 0.9.2
### 1.0.0.a4
	Added missing ray implementation for lang_id, doc_quality, tokenization and filter
	Added ray notebooks for lang id, Doc Quality, tokenization, and Filter
### 1.0.0.a3
	Added code_profiler
### 1.0.0.a2
   Relax dependencies on pandas (use latest or whatever is installed by application)
   Relax dependencies on requests (use latest or whatever is installed by application)



 
