Metadata-Version: 2.4
Name: prepo
Version: 0.1.3
Summary: A package for preprocessing pandas DataFrames
Home-page: https://github.com/erikhox/prepo
Author: Erik Hoxhaj
Author-email: erik.hoxhaj@outlook.com
Project-URL: Bug Reports, https://github.com/erikhox/prepo/issues
Project-URL: Source, https://github.com/erikhox/prepo
Project-URL: Documentation, https://github.com/erikhox/prepo#readme
Keywords: pandas preprocessing data-science feature-engineering
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.0.0
Requires-Dist: numpy>=1.18.0
Requires-Dist: scikit-learn>=0.22.0
Requires-Dist: scipy>=1.4.0
Requires-Dist: python-dateutil>=2.8.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0.0; extra == "dev"
Requires-Dist: pytest-cov>=2.10.0; extra == "dev"
Requires-Dist: flake8>=3.8.0; extra == "dev"
Requires-Dist: black>=20.8b1; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Prepo

A Python package for preprocessing pandas DataFrames, with a focus on automatic data type detection, cleaning, and scaling.

## Features

- **Automatic Data Type Detection**: Automatically identifies column types (numeric, categorical, temporal, etc.)
- **Data Cleaning**: Handles missing values, standardizes null representations
- **Outlier Removal**: Identifies and removes outliers from numeric columns
- **Feature Scaling**: Supports multiple scaling methods (standard, robust, minmax)
- **Time Series Detection**: Identifies if a DataFrame represents time series data

## Installation

```bash
pip install prepo
```

## Usage

```python
import pandas as pd
from prepo import FeaturePreProcessor

# Create a processor instance
processor = FeaturePreProcessor()

# Load your data
df = pd.read_csv('data/raw/your_data.csv')

# Process the data
processed_df = processor.process(
    df, 
    drop_na=True,           # Drop rows with missing values
    scaler_type='standard', # Scale numeric features using standard scaling
    remove_outlier=True     # Remove outliers
)

# Save the processed data
processed_df.to_csv('data/processed/processed_data.csv', index=False)
```

## Data Type Detection

The package automatically detects the following data types:

- **temporal**: Date and time columns
- **binary**: Columns with only two unique values
- **percentage**: Columns with values between 0 and 1, or columns with names containing "perc", "rating", etc.
- **price**: Columns with names containing "price", "cost", "revenue", etc.
- **id**: Columns with names ending or starting with "id"
- **numeric**: General numeric columns
- **integer**: Numeric columns with integer values
- **categorical**: Columns with a low ratio of unique values to total values
- **string**: Short text columns
- **text**: Long text columns

## Project Structure

```
prepo/
├── data/               # Data directory
│   ├── raw/            # Raw data files
│   ├── processed/      # Processed data files
│   └── test/           # Test data files
├── src/                # Source code
│   └── prepo/          # Main package
│       ├── __init__.py        # Package initialization
│       └── preprocessor.py    # Core preprocessing functionality
├── tests/              # Test directory
│   ├── __init__.py     # Test package initialization
│   └── test_preprocessor.py  # Tests for preprocessor
├── examples/           # Example scripts
│   └── basic_usage.py  # Basic usage example
├── README.md           # Project documentation
├── LICENSE             # License information
└── setup.py            # Package installation script
```

## License

This project is licensed under the MIT License - see the LICENSE file for details.
