Metadata-Version: 2.4
Name: hypersets
Version: 0.0.2
Summary: Fast, efficient alternative to Hugging Face load_dataset using DuckDB for querying, sampling and transforming remote datasets
Project-URL: Homepage, https://github.com/omarkamali/hypersets
Project-URL: Repository, https://github.com/omarkamali/hypersets
Project-URL: Issues, https://github.com/omarkamali/hypersets/issues
Project-URL: Documentation, https://github.com/omarkamali/hypersets#readme
Author-email: Omar Kamali <hypersets@omarkama.li>
License: MIT
License-File: LICENSE
Keywords: datasets,duckdb,huggingface,machine-learning,parquet
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Requires-Dist: datasets>=2.14.0
Requires-Dist: duckdb>=0.9.0
Requires-Dist: huggingface-hub>=0.16.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: pyarrow>=12.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: requests>=2.28.0
Requires-Dist: typing-extensions>=4.5.0
Provides-Extra: datasets
Requires-Dist: datasets>=2.0.0; extra == 'datasets'
Provides-Extra: dev
Requires-Dist: black>=22.0.0; extra == 'dev'
Requires-Dist: flake8>=4.0.0; extra == 'dev'
Requires-Dist: isort>=5.10.0; extra == 'dev'
Requires-Dist: mypy>=0.950; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# Hypersets

**Efficient SQL interface for HuggingFace datasets using DuckDB.**

Hypersets is a library to work with massive datasets without downloading them entirely. Query terabytes of data using simple SQL while only downloading what you need.

_Hypersets is currently in pre-alpha stage. Use at your own risk._

## ✨ Features

- 🚀 **Fast metadata retrieval** - Get dataset info without downloading
- 💾 **Memory-only operation** - No disk caching unless requested  
- 🎯 **Efficient querying** - SQL interface with DuckDB optimization
- 📊 **Download tracking** - See exactly how much data you're saving
- 🧠 **Smart caching** - Avoid repeated API calls
- 🔄 **Multiple formats** - Output as pandas DataFrame or HuggingFace Dataset
- ⚡ **Rate limit handling** - Built-in exponential backoff for 429 errors
- 🛡️ **Proper error handling** - Clear exceptions for common issues

## 🚦 Validation Status

What has been tested and confirmed so far:
- **Dataset info retrieval**: Fast YAML frontmatter parsing
- **Efficient querying**: DuckDB SQL with HTTP optimization and 429 retry logic
- **Smart caching**: 1000x+ speedup on repeated calls  
- **Download tracking**: 99.9% data savings demonstrated on real datasets (0.04GB on a 59GB dataset for simple operations)
- **Multiple formats**: pandas DataFrame and HuggingFace Dataset support
- **Error handling**: Proper exceptions and retry logic for production use
- **Memory efficiency**: Handles TB-scale datasets in MBs or GBs of RAM and bandwidth

## 📦 Installation

```bash
pip install hypersets
```

## 🎯 Quick Start

```python
import hypersets as hs

# Get dataset info without downloading
info = hs.info("omarkamali/wikipedia-monthly") 
print(f"Dataset size: {info.estimated_total_size_gb:.1f} GB")
print(f"Configs: {len(info.config_names)}")
print(f"Available configs: {info.config_names[:5]}")

# Query with SQL - only downloads what's needed
result = hs.query(
    "SELECT title, LENGTH(text) as text_length FROM dataset LIMIT 10",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

# Convert to pandas for analysis
df = result.to_pandas()
print(f"Retrieved {len(df)} articles")
```

## 🚀 Core API

### Dataset Information
```python
# Get comprehensive dataset metadata
info = hs.info("omarkamali/wikipedia-monthly")
print(f"Total files: {info.total_parquet_files}")
print(f"Size estimate: {info.estimated_total_size_gb:.1f} GB")

# List available configurations
configs = hs.list_configs("omarkamali/wikipedia-monthly")
print(f"Available configs: {configs[:10]}")  # First 10

# Clear cached metadata
hs.clear_cache()
```

### SQL Querying
```python
# Basic querying
result = hs.query(
    "SELECT title, url FROM dataset WHERE LENGTH(text) > 10000 LIMIT 100",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)

# Aggregation queries
count = hs.count(
    dataset="omarkamali/wikipedia-monthly", 
    config="latest.en"
)
print(f"Total articles: {count:,}")

# Advanced analytics
stats = hs.query(
    """
    SELECT 
        COUNT(*) as total_articles,
        AVG(LENGTH(text)) as avg_length,
        MAX(LENGTH(text)) as max_length
    FROM dataset
    """,
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)
```

### Sampling & Exploration
```python
# Random sampling with DuckDB optimization
sample = hs.sample(
    n=1000,
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en",
    columns=["title", "url", "LENGTH(text) as text_length"]
)

# Quick data preview
preview = hs.head(
    n=5,
    dataset="omarkamali/wikipedia-monthly", 
    config="latest.en",
    columns=["title", "url"]
)

# Schema inspection
schema = hs.schema(
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en"
)
print(f"Columns: {[col['name'] for col in schema.columns]}")
```

### Output Formats
```python
result = hs.query("SELECT * FROM dataset LIMIT 100", ...)

# As pandas DataFrame
df = result.to_pandas()
print(df.head())

# As HuggingFace Dataset
hf_dataset = result.to_hf_dataset()
print(hf_dataset.features)

# Query result metadata
print(f"Shape: {result.shape}")
print(f"Columns: {result.columns}")
```

### Download Tracking
```python
# Enable download tracking to see data savings
result = hs.query(
    "SELECT title FROM dataset LIMIT 1000",
    dataset="omarkamali/wikipedia-monthly",
    config="latest.en",
    track_downloads=True
)

# Check savings
if result.download_stats:
    stats = result.download_stats
    print(f"Total dataset: {stats.total_dataset_size_gb:.1f} GB")
    print(f"Downloaded: {stats.estimated_downloaded_gb:.2f} GB")
    print(f"Savings: {stats.savings_percentage:.1f}%")
```

## 📁 Examples

Explore our comprehensive examples to see Hypersets in action:

### 🏃 Quick Demo
```bash
python examples/demo.py
```
**Complete feature demonstration** - Shows all Hypersets capabilities with real datasets.

### 📚 Basic Usage
```bash
python examples/basic_usage.py  
```
**Learn the fundamentals** - Dataset info, querying, sampling, caching, and output formats.

### 🔬 Advanced Queries
```bash  
python examples/advanced_queries.py
```
**Sophisticated analytics** - Text analysis, pattern matching, quality metrics, and performance optimization.

## 🏗️ Architecture

Hypersets consists of four core components:

1. **Dataset Info Retriever** - Discovers parquet files, configs, and schema from YAML frontmatter
2. **DuckDB Mount System** - Mounts remote parquet files as virtual tables with HTTP optimization
3. **Query Interface** - Clean API with SQL support, download tracking, and multiple output formats
4. **Smart Caching** - TTL-based caching of dataset metadata to avoid repeated API calls

All components include proper 429 rate limit handling with exponential backoff.

## 🔧 Advanced Configuration

### Memory Management
```python
# Configure DuckDB memory limit (default: 4GB)
result = hs.query(
    "SELECT * FROM dataset LIMIT 1000",
    dataset="large/dataset",
    memory_limit="8GB"  # Increase for large datasets
)

# For extremely large datasets
result = hs.query(
    "SELECT * FROM dataset LIMIT 10000", 
    dataset="massive/dataset",
    memory_limit="16GB",  # More memory
    threads=8             # More threads
)

# Memory-efficient column selection
result = hs.query(
    "SELECT id, title FROM dataset LIMIT 100000",  # Only select needed columns
    dataset="large/dataset",
    memory_limit="2GB"  # Can use less memory
)
```

**Memory Limit Guidelines:**
- **Default (4GB)**: Good for most datasets up to ~50GB
- **8GB**: For large datasets (50-200GB) or complex queries  
- **16GB+**: For massive datasets (200GB+) or heavy aggregations
- **Column selection**: Always select only needed columns for better memory efficiency

### Custom Caching
```python
# Cache with custom TTL (Time To Live)
info = hs.info("dataset", cache_ttl=3600)  # 1 hour

# Disable caching for fresh data
info = hs.info("dataset", use_cache=False)
```

### Authentication
```python
# Use HuggingFace token for private datasets
result = hs.query(
    "SELECT * FROM dataset LIMIT 10",
    dataset="private/dataset",
    token="hf_your_token_here"
)
```

### Performance Tuning
```python
# Optimize for your use case
result = hs.query(
    "SELECT * FROM dataset USING SAMPLE 10000",
    dataset="large/dataset",
    memory_limit="6GB",    # Adequate memory
    threads=4,            # Balanced parallelism  
    track_downloads=True  # Monitor efficiency
)

# For aggregation-heavy workloads
stats = hs.query(
    """
    SELECT 
        category,
        COUNT(*) as count,
        AVG(LENGTH(text)) as avg_length
    FROM dataset 
    GROUP BY category
    """,
    dataset="large/dataset",
    memory_limit="12GB",  # More memory for grouping
    threads=8            # More threads for aggregation
)
```

## Contributing

1. Fork the repository
2. Create a feature branch: `git checkout -b feature-name`
3. Make changes and add tests
4. Run tests: `pytest tests/`
5. Submit a pull request

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Acknowledgments

- **DuckDB** for incredible SQL analytics on remote data
- **Parquet** for the de facto standard for columnar data storage
- **HuggingFace** for democratizing access to datasets
- **The open source community** for inspiration and feedback

## Contributors

[Omar Kamali](https://omarkama.li)

## 📝 Citation

If you use Hypersets in your research, please cite:

```bibtex
@misc{hypersets,
    title={Hypersets: Efficient dataset transfer, querying and transformation},
    author={Omar Kamali},
    year={2025},
    url={https://github.com/omarkamali/hypersets}
    note={Project developed under Omneity Labs}
}
```


---

**🚀 Ready to query terabytes of data efficiently?** Start with `examples/demo.py` to see Hypersets in action! 