Metadata-Version: 2.4
Name: smolcrawl
Version: 0.1.6
Summary: Crawls and indexes websites
Author-email: Bill Chambers <contact@learnbybuilding.ai>
License-Expression: MIT
Project-URL: Homepage, https://github.com/bllchmbrs/smolcrawl
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: crawlee[all]>=0.6.7
Requires-Dist: diskcache>=5.6.3
Requires-Dist: loguru>=0.7.3
Requires-Dist: markdownify>=1.1.0
Requires-Dist: python-dotenv>=1.1.0
Requires-Dist: readabilipy>=0.3.0
Requires-Dist: tantivy>=0.22.2
Requires-Dist: typer>=0.15.2
Dynamic: license-file

# SmolCrawl

A lightweight web crawler and indexer for creating searchable document collections from websites.

## Overview

SmolCrawl is a Python-based tool that helps you:
- Crawl websites and extract content
- Convert HTML content to readable markdown
- Index pages for efficient searching
- Query indexed content with relevance scoring

Perfect for creating local knowledge bases, documentation search, or personal research collections.

## Features

- **Simple Web Crawling**: Easily crawl and extract content from target websites
- **Content Extraction**: Automatically extracts meaningful content from HTML using readability algorithms
- **Markdown Conversion**: Converts HTML content to clean, readable markdown format
- **Fast Indexing**: Uses Tantivy (Rust-based search library) for performant full-text search
- **Caching**: Implements disk-based caching to avoid redundant crawling
- **CLI Interface**: Simple command-line interface for all operations

## Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/smolcrawl.git
cd smolcrawl

# Install the package
pip install -e .
```

## Requirements

- Python 3.11 or higher
- Dependencies are automatically installed with the package

## Usage

### Crawl a Website

```bash
smolcrawl crawl https://example.com
```

### Index a Website

```bash
smolcrawl index https://example.com my_index_name
```

### List Available Indices

```bash
smolcrawl list_indices
```

### Query an Index

```bash
smolcrawl query my_index_name "your search query" --limit 10 --score_threshold 0.5
```

### Delete an Index

```bash
smolcrawl delete_index my_index_name
```

## Configuration

SmolCrawl uses environment variables for configuration:

- `STORAGE_PATH`: Path to store data (default: `./data`)
- `CACHE_PATH`: Path for caching (default: `./data/cache`)

You can set these in a `.env` file in the project root.

## Project Structure

```
smolcrawl/
├── src/smolcrawl/
│   ├── __init__.py    # CLI and entry points
│   ├── crawl.py       # Web crawling functionality
│   ├── db.py          # Indexing and search functionality
│   └── utils.py       # Utility functions
├── data/              # Storage for indices and cache (gitignored)
├── .gitignore
└── pyproject.toml     # Project metadata and dependencies
```

## How It Works

1. **Crawling**: Uses BeautifulSoupCrawler to fetch web pages and extract links
2. **Content Processing**: Extracts meaningful content using ReadabiliPy and converts to markdown
3. **Indexing**: Stores extracted content in a Tantivy index for efficient searching
4. **Searching**: Performs full-text search on indexed content with relevance ranking

## Responsible Crawling

SmolCrawl is a powerful tool, and with great power comes great responsibility. When crawling websites, please be mindful and respectful of the website owners and their resources.

- **Check `robots.txt`**: Always check a website's `robots.txt` file (`https://example.com/robots.txt`) before crawling. Respect the rules outlined there regarding which paths are allowed or disallowed for crawling.
- **Rate Limiting**: Avoid overwhelming the target server with too many requests in a short period. Implement delays between requests if necessary (SmolCrawl does not currently have built-in rate limiting).
- **Identify Yourself**: Consider setting a descriptive User-Agent string to identify your crawler, although SmolCrawl does not currently support custom User-Agents.
- **Crawl During Off-Peak Hours**: If possible, schedule crawls during times when the website is likely to have lower traffic.
- **Use Caching**: Take advantage of SmolCrawl's caching feature to avoid re-downloading content unnecessarily.

Misusing web crawlers can lead to your IP address being blocked and can negatively impact the performance and availability of the website for others. Use SmolCrawl ethically and responsibly.

## License

[Your License Choice]

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
