Metadata-Version: 2.1
Name: raccy
Version: 1.3.0
Summary: Web Scraping Library Based on Selenium
Home-page: https://github.com/danielafriyie/raccy
Author: Daniel Afriyie
Author-email: danielafriyie98@gmail.com
License: Apache License, Version 2.0
Platform: UNKNOWN
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: selenium (>=3.141.0)
Requires-Dist: wget (==3.2)
Requires-Dist: requests (==2.26.0)

# RACCY

### OVERVIEW
Raccy is a multithreaded web scraping library based on selenium with
built in Object Relational Mapper (ORM). It can be used for web automation, web scraping, and
data mining. Currently the ORM feature supports only SQLite Database.
Some of the features in this library is inspired by Django ORM and Scrapy.

### REQUIREMENTS
- Python 3.7+ 
- Works on Linux, Windows, and Mac

### ARCHITECTURE OVERVIEW
* **UrlDownloaderWorker:** resonsible for downloading item(s) to be scraped urls and enqueue(s) them in ItemUrlQueue

* **ItemUrlQueue:** receives item urls from UrlDownloaderWorker and enqueues them
    for feeding them to CrawlerWorker

* **CrawlerWorker:** fetches item web pages and scrapes or extract data from them and enqueues the data in DatabaseQueue

* **DatabaseQueue:** receives scraped item data from CrawlerWorker(s) and enques them
    for feeding them to DatabaseWorker.

* **DatabaseWorker:** receives scraped data from DatabaseQueue and stores it in a persistent database.

### INSTALL

```shell script
pip install raccy
```

### TUTORIAL

```python
from raccy import (
    model, UrlDownloaderWorker, CrawlerWorker, DatabaseWorker, WorkersManager
)
from selenium import webdriver
from shutil import which

config = model.Config()
config.DATABASE = model.SQLiteDatabase('quotes.sqlite3')


class Quote(model.Model):
    quote_id = model.PrimaryKeyField()
    quote = model.TextField()
    author = model.CharField(max_length=100)


class UrlDownloader(UrlDownloaderWorker):
    start_url = 'https://quotes.toscrape.com/page/1/'
    max_url_download = 10

    def job(self):
        url = self.driver.current_url
        self.url_queue.put(url)
        self.follow(xpath="//a[contains(text(), 'Next')]", callback=self.job)


class Crawler(CrawlerWorker):

    def parse(self, url):
        self.driver.get(url)
        quotes = self.driver.find_elements_by_xpath("//div[@class='quote']")
        for q in quotes:
            quote = q.find_element_by_xpath(".//span[@class='text']").text
            author = q.find_element_by_xpath(".//span/small").text

            data = {
                'quote': quote,
                'author': author
            }
            self.log.info(data)
            self.db_queue.put(data)


class Db(DatabaseWorker):

    def save(self, data):
        Quote.objects.create(**data)


def get_driver():
    driver_path = which('.\\chromedriver.exe')
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument("--start-maximized")
    driver = webdriver.Chrome(executable_path=driver_path, options=options)
    return driver


if __name__ == '__main__':
    manager = WorkersManager()
    manager.add_driver(get_driver)
    manager.start()
    print('Done scraping...........')

```

### Author

* **Afriyie Daniel**

Hope You Enjoy Using It !!!!


