Metadata-Version: 2.0
Name: xlseries
Version: 0.2.2
Summary: Python package to scrape time series data from excel files.
Home-page: https://github.com/abenassi/xlseries
Author: Agustin Benassi
Author-email: agusbenassi@gmail.com
Maintainer: Agustin Benassi
Maintainer-email: agusbenassi@gmail.com
License: GPLv3+
Download-URL: https://github.com/abenassi/xlseries/archive/master.zip
Keywords: xlseries excel time series data opendata scraper
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Topic :: Office/Business
Requires-Dist: arrow (==0.5.4)
Requires-Dist: asn1crypto (==0.24.0)
Requires-Dist: certifi (==2019.6.16)
Requires-Dist: cffi (==1.12.3)
Requires-Dist: coverage (==4.5.4)
Requires-Dist: cryptography (==2.7)
Requires-Dist: et-xmlfile (==1.0.1)
Requires-Dist: jdcal (==1.4.1)
Requires-Dist: mock (==3.0.5)
Requires-Dist: nose (==1.3.7)
Requires-Dist: numpy (==1.11.3)
Requires-Dist: openpyxl (==2.5.0)
Requires-Dist: pandas (<0.21,>=0.20)
Requires-Dist: Parsley (==1.3)
Requires-Dist: pycallgraph (==1.0.1)
Requires-Dist: pycparser (==2.19)
Requires-Dist: python-dateutil (==2.8.0)
Requires-Dist: pytz (==2019.2)
Requires-Dist: PyYAML (==5.1.2)
Requires-Dist: six (==1.12.0)
Requires-Dist: Unidecode (==0.4.21)
Requires-Dist: xlrd (==1.0.0)

xlseries
========

A python package to scrape `time
series <https://en.wikipedia.org/wiki/Time_series>`__ from *any* excel
file and return them turned into `pandas <http://pandas.pydata.org/pandas-docs/dev/index.html>`__ `data frames <http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html>`__.

Installation
------------

If you want to install in developer mode, `clone the repository <https://github.com/abenassi/xlseries.git>`__ and follow these instructions:

**If you are using Anaconda as your python distribution**

1. ``conda create -n xlseries python=2`` *Create new environment*
2. ``cd project_directory``
3. ``source activate xlseries`` *Activate the environment*
4. ``pip install -e .`` *Install the package in developer mode*
5. ``pip install -r requirements.txt`` *Install dependencies*
6. ``deactivate`` *Deactivate when you are done*

**If you are using a standard python installation**

1. ``cd project_directory``
2. ``virtualenv venv`` *Create new environment*
3. ``source venv/bin/activate`` *Activate the environment*
4. ``pip install -r requirements.txt`` *Install dependencies*
5. ``deactivate`` *Deactivate when you are done*

If you just want to use it:

``pip install xlseries`` in your environment, instead of cloning and pip
installing in developer mode.

Quick start
-----------

.. code:: python

    from xlseries import XlSeries
    xl = XlSeries("path_to_excel_file" or openpyxl.Workbook instance)
    dfs = xl.get_data_frames("path_to_json_parameters" or parameters_dictionary)

With the test case number 1:

.. code:: python

    from xlseries import XlSeries
    from xlseries.utils.path_finders import get_orig_cases_path, get_param_cases_path

    # this will only work if you clone the repo with all the test files
    path_to_excel_file = get_orig_cases_path(1)
    path_to_json_parameters = get_param_cases_path(1)

    xl = XlSeries(path_to_excel_file)
    dfs = series.get_data_frames(path_to_json_parameters)

or passing only the critical parameters as a dictionary:

.. code:: python

    parameters_dictionary = {
        "headers_coord": ["B1","C1"],
        "data_starts": 2,
        "frequency": "M",
        "time_header_coord": "A1"
    }
    dfs = xl.get_data_frames(parameters_dictionary)

you can specify what worksheet you want to scrape (otherwise the first
one will be used):

.. code:: python

    dfs = xl.get_data_frames(parameters_dictionary, ws_name="my_worksheet")

you can ask an XlSeries object for a template dictionary of the critical
parameters you need to fill:

.. code:: python

    >>> params = xl.critical_params_template()
    >>> params
    {'data_starts': 2,
     'frequency': 'M',
     'headers_coord': ['B1', 'C1', 'E1-G1'],
     'time_header_coord': 'A1'}
    >>> params["headers_coord"] = ["B1","C1"]
    >>> dfs = xl.get_data_frames(params, ws_name="my_worksheet")

if this doesn't work and you want to see exactly where the scraping is
failing, you may want to fill out all the parameters and try again to
see where the exception is raised:

.. code:: python

    >>> params = xl.complete_params_template()
    >>> params
    {'alignment': u'vertical',
     'blank_rows': False,
     'continuity': True,
     'data_ends': None,
     'data_starts': 2,
     'frequency': 'M',
     'headers_coord': ['B1', 'C1', 'E1-G1'],
     'missing_value': [None, '-', '...', '.', ''],
     'missings': False,
     'series_names': None,
     'time_alignment': 0,
     'time_composed': False,
     'time_header_coord': 'A1',
     'time_multicolumn': False}
    >>> params["headers_coord"] = ["B1","C1"]
    >>> params["data_ends"] = 256
    >>> params["missings"] = True
    >>> dfs = xl.get_data_frames(params, ws_name="my_worksheet")

-  **Excel file**: Up to this development point the excel file should
   not be more *complicated* than the `7 test cases <https://github.com/abenassi/xlseries#test-cases>`__:

-  **Parameters**: Together with the excel file, some parameters about
   the series must be provided. These could be passed to
   get\_data\_frames() as a path to a *JSON file* or as a *python
   dictionary*. ``xlseries`` use about 14 parameters to characterize the
   time series of a spreadsheet, but only 4 of them are *critical* most
   of the time: the rest can be guessed by the package. The only
   difference between specifying more or less parameters than the 4
   critical is the total time that ``xlseries`` will need to complete
   the task (more parameters, less time). Go to the `parameters <https://github.com/abenassi/xlseries#parameters>`__ section for a more detailed
   explanation about how to use them, and when you need to specify more
   than the basic 4 (``headers_coord``, ``data_starts``, ``frequency``
   and ``time_header_coord``).

Take a look to this `ipython notebook
template <https://github.com/abenassi/xlseries/blob/master/docs/notebooks/Example%20use%20case.ipynb>`__ to get started!.

If you want to dig inside the test cases and get an idea of how far is
going ``xlseries`` at the moment, check out this `ipython notebook with
the 7 test cases <https://github.com/abenassi/xlseries/blob/master/docs/notebooks/Test%20cases.ipynb>`__.

For more details go to the official repository on github:
https://github.com/abenassi/xlseries


