Metadata-Version: 2.1
Name: hdx-python-scraper
Version: 1.3.3
Summary: HDX Python scraper utilities
Home-page: https://github.com/OCHA-DAP/hdx-python-scraper
Author: Michael Rans
Author-email: rans@email.com
License: MIT
Keywords: HDX,scrapers,library
Platform: any
Classifier: Development Status :: 5 - Production/Stable
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Unix
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Requires-Python: >=3.6
Description-Content-Type: text/markdown; charset=UTF-8
Requires-Dist: hdx-python-api (>=5.3.1)
Requires-Dist: gspread
Requires-Dist: olefile
Requires-Dist: regex
Provides-Extra: pandas
Requires-Dist: pandas (>=1.2.4) ; extra == 'pandas'

[![Build Status](https://github.com/OCHA-DAP/hdx-python-scraper/workflows/build/badge.svg)](https://github.com/OCHA-DAP/hdx-python-scraper/actions?query=workflow%3Abuild) [![Coverage Status](https://coveralls.io/repos/github/OCHA-DAP/hdx-python-scraper/badge.svg?branch=main&ts=1)](https://coveralls.io/github/OCHA-DAP/hdx-python-scraper?branch=main)

The HDX Python Scraper Library is designed to enable you to easily develop code that assembles data from one or more 
sources. 

If you have humanitarian-related data, please upload your datasets to HDX.

## Usage

The library has detailed API documentation which can be found
here: <http://ocha-dap.github.io/hdx-python-scraper/>. The code for the
library is here: <https://github.com/ocha-dap/hdx-python-scraper>.

To use the optional functions for outputting data from Pandas to JSON, Excel etc., install with:

    pip install hdx-python-scraper[pandas]

## Scraper Framework Configuration

The following is an example of how the framework is set up:

    today = datetime.now()
    adminone = AdminOne(configuration)
    population_lookup = dict()
    headers, columns, sources = get_tabular(configuration, ['AFG'], adminone, 'national', downloader, today=today, scrapers=['population'], population_lookup=population_lookup)
    assert headers == [['Population'], ['#population']]
    assert columns == [{'AFG': 38041754}]
    assert sources == [('#population', '2020-10-01', 'World Bank', 'https://data.humdata.org/organization/world-bank-group')]
    results = get_tabular(configuration, ['AFG'], adminone, 'national', downloader, today=today, scrapers=['who'], population_lookup=population_lookup)
    assert results['headers'] == [['CasesPer100000', 'DeathsPer100000', 'Cases2Per100000', 'Deaths2Per100000'], ['#affected+infected+per100000', '#affected+killed+per100000', '#affected+infected+2+per100000', '#affected+killed+2+per100000']]
    assert results['values']  == [{'AFG': '96.99'}, {'AFG': '3.41'}, {'AFG': '96.99'}, {'AFG': '3.41'}]
    assert results['sources'] == [('#affected+infected+per100000', '2020-08-06', 'WHO', 'tests/fixtures/WHO-COVID-19-global-data.csv'), ('#affected+killed+per100000', '2020-08-06', 'WHO', 'tests/fixtures/WHO-COVID-19-global-data.csv'), ('#affected+infected+2+per100000', '2020-08-06', 'WHO', 'tests/fixtures/WHO-COVID-19-global-data.csv'), ('#affected+killed+2+per100000', '2020-08-06', 'WHO', 'tests/fixtures/WHO-COVID-19-global-data.csv')]

The first parameter is a configuration. More on this later.

The second parameter is a list of country iso3s. The third is an AdminOne object. More about this can be found in 
the [HDX Python Country](https://github.com/OCHA-DAP/hdx-python-country) library, but briefly that class accepts a 
configuration as follows:

country_name_mappings defines the country name overrides we use (ie. where we deviate from the names in the OCHA 
countries and territories file).

    country_name_mappings:
      PSE: "occupied Palestinian territory"
      BOL: "Bolivia"

admin_info defines the admin level 1 names and pcodes. 

    admin_info:
      - {pcode: AF01, name: Kabul, iso2: AF, iso3: AFG, country: Afghanistan}
      - {pcode: AF02, name: Kapisa, iso2: AF, iso3: AFG, country: Afghanistan}

adm_mappings defines mappings from country name to iso3 and from admin 1 name to pcode.

    adm_mappings:
      - "Congo DR": "COD"
        "CAR": "CAF"
        "oPt": "PSE"
      - "Nord-Ouest": "HT09"
        "nord-ouest": "HT09"
    ...

adm1_name_replacements defines some find and replaces that are done to improve automatic admin1 name matching to 
pcode.

    adm1_name_replacements:
      " urban": ""
      "sud": "south"
      "ouest": "west"
    ...

adm1_fuzzy_ignore defines admin 1 names to ignore.

    adm1_fuzzy_ignore:
      - "nord"
    ...

The third parameter specifies which scraper set to process corresponding to a name in the configuration beginning
scraper_ (but without needing the scraper_ part in the parameter).

The population scraper is a special case, so it is read first so that populations can be used by subsequent 
scrapers in calculations.

### Configuration Parameter

The framework is configured by passing in a configuration. Typically this will come from a yaml file such as 
config/project_configuration.yml.

The configuration should have a mapping from the internal dictionaries to the tabs in the spreadsheet or keys in 
the JSON output file(s):

    tabs:
      world: "WorldData"
      regional: "RegionalData"
      national: "NationalData"
      subnational: "SubnationalData"
      covid_series: "CovidSeries"
      covid_trend: "CovidTrend"
      sources: "Sources"

Then the location of Google spreadsheets are defined, for prod (production), test and scratch:

    googlesheets:
      prod: "https://docs.google.com/spreadsheets/d/SPREADSHEET_KEY_PROD/edit"
      test: "https://docs.google.com/spreadsheets/d/SPREADSHEET_KEY_TEST/edit"
      scratch: "https://docs.google.com/spreadsheets/d/SPREADSHEET_KEY_SCRATCH/edit"

The json outputs are then specified. Under the key “additional”, subsets of the full json can be saved as separate 
files.

    json:
      additional_json:
        - name: "Other"
          source: "Some org"
          source_url: "https://data.humdata.org/organization/world-bank-group"
          format: "json"
          url: "https://raw.githubusercontent.com/mcarans/hdx-python-scraper/master/tests/fixtures/additional_json.json"
          jsonpath: "[*]"
      filepath: "test_tabular_all.json"
      additional:
        - filepath: "test_tabular_population.json"
          tabs:
            - tab: "national"
              key: "cumulative"
              filters:
                "#country+code": "{{countries_to_save}}"
              hxltags:
                - "#country+code"
                - "#country+name"
                - "#population"
        - filepath: "test_tabular_population_2.json"
          tabs:
            - tab: "national"
              key: "cumulative"
              filters:
                "#country+code":
                  - "AFG"
              hxltags:
                - "#country+code"
                - "#country+name"
                - "#population"
        - filepath: "test_tabular_other.json"
          remove:
            - "national"

Next comes additional sources. This allows additional HXL hashtags to be associated with a dataset date, source 
and url. In the ones below, the metadata is obtained from datasets on HDX.

    additional_sources:
      - indicator: "#food-prices"
        dataset: "wfp-food-prices"
      - indicator: "#vaccination-campaigns"
        dataset: "immunization-campaigns-impacted"

scraper_tabname defines a set of mini scrapers that use the framework and produce data for the tab tabname eg.

    scraper_national:
    …

#### Examples

It is helpful to look at a few example mini scrapers to see how they are configured:

The economicindex mini scraper reads the dataset “covid-19-economic-exposure-index” on HDX, taking from it dataset 
source, date of dataset and using the url of the dataset in HDX as the source url (used by the DATA link in the 
visual). The scraper framework finds the first resource that is of format “xlsx”, reads the “economic exposure” 
sheet and looks for the headers in row 1. Note that it is possible to specify a specific resource name using the 
key "resource" instead of searching for the first resource of a particular format.

adm_cols defines the column or columns in which to look for admin information. As this is a national level 
scraper, it uses the “Country” column. input_cols specifies the column(s) to use for values, in this case 
“Covid 19 Economic exposure index”. output_columns and output_hxltags define the header name(s) and HXL tag(s) to 
use for the input_cols.

     economicindex:
        format: "xlsx"
        dataset: "covid-19-economic-exposure-index"
        sheet: "economic exposure"
        headers: 1
        adm_cols:
          - "Country"
        input_cols:
          - "Covid 19 Economic exposure index"
        output_columns:
          - "EconomicExposure"
        output_hxltags:
          - "#severity+economic+num"

The population mini scraper configuration directly provides metadata for source, source_url and the download 
location given by url only taking the source date from the dataset. The scraper pulls subnational data so adm_cols 
defines both a country column “alpha_3” and an admin 1 pcode column “ADM1_PCODE”. 

input_transforms defines operations to be performed on each value in the column. In this case, the value is 
converted to either an int or float if it is possible.

     population:
        source: "Multiple Sources"
        source_url: "https://data.humdata.org/search?organization=worldpop&q=%22population%20counts%22"
        format: "csv"
        url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vS3_uBOV_uRDSxOggfBus_pkCs6Iw9lm0nAxzwG14YF_frpm13WPKiM1oNnQ9zrUA/pub?gid=1565793974&single=true&output=csv"
        dataset: "global-humanitarian-response-plan-covid-19-administrative-boundaries-and-population-statistics"
        headers: 1
        adm_cols:
          - "alpha_3"
          - "ADM1_PCODE"
        input_cols:
          - "POPULATION"
        input_transforms:
          "POPULATION": "get_numeric_if_possible(POPULATION)"
        output_columns:
          - "Population"
        output_hxltags:
          - "#population"

The travel mini scraper reads values from the “info” and “published” columns of the source. append_cols defines 
any columns where if the same admin appears again (in this case, the same country), then that data is appended to 
the existing. keep_cols defines any columns where if the same admin appears again, the existing value is kept 
rather than replaced.

     travel:
        format: "csv"
        dataset: "covid-19-global-travel-restrictions-and-airline-information"
        headers: 1
        adm_cols:
          - "iso3"
        input_cols:
          - "info"
          - "published"
        append_cols:
          - "info"
        keep_cols:
          - "published"
        output_columns:
          - "TravelRestrictions"
          - "TravelRestrictionsPublished"
        output_hxltags:
          - "#severity+travel"
          - "#severity+date+travel"

The gam mini scraper reads from a spreadsheet that has a multiline header (headers defined as rows 3 and 4). 
Experimentation is often needed with row numbers since in my experience, they are sometimes offset from the real 
row numbers seen when opening the spreadsheet. date_col defines a column that contains date information and 
date_type specifies in what form the date information is held eg. as a date, a year or an int. The scraper will 
for each admin, obtain the data (in this case the “National Point Estimate”) for the latest date up to the current
date (unless ignore_future_date is set to False then future dates will be allowed).

     gam:
        format: "xlsx"
        dataset: "world-global-expanded-database-on-severe-wasting"
        sheet: "Trend"
        headers:
          - 3
          - 4
        adm_cols:
          - "ISO"
        date_col: "Year*"
        date_type: "year"
        input_cols:
          - "National Point Estimate"
        output_columns:
          - "Malnutrition Estimate"
        output_hxltags:
          - "#severity+malnutrition+num+national"

The covidtests mini scraper gets “new_tests” and “new_tests_per_thousand” for the latest date where a 
date_condition is satisfied which is that “new_tests” is a value greater than zero.

     covidtests:
        format: "xlsx"
        dataset: "total-covid-19-tests-performed-by-country"
        headers: 1
        sheet: 1
        date_col: "date"
        date_type: "date"
        date_condition: "new_tests is not None and new_tests > 0"
        adm_cols:
          - "iso_code"
        input_cols:
          - "new_tests"
          - "new_tests_per_thousand"
        output_columns:
          - "New Tests"
          - "New Tests Per Thousand"
        output_hxltags:
          - "#affected+tested"
          - "#affected+tested+per1000"

The oxcgrt mini scraper reads from a data source that has HXL tags and these can be used instead of the header 
names provided use_hxl is set to True. By default all the HXLated columns are read with the admin related ones 
inferred and the rest taken as values except if defined as a date_col.

     oxcgrt:
        format: "csv"
        dataset: "oxford-covid-19-government-response-tracker"
        headers: 1
        use_hxl: True
        date_col: "#date"
        date_type: "date"

In the imperial mini scraper, output_columns and output_hxltags are defined which specify which columns and HXL 
tags in the HXLated file should be used rather than using all HXLated columns.

     imperial:
        format: "xlsx"
        dataset: "imperial-college-covid-19-projections"
        sheet: 1
        headers: 1
        use_hxl: True
        output_columns:
          - "Imp: Total Cases(min)"
          - "Imp: Total Cases(max)"
          - "Imp: Total Deaths(min)"
          - "Imp: Total Deaths(max)"
        output_hxltags:
          - "#affected+infected+min+imperial"
          - "#affected+infected+max+imperial"
          - "#affected+killed+min+imperial"
          - "#affected+killed+max+imperial"

The idmc mini scraper reads 2 HXLated columns defined in input_cols. In input_transforms, a cast to int is 
performed if the value is not None or it is set to 0. einput_cols defines new column(s) that can be combinations 
of the other columns in input_cols. In this case, input_cols specifies a new column which sums the 2 columns in 
input_cols. That new column is given a header and a HXL tag (in output_columns and output_hxltags).

     idmc:
        format: "csv"
        dataset: "idmc-internally-displaced-persons-idps"
        headers: 1
        use_hxl: True
        date_col: "#date+year"
        date_type: "year"
        input_cols:
          - "#affected+idps+ind+stock+conflict"
          - "#affected+idps+ind+stock+disaster"
        input_transforms:
          "#affected+idps+ind+stock+conflict": "int(#affected+idps+ind+stock+conflict) if #affected+idps+ind+stock+conflict else 0"
          "#affected+idps+ind+stock+disaster": "int(#affected+idps+ind+stock+disaster) if #affected+idps+ind+stock+disaster else 0"
        input_cols:
          - "#affected+idps+ind+stock+conflict + #affected+idps+ind+stock+disaster"
        output_columns:
          - "TotalIDPs"
        output_hxltags:
          - "#affected+displaced"

The needs mini scraper takes data for the latest available date for each country. subsets allows the definition of 
multiple indicators by way of filters. A filter is defined for each indicator (in this case there is one) which 
contains one or more filters in Python syntax. Column names can be used directly and if all are not already specified 
in input_cols and date_col, then all should be put as a list under the key filter_cols.

     needs:
        format: "xlsx"
        dataset: "global-humanitarian-overview-2020-figures"
        sheet: "Raw Data"
        headers: 1
        adm_cols:
          - "Country Code"
        date_col: "Year"
        date_type: "year"
        filter_cols:
          - "Metric"
          - "PiN Value for Dataviz"
        subsets:
          - filter: "Metric == 'People in need' and PiN Value for Dataviz == 'yes'"
            input_cols:
              - "Value"
            output_columns:
              - "PeopleInNeed"
            output_hxltags:
              - "#affected+inneed"

The population mini scraper matches admin names only using exact matching (adm_exact is set to True) rather than 
the default which tries fuzzy matching in the event of a failure to match exactly. In this case, the fuzzy 
matching was incorrect. Note: a future fix in HDX Python Country may render this flag unnecessary as it will only 
fuzzy match if the input has more than 3 characters. 

     population:
        source: "World Bank"
        source_url: "https://data.humdata.org/organization/world-bank-group"
        format: "xls"
        url: "http://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=excel&dataformat=list"
        sheet: "Data"
        headers: 3
        adm_cols:
          - "Country Code"
        adm_exact: True
        date_col: "Year"
        date_type: "year"
        date_condition: "Value != ''"
        input_cols:
          - "Value"
        output_columns:
          - "Population"
        output_hxltags:
          - "#population"

The covid tests mini scraper applies a prefilter to the data that only processes rows where the value in the column
"new_tests" is not None and is greater than zero. If "new_tests" was not specified in input_cols or date_col, then
it would need to be under a key filter_cols.

    covidtests:
        source: "Our World in Data"
        format: "xlsx"
        dataset: "total-covid-19-tests-performed-by-country"
        url: "tests/fixtures/owid-covid-data.xlsx"
        headers: 1
        sheet: 1
        prefilter: "new_tests is not None and new_tests > 0"
     ...

The sadd mini scraper reads data from the dataset “covid-19-sex-disaggregated-data-tracker”. It filters that data 
using data from another file, the url of which is defined in external_filter. Specifically, it cuts down the sadd 
data to only include countries listed in the  “#country+code+v_iso2” column of the external_filter file.

     sadd:
         format: "csv"
         dataset: "covid-19-sex-disaggregated-data-tracker"
         headers: 1
         external_filter:
           url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vR9PhPG7-aH0EkaBGzXYlrO9252gqs-UuKIeDQr9D3pOLBOdQ_AoSwWi21msHsdyT7thnjuhSY6ykSX/pub?gid=434885896&single=true&output=csv"
           hxltags:
             - "#country+code+v_iso2"
     ...

The fsnwg mini scraper first applies a sort to the data it reads. The reverse sort is based on the keys 
“reference_year” and “reference_code”. adm_cols defines a country column adm0_pcod3 and three admin 1 level 
columns (“adm1_pcod2”, “adm1_pcod3”, “adm1_name”) which are examined consecutively until a match with the internal 
admin 1 is made. 

date_col is comprised of the amalgamation of two columns “reference_year” and “reference_code” 
(corresponding to the two columns which were used for sorting earlier). sum_cols is used to sum values in a 
column. For example, the formula “get_fraction_str(phase3, population)” takes the sum of all phase 3 values for an 
admin 1 and divides it by the sum of all population values for an admin 1. 

mustbepopulated determines if values are included or not. If it is True, then only when all columns in input_cols 
for the row are populated is the value included. This means that when using multiple summed columns together in a 
formula, the number of values that were summed in each column will be the same. The last formula uses 
“#population” which is replaced by the population for the admin (which has been previously obtained by another 
mini scraper).

     fsnwg:
        format: "xlsx"
        dataset: "cadre-harmonise"
        headers: 1
        sheet: 1
        sort:
          reverse: True
          keys:
            - "reference_year"
            - "reference_code"
        adm_cols:
          - "adm0_pcod3"
          - - "adm1_pcod2"
            - "adm1_pcod3"
            - "adm1_name"
        date_col:
          - "reference_year"
          - "reference_code"
        date_type: "int"
        filter_cols:
          - "chtype"
        subsets:
          - filter: "chtype == 'current'"
            input_cols:
              - "phase3"
              - "phase4"
              - "phase5"
              - "phase35"
              - "population"
            input_transforms:
              "phase3": "float(phase3)"
              "phase4": "float(phase4)"
              "phase5": "float(phase5)"
              "phase35": "float(phase35)"
              "population": "float(population)"
            sum_cols:
              - formula: "get_fraction_str(phase3, population)"
                mustbepopulated: True
              - formula: "get_fraction_str(phase4, population)"
                mustbepopulated: True
              - formula: "get_fraction_str(phase5, population)"
                mustbepopulated: True
              - formula: "get_fraction_str(phase35, population)"
                mustbepopulated: True
              - formula: "get_fraction_str(population, #population)"
                mustbepopulated: True
            output_columns:
              - "FoodInsecurityCHP3"
              - "FoodInsecurityCHP4"
              - "FoodInsecurityCHP5"
              - "FoodInsecurityCHP3+"
              - "FoodInsecurityCHAnalysed"
            output_hxltags:
              - "#affected+ch+food+p3+pct"
              - "#affected+ch+food+p4+pct"
              - "#affected+ch+food+p5+pct"
              - "#affected+ch+food+p3plus+pct"
              - "#affected+ch+food+analysed+pct"

The who_subnational mini scraper defines two values in input_ignore_vals which if found are ignored. Since 
mustbepopulated is True, then only when all columns in input_cols for the row are populated and do not contain 
either “-2222” or “-4444” is the value included in the sum of any column used in sum_cols. 

     who_subnational:
        source: "WHO"
        format: "csv"
        url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vRfjaIXE1hvEIXD66g6cuCbPrGdZkx6vLIgXO_znVbjQ-OgwfaI1kJPhxhgjw2Yg08CmtBuMLAZkTnu/pub?gid=337443769&single=true&output=csv"
        headers: 1
        adm_cols:
          - "iso"
          - "Admin1"
        date_col: "Year"
        date_type: "year"
        filter_cols:
          - "Vaccine"
        subsets:
          - filter: "Vaccine == 'HepB1'"
            input_cols:
              - "Numerator"
              - "Denominator"
            input_ignore_vals:
              - "-2222"
              - "-4444"
            input_transforms:
              Numerator: "float(Numerator)"
              Denominator: "float(Denominator)"
            sum_cols:
              - formula: "get_fraction_str(Numerator, Denominator)"
                mustbepopulated: True
            output_columns:
              - "HepB1 Coverage"
            output_hxltags:
              - "#population+hepb1+pct+vaccinated"

The access mini scraper provides different sources for each HXL tag by providing dictionaries instead of strings 
in source and source_url. It maps specific HXL tags by key to sources or falls back on a “default_source” and 
“default_url” for all unspecified HXL tags.

     access:
        source:
          "#access+visas+pct": "OCHA"
          "#access+travel+pct": "OCHA"
          "#event+year+previous+num": "Aid Workers Database"
          "#event+year+todate+num": "Aid Workers Database"
          "#event+year+previous+todate+num": "Aid Workers Database"
          "#activity+cerf+project+insecurity+pct": "OCHA"
          "#activity+cbpf+project+insecurity+pct": "OCHA"
          "#population+education": "UNESCO"
          "default_source": "Multiple sources"
        source_url:
          "#event+year+previous+num": "https://data.humdata.org/dataset/security-incidents-on-aid-workers"
          "#event+year+todate+num": "https://data.humdata.org/dataset/security-incidents-on-aid-workers"
          "#event+year+previous+todate+num": "https://data.humdata.org/dataset/security-incidents-on-aid-workers"
          "default_url": "https://docs.google.com/spreadsheets/d/e/2PACX-1vRSzJzuyVt9i_mkRQ2HbxrUl2Lx2VIhkTHQM-laE8NyhQTy70zQTCuFS3PXbhZGAt1l2bkoA4_dAoAP/pub?gid=1565063847&single=true&output=csv"
        format: "csv"
        url: "https://docs.google.com/spreadsheets/d/e/2PACX-1vRSzJzuyVt9i_mkRQ2HbxrUl2Lx2VIhkTHQM-laE8NyhQTy70zQTCuFS3PXbhZGAt1l2bkoA4_dAoAP/pub?gid=1565063847&single=true&output=csv"
        headers: 1
        use_hxl: True
        input_transforms:
          "#access+visas+pct": "get_numeric_if_possible(#access+visas+pct)"
          "#access+travel+pct": "get_numeric_if_possible(#access+travel+pct)"
          "#activity+cerf+project+insecurity+pct": "get_numeric_if_possible(#activity+cerf+project+insecurity+pct)"
          "#activity+cbpf+project+insecurity+pct": "get_numeric_if_possible(#activity+cbpf+project+insecurity+pct)"
          "#population+education": "get_numeric_if_possible(#population+education)"

The field adm_vals allows overriding the country iso 3 codes and admin 1 pcodes for teh specific mini scraper:

      gam:
        source: "UNICEF"
        format: "xlsx"
        url: "tests/fixtures/unicef_who_wb_global_expanded_databases_severe_wasting.xlsx"
        sheet: "Trend"
        headers:
          - 3
          - 4
        flatten:
          - original: "Region {{1}} Region Name"
            new: "Region Name"
          - original: "Region {{1}} Point Estimate"
            new: "Region Point Estimate"
        adm_cols:
          - "ISO"
          - "Region Name"
        adm_vals:
          - ["AFG"]
          - ["AF09", "AF24"]
        date_col: "Year*"
        date_type: "year"
        input_cols:
          - "Region Point Estimate"
        output_cols:
          - "Malnutrition Estimate"
        output_hxltags:
          - "#severity+malnutrition+num+subnational"

The date_level field enables reading of data containing dates where the latest date must be used at a particular level
such as per country or per admin 1. For example, we might want to produce a global value by summing the latest 
available value per country as shown below. We have specified a date_col and sum_cols. The library will apply 
a sort by date (since we have not specified one) to ensure correct results. This would also happen if we had
used process_cols or append_cols. 

      ourworldindata:
        source: "Our World in Data"
        format: "csv"
        url: "tests/fixtures/ourworldindata_vaccinedoses.csv"
        headers: 1
        use_hxl: True
        adm_cols:
          - "#country+code"
        date_col: "#date"
        date_type: "date"
        date_level: "national"
        input_cols:
          - "#total+vaccinations"
        sum_cols:
          - formula: "number_format(#total+vaccinations, format='%.0f')"
            mustbepopulated: False
        output_cols:
          - "TotalDosesAdministered"
        output_hxltags:
          - "#capacity+doses+administered+total"

An example of combining adm_vals, date_col and date_level is getting the latest global value in which global data has
been treated as a special case of national data by using OWID_WRL in the #country+code field:

      ourworldindata:
        source: "Our World in Data"
        format: "csv"
        url: "tests/fixtures/ourworldindata_vaccinedoses.csv"
        headers: 1
        use_hxl: True
        adm_cols:
          - "#country+code"
        adm_vals:
          - "OWID_WRL"
        date_col: "#date"
        date_type: "date"
        date_level: "national"
        input_cols:
          - "#total+vaccinations"
        output_cols:
          - "TotalDosesAdministered"
        output_hxltags:
          - "#capacity+doses+administered+total"

If columns need to be summed and the latest date chosen overall not per admin unit, then we can specify 
single_maxdate as shown below:

      cerf_global:
        format: "csv"
        url: "tests/fixtures/full_pfmb_allocations.csv"
        dataset: "cerf-covid-19-allocations"
        force_date_today: True
        headers: 1
        date_col: "AllocationYear"
        date_type: "year"
        single_maxdate: True
        filter_cols:
          - "FundType"
          - "GenderMarker"
        subsets:
          ...
          - filter: "FundType == 'CBPF' and GenderMarker == '0'"
            input_cols:
              - "Budget"
            input_transforms:
              Budget: "float(Budget)"
            sum_cols:
              - formula: "Budget"
                mustbepopulated: False
            output_cols:
              - "CBPFFundingGM0"
            output_hxltags:
              - "#value+cbpf+funding+gm0+total+usd"
           ...

