Metadata-Version: 2.1
Name: omicidx
Version: 0.3.4
Summary: The OmicIDX project collects, reprocesses, and then republishes metadata from multiple public genomics repositories. Included are the NCBI SRA, Biosample, and GEO databases. Publication is via the cloud data warehouse platform Bigquery, a set of performant search and retrieval APIs, and a set of json-format files for easy incorporation into other projects.
Home-page: https://github.com/seandavi/omicidx
License: MIT
Keywords: genomics,bioinformatics,open data,API
Author: Sean Davis
Author-email: seandavi@gmail.com
Requires-Python: >=3.7,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Provides-Extra: eralchemy
Requires-Dist: Click
Requires-Dist: biopython (==1.75)
Requires-Dist: boto3 (>=1.9,<2.0)
Requires-Dist: elasticsearch (>=7.0,<8.0)
Requires-Dist: google-bigquery (>=0.14.0,<0.15.0)
Requires-Dist: google-cloud-bigquery (>=1.18,<2.0)
Requires-Dist: google-cloud-pubsub (>=0.45.0,<0.46.0)
Requires-Dist: google-cloud-storage (>=1.18,<2.0)
Requires-Dist: psycopg2 (>=2.8,<3.0)
Requires-Dist: pydantic
Requires-Dist: requests (>=2.22,<3.0)
Requires-Dist: sd_cloud_utils
Requires-Dist: sqlalchemy (>=1.3,<2.0)
Requires-Dist: ujson (>=1.35,<2.0)
Project-URL: Repository, https://github.com/seandavi/omicidx
Description-Content-Type: text/markdown

#

# New process


## Steps

- Download xml
- Create basic json
- Upload json to s3
- munge basic json to parquet
- munge parquet to 
    - experiment joined
	- sample joined
	- run joined
	- study with aggregates
	- Include aggs in spark jobs:
		- number of samples, experiments, runs
		- sample, experiment, and run accessions (as array)
- Save munged spark data (json, parquet)
- Create elasticsearch index mappings
- Drop existing elasticsearch mappings
- Load elasticsearch index mappings


## lambda

zip lambdas.zip lambda_handlers.py sra_parsers.py


aws lambda create-function --function-name sra_to_json --zip-file fileb://lambdas.zip --handler lambda_handlers.lambda_return_full_experiment_json --runtime python3.6 --role arn:aws:iam::377200973048:role/lambda_s3_exec_role


# Invoke

aws lambda invoke --function-name sra_to_json --log-type Tail --payload '{"accession":"SRX000273"}' /tmp/abc.txt

# Concurrency

1000 total, reserve for certain functions to limit, etc.

aws lambda put-function-concurrency --function-name sra_to_json --reserved-concurrent-executions 20

# timeout and memory

aws lambda update-function-configuration --function-name sra_to_json --timeout 15


# logging

https://github.com/jorgebastida/awslogs

awslogs get /aws/lambda/sra_to_json ALL --watch


## dynamodb

aws dynamodb scan --table-name sra_experiment --select "COUNT"

# GEO

```
python -m omicidx.geometa --gse=GSE10
```

Will print json, one "line" per entity to stdout.


