
adex-data-scraper
The adex-data-scraper
is a pip installable runnable package that can be used to import data from our archives (primarily astron-vo) into the ADEX backend
See also: https://git.astron.nl/astron-sdc/adex-backend-django
Architecture
adex-data-scraper
lives between the astron-vo datasources and the ADEX-cache database.
It is (currently) a pip installable program that can be run manually.
See SDC Architecture Focus Project for more detailed architecture information.
Building (automatic)
The CI/CD pipeline builds and uploads the software to the ASTRON gitlab package registry.
Deploying (manually)
For example on sdc@dop814.astron.nl
(this is our sdc-dev test machine)
# only once
> cd ~
> mkdir adex-data-scraper
> cd adex-data-scraper
> virtualenv env3.9 -p python3.9 (on ubuntu)
> python3 -m venv env3.9 (on centos)
# for every deployment
> cd ~/adex-data-scraper
> source env3.9/bin/activate
> pip install --upgrade pip
> pip install adex-data-scraper --extra-index-url https://git@git.astron.nl/api/v4/projects/349/packages/pypi/simple --upgrade
Running
To test if it works (look at the version)
> adex_data_scraper -h
--- adex-data-scraper (version 14 feb 2023) ---
Getting help
> adex_data_scraper -h
usage: main.py [-h] [--datasource DATASOURCE] [--data_host DATA_HOST]
[--connector CONNECTOR] [--limit LIMIT]
[--batch_size BATCH_SIZE]
[--adex_backend_host [ADEX_BACKEND_HOST]]
[--adex_resource ADEX_RESOURCE] [--collection COLLECTION]
[--clear_resource] [--clear_collection] [--simulate_post]
[--adex_token ADEX_TOKEN] [-v] [--argfile [ARGFILE]]
options:
-h, --help show this help message and exit
--datasource DATASOURCE
where should the data be imported from? Options: vo,
postgres
--data_host DATA_HOST
service/table, either VO or postgres. Examples:
https://vo.astron.nl/tap/apertif_dr1.continuum_images,
postgres:postgres@localhost:5432/alta
--connector CONNECTOR
Connector class containing the translation scheme from
vo to adex
--limit LIMIT max records to fetch from VO, for dev/test purposes
--batch_size BATCH_SIZE
number of records to post as a batch to ADEX
--adex_backend_host [ADEX_BACKEND_HOST]
location of the adex-backend-django application
--adex_resource ADEX_RESOURCE
resource/table to update, options are:
primary_dp/create, ancillary_dp/create,
activity/create
--collection COLLECTION
can be used as filter in ADEX backend and ADEX
frontend
--clear_resource Delete all the data from the adex_resource
--clear_collection Delete all the data for this collection from the
adex_resource, works in concert with the '--
collection' parameter.
--simulate_post If true, then no data is posted to ADEX
--adex_token ADEX_TOKEN
Token to login
-v, --verbose More information about atdb_services at run time.
--argfile [ARGFILE] Ascii file with arguments (overrides all other
arguments
Process finished with exit code 0
Development
adex-data-scraper
can be used to fill the ADEX cache database with data from ASTRON VO.
The examples directory in this repo
shows some 'argument files' for different datasets and communication paths.
See this diagram for the current development
Connectors
The implementation of the connectors need to follow the ADEX datamodel, and the datamodel in astron-vo. Both datamodels are still under development.
Visualisation in adex-labs
This is a visualisation of the apertif-dr1 example in adex-labs engineering frontend.
Examples
The examples
folder contains a series of argument files that show to use the adex-data-scraper
The differences in the examples are mainly the different servers that can be used.
These 2 are highlighted as an example of a primary and ancillary dataset that belong together:
adex_data_scraper --argfile examples\vo\apertif_dr1_continuum_images_localhost.args
adex_data_scraper --argfile examples\postgres\ancillary_apertif_inspectionplots_localhost.args
How To add new VO collections to ADEX
This is an example of how to add the ALMA ivoa.obscore collecion to ADEX.
Most of the work is in the adex-data-scraper
(the current repo).
With one small configuration change in the adex-backend-django
configuration files for adex-labs
and adex-next
to enable the new collection.
- use a VO application like Topcat to find or query datasets. This example uses Topcat:
- look for the Table Access Protocol (TAP service). In Topcat: VO => TAP query => keyword 'ALMA'
- select a table: alma.ivoa.obscore
- SDQL Query: SELECT TOP 1000 * FROM ivoa.obscore
- double click 'Table List'
This shows the fields that you need to translate with a 'connector' to ADEX.
create argument file (alma_obscore_sdc.args)
Create your argument file (also see 'examples' chapter)
--datasource=vo
--connector=ALMA.Obscore
--data_host=http://jvo.nao.ac.jp/skynode/do/tap/alma/ivoa.obscore
--batch_size=1000
--adex_backend_host=https://sdc.astron.nl/adex_backend/
--adex_resource=primary_dp/create
--adex_token=6b85509349313c7bdb16bd706d43ee5eb1cfb5da
--clear_collection
--collection=alma_obscore
Most arguments are default or obvious, but some need a bit more explanation.
--connector=ALMA.Obscore
This refers to a file ALMA.py
and classname Obscore
in that file, in the vo.connectors
directory.
You will need to create that file, this is the 'connector'. (see next chapter)
--data_host=http://jvo.nao.ac.jp/skynode/do/tap/alma/ivoa.obscore
This is a combination of the 'service URL' and the table name (indicated in red in the topcat screenshot)
--collection=alma_obscore
You can freely choose this name. This is the name that appears in the Collection dropdown menu. You need to use the same name in the adex backend configuration (see 'adex-backend-django configuration')
write the connector (ALMA.Obscore)
This translates the results from the ADQL query to ADEX format. Not every VO services uses the same field names for similar information (like ra,dec), so you need to look at the VO table result or Schema in TOPCAT which names this service uses. Also, not all the ADEX fields will be available in every service, and they an be left out.
The 'translate()' function is an overridden function, which means that its name, arguments and returned results are given and should not be changed. Don't change the keys in the dict, only the identifiers in the 'row'
For example, ADEX expects the Right Ascension named as 'ra', which should be in decimal degrees as returned as a float in the payload json.
The ALMA service returns Right Ascension named as 's_ra' and returns it as a string. So you need to convert it so that it fits ADEX: float(row['s_ra'])
class Obscore():
def translate(self, row, args):
"""
parse the specific row that comes from the VO adql query,
and translate it into the standard json payload for posting to the ADEX backend REST API
:param row: the results from the ADQL query to a VO service
:param args: the commandline arguments, but only args.collection is currently used
:return: ADEX record as json structure
"""
payload = dict(
pid=row['data_id'],
name=row['target_name'],
dp_type=row['dataproduct_type'],
format="fits",
locality="online",
access_url=row['access_url'],
ra=float(row['s_ra']),
dec=float(row['s_dec']),
equinox="2000.0",
release_date=row['obs_release_date'],
data_provider="ALMA",
sky_footprint=row['s_region'],
dataset_id=str(row['data_id']),
activity=None,
collection = args.collection,
)
return payload
adex-backend-django configuration
The frontend applications (adex-labs and adex-gui) get their configuration from adex-backend. This is the directory where the frontend configuration files are kept. https://git.astron.nl/astron-sdc/adex-backend-django/-/tree/main/adex_backend/adex_backend/configuration?ref_type=heads
Look for the "collections" tag in the configuration, and add new alma_obscore collection to it
"collections": [
{ "name" : "linc_skymap", "dp_types": ['qa-skymap']},
{ "name" : "linc_visibilities", "dp_types": ['die-calibrated-visibilities'], "distinct_field" : "dataset_id"},
{ "name" : "apertif-dr1", "dp_types": ['science-skymap']},
{ "name" : "lotts-dr2", "dp_types": ['skymap']},
{ "name" : "lofar-skyimage", "dp_types": ['skyimage']},
{ "name" : "alma_obscore", "dp_types": ['IMAGE','CUBE']}
],
The alma_obscore
is the name you chose for your collection, as defined in the argument file.
IMAGE
and CUBE
are values of the dataproduct_type
field in the ALMA ivoa.obscore collection.
These can be mapped onto the ADEX dp_type field (see previous chapter).
By adding them to the configuration, the frontend(s) knows to give these values as filter options once the alma_obscore
collection is selected.
execute
> adex_data_scraper --argfile ./alma_obscore_sdc.args
Now the ADEX database will be filled with records in batches of 1000.
1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma
1000 posted to https://sdc.astron.nl/adex_backend/ in 0:00:17.483034
1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma
1000 posted to https://sdc.astron.nl/adex_backend/ in 0:00:15.901474
1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma
1000 posted to https://sdc.astron.nl/adex_backend/ in 0:00:17.173940
1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma
1000 posted to https://sdc.astron.nl/adex_backend/ in 0:00:17.729760
1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma