Skip to content
Snippets Groups Projects
Select Git revision
  • main
  • SDC-1435-add-WEBB-collection
  • SDCP-132-connect-lta-to-ldvspec
  • SDC-1435-add-ALMA-collection
  • oracle_and_lta
  • SDC-890-activities
  • SDC-854__implement_python_package_template
  • SDC-926-add-focus-connectors
  • add-unittests-for-connectors
  • improve-test-coverage
  • toying_with_connectors
  • ancillary_dps
12 results

README.md

Blame
  • Code owners
    Assign users and groups as approvers for specific file changes. Learn more.

    adex-data-scraper

    The adex-data-scraper is a pip installable runnable package that can be used to import data from our archives (primarily astron-vo) into the ADEX backend

    See also: https://git.astron.nl/astron-sdc/adex-backend-django

    Architecture

    adex-data-scraper lives between the astron-vo datasources and the ADEX-cache database. It is (currently) a pip installable program that can be run manually.

    See SDC Architecture Focus Project for more detailed architecture information.

    Building (automatic)

    The CI/CD pipeline builds and uploads the software to the ASTRON gitlab package registry.

    Deploying (manually)

    For example on sdc@dop814.astron.nl (this is our sdc-dev test machine)

    # only once
    > cd ~
    > mkdir adex-data-scraper
    > cd adex-data-scraper
    > virtualenv env3.9 -p python3.9 (on ubuntu)
    > python3 -m venv env3.9 (on centos)
    
    # for every deployment
    > cd ~/adex-data-scraper
    > source env3.9/bin/activate
    > pip install --upgrade pip
    > pip install adex-data-scraper --extra-index-url https://git@git.astron.nl/api/v4/projects/349/packages/pypi/simple --upgrade
    

    Running

    To test if it works (look at the version)

    > adex_data_scraper -h
    --- adex-data-scraper (version 14 feb 2023) ---

    Getting help

      > adex_data_scraper -h
    usage: main.py [-h] [--datasource DATASOURCE] [--data_host DATA_HOST]
                   [--connector CONNECTOR] [--limit LIMIT]
                   [--batch_size BATCH_SIZE]
                   [--adex_backend_host [ADEX_BACKEND_HOST]]
                   [--adex_resource ADEX_RESOURCE] [--collection COLLECTION]
                   [--clear_resource] [--clear_collection] [--simulate_post]
                   [--adex_token ADEX_TOKEN] [-v] [--argfile [ARGFILE]]
    
    options:
      -h, --help            show this help message and exit
      --datasource DATASOURCE
                            where should the data be imported from? Options: vo,
                            postgres
      --data_host DATA_HOST
                            service/table, either VO or postgres. Examples:
                            https://vo.astron.nl/tap/apertif_dr1.continuum_images,
                            postgres:postgres@localhost:5432/alta
      --connector CONNECTOR
                            Connector class containing the translation scheme from
                            vo to adex
      --limit LIMIT         max records to fetch from VO, for dev/test purposes
      --batch_size BATCH_SIZE
                            number of records to post as a batch to ADEX
      --adex_backend_host [ADEX_BACKEND_HOST]
                            location of the adex-backend-django application
      --adex_resource ADEX_RESOURCE
                            resource/table to update, options are:
                            primary_dp/create, ancillary_dp/create,
                            activity/create
      --collection COLLECTION
                            can be used as filter in ADEX backend and ADEX
                            frontend
      --clear_resource      Delete all the data from the adex_resource
      --clear_collection    Delete all the data for this collection from the
                            adex_resource, works in concert with the '--
                            collection' parameter.
      --simulate_post       If true, then no data is posted to ADEX
      --adex_token          ADEX_TOKEN
                            Token to login
      -v, --verbose         More information about atdb_services at run time.
      --argfile [ARGFILE]   Ascii file with arguments (overrides all other
                            arguments
    
    Process finished with exit code 0
    

    Development

    adex-data-scraper can be used to fill the ADEX cache database with data from ASTRON VO. The examples directory in this repo shows some 'argument files' for different datasets and communication paths.

    See this diagram for the current development

    Connectors

    The implementation of the connectors need to follow the ADEX datamodel, and the datamodel in astron-vo. Both datamodels are still under development.

    Visualisation in adex-labs

    This is a visualisation of the apertif-dr1 example in adex-labs engineering frontend.

    Examples

    The examples folder contains a series of argument files that show to use the adex-data-scraper The differences in the examples are mainly the different servers that can be used.

    These 2 are highlighted as an example of a primary and ancillary dataset that belong together:

    adex_data_scraper --argfile examples\vo\apertif_dr1_continuum_images_localhost.args
    adex_data_scraper --argfile examples\postgres\ancillary_apertif_inspectionplots_localhost.args

    How To add new VO collections to ADEX

    This is an example of how to add the ALMA ivoa.obscore collecion to ADEX.

    Most of the work is in the adex-data-scraper (the current repo). With one small configuration change in the adex-backend-django configuration files for adex-labs and adex-next to enable the new collection.

    • use a VO application like Topcat to find or query datasets. This example uses Topcat:
      • look for the Table Access Protocol (TAP service). In Topcat: VO => TAP query => keyword 'ALMA'
      • select a table: alma.ivoa.obscore
      • SDQL Query: SELECT TOP 1000 * FROM ivoa.obscore
      • double click 'Table List'

    This shows the fields that you need to translate with a 'connector' to ADEX.

    create argument file (alma_obscore_sdc.args)

    Create your argument file (also see 'examples' chapter)

    --datasource=vo
    --connector=ALMA.Obscore
    --data_host=http://jvo.nao.ac.jp/skynode/do/tap/alma/ivoa.obscore
    --batch_size=1000
    --adex_backend_host=https://sdc.astron.nl/adex_backend/
    --adex_resource=primary_dp/create
    --adex_token=6b85509349313c7bdb16bd706d43ee5eb1cfb5da
    --clear_collection
    --collection=alma_obscore

    Most arguments are default or obvious, but some need a bit more explanation.

    --connector=ALMA.Obscore

    This refers to a file ALMA.py and classname Obscore in that file, in the vo.connectors directory. You will need to create that file, this is the 'connector'. (see next chapter)

    --data_host=http://jvo.nao.ac.jp/skynode/do/tap/alma/ivoa.obscore

    This is a combination of the 'service URL' and the table name (indicated in red in the topcat screenshot)

    --collection=alma_obscore

    You can freely choose this name. This is the name that appears in the Collection dropdown menu. You need to use the same name in the adex backend configuration (see 'adex-backend-django configuration')

    write the connector (ALMA.Obscore)

    This translates the results from the ADQL query to ADEX format. Not every VO services uses the same field names for similar information (like ra,dec), so you need to look at the VO table result or Schema in TOPCAT which names this service uses. Also, not all the ADEX fields will be available in every service, and they an be left out.

    The 'translate()' function is an overridden function, which means that its name, arguments and returned results are given and should not be changed. Don't change the keys in the dict, only the identifiers in the 'row'

    For example, ADEX expects the Right Ascension named as 'ra', which should be in decimal degrees as returned as a float in the payload json.

    The ALMA service returns Right Ascension named as 's_ra' and returns it as a string. So you need to convert it so that it fits ADEX: float(row['s_ra'])

    class Obscore():
    
        def translate(self, row, args):
            """
            parse the specific row that comes from the VO adql query,
            and translate it into the standard json payload for posting to the ADEX backend REST API
    
            :param row: the results from the ADQL query to a VO service
            :param args: the commandline arguments, but only args.collection is currently used
            :return: ADEX record as json structure
            """
            payload = dict(
                pid=row['data_id'],
                name=row['target_name'],
                dp_type=row['dataproduct_type'],
                format="fits",
                locality="online",
                access_url=row['access_url'],
                ra=float(row['s_ra']),
                dec=float(row['s_dec']),
                equinox="2000.0",
    
                release_date=row['obs_release_date'],
                data_provider="ALMA",
    
                sky_footprint=row['s_region'],
    
                dataset_id=str(row['data_id']),
                activity=None,
                collection = args.collection,
            )
    
            return payload

    adex-backend-django configuration

    The frontend applications (adex-labs and adex-gui) get their configuration from adex-backend. This is the directory where the frontend configuration files are kept. https://git.astron.nl/astron-sdc/adex-backend-django/-/tree/main/adex_backend/adex_backend/configuration?ref_type=heads

    Look for the "collections" tag in the configuration, and add new alma_obscore collection to it

        "collections": [
            { "name" : "linc_skymap", "dp_types": ['qa-skymap']},
            { "name" : "linc_visibilities", "dp_types": ['die-calibrated-visibilities'], "distinct_field" : "dataset_id"},
            { "name" : "apertif-dr1", "dp_types": ['science-skymap']},
            { "name" : "lotts-dr2", "dp_types": ['skymap']},
            { "name" : "lofar-skyimage", "dp_types": ['skyimage']},
            { "name" : "alma_obscore", "dp_types": ['IMAGE','CUBE']}
        ],

    The alma_obscore is the name you chose for your collection, as defined in the argument file.

    IMAGEand CUBE are values of the dataproduct_type field in the ALMA ivoa.obscore collection. These can be mapped onto the ADEX dp_type field (see previous chapter). By adding them to the configuration, the frontend(s) knows to give these values as filter options once the alma_obscore collection is selected.

    execute

    > adex_data_scraper --argfile ./alma_obscore_sdc.args

    Now the ADEX database will be filled with records in batches of 1000.

    1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma
    1000 posted to https://sdc.astron.nl/adex_backend/ in 0:00:17.483034
    1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma
    1000 posted to https://sdc.astron.nl/adex_backend/ in 0:00:15.901474
    1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma
    1000 posted to https://sdc.astron.nl/adex_backend/ in 0:00:17.173940
    1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma
    1000 posted to https://sdc.astron.nl/adex_backend/ in 0:00:17.729760
    1000 records fetched from http://jvo.nao.ac.jp/skynode/do/tap/alma