This page describes how ESAP and the Rucio system (as the main data management component of the scientific data lake) are connected. The connection between the two has been implemented in two ways:
- Query functionality. The main interface from ESAP offers the possibility to query data on the scientific data lake.
- Data access. The access to data from the scientific data lake from a notebook has been implemented in the Data Lake as a Service (DLaaS) Jupyter notebook. The shopping basked client implements functionality that can be executed in this notebook to stage data from the scientific data lake to the DLaaS notebook.
More details on each of those modes is detailed in the following sections.
Querying Rucio
Rucio queries are an example of the Query Service Category.
The Rucio querying makes use of the RUCIO_HOST
environment variable for its configuration. To use this functionality, the Rucio instance should support using OIDC tokens and that the tokens consumed by both services should use the same audience (in case of ESCAPE: Rucio). An implementation using X509 AAI tokens, which uses the RUCIO_AUTH_TOKEN
and RUCIO_AUTH_HOST
environment variables.
Having this set up makes it possible for users to query data in the data lake by first selecting their scope (in Rucio, the scope is the top-level name space), and then searching for name and type (DIDs; Data IDentifiers, or Replicas; even though replicas seem not to be implemented). An example query is shown here.
In this result, the Type column can be FILE
, DATASET
or CONTAINER
, which are the three types of DID implemented in Rucio. In essence, both DATASET
s and CONTAINER
s are collections of files that can be accessed as a single object (think of it like a flexible way of defining directories).
When the data has been found, the user can then clock on the check marks in the Basket
column, adding them to the basket. Do not forget to confirm the choice by clicking the Save Basket
button in the top bar of ESAP to confirm adding the items to it.
The shopping basket now contains references to each of the items selected (note that each DID is added as a single item, irrespective of its Type
). An example shopping basket is shown below.
Data Lake as a Service and the ESAP shopping basket
To access the data sets that have been added to the shopping basket, the Python shopping basket client can be used.
The selection of the data from a specific Source
can be done by using a connector.
On top of selecting specific data, connectors can be extended with functionality that can be used to obtain the data from the specific source. For access to data in the data lake, the RucioConnector
can be used, which expects to be executed inside the Data Lake as a Service notebook.
When starting the notebook, the first step is to install the shopping basket client
!pip install git+https://git.astron.nl/astron-sdc/esap-userprofile-python-client.git
The next step is to import the ESAP shopping basket client and the connector
import esap_client
from esap_client.connectors import Rucio as RucioConnector
To download the basket, the following three commands can be used:
rucio_connector = RucioConnector()
my_shopping_client = esap_client.shopping_client.ShoppingClient(host=HOST_URL, connectors=[rucio_connector])
my_basket = my_shopping_client.get_basket(filter_archives=True)
The first line sets up the RucioConnector
. The second line sets up a shopping client to access the shopping basket from the ESAP instance running at HOST_URL
(this is the root URL; for the ESAP demo instance this URL is https://sdc-dev.astron.nl
. The connectors
argument can be used to add a list of connectors for data selection. Since we are only interested in Rucio data, this list only contains a Rucio connector. The third line takes the data from the shopping basket and puts it in the my_basket
variable. The filter_archives
argument is set to True
so that only entries with source rucio
are saved in the variable.
Now we will use the commodity staging function that are part of the RucioConnector
. This will read out the configuration of the DLaaS notebook, and create a rule in Rucio to make a copy of the data to the storage element that is mounted to the DLaaS service. The data is returned as an object, which depending on the type of the DID is a SingleItemDID
(for DIDs of type FILE
), or a MultipleItemDID
(for DIDs of types DATASET
and CONTAINER
). Those objects are in essence a string and a list, and are the types that are used by the DLaaS notebook. The retrieval is started by the command
rucio_connector.retrieve(my_basket)
which will start the retrieval. Running
rucio_connector.get_staging_status()
will return a list of the items in the shopping basket, together with their status in Rucio (most often REPLICATING
for files that are being copied, and OK
for files that have been staged, hopefully no files in the STUCK
state). For practical use, the connector can be asked to block until all files are available:
rucio_connector.block_till_staged()
Note that staging small files can still take a remarkable amount of time, probably because the overhead of queueing and managing transfers. Getting the files that are part of the connector (including the unpacking of DATASET
s and CONTAINER
s can be done using
the_dids = rucio_connector.getDIDs()
. Now the files can be handled as any other file:
for did in the_dids:
with open(did) as didhandler:
do_cool_science(didhandler.read())