|
|
This page describes how ESAP and the Rucio system (as the main data management component of the scientific data lake) are connected. The connection between the two has been implemented in two ways:
|
|
|
|
|
|
1. Query functionality. The main interface from ESAP offers the possibility to query data on the scientific data lake.
|
|
|
2. Data access. The access to data from the scientific data lake from a notebook has been implemented in the Data Lake as a Service (DLaaS) Jupyter notebook. The shopping basked client implements functionality that can be executed in this notebook to stage data from the scientific data lake to the DLaaS notebook.
|
|
|
|
|
|
More details on each of those modes is detailed in the following sections.
|
|
|
|
|
|
## Querying Rucio
|
|
|
|
|
|
Rucio queries are an example of the [Query Service Category](/Service-Categories/Query).
|
|
|
|
|
|
The Rucio querying makes use of the `RUCIO_HOST` environment variable for its configuration. To use this functionality, the Rucio instance should support using OIDC tokens and that the tokens consumed by both services should use the same audience (in case of ESCAPE: Rucio). An implementation using X509 AAI tokens, which uses the <span dir="">`RUCIO_AUTH_TOKEN`</span> and `RUCIO_AUTH_HOST` environment variables.
|
|
|
|
|
|
Having this set up makes it possible for users to query data in the data lake by first selecting their scope (in Rucio, the scope is the top-level name space), and then searching for name and type (DIDs; Data IDentifiers, or Replicas; even though replicas seem not to be implemented). An example query is shown here.
|
|
|
|
|
|

|
|
|
|
|
|
In this result, the Type column can be `FILE`, `DATASET` or `CONTAINER`, which are the three types of DID implemented in Rucio. In essence, both `DATASET`s and `CONTAINER`s are collections of files that can be accessed as a single object (think of it like a flexible way of defining directories).
|
|
|
|
|
|
When the data has been found, the user can then clock on the check marks in the `Basket` column, adding them to the basket. Do not forget to confirm the choice by clicking the `Save Basket` button in the top bar of ESAP to confirm adding the items to it.
|
|
|
The shopping basket now contains references to each of the items selected (note that each DID is added as a single item, irrespective of its `Type`). An example shopping basket is shown below.
|
|
|
|
|
|

|
|
|
|
|
|
## Data Lake as a Service and the ESAP shopping basket
|
|
|
|
|
|
To access the data sets that have been added to the shopping basket, the Python shopping basket client can be used.
|
|
|
The selection of the data from a specific `Source` can be done by using a connector.
|
|
|
On top of selecting specific data, connectors can be extended with functionality that can be used to obtain the data from the specific source. For access to data in the data lake, the `RucioConnector` can be used, which expects to be executed inside the Data Lake as a Service notebook.
|
|
|
|
|
|
When starting the notebook, the first step is to install the shopping basket client
|
|
|
|
|
|
```plaintext
|
|
|
!pip install git+https://git.astron.nl/astron-sdc/esap-userprofile-python-client.git
|
|
|
```
|
|
|
|
|
|
The next step is to import the ESAP shopping basket client and the connector
|
|
|
|
|
|
```plaintext
|
|
|
import esap_client
|
|
|
from esap_client.connectors import Rucio as RucioConnector
|
|
|
```
|
|
|
|
|
|
To download the basket, the following three commands can be used:
|
|
|
|
|
|
```plaintext
|
|
|
rucio_connector = RucioConnector()
|
|
|
my_shopping_client = esap_client.shopping_client.ShoppingClient(host=HOST_URL, connectors=[rucio_connector])
|
|
|
my_basket = my_shopping_client.get_basket(filter_archives=True)
|
|
|
```
|
|
|
|
|
|
The first line sets up the `RucioConnector`. The second line sets up a shopping client to access the shopping basket from the ESAP instance running at `HOST_URL` (this is the root URL; for the ESAP demo instance this URL is `https://sdc-dev.astron.nl`. The `connectors` argument can be used to add a list of connectors for data selection. Since we are only interested in Rucio data, this list only contains a Rucio connector. The third line takes the data from the shopping basket and puts it in the `my_basket` variable. The `filter_archives` argument is set to `True` so that only entries with source `rucio` are saved in the variable.
|
|
|
|
|
|
Now we will use the commodity staging function that are part of the `RucioConnector`. This will read out the configuration of the DLaaS notebook, and create a rule in Rucio to make a copy of the data to the storage element that is mounted to the DLaaS service. The data is returned as an object, which depending on the type of the DID is a `SingleItemDID` (for DIDs of type `FILE`), or a `MultipleItemDID` (for DIDs of types `DATASET` and `CONTAINER`). Those objects are in essence a string and a list, and are the types that are used by the DLaaS notebook. The retrieval is started by the command
|
|
|
|
|
|
```plaintext
|
|
|
rucio_connector.retrieve(my_basket)
|
|
|
```
|
|
|
|
|
|
which will start the retrieval. Running
|
|
|
|
|
|
```plaintext
|
|
|
rucio_connector.get_staging_status()
|
|
|
```
|
|
|
|
|
|
will return a list of the items in the shopping basket, together with their status in Rucio (most often `REPLICATING` for files that are being copied, and `OK` for files that have been staged, hopefully no files in the `STUCK` state). For practical use, the connector can be asked to block until all files are available:
|
|
|
|
|
|
```plaintext
|
|
|
rucio_connector.block_till_staged()
|
|
|
```
|
|
|
|
|
|
Note that staging small files can still take a remarkable amount of time, probably because the overhead of queueing and managing transfers. Getting the files that are part of the connector (including the unpacking of `DATASET`s and `CONTAINER`s can be done using
|
|
|
|
|
|
```plaintext
|
|
|
the_dids = rucio_connector.getDIDs()
|
|
|
```
|
|
|
|
|
|
. Now the files can be handled as any other file:
|
|
|
|
|
|
```plaintext
|
|
|
for did in the_dids:
|
|
|
with open(did) as didhandler:
|
|
|
do_cool_science(didhandler.read())
|
|
|
``` |