diff --git a/ESAP_DIOS.tex b/ESAP_DIOS.tex index b5f78602b31878f3448d0d55675aeca62bebdef3..01afa5594d2101840969b642581f94b51f9e3a06 100644 --- a/ESAP_DIOS.tex +++ b/ESAP_DIOS.tex @@ -1,38 +1,99 @@ - \documentclass[pu]{escape} +\documentclass[pu]{escape} \input{meta} \setDocNumber{WP5-1} \setDocDate{\vcsDate} -\setDocTitle{The datalake in ESAP} +\setDocTitle{The Data Lake in \glsentrytext{ESAP}} +\setWorkPackage{WP5} +\setLeadAuthor{Yan Grange (ASTRON)} +\setOtherAuthors{John Swinbank (ASTRON)} +\setDueDate{2021-11-12} +\setDueMonth{} + +\newacronym{DAC21}{DAC~21}{Data Access Challenge 2021} +\newacronym{DLaaS}{DLaaS}{Data Lake as a Service} +\newacronym{DID}{DID}{Data IDentifier} +\newacronym{FUSE}{FUSE}{Filesystem in Userspace} +\newacronym{OIDC}{OIDC}{OpenID Connect} \begin{document} -\section{context} -The Scientific Datalake (hereafter referred to as ``the datalake'') is the name of an infrastructure being under investigation in Work Package 2 of the ESCAPE project. An essential component of the datalake is the Rucio system which takes care of making sure data is distributed over the different storage systems based on rules. The principal method to access the back end is a implemented as a REST API. The Rucio clients implement the communication with this API. +\maketitle + +\printglossary[title=List of Abbreviations] +\clearpage +\tableofcontents +\clearpage + +\section{Context} + +\Acrshort{ESCAPE} Work Package 2, \gls{DIOS}, is developing a scientific ``data lake'': a distributed system for storage and management of bulk scientific datasets. +This data lake is based on the Rucio\footnote{\url{https://rucio.cern.ch/}}, which provides policy based data distribution across heterogeneous storage systems. +This system is principally accessed through a \Acrshort{REST} \Acrshort{API}, which is used for communication by Rucio clients. + +\Acrshort{ESCAPE} Work Package 5, \gls{ESAP}, is developing a toolkit which enables the construction of ``science platforms'': systems which integrate data discovery, access, and analysis, across multiple different resources, presenting the results to the end user through a consistent interface. +\gls{ESAP} is build around the concept of a ``shopping basket'', which contains the current working set of data of interest to the user, potentially harvested from and manipulated using a variety of different archives and systems. + +This document describes plans for integrating the Data Lake with \Acrshort{ESAP}. +It particularly focuses on \gls{DAC21}, a major WP2-driven integration exercise scheduled for November 2021, but also provides ideas for future work. + +\section{\glsentrylong{DLaaS}} + +For the scientific end user, access to data in the data lake should be as transparent as possible. +To this end, the WP2 team at CERN have developed the \gls{DLaaS} system, which provides access to the data lake through a Jupyter\footnote{\url{https://jupyter.org/}} notebook plugin\footnote{\url{https://escape-notebook.cern.ch}}. +This notebook provides means to `stage' data sets to an area that is mounted over a networked \gls{FUSE} mount in the notebook, meaning applications can access the data as if it were on local disk. + +\section{\glsentrylong{DAC21}} + +In November 2021, various partners within work package 2 of ESCAPE will demonstrate how the data lake can be used to perform typical use cases which are coming from their domain and instrument as part of the \gls{DAC21} exercise. +This section describes the integration expected between \gls{ESAP} and the data lake in support of \gls{DAC21}. + +\subsection{Features and requirements} +\label{sec:dac21:features} + +\subsubsection{\glsentrytext{DLaaS} integration} + +Within the context of \gls{DAC21}, we anticipate that \gls{ESAP} will be used to perform analysis that requires data from multiple sources, including data stored in the data lake. +To this end, ESAP should provide the capability to query for data in the data lake, put it in the shopping basket, and access this data in the \gls{DLaaS} notebook for further processing. + +The consequent requirements on \gls{ESAP} are: + +\begin{enumerate} + \item Query the data lake from \gls{ESAP} by scope and \gls{DID}\footnote{A \gls{DID} is the name that identifies an entry in the data lake.}. \label{dac:qry} + \item Read the Rucio data found in the previous step by using the \gls{ESAP} Python shopping basket plugin, and provide that to the \gls{DLaaS} functionality in the notebook. \label{dac:acs} +\end{enumerate} + +Item \ref{dac:qry} has already been implemented. +The main implementation work should therefore focus on item \ref{dac:acs}. +The minimal way this needs to be implemented is by combining the shopping basket Python library, provided by WP5, with the \gls{DLaaS} notebook. +This would require copy-pasting the content of the \gls{JSON} in the \gls{DLaaS} query field. +Ideally however the \gls{DLaaS} service in the notebook would be able to directly read data from the shopping basket using the key \texttt{Rucio}. +It will then be directly made available to the user for processing. + +\subsubsection{Multiple \glsentrytext{ESAP} instances} -For the scientific end user, access to data in the datalake should be as transparent as possible. To this end, the datalake as a Service (DLaaS) Jupyter lab notebook, which implements access to the datalake by a Jupyter plugin, was developed at CERN. This notebook provides means to `stage' data sets to an area that is mounted over a network FUSE mount in the notebook, meaning applications can access the data as if it were on local disk. +Since the goal of \gls{ESAP} is to be a system that can be deployed and adapted by different projects, and the data lake has the ambition to support access by a broad range of users, \gls{DAC21} provides an important opportunity to demonstrate the flexibility of both \gls{ESAP} and \gls{DIOS}. +We therefore suggest that multiple projects should deploy \gls{ESAP} during \gls{DAC21}. +As per \cref{sec:dac21:delivery}, hardware provisioning and deployment during \gls{DAC21} is a WP2 responsibility. +However, WP5 will provide support by ensuring that the \gls{ESAP} codebase is technically capable of supporting multiple independent deployments of \gls{ESAP}. -\section{DAC21} -In November, different partners within work package 2 of ESCAPE will demonstrate how the datalake can be used to perform typical use cases which are coming from their domain and instrument. This exercise has been named Data Access Challenge \'21 (hereafter DAC21). The scope of this document is to provide insight on what functionality is expected for DAC21, and what potential further development could be done to integrate the datalake and ESAP even more. +\subsubsection{\glsentrytext{OIDC} authentication} -The major interactions between ESAP and the datalake will be aimed at performing analysis that requires data from multiple sources, where one of those is the datalake. To this end, ESAP should provide the possibility to query data in the datalake, put it in the ``shopping basket'', and access this data in the DLaaS notebook for further processing. +For \gls{DAC21} to be executed successfully from the \gls{ESAP} perspective, it is essential that \gls{OIDC} token-based authentication --- at least for the Rucio catalogue --- is in place. +For data access \gls{OIDC} tokens are the preferred option, but if this is not supported by multiple storage systems data access using X.509 certificates can be a fall-back. -In practice, the implementation requirements on ESAP are: -\begin{itemize} - \item[1] Query the Datalake from ESAP by scope and DID\footnote{A DID is the name under which identifies an entry, file \textit{dataset} or \textit{container}, in the datalake.}. \label{dac:qry} - \item[2] Read the Rucio data found in the previous step by using the ESAP python shopping basket plugin, and feed that to the DLaaS functionality in the notebook. \label{dac:acs} - \item[3] Optionally: having multiple ESAP instances participating in DAC21 could add value. \label{dac:mult} -\end{itemize} +\subsection{Delivery and deployment} +\label{sec:dac21:delivery} -Item \ref{dac:qry} has already been implemented. The main implementation work should therefore focus on item \ref{dac:acs}. The minimal way this needs to be implemented is by combining the shopping basked python library with he DLaaS notebook. This would require copy-pasting the content of the JSON in the DLaaS query field. Ideally however the DLaaS service in the notebook would be able to directly read the data with key Rucio from the shopping basket. The user can then directly make it available for processing. +Code which provides the features described in \cref{sec:dac21:features} will be committed by WP5 members to \gls{ESAP} code repositories on \texttt{git.astron.nl} in advance of the start of \gls{DAC21}. +WP5 members will provide best-efforts support to members of WP2 or other \gls{DAC21} participants in using \gls{ESAP}, both generally and with specific attention to this functionality. -Since the goal of ESAP is to be a system that can be deployed and adapted by different projects, and the datalake has the ambition to support access by a broad range of users, the work described as item \ref{dac:mult} is aimed at demonstrating the flexibility of both ESAP and DIOS. However this work depends on projects willing to demonstrate the deployment of ESAP on their own infrastructure. +WP2 members and other \gls{DAC21} participants are responsible for deploying and using \gls{ESAP} in the context of \gls{DAC21}, including provisioning whatever hardware systems on which those deployments are performed. +\section{Subsequent work} -\section{requirements} -For DAC21 to be executed successfully from the ESAP perspective, it is essential that OIDC token-based authentication (at least for the Rucio catalogue) is in place. For data access OIDC tokes are the preferred option, but if this is not supported by multiple storage systems data access using X509 certificates can be a fall-back. +After \gls{DAC21}, further development could focus on offering querying functionality that is more complex than just file names. +This depends on the implementation of custom metadata for data in the Rucio installation by the WP2 team. +When this is available, it there is an excellent opportunity to demonstrate how \gls{ESAP} and \gls{DIOS} can interoperate to support enhanced data findability. -\section{After DAC21} -After DAC2 further development could focus on offering querying functionality that is more complex than just file names. This does depend on the implementation of custom metadata for data in the Rucio installation. If this is the case, it would make it a very good opportunity to demonstrate how the datalake can support the findability of the data. - \end{document} diff --git a/Makefile b/Makefile index e9993259d2ee2b3f41a5a89ad185f67546b74f44..b31ccd54afb3fd73da4fbcc8612713b3fb6e8583 100644 --- a/Makefile +++ b/Makefile @@ -3,6 +3,7 @@ export TEXMFHOME ?= astron-texmf/texmf $(DOCHANDLE).pdf: ESAP_DIOS.tex meta.tex xelatex -jobname=$(subst .,_,$(DOCHANDLE)) ESAP_DIOS.tex + makeglossaries $(subst .,_,$(DOCHANDLE)) xelatex -jobname=$(subst .,_,$(DOCHANDLE)) ESAP_DIOS.tex xelatex -jobname=$(subst .,_,$(DOCHANDLE)) ESAP_DIOS.tex