Skip to content
Snippets Groups Projects
Verified Commit d993760c authored by John Swinbank's avatar John Swinbank
Browse files

Merge branch 'grange-swinbank-comments' into swinbank-cleanup

parents 1d585267 030980bb
No related branches found
No related tags found
2 merge requests!2Grange swinbank comments,!1Reword, clarify, clean-up
...@@ -5,5 +5,9 @@ ESAP_DIOS.ist ...@@ -5,5 +5,9 @@ ESAP_DIOS.ist
ESAP_DIOS.log ESAP_DIOS.log
ESAP_DIOS.out ESAP_DIOS.out
ESAP_DIOS.pdf ESAP_DIOS.pdf
ESAP_DIOS.glg
ESAP_DIOS.gls
ESAP_DIOS.tex.bak
ESAP_DIOS.toc
ESAP_DIOS.run.xml ESAP_DIOS.run.xml
meta.tex meta.tex
...@@ -15,6 +15,8 @@ ...@@ -15,6 +15,8 @@
\newacronym{DID}{DID}{Data IDentifier} \newacronym{DID}{DID}{Data IDentifier}
\newacronym{FUSE}{FUSE}{Filesystem in Userspace} \newacronym{FUSE}{FUSE}{Filesystem in Userspace}
\newacronym{OIDC}{OIDC}{OpenID Connect} \newacronym{OIDC}{OIDC}{OpenID Connect}
\newacronym{FTS}{FTS}{File Transfer Service}
\newacronym{FAAI}{FAAI}{Federated Authentication and Authorization Infrastructure}
\begin{document} \begin{document}
...@@ -27,14 +29,21 @@ ...@@ -27,14 +29,21 @@
\section{Context} \section{Context}
\Acrshort{ESCAPE} Work Package 2, \gls{DIOS}, is developing a scientific ``data lake'': a distributed system for storage and management of bulk scientific datasets. \Acrshort{ESCAPE} Work Package 2, \gls{DIOS}, is developing a ``scientific data lake'': a distributed system for storage and management of bulk scientific data sets\footnote{Note that this definition differs from the usage of the term ``data lake'' in cloud environments.
This data lake is based on the Rucio\footnote{\url{https://rucio.cern.ch/}}, which provides policy based data distribution across heterogeneous storage systems. For brevity we will use the term ``data lake'' for the scientific data lake throughout this document.}.
This data lake consists of several components:
\begin{enumerate}
\item \gls{FTS}\footnote{\url{https://fts.web.cern.ch/fts/}}, which manages file transfers.
\item Indigo \gls{IAM}\footnote{\url{https://indigo-iam.github.io}} is the \gls{FAAI} service, which provides user and group management.
\item Rucio\footnote{\url{https://rucio.cern.ch/}}, which provides policy based data distribution across heterogeneous storage systems.
\end{enumerate}
The primary interface to the data lake is through Rucio.
This system is principally accessed through a \Acrshort{REST} \Acrshort{API}, which is used for communication by Rucio clients. This system is principally accessed through a \Acrshort{REST} \Acrshort{API}, which is used for communication by Rucio clients.
\Acrshort{ESCAPE} Work Package 5, \gls{ESAP}, is developing a toolkit which enables the construction of ``science platforms'': systems which integrate data discovery, access, and analysis, across multiple different resources, presenting the results to the end user through a consistent interface. \Acrshort{ESCAPE} Work Package 5, \gls{ESAP}, is developing a toolkit which enables the construction of ``science platforms'': systems which integrate data discovery, access, and analysis, across multiple different resources, presenting the results to the end user through a consistent interface.
\gls{ESAP} is build around the concept of a ``shopping basket'', which contains the current working set of data of interest to the user, potentially harvested from and manipulated using a variety of different archives and systems. \gls{ESAP} is build around the concept of a ``shopping basket'', which contains the current working set of data of interest to the user, potentially harvested from and manipulated using a variety of different archives and systems.
This document describes plans for integrating the Data Lake with \Acrshort{ESAP}. This document describes plans for integrating the data lake with \Acrshort{ESAP}.
It particularly focuses on \gls{DAC21}, a major WP2-driven integration exercise scheduled for November 2021, but also provides ideas for future work. It particularly focuses on \gls{DAC21}, a major WP2-driven integration exercise scheduled for November 2021, but also provides ideas for future work.
\section{\glsentrylong{DLaaS}} \section{\glsentrylong{DLaaS}}
...@@ -54,13 +63,13 @@ This section describes the integration expected between \gls{ESAP} and the data ...@@ -54,13 +63,13 @@ This section describes the integration expected between \gls{ESAP} and the data
\subsubsection{\glsentrytext{DLaaS} integration} \subsubsection{\glsentrytext{DLaaS} integration}
Within the context of \gls{DAC21}, we anticipate that \gls{ESAP} will be used to perform analysis that requires data from multiple sources, including data stored in the data lake. Within the context of \gls{DAC21}, we anticipate that \gls{ESAP} will be used to perform analysis that requires data from multiple sources, including data stored in the data lake.
To this end, ESAP should provide the capability to query for data in the data lake, put it in the shopping basket, and access this data in the \gls{DLaaS} notebook for further processing. To this end, ESAP should provide the capability to query for data in the data lake, put it in the shopping basket together with data from other sources, and access this data in the \gls{DLaaS} notebook for further processing.
The consequent requirements on \gls{ESAP} are: The consequent requirements on \gls{ESAP} are:
\begin{enumerate} \begin{enumerate}
\item Query the data lake from \gls{ESAP} by scope and \gls{DID}\footnote{A \gls{DID} is the name that identifies an entry in the data lake.}. \label{dac:qry} \item Query the data lake from \gls{ESAP} by scope and \gls{DID}\footnote{A \gls{DID} is the name that identifies an entry in the data lake, which could either be a single file, or a collection of files.}. \label{dac:qry}
\item Read the Rucio data found in the previous step by using the \gls{ESAP} Python shopping basket plugin, and provide that to the \gls{DLaaS} functionality in the notebook. \label{dac:acs} \item Read the Rucio data found in the previous step by using the \gls{ESAP} Python shopping basket plugin, and provide that to the \gls{DLaaS} functionality in a notebook. \label{dac:acs}
\end{enumerate} \end{enumerate}
Item \ref{dac:qry} has already been implemented. Item \ref{dac:qry} has already been implemented.
...@@ -80,7 +89,12 @@ However, WP5 will provide support by ensuring that the \gls{ESAP} codebase is te ...@@ -80,7 +89,12 @@ However, WP5 will provide support by ensuring that the \gls{ESAP} codebase is te
\subsubsection{\glsentrytext{OIDC} authentication} \subsubsection{\glsentrytext{OIDC} authentication}
For \gls{DAC21} to be executed successfully from the \gls{ESAP} perspective, it is essential that \gls{OIDC} token-based authentication --- at least for the Rucio catalogue --- is in place. For \gls{DAC21} to be executed successfully from the \gls{ESAP} perspective, it is essential that \gls{OIDC} token-based authentication --- at least for the Rucio catalogue --- is in place.
For data access \gls{OIDC} tokens are the preferred option, but if this is not supported by multiple storage systems data access using X.509 certificates can be a fall-back. This means that applying rules, querying the data in the data lake, and other catalogue-focused activities need to be possible using authentication with an \gls{OIDC} token.
For direct data access --- uploads and downloads between local and remote storage --- \gls{OIDC} tokens are still the strongly preferred option.
However, the data lake is heterogeneous by design, consisting different storage middleware systems.
Implementation of \gls{OIDC} authentication would require actions from development teams that are not part of the project.
Therefore, data access using X.509 certificates can be a fall-back.
\subsection{Delivery and deployment} \subsection{Delivery and deployment}
\label{sec:dac21:delivery} \label{sec:dac21:delivery}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment