Skip to content
Snippets Groups Projects
Unverified Commit f37aaeb1 authored by Yan Grange's avatar Yan Grange :wave:
Browse files

Spell checking

parent 531aab97
Branches
No related tags found
No related merge requests found
...@@ -8,29 +8,29 @@ ...@@ -8,29 +8,29 @@
\begin{document} \begin{document}
\section{context} \section{context}
The Scientific Datalake (hereafter refered to as ``the datalake'') is the name of an infrastructure being under investigation in Work Package 2 of the ESCAPE project. An essential component of the datalake is the Rucio system which takes care of making sure data is distributed over the different storage systems based on rules. The principal method to access the backend is a implemented as a REST API. The Rucio clients implement the communication with this API. The Scientific Datalake (hereafter referred to as ``the datalake'') is the name of an infrastructure being under investigation in Work Package 2 of the ESCAPE project. An essential component of the datalake is the Rucio system which takes care of making sure data is distributed over the different storage systems based on rules. The principal method to access the back end is a implemented as a REST API. The Rucio clients implement the communication with this API.
For the scientific end user, access to data in the datalake should be as transparent as possible. To this end, the DataLake as a Service (DLaaS) Jupyter lab notebook, which implements access to the datalake by a Jupyter plugin, was developed at CERN. This notebook provides means to `stage' data sets to an area that is mounted over a network FuSe mount in the notebook, meaning applications can access the data as if it were on local disk. For the scientific end user, access to data in the datalake should be as transparent as possible. To this end, the datalake as a Service (DLaaS) Jupyter lab notebook, which implements access to the datalake by a Jupyter plugin, was developed at CERN. This notebook provides means to `stage' data sets to an area that is mounted over a network FUSE mount in the notebook, meaning applications can access the data as if it were on local disk.
\section{DAC21} \section{DAC21}
In November, different partners within work package 2 of ESCAPE will demonstrate how the datalake can be used to perform typical use cases which are coming from their domain and instrument. This exercise has been named Data Access Challenge \'21 (hereafter DAC21). The scope of this document is to provide insight on what functionality is expected for DAC21, and what potential further development could be done to integrate the datalake and ESAP even more. In November, different partners within work package 2 of ESCAPE will demonstrate how the datalake can be used to perform typical use cases which are coming from their domain and instrument. This exercise has been named Data Access Challenge \'21 (hereafter DAC21). The scope of this document is to provide insight on what functionality is expected for DAC21, and what potential further development could be done to integrate the datalake and ESAP even more.
The major interactions between ESAP and the datalake will be aimed at performing analysis that requires data from multple sources, where one of those is the datalake. To this end, ESAP should provide the possibility to query data in the datalake, put it in the ``shopping basket'', and access this data in the DLaaS notebook for further processing. The major interactions between ESAP and the datalake will be aimed at performing analysis that requires data from multiple sources, where one of those is the datalake. To this end, ESAP should provide the possibility to query data in the datalake, put it in the ``shopping basket'', and access this data in the DLaaS notebook for further processing.
In prectice, the implementation requirements on ESAP are: In practice, the implementation requirements on ESAP are:
\begin{itemize} \begin{itemize}
\item[1] Query the Datalake from ESAP by scope and DID\footnote{A DID is the name under which identifies an entry, file dataset or contaiber, in the datalake.}. \label{dac:qry} \item[1] Query the Datalake from ESAP by scope and DID\footnote{A DID is the name under which identifies an entry, file \textit{dataset} or \textit{container}, in the datalake.}. \label{dac:qry}
\item[2] Read the rucio data found in the previous step by using the ESAP python shopping basket plugin, and feed that to the DLaaS functionality in the notebook. \label{dac:acs} \item[2] Read the Rucio data found in the previous step by using the ESAP python shopping basket plugin, and feed that to the DLaaS functionality in the notebook. \label{dac:acs}
\item[3] Optionally: having multiple ESAP instances participating in DAC21 could add value. \label{dac:mult} \item[3] Optionally: having multiple ESAP instances participating in DAC21 could add value. \label{dac:mult}
\end{itemize} \end{itemize}
Item \ref{dac:qry} has already been implemented. The main implementation work should therefore focus on item \ref{dac:acs}. The minimal way this needs to be implemented is by combining the shopping basked python library witht he DLaaS notebook. This would require copy-pasting the content of the JSON in the DLaaS query field. Ideally however the DLaaS service in the notebook would be able to directly read the data with key rucio from the shopping basket. The user can then direclty make it available for processing. Item \ref{dac:qry} has already been implemented. The main implementation work should therefore focus on item \ref{dac:acs}. The minimal way this needs to be implemented is by combining the shopping basked python library with he DLaaS notebook. This would require copy-pasting the content of the JSON in the DLaaS query field. Ideally however the DLaaS service in the notebook would be able to directly read the data with key Rucio from the shopping basket. The user can then directly make it available for processing.
Since the goal of ESAP is to be a system that can be deployed and adapted by different projects, and the datalake has the ambition to support access by a broad range od users, the work described as item \ref{dac:mult} is aimed at demonstrating the flexibility of both ESAP and DIOS. However this work depends on projects willing to demonstrate the deployment of ESAP on their own infrastructure. Since the goal of ESAP is to be a system that can be deployed and adapted by different projects, and the datalake has the ambition to support access by a broad range of users, the work described as item \ref{dac:mult} is aimed at demonstrating the flexibility of both ESAP and DIOS. However this work depends on projects willing to demonstrate the deployment of ESAP on their own infrastructure.
\section{requirements} \section{requirements}
For DAC21 to be executed successfully from the ESAP perspective, it is essential that ODIC token-based authentication (at least for the Rucio catalogue) is in place. For data access OIDC tokes are the preferrd option, but if this is not supported by multiple storage systems data access using X509 certificates can be a fall-back. For DAC21 to be executed successfully from the ESAP perspective, it is essential that OIDC token-based authentication (at least for the Rucio catalogue) is in place. For data access OIDC tokes are the preferred option, but if this is not supported by multiple storage systems data access using X509 certificates can be a fall-back.
\section{After DAC21} \section{After DAC21}
After DAC2 further development could focus on offering querying functionality that is more complex than just file names. This does depend on the implementation of custom metadata for data in the Rucio installation. If this is the case, it would make it a very good opportunity to demonstrate how the datalake can support the findability of the data. After DAC2 further development could focus on offering querying functionality that is more complex than just file names. This does depend on the implementation of custom metadata for data in the Rucio installation. If this is the case, it would make it a very good opportunity to demonstrate how the datalake can support the findability of the data.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment