Add reworked vision section

5bedf555 · John Swinbank · 8faddb6a · 5bedf555 · 5bedf555
Commit 5bedf555 authored Dec 18, 2022 by John Swinbank
--- a/contents/2-vision.tex
+++ b/contents/2-vision.tex
 \section{The \glsentrytext{ESAP} Vision}
 \label{sec:vision}
+
+This section presents a brief overview of the vision for and design of \pgls{ESAP}.
+It supplements and expands upon earlier discussions \cite{ESCAPE-GA, ESCAPE-D5.2, ESCAPE-D5.3} to describe current thinking about how \pgls{ESAP} can best meet its goals.
+Note that not all of the capabilities described in this section are currently available in the \pgls{ESAP} codebase: see \cref{sec:current} for a description of the current state of the art, and \cref{sec:future} for future development plans.
+
+\subsection{High-Level Summary}
+\label{sec:vision:summary}
+
+\pgls{ESAP} may be conveniently described as a \emph{toolkit} for building \emph{science platforms}.
+By unpacking and explaining those terms, we can best summarize the goals we have been working towards in developing \gls{ESAP}.
+
+By \emph{science platform}, we mean an interactive web-based environment which supports the full lifecycle of the scientific analysis process: it enables scientists to discover data, access it, perform interactive data analysis and use specialist data visualization tools, run bulk processing workflows, and ultimately publish their results to a persistent archive.
+
+By \emph{toolkit}, we mean that \pgls{ESAP} is not, in itself, intended to be deployed independently.
+Rather, \pgls{ESAP} provides a collection of software components, design patterns, and best practices that enable \glspl{ESFRI} or other groups that have a library of bespoke services (for example data archives, or interactive notebook environments) to make them available to a user community in a coherent, consistent, and integrated way.
+
+Using the \pgls{ESAP} toolkit, \gls{ESCAPE} project partners can make assist users in engaging with the services provided in the other ESCAPE work packages by:
+
+\begin{itemize}
+
+\item{providing a flexible interface for querying and retrieving data from a variety of archives and data repositories, with particular emphasis on those which are stored in or accessible through the services provided by \gls{ESCAPE} \glspl{WP} 2 (\Acrshort{DIOS}: \Acrlong{DIOS}) and 4 (\Acrshort{CEVO}: \Acrlong{CEVO}), as well as the citizen science platforms addressed through \gls{WP}6;}
+
+\item{enabling users to explore the software repositories, like the \gls{WP}3 \gls{OSSR}, to identify and select analysis tools and workflows which are appropriate to their needs;}
+
+\item{helping users to identify interactive data analysis and batch computing facilities which are accessible to them;}
+
+\item{facilitating the staging of data, software, and workflows to compute facilities, providing access to those facilities for end users, and subsequently retrieving the results of processing.}
+
+\end{itemize}
+
+\begin{figure}
+\begin{center}
+\includegraphics[width=0.66\textwidth]{figures/ESCAPE/esap-interfaces.pdf}
+\end{center}
+\caption{The relationship between major \glsentrytext{ESAP} components, described in the text, and a selection of the services provided by various other \gls{ESCAPE} work packages.}
+\label{fig:vision:interfaces}
+\end{figure}
+
+This relationship between \pgls{ESAP} and the capabilities exposed by the other \gls{ESCAPE} work packages --- and the wider service infrastructure within which it exists --- is illustrated in \cref{fig:vision:interfaces} and explored further in subsequent sections.
+
+By design, \pgls{ESAP} is extensible: rather than attempting to anticipate every possible type of data repository, software, compute system, or other service provider, the platform will provide generic interfaces through which it can be extended to encompass new functionality.
+
+In short, our approach is not to attempt to provide a single, integrated platform to which all researchers must adapt, but rather a set of functionalities from which various communities and research infrastructures can assemble an analysis platform geared to their specific needs.
+Deploying such a science platform provides at the scale of a system like \pgls{EOSC} provides a natural opportunity to integrate with the data and computing fabric this environment encompasses while simultaneously accessing the tools, techniques, and expertise other research domains bring to that environment.
+At the same time, we expect that instances of \pgls{ESAP} may usefully be deployed in other contexts, from providing services to just a few users within a small project, to supporting major pieces of infrastructure; it must therefore be capable of operating effectively at a range of scales.
+
+\subsection{Conceptual Model}
+\label{sec:vision:model}
+
+\pgls{ESAP}, in and of itself, provides no compute or analysis capabilities (beyond a simple ability to view tabular data and preview images).
+Rather, it acts as a broker between users and the various query and analysis services which are available to them.
+These might include, for example:
+
+\begin{itemize}
+
+\item{bulk data query systems, which can help the user locate and access data files (images, visibility data, etc) in archives, data lakes, or similar bulk storage systems;}
+
+\item{tabular data query systems, which can help the user find relevant entries in source catalogues and similar relational systems;}
+
+\item{\gls{IDA} systems, which provide the user compute and visualization tools in a convenient environment with access to relevant datasets (for example, a Jupyter \autocite{jupyter:2016} notebook, or containerized analysis application);}
+
+\item{bulk data processing systems, which provide batch (non-interactive) processing of data at-scale in \gls{HPC} or \gls{HTC} environments;}
+
+\item{scientific software repositories, which provide access to specialist analysis tools and workflows;}
+
+\end{itemize}
+
+A given instance of \pgls{ESAP} is configured with information about available services\footnote{This configuration is instance-specific: for example, a central \gls{EOSC} installation of \pgls{ESAP} might provide access to a wide range of services, spanning the entire \gls{EOSC}, while an institutional or project-level system may only be configured with information about local resources.}.
+When a user connects, the \pgls{ESAP} instance will:
+
+\begin{itemize}
+
+\item{help the user select services which are relevant to them (for example, by clearly presenting the available services; by making clear what science cases those services support, by taking account of the user's access privileges, etc);}
+
+\item{facilitate authentication and authorization with the various services, as necessary;}
+
+\item{provide a consistent and convenient way for the user to access services (for example, by providing the user with a single way to enter a particular query, and then automatically translating that to the requirements of each individual service);}
+
+\item{mediate data flow between services (for example, by enabling the user to locate data with an archive query, dispatch the data to the processing facility, and schedule processing of the data on a bulk data processing system).}
+
+\end{itemize}
+
+This relationship is illustrated schematically in \cref{fig:vision:model}: this shows the end user communicating directly with \pgls{ESAP}, which mediates their interactions with a range of other services, deployed across a variety of different infrastructures.
+
+Note that the user communicates with a single \pgls{ESAP} instance, while that instance mediates interactions with a range of different services from a variety of infrastructure providers.
+The current \pgls{ESAP} system provides no concept of federation between instances; however, this is a topic that we will return to in \cref{sec:future:fed}.
+
+\begin{figure}
+\begin{center}
+\includegraphics[width=0.66\textwidth]{figures/ESCAPE/esap-overview.pdf}
+\end{center}
+\caption{\glsentrytext{ESAP} in its environment.}
+\label{fig:vision:model}
+\end{figure}
+
+\subsection{Major Functionality}
+\label{sec:vision:capabilities}
+
+\subsubsection{User Interface}
+\label{sec:vision:capabilities:ui}
+
+\gls{ESAP} is primarily a web application: the central hub (the “\gls{API} Gateway”, or “back-end”) runs on one or more servers, and users interact with it by making \gls{HTTP} requests.
+The work package will provide a customizable front-end application (“ESAP-GUI”) which runs in the browser and communicates with the back-end.
+This separation of concerns is illustrated in \cref{fig:vision:capabilities:ui}.
+In principle, it may be possible to support alternative \glspl{GUI} which communicate with the same back-end.
+Providing such alternatives is out of scope for this work package, but provides scope for future extension of the work if appropriate.
+
+\begin{figure}
+\begin{center}
+\includegraphics[width=0.66\textwidth]{figures/ESCAPE/esap-high-level-architecture.pdf}
+\end{center}
+\caption{The high-level architecture of \glsentrytext{ESAP}.}
+\label{fig:vision:capabilities:ui}
+\end{figure}
+
+\subsubsection{Authentication and Authorization}
+\label{sec:vision:capabilities:aa}
+
+Users may be asked to log in to access \pgls{ESAP} itself, or to use some or all of the services mediated by a given \gls{ESAP} instance.
+
+This step is not required: if both the owner of the \pgls{ESAP} instance and the owner of any services being accessed make them available to the general public, then \pgls{ESAP} need not force the user to log in.
+In general, however, users are expected to log in before using the data orchestration services (\cref{sec:vision:capabilities:orch}).
+
+\pgls{ESAP} as delivered by this work package will provide for user authentication through the \gls{ESCAPE} \gls{IAM} service\footnote{\url{https://iam-escape.cloud.cnaf.infn.it/login}}.
+Where possible, \pgls{ESAP} is designed to be flexible and adaptable to other systems, but explicit support for other systems is outside the scope of this work package.
+
+\subsubsection{Data Orchestration within \glsentrytext{ESAP}}
+\label{sec:vision:capabilities:orch}
+
+The fundamental \pgls{ESAP} workflow  is that the user will query one or more archives to identify data of interest, then dispatch that data to \gls{IDA} or bulk processing systems for processing.
+
+To support this model, \pgls{ESAP} maintains a per-user list of active data items: the “shopping basket”.
+This basket is persistent: (a representation of) the data the user has selected is serialized as \gls{JSON}, and the results are stored in a database.
+Note that the basket is not generally expected to contain a complete representation of the data in question (it will not store multi-\gls{GB} images or query results), but rather it will contain sufficient metadata that the data can be fetched and manipulated on demand (for example, it will store the query which produces the result in question, or a path or other identifier which enables data to be fetched from the “data lake” or other storage).
+
+Services integrated with the \pgls{ESAP} system will be able to edit, augment, and update the contents of the users' shopping basket.
+
+This shopping basket metaphor extends include services --- such as \gls{IDA} or batch compute facilities --- and workflows from the \pgls{OSSR} and other repositories: as they move through the system, users will be able to identify services or software of interest, and store them for use later.
+
+\subsubsection{Data Discovery and Staging}
+\label{sec:vision:capabilities:data}
+
+\pgls{ESAP} provides a uniform interface which enables users to dispatch queries to a multiplicity of archive services.
+These include both federated, multi-facility systems such as the \gls{VO} and facility- or \gls{ESFRI}-specific archives.
+It also includes the “data lake” being developed as part of the \gls{DIOS} system in \gls{ESCAPE} \gls{WP}2.
+
+The data discovery system adapts itself dynamically to the type of archive being queried.
+For example, it is possible to query astronomical archives by using astronomy-specific parameters such as the celestial position where appropriate.
+
+When data of interest to the user has been located, if appropriate it is possible to arrange for the data to be “staged” --- that is, to be moved from the archive to storage which is available with low-latency from an appropriate analysis system.
+
+\subsubsection{\glsentrytext{SAMP}}
+\label{sec:vision:capabilities:samp}
+
+\pgls{ESAP} provides support for the \gls{IVOA} \gls{SAMP} \autocite{2012ivoa.spec.1104T}.
+This makes it possible for users of other \gls{SAMP}-compliant tools --- including TOPCAT \autocite{topcat:2005}, Aladin \autocite{aladin:2000} and Astropy \autocite{astropy:2018} --- as well as archive interfaces like ESASky \autocite{esasky:2020} to exchange data with \pgls{ESAP}.
+This means that users can take advantage of the advanced querying and data manipulation capabilities provided by these tools and facilities in conjunction with the possibilities offered by \pgls{ESAP}, maximizing interoperability and avoiding duplication of effort.
+
+\subsubsection{\glsentrydesc{IDA}}
+\label{sec:vision:capabilities:ida}
+
+\gls{IDA} describes a scientist interacting with a dataset in real time to perform their analyses.
+That is, they type commands or manipulate controls, and observe the results that are produced or the figures that are displayed.
+Contrast this with batch processing, discussed in \cref{sec:vision:capabilities:batch}.
+
+The processes and tools required for \gls{IDA} differ substantially from field to field and from facility to facility.
+For example, the way that data from the \gls{SKA} will be analyzed is very different to the processes applied to data from the \gls{LHC}.
+It is therefore essential that \gls{ESAP} implement a flexible capability for interfacing with a variety of \gls{IDA} services.
+
+The architecture described in \cref{sec:vision:capabilities:ui}, together with the data orchestration system described in \cref{sec:vision:capabilities:orch}, are designed to make this possible.
+Specifically, \pgls{ESAP} provides an \glspl{API} through which \gls{IDA} systems can access the “shopping basket”, both to retrieve data items and to provide (appropriately authenticated) updates from the \gls{IDA} system as the user saves their analysis.
+The expectation is that the \gls{IDA} system will write substantial data products (such as output images) to bulk storage (such as the \gls{DIOS} data lake), and return references to them to \pgls{ESAP} for further analysis.
+
+\subsubsection{Batch Data Processing}
+\label{sec:vision:capabilities:batch}
+
+Batch data processing describes a situation which is in many ways similar to \gls{IDA} (\cref{sec:vision:capabilities:ida}), but with a number of significant differences:
+
+\begin{itemize}
+
+\item{the work is carried out asynchronously: the user submits a job, and then returns some time later to examine the results;}
+
+\item{the user does not interact with the computing systems while processing takes place;}
+
+\item{processing generally happens at scale, perhaps being distributed over multiple computing systems.}
+
+\end{itemize}
+
+\pgls{ESAP} supports this by:
+
+\begin{itemize}
+
+\item{providing a generic \gls{API} for interacting with batch compute systems, combined with one or more adaptations of this interface to specific systems;}
+
+\item{providing a user interface for asynchronous processing, where \gls{ESAP} tracks the progress of user jobs, and notifies the submitter when they are complete.}
+
+\end{itemize}
+
+\subsubsection{Service and Software Discovery}
+\label{sec:vision:capabilities:discovery}
+
+\pgls{ESAP} provides deep integration with the \gls{OSSR}, and other repositories of software and services if appropriate.
+This will make it possible for users to discover capabilities which are of relevance to them.
+In particular, \pgls{ESAP} helps users discover software workflows and compute and storage infrastructure that can be used to execute both \gls{IDA} and batch processing tasks (as described in \cref{sec:vision:capabilities:ida,sec:vision:capabilities:batch}).
+
+The user is provided with a range of help in identifying software and services which are of relevance to their needs.
+That is, based on metadata sourced from the \gls{OSSR}, \pgls{ESAP} helps the user make informed decisions based on criteria such as (but not limited to):
+
+\begin{itemize}
+
+\item{software that is capable of processing the types of data stored in their shopping basket (\cref{sec:vision:capabilities:orch});}
+\item{software that is appropriate for the type of analysis they wish to perform (addressing particular science goals, capable of being executed in batch or interactive mode,etc);}
+\item{services that are capable of executing the workflow or software package which the user has selected;}
+\item{services that are local to the storage location of bulk data, or which can instantiate efficient bulk data movement.}
+
+\end{itemize}
+
+\subsubsection{Managed Database}
+\label{sec:vision:capabilities:db}
+
+A managed database service provides users with with the capability to define and use their own relational databases directly within the \gls{ESAP} system.
+It is possible to directly load the results of queries against external archives into the user's database space, and then to submit complex \gls{SQL} queries to the database system.
+This provides the user with advanced data analysis capabilities --- for example, the ability to perform complex catalogue cross-matching --- without requiring that they set up and administer their own database system.
+Further, it opens the prospect of integrating \pgls{ESAP} with external \gls{SQL} federation services such as Trino\footnote{\url{https://trino.io}} or openLooKeng\footnote{\url{https://openlookeng.io}}.
+
+\subsubsection{Provenance and \glsentryplural{PID}}
+\label{sec:vision:capabilities:provenance}
+
+Processing, controlled and mediated through \pgls{ESAP}, will result in \emph{advanced} data products: refined, augmented, or reduced versions of the input data.
+These data products, taken together with the workflows that have been used to produce them and resulting scientific publications, form the \emph{research objects} which are the fundamental outputs of the scientific community.
+In order to facilitate \gls{FAIR} access to data, \gls{ESAP} provides mechanisms for tracking the provenance of these research objects and will assist users in providing them with \glspl{PID} \autocite{2018-EC-FAIR}.
+
+\subsection{Extensibility and Supported Services}
+\label{sec:vision:extensibility}
+
+As described in \cref{sec:vision:model,sec:vision:capabilities:ui} above, the \pgls{ESAP} system is designed to be intrinsically extensible: the core API Gateway provides generic interfaces into which additional services can be integrated with minimal effort.
+
+However easy it is to integrate services with \pgls{ESAP}, it is clearly impossible for the \gls{ESCAPE} team to integrate \emph{all possible} services: there are simply too many domain-specific tools in use in the scientific community for this to be practical.
+Instead, the team has focused on:
+
+\begin{itemize}
+
+\item{providing a number of service integrations which demonstrate key capabilities and facilitate the expressed science goals and use cases of \gls{ESCAPE}-affiliated \glspl{ESFRI};}
+
+\item{providing documentation and examples to make it possible for new services to be quickly and easily integrated with \pgls{ESAP} without direct intervention from the \gls{ESCAPE} team.}
+
+\end{itemize}
+
+The detailed list of service integrations which will be supplied in the core \pgls{ESAP} delivery by \gls{ESCAPE} \gls{WP}5 is still under development.
+However, we expect to provide at least:
+
+\begin{itemize}
+
+\item{data query and data discovery based on major \gls{ESFRI} archives;}
+\item{integration with Rucio-based \autocite{rucio:2019} data lake systems;}
+\item{\gls{VO} query capabilities;}
+\item{\gls{SAMP} integration;}
+\item{integration with Jupyter-based \gls{IDA} facilities, probably based around BinderHub \autocite{binder:2018} and/or Rosetta\footnote{\url{https://rosetta.oats.inaf.it/main/}; \url{https://github.com/sarusso/Rosetta}};}
+\item{integration with at least one batch computing service, probably through DIRAC \autocite{dirac:2018}.}
+
+\end{itemize}
+
+The results of these activities are described in \cref{sec:current}.
--- a/contents/7-future.tex
+++ b/contents/7-future.tex
 \section{Future Prospects}
 \label{sec:future}
+
+\subsection{Federation}
+\label{sec:future:fed}