This page presents a high-level overview of the goals and scope of ESAP, ESCAPE's ESFRI Science Analysis Platform. ESAP is the primary product of ESCAPE Work Package 5.
- High Level Goals
- Conceptual Model
- Major Capabilities
High Level Goals
ESAP will provide a flexible system for analysing data available through the European Open Science Cloud (EOSC). It will unite the services provided in the other ESCAPE work packages by:
- Providing a flexible interface for querying and retrieving data from a variety of data repositories, including the WP2 Data Lake and WP4 Virtual Observatory;
- Enabling users to explore the software repositories, like the WP3 OSSR, to identify and select analysis tools and workflows which are appropriate to their needs;
- Helping users identify interactive data analysis and batch computing facilities which are accessible to them;
- Facilitate staging of data, software, and workflows to compute facilities, access provision for end users, and subsequent retrieval of results.
ESAP will be, by design, extensible: rather than attempting to anticipate every possible type of data repository, software, compute system, or other service provider, the platform will provide generic interfaces through which it can be extended to encompass new functionality.
Note that ESAP is a software product, not an operational service. That is, the deliverable documented here consists of code and documentation which, when deployed on appropriate infrastructure and supplied with an appropriate configuration, can be used to provide an operational service. Neither that infrastructure, nor staffing to operate and maintain an ESAP service, are provided by this work package. Rather, it is our expectation that:
- Future EOSC projects — perhaps including EOSC-Future — will provide resourcing and expertise to deploy and maintain one or ESAP services within the context of EOSC;
- ESAP may be usefully deployed in other contexts — from providing services to just a few users within a small project, to major supporting pieces of infrastructure — and will be capable of operating effectively at a range of scales.
In other words, we do not think of the project as working simply towards a single, unified ESAP system, but rather providing a “science platform toolbox” which form the basis of multiple independent deployments.
ESAP, in itself, provides no compute or analysis capabilities (beyond a simple ability to view tabular data and preview images). Rather, it acts as a broker between users and the various query and analysis services which are available to them. These services might include, for example:
- Bulk data query systems, which can help the user locate and access data files (images, visibility data, etc) in archives, data lakes, or similar bulk storage systems.
- Tabular data query systems, which can help the user find relevant entries in source catalogues and similar relational systems.
- Interactive data analysis (IDA) systems, which provide the user compute and visualization tools in a convenient environment with access to relevant datasets (for example, a Jupyter notebook).
- Bulk data processing systems, which provide batch (non-interactive) processing of data at-scale in HPC or HTC environments.
A given instance of ESAP is configured with information about available services1. When a user connects, the ESAP instance should:
- Help the user select services which are relevant to them (for example, by clearly presenting the available services; by making clear what science cases those services support, by taking account of the user's access privileges, etc);
- Provide a consistent and convenient way for the user to access services (for example, by providing the user with a single way to enter a particular query, and then automatically translating that to the requirements of each individual service);
- Mediate data flow between services (for example, by enabling the user to locate data with an archive query, and schedule processing of that data on a bulk data processing system).
This relationship is shown schematically in the figure below. Note that the user communicates with a single ESAP instance, while that instance mediates interactions with a range of different services from a variety of infrastructure providers.
This section outlines high-level considerations on the design and scope of major sections of ESAP. Where possible, it should avoid specifying architectural or implementation choices.
ESAP is primarily a web application: the central hub (the “back-end”) runs on one more servers, and users interact with it by making HTTP requests. The work package will provide a customizable front-end application (“ESAP-GUI”) which runs in the browser and communicates with the back-end.
In principle, it may be possible to support alternative GUIs which communicate with the same back-end. Providing such alternatives is out of scope for this work package.
Authentication and Authorization
Users may be required to log in to access ESAP itself, or to use some or all of the services mediated by a given ESAP instance.
This step is not required: if both the owner of an ESAP instance and the owner of any services being accessed make them available to the general public, then ESAP need not force the user to log in.
ESAP as delivered by this work package will provide for user authentication through the ESCAPE IAM service2. Where possible, ESAP should be designed to be flexible and adaptable to other systems. However, explicit support for other systems is outside the scope of this work package.
Data Management: the “Shopping Basket”
The fundamental workflow envisioned for ESAP is that the user will query one or more archives to identify data of interest, then dispatch that data to IDA or bulk processing systems for processing.
To support this model, ESAP will maintain a per-user list of active data items: the “shopping basket”. This basket is persistent: (a representation of) the data the user has selected is serialized as JSON, and the results are stored in a database3.
Each user has access to a single basket, couple to their current session; users do not have the capability to switch between multiple baskets4.
When any action returning results is completed (e.g. an archive search, or results returned from IDA), the user should have the option of adding (a subset, if wanted) of those results to their basket.
The user may see the contents of their basket, and — if they wish — remove items from it. Further, it should be possible to edit items within the basket. For example, given a source list, it should be possible for the user to:
- Manually select and remove specific sources from that list;
- Filter the list through an IDA or similar service, and replace the list in their shopping basket with the filtered list.
In normal use, the basket is expected to contain only structured or tabular data; it does not directly contain binary blobs (images, MeasurementSets, etc). Instead, the basket may contain paths, URLs, or similar identifiers so that binary data can be retrieved based on the contents of the basket5.
In addition to data, it is also possible to store software and workflows in the shopping basket. We envision the user querying the OSSR to find tools which are of interest to their science case, then storing them in the basket until they are ready to dispatch them to compute or IDA infrastructure. We note that this raises questions --- or opportunities --- around how to provide the appropriate metadata, so that relevant tools are matched with the appropriate data and compatible infrastructure. We anticipate addressing these questions in conjunction with the other ESCAPE work packages.
- A dataset is a queryable resource; that is, a list of images, data files, sources, or some other collection which a user may wish to search.
- An archive is a facility, research infrastructure, or organization which provides access to one or more datasets. Those datasets may be logically connected (e.g. providing catalogues from different surveys undertaken by the same instrument), or entirely unrelated (e.g. a virtual observatory archive collating catalogues from many different instruments).
The user should be able to specify query parameters using a consistent interface which is appropriate to the systems being queried. For example, when searching for astronomical sources, they should be able to specify parameters like right ascension and declination, and ESAP will map or transform those to query parameters which may be specific to a particular archive.
ESAP should be able to query archives with broadly arbitrary schemata. That is, it should not make assumptions that all datasets have a celestial position, or similar.
Datasets should, however, be logically grouped. For example, a user wishing to search specifically by celestial position should rapidly be able to identify datasets against which such a search can be run. In particular, they should not be expected to manually inspect each dataset available through the ESAP instance to see if it can be used to answer the particular query they have in mind.
The user should be able to query multiple datasets simultaneously. For example, when performing an astronomical “cone search” (specified by a celestial position and search radius), the user should have the option of sending that search to all known datasets which support cone searches, or of selecting particular datasets that they wish to query from a list. When results are returned from multiple datasets, it is not necessary for ESAP to attempt to combine them (a general system for cross-matching arbitrary datasets in certainly out of scope!), but the results should be presented to the user in a consistent and digestible way.
Binary blobs (images, etc) should not be returned from archives to the ESAP instance (and note that the shopping basket, above, has no method of storing them). Where appropriate, query results may contain URLs that point to preview images (thumbnails, postage-stamps, etc) which ESAP may display in-line with results.
Some archives may support, encourage, and/or require asynchronous queries. That is, the user submits a query to a queue, and returns to collect results at some future point. Support for asynchronous queries is not required in early versions of ESAP, but the system should be designed in such a way that asynchronous queries can be added in future.
SAMP — the IVOA's Simple Application Messaging Protocol — provides a mechanism for compliant implementations to share data within a local system. ESAP should support data exchange with other applications on the user's system using SAMP. This exchange should be bi-directional (i.e., SAMP enabled applications on the user's system can both read data from ESAP, and write data to it).
For example, a user should be able to query an archive and obtain a tabular source listing (as described above), displayed in the ESAP-GUI running in their browser. The user should then be able to transmit that source listing to a SAMP-enabled tool like TOPCAT. The user should be able to view and modify the listing in TOPCAT. The user should be able to transmit the modified listing back to ESAP. The modified listing should appear in the shopping basket, either replacing or in addition to the original listing.
Alternatively, the user should be able to load tabular data directly into TOPCAT from some other source, and then transmit that data to ESAP, where it will appear as a new entry in their shopping basket.
Interactive Data Analysis
The basic workflow for an IDA system is expected to be:
- ESAP assists the user in identifying an IDA system which meets their needs.
- (If applicable) an appropriate environment is made available on the IDA system7.
- The contents of the user's shopping basket (or, perhaps, a subset thereof) is made available on the IDA system.
- The user is linked to the configured environment, and carries out their analysis.
- When the analysis is complete, results should — at the user's discretion — be made available in the user's basket.
Several aspects of this workflow require further discussion, and are expanded upon below.
In the simplest case, the IDA system provides only a single capability: it provides access to a specialist analysis tool (for example, a particular type of image display tool, or a cross-matching service). Question: are single-purpose services like this included in the OSSR catalogue? If so, do we perform service discovery there, or do we only use services explicitly listed in ESAP configuration? In this case, the user needs to be able to locate the service based on its capability and (perhaps) by its proximity to the data. Where applicable, the user should be able to select which items from their basket to send to the tool (perhaps they only want to visualize one image, but have many in their basket). ESAP should intelligently assist the user in sending only appropriate data (i.e., it should not be possible for the user to send tabular data to a tool which is only capable of image display). Note that the basket does not store binary blobs (like images); instead, only a URL or similar locator is available for transmission to the IDA service. It follows that the tool itself — or some intermediate middleware — must be able to access that location and transform it into a form that can be usefully processed.
Software and Environment Selection
More complex is the case where the service is capable of executing customizable payloads (for example, a compute system which is capable of executing a variety of different software depending on user needs). The expectation is that the use is able to select software from a catalogue provided by ESCAPE WP3 (OSSR), which can be stored in the ESAP shopping basket (as described above). Question: this is a good story for the “single EOSC-hosted ESAP”, but what about small, project-level deployments, which we've explicitly said above are supported? Do they all connect to one central OSSR? If so, how do they make available project-specific software with quick turnaround? Are there smaller, project-level, OSSR-compliant repositories? Something else? The user is guided to choosing software that meets their needs based on catalogue metadata, and then ESAP will suggest services which are capable of executing the software. For some services, provision will also need to be made for customizing the environment in which the software runs8.
Tabular Data Exchange
An additional complexity is the form in which data is exchanged between ESAP and the service. It is explicit above that bulk binary data is only transferred by reference. There is still an open question about transferring tabular data: one could imagine this being done either by reference (for example, by sending the query that was used to generate the data) or by value (transferring the data itself). Considerations in favour of transferring by reference:
- Simplicity of implementation.
- Reduces the amount of data to transport in the case that the table is of non-trivial length.
Considerations in favour of transferring by value:
- Does not require the recipient service has the capability to replay the query. Query replay may be straightforward for some archives (e.g. VO services), but may require bespoke code on the recipient service for specialist or infrastructure-specific archives.
- Enables processing of data which was not generated by a query. This would include data uploaded by SAMP, for example.
- Enables workflows where data is round-tripped through multiple services (consider an IDA session being used to filter data for later batch processing).
It seems clear that returning tabular data to ESAP from the IDA service will have to be done by value, unless that service is capable of independently creating and publishing queryable catalogues. Note that this implies some level of cooperation from the service itself: a service which is not ESAP aware obviously cannot return data to ESAP. See also the discussion on Extensibility.
Binary Data Exchange
As above, transfer of binary data from ESAP to the analysis service is done by reference; this is relatively straightforward.
However, the user may wish to store further binary artefacts generated during their analysis session. Broadly, there are two ways in which this could be achieved:
- The analysis service makes these artefacts available for direct download to the user's local system, or
- The analysis service publishes those artefacts to bulk storage accessible both to the service and to ESAP, and then returns the location to ESAP.
The latter obviously is substantially more complex to implement, as it requires pre-negotiation of shared storage between ESAP and the analysis storage. As with Tabular Data Exchange, this implies that the analysis service must be ESAP-aware.
Bulk Data Processing
Broadly, considerations around bulk data processing are the same as those around IDA: again, ESAP must help the user select software and its environment, locate a service that is close to the data and capable of executing the software, and facilitate the exchange of data with that service. The additional complexity is that this processing is carried out asynchronously: the user should be able to log out or move on to other tasks while their data processing is being carried out, and then have ESAP notify them when processing is complete. Note this speaks to requirements around persistence of the shopping basket.
In general, it seems impossible for ESAP to provide full provenance information given the nature of its workflow. In particular, while ESAP can track the queries used to select an input dataset, it cannot track whatever IDA is performed by the user (which may include filtering or otherwise modifying the input data). Suggestion: the shopping basket should track the source of each entry (that is, what query was run to generate it, etc), and record what other services the data has been round-tripped through.
The preceding discussion of capabilities material is broadly intended to be generic: it refers to archives, IDA, or batch computing, without specifying particular implementations. This is because ESAP is intended to be extensible: rather than attempting to anticipate every variation on the above services that may be available in some infrastructure — or, indeed, new service types that aren't described by the above — ESAP will provide generic interfaces which are adaptable to the details of a particular service implementation. For example, the same basic archive query capability could be adapted to communicate with IVOA services, other major data service providers (e.g. Zooniverse), and with project-specific databases associated with major infrastructure. Similarly, IDA capabilities could be use to address Jupyter notebooks, bespoke visualization tooling, or other analysis environments as required.
ESAP should be designed to make extensibility as self-contained as possible. For example, in the ideal case, it should be possible to package and install adapters to specific services independently as “plug-ins”, and then simply switch them on in ESAP configuration. In practice, some editing of the ESAP code may be required, but this should be minimized where possible.
Note that in some cases, adapters may be needed at the service, as well as (or instead of) extensions to ESAP itself. Wherever possible, though, these should be avoided: the aim of ESAP is to integrate with existing services without the necessity of modifying them. Note that this may be impossible if the service has to return data to ESAP; see above.
Question: to what extent should service types also be extensible? This seems less important, but I guess we might find a use case. I'm inclined to rule it out for now, though.
ESAP as provided by ESCAPE WP5 will be accompanied by adapters which integrate with a range of standard/widely used services. These should include at least the following:
- Virtual Observatory services. Note: we need to define this a bit more carefully. Does this mean cone search? Sending arbitrary ADQL? Interrogating the VO registry? The Registry of Registries?
- Rucio. Note: somebody more familiar with Rucio than I am will need to help flesh out what this actually means.
- Jupyter. Note: we need to define this a bit more carefully. Jupyter notebooks? JupyterLab? Is it mediated through Binder? JupyterHub? etc
- Bulk computing:
- At least one bulk computing service will be supported, but this remains to be selected. Note: we usually mention DIRAC in this context, but no decision has actually been taken. Likely depends on the availability of expertise
This list may be extended or amended over the course of ESAP development, driven largely by the emergent needs of WP5 members and other ESCAPE partners.
ESAP configuration is (currently) held in an SQLite database. ESAP does not support a service discovery mechanism: the expectation is that available services are configured before ESAP is launched, and changes to that configuration may require the ESAP system to be reinitialized. Note: I guess if you are willing to go in and change the SQLite database or use Django admin, you can probably change this on the fly if you are determined to do so...
Future extensions to ESAP may focus on service discovery or better support for dynamic reconfiguration, but these are regarded as beyond the scope of our current work.
This configuration is instance-specific: for example, a central EOSC installation of ESAP might provide access to a wide range of services, spanning the entire EOSC, while an institutional or project-level system may only be configured with information about local resources.
We note this raises a number of questions about the lifecycle of the data. How long is it persisted for? Are old shopping baskets expired? Is some time-to-live assumed for query results? How is storage allocated? However, for the purposes of this vision document, we regard these as implementation details.
One could imagine that some science cases would be well served by enabling the user to switch between multiple shopping baskets. However, we do not envision supporting that in early versions of ESAP.
In practice, the current implementation of the shopping basket makes it possible to store binary blobs by serializing them to JSON. We can envision circumstances in which this might be convenient, but caution against its use in general.
The current ESAP implementation also introduces the concept of a catalogue, which is closely related to the dataset; for the purposes of this document, written describing the users view of the system, we do not regard the distinction as useful.
Some systems may provide for user-selectable environments; others may not. Both should be supported.
For example, a Jupyter notebook service could let the user choose an environment with certain libraries installed.