Merge branch 'updated' into 'main'

Updated Workflow See merge request !1

Merge branch 'updated' into 'main'
ff6a2dc4 · John Swinbank · 677f5059 · 586d749d · ff6a2dc4 · ff6a2dc4
Commit ff6a2dc4 authored 2 years ago by John Swinbank
--- a/README.md
+++ b/README.md
-## Integrating Machine Learning
+## Zooniverse - Integrating Machine Learning

-This directory contains resources for the _integrating Machine Learning_ tutorial.
+This directory contains resources for the _Integrating Machine Learning_ tutorial. This tutorial forms part of a series of advanced guides for managing Zoonivere projects through Python. While they can be done independently, for best usage you may want to complete them in the following order (these are also all available as Interactive Analysis workflows in the ESAP GUI):
+1. [Advanced Project Building](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-advanced-project-building)
+2. [Advanced Aggregation with Caesar](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-advanced-aggregation-with-caesar)
+3. [Integrating Machine Learning (current)](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-integrating-machine-learning)
+
+Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects. _Caesar_ also provides a powerful way of collecting and analysing volunteer classifications (aggregation). Machine learning models can be used with _Caesar_ to make classification workflows more efficient, such as for implementing advanced subject retirement rules, filtering subjects prior to being shown to volunteers, or setting up an active learning cycle so that volunteer classifications help train the machine learning model.
+
+For guides on creating a Zooniverse project through the web interface or by using Python, take a look at the _Advanced Project Building_ tutorial above and the links therein. For an introduction to _Caesar_, take a look at the _Advanced Aggregation with Caesar_ tutorial above and the links therein. Note that this tutorial does not cover the basics of machine learning, for which there are various guides online and in print.
+
+The advanced tutorial presented here includes demonstrates of using Python for:
+* Advanced retirement rules using machine learning
+    * Option 1: pre-classifying with machine learning in preparation for volunteer classifications
+    * Option 2: "on the fly" retirement decisions made after both machine learning and volunteer classifications
+* Using machine learning to filter out uninteresting subjects prior to volunteer classification
+* Setting up active learning: volunteer classifications train the machine learning model, which in turn handles the "boring" subjects and leaves the more challenging/interesting subjects for volunteers.

 You can find the code for the tutorial in the `notebooks` folder and the data that were used in the `data` folder.
+As with the _Advanced Project Building_ tutorial, this tutorial makes use of example material (subjects, metadata, classifications) from the [_SuperWASP Variable Stars_](https://www.zooniverse.org/projects/ajnorton/superwasp-variable-stars) Zooniverse project, which involves classifying light curves (how brightness varies over time) of stars.
+
+A recorded walkthrough of this advanced tutorial is available [here](https://youtu.be/o9SzgsZvOCg?t=8218) as part of the [First ESCAPE Citizen Science Workshop](https://indico.in2p3.fr/event/21939/).
+
+The ESAP Archives (accessible via the ESAP GUI) include data retrieval from the Zooniverse Classification Database using the ESAP Shopping Basket. For a tutorial on loading Zooniverse data from a saved shopping basket into a notebook and performing simple aggregation of the classification results, see [here](https://git.astron.nl/astron-sdc/escape-wp5/workflows/muon-hunters-example/-/tree/master) (also available as an Interactive Analyis workflow).
+
+### Setup
+
+#### Option 1: ESAP workflow as a remote notebook instance
+
+You may need to install the `panoptes_client`, `pandas`, and `boto3` packages. `boto3` is a package that implements the Amazon Web Service (AWS) Python SDK (see the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) for more detail), and is used in the second half of the tutorial.
+```
+!python -m pip install panoptes_client
+!python -m pip install pandas
+!python -m pip install boto3
+```
+
+#### Option 2: Local computer
+
+1. Install Python 3: the easiest way to do this is to download the Anaconda build from https://www.anaconda.com/download/. This will pre-install many of the packages needed for the aggregation code.
+2. Open a terminal and run: `pip install panoptes_client` and `pip install boto3`.
+3. Download the [Integrating Machine Learning](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-integrating-machine-learning/) tutorial into a suitable directory.
+
+#### Option 3: Google Colab
+
+Google Colab is a service that runs Python code in the cloud.
+
+1. Sign into Google Drive.
+2. Make a copy of the [Integrating Machine Learning](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-integrating-machine-learning/) tutorial in your own Google Drive.
+3. Right click the `MachineLearning.ipynb` file > Open with > Google Colaboratory.
+    1. If this is not an option click "Connect more apps", search for "Google Colaboratory", enable it, and refresh the page.
+4. Run the following in the notebook:
+    1. `!pip install panoptes_client` and `!pip install boto3` to install the required packages, 
+    2. `from google.colab import drive; drive.mount('/content/drive')` to mount Google Drive, and 
+    3. `import os; os.chdir('/content/drive/MyDrive/zooniverse-integrating-machine-learning/')` to change the current working directory to the example folder (adjust if you have renamed the example folder).
+
+### Other Useful Resources
+
+Here is a list of additional resources that you may find useful when building your own Zooniverse citizen science project.
+* [_Zooniverse_ website](http://zooniverse.org) - Interested in Citizen Science? Create a **free** _Zooniverse_ account, browse other projects for inspiration, contribute yourself as a citizen scientist, and build your own project.
+* [Zooniverse project builder help pages](https://help.zooniverse.org) - A great resource with practical guidance, tips and advice for building great Citizen Science projects. See the ["Building a project using the project builder"](https://youtu.be/zJJjz5OEUAw?t=7633) recorded tutorial for more information.
+* [_Caesar_ web interface](https://caesar.zooniverse.org) - An online interface for the _Caesar_ advanced retirement and aggregation engine. See the ["Introducing Caesar"](https://youtu.be/zJJjz5OEUAw?t=10830) recorded tutorial for tips and advice on how to use Caesar to supercharge your _Zooniverse_ project.
+* [The `panoptes_client` documentation](https://panoptes-python-client.readthedocs.io/en/v1.1/) - A comprehensive reference for the Panoptes Python Client.
+* [The `panoptes_aggregation` documentation](https://aggregation-caesar.zooniverse.org/docs) - A comprehensive reference for the Panoptes Aggregation tool.
+* [The `aggregation-for-caesar` GitHub](https://github.com/zooniverse/aggregation-for-caesar) - A collection of external reducers for _Caesar_ and offline use.
+* [Amazon Web Services (AWS)](https://aws.amazon.com) - A cloud computation provider that can be used to create your own SQS queue like the one detailed in this tutorial. You can register for a free account and **some** services are available free of charge **up to specific usage limits**. Be careful you don't exceed these limits or you may end up with a bill to pay!
+* [AWS Simple Queue Service (SQS)](https://aws.amazon.com/sqs/) - Information about the SQS message queueing system that can be used together with _Caesar_ to implement computationally intensive extraction and reduction tasks and apply them to your project's classifications.
\ No newline at end of file
--- a/notebooks - pdf versions/MachineLearning.pdf
+++ b/notebooks - pdf versions/MachineLearning.pdf
--- a/notebooks/MachineLearning.ipynb
+++ b/notebooks/MachineLearning.ipynb
@@ -4,7 +4,43 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Improve Efficiency with Machine Learning and Caesar"
+    "# Zooniverse - Integrating Machine Learning"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects. _Caesar_ also provides a powerful way of collecting and analysing volunteer classifications (aggregation). Machine learning models can be used with _Caesar_ to make classification workflows more efficient: in this notebook we will examine how machine learning can be used to implement advanced subject retirement rules and filter subjects prior to being shown to volunteers, as well as setting up an active learning cycle so that volunteer classifications help train the machine learning model.\n",
+    "\n",
+    "## Table of Contents\n",
+    "\n",
+    "1. [Setup](#Setup)\n",
+    "2. [Advanced Retirement Using Machine Learning](#Advanced-Retirement-Using-Machine-Learning)\n",
+    "    1. [Option 1: Make machine learning predictions for all subjects \"up front\"](#Option-1:-Make-machine-learning-predictions-for-all-subjects-\"up-front\")\n",
+    "    2. [Option 2: Make machine learning predictions for each subject \"on the fly\"](#Option-2:-Make-machine-learning-predictions-for-each-subject-\"on-the-fly\")\n",
+    "3. [Filtering Subjects Using Machine Learning](#Filtering-Subjects-Using-Machine-Learning)\n",
+    "4. [Active Learning](#Active-Learning)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "You may need to install the `panoptes_client`, `pandas`, and `boto3` packages. If you do, then run the code in the next cell. Otherwise, skip it. `boto3` is a package that implements the Amazon Web Service (AWS) Python SDK (see the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) for more detail), and is used in the second half of the tutorial."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python -m pip install panoptes_client\n",
+    "!python -m pip install pandas\n",
+    "!python -m pip install boto3"
   ]
  },
  {
@@ -24,26 +60,29 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Trained machine learning models can take up a lot of disk space and can be a computationally intensive process. Using machine learning to predict classifications for specific subjects not the sort of analysis that _Caesar_ can run by itself (even if you did add the code to the `aggregation_for_caesar` repository. \n",
+    "## Advanced Retirement Using Machine Learning\n",
+    "\n",
+    "Trained machine learning models can take up a lot of disk space and can be a computationally intensive process. Using machine learning to predict classifications for specific subjects is not the sort of analysis that _Caesar_ can run by itself (even if you did add the code to the [`aggregation_for_caesar` repository](https://github.com/zooniverse/aggregation-for-caesar)). \n",
    "\n",
-    "However, that doesn't mean that we can't use machine learning models and Caesar _together_ to make our classification workflow more efficient. For example you might want to use a machine learning model to predict the class of every subject in a subject set and then use less stringent retirement criteria if the machnie and human classifiers agree.\n",
+    "However, that doesn't mean that we can't use machine learning models and _Caesar_ **together** to make our classification workflow more efficient. For example, you might want to use a machine learning model to predict the class of every subject in a subject set and then use less stringent retirement criteria if the machine and human classifiers agree.\n",
    "\n",
-    "As a concrete example, we might say that if the first three volunteers all agree with the machine prediction, then we can retire the suject but we would demand that seven volunteers view and classify the subject otherwise.\n",
+    "As a concrete example, we might say that if the first three volunteers all agree with the machine prediction then we can retire the suject, but we would otherwise demand that seven volunteers view and classify the subject.\n",
    "\n",
-    "There are lots a couple of easy ways to couple a machine learning model with your Zooniverse project and we'll quickly demo them here.\n",
+    "There are a couple of easy ways to couple a machine learning model with your Zooniverse project and we'll quickly demo them here.\n",
    "\n",
-    "1. The first option we have is to preclassify all of our subjects and add a **hidden subject metadatum** with the mahine learning score that Caesar can recongise and use to execute specific retirement rules.\n",
-    "2. The second option is to make machine learning predictions for subjects **after** they have been classified and make an \"on the fly\" retirement decision. The second approach is much more complicated, but it does mean that you can benefit from the latest version of your machine learning model and you can compute very complicated retirement criteria that are difficult or impossible to express as Caesar rules. "
+    "1. The first option we have is to preclassify all of our subjects with machine learning and add the machine learning score to each as a **hidden subject metadatum** that _Caesar_ can recognise and use to execute specific retirement rules.\n",
+    "2. The second option is to make machine learning predictions for subjects **after** they have been classified and make an \"on the fly\" retirement decision. The second approach is much more complicated, but it does mean that you can benefit from the latest version of your machine learning model and you can compute very complicated retirement criteria that are difficult or impossible to express as _Caesar_ rules."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Option 1: Make machine learning prediction for all subjects \"up front\"\n",
+    "### Option 1: Make machine learning predictions for all subjects \"up front\"\n",
+    "\n",
    "If you have a trained machine learning model and a pool of data, then it's straightforward to predict a machine score for all of your subjects and add a new hidden metadatum to each one.\n",
    "\n",
-    "We've already trained a simple CNN to predict the class of a variable star lightcurve by looking at the same images the volunteers see. We can load a table of its predictions for our subjects."
+    "For this tutorial, we have already trained a simple convolutional neural network (CNN) to predict the class (i.e. category) of a variable star's light curve by looking at the same images the volunteers see. We can hence load a table of its predictions for our subjects, where each row corresponds to a different subject and each column gives the predicted probability of that subject belonging to a given class."
   ]
  },
  {
@@ -52,7 +91,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "dataDirectory = ??\n",
+    "dataDirectory = ?? # e.g. '../data/'\n",
    "predictions = pd.read_csv(os.path.join(dataDirectory, \"predictions.csv\")).set_index(\"image\")\n",
    "predictions"
   ]
@@ -61,7 +100,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now let's create a new subject set (as we've done many times now) and add the machine learning prediction as a hidden metadatum. We can call that metadatum whatever we like, so how about `#ml_prediction`? We'll simplify things by defining the machine prediction for a subject to be the category with highest score, so for the zeroth row, our machine prediction is \"Pulsator\". Rather than storing the prediction as a word though, we'll map it to its corresponding position in the list of options for task 1 of our project. Recall that that order is:\n",
+    "Now let's create a new subject set (as we did in the _Advanced Project Building_ tutorial) and add the machine learning prediction to each subject as a hidden metadatum. We can call that metadatum whatever we like, so how about `#ml_prediction`? We'll simplify things by defining the machine prediction for a subject to be the class with the highest score, so for the zeroth row, our machine prediction is \"Pulsator\". Rather than storing the prediction as a word though, we'll map it to its corresponding position in the list of options for task 1 of our project. Recall that the order is:\n",
    "```python\n",
    "[\"Pulsator\", \"EA/EB type\", \"EW type\", \"Rotator\", \"Unkown\", \"Junk\"]\n",
    "```"
@@ -80,7 +119,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's start with all the parts we've seen before."
+    "Let's start with all the parts we've seen before: authenticating with the Panoptes API and finding our project."
   ]
  },
  {
@@ -112,6 +151,19 @@
    "project = Project.find(projectId)"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\"Name:\", project.display_name)\n",
+    "print(\"Description:\", project.description)\n",
+    "print(\"Introduction:\", project.introduction)\n",
+    "print(\"Number of Subjects:\", project.subjects_count)\n",
+    "print(\"Number of Classifications:\", project.classifications_count)"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -125,10 +177,17 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "subjectSet = SubjectSet()\n",
-    "subjectSet.display_name = \"Machine Learning Demo Set\"\n",
-    "subjectSet.links.project = project\n",
-    "response = subjectSet.save()"
+    "subjectSet = SubjectSet() # Define new subject set\n",
+    "subjectSet.display_name = \"Machine Learning Demo Set\" # Give it a name\n",
+    "subjectSet.links.project = project # Set it to link to our project\n",
+    "response = subjectSet.save() # Save this (currently empty) subject set"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now we'll acquire a file list of all the subjects to be added to the subject set."
   ]
  },
  {
@@ -145,7 +204,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We'll use those files to create a list of new `Subjects` and add some metadata. First let's process the machine predictions like we discussed above."
+    "We'll add those files to the subject set along with some metadata. First, let's process the machine predictions like we discussed above, identifying the class with the highest score for each subject."
   ]
  },
  {
@@ -159,9 +218,9 @@
    "    imageRow = int(os.path.basename(imageFile)[:-4])\n",
    "    imagePredictions.append(\n",
    "        (\n",
-    "            imageRow,\n",
-    "            predictions.iloc[imageRow].idxmax(),\n",
-    "            questionOrder.index(predictions.iloc[imageRow].idxmax()),\n",
+    "            imageRow, # Subject name/number\n",
+    "            predictions.iloc[imageRow].idxmax(), # Class of variable star with the highest prediction score\n",
+    "            questionOrder.index(predictions.iloc[imageRow].idxmax()), # Corresponding index\n",
    "        )\n",
    "    )\n",
    "imagePredictions"
@@ -210,27 +269,31 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we just have to **set up appropriate retirement rules in the Caesar UI**."
+    "Now we just have to **set up appropriate retirement rules in the _Caesar_ UI**, shown [here](https://youtu.be/o9SzgsZvOCg?t=9003) (02:30:03-02:36:03) as part of the recorded demonstration of this tutorial. Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects and aggregating volunteer classifications. These involve the use of \"extractors\" and \"reducers\": extractors extract the data from each classification into a more useful data format, while reducers \"reduce\" (aggregate) the extracted data for each subject together.\n",
+    "\n",
+    "The input to a reducer is the output of one or more extractors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Option 2: Make machine learning prediction for each subject \"on the fly\"\n",
+    "### Option 2: Make machine learning predictions for each subject \"on the fly\"\n",
+    "\n",
+    "For this option, we want to make subject retirement decisions following machine learning predictions made _after_ volunteer classification. To do this, we'll need a way of outputting extract data for processing with machine learning, and a way for uploading the processed results back to _Caesar_ for the retirement decisions.\n",
    "\n",
    "#### External reducers\n",
-    "Instead Caesar defines the concept of an _External_ reducer. When you set up an external reducer in the Caesar UI, the system will send the results from any associated extractors to an HTTP endpoint that you specify. You can then receive those extract data and process them appropriately. \n",
+    "While _Caesar_ contains multiple inbuilt reducers to choose from, _Caesar_ also defines the concept of an _External_ reducer. When you set up an external reducer in the _Caesar_ UI, the system will send the results from any associated extractors to an HTTP endpoint that you specify. You can then receive those extract data through the HTTP endpoint and process them appropriately.\n",
    "\n",
    "#### Placeholder reducers\n",
-    "Once processing is complete, you need a way to send the results back to Caesar. The system defines the concept of a _Placeholder_ reducer for this purpose. Your external reducer should send its answer back to the placeholder reducer, allowing Caesar to process any associated rules or effects.\n",
+    "Once processing is complete, you need a way to send the results back to _Caesar_. The system defines the concept of a _Placeholder_ reducer for this purpose. Your external reducer should send its answer back to the placeholder reducer, allowing _Caesar_ to process any associated rules or effects.\n",
    "\n",
-    "#### SQS Reducers\n",
-    "The one downside of a basic external reducer is that you need to run some sort of webserver that can listen out for messages from Caesar and process them when they arrive. This can be a barrier to entry for some research teams, so Caesar defines one more reducer category, the _SQS Reducer_. \n",
+    "#### SQS reducers\n",
+    "The one downside of a basic external reducer is that you need to run some sort of webserver that can listen out for messages from _Caesar_ and process them when they arrive. This can be a barrier to entry for some research teams, so _Caesar_ defines one more reducer category, the _SQS Reducer_. \n",
    "\n",
-    "_SQS_ stands for \"Simple Queue Service\". It's a facility provided by Amazon Web Services that maintains a queue of messages in the cloud. Caesar sends messages to this queue and your code can consume them. All the web hosting is handled by Amazon, so you don't have to worry.\n",
+    "_SQS_ stands for \"Simple Queue Service\". It's a facility provided by Amazon Web Services that maintains a queue of messages in the cloud. _Caesar_ sends messages to this queue and your code can consume them. All the web hosting is handled by Amazon, so you don't have to worry.\n",
    "\n",
-    "We'll set up an SQS reducer in the Caesar UI (see the recording that goes with this notebook) to send message extracts to us. Then we can use a specially written client to grab those messages from the queue and process them."
+    "We'll **set up an SQS reducer in the _Caesar_ UI** to send message extracts to us, shown [here](https://youtu.be/o9SzgsZvOCg?t=9975) (02:46:15-02:56:36) as part of the recorded demonstration of this tutorial. Then we can use a specially written client to grab those messages from the queue and process them."
   ]
  },
  {
@@ -246,7 +309,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**IMPORTANT**: Note that for you to use the SQS reducer facility you will need to set up an Amazon AWS account, set up a queue and get a set of access credentials. This process is free and reasonably straightforward, but beyond the scope of this tutorial. Once you've done that you can export your credentials to your computer's environment like this."
+    "**IMPORTANT**: Note that for you to use the SQS reducer facility you will need to set up an Amazon AWS account, set up a queue and get a set of access credentials. This process is free and reasonably straightforward, but beyond the scope of this tutorial. Once you've done that, you can export your credentials to your computer's environment like this."
   ]
  },
  {
@@ -313,13 +376,13 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For every classification we get a row of data corresponding. The information about the volunteer's answer is in the `data` column. Since we set Caesar to use a question extractor the format of each entry can be interpreted as:\n",
+    "For every classification we get a row of corresponding data. The information about the volunteer's answer is in the `data` column. Since we set _Caesar_ to use a question extractor, the format of each entry can be interpreted as:\n",
    "```python\n",
    "{ '<answer>' : <number_of_matching_responses> } \n",
    "```\n",
-    "So fo example in row 0 of the table we can see that for subject 52665118, the classification contained one response for the answer `'0'`, which corresponds to a Pulsator.\n",
+    "For example, in row `0` of the table we can see that for subject `52665118` the classification contained one response for the answer `'0'`, which corresponds to a Pulsator.\n",
    "\n",
-    "If we have a (partially) trained machine learning model, we could ask it to make its own predictions for each of the subjects we just got classifications for, but for now, let's just look up the classification (if it exists) in our table of predictions. We can use the `subject_id` field in our message table to get the metadata for each subject and then try to retrieve its machnie learning prediction."
+    "If we have a (partially) trained machine learning model, we could ask it to make its own predictions for each of the subjects we just got classifications for, but for now, let's just look up the classification (if it exists) in our table of predictions. We can use the `subject_id` field in our message table to get the metadata for each subject and then try to retrieve its machine learning prediction."
   ]
  },
  {
@@ -363,7 +426,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We decide to retire all subjects for which the machine and human predictions match. Now we need to send a message back to Caesar with our decision for the subjects we want to retire. If we don't want to do anything we don't have to send a message. We can use the Panoptes Python API send a message back to the placeholder reducer we set up in the Caesar UI. Here's how we do it."
+    "Perhaps we decide to retire all subjects for which the machine and human predictions match. In that case, we now need to send a message back to _Caesar_ with our decision for the subjects we want to retire. If we don't want to do anything, we don't have to send a message. \n",
+    "\n",
+    "We can use the Panoptes Python API to send a message back to the placeholder reducer we set up in the Caesar UI. Here's how we do it."
   ]
  },
  {
@@ -396,7 +461,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "If the reduction was successfully submitted, Caesar will send a `dict` in response."
+    "If the reduction was successfully submitted, _Caesar_ will send a `dict` in response."
   ]
  },
  {
@@ -415,7 +480,7 @@
    "Success!\n",
    "\n",
    "#### Aside: Avoid entering credentials\n",
-    "To avoid having to enter credentials to send reductions back to Caesar, you can register with the Panoptes authentication system and get a _Client ID_ and a _Client Secret_. These are just special strings of unguessable characters that the Panoptes Python API can use instead of your usual credentials to authenticate. \n",
+    "To avoid having to enter credentials to send reductions back to _Caesar_, you can register with the Panoptes authentication system and get a _Client ID_ and a _Client Secret_. These are just special strings of unguessable characters that the Panoptes Python API can use instead of your usual credentials to authenticate.\n",
    "\n",
    "To get your _Client ID_ and _Client Secret_ visit [https://panoptes.zooniverse.org/oauth/applications](https://panoptes.zooniverse.org/oauth/applications) and click on _New Application_. Once you have those details you can export them to your computer's environment just like you did for the Amazon credentials, but with different names."
   ]
@@ -434,16 +499,17 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Filtering Subjects Using Machine Learning\n",
+    "## Filtering Subjects Using Machine Learning\n",
+    "\n",
    "Although machine learning algorithms are very good at performing **specific tasks**, there are some things that human beings still tend to do better at. For example, human beings are much more likely to spot unusual or unexpected features in images or other types of subjects.\n",
    "\n",
-    "However, there are many datasets (with more arriving every day) was that are simply too large to be processed by human beings, even using a citizen science approach.\n",
+    "However, there are many data sets (with more arriving every day) that are simply too large to be processed by human beings, even using a citizen science approach.\n",
    "\n",
-    "Machine learning can help here by filtering out subjects that are \"not interesting\". Such subjects are typically very common in the datasets that were used to train the machnine learning models and are therefore very easily and confidently classified using those models.\n",
+    "Machine learning can help here by filtering out subjects that are \"not interesting\". Such subjects are typically very common in the data sets that were used to train the machine learning models and are therefore very easily and confidently classified using those models.\n",
    "\n",
-    "Commonly cited examples of how machine learning has been used to filter subjects that are shown to volunteers on Zooniverse are ecolology-focussed \"camera trap\" projects. Volunteers are asked to identify any animals they see in the images they are shown. Machine learning models detect \"empty\" images very accurately and it is not useful for volunteers to classifiy images with no animals in them. Machine learning can be used very effectively to remove empty images from Zooniverse subject sets to let volunteers focus on classifying animals.\n",
+    "Commonly cited examples of how machine learning has been used to filter subjects that are shown to volunteers on Zooniverse are ecology-focused \"camera trap\" projects. Volunteers are asked to identify any animals they see in the images they are shown. Machine learning models detect \"empty\" images very accurately and it is not useful for volunteers to classify images with no animals in them. Machine learning can be used very effectively to remove empty images from Zooniverse subject sets to let volunteers focus on classifying animals.\n",
    "\n",
-    "We'll use an example from our SuperWASP CNN. We'll select only those lightcurves for which our model is \"confused\". We'll define \"confusing\" images as those for which the machine learning algorithm outputs a response greater than 0.4 for more than one category.\n",
+    "We'll use an example from our _SuperWASP Variable Stars_ CNN, in which we'll select only those light curves for which our model is \"confused\". We'll define \"confusing\" images as those for which the machine learning algorithm outputs a response greater than 0.4 for more than one category.\n",
    "\n",
    "For this demonstration, we've preselected those subjects and you can find the images in the `data/demoset_confused` folder. "
   ]
@@ -546,11 +612,15 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Active Learning\n",
+    "## Active Learning\n",
    "\n",
-    "We're now in a position to implement a toy demonstration of a technique called \"active learning\". In active learning, model predictions are used to select subjects that are likely to provide the most useful information to improve the model's performance if they were labelled. Our \"confusing subjects\" selection probably isn't exactly the right approach, but it's a reasonable attempt.\n",
+    "We're now in a position to implement a toy demonstration of a technique called \"active learning\". In active learning, model predictions are used to select subjects that are likely to provide the most useful information to improve the model's performance if they were labelled and used for further training the model. This is particularly useful when your available data set is largely unlabelled.\n",
    "\n",
-    "Now that we've created a new subject set with confusing images, let's create a special \"Advanced\" workflow to process them. We can use Caesar to send classifications from that workflow to our SQS queue once they're classified and we can use those classifications to retrain our model."
+    "For example, if you have a method of obtaining a level of confidence in a prediction made by your model, such as the uncertainty value predicted by a Bayesian neural network alongside each class prediction, then subjects with the highest predicted uncertainties would likely be the most useful for active learning. \n",
+    "\n",
+    "Our \"confusing subjects\" selection probably isn't exactly the right approach, but it's a reasonable attempt.\n",
+    "\n",
+    "Now that we've created a new subject set with confusing images, let's create a special \"Advanced\" workflow to process them. Like before, we can use an external reducer on _Caesar_ to send classifications from that workflow to our SQS queue once they're classified by volunteers."
   ]
  },
  {
@@ -568,7 +638,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Once we collect enough classifications from our \"Advanced\" workflow we can retrain our model, select a new set of confusing subjects, and use them to create another subject set for the \"Advanced\" workflow. Repeating this cycle over and over again is the basis of active learning with Zooniverse."
+    "Once we collect enough classifications from our \"Advanced\" workflow, we can use those classifications to further train our model. We can then use this model to make predictions for more subjects, select a new set of the most confusing subjects, and use them to create another subject set for the \"Advanced\" workflow.\n",
+    "\n",
+    "Repeating this cycle over and over again is the basis of active learning with Zooniverse."
   ]
  },
  {
@@ -581,7 +653,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -595,7 +667,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.6"
+   "version": "3.9.13"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id: tags:

-## Improve Efficiency with Machine Learning and Caesar
+# Zooniverse - Integrating Machine Learning
+
+%% Cell type:markdown id: tags:
+
+Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects. _Caesar_ also provides a powerful way of collecting and analysing volunteer classifications (aggregation). Machine learning models can be used with _Caesar_ to make classification workflows more efficient: in this notebook we will examine how machine learning can be used to implement advanced subject retirement rules and filter subjects prior to being shown to volunteers, as well as setting up an active learning cycle so that volunteer classifications help train the machine learning model.
+
+## Table of Contents
+
+1. [Setup](#Setup)
+2. [Advanced Retirement Using Machine Learning](#Advanced-Retirement-Using-Machine-Learning)
+    1. [Option 1: Make machine learning predictions for all subjects "up front"](#Option-1:-Make-machine-learning-predictions-for-all-subjects-"up-front")
+    2. [Option 2: Make machine learning predictions for each subject "on the fly"](#Option-2:-Make-machine-learning-predictions-for-each-subject-"on-the-fly")
+3. [Filtering Subjects Using Machine Learning](#Filtering-Subjects-Using-Machine-Learning)
+4. [Active Learning](#Active-Learning)
+
+%% Cell type:markdown id: tags:
+
+## Setup
+
+You may need to install the `panoptes_client`, `pandas`, and `boto3` packages. If you do, then run the code in the next cell. Otherwise, skip it. `boto3` is a package that implements the Amazon Web Service (AWS) Python SDK (see the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) for more detail), and is used in the second half of the tutorial.
+
+%% Cell type:code id: tags:
+
+``` python
+!python -m pip install panoptes_client
+!python -m pip install pandas
+!python -m pip install boto3
+```

 %% Cell type:code id: tags:

 ``` python
 from panoptes_client import Panoptes, Project, Subject, SubjectSet
 import pandas as pd
 import os
 import getpass
 import glob
 ```

 %% Cell type:markdown id: tags:

-Trained machine learning models can take up a lot of disk space and can be a computationally intensive process. Using machine learning to predict classifications for specific subjects not the sort of analysis that _Caesar_ can run by itself (even if you did add the code to the `aggregation_for_caesar` repository.
+## Advanced Retirement Using Machine Learning
+
+Trained machine learning models can take up a lot of disk space and can be a computationally intensive process. Using machine learning to predict classifications for specific subjects is not the sort of analysis that _Caesar_ can run by itself (even if you did add the code to the [`aggregation_for_caesar` repository](https://github.com/zooniverse/aggregation-for-caesar)).

-However, that doesn't mean that we can't use machine learning models and Caesar _together_ to make our classification workflow more efficient. For example you might want to use a machine learning model to predict the class of every subject in a subject set and then use less stringent retirement criteria if the machnie and human classifiers agree.
+However, that doesn't mean that we can't use machine learning models and _Caesar_ **together** to make our classification workflow more efficient. For example, you might want to use a machine learning model to predict the class of every subject in a subject set and then use less stringent retirement criteria if the machine and human classifiers agree.

-As a concrete example, we might say that if the first three volunteers all agree with the machine prediction, then we can retire the suject but we would demand that seven volunteers view and classify the subject otherwise.
+As a concrete example, we might say that if the first three volunteers all agree with the machine prediction then we can retire the suject, but we would otherwise demand that seven volunteers view and classify the subject.

-There are lots a couple of easy ways to couple a machine learning model with your Zooniverse project and we'll quickly demo them here.
+There are a couple of easy ways to couple a machine learning model with your Zooniverse project and we'll quickly demo them here.

-1. The first option we have is to preclassify all of our subjects and add a **hidden subject metadatum** with the mahine learning score that Caesar can recongise and use to execute specific retirement rules.
-2. The second option is to make machine learning predictions for subjects **after** they have been classified and make an "on the fly" retirement decision. The second approach is much more complicated, but it does mean that you can benefit from the latest version of your machine learning model and you can compute very complicated retirement criteria that are difficult or impossible to express as Caesar rules.
+1. The first option we have is to preclassify all of our subjects with machine learning and add the machine learning score to each as a **hidden subject metadatum** that _Caesar_ can recognise and use to execute specific retirement rules.
+2. The second option is to make machine learning predictions for subjects **after** they have been classified and make an "on the fly" retirement decision. The second approach is much more complicated, but it does mean that you can benefit from the latest version of your machine learning model and you can compute very complicated retirement criteria that are difficult or impossible to express as _Caesar_ rules.

 %% Cell type:markdown id: tags:

-### Option 1: Make machine learning prediction for all subjects "up front"
+### Option 1: Make machine learning predictions for all subjects "up front"
+
 If you have a trained machine learning model and a pool of data, then it's straightforward to predict a machine score for all of your subjects and add a new hidden metadatum to each one.

-We've already trained a simple CNN to predict the class of a variable star lightcurve by looking at the same images the volunteers see. We can load a table of its predictions for our subjects.
+For this tutorial, we have already trained a simple convolutional neural network (CNN) to predict the class (i.e. category) of a variable star's light curve by looking at the same images the volunteers see. We can hence load a table of its predictions for our subjects, where each row corresponds to a different subject and each column gives the predicted probability of that subject belonging to a given class.

 %% Cell type:code id: tags:

 ``` python
-dataDirectory = ??
+dataDirectory = ?? # e.g. '../data/'
 predictions = pd.read_csv(os.path.join(dataDirectory, "predictions.csv")).set_index("image")
 predictions
 ```

 %% Cell type:markdown id: tags:

-Now let's create a new subject set (as we've done many times now) and add the machine learning prediction as a hidden metadatum. We can call that metadatum whatever we like, so how about `#ml_prediction`? We'll simplify things by defining the machine prediction for a subject to be the category with highest score, so for the zeroth row, our machine prediction is "Pulsator". Rather than storing the prediction as a word though, we'll map it to its corresponding position in the list of options for task 1 of our project. Recall that that order is:
+Now let's create a new subject set (as we did in the _Advanced Project Building_ tutorial) and add the machine learning prediction to each subject as a hidden metadatum. We can call that metadatum whatever we like, so how about `#ml_prediction`? We'll simplify things by defining the machine prediction for a subject to be the class with the highest score, so for the zeroth row, our machine prediction is "Pulsator". Rather than storing the prediction as a word though, we'll map it to its corresponding position in the list of options for task 1 of our project. Recall that the order is:
 ```python
 ["Pulsator", "EA/EB type", "EW type", "Rotator", "Unkown", "Junk"]
 ```

 %% Cell type:code id: tags:

 ``` python
 questionOrder = ["Pulsator", "EA/EB type", "EW type", "Rotator", "Unknown", "Junk"]
 ```

 %% Cell type:markdown id: tags:

-Let's start with all the parts we've seen before.
+Let's start with all the parts we've seen before: authenticating with the Panoptes API and finding our project.

 %% Cell type:code id: tags:

 ``` python
 username = input("Enter Panoptes Username:")
 password = getpass.getpass("Enter Panoptes Password:")
 ```

 %% Cell type:code id: tags:

 ``` python
 panoptesClient = Panoptes.connect(username=username, password=password)
 ```

 %% Cell type:code id: tags:

 ``` python
 projectId = ?? # You can look this up in the Project Builder
 project = Project.find(projectId)
 ```

+%% Cell type:code id: tags:
+
+``` python
+print("Name:", project.display_name)
+print("Description:", project.description)
+print("Introduction:", project.introduction)
+print("Number of Subjects:", project.subjects_count)
+print("Number of Classifications:", project.classifications_count)
+```
+
 %% Cell type:markdown id: tags:

 Here's where we create our new subject set and give it a name and link it to our project.

 %% Cell type:code id: tags:

 ``` python
-subjectSet = SubjectSet()
-subjectSet.display_name = "Machine Learning Demo Set"
-subjectSet.links.project = project
-response = subjectSet.save()
+subjectSet = SubjectSet() # Define new subject set
+subjectSet.display_name = "Machine Learning Demo Set" # Give it a name
+subjectSet.links.project = project # Set it to link to our project
+response = subjectSet.save() # Save this (currently empty) subject set
 ```

+%% Cell type:markdown id: tags:
+
+Now we'll acquire a file list of all the subjects to be added to the subject set.
+
 %% Cell type:code id: tags:

 ``` python
 imageFiles = glob.glob(os.path.join(dataDirectory, "demoset_smallSample_1/*jpg"))
 imageFiles
 ```

 %% Cell type:markdown id: tags:

-We'll use those files to create a list of new `Subjects` and add some metadata. First let's process the machine predictions like we discussed above.
+We'll add those files to the subject set along with some metadata. First, let's process the machine predictions like we discussed above, identifying the class with the highest score for each subject.

 %% Cell type:code id: tags:

 ``` python
 imagePredictions = []
 for imageFile in imageFiles:
    imageRow = int(os.path.basename(imageFile)[:-4])
    imagePredictions.append(
        (
-            imageRow,
-            predictions.iloc[imageRow].idxmax(),
-            questionOrder.index(predictions.iloc[imageRow].idxmax()),
+            imageRow, # Subject name/number
+            predictions.iloc[imageRow].idxmax(), # Class of variable star with the highest prediction score
+            questionOrder.index(predictions.iloc[imageRow].idxmax()), # Corresponding index
        )
    )
 imagePredictions
 ```

 %% Cell type:code id: tags:

 ``` python
 newSubjects = []

 for imageFile, imagePrediction in zip(imageFiles, imagePredictions):

    newSubject = Subject()
    newSubject.links.project = project
    newSubject.add_location(imageFile)
    newSubject.metadata = {
        "Origin": "Python ML demo",
        "image": os.path.basename(imageFile),
        "#ml_prediction": imagePrediction[2],
    }
    newSubject.save()

    newSubjects.append(newSubject)
 ```

 %% Cell type:markdown id: tags:

 And finally, we assign our newly uploaded `Subject`s to the `SubjectSet` we already created.

 %% Cell type:code id: tags:

 ``` python
 subjectSet.add(newSubjects)
 ```

 %% Cell type:markdown id: tags:

-Now we just have to **set up appropriate retirement rules in the Caesar UI**.
+Now we just have to **set up appropriate retirement rules in the _Caesar_ UI**, shown [here](https://youtu.be/o9SzgsZvOCg?t=9003) (02:30:03-02:36:03) as part of the recorded demonstration of this tutorial. Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects and aggregating volunteer classifications. These involve the use of "extractors" and "reducers": extractors extract the data from each classification into a more useful data format, while reducers "reduce" (aggregate) the extracted data for each subject together.
+
+The input to a reducer is the output of one or more extractors.

 %% Cell type:markdown id: tags:

-### Option 2: Make machine learning prediction for each subject "on the fly"
+### Option 2: Make machine learning predictions for each subject "on the fly"
+
+For this option, we want to make subject retirement decisions following machine learning predictions made _after_ volunteer classification. To do this, we'll need a way of outputting extract data for processing with machine learning, and a way for uploading the processed results back to _Caesar_ for the retirement decisions.

 #### External reducers
-Instead Caesar defines the concept of an _External_ reducer. When you set up an external reducer in the Caesar UI, the system will send the results from any associated extractors to an HTTP endpoint that you specify. You can then receive those extract data and process them appropriately.
+While _Caesar_ contains multiple inbuilt reducers to choose from, _Caesar_ also defines the concept of an _External_ reducer. When you set up an external reducer in the _Caesar_ UI, the system will send the results from any associated extractors to an HTTP endpoint that you specify. You can then receive those extract data through the HTTP endpoint and process them appropriately.

 #### Placeholder reducers
-Once processing is complete, you need a way to send the results back to Caesar. The system defines the concept of a _Placeholder_ reducer for this purpose. Your external reducer should send its answer back to the placeholder reducer, allowing Caesar to process any associated rules or effects.
+Once processing is complete, you need a way to send the results back to _Caesar_. The system defines the concept of a _Placeholder_ reducer for this purpose. Your external reducer should send its answer back to the placeholder reducer, allowing _Caesar_ to process any associated rules or effects.

-#### SQS Reducers
-The one downside of a basic external reducer is that you need to run some sort of webserver that can listen out for messages from Caesar and process them when they arrive. This can be a barrier to entry for some research teams, so Caesar defines one more reducer category, the _SQS Reducer_.
+#### SQS reducers
+The one downside of a basic external reducer is that you need to run some sort of webserver that can listen out for messages from _Caesar_ and process them when they arrive. This can be a barrier to entry for some research teams, so _Caesar_ defines one more reducer category, the _SQS Reducer_.

-_SQS_ stands for "Simple Queue Service". It's a facility provided by Amazon Web Services that maintains a queue of messages in the cloud. Caesar sends messages to this queue and your code can consume them. All the web hosting is handled by Amazon, so you don't have to worry.
+_SQS_ stands for "Simple Queue Service". It's a facility provided by Amazon Web Services that maintains a queue of messages in the cloud. _Caesar_ sends messages to this queue and your code can consume them. All the web hosting is handled by Amazon, so you don't have to worry.

-We'll set up an SQS reducer in the Caesar UI (see the recording that goes with this notebook) to send message extracts to us. Then we can use a specially written client to grab those messages from the queue and process them.
+We'll **set up an SQS reducer in the _Caesar_ UI** to send message extracts to us, shown [here](https://youtu.be/o9SzgsZvOCg?t=9975) (02:46:15-02:56:36) as part of the recorded demonstration of this tutorial. Then we can use a specially written client to grab those messages from the queue and process them.

 %% Cell type:code id: tags:

 ``` python
 from ML_Demo import SQSClient
 ```

 %% Cell type:markdown id: tags:

-**IMPORTANT**: Note that for you to use the SQS reducer facility you will need to set up an Amazon AWS account, set up a queue and get a set of access credentials. This process is free and reasonably straightforward, but beyond the scope of this tutorial. Once you've done that you can export your credentials to your computer's environment like this.
+**IMPORTANT**: Note that for you to use the SQS reducer facility you will need to set up an Amazon AWS account, set up a queue and get a set of access credentials. This process is free and reasonably straightforward, but beyond the scope of this tutorial. Once you've done that, you can export your credentials to your computer's environment like this.

 %% Cell type:code id: tags:

 ``` python
 os.environ["AWS_ACCESS_KEY_ID"] = ?? # You will need an AWS account to get your ID
 os.environ["AWS_SECRET_ACCESS_KEY"] = ?? # You will need an AWS account to get your Secret
 ```

 %% Cell type:markdown id: tags:

 Every SQS queue has a unique URL that you can use to connect to it and retrieve messages. You must pass this URL to the `SQSClient` constructor when you instantiate the SQS client.

 %% Cell type:code id: tags:

 ``` python
 queueUrl = ??  # You can set this up with an AWS account e.g. "https://sqs.us-east-1.amazonaws.com/123456123456/MySQSQueue"
 sqs = SQSClient(queueUrl)
 ```

 %% Cell type:markdown id: tags:

 The client provides a `getMessages` method that will retrieve a batch of messages from the queue and (optionally) delete them from the queue. The method returns three objects. The first is the set of messages that were received, with any duplicates removed, the second is the raw set of messages before deduplication and the final object is a list of unique message IDs for the messages that were received.

 %% Cell type:code id: tags:

 ``` python
 messages, receivedMessages, receivedMessageIds = sqs.getMessages(delete=False)
 ```

 %% Cell type:markdown id: tags:

 We can examine our messages more easily if we convert to a `pandas.DataFrame`.

 %% Cell type:code id: tags:

 ``` python
 messagesFrame = pd.DataFrame(messages)
 messagesFrame
 ```

 %% Cell type:markdown id: tags:

-For every classification we get a row of data corresponding. The information about the volunteer's answer is in the `data` column. Since we set Caesar to use a question extractor the format of each entry can be interpreted as:
+For every classification we get a row of corresponding data. The information about the volunteer's answer is in the `data` column. Since we set _Caesar_ to use a question extractor, the format of each entry can be interpreted as:
 ```python
 { '<answer>' : <number_of_matching_responses> }
 ```
-So fo example in row 0 of the table we can see that for subject 52665118, the classification contained one response for the answer `'0'`, which corresponds to a Pulsator.
+For example, in row `0` of the table we can see that for subject `52665118` the classification contained one response for the answer `'0'`, which corresponds to a Pulsator.

-If we have a (partially) trained machine learning model, we could ask it to make its own predictions for each of the subjects we just got classifications for, but for now, let's just look up the classification (if it exists) in our table of predictions. We can use the `subject_id` field in our message table to get the metadata for each subject and then try to retrieve its machnie learning prediction.
+If we have a (partially) trained machine learning model, we could ask it to make its own predictions for each of the subjects we just got classifications for, but for now, let's just look up the classification (if it exists) in our table of predictions. We can use the `subject_id` field in our message table to get the metadata for each subject and then try to retrieve its machine learning prediction.

 %% Cell type:code id: tags:

 ``` python
 machineMatchesHuman = []
 for _, (subjectId, data) in messagesFrame[["subject_id", "data"]].iterrows():
    answer = next(iter(data.keys()))
    if len(answer) == 0:
        machineMatchesHuman.append(False)
        continue
    answer = int(answer)
    try:
        subject = Subject.find(subjectId)
    except:
        machineMatchesHuman.append(False)
        continue
    if "image" in subject.metadata:
        lookup = int(subject.metadata["image"][:-4])

    if lookup in predictions.index:
        print(
            subjectId,
            subject.metadata["image"][:-4],
            predictions.loc[lookup].idxmax(),
            questionOrder[answer],
        )
        machineMatchesHuman.append(
            answer == questionOrder.index(predictions.loc[lookup].idxmax())
        )
        continue
    machineMatchesHuman.append(False)

 machineMatchesHuman
 ```

 %% Cell type:markdown id: tags:

-We decide to retire all subjects for which the machine and human predictions match. Now we need to send a message back to Caesar with our decision for the subjects we want to retire. If we don't want to do anything we don't have to send a message. We can use the Panoptes Python API send a message back to the placeholder reducer we set up in the Caesar UI. Here's how we do it.
+Perhaps we decide to retire all subjects for which the machine and human predictions match. In that case, we now need to send a message back to _Caesar_ with our decision for the subjects we want to retire. If we don't want to do anything, we don't have to send a message.
+
+We can use the Panoptes Python API to send a message back to the placeholder reducer we set up in the Caesar UI. Here's how we do it.

 %% Cell type:code id: tags:

 ``` python
 pan = Panoptes(login="interactive", redirect_url='https://caesar.zooniverse.org/auth/zooniverse/callback')
 ```

 %% Cell type:code id: tags:

 ``` python
 wokflowId = ?? # You can look this up in the Project Builder or you could use: project.links.workflows[0]
 response = pan.put(endpoint="https://caesar.zooniverse.org",
                   path=f"workflows/{wokflowId}/subject_reductions/receiver",
                   json={
                       'reduction': {
                           'subject_id': int(messagesFrame.subject_id.iloc[-1]),
                           'data': {"retire": True}
                       }
                   })
 ```

 %% Cell type:markdown id: tags:

-If the reduction was successfully submitted, Caesar will send a `dict` in response.
+If the reduction was successfully submitted, _Caesar_ will send a `dict` in response.

 %% Cell type:code id: tags:

 ``` python
 response
 ```

 %% Cell type:markdown id: tags:

 Success!

 #### Aside: Avoid entering credentials
-To avoid having to enter credentials to send reductions back to Caesar, you can register with the Panoptes authentication system and get a _Client ID_ and a _Client Secret_. These are just special strings of unguessable characters that the Panoptes Python API can use instead of your usual credentials to authenticate.
+To avoid having to enter credentials to send reductions back to _Caesar_, you can register with the Panoptes authentication system and get a _Client ID_ and a _Client Secret_. These are just special strings of unguessable characters that the Panoptes Python API can use instead of your usual credentials to authenticate.

-To get your _Client ID_ and _Client Secret_  visit [https://panoptes.zooniverse.org/oauth/applications](https://panoptes.zooniverse.org/oauth/applications) and click on _New Application_. Once you have those details you can export them to your computer's environment just like you did for the Amazon credentials, but with different names.
+To get your _Client ID_ and _Client Secret_ visit [https://panoptes.zooniverse.org/oauth/applications](https://panoptes.zooniverse.org/oauth/applications) and click on _New Application_. Once you have those details you can export them to your computer's environment just like you did for the Amazon credentials, but with different names.

 %% Cell type:code id: tags:

 ``` python
 os.environ["PANOPTES_CLIENT_ID"] = ??
 os.environ["PANOPTES_CLIENT_SECRET"] = ??
 ```

 %% Cell type:markdown id: tags:

-### Filtering Subjects Using Machine Learning
+## Filtering Subjects Using Machine Learning
+
 Although machine learning algorithms are very good at performing **specific tasks**, there are some things that human beings still tend to do better at. For example, human beings are much more likely to spot unusual or unexpected features in images or other types of subjects.

-However, there are many datasets (with more arriving every day) was that are simply too large to be processed by human beings, even using a citizen science approach.
+However, there are many data sets (with more arriving every day) that are simply too large to be processed by human beings, even using a citizen science approach.

-Machine learning can help here by filtering out subjects that are "not interesting". Such subjects are typically very common in the datasets that were used to train the machnine learning models and are therefore very easily and confidently classified using those models.
+Machine learning can help here by filtering out subjects that are "not interesting". Such subjects are typically very common in the data sets that were used to train the machine learning models and are therefore very easily and confidently classified using those models.

-Commonly cited examples of how machine learning has been used to filter subjects that are shown to volunteers on Zooniverse are ecolology-focussed "camera trap" projects. Volunteers are asked to identify any animals they see in the images they are shown. Machine learning models detect "empty" images very accurately and it is not useful for volunteers to classifiy images with no animals in them. Machine learning can be used very effectively to remove empty images from Zooniverse subject sets to let volunteers focus on classifying animals.
+Commonly cited examples of how machine learning has been used to filter subjects that are shown to volunteers on Zooniverse are ecology-focused "camera trap" projects. Volunteers are asked to identify any animals they see in the images they are shown. Machine learning models detect "empty" images very accurately and it is not useful for volunteers to classify images with no animals in them. Machine learning can be used very effectively to remove empty images from Zooniverse subject sets to let volunteers focus on classifying animals.

-We'll use an example from our SuperWASP CNN. We'll select only those lightcurves for which our model is "confused". We'll define "confusing" images as those for which the machine learning algorithm outputs a response greater than 0.4 for more than one category.
+We'll use an example from our _SuperWASP Variable Stars_ CNN, in which we'll select only those light curves for which our model is "confused". We'll define "confusing" images as those for which the machine learning algorithm outputs a response greater than 0.4 for more than one category.

 For this demonstration, we've preselected those subjects and you can find the images in the `data/demoset_confused` folder.

 %% Cell type:code id: tags:

 ``` python
 confusedPredictions = pd.read_csv(os.path.join(dataDirectory, "confused_predictions.csv")).set_index("image")
 ```

 %% Cell type:code id: tags:

 ``` python
 confusedPredictions
 ```

 %% Cell type:markdown id: tags:

 Let's make a quick plot of some of these "confusing" images.

 %% Cell type:code id: tags:

 ``` python
 from ML_Demo import plotConfusedBatch
 ```

 %% Cell type:code id: tags:

 ``` python
 plotConfusedBatch(confusedPredictions.iloc[:, :6], os.path.join(dataDirectory, "demoset_confused"))
 ```

 %% Cell type:markdown id: tags:

 I think we can agree that the answers certainly aren't obvious, but as humans (maybe experts) we can probably get all of them right. Let's use our normal techniques to make a new "confused" subject set and add our confused subjects to it.

 %% Cell type:code id: tags:

 ``` python
 subjectSet = SubjectSet()
 subjectSet.display_name = "Confused Demo Set"
 subjectSet.links.project = project
 response = subjectSet.save()
 ```

 %% Cell type:code id: tags:

 ``` python
 newSubjects = []

 for image, imagePrediction in confusedPredictions.iloc[:, :6].iterrows():

    newSubject = Subject()
    newSubject.links.project = project
    newSubject.add_location(os.path.join(os.path.join(dataDirectory, "demoset_confused"), f"{image}.jpg"))
    newSubject.metadata = {
        "Origin": "Python ML demo",
        "image": f"{image}.jpg",
        "ml_prediction": dict(imagePrediction),
    }
    newSubject.save()

    newSubjects.append(newSubject)
 ```

 %% Cell type:code id: tags:

 ``` python
 subjectSet.add(newSubjects)
 ```

 %% Cell type:markdown id: tags:

-### Active Learning
+## Active Learning
+
+We're now in a position to implement a toy demonstration of a technique called "active learning". In active learning, model predictions are used to select subjects that are likely to provide the most useful information to improve the model's performance if they were labelled and used for further training the model. This is particularly useful when your available data set is largely unlabelled.

-We're now in a position to implement a toy demonstration of a technique called "active learning". In active learning, model predictions are used to select subjects that are likely to provide the most useful information to improve the model's performance if they were labelled. Our "confusing subjects" selection probably isn't exactly the right approach, but it's a reasonable attempt.
+For example, if you have a method of obtaining a level of confidence in a prediction made by your model, such as the uncertainty value predicted by a Bayesian neural network alongside each class prediction, then subjects with the highest predicted uncertainties would likely be the most useful for active learning.

-Now that we've created a new subject set with confusing images, let's create a special "Advanced" workflow to process them. We can use Caesar to send classifications from that workflow to our SQS queue once they're classified and we can use those classifications to retrain our model.
+Our "confusing subjects" selection probably isn't exactly the right approach, but it's a reasonable attempt.
+
+Now that we've created a new subject set with confusing images, let's create a special "Advanced" workflow to process them. Like before, we can use an external reducer on _Caesar_ to send classifications from that workflow to our SQS queue once they're classified by volunteers.

 %% Cell type:code id: tags:

 ``` python
 messages, receivedMessages, receivedMessageIds = sqs.getMessages(delete=True)
 messagesFrame = pd.DataFrame(messages)
 messagesFrame
 ```

 %% Cell type:markdown id: tags:

-Once we collect enough classifications from our "Advanced" workflow we can retrain our model, select a new set of confusing subjects, and use them to create another subject set for the "Advanced" workflow. Repeating this cycle over and over again is the basis of active learning with Zooniverse.
+Once we collect enough classifications from our "Advanced" workflow, we can use those classifications to further train our model. We can then use this model to make predictions for more subjects, select a new set of the most confusing subjects, and use them to create another subject set for the "Advanced" workflow.
+
+Repeating this cycle over and over again is the basis of active learning with Zooniverse.

 %% Cell type:code id: tags:

 ``` python
 ```