Add recording links to notebook

7277b075 · James Pearson · 30a71c01 · 7277b075
Commit 7277b075 authored 2 years ago by James Pearson
--- a/notebooks/MachineLearning.ipynb
+++ b/notebooks/MachineLearning.ipynb
@@ -269,7 +269,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we just have to **set up appropriate retirement rules in the _Caesar_ UI**. Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects and aggregating volunteer classifications. These involve the use of \"extractors\" and \"reducers\": extractors extract the data from each classification into a more useful data format, while reducers \"reduce\" (aggregate) the extracted data for each subject together.\n",
+    "Now we just have to **set up appropriate retirement rules in the _Caesar_ UI**, shown [here](https://youtu.be/o9SzgsZvOCg?t=9003) (02:30:03-02:36:03) as part of the recorded demonstration of this tutorial. Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects and aggregating volunteer classifications. These involve the use of \"extractors\" and \"reducers\": extractors extract the data from each classification into a more useful data format, while reducers \"reduce\" (aggregate) the extracted data for each subject together.\n",
    "\n",
    "The input to a reducer is the output of one or more extractors."
   ]
@@ -293,7 +293,7 @@
    "\n",
    "_SQS_ stands for \"Simple Queue Service\". It's a facility provided by Amazon Web Services that maintains a queue of messages in the cloud. _Caesar_ sends messages to this queue and your code can consume them. All the web hosting is handled by Amazon, so you don't have to worry.\n",
    "\n",
-    "We'll set up an SQS reducer in the _Caesar_ UI (**see the recording that goes with this notebook**) to send message extracts to us. Then we can use a specially written client to grab those messages from the queue and process them."
+    "We'll **set up an SQS reducer in the _Caesar_ UI** to send message extracts to us, shown [here](https://youtu.be/o9SzgsZvOCg?t=9975) (02:46:15-02:56:36) as part of the recorded demonstration of this tutorial. Then we can use a specially written client to grab those messages from the queue and process them."
   ]
  },
  {

 %% Cell type:markdown id: tags:
 # Zooniverse - Integrating Machine Learning
 %% Cell type:markdown id: tags:
 Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects. _Caesar_ also provides a powerful way of collecting and analysing volunteer classifications (aggregation). Machine learning models can be used with _Caesar_ to make classification workflows more efficient: in this notebook we will examine how machine learning can be used to implement advanced subject retirement rules and filter subjects prior to being shown to volunteers, as well as setting up an active learning cycle so that volunteer classifications help train the machine learning model.
 ## Table of Contents
 1. [Setup](#Setup)
 2. [Advanced Retirement Using Machine Learning](#Advanced-Retirement-Using-Machine-Learning)
    1. [Option 1: Make machine learning predictions for all subjects "up front"](#Option-1:-Make-machine-learning-predictions-for-all-subjects-"up-front")
    2. [Option 2: Make machine learning predictions for each subject "on the fly"](#Option-2:-Make-machine-learning-predictions-for-each-subject-"on-the-fly")
 3. [Filtering Subjects Using Machine Learning](#Filtering-Subjects-Using-Machine-Learning)
 4. [Active Learning](#Active-Learning)
 %% Cell type:markdown id: tags:
 ## Setup
 You may need to install the `panoptes_client`, `pandas`, and `boto3` packages. If you do, then run the code in the next cell. Otherwise, skip it. `boto3` is a package that implements the Amazon Web Service (AWS) Python SDK (see the [documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html) for more detail), and is used in the second half of the tutorial.
 %% Cell type:code id: tags:
 ``` python
 !python -m pip install panoptes_client
 !python -m pip install pandas
 !python -m pip install boto3
 ```
 %% Cell type:code id: tags:
 ``` python
 from panoptes_client import Panoptes, Project, Subject, SubjectSet
 import pandas as pd
 import os
 import getpass
 import glob
 ```
 %% Cell type:markdown id: tags:
 ## Advanced Retirement Using Machine Learning
 Trained machine learning models can take up a lot of disk space and can be a computationally intensive process. Using machine learning to predict classifications for specific subjects is not the sort of analysis that _Caesar_ can run by itself (even if you did add the code to the [`aggregation_for_caesar` repository](https://github.com/zooniverse/aggregation-for-caesar)).
 However, that doesn't mean that we can't use machine learning models and _Caesar_ **together** to make our classification workflow more efficient. For example, you might want to use a machine learning model to predict the class of every subject in a subject set and then use less stringent retirement criteria if the machine and human classifiers agree.
 As a concrete example, we might say that if the first three volunteers all agree with the machine prediction then we can retire the suject, but we would otherwise demand that seven volunteers view and classify the subject.
 There are a couple of easy ways to couple a machine learning model with your Zooniverse project and we'll quickly demo them here.
 1. The first option we have is to preclassify all of our subjects with machine learning and add the machine learning score to each as a **hidden subject metadatum** that _Caesar_ can recognise and use to execute specific retirement rules.
 2. The second option is to make machine learning predictions for subjects **after** they have been classified and make an "on the fly" retirement decision. The second approach is much more complicated, but it does mean that you can benefit from the latest version of your machine learning model and you can compute very complicated retirement criteria that are difficult or impossible to express as _Caesar_ rules.
 %% Cell type:markdown id: tags:
 ### Option 1: Make machine learning predictions for all subjects "up front"
 If you have a trained machine learning model and a pool of data, then it's straightforward to predict a machine score for all of your subjects and add a new hidden metadatum to each one.
 For this tutorial, we have already trained a simple convolutional neural network (CNN) to predict the class (i.e. category) of a variable star's light curve by looking at the same images the volunteers see. We can hence load a table of its predictions for our subjects, where each row corresponds to a different subject and each column gives the predicted probability of that subject belonging to a given class.
 %% Cell type:code id: tags:
 ``` python
 dataDirectory = ?? # e.g. '../data/'
 predictions = pd.read_csv(os.path.join(dataDirectory, "predictions.csv")).set_index("image")
 predictions
 ```
 %% Cell type:markdown id: tags:
 Now let's create a new subject set (as we did in the _Advanced Project Building_ tutorial) and add the machine learning prediction to each subject as a hidden metadatum. We can call that metadatum whatever we like, so how about `#ml_prediction`? We'll simplify things by defining the machine prediction for a subject to be the class with the highest score, so for the zeroth row, our machine prediction is "Pulsator". Rather than storing the prediction as a word though, we'll map it to its corresponding position in the list of options for task 1 of our project. Recall that the order is:
 ```python
 ["Pulsator", "EA/EB type", "EW type", "Rotator", "Unkown", "Junk"]
 ```
 %% Cell type:code id: tags:
 ``` python
 questionOrder = ["Pulsator", "EA/EB type", "EW type", "Rotator", "Unknown", "Junk"]
 ```
 %% Cell type:markdown id: tags:
 Let's start with all the parts we've seen before: authenticating with the Panoptes API and finding our project.
 %% Cell type:code id: tags:
 ``` python
 username = input("Enter Panoptes Username:")
 password = getpass.getpass("Enter Panoptes Password:")
 ```
 %% Cell type:code id: tags:
 ``` python
 panoptesClient = Panoptes.connect(username=username, password=password)
 ```
 %% Cell type:code id: tags:
 ``` python
 projectId = ?? # You can look this up in the Project Builder
 project = Project.find(projectId)
 ```
 %% Cell type:code id: tags:
 ``` python
 print("Name:", project.display_name)
 print("Description:", project.description)
 print("Introduction:", project.introduction)
 print("Number of Subjects:", project.subjects_count)
 print("Number of Classifications:", project.classifications_count)
 ```
 %% Cell type:markdown id: tags:
 Here's where we create our new subject set and give it a name and link it to our project.
 %% Cell type:code id: tags:
 ``` python
 subjectSet = SubjectSet() # Define new subject set
 subjectSet.display_name = "Machine Learning Demo Set" # Give it a name
 subjectSet.links.project = project # Set it to link to our project
 response = subjectSet.save() # Save this (currently empty) subject set
 ```
 %% Cell type:markdown id: tags:
 Now we'll acquire a file list of all the subjects to be added to the subject set.
 %% Cell type:code id: tags:
 ``` python
 imageFiles = glob.glob(os.path.join(dataDirectory, "demoset_smallSample_1/*jpg"))
 imageFiles
 ```
 %% Cell type:markdown id: tags:
 We'll add those files to the subject set along with some metadata. First, let's process the machine predictions like we discussed above, identifying the class with the highest score for each subject.
 %% Cell type:code id: tags:
 ``` python
 imagePredictions = []
 for imageFile in imageFiles:
    imageRow = int(os.path.basename(imageFile)[:-4])
    imagePredictions.append(
        (
            imageRow, # Subject name/number
            predictions.iloc[imageRow].idxmax(), # Class of variable star with the highest prediction score
            questionOrder.index(predictions.iloc[imageRow].idxmax()), # Corresponding index
        )
    )
 imagePredictions
 ```
 %% Cell type:code id: tags:
 ``` python
 newSubjects = []
 for imageFile, imagePrediction in zip(imageFiles, imagePredictions):
    newSubject = Subject()
    newSubject.links.project = project
    newSubject.add_location(imageFile)
    newSubject.metadata = {
        "Origin": "Python ML demo",
        "image": os.path.basename(imageFile),
        "#ml_prediction": imagePrediction[2],
    }
    newSubject.save()
    newSubjects.append(newSubject)
 ```
 %% Cell type:markdown id: tags:
 And finally, we assign our newly uploaded `Subject`s to the `SubjectSet` we already created.
 %% Cell type:code id: tags:
 ``` python
 subjectSet.add(newSubjects)
 ```
 %% Cell type:markdown id: tags:
-Now we just have to **set up appropriate retirement rules in the _Caesar_ UI**. Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects and aggregating volunteer classifications. These involve the use of "extractors" and "reducers": extractors extract the data from each classification into a more useful data format, while reducers "reduce" (aggregate) the extracted data for each subject together.
+Now we just have to **set up appropriate retirement rules in the _Caesar_ UI**, shown [here](https://youtu.be/o9SzgsZvOCg?t=9003) (02:30:03-02:36:03) as part of the recorded demonstration of this tutorial. Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects and aggregating volunteer classifications. These involve the use of "extractors" and "reducers": extractors extract the data from each classification into a more useful data format, while reducers "reduce" (aggregate) the extracted data for each subject together.
 The input to a reducer is the output of one or more extractors.
 %% Cell type:markdown id: tags:
 ### Option 2: Make machine learning predictions for each subject "on the fly"
 For this option, we want to make subject retirement decisions following machine learning predictions made _after_ volunteer classification. To do this, we'll need a way of outputting extract data for processing with machine learning, and a way for uploading the processed results back to _Caesar_ for the retirement decisions.
 #### External reducers
 While _Caesar_ contains multiple inbuilt reducers to choose from, _Caesar_ also defines the concept of an _External_ reducer. When you set up an external reducer in the _Caesar_ UI, the system will send the results from any associated extractors to an HTTP endpoint that you specify. You can then receive those extract data through the HTTP endpoint and process them appropriately.
 #### Placeholder reducers
 Once processing is complete, you need a way to send the results back to _Caesar_. The system defines the concept of a _Placeholder_ reducer for this purpose. Your external reducer should send its answer back to the placeholder reducer, allowing _Caesar_ to process any associated rules or effects.
 #### SQS reducers
 The one downside of a basic external reducer is that you need to run some sort of webserver that can listen out for messages from _Caesar_ and process them when they arrive. This can be a barrier to entry for some research teams, so _Caesar_ defines one more reducer category, the _SQS Reducer_.
 _SQS_ stands for "Simple Queue Service". It's a facility provided by Amazon Web Services that maintains a queue of messages in the cloud. _Caesar_ sends messages to this queue and your code can consume them. All the web hosting is handled by Amazon, so you don't have to worry.
-We'll set up an SQS reducer in the _Caesar_ UI (**see the recording that goes with this notebook**) to send message extracts to us. Then we can use a specially written client to grab those messages from the queue and process them.
+We'll **set up an SQS reducer in the _Caesar_ UI** to send message extracts to us, shown [here](https://youtu.be/o9SzgsZvOCg?t=9975) (02:46:15-02:56:36) as part of the recorded demonstration of this tutorial. Then we can use a specially written client to grab those messages from the queue and process them.
 %% Cell type:code id: tags:
 ``` python
 from ML_Demo import SQSClient
 ```
 %% Cell type:markdown id: tags:
 **IMPORTANT**: Note that for you to use the SQS reducer facility you will need to set up an Amazon AWS account, set up a queue and get a set of access credentials. This process is free and reasonably straightforward, but beyond the scope of this tutorial. Once you've done that, you can export your credentials to your computer's environment like this.
 %% Cell type:code id: tags:
 ``` python
 os.environ["AWS_ACCESS_KEY_ID"] = ?? # You will need an AWS account to get your ID
 os.environ["AWS_SECRET_ACCESS_KEY"] = ?? # You will need an AWS account to get your Secret
 ```
 %% Cell type:markdown id: tags:
 Every SQS queue has a unique URL that you can use to connect to it and retrieve messages. You must pass this URL to the `SQSClient` constructor when you instantiate the SQS client.
 %% Cell type:code id: tags:
 ``` python
 queueUrl = ??  # You can set this up with an AWS account e.g. "https://sqs.us-east-1.amazonaws.com/123456123456/MySQSQueue"
 sqs = SQSClient(queueUrl)
 ```
 %% Cell type:markdown id: tags:
 The client provides a `getMessages` method that will retrieve a batch of messages from the queue and (optionally) delete them from the queue. The method returns three objects. The first is the set of messages that were received, with any duplicates removed, the second is the raw set of messages before deduplication and the final object is a list of unique message IDs for the messages that were received.
 %% Cell type:code id: tags:
 ``` python
 messages, receivedMessages, receivedMessageIds = sqs.getMessages(delete=False)
 ```
 %% Cell type:markdown id: tags:
 We can examine our messages more easily if we convert to a `pandas.DataFrame`.
 %% Cell type:code id: tags:
 ``` python
 messagesFrame = pd.DataFrame(messages)
 messagesFrame
 ```
 %% Cell type:markdown id: tags:
 For every classification we get a row of corresponding data. The information about the volunteer's answer is in the `data` column. Since we set _Caesar_ to use a question extractor, the format of each entry can be interpreted as:
 ```python
 { '<answer>' : <number_of_matching_responses> }
 ```
 For example, in row `0` of the table we can see that for subject `52665118` the classification contained one response for the answer `'0'`, which corresponds to a Pulsator.
 If we have a (partially) trained machine learning model, we could ask it to make its own predictions for each of the subjects we just got classifications for, but for now, let's just look up the classification (if it exists) in our table of predictions. We can use the `subject_id` field in our message table to get the metadata for each subject and then try to retrieve its machine learning prediction.
 %% Cell type:code id: tags:
 ``` python
 machineMatchesHuman = []
 for _, (subjectId, data) in messagesFrame[["subject_id", "data"]].iterrows():
    answer = next(iter(data.keys()))
    if len(answer) == 0:
        machineMatchesHuman.append(False)
        continue
    answer = int(answer)
    try:
        subject = Subject.find(subjectId)
    except:
        machineMatchesHuman.append(False)
        continue
    if "image" in subject.metadata:
        lookup = int(subject.metadata["image"][:-4])
    if lookup in predictions.index:
        print(
            subjectId,
            subject.metadata["image"][:-4],
            predictions.loc[lookup].idxmax(),
            questionOrder[answer],
        )
        machineMatchesHuman.append(
            answer == questionOrder.index(predictions.loc[lookup].idxmax())
        )
        continue
    machineMatchesHuman.append(False)
 machineMatchesHuman
 ```
 %% Cell type:markdown id: tags:
 Perhaps we decide to retire all subjects for which the machine and human predictions match. In that case, we now need to send a message back to _Caesar_ with our decision for the subjects we want to retire. If we don't want to do anything, we don't have to send a message.
 We can use the Panoptes Python API to send a message back to the placeholder reducer we set up in the Caesar UI. Here's how we do it.
 %% Cell type:code id: tags:
 ``` python
 pan = Panoptes(login="interactive", redirect_url='https://caesar.zooniverse.org/auth/zooniverse/callback')
 ```
 %% Cell type:code id: tags:
 ``` python
 wokflowId = ?? # You can look this up in the Project Builder or you could use: project.links.workflows[0]
 response = pan.put(endpoint="https://caesar.zooniverse.org",
                   path=f"workflows/{wokflowId}/subject_reductions/receiver",
                   json={
                       'reduction': {
                           'subject_id': int(messagesFrame.subject_id.iloc[-1]),
                           'data': {"retire": True}
                       }
                   })
 ```
 %% Cell type:markdown id: tags:
 If the reduction was successfully submitted, _Caesar_ will send a `dict` in response.
 %% Cell type:code id: tags:
 ``` python
 response
 ```
 %% Cell type:markdown id: tags:
 Success!
 #### Aside: Avoid entering credentials
 To avoid having to enter credentials to send reductions back to _Caesar_, you can register with the Panoptes authentication system and get a _Client ID_ and a _Client Secret_. These are just special strings of unguessable characters that the Panoptes Python API can use instead of your usual credentials to authenticate.
 To get your _Client ID_ and _Client Secret_ visit [https://panoptes.zooniverse.org/oauth/applications](https://panoptes.zooniverse.org/oauth/applications) and click on _New Application_. Once you have those details you can export them to your computer's environment just like you did for the Amazon credentials, but with different names.
 %% Cell type:code id: tags:
 ``` python
 os.environ["PANOPTES_CLIENT_ID"] = ??
 os.environ["PANOPTES_CLIENT_SECRET"] = ??
 ```
 %% Cell type:markdown id: tags:
 ## Filtering Subjects Using Machine Learning
 Although machine learning algorithms are very good at performing **specific tasks**, there are some things that human beings still tend to do better at. For example, human beings are much more likely to spot unusual or unexpected features in images or other types of subjects.
 However, there are many data sets (with more arriving every day) that are simply too large to be processed by human beings, even using a citizen science approach.
 Machine learning can help here by filtering out subjects that are "not interesting". Such subjects are typically very common in the data sets that were used to train the machine learning models and are therefore very easily and confidently classified using those models.
 Commonly cited examples of how machine learning has been used to filter subjects that are shown to volunteers on Zooniverse are ecology-focused "camera trap" projects. Volunteers are asked to identify any animals they see in the images they are shown. Machine learning models detect "empty" images very accurately and it is not useful for volunteers to classify images with no animals in them. Machine learning can be used very effectively to remove empty images from Zooniverse subject sets to let volunteers focus on classifying animals.
 We'll use an example from our _SuperWASP Variable Stars_ CNN, in which we'll select only those light curves for which our model is "confused". We'll define "confusing" images as those for which the machine learning algorithm outputs a response greater than 0.4 for more than one category.
 For this demonstration, we've preselected those subjects and you can find the images in the `data/demoset_confused` folder.
 %% Cell type:code id: tags:
 ``` python
 confusedPredictions = pd.read_csv(os.path.join(dataDirectory, "confused_predictions.csv")).set_index("image")
 ```
 %% Cell type:code id: tags:
 ``` python
 confusedPredictions
 ```
 %% Cell type:markdown id: tags:
 Let's make a quick plot of some of these "confusing" images.
 %% Cell type:code id: tags:
 ``` python
 from ML_Demo import plotConfusedBatch
 ```
 %% Cell type:code id: tags:
 ``` python
 plotConfusedBatch(confusedPredictions.iloc[:, :6], os.path.join(dataDirectory, "demoset_confused"))
 ```
 %% Cell type:markdown id: tags:
 I think we can agree that the answers certainly aren't obvious, but as humans (maybe experts) we can probably get all of them right. Let's use our normal techniques to make a new "confused" subject set and add our confused subjects to it.
 %% Cell type:code id: tags:
 ``` python
 subjectSet = SubjectSet()
 subjectSet.display_name = "Confused Demo Set"
 subjectSet.links.project = project
 response = subjectSet.save()
 ```
 %% Cell type:code id: tags:
 ``` python
 newSubjects = []
 for image, imagePrediction in confusedPredictions.iloc[:, :6].iterrows():
    newSubject = Subject()
    newSubject.links.project = project
    newSubject.add_location(os.path.join(os.path.join(dataDirectory, "demoset_confused"), f"{image}.jpg"))
    newSubject.metadata = {
        "Origin": "Python ML demo",
        "image": f"{image}.jpg",
        "ml_prediction": dict(imagePrediction),
    }
    newSubject.save()
    newSubjects.append(newSubject)
 ```
 %% Cell type:code id: tags:
 ``` python
 subjectSet.add(newSubjects)
 ```
 %% Cell type:markdown id: tags:
 ## Active Learning
 We're now in a position to implement a toy demonstration of a technique called "active learning". In active learning, model predictions are used to select subjects that are likely to provide the most useful information to improve the model's performance if they were labelled and used for further training the model. This is particularly useful when your available data set is largely unlabelled.
 For example, if you have a method of obtaining a level of confidence in a prediction made by your model, such as the uncertainty value predicted by a Bayesian neural network alongside each class prediction, then subjects with the highest predicted uncertainties would likely be the most useful for active learning.
 Our "confusing subjects" selection probably isn't exactly the right approach, but it's a reasonable attempt.
 Now that we've created a new subject set with confusing images, let's create a special "Advanced" workflow to process them. Like before, we can use an external reducer on _Caesar_ to send classifications from that workflow to our SQS queue once they're classified by volunteers.
 %% Cell type:code id: tags:
 ``` python
 messages, receivedMessages, receivedMessageIds = sqs.getMessages(delete=True)
 messagesFrame = pd.DataFrame(messages)
 messagesFrame
 ```
 %% Cell type:markdown id: tags:
 Once we collect enough classifications from our "Advanced" workflow, we can use those classifications to further train our model. We can then use this model to make predictions for more subjects, select a new set of the most confusing subjects, and use them to create another subject set for the "Advanced" workflow.
 Repeating this cycle over and over again is the basis of active learning with Zooniverse.
 %% Cell type:code id: tags:
 ``` python
 ```