diff --git a/README.md b/README.md index 5e7dda785f7d4f727c526d135f33aabf601bbaaa..7d2f074c85c7cf96eaa6df934931a42dbb52a70b 100644 --- a/README.md +++ b/README.md @@ -1,92 +1,63 @@ -# Zooniverse Advanced Aggregation with Caesar +## Zooniverse - Advanced Aggregation with Caesar +This directory contains resources for the _Advanced Aggregation with Caesar_ tutorial. This tutorial forms part of a series of advanced guides for managing Zoonivere projects through Python. While they can be done independently, for best usage you may want to complete them in the following order (these are also all available as Interactive Analysis workflows in the ESAP GUI): +1. [Advanced Project Building](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-advanced-project-building) +2. [Advanced Aggregation with Caesar (current)](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-advanced-aggregation-with-caesar) +3. [Integrating Machine Learning](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-integrating-machine-learning) +Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects (as opposed to the default method of retiring after that subject has been classified a certain number of times). _Caesar_ also provides a powerful way of collecting and analysing volunteer classifications (aggregation). -## Getting started +For guides on creating a Zooniverse project through the web interface or by using Python, take a look at the _Advanced Project Building_ tutorial above and the links therein. For an introduction to _Caesar_ and using its [web interface](https://caesar.zooniverse.org), see the [Zooniverse _Caesar_ help guide](https://help.zooniverse.org/next-steps/caesar-realtime-data-processing/). A [recorded introduction](https://youtu.be/zJJjz5OEUAw?t=10830) and [accompanying slides](https://indico.in2p3.fr/event/21939/contributions/89043/) are available as part of the [First ESCAPE Citizen Science Workshop](https://indico.in2p3.fr/event/21939/). -To make it easy for you to get started with GitLab, here's a list of recommended next steps. +The advanced tutorial presented here include two notebooks. The first, `Working_with_data.ipynb`, demonstrates using Python for: +* Details of classification, subject, and workflow exports from Zooniverse +* Configuring aggregation extractors and reducers +* Running extractors to extract the data from each classification into a more useful data format +* Running reducers to aggregate the data -Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)! +The second, `plotting_functions.ipynb`, demonstrates plotting example cluster data produced by the Zooniverse's point reducer code to visualise aggregation results. You can find the codes for the tutorial in the `notebooks` folder and the data that were used in the `data` folder. -## Add your files +This tutorial makes use of example material (subjects, metadata, classifications) from the [_Penguin Watch_](https://www.zooniverse.org/projects/penguintom79/penguin-watch) Zooniverse project, which involves counting the numbers of penguin adults, chicks and eggs in images to help understand their lives and environment. -- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files -- [ ] [Add files using the command line](https://docs.gitlab.com/ee/gitlab-basics/add-file.html#add-a-file-using-the-command-line) or push an existing Git repository with the following command: +A recorded walkthrough of this advanced tutorial is available [here](https://youtu.be/o9SzgsZvOCg?t=3840). -``` -cd existing_repo -git remote add origin https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-advanced-aggregation-with-caesar.git -git branch -M main -git push -uf origin main -``` - -## Integrate with your tools - -- [ ] [Set up project integrations](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-advanced-aggregation-with-caesar/-/settings/integrations) - -## Collaborate with your team - -- [ ] [Invite team members and collaborators](https://docs.gitlab.com/ee/user/project/members/) -- [ ] [Create a new merge request](https://docs.gitlab.com/ee/user/project/merge_requests/creating_merge_requests.html) -- [ ] [Automatically close issues from merge requests](https://docs.gitlab.com/ee/user/project/issues/managing_issues.html#closing-issues-automatically) -- [ ] [Enable merge request approvals](https://docs.gitlab.com/ee/user/project/merge_requests/approvals/) -- [ ] [Automatically merge when pipeline succeeds](https://docs.gitlab.com/ee/user/project/merge_requests/merge_when_pipeline_succeeds.html) - -## Test and Deploy - -Use the built-in continuous integration in GitLab. - -- [ ] [Get started with GitLab CI/CD](https://docs.gitlab.com/ee/ci/quick_start/index.html) -- [ ] [Analyze your code for known vulnerabilities with Static Application Security Testing(SAST)](https://docs.gitlab.com/ee/user/application_security/sast/) -- [ ] [Deploy to Kubernetes, Amazon EC2, or Amazon ECS using Auto Deploy](https://docs.gitlab.com/ee/topics/autodevops/requirements.html) -- [ ] [Use pull-based deployments for improved Kubernetes management](https://docs.gitlab.com/ee/user/clusters/agent/) -- [ ] [Set up protected environments](https://docs.gitlab.com/ee/ci/environments/protected_environments.html) - -*** +The ESAP Archives (accessible via the ESAP GUI) include data retrieval from the Zooniverse Classification Database using the ESAP Shopping Basket. For a tutorial on loading Zooniverse data from a saved shopping basket into a notebook and performing simple aggregation of the classification results, see [here](https://git.astron.nl/astron-sdc/escape-wp5/workflows/muon-hunters-example/-/tree/master) (also available as an Interactive Analyis workflow). -# Editing this README +### Setup -When you're ready to make this README your own, just edit this file and use the handy template below (or feel free to structure it however you want - this is just a starting point!). Thank you to [makeareadme.com](https://www.makeareadme.com/) for this template. +#### Option 1: ESAP workflow as a remote notebook instance -## Suggestions for a good README -Every project is different, so consider which of these sections apply to yours. The sections used in the template are suggestions for most open source projects. Also keep in mind that while a README can be too long and detailed, too long is better than too short. If you think your README is too long, consider utilizing another form of documentation rather than cutting out information. - -## Name -Choose a self-explaining name for your project. - -## Description -Let people know what your project can do specifically. Provide context and add a link to any reference visitors might be unfamiliar with. A list of Features or a Background subsection can also be added here. If there are alternatives to your project, this is a good place to list differentiating factors. - -## Badges -On some READMEs, you may see small images that convey metadata, such as whether or not all the tests are passing for the project. You can use Shields to add some to your README. Many services also have instructions for adding a badge. - -## Visuals -Depending on what you are making, it can be a good idea to include screenshots or even a video (you'll frequently see GIFs rather than actual videos). Tools like ttygif can help, but check out Asciinema for a more sophisticated method. - -## Installation -Within a particular ecosystem, there may be a common way of installing things, such as using Yarn, NuGet, or Homebrew. However, consider the possibility that whoever is reading your README is a novice and would like more guidance. Listing specific steps helps remove ambiguity and gets people to using your project as quickly as possible. If it only runs in a specific context like a particular programming language version or operating system or has dependencies that have to be installed manually, also add a Requirements subsection. - -## Usage -Use examples liberally, and show the expected output if you can. It's helpful to have inline the smallest example of usage that you can demonstrate, while providing links to more sophisticated examples if they are too long to reasonably include in the README. - -## Support -Tell people where they can go to for help. It can be any combination of an issue tracker, a chat room, an email address, etc. +You may need to install the `panoptes_aggregation` and `pandas` packages. +``` +!python -m pip install panoptes_aggregation +!python -m pip install pandas +``` -## Roadmap -If you have ideas for releases in the future, it is a good idea to list them in the README. +#### Option 2: Local computer -## Contributing -State if you are open to contributions and what your requirements are for accepting them. +1. Install Python 3: the easiest way to do this is to download the Anaconda build from https://www.anaconda.com/download/. This will pre-install many of the packages needed for the aggregation code. +2. Open a terminal and run: `pip install panoptes_aggregation` +3. Download the [Advanced Aggregation with Caesar](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-advanced-aggregation-with-caesar) tutorial into a suitable directory. -For people who want to make changes to your project, it's helpful to have some documentation on how to get started. Perhaps there is a script that they should run or some environment variables that they need to set. Make these steps explicit. These instructions could also be useful to your future self. +#### Option 3: Google Colab -You can also document commands to lint the code or run tests. These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something. Having instructions for running tests is especially helpful if it requires external setup, such as starting a Selenium server for testing in a browser. +Google Colab is a service that runs Python code in the cloud. -## Authors and acknowledgment -Show your appreciation to those who have contributed to the project. +1. Sign into Google Drive. +2. Make a copy of the [Advanced Aggregation with Caesar](https://git.astron.nl/astron-sdc/escape-wp5/workflows/zooniverse-advanced-aggregation-with-caesar) tutorial in your own Google Drive. +3. Right click the `Working_with_data.ipynb` file > Open with > Google Colaboratory. + 1. If this is not an option click "Connect more apps", search for "Google Colaboratory", enable it, and refresh the page. +4. Run the following in the notebook: + 1. `!pip install panoptes_aggregation --quiet` to install the needed packages (it will take a few minutes to finish), + 2. `from google.colab import drive; drive.mount('/content/drive')` to mount Google Drive, and + 3. `import os; os.chdir('/content/drive/MyDrive/zooniverse-advanced-aggregation-with-caesar/')` to change the current working directory to the example folder (adjust if you have renamed the example folder). -## License -For open source projects, say how it is licensed. +### Other Useful Resources -## Project status -If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers. +Here is a list of additional resources that you may find useful when building your own Zooniverse citizen science project. +* [_Zooniverse_ website](http://zooniverse.org) - Interested in Citizen Science? Create a **free** _Zooniverse_ account, browse other projects for inspiration, contribute yourself as a citizen scientist, and build your own project. +* [Zooniverse project builder help pages](https://help.zooniverse.org) - A great resource with practical guidance, tips and advice for building great Citizen Science projects. See the ["Building a project using the project builder"](https://youtu.be/zJJjz5OEUAw?t=7633) recorded tutorial for more information. +* [_Caesar_ web interface](https://caesar.zooniverse.org) - An online interface for the _Caesar_ advanced retirement and aggregation engine. See the ["Introducing Caesar"](https://youtu.be/zJJjz5OEUAw?t=10830) recorded tutorial for tips and advice on how to use Caesar to supercharge your _Zooniverse_ project. +* [The `panoptes_client` documentation](https://panoptes-python-client.readthedocs.io/en/v1.1/) - A comprehensive reference for the Panoptes Python Client. +* [The `panoptes_aggregation` documentation](https://aggregation-caesar.zooniverse.org/docs) - A comprehensive reference for the Panoptes Aggregation tool. +* [The `aggregation-for-caesar` GitHub](https://github.com/zooniverse/aggregation-for-caesar) - A collection of external reducers for _Caesar_ and offline use. \ No newline at end of file diff --git a/Working_with_data.pdf b/Working_with_data.pdf deleted file mode 100755 index b94d1ea82dc2506c697b948945e76c73086e9176..0000000000000000000000000000000000000000 Binary files a/Working_with_data.pdf and /dev/null differ diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-classifications-trim.csv b/data/penguin-watch-classifications-trim.csv similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-classifications-trim.csv rename to data/penguin-watch-classifications-trim.csv diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects-trim.csv b/data/penguin-watch-subjects-trim.csv similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects-trim.csv rename to data/penguin-watch-subjects-trim.csv diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860262.png b/data/penguin-watch-subjects/20860262.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860262.png rename to data/penguin-watch-subjects/20860262.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860263.png b/data/penguin-watch-subjects/20860263.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860263.png rename to data/penguin-watch-subjects/20860263.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860264.png b/data/penguin-watch-subjects/20860264.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860264.png rename to data/penguin-watch-subjects/20860264.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860265.png b/data/penguin-watch-subjects/20860265.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860265.png rename to data/penguin-watch-subjects/20860265.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860266.png b/data/penguin-watch-subjects/20860266.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860266.png rename to data/penguin-watch-subjects/20860266.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860267.png b/data/penguin-watch-subjects/20860267.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860267.png rename to data/penguin-watch-subjects/20860267.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860268.png b/data/penguin-watch-subjects/20860268.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860268.png rename to data/penguin-watch-subjects/20860268.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860269.png b/data/penguin-watch-subjects/20860269.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860269.png rename to data/penguin-watch-subjects/20860269.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860270.png b/data/penguin-watch-subjects/20860270.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860270.png rename to data/penguin-watch-subjects/20860270.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860271.png b/data/penguin-watch-subjects/20860271.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860271.png rename to data/penguin-watch-subjects/20860271.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860272.png b/data/penguin-watch-subjects/20860272.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860272.png rename to data/penguin-watch-subjects/20860272.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860273.png b/data/penguin-watch-subjects/20860273.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860273.png rename to data/penguin-watch-subjects/20860273.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860274.png b/data/penguin-watch-subjects/20860274.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860274.png rename to data/penguin-watch-subjects/20860274.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860275.png b/data/penguin-watch-subjects/20860275.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860275.png rename to data/penguin-watch-subjects/20860275.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860276.png b/data/penguin-watch-subjects/20860276.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860276.png rename to data/penguin-watch-subjects/20860276.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860277.png b/data/penguin-watch-subjects/20860277.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860277.png rename to data/penguin-watch-subjects/20860277.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860278.png b/data/penguin-watch-subjects/20860278.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860278.png rename to data/penguin-watch-subjects/20860278.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860279.png b/data/penguin-watch-subjects/20860279.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860279.png rename to data/penguin-watch-subjects/20860279.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860280.png b/data/penguin-watch-subjects/20860280.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860280.png rename to data/penguin-watch-subjects/20860280.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860281.png b/data/penguin-watch-subjects/20860281.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860281.png rename to data/penguin-watch-subjects/20860281.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860282.png b/data/penguin-watch-subjects/20860282.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860282.png rename to data/penguin-watch-subjects/20860282.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860283.png b/data/penguin-watch-subjects/20860283.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860283.png rename to data/penguin-watch-subjects/20860283.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860284.png b/data/penguin-watch-subjects/20860284.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860284.png rename to data/penguin-watch-subjects/20860284.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860285.png b/data/penguin-watch-subjects/20860285.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860285.png rename to data/penguin-watch-subjects/20860285.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860286.png b/data/penguin-watch-subjects/20860286.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860286.png rename to data/penguin-watch-subjects/20860286.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860287.png b/data/penguin-watch-subjects/20860287.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860287.png rename to data/penguin-watch-subjects/20860287.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860288.png b/data/penguin-watch-subjects/20860288.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860288.png rename to data/penguin-watch-subjects/20860288.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860289.png b/data/penguin-watch-subjects/20860289.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860289.png rename to data/penguin-watch-subjects/20860289.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860290.png b/data/penguin-watch-subjects/20860290.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860290.png rename to data/penguin-watch-subjects/20860290.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860291.png b/data/penguin-watch-subjects/20860291.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860291.png rename to data/penguin-watch-subjects/20860291.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860292.png b/data/penguin-watch-subjects/20860292.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860292.png rename to data/penguin-watch-subjects/20860292.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860293.png b/data/penguin-watch-subjects/20860293.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860293.png rename to data/penguin-watch-subjects/20860293.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860294.png b/data/penguin-watch-subjects/20860294.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860294.png rename to data/penguin-watch-subjects/20860294.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860295.png b/data/penguin-watch-subjects/20860295.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860295.png rename to data/penguin-watch-subjects/20860295.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860296.png b/data/penguin-watch-subjects/20860296.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860296.png rename to data/penguin-watch-subjects/20860296.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860297.png b/data/penguin-watch-subjects/20860297.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860297.png rename to data/penguin-watch-subjects/20860297.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860298.png b/data/penguin-watch-subjects/20860298.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860298.png rename to data/penguin-watch-subjects/20860298.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860299.png b/data/penguin-watch-subjects/20860299.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860299.png rename to data/penguin-watch-subjects/20860299.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860300.png b/data/penguin-watch-subjects/20860300.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860300.png rename to data/penguin-watch-subjects/20860300.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860301.png b/data/penguin-watch-subjects/20860301.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860301.png rename to data/penguin-watch-subjects/20860301.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860302.png b/data/penguin-watch-subjects/20860302.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860302.png rename to data/penguin-watch-subjects/20860302.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860303.png b/data/penguin-watch-subjects/20860303.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860303.png rename to data/penguin-watch-subjects/20860303.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860304.png b/data/penguin-watch-subjects/20860304.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860304.png rename to data/penguin-watch-subjects/20860304.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860305.png b/data/penguin-watch-subjects/20860305.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860305.png rename to data/penguin-watch-subjects/20860305.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860306.png b/data/penguin-watch-subjects/20860306.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860306.png rename to data/penguin-watch-subjects/20860306.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860307.png b/data/penguin-watch-subjects/20860307.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860307.png rename to data/penguin-watch-subjects/20860307.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860308.png b/data/penguin-watch-subjects/20860308.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860308.png rename to data/penguin-watch-subjects/20860308.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860309.png b/data/penguin-watch-subjects/20860309.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860309.png rename to data/penguin-watch-subjects/20860309.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860310.png b/data/penguin-watch-subjects/20860310.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860310.png rename to data/penguin-watch-subjects/20860310.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860311.png b/data/penguin-watch-subjects/20860311.png similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-subjects/20860311.png rename to data/penguin-watch-subjects/20860311.png diff --git a/Penguin-Watch-Example-data-dumps/penguin-watch-workflows.csv b/data/penguin-watch-workflows.csv similarity index 100% rename from Penguin-Watch-Example-data-dumps/penguin-watch-workflows.csv rename to data/penguin-watch-workflows.csv diff --git a/notebooks - pdf versions/Working_with_data.pdf b/notebooks - pdf versions/Working_with_data.pdf new file mode 100755 index 0000000000000000000000000000000000000000..370c42244096ffbdfdf4e78b9f850eeb30305c30 Binary files /dev/null and b/notebooks - pdf versions/Working_with_data.pdf differ diff --git a/notebooks - pdf versions/plotting_functions.pdf b/notebooks - pdf versions/plotting_functions.pdf new file mode 100755 index 0000000000000000000000000000000000000000..70e63fe2730d2ba3edb88277b15f7f972676ce43 Binary files /dev/null and b/notebooks - pdf versions/plotting_functions.pdf differ diff --git a/Working_with_data.ipynb b/notebooks/Working_with_data.ipynb similarity index 51% rename from Working_with_data.ipynb rename to notebooks/Working_with_data.ipynb index 130da16ba8fc34cd4889967e3a857415f9e6ba3d..5cf43afceea17bf540422a09e9cffbe71fb2dc2b 100755 --- a/Working_with_data.ipynb +++ b/notebooks/Working_with_data.ipynb @@ -6,95 +6,65 @@ "id": "HrVPzm6KzWma" }, "source": [ - "# Working with Zooniverse data\n", + "# Zooniverse - Advanced Aggregation with _Caesar_: Working with Data\n", "\n", - "For this workshop we will be working with the Zooniverse's data aggregation code `panoptes-aggregation`. This code is designed to work directly with the data dumps exported from projects `Data Export` page.\n", + "Zooniverse's _Caesar_ advanced retirement and aggregation engine allows for the setup of more advanced rules for retiring subjects, and provides a powerful way of aggregating volunteer classifications (collecting the results for analysis). Aggregation involves the use of \"extractors\" and \"reducers\": extractors extract the data from each classification into a more useful data format, while reducers \"reduce\" (aggregate) the extracted data for each subject together.\n", "\n", - "General documentation for the aggregation code can be found on https://aggregation-caesar.zooniverse.org/docs\n", - "\n", - "## Installing the code\n", - "\n", - "### Local computer\n", - "\n", - "Install python 3, the easiest way to do this is to download the Anaconda build from: https://www.anaconda.com/download/. This will pre-install many of the packages needed for the aggregation code.\n", - "\n", - "Open a terminal and run: `pip install panoptes-aggregation[gui]`\n", - "\n", - "Download the folder containing the [example data](https://drive.google.com/drive/folders/1rLPs3jFsop9dw7Gst6AF0bjhFw6EZ1jc?usp=sharing)\n", + "In this notebook, we will see how to set up, configure, and run external extractors and reducers. As an example, we will focus on clustering point data from the _Penguin Watch_ Zooniverse project, with results plotted in the `plotting_fuctions.ipynb` notebook.\n", "\n", - "### Google colab\n", + "For this tutorial we will be working with the Zooniverse's data aggregation code `panoptes_aggregation` (_Panoptes_ is the platform behind Zooniverse). This code is designed to work directly with the data exported from a project's `Data Export` page within the Zooniverse's Project Builder.\n", "\n", - "Google colab is a service that runs python code in the cloud\n", + "General documentation for the aggregation code can be found on https://aggregation-caesar.zooniverse.org/docs\n", "\n", - "1. Sign into google drive\n", - "2. Make a copy of the [example data](https://drive.google.com/drive/folders/1rLPs3jFsop9dw7Gst6AF0bjhFw6EZ1jc?usp=sharing) in your own google drive (easiest way is to download and re-uplaod the folder)\n", - "3. Right click the `Working_with_data.ipynb` file > Open with > Google Colaboratory\n", - " 1. if this is not an option click \"Connect more apps\", search for \"Google Colaboratory\", enable it, and refresh the page.\n", - "4. Run the cell below to install the needed packages (it will take a few mins to finish)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "r6BVpClIzSCl" - }, - "outputs": [], - "source": [ - "!pip install panoptes_aggregation --quiet" + "## Table of Contents\n", + "\n", + "1. [Setup](#Setup)\n", + "2. [Zooniverse data exports](#Zooniverse-data-exports)\n", + "3. [How the aggregation code works](#How-the-aggregation-code-works)\n", + "4. [Processing the example Penguin Watch data](#Processing-the-example-Penguin-Watch-data)\n", + " 1. [Extractor](#Edit-the-extractor-config-file)\n", + " 2. [Reducer](#Edit-the-reducer-config-files)\n", + " 3. [Plot the results](#Plot-the-results)\n", + " 4. [Understanding what the files contain](#Understanding-what-the-extraction-and-reduction-files-contain)\n", + "5. [Other things to try](#Other-things-to-try)" ] }, { "cell_type": "markdown", "metadata": { - "id": "tmaEMsgO0Klw" + "id": "HrVPzm6KzWma" }, "source": [ - "5. Run the cell below to mount google drive" + "## Setup\n", + "\n", + "You may need to install the `panoptes_aggregation` and `pandas` packages. If you do, then run the code in the next cell. Otherwise, skip it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "id": "amNr3p6G0E4l" + "id": "HrVPzm6KzWma" }, "outputs": [], "source": [ - "from google.colab import drive\n", - "drive.mount('/content/drive')" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "7SAl_jSJ078S" - }, - "source": [ - "6. Run the cell below to change director to the example folder (adjust if you have renamed the example folder)" + "!python -m pip install panoptes_aggregation\n", + "!python -m pip install pandas" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "executionInfo": { - "elapsed": 713, - "status": "ok", - "timestamp": 1606147054963, - "user": { - "displayName": "Coleman Krawczyk", - "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14Ggyy0vPodL74vKupL7ewNWOXnrC-ItpID-aDM5v=s64", - "userId": "17600869181314191626" - }, - "user_tz": 0 - }, - "id": "7K_K1NS10rra" + "id": "HrVPzm6KzWma" }, "outputs": [], "source": [ + "import pandas as pd\n", + "import getpass\n", + "import glob\n", "import os\n", - "os.chdir('/content/drive/MyDrive/agg_workshop_2020/')" + "import io" ] }, { @@ -105,14 +75,14 @@ "source": [ "## Zooniverse data exports\n", "\n", - "Zooniverse projects provide a large amount of data to research teams. These data can be exported from the `Data Export` tab on a project's `Lab` page.\n", + "Zooniverse projects provide a large amount of data to research teams. The following data can be exported from the `Data Export` tab on a project's `Lab` page.\n", "\n", "### Classification export\n", "\n", - "This `csv` file has one row for every classification submitted for a project. This files has the following columns:\n", + "This `csv` file has one row for every classification submitted for a project. This files has the following columns:\n", "\n", "* `classification_id`: A unique ID number assigned to each classification\n", - "* `user_name`: The name of the user that submitted the classification. Non logged-in users are assigned a unique name based on (a hashed version of) their IP address.\n", + "* `user_name`: The name of the user that submitted the classification. Non logged-in users are assigned a unique name based on (a hashed version of) their IP address.\n", "* `user_id`: User ID number is provided for logged-in users\n", "* `user_ip`: A hashed version of the user's IP address (original IP addresses are not provided for privacy reasons)\n", "* `workflow_id`: The ID number for the workflow the classification was made on\n", @@ -122,13 +92,13 @@ "* `gold_standard`: Identifies if the classification was made on a gold standard subject\n", "* `expert`: Identifies if the classification was made in \"expert\" mode\n", "* `metadata`: A `JSON` blob containing additional metadata about the classification (e.g. browser size, browser user agent, classification duration, etc...)\n", - "* `annotations`: A `JSON` blob with the annotations made for each task in the workflow. The exact shape of this blob is dependent on the shape of the workflow.\n", - "* `subject_data`: A `JSON` blob with the metadata associated with the subject that was classified. The exact shape of this blob is dependent on the metadata uploaded to each subject\n", + "* `annotations`: A `JSON` blob with the annotations made for each task in the workflow. The exact shape of this blob is dependent on the shape of the workflow.\n", + "* `subject_data`: A `JSON` blob with the metadata associated with the subject that was classified. The exact shape of this blob is dependent on the metadata uploaded to each subject\n", "* `subject_ids`: The ID number for the subject classified\n", "\n", "### Subject export\n", "\n", - "This `csv` file has one row for every subject uploaded to a project. This file has the following columns:\n", + "This `csv` file has one row for every subject uploaded to a project. This file has the following columns:\n", "\n", "* `subject_id`: A unique ID number assigned to each subject as they are uploaded\n", "* `project_id`: The ID number for the project\n", @@ -142,7 +112,7 @@ "\n", "### Workflows export\n", "\n", - "This `csv` file has the information for every major version of a workflow. This file has the following columns:\n", + "This `csv` file has the information for every major version of a workflow. This file has the following columns:\n", "\n", "* `workflow_id`: The ID number for the workflow\n", "* `display_name`: The display name for the workflow\n", @@ -157,31 +127,56 @@ "* `strings`: A `JSON` blob containing all the text associated with the workflow\n", "* `minor_version`: The minor workflow version number (changes when text is edited on the workflow)\n", "\n", - "All other columns are not typically used and are for experimental or more advanced workflow setups.\n", - "\n", + "All other columns are not typically used and are for experimental or more advanced workflow setups." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RYa1sVVzyA8m" + }, + "source": [ "## How the aggregation code works\n", "\n", - "Aggregation is done in a three step process.\n", + "Aggregation is done in a three step process:\n", "\n", "1. Configure `extractors` and `reducers`\n", " 1. Check over configuration files and edit them as needed\n", "2. `Extract` the data from each classification into a more useful data format (i.e. flatten and normalize the data)\n", - "3. `Reduce` the the extracts for each subject together (i.e. aggregate the data)\n", - "\n", + "3. `Reduce` the extracts for each subject together (i.e. aggregate the data)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RYa1sVVzyA8m" + }, + "source": [ "## Processing the example Penguin Watch data\n", "\n", - "If you are comfortable using the command line, open a terminal and navigate the folder containing the example data. If you are working with the anaconda build on Windows you can using the anaconda launcher to boot up a `Jupyter lab` tab in your web browser. Use the left panel to navigate to the folder with the example data then launch a terminal.\n", + "This tutorial makes use of example material (subjects, metadata, classifications) from the [_Penguin Watch_](https://www.zooniverse.org/projects/penguintom79/penguin-watch) Zooniverse project, which involves counting the numbers of penguin adults, chicks and eggs in images to help understand their lives and environment. Volunteers are provided with marker tools to place points over each of the above categories (as well as an 'other' category), using a different marker tool for each category, often resulting in the positions of multiple points being stored for each image (point data). In addition to this point marking task, there is also a question asked to volunteers, as well as a \"shortcut\" checkbox (see below).\n", "\n", - "A selection of each of the above export files for the Penguin Watch project has been provided for this workshop. You should make a new folder called `aggregation` to direct the output of these scripts. These steps outline how to use the command line to run all of the aggregation scripts, but a GUI is available by running `panoptes_aggregation_gui` from the terminal.\n", + "A selection of each of the above export files for the Penguin Watch project has been provided in the `data` folder.\n", + "\n", + "You should make a new folder called `aggregation_results` to direct the output of the following scripts in this tutorial. These steps outline how to use the command line to run all of the aggregation scripts.\n", + "\n", + "For more detailed breakdowns of the commands used in this tutorial, see the [`panoptes_aggregation` documentation](https://aggregation-caesar.zooniverse.org/Scripts.html).\n", "\n", "### Run configuration script\n", "\n", - "The auto-config script will detect the shape of a project's workflow and select the default extractor and reducers to use. For this example we want to configuration files for workflow `6465` version `52.76`:" + "The `panoptes_aggregation` command contains three sub-commands: `config`, `extract`, and `reduce`:\n", + "- `config` creates configuration files for _panoptes_ data extraction and reduction based on a workflow export\n", + "- `extract` extracts data from _panoptes_ classifications based on the workflow\n", + "- `reduce` reduces data from panoptes classifications based on the extracted data\n", + "\n", + "To begin, we shall use `config` to generate the configuration files we'll need.\n", + "\n", + "The auto-config script will detect the shape of a project's workflow and select the default extractor and reducers to use. For this example, the `penguin-watch-workflows.csv` file contains data for multiple Penguin Watch workflows and for multiple versions of these workflows, so for now let's focus on generating configuration files for workflow `6465` version `52.76`:" ] }, { "cell_type": "code", - "execution_count": 1, + "execution_count": 3, "metadata": { "id": "E6mLcIouyA8n" }, @@ -191,24 +186,24 @@ "output_type": "stream", "text": [ "Saving Extractor config to:\n", - "/Volumes/Work/agg_workshop_2020/aggregation_results/Extractor_config_workflow_6465_V52.76.yaml\n", + "/home/james/git_projects/zooniverse-advanced-aggregation-with-caesar/aggregation_results/Extractor_config_workflow_6465_V52.76.yaml\n", "\n", "Saving Reducer config to:\n", - "/Volumes/Work/agg_workshop_2020/aggregation_results/Reducer_config_workflow_6465_V52.76_point_extractor_by_frame.yaml\n", + "/home/james/git_projects/zooniverse-advanced-aggregation-with-caesar/aggregation_results/Reducer_config_workflow_6465_V52.76_point_extractor_by_frame.yaml\n", "\n", "Saving Reducer config to:\n", - "/Volumes/Work/agg_workshop_2020/aggregation_results/Reducer_config_workflow_6465_V52.76_question_extractor.yaml\n", + "/home/james/git_projects/zooniverse-advanced-aggregation-with-caesar/aggregation_results/Reducer_config_workflow_6465_V52.76_question_extractor.yaml\n", "\n", "Saving Reducer config to:\n", - "/Volumes/Work/agg_workshop_2020/aggregation_results/Reducer_config_workflow_6465_V52.76_shortcut_extractor.yaml\n", + "/home/james/git_projects/zooniverse-advanced-aggregation-with-caesar/aggregation_results/Reducer_config_workflow_6465_V52.76_shortcut_extractor.yaml\n", "\n", "Saving task key look up table to:\n", - "/Volumes/Work/agg_workshop_2020/aggregation_results/Task_labels_workflow_6465_V52.76.yaml\n" + "/home/james/git_projects/zooniverse-advanced-aggregation-with-caesar/aggregation_results/Task_labels_workflow_6465_V52.76.yaml\n" ] } ], "source": [ - "!panoptes_aggregation config Penguin-Watch-Example-data-dumps/penguin-watch-workflows.csv 6465 -v 52 -m 76 -d aggregation_results" + "!panoptes_aggregation config data/penguin-watch-workflows.csv 6465 -v 52.76 -d aggregation_results" ] }, { @@ -217,12 +212,12 @@ "id": "Z_PSnDnbyA8p" }, "source": [ - "See `panoptes_aggregation config -h` for help text explaining each of these inputs.\n", + "See `!panoptes_aggregation config -h` for help text explaining each of these inputs.\n", "\n", - "This will crate four new files:\n", + "This will create five new files (that you can open with a text editor):\n", "\n", "* `Extractor_config_workflow_6465_V52.yaml`: The configuration file for the extractors\n", - "* `Reducer_config_workflow_6465_V52.76_shortcut_extractor.yaml`: The configuration file for the shourtcut reducer\n", + "* `Reducer_config_workflow_6465_V52.76_shortcut_extractor.yaml`: The configuration file for the shortcut reducer\n", "* `Reducer_config_workflow_6465_V52_question_extractor.yaml`: The configuration file for the question reducer\n", "* `Reducer_config_workflow_6465_V52_point_extractor_by_frame.yaml`: The configuration file for the point reducer\n", "* `Task_labels_workflow_6465_V52.76.yaml`: A file with a look up table that matches the workflow task keys with the text associated with them" @@ -239,7 +234,14 @@ "source": [ "### Edit the extractor config file\n", "\n", - "Task `T4` was never used in the final project so it can be removed from the config file. Today we are not intrested in task `T0` tool `3` (\"other\") so we will remove it and it's subtask form the config file. The final file should look like:\n", + "Next, we'll use the `extract` sub-command to extract data based on the extractor configuration file after we've had a look at it.\n", + "\n", + "In the extractor config file, there are a number of extractors listed:\n", + "1. The point extractor `T0` corresponds to the task involving marking the penguins, whose tools `0`, `1`, `2`, and `3` correspond to marker tools for \"Adults\", \"Chicks\", \"Eggs\", and \"Other\", respectively.\n", + "2. The question extractor `T1` corresponds to the question \"Have you marked all the animals?\" asked after marking points.\n", + "3. The shortcut extractor `T2` corresponds to the \"This image is too dark or blurry\" checkbox that allows volunteers to skip that image.\n", + "\n", + "Task `T4` was never used in the final project so **it can be removed from the config file**. Today we are not interested in task `T0` tool `3` (\"Other\") so **we will remove it and its subtask from the config file**. The final file should look like:\n", "\n", "```yaml\n", "extractor_config:\n", @@ -256,27 +258,38 @@ "workflow_id: 6465\n", "workflow_version: '52.76'\n", "\n", - "```\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EgQe6CeuyA8p", + "jupyter": { + "outputs_hidden": true + } + }, + "source": [ + "#### Side Note: Converting this extractor config for use in _Caesar_\n", + "\n", + "If you have a config file that you would like converting into an extractor for use in the _Caesar_ web interface, it is worth noting that all of the extractors available within `panoptes_aggregation` can be used as \"external extractors\" in _Caesar_. The following demonstrates how to set this up, though such an external extractor won't be used in this tutorial.\n", "\n", - "#### Converting this extractor config for use on Caesar\n", + "Note: For question tasks you will typically want to use the built-in question extractor.\n", "\n", - "All of the extractors available in within `panoptes_aggregation` can be used as \"external extractors\" on Caesar.\n", + "To begin, if you have not done so already, add a workflow to _Caesar_ in order to create an extractor: log in to the [_Caesar_ web interface](https://caesar.zooniverse.org/) using your Zooniverse account, go to the \"Workflow\" or \"Project\" tab and add yours by clicking `+Add` (IDs can be found in the Project Builder), then select a workflow.\n", "\n", - "1. On the \"Extractors\" tab in Caesar click \"Create Extractor\" and select \"External\"\n", - "2. Enter a unique \"key\" to reference this extractor later\n", - " - This value will be used in the reducer later and will show up in the data export\n", + "1. On the \"Extractors\" tab in _Caesar_ click \"Create Extractor\" and select \"External\".\n", + "2. Enter a unique \"key\" (e.g. `advanced_aggregation_example`) to reference this extractor later. This key will be used in the reducer later and will show up in the `Data Export`.\n", "3. Enter the \"URL\" as `https://aggregation-caesar.zooniverse.org/extractors/<extractor name>?task=<task ID>`\n", - " - The `<task ID>` value is found next to the `task:` key for each extractor in the config file\n", - " - For our example this would be \n", + " - The `<task ID>` value is found next to the `task:` key for each extractor in the config file.\n", + " - For our example this would be written as\n", " ```\n", " https://aggregation-caesar.zooniverse.org/extractors/point_extractor_by_frame?task=T0&tools=[0,1,2]\n", " ```\n", - "4. Enter the \"Minimum workflow version\"\n", - " - All version above this number will be passed through this extractor\n", - " - `52.76` for the example above (the value next to `workflow_version:`)\n", - "5. Click \"Create External extractor\"\n", - "\n", - "Note: For question tasks you will typically want to use the built in question extractor" + "4. Enter the \"Minimum workflow version\".\n", + " - All version above this number will be passed through this extractor.\n", + " - For our example this would be `52.76` (the value next to `workflow_version:`).\n", + "5. Click \"Create External extractor\"." ] }, { @@ -285,14 +298,14 @@ "source": [ "### Run the extractors\n", "\n", - "The extraction script will create one `csv` file for each type of extractor being used. In this case there will be two files created, one for `point_extractor_by_frame` and one for `question_extractor`.\n", + "The extraction script will create one `csv` file for each type of extractor being used. In this case there will be three files created, one each for `point_extractor_by_frame`, `question_extractor` and `shortcut_extractor`.\n", "\n", - "See `panoptes_aggregation extract -h` for help text explaining each of these inputs.\n" + "See `!panoptes_aggregation extract -h` for help text explaining each of these inputs." ] }, { "cell_type": "code", - "execution_count": 2, + "execution_count": 5, "metadata": { "id": "rHXYi2NDyA8p" }, @@ -306,7 +319,7 @@ } ], "source": [ - "!panoptes_aggregation extract Penguin-Watch-Example-data-dumps/penguin-watch-classifications-trim.csv aggregation_results/Extractor_config_workflow_6465_V52.76.yaml -o example -d aggregation_results" + "!panoptes_aggregation extract data/penguin-watch-classifications-trim.csv aggregation_results/Extractor_config_workflow_6465_V52.76.yaml -o example -d aggregation_results" ] }, { @@ -317,38 +330,55 @@ "source": [ "### Edit the reducer config files\n", "\n", - "There are no configuration parameters for the question reducer so that files does not need to be edited, but for the point reducer we should switch it from the default `DBSCAN` reducer to the `HDBSCAN` one. We are making this switch since the Penguin Watch subjects have a large depth-of-field that causes the point clusters to be different densities across the image.\n", + "Now, we'll use the `reduce` sub-command to reduce data based on the extracted data and and the reducer configuration files, once again after we've had a look at them.\n", + "\n", + "Thankfully, there are no configuration parameters for the question reducer or shortcut reducer so those files do not need to be edited.\n", + "\n", + "For the point reducer (`Reducer_config_workflow_6465_V52.76_point_extractor_by_frame.yaml` for this example), by default the reducer uses a clustering algorithm `DBSCAN` to aggregate the data, but we should **edit the file to switch it from the default `DBSCAN` reducer to a `HDBSCAN` one**. We are making this switch since the Penguin Watch subjects have a large depth-of-field that causes the clusters of points to be different densities across the image, which `HDBSCAN` can account for.\n", + "\n", + "As before, we are not interested in task `T0` tool `3` (\"Other\") so we will **remove it from the reducer config file** as well, i.e. removing the following:\n", + "```\n", + "details:\n", + " T0_tool3:\n", + " - question_reducer\n", + "```\n", "\n", - "We can also use this config file to set the `min_cluster_size` and `min_samples` keywords. Here are some good values to start with:\n", + "We can also use this config file to **set the `min_cluster_size` and `min_samples` keywords**. Here are some good values to start with:\n", "\n", "```yaml\n", "reducer_config:\n", " point_reducer_hdbscan:\n", " min_cluster_size: 4\n", " min_samples: 3\n", - "```\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9a-NpnFhyA8q" + }, + "source": [ + "#### Side Note: Converting the reducer config for use in Caesar\n", "\n", - "#### Converting the reducer config for use on Caesar\n", + "Continuing on from the previous side note, we can also convert our reducer config file for use in the _Caesar_ web interface. Like before, all of the reducers available within `panoptes_aggregation` can be used as \"external reducers\" in _Caesar_.\n", "\n", - "All of the reducers available in within `panoptes_aggregation` can be used as \"external extractors\" on Caesar.\n", + "Note: For question tasks you will typically want to use the built-in stats reducer.\n", "\n", - "1. On the \"Reducers\" tab in Caesar click \"Crate Reducer\" and select \"External\"\n", - "2. Enter a unique \"key\" to reference this reducer later\n", - " - This value is used by the rules later and will show up in the data export\n", + "After selecting a workflow in the [_Caesar_ web interface](https://caesar.zooniverse.org/):\n", + "\n", + "1. On the \"Reducers\" tab in _Caesar_ click \"Create Reducer\" and select \"External\".\n", + "2. Enter a unique \"key\" to reference this reducer later. This key is used by the rules later and will show up in the `Data Export`.\n", "3. Enter the \"URL\" as `https://aggregation-caesar.zooniverse.org/reducers/<reducer name>?<param 1>=<value 1>&<param 2>=<value 2>&<etc ...>`\n", - " - For this example \n", + " - For our example this would be written as\n", " ```\n", " https://aggregation-caesar.zooniverse.org/reducers/point_reducer_hdbscan?min_cluster_size=4&min_samples=3\n", " ```\n", "4. Expand the \"Filters\" section\n", - "5. Fill in the \"Extractor keys\" section as a list \n", - " - This the key picked in step 2 of the extractor config\n", - "6. Pick how you want \"Repeated Classifications\" to be handled\n", - " - Defaults to \"keep first\"\n", - " - \"keep all\" can be useful at the testing/debugging stage\n", - "7. Click \"Create External reducer\"\n", - "\n", - "Note: For question tasks you will typically want to use the built in stats reducer" + "5. Fill in the \"Extractor keys\" section as a `list`: this is the key picked in step 2 of converting the extractor config above.\n", + " - Following the example above, this would be `[\"advanced_aggregation_example\"]`.\n", + "6. Pick how you want \"Repeated Classifications\" to be handled. The default is \"keep first\", though \"keep all\" can be useful at the testing/debugging stage.\n", + "7. Click \"Create External reducer\"." ] }, { @@ -358,12 +388,12 @@ "### Run the reducers\n", "See `panoptes_aggregation reduce -h` for help text explaining each of these inputs.\n", "\n", - "Note: By default only if a volunteer classifies the same subject multiple times only the first one is used. This can be changed with the `-F` flag on the command line (e.g. `-F all` to keep all, `-F first` to keep first, `-F last` to keep last)" + "Note: By default, if a volunteer classifies the same subject multiple times only the first one is used. This can be changed with the `-F` flag on the command line (e.g. `-F all` to keep all, `-F first` to keep first, `-F last` to keep last)" ] }, { "cell_type": "code", - "execution_count": 3, + "execution_count": 9, "metadata": { "id": "wnL-m82oyA8q" }, @@ -382,7 +412,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": 10, "metadata": { "id": "iw991ZbxyA8r" }, @@ -401,7 +431,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 12, "metadata": { "id": "5mxjVTSeyA8r" }, @@ -410,7 +440,7 @@ "name": "stdout", "output_type": "stream", "text": [ - "Reducing: 100% |###############################################| Time: 0:00:01\n" + "Reducing: 100% |###############################################| Time: 0:00:00\n" ] } ], @@ -426,9 +456,7 @@ "source": [ "### Plot the results\n", "\n", - "The final step is examining the results of the point clustering. A `jupyter notebook` named `plotting_functions.ipynb` is included with the shared files, this can be run by either opening the notebook directly or running the command `jupyter lab` in the folder containing the file, and opening it in your web browser. User `shift+enter` to run the code in each of the cells.\n", - "\n", - "Note: you will have to adjust the file paths in the \"Reading in the data\" and \"Plotting all the images\" sections to match your file set up.\n", + "The final step is examining the results of the point clustering. The notebook `plotting_functions.ipynb` is included in the `notebooks` folder for this purpose.\n", "\n", "### Understanding what the extraction and reduction files contain\n", "\n", @@ -455,12 +483,19 @@ "* `data.frame0.T0_tool*_clusters_y` : The weighted `y` position for each **cluster** found\n", "* `data.frame0.T0_tool*_clusters_var_x` : The weighted `x` variance of points in each **cluster** found\n", "* `data.frame0.T0_tool*_clusters_var_y` : The weighted `y` variance of points in each **cluster** found\n", - "* `data.frame0.T0_tool*_clusters_var_x_y` : The weighted `x-y` covariance of points in each **cluster** found\n", - "\n", + "* `data.frame0.T0_tool*_clusters_var_x_y` : The weighted `x-y` covariance of points in each **cluster** found" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LD96TM4qyA8r" + }, + "source": [ "## Other things to try\n", "\n", - "* Play around with changing the `min_cluster_size` and `min_samples` parameters to see how they change the detected clusters\n", - "* Read the various `csv` files into you favorite programming language and explore the data\n" + "* Play around with changing the `min_cluster_size` and `min_samples` parameters to see how they change the detected clusters (visualising the results in the `plotting_functions.ipynb` notebook).\n", + "* Read the various `csv` files into you favourite programming language and explore the data." ] }, { @@ -480,7 +515,7 @@ "provenance": [] }, "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -494,7 +529,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.6" + "version": "3.9.13" } }, "nbformat": 4, diff --git a/plotting_functions.ipynb b/notebooks/plotting_functions.ipynb similarity index 99% rename from plotting_functions.ipynb rename to notebooks/plotting_functions.ipynb index 797ea34d71da6f01b2ce6b52149a4cda61ccf522..e60a7c91b04165d6c7fad04df9320a496034006d 100755 --- a/plotting_functions.ipynb +++ b/notebooks/plotting_functions.ipynb @@ -4,13 +4,28 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Plotting penguin clusters\n", + "# Zooniverse - Advanced Aggregation with _Caesar_: Plotting penguin clusters\n", "\n", - "This notebook covers provides functions that can be used to plot the cluster data produced by the Zooniverse's point reducer code. The reducer code provides the original point data, probabilities of the points belonging in a cluster, the location of identified clusters, the covariance of the identified clusters, and the lifetimes of identified clusters.\n", + "This notebook covers functions that can be used to plot the cluster data produced by the Zooniverse's point reducer code, following the example presented in the `Working_with_data.ipynb` notebook. The output reducer `.csv` file provides the original point data, probabilities of the points belonging in a cluster, the location of identified clusters, the covariance of the identified clusters, and the lifetimes of identified clusters.\n", "\n", - "## Import packages used for plotting\n", + "Note:\n", + "- covariance = a measure of the strength of the correlation (joint variability) between two random variables.\n", + "- cluster lifetime = a measure of how confident the clustering algorithm is of that cluster (longer lifetime = more confident).\n", "\n", - "These packages are used for plotting (`matplotlib`, `seaborn`), reading in images (`skimage`), reading in data tables (`pandas`), and working with arrays (`scipy` and `numpy`)." + "## Table of Contents\n", + "\n", + "1. [Setup](#Setup)\n", + "2. [Define functions](#Define-functions)\n", + " 1. [Using covariance values to draw a 2-$\\sigma$ ellipse](#Using-covariance-values-to-draw-a-2-$\\sigma$-ellipse)\n", + " 2. [Plotting the original subject image](#Plotting-the-original-subject-image)\n", + " 3. [Plotting the data and clusters on the image](#Plotting-the-data-and-clusters-on-the-image)\n", + "3. [Reading in the data](#Reading-in-the-data)\n", + "4. [Plotting one image](#Plotting-one-image)\n", + "5. [Plotting all the images](#Plotting-all-the-images)\n", + "\n", + "## Setup\n", + "\n", + "First, let's import some Python packages to be used for plotting (`matplotlib`, `seaborn`), reading in images (`skimage`), reading in data tables (`pandas`), and working with arrays (`scipy` and `numpy`). As data in the `.csv` files are stored in JSON format, we'll also import the `unjson_dataframe` command from `panoptes_aggregation` to make them easier to handle." ] }, { @@ -33,9 +48,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Using covariance values to draw a 2-$\\sigma$ ellipse\n", + "## Define functions\n", "\n", - "The reduction code provides the covariance measurements for the identified clusters, this information can be used to create uncertainty ellipses. The eigenvalues the covariance matrix give the semi-major and semi-minor axes of the ellipse, and the angle between the eigenvectors gives the angle of the ellipse. The `matplotlib` ellipse object is returned once these values have been calculated." + "### Using covariance values to draw a 2-$\\sigma$ ellipse\n", + "\n", + "The reduction code provides the covariance measurements for the identified clusters: this information can be used to create uncertainty ellipses. The eigenvalues of the covariance matrix give the semi-major and semi-minor axes of the ellipse, and the angle between the eigenvectors gives the angle of the ellipse. The `matplotlib` ellipse object is returned once these values have been calculated." ] }, { @@ -45,12 +62,9 @@ "outputs": [], "source": [ "def cov_to_ellipse(cov, pos, nstd=1, **kwargs):\n", - " eigvec, eigval, V = sl.svd(cov, full_matrices=False)\n", - " # the angle the first eigenvector makes with the x-axis\n", - " theta = np.degrees(np.arctan2(eigvec[1, 0], eigvec[0, 0]))\n", - " # full width and height of ellipse, not radius\n", - " # the eigenvalues are the variance along the eigenvectors\n", - " width, height = 2 * nstd * np.sqrt(eigval)\n", + " eigvec, eigval, V = sl.svd(cov, full_matrices=False) # the eigenvalues are the variance along the eigenvectors\n", + " theta = np.degrees(np.arctan2(eigvec[1, 0], eigvec[0, 0])) # the angle the first eigenvector makes with the x-axis\n", + " width, height = 2 * nstd * np.sqrt(eigval) # full width and height of ellipse (not radius)\n", " return patches.Ellipse(xy=pos, width=width, height=height, angle=theta, **kwargs)" ] }, @@ -58,9 +72,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Plotting the original subject image\n", + "### Plotting the original subject image\n", "\n", - "This function takes in the `path` to an image and uses `matplotlib` to display it. It is displayed using the standard image coordinate system with the origin in the upper left of the image. This function ensures that the image is plotted in the same size and aspect ratio as the original image." + "This function takes in the `path` to an image and uses `matplotlib` to display it. It is displayed using the standard image coordinate system with the origin in the upper left of the image. This function ensures that the image is plotted with the same size and aspect ratio as the original image." ] }, { @@ -70,18 +84,19 @@ "outputs": [], "source": [ "def display_image_in_actual_size(path):\n", - " im_data = io.imread(path)\n", + " im_data = io.imread(path) # read in the image\n", " dpi = 100\n", " height, width, depth = im_data.shape\n", - " # What size does the figure need to be in inches to fit the image?\n", + " \n", + " # Determine the figure size (in inches) needed to fit the image:\n", " figsize = width / float(dpi), height / float(dpi)\n", - " # Create a figure of the right size with one axes that takes up the full figure\n", + " \n", + " # Create a figure of the right size with one axes that takes up the full figure:\n", " fig = plt.figure(figsize=figsize)\n", " ax = fig.add_axes([0, 0, 1, 1])\n", - " # Hide spines, ticks, etc.\n", - " ax.axis('off')\n", - " # Display the image.\n", - " ax.imshow(im_data)\n", + " \n", + " ax.axis('off') # hide spines, ticks, etc.\n", + " ax.imshow(im_data) # display the image\n", " return fig, ax" ] }, @@ -89,13 +104,13 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Plotting the data and clusters on the image\n", + "### Plotting the data and clusters on the image\n", "\n", - "The final plotting function puts everything together and plots the original data points and the identified clusters on the original image. It takes in the `path` to the original image, the `aggregate` data for the image (assumed to a row form a `pandas` data frame).\n", + "The final plotting function puts everything together and plots the original data points and the identified clusters on the original image. It takes in the `path` to the original image, and the `aggregate` data for the image (assumed to be a row from a `pandas` data frame).\n", "\n", - "Each of the four point tools are given a unique color, and the saturation of the points is proportional to the probability of the data point belonging to a cluster (e.g. the closer to grey a point is the more likely it is an outlier).\n", + "Each of the point marker tools are given a unique color, and the saturation of the points is proportional to the probability of the data point belonging to a cluster (i.e. the closer to grey a point is the more likely it is an outlier).\n", "\n", - "The transparency of the ellipses is proportional to the lifetime of the identified cluster (e.g. shorter lived clusters are more transparent).\n", + "The transparency of the ellipses is proportional to the lifetime of the identified cluster (i.e. shorter lived clusters are more transparent).\n", "\n", "The final cluster counts for each point tool are displayed in the figure's legend." ] @@ -107,6 +122,8 @@ "outputs": [], "source": [ "def plot_points(path, aggregate):\n", + " # Create dictionary containing labels, colours, and initial \n", + " # counts (zero) for each point marker tool (penguin category):\n", " labels = {\n", " 0: {\n", " 'label': 'Adults',\n", @@ -124,7 +141,11 @@ " 'count': 0\n", " }\n", " }\n", + " \n", + " # Plot the original subject image:\n", " fig, ax = display_image_in_actual_size(path)\n", + " \n", + " # Plot points and cluster ellipses for each point marker tool:\n", " for tool in labels.keys():\n", " cluster_x = f'data.frame0.T0_tool{tool}_clusters_x'\n", " cluster_y = f'data.frame0.T0_tool{tool}_clusters_y'\n", @@ -135,6 +156,7 @@ " points_x = f'data.frame0.T0_tool{tool}_points_x'\n", " points_y = f'data.frame0.T0_tool{tool}_points_y'\n", " points_prob = f'data.frame0.T0_tool{tool}_cluster_probabilities'\n", + " # Plot cluster ellipses:\n", " if isinstance(aggregate[cluster_x], list) and isinstance(aggregate[cluster_y], list):\n", " plot_props = labels[tool]\n", " probs = np.array(aggregate[cluster_prob])\n", @@ -156,6 +178,7 @@ " alpha = 0.7\n", " if prob / max_probs <= 0.1:\n", " alpha = 0.3\n", + " # Use covariance values to draw a 2-sigma ellipse:\n", " ellipse = cov_to_ellipse(\n", " cov,\n", " pos,\n", @@ -167,6 +190,7 @@ " )\n", " ax.add_artist(ellipse)\n", " labels[tool]['count'] += 1\n", + " # Plot points:\n", " if isinstance(aggregate[points_x], list) and isinstance(aggregate[points_y], list):\n", " plot_props = labels[tool]\n", " colors = [sns.desaturate(plot_props['color'], max(prob, 0.4)) for prob in aggregate[points_prob]]\n", @@ -177,7 +201,7 @@ " s=3,\n", " label='{0} ({1})'.format(plot_props['label'], plot_props['count'])\n", " )\n", - " plt.legend(loc=1)\n", + " plt.legend(loc=1) # add the legend to the figure\n", " return fig, ax" ] }, @@ -189,9 +213,8 @@ "\n", "Now that the general plotting functions are defined, we can read in the data. These lines of code should be adjusted to point to the directories where your data is kept.\n", "\n", - "`path_reduction`: The file path to the Zooniverse's data reduction file\n", - "\n", - "`base_path_images`: the file path to the original Penguin Watch images (note: the `{0}.png` at the end of this path is important and should not be changed)" + "- `path_reduction`: The file path to the Zooniverse's data reduction file (the `.csv` file produced by the point reducer).\n", + "- `base_path_images`: The file path to the original Penguin Watch images (note: the `{0}.png` at the end of this path is important and should not be changed)." ] }, { @@ -202,10 +225,10 @@ "source": [ "path_reduction = 'aggregation_results/point_reducer_hdbscan_example.csv'\n", "\n", - "reductions = pandas.read_csv(path_reduction)\n", - "unjson_dataframe(reductions)\n", + "reductions = pandas.read_csv(path_reduction) # read in the data from the .csv file\n", + "unjson_dataframe(reductions) # convert the data to a pandas DataFrame\n", "\n", - "base_path_images = 'Penguin-Watch-Example-data-dumps/penguin-watch-subjects/{0}.png'" + "base_path_images = 'data/penguin-watch-subjects/{0}.png'" ] }, { @@ -214,7 +237,7 @@ "source": [ "## Plotting one image\n", "\n", - "Here is example code for plotting and displaying one image. The `iloc[0]` on the first line tells the code to \"grab the first row\", change the `0` to a different number (up to `49`) to see a different subject." + "Here is example code for plotting and displaying one image. The `iloc[0]` on the first line tells the code to \"grab the first row\": you can change the `0` to a different number (up to `49`) to see a different subject." ] }, { @@ -248,7 +271,7 @@ "source": [ "## Plotting all the images\n", "\n", - "Make a new folder in the `aggregation` directory called `PW_clusters`. This `for` loop goes through every row of the reduced data, creates the plot, and saves the images to the newly created folder." + "Make a new folder in the `aggregation_results` directory called `PW_clusters`. This `for` loop goes through every row of the reduced data, creates the plot, and saves the images to the newly created folder." ] }, { @@ -258,6 +281,7 @@ "outputs": [], "source": [ "output_path = 'aggregation_results/PW_clusters/{0}_clusters.png'\n", + "\n", "for sdx, reduction in reductions.iterrows():\n", " output_name = output_path.format(reduction.subject_id)\n", " image_path = base_path_images.format(reduction.subject_id)\n", @@ -276,7 +300,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, @@ -290,7 +314,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.6" + "version": "3.9.7" } }, "nbformat": 4, diff --git a/plotting_functions.pdf b/plotting_functions.pdf deleted file mode 100755 index 0ea96c4193c3532f03b03f9859071be031c0f965..0000000000000000000000000000000000000000 Binary files a/plotting_functions.pdf and /dev/null differ