{ "cells": [ { "cell_type": "markdown", "id": "ea0786c2", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# A quick introduction to machine learning\n", "\n", "\n", "- These notebooks are a brief, hands-on, introduction to machine learning.\n", "- We will revise some of the nomenclature, principles, and applications from [Valentina's presentation](https://github.com/oceanhackweek/ohw-tutorials/tree/OHW22/01-Tue/01-machine-learning-intro)." ] }, { "cell_type": "markdown", "id": "2a4c4301", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## ML will solve all of our problems, right?\n", "\n", "" ] }, { "cell_type": "markdown", "id": "987fda81", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## What is Machine Learning (ML)?\n", "\n", "**Caveat:** I'm not a Statistician, Mathematician, or ML expert. I only play one online. You can find my work on movies like \"How to get by with little to no data\" or \"Oh gosh, the PI wants some buzz-words in the report\" and \"Fuzzy logic no longer does it, we need ML → AI → DL\"\n", "\n", "What is ML (a personal point of view):\n", "\n", "* Focus on practical problems\n", "* Learn from the data and/or make predictions with it\n", "* Middle ground between statistics and optimization techniques\n", "* We have fast computers now, right? Let them do the work! ([Must see JVP talk on this](https://www.youtube.com/watch?app=desktop&v=Iq9DzN6mvYA).)" ] }, { "cell_type": "markdown", "id": "00cc5ce5", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Oversimplified take:** Fit a model to data and use it to make predictions. (This is how scikit-learn designed its API BTW)." ] }, { "cell_type": "markdown", "id": "b2cf395e", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Vocabulary \n", "\n", "\n", "- **parameters:** Variables that define the model and control its behavior.\n", "\n", "- **model:** Set of mathematical equations used to approximate the data.\n", "\n", "- **labels/classes:** Quantity/category that we want to predict\n", "\n", "- **features:** Observations (information) used as predictors of labels/classes.\n", "\n", "- **training:** Use **features** and known **labels/classes** to fit the **model** estimate its **parameters** (full circle, right? But why stop now?).\n", "\n", "Please check out [this awesome lecture](https://docs.google.com/presentation/d/1Fa9SuyK9DIpd-MkJJjGqjCbAa-sHtr3qufC9MhmewDQ/edit) on ML for climate science." ] }, { "cell_type": "markdown", "id": "ed4313fa", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- **hyper-parameters:** Variables that influence the **training** and the **model** but are not estimated during training.\n", "- **unsupervised learning:** Extract information and structure from the data without **training** with known **labels**. We will see clustering, and Principal Component Analysis (PCA).\n", "\n", "- **supervised learning:** Fit a model using data to \"train\" it for making predictions. Examples: regression, classification, spam detection, recommendation systems. We'll see KNN, a classification type of ML in this tutorial." ] }, { "cell_type": "markdown", "id": "97fe276c", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Unsupervised: PCA\n", "\n", "The dataset we will use was consists of Red, Green, Blue composites (**parameters**) from plastic pellets photos. We also have some extra information on the pellet size, shape, etc.\n", "\n", "The **labels** are the yellowing index. The goal is to predict the yellowing based the pellets image, broken down to its RGB info." ] }, { "cell_type": "code", "execution_count": 1, "id": "6f449af4", "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "048ff44d", "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | r | \n", "g | \n", "b | \n", "size (mm) | \n", "color | \n", "description | \n", "erosion | \n", "erosion index | \n", "yellowing | \n", "yellowing index | \n", "
---|---|---|---|---|---|---|---|---|---|---|
image | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
cl1_p11_moca2_deixa5_a0001 | \n", "152 | \n", "150 | \n", "143 | \n", "4.021 | \n", "transparent | \n", "sphere | \n", "high erosion | \n", "3 | \n", "low | \n", "1 | \n", "
cl1_p12_lagoinha_deixa1_g0006 | \n", "221 | \n", "218 | \n", "219 | \n", "4.244 | \n", "white | \n", "light erosion | \n", "low erosion | \n", "1 | \n", "low | \n", "1 | \n", "
cl1_p12_lagoinha_deixa1_g0007 | \n", "140 | \n", "137 | \n", "129 | \n", "3.946 | \n", "white | \n", "not erosion | \n", "low erosion | \n", "1 | \n", "low | \n", "1 | \n", "
cl1_p12_lagoinha_deixa1_g0008 | \n", "188 | \n", "178 | \n", "146 | \n", "3.948 | \n", "white | \n", "moderate erosion | \n", "high erosion | \n", "3 | \n", "moderate | \n", "2 | \n", "
cl1_p12_lagoinha_deixa2_h0004 | \n", "207 | \n", "200 | \n", "189 | \n", "6.043 | \n", "white | \n", "light erosion | \n", "low erosion | \n", "1 | \n", "moderate | \n", "2 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
cl1_p6_moca2_deixa3_a0006 | \n", "186 | \n", "193 | \n", "155 | \n", "4.546 | \n", "transparent | \n", "cylinder | \n", "moderate erosion | \n", "2 | \n", "low | \n", "1 | \n", "
cl1_p8_moca2_deixa5_b0001 | \n", "169 | \n", "168 | \n", "106 | \n", "3.082 | \n", "transparent | \n", "sphere | \n", "low erosion | \n", "1 | \n", "low | \n", "1 | \n", "
cl1_p8_moca2_deixa5_b0003 | \n", "191 | \n", "189 | \n", "152 | \n", "3.932 | \n", "white | \n", "sphere | \n", "low erosion | \n", "1 | \n", "low | \n", "1 | \n", "
cl1_p8_moca2_deixa5_b0004 | \n", "181 | \n", "156 | \n", "70 | \n", "3.230 | \n", "white | \n", "sphere | \n", "moderate erosion | \n", "3 | \n", "moderate | \n", "2 | \n", "
cl1_p9_moca2_deixa5_b0001 | \n", "193 | \n", "192 | \n", "198 | \n", "3.763 | \n", "transparent | \n", "sphere | \n", "high erosion | \n", "3 | \n", "low | \n", "1 | \n", "
127 rows × 10 columns
\n", "\n", " | TS 1 | \n", "TS 2 | \n", "TS 3 | \n", "
---|---|---|---|
PC 1 | \n", "0.106617 | \n", "-0.020035 | \n", "0.796085 | \n", "
PC 2 | \n", "0.038558 | \n", "-0.015377 | \n", "0.045291 | \n", "
PC 3 | \n", "0.118255 | \n", "-0.029110 | \n", "-0.071213 | \n", "
PC 4 | \n", "0.032419 | \n", "-0.036895 | \n", "-0.025253 | \n", "
PC 5 | \n", "0.037738 | \n", "-0.033506 | \n", "0.003937 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
PC 123 | \n", "0.037121 | \n", "0.082580 | \n", "0.065741 | \n", "
PC 124 | \n", "0.023213 | \n", "0.049475 | \n", "0.000340 | \n", "
PC 125 | \n", "0.027663 | \n", "0.022751 | \n", "-0.036524 | \n", "
PC 126 | \n", "-0.032039 | \n", "-0.082355 | \n", "0.004122 | \n", "
PC 127 | \n", "0.076820 | \n", "-0.015290 | \n", "-0.064717 | \n", "
127 rows × 3 columns
\n", "