About the Iris Dataset
The Iris dataset is one of the most famous datasets in machine learning, originally introduced by statistician and biologist Ronald Fisher in 1936. It has become a standard benchmark for testing machine learning algorithms due to its simplicity and well-defined structure.Dataset Overview
The dataset contains 150 samples of iris flowers from three different species:- Setosa (50 samples)
- Versicolor (50 samples)
- Virginica (50 samples)
- Sepal Length (cm)
- Sepal Width (cm)
- Petal Length (cm)
- Petal Width (cm)
The Prediction Task
The goal is to predict the species of an iris flower based on its physical measurements. This is a multi-class classification problem where models must distinguish between the three species using only the four feature measurements. In this example of a Crunch implementation we have three phases:- Training Phase: Models receive historical iris measurements with known species labels
- Prediction Phase: Models must predict species for new iris measurements (without labels)
- Scoring Phase: Predictions are evaluated against ground truth to determine model performance
Scoring of the prediction
The scoring of each models prediction is done by comparing the predicted species with the ground truth species of the dataset using the accuracy_score of the sklearn library.Payout definitions
The payout is distributed to the top 3 models based on the accuracy score:- 1st place: 50% of the prize pool
- 2nd place: 30% of the prize pool
- 3rd place: 20% of the prize pool
Base and Quickstarter Models
In the next section we will walkthrough the implementation of the Model Package and Quickstarter Models, as it will explain the core prediction task the Coordinator is trying to solve.