1. 一起学习机器学习 -- Data

Prerequisites

Basic familiarity with Pandas
Basic familiarity with Pyplot

Outline

Section 1: Data Loading
Section 2: Data Cleaning
Section 3: More Data Cleaning
Section 4: Descriptive Statistics
Section 5: Visualization

Data exploration

In this notebook, we will learn some basic python functions which are helpful to gain a first overview whenever you face an unknown data set.

You will work with a data set that consists of climbing data of Mount Rainier. Mount Rainier is a 4,392 meters high stratovolcano in Washington, USA, and is considered difficult to summit. You must download the data set from Blackboard.

We start with importing the required packages: numpy, pandas, and matplotlib.

import numpy
import pandas as pd
import matplotlib.pyplot as plt

Section 1: Data Loading

Now we load the data set into a pandas data frame, which is standard the representation of a matrix with row and column names in python.

When using Google Colab, you first need to upload the data set to your notebook. This can be done by files.upload() from Google Colab’s own python package google.colab.

# only use when running this notebook in Google Colab, ignore this cell when running this notebook on your own machine
from google.colab import files

climbing_data = files.upload()

# load data as pandas data frames
climbing = pd.read_csv('climbing_statistics.csv', parse_dates=['Date'])

It is important to inform the read_csv function about the columns corresponding to dates through parse_dates argument, to avoid loading the date as a string instead of datetime object. This is necessary to enable sorting rows by date columns and other operations related to date and time.

The first step should always be to see the first few rows of your data set. This can easily be done with .head() when the object preceding it is a pandas data frame.

In the parentheses of .head(), you can specify how many rows you would like to see (default is 5), and the function .tail() lets you see the last few rows.

We print here the first 10 rows.

climbing.head(10)

To get an overview over the size of the data set, we look at the.shape attribute of the data frame.

# get shape of data
climbing.shape

Section 2: Data Cleaning

However, in the rows with indices 5 and 6 (among others) something has gone wrong. The Date and Route taken to the summit are exactly the same, so why has this not been summarised in one single row? Such things are very common in real data sets and to deal with these inconsistencies is part of so called data cleaning.

We will present two different methods for how this data set can be cleaned, i.e., made ready to be used by machine learning models.

Method 1: Using higher-level Pandas operations

# aggregating the rows when Date and Route are identical
climbing_clean_0 = climbing.groupby(['Date', 'Route']).sum().sort_values(['Date', 'Route'], ascending=False).reset_index()
# recalculate the success percentage
climbing_clean_0['Success Percentage'] = climbing_clean_0['Succeeded'] / climbing_clean_0['Attempted']

Now check the DataFrame after aggregating the rows by Date and Route

climbing_clean_0.head(10)

Method 2: Looping over the rows

# create a sorted copy from climbing dataframe, with indices 0, 1, ..., n-1
climbing_clean_1 = climbing.sort_values(['Date', 'Route'], ascending=False, ignore_index=True)

# aggregating the rows when Date and Route are identical
for index, row in climbing_clean_1.iterrows():
    # skip the first row
    if index == 0:
        continue

    # check if Date and Route are identical with predecessor
    if (climbing_clean_1.loc[index-1, 'Date'] == climbing_clean_1.loc[index, 'Date'] and \
        climbing_clean_1.loc[index-1, 'Route'] == climbing_clean_1.loc[index, 'Route']):

        # aggregate Attempted and Succeeded
        climbing_clean_1.loc[index, 'Attempted'] += climbing_clean_1.loc[index-1, 'Attempted']
        climbing_clean_1.loc[index, 'Succeeded'] += climbing_clean_1.loc[index-1, 'Succeeded']

    # recalculate the Success Percentage
    climbing_clean_1.loc[index, 'Success Percentage'] = climbing_clean_1.loc[index, 'Succeeded'] / climbing_clean_1.loc[index, 'Attempted']

# check
climbing_clean_1.tail(20)

You can see that we were able to sum up the number of Attempted summits, but still have both rows in our data frame. Pandas has a one-line command to delete the first of these duplicates and only keep the last.

climbing_clean_1 = climbing_clean_1.drop_duplicates(subset=['Date', 'Route'], keep='last')

# check
climbing_clean_1.head(10)

Now we have the data frame that we wanted. The indices can simply be reset with .reset_index().

climbing_clean_1 = climbing_clean_1.reset_index(drop=True)  # with drop=True, drop the old index and create a new one

# check
climbing_clean_1.head(10)

Check if the results from Method 1 matches with results from Method 2

# Any differences will be shown
climbing_clean_0.compare(climbing_clean_1)

As expected, both methods lead to the same result, so we continue with one of the two (equivalent) cleaned data sets.

# define cleaned data set to be used in the following
climbing_clean = climbing_clean_0

We can see that the number of rows has decreased (by a factor of 4) after data cleaning.

# get shape of data
climbing_clean.shape

Section 3: More Data Cleaning

Next, we explore other inconsistencies in the data set manually. After having reviewed some random rows by hand, we find that on some dates there were more Succeeded summits than Attempted summits. This must be a mistake and we delete these rows.

# deleting rows, where number of successes is higher than number of attempts
mistake_rows = climbing_clean[climbing_clean['Succeeded'] > climbing_clean['Attempted']]

# check
mistake_rows

# delete these rows
climbing_clean = climbing_clean.drop(index=mistake_rows.index)

# deleted?
climbing_clean[climbing_clean['Succeeded'] > climbing_clean['Attempted']]

Success! All rows that had a higher number of successes than attempts are deleted.

# reset index again
climbing_clean = climbing_clean.reset_index().drop(columns='index')

# check
climbing_clean.head()

# the size of the data set has decreased again
climbing_clean.shape

Section 4: Descriptive Statistics

It’s always a good idea to compute some basic statistics of our data in the beginning. We start with simple means and standard deviations.

climbing_clean_1.describe()

Section 5: Visualization

We can also plot the attempts and successes over time. Our first column is already a date, but pandas has a special data type for dates, which it calls datetime.

Let’s first see in which format our column Date saved its values at.

# check data types
climbing_clean.dtypes

fig, ax = plt.subplots(figsize=(14,10))

for col in ['Attempted', 'Succeeded']:
    ax.bar(climbing_clean_1['Date'], climbing_clean_1[col], alpha=0.5, label=col)
ax.legend(fontsize=14)

# rotates and right aligns the x labels, and moves the bottom of the axes up to make room for them
fig.autofmt_xdate();

本文含有隐藏内容，请开通VIP 后查看

1. 一起学习机器学习 -- Data_exploration