用Tensorflow进行线性回归和逻辑回归(六)-EW帮帮网

6.奇怪的学习速度

现在你理解了神经元是什么，我们来讨论神经元（更一般的神经网络）学习的意义。这会允许我们引入诸如超参数和学习速率等的概念。几乎所有的神经网络问题，学习意味着找到权重（记住神经网络是由许多神经元组成的，每个神经元都有它自己的权重）和偏置使网络的损失函数最小化，损失函数通常记为J。

在微积分里，有多种方法解析的找到给定函数最小值。不幸的是，所有神经网络应用里，权重数量太多不能使有这些方法。所以必须使用数值方法，最有名的是梯度下降。它是最容易理解的方法，它是理解后面复杂算法的基础。我们简单的括概一下它是如何工作的，因为它是机器学习里给读者介绍学习速率概念的最好的算法。

梯度下降的思想是反复的沿负梯度找到函数的最小值。算法上，这种更新规则表达为

其中α是步长用来指示给新的梯度多大的权重。这个思想是在∇W方向上取许多小的步长。注意，∇W本身是 W的函数，所以真正的步长是每次迭代时改变的。每一步对权重矩阵W作小的更新。反复更新的过程称为学习权重矩阵W。

给定一个通用的函数 J(w), 其中 w 是权重向量,权重空间的最小位置 (即使 J(w) 取最小值的w) 可以用算法基于下面的步骤找到:

迭代 0: 选择随机的初始值 w0
迭代 n + 1 ( n 从0开始): 权重迭代 n + 1次, wn + 1 从前面的迭代 n更新wn, 使用公式

这里 ∇J(w),表示损失函数的梯度,它是向量它的组份是损失函数关于所有权向量w的偏微分,如下:

要确定什么时候停止，我们需要检查什么时候 J(w)停止改变太多，或者换句话，你可以定义一下阈值 ϵ 并停止任何迭代q > k (这里 k是你要找到的整数)满足 | J(wq + 1) − J(wq) | < ϵ 对于所有的q > k。这种方法的问题是它太复杂，这种检查很费时，当使用Python时 (记住，你要做这在步很多次), 所以通常人们让算法运行固定次数的迭代并检查最后的结果。如果结果不是希望的，它们增加大的固定次数。多大? 这取决于你的问题。你要做的是选择一定次数的迭代 (例如 10,000或 1,000,000)并让算法运行。同时，你绘制损失函数与迭代次数的图，你检查你的迭代次数是合理的。本章后面我给出选择迭代次数的例子。现在，你要知道在一定次数迭代后停止算法。

注意 为什么算法收敛于最小值超出了本书的范围，会让读者偏离学习目的。本章的学习目的是让你理解学习速率的影响以及学习速率太小或太大的后果。

我们这里假定损失函数是可微的。虽然通常不是这样的，但是这个问题的讨论超出了本书的范围。这种情况，人们趋于使用实用的方法。这个实现工作得很好，所以这种理论问题通常被实践实忽略。记住，在深度学习模型里，损失函数是非常复杂的函数，研究它几乎不太可能。

在一定次数迭代后，级数 wn有助于收敛到最小位置。参数 γ被称为学习速率，是神经网络学习过程最重要的参数。

注意，有别于权重,学习速率称为超参数。我们会遇到更多的超参数，它们的值不是通过训练得到的，通常是在一开始设定的。相反，参数 w 和 b是通过训练得到的.

单词hopefully的选择有很好的理由。它可能是算法不能收敛于最小值。也有可能 wn 在不能收敛的值之间振荡—或者完全偏离。选择太大或太小的γ,你的模型不会收敛 (或者收敛很慢)。要理解为什么会这样，我们看一个实际的例子，并看一下选择不同的学习速率时方法是如何工作的。

学习速度的实例

我们考虑由 m = 30个观察组成的数据信，y通过代码产生。

#List3-16

m = 30

w0 = 2

w1 = 0.5

x =np.linspace(-1,1,m)

y = w0 + w1 * x

作为损失函数，我们选择经典的均方误差 (MSE)

这里我们记上标 (i) 为第i个观察。记住下标 i (xi),我们表示第 i个特征。记住我们的标记，我们记x j为第 j个特征且第 i个观察。这个例子里，我们只有一个特征，我们不需要下标 j. 损失函数可以很容易的实现

np.average((y-hypothesis(x, w0, w1))**2, axis=2)/2

这里我们定义

def hypothesis(x, w0, w1): return w0 + w1*x

我们的目的是找到使J(w0, w1)最小的w0和w1。

要应用梯度下降方法，我们必须计算 w0, n 和 w1, n的级数。我们有下面的等式:

通过偏微分简化这个等式

因为 ∂f(w0, w1, xi)/∂w0 = 1 且 ∂f(w0, w1, xi)/∂w1 = xi,如果我们要用代码实现梯度下降算法，上面的等式必须用Python实现。

注意等式(2.11)的微分的目的是展示梯度下降很快变得复杂，即便是很简单的例子。下一节我们用tensorflow构建第一个模型.这个库的最好的方面是所有的那些公式都可以自动的计算，你不需要计算什么东西。实现这里的等式并调试它们需要很长的时间，对于有许多神经元的大的神经网络来说几乎是不可能的。

我在本书忽略完整的代码，因为需要太多的纸张。

通过改变学习速率检查这个模型如何工作是有意义的。在图 3-18, 3-19, 和3-20,画出了损失函数的等值线，在顶部，画了 (w0, n, w1, n)级数, 用来可视化级数的收敛 (或不收敛)。在图中,最小值用圆标记。我们考虑 γ = 0.8 (图3-18), γ = 2 (图 3-19),和 γ = 0.05 (图3-20)。不同的估计, wn, 用点表示。最小值用圆表示。

第一种情况 (图3-18),收敛很好,只有8步，方法收敛于最小值。当 γ = 2时 (图 3-19), 方法的步子太大 (记住:由 −γ∇J(w) 给出步子因此γ越大步子越大)并不能接近最小值。它在最小值附近振荡,而不能到达。这种情况,模型不会收敛。在最后的例子里,当 γ = 0.05 (图3-20), 学习很慢它要很多的步才能达到最小值。有些情况，损失函数在最小值附近很平，方法要很多次迭代才能收敛，实际上你不能得到真实的最小值在有限的时间内。图 3-20, 画出了300次迭代,但是方法并没有接近最小值。

注意当編写神经网络学习部分的代码时选择对的学习速率是最重要的。学习速率太大，方法就会在最小值附近振荡，不能达到最小值。选择太小的学习速率，算法就会变得很慢你很难在有限的时间内找到最小值。学习速率太大的信号是损失函数出现nan (“not a number,” 在Python语言里)。在训练过程中打印损失函数是检查这种问题的很好的方法。使你有机会停止过程从而避免浪费时间(一旦你看到 nan出现)。具体的例子见后面的章节.

在深度学习里,每一次迭代都需要时间,你要重复这个过程多次。选择对的学习速率是设计好的模型的关键,因为它会让训练更快 (或者让它不可能)。

图 3-18. 展示梯度下降算法良好收敛行为

图 3-19. 展示学习速率过大时梯度下降算法.方法不能收敛于最小值。

图 3-20. 展示学习速率太小的梯率学习算法。方法很慢要很多次迭代才能达到最小值。

有时在学习过程中改变学习速率是很有效的。你一开始用大的学习速率快速的接近最小值，然后慢慢的减少学习速率，确保你接近最小值。我在后面讨论这个问题。

注意学习速率的选择没有固定的规则.它取决于模型，损失函数，开始点，等。好的经验是以 γ = 0.05开始，然后看损失函数的行为。通常对 J(w) 与迭代次数作图,来检查它的下降和下降速率。

检查收敛的好办法是作损失函数与迭代次数的图。这样你可以检查它的行为。图3-21是三种学习速率的损失函数的例子。你可以清楚的看到 γ = 0.8时很快接近于零,表明我们达到了最小值。 γ = 2不能下降，它保持在初始值附近。最后γ = 0.05 开始下降，但是比第一种情况慢很多。

图3-21. 损失函数与迭代次数 (只考虑前8点)

这里是我们从 2-13得到的结论:

- γ = 0.05 → J 下降,它不错，但是8次以后，我们不能达到平台，我们需要更多次迭代，直到 J不怎么改变。

- γ = 2 → J 不下降.我们要检查学习速率。尝试更小的值是个好的起点.

- γ = 0.8 → 损失函数下降很快然后保持在恒定值。主是很好的信号表示我们达到了最小值。

记住学习速率的绝对值是不相关的。重要的是它的行为。我们可以用常数乘损失函数而不会影响学习速率，不要看绝对值，要检查它有多快以及损失函数的行为。另外，损失函数总是不会达到零，不要希望它。 J的最小值几乎总是不为零(取决于函数本身 )。在回归一节，你会看到损失函数不会达到零.

注意当训练你的模型时，记住总是检查你的损失函数与迭代次数（或者称为 epochs). 这会给你有效的方法来评估训练是否有效，是否有用，给你优化的提示。

现在我们定义了基础，我们使用神经元来解决二个简单的问题 :线性回归和逻辑回归。

#List3-17

#梯度下降的图形化展示

import numpy as np

import random

import matplotlib.pyplot as plt

import matplotlib as mpl

用于绘图的函数

我们将用下面的函数完成一些绘图工作，你现在可以忽略它。但是下面的单元格必须在其它代码之前运行。

def myplot(x,y, name, xlab, ylab):

plt.rc('font', family='arial')

plt.rc('xtick', labelsize='x-small')

plt.rc('ytick', labelsize='x-small')

plt.tight_layout()

fig = plt.figure(figsize=(8, 5))

ax = fig.add_subplot(1, 1, 1)

plt.tick_params(labelsize=16)

ax.plot(x, y, ls='solid', color = 'black')

ax.set_xlabel(xlab, fontsize = 16)

ax.set_ylabel(ylab, fontsize = 16)

绘制梯度下降

收敛良好的图像

import numpy as np

import matplotlib.pyplot as plt

# The data to fit

m = 30

theta0_true = 2

theta1_true = 0.5

x = np.linspace(-1,1,m)

y = theta0_true + theta1_true * x

def cost_func(theta0, theta1):

# The cost function, J(theta0, theta1) describing the goodness of fit.

theta0 = np.atleast_3d(np.asarray(theta0))

theta1 = np.atleast_3d(np.asarray(theta1))

return np.average((y-hypothesis(x, theta0, theta1))**2, axis=2)/2

def hypothesis(x, theta0, theta1):

# Our "hypothesis function", a straight line.

return theta0 + theta1*x

# First construct a grid of (theta0, theta1) parameter pairs and their

# corresponding cost function values.

theta0_grid = np.linspace(-1,4,101)

theta1_grid = np.linspace(-5,5,101)

J_grid = cost_func(theta0_grid[:,np.newaxis,np.newaxis],

theta1_grid[np.newaxis,:,np.newaxis])

# Let's start with the plotting

fig, ax = plt.subplots(figsize=(12, 8))

plt.rc('font', family='arial')

plt.rc('xtick', labelsize='x-small')

plt.rc('ytick', labelsize='x-small')

plt.tick_params(labelsize=16)

#CHECK:https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#numpy.newaxis

# A labeled contour plot for the RHS cost function

X, Y = np.meshgrid(theta0_grid, theta1_grid)

contours = ax.contour(X, Y, J_grid, 30, colors='k')

ax.clabel(contours)

# The target parameter values indicated on the cost function contour plot

ax.scatter([theta0_true]*2,[theta1_true]*2,s=[50,10], color=['k','w'])

# Take N steps with learning rate alpha down the steepest gradient,

# starting at (theta0, theta1) = (0, 0).

N = 8

alpha = 0.7

theta = [np.array((0,0))]

J = [cost_func(*theta[0])[0]]

for j in range(N-1):

last_theta = theta[-1]

this_theta = np.empty((2,))

this_theta[0] = last_theta[0] - alpha / m * np.sum(

(hypothesis(x, *last_theta) - y))

this_theta[1] = last_theta[1] - alpha / m * np.sum(

(hypothesis(x, *last_theta) - y) * x)

theta.append(this_theta)

J.append(cost_func(*this_theta))

# Annotate the cost function plot with coloured points indicating the

# parameters chosen and red arrows indicating the steps down the gradient.

# Also plot the fit function on the LHS data plot in a matching colour.

colors = ['b', 'g', 'm', 'c', 'orange']

for j in range(1,N):

ax.annotate('', xy=theta[j], xytext=theta[j-1],

arrowprops={'arrowstyle': '->', 'color': 'r', 'lw': 1},

va='center', ha='center')

ax.scatter(*zip(*theta), cmap='gray', s=80, lw=0)

# Labels, titles and a legend.

ax.set_xlabel(r'$w_0$', fontsize = 16)

ax.set_ylabel(r'$w_1$', fontsize = 16)

ax.set_title('Cost function', fontsize = 16)

plt.show()

图3-22收敛不好的图像

J1=J

import numpy as np

import matplotlib.pyplot as plt

# The plot: LHS is the data, RHS will be the cost function.

fig, ax = plt.subplots(figsize=(12, 8))

plt.rc('font', family='arial')

plt.rc('xtick', labelsize='x-small')

plt.rc('ytick', labelsize='x-small')

plt.tick_params(labelsize=16)

# A labeled contour plot for the RHS cost function

X, Y = np.meshgrid(theta0_grid, theta1_grid)

contours = ax.contour(X, Y, J_grid, 30, colors='k')

ax.clabel(contours)

# The target parameter values indicated on the cost function contour plot

ax.scatter([theta0_true]*2,[theta1_true]*2,s=[50,10], color=['k','w'])

# Take N steps with learning rate alpha down the steepest gradient,

# starting at (theta0, theta1) = (0, 0).

N = 8

alpha = 2

theta = [np.array((0,0))]

J = [cost_func(*theta[0])[0]]

for j in range(N-1):

last_theta = theta[-1]

this_theta = np.empty((2,))

this_theta[0] = last_theta[0] - alpha / m * np.sum(

(hypothesis(x, *last_theta) - y))

this_theta[1] = last_theta[1] - alpha / m * np.sum(

(hypothesis(x, *last_theta) - y) * x)

theta.append(this_theta)

J.append(cost_func(*this_theta))

# Annotate the cost function plot with coloured points indicating the

# parameters chosen and red arrows indicating the steps down the gradient.

# Also plot the fit function on the LHS data plot in a matching colour.

colors = ['b', 'g', 'm', 'c', 'orange']

for j in range(1,N):

ax.annotate('', xy=theta[j], xytext=theta[j-1],

arrowprops={'arrowstyle': '->', 'color': 'r', 'lw': 1},

va='center', ha='center')

ax.scatter(*zip(*theta), cmap='gray', s=80, lw=0)

# Labels, titles and a legend.

ax.set_xlabel(r'$w_0$', fontsize = 16)

ax.set_ylabel(r'$w_1$', fontsize = 16)

ax.set_title('Cost function', fontsize = 16)

import numpy as np

import matplotlib.pyplot as plt

图3-23学习速率太小的图像

J2=J

import numpy as np

import matplotlib.pyplot as plt

# The plot: LHS is the data, RHS will be the cost function.

fig, ax = plt.subplots(figsize=(12, 8))

plt.rc('font', family='arial')

plt.rc('xtick', labelsize='x-small')

plt.rc('ytick', labelsize='x-small')

plt.tick_params(labelsize=16)

# First construct a grid of (theta0, theta1) parameter pairs and their

# corresponding cost function values.

theta0_grid = np.linspace(-1,4,101)

theta1_grid = np.linspace(-5,5,101)

J_grid = cost_func(theta0_grid[:,np.newaxis,np.newaxis],

theta1_grid[np.newaxis,:,np.newaxis])

# A labeled contour plot for the RHS cost function

X, Y = np.meshgrid(theta0_grid, theta1_grid)

contours = ax.contour(X, Y, J_grid, 30, colors='k')

ax.clabel(contours)

# The target parameter values indicated on the cost function contour plot

ax.scatter([theta0_true]*2,[theta1_true]*2,s=[50,10], color=['k','w'])

# Take N steps with learning rate alpha down the steepest gradient,

# starting at (theta0, theta1) = (0, 0).

N = 30

alpha = 0.05

theta = [np.array((0,0))]

J = [cost_func(*theta[0])[0]]

for j in range(N-1):

last_theta = theta[-1]

this_theta = np.empty((2,))

this_theta[0] = last_theta[0] - alpha / m * np.sum(

(hypothesis(x, *last_theta) - y))

this_theta[1] = last_theta[1] - alpha / m * np.sum(

(hypothesis(x, *last_theta) - y) * x)

theta.append(this_theta)

J.append(cost_func(*this_theta))

# Annotate the cost function plot with coloured points indicating the

# parameters chosen and red arrows indicating the steps down the gradient.

# Also plot the fit function on the LHS data plot in a matching colour.

colors = ['b', 'g', 'm', 'c', 'orange']

for j in range(1,N):

ax.annotate('', xy=theta[j], xytext=theta[j-1],

arrowprops={'arrowstyle': '->', 'color': 'r', 'lw': 1},

va='center', ha='center')

ax.scatter(*zip(*theta), cmap='gray', s=80, lw=0)

# Labels, titles and a legend.

ax.set_xlabel(r'$w_0$', fontsize = 16)

ax.set_ylabel(r'$w_1$', fontsize = 16)

ax.set_title('Cost function', fontsize = 16)

plt.show()

fig.savefig('Figure_1-12'+'.pdf', format='pdf', dpi=300,bbox_inches='tight')

fig.savefig('Figure_1-12'+'.png', format='png', dpi=300,bbox_inches='tight')

图3-24不同速率参数的损失函数

J3=J

plt.rc('font', family='arial')

#plt.rc('font',**{'family':'serif','serif':['Palatino']})

plt.rc('xtick', labelsize='x-small')

plt.rc('ytick', labelsize='x-small')

plt.tight_layout()

fig = plt.figure(figsize=(10,6))

ax = fig.add_subplot(1, 1, 1)

#x = np.linspace(1., 8., 30)

ax.plot(J1, ls='solid', color = 'black', label='$\gamma=0.8$')

ax.plot(J2, ls='dashed', color = 'black', label='$\gamma=2.0$')

ax.plot(J3, ls='dotted', color = 'black', label='$\gamma=0.05$')

ax.set_xlabel('Iterations', fontsize = 16)

ax.set_ylabel('Cost function $J$', fontsize = 16)

plt.xlim(0,8)

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., fontsize = 16)

图3-25

用Tensorflow进行线性回归和逻辑回归(六)

学习速度的实例

用于绘图的函数

绘制梯度下降

收敛良好的图像

网站公告

今日签到

热门文章

最新发布