A Visible Method to Gradient Descent and different Optimization Algorithms | by Julien Pascal | Dec, 2022

December 23, 2022

2

Visualize the variations and similarities between gradient descent, gradient descent with momentum, RMSprop, and Adam

Picture by Kristen Munk from Pexels: https://www.pexels.com/picture/photo-of-person-walking-on-unpaved-pathway-2599546/

If you’re like me, equations don’t communicate for themselves. To know them, I have to see what they do with a concrete instance. On this weblog submit, I apply this visualization precept to well-liked optimization algorithms utilized in machine studying.

These days, the Adam algorithm is a extremely popular alternative. The Adam algorithm provides momentum and self-tuning of the training price to the plain-vanilla gradient descent algorithm. However what are momentum and self-tuning precisely?

Under is a visible preview of what these ideas confer with:

Conduct of a number of optimization algorithms. Supply: writer’s calculations

To maintain issues easy, I exploit totally different optimization algorithms on the bivariate linear regression mannequin:

y = a + bx

The variable y represents a amount we attempt to predict/clarify utilizing one other variable x. The unknown parameters are the intercept a and the slope b.

To suit the mannequin to the info, we decrease the imply sq. of the distinction between the mannequin and the info, which could be compactly expressed as follows:

Loss(a,b)=1/m||y-a-bx||²

(assuming we’ve got m observations and utilizing the Euclidean norm)

By altering the worth of a and b, we are able to hopefully enhance the match of the mannequin to the info. With the bivariate regression mannequin, an excellent factor is that we are able to plot the worth of the loss perform as a perform of the unknown parameters a and b. Under is a floor plot of the loss perform, with the black dot representing the minimal of the loss.

Loss perform OLS. Supply: writer’s calculations

We are able to additionally visualize the loss perform utilizing a contour plot, the place the traces are degree units (factors such that Loss(a,b) = fixed). Under, the white level represents the minimal of the loss perform.

Contour plot loss perform OLS. Supply: writer’s calculations

The plain-vanilla gradient descent algorithm consists in taking a step of measurement η within the route of the steepest descent, which is given by the alternative worth of the gradient. Mathematically, the replace rule appears to be like like:

Within the subsequent plot, I present one trajectory implied by the gradient descent algorithm. Factors characterize values of a and b throughout iterations, whereas arrows are gradients of the loss perform, telling us the place to maneuver within the subsequent iteration.

A key function is that the gradient descent algorithm would possibly create some oscillations between degree units. In an ideal world, we want as an alternative to maneuver easily within the route of the minimal. As we’ll see, including momentum is one technique to easy the trajectory towards the minimal worth.

Gradient Descent. Supply: writer’s calculations

Momentum refers back to the tendency of transferring objects to proceed transferring in the identical route. In observe, we are able to add momentum to gradient descent by taking into account earlier values of the gradient. This may be executed as follows:

The upper the worth for γ, the extra previous values of the gradient are considered within the present replace.

Within the subsequent plot, I present the trajectories implied by the gradient descent algorithm with (in blue) and with out momentum (in white).

Momentum reduces the fluctuations alongside the worth of the slope coefficient. The massive swings up and down are inclined to cancel out as soon as the averaging results of momentum begin to kick in. Because of this, with momentum we transfer quicker within the route of the true worth.

Gradient Descent with (blue) or with out momentum (white). Supply: writer’s calculations

Momentum is a pleasant twist to gradient descent. One other line of enchancment consists in introducing a studying price that’s tailor-made to every parameter (in our instance: one studying price for the slope, one studying price for the intercept).

However how to decide on such a coefficient-specific studying price? Be aware that the earlier plots present that the gradient doesn’t essentially level towards the minimal. A minimum of not through the first iterations.

Intuitively, we want to give much less weight to the strikes within the up/down route, and extra weight to the strikes within the left/proper route. The RMSprop updating rule embeds this desired property:

The primary line simply defines g to the be the gradient of the loss perform. The second line says that we calculate a operating common of the sq. of the gradient. In third line, we take a step within the route given by the gradient, however rescaled by the sq. root of the operating common of previous gradients.

In our instance, as a result of the sq. of the gradient tends to be giant for the slope coefficient, so we take small steps in that route. The alternative is true for the intercept coefficient (small values, giant strikes).

RMSprop (blue) and Gradient Descent (white). Supply: writer’s calculations

The Adam optimization algorithm has momentum, in addition to the adaptive studying price of RMSprop. Under is nearly what Adam does:

The updating rule is similar to one among RMSprop. The important thing distinction is momentum: the route of change is given by a operating common of the previous gradient.

The precise Adam updating rule makes use of “bias-corrected” worth for m and v. In step one, Adam initialize m and v to be zero. To right for the initialization bias, the authors recommend to make use of reweighed variations of m and v:

Under, we see that the trajectory induced by Adam is considerably much like the one given by RMSprop, however with a slower begin.

Adam (blue) and Gradient Descent (white). Supply: writer’s calculations

The subsequent plot exhibits the trajectories induced by the 4 optimization algorithms described above.

Key outcomes are as follows:

Gradient descent with momentum has much less fluctuations than gradient descent with out momentum.
Adam and RMSprop take a special route, transferring slower within the slope dimension and quicker within the intercept dimension.
As anticipated, Adam shows some momentum: whereas RMSprop begins turning left in direction of the minimal, Adam has a tougher time to show due to the accrued momentum.

Under is similar graph, however in 3d:

On this weblog submit, my intention was for the reader to construct an intuitive understanding of key optimization algorithms utilized in machine studying.

Under you could find the code that was used to provide the graphs used on this submit. Don’t hesitate to switch the training price and/or the loss perform to see how this impacts the totally different trajectories.

—

The next block of code hundreds dependencies, defines the loss perform and does plots the loss perform (floor and contour plots):

# A. Dependencies 
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocatorplot_scale = 1.25
plt.rcParams["figure.figsize"] = (plot_scale*16, plot_scale*9)
import numpy as np
import pandas as pd
import random
import scipy.stats
from itertools import product
import os
import time
from math import sqrt
import seaborn as sns; sns.set()
from tqdm import tqdm as tqdm       
import datetime
from typing import Tuple
class Vector: cross
from scipy.stats import norm
import torch
from torch import nn
from torch.utils.knowledge import DataLoader
import copy
import matplotlib.ticker as mtick
from torchcontrib.optim import SWA
from numpy import linalg as LA
import imageio as io #create gif
# B. Create OLS drawback
b0 = -2.0 #intercept
b1 = 2.0 #slope
beta_true = (b0 , b1)
nb_vals = 1000 #quantity attracts
mu, sigma = 0, 0.001 # imply and customary deviation
shocks = np.random.regular(mu, sigma, nb_vals)
# covariate
x0 = np.ones(nb_vals) #cst
x1 = np.random.uniform(-5, 5, nb_vals)
X = np.column_stack((x0, x1))
# Information
y = b0*x0 + b1*x1 + shocks
A = np.linalg.inv(np.matmul(np.transpose(X), X))
B = np.matmul(np.transpose(X), y)
np.matmul(A, B)
X_torch = torch.from_numpy(X).float()
y_torch = torch.from_numpy(y).float()
# Loss perform and gradient (for plotting)
def loss_function_OLS(beta_hat, X, y):
loss = (1/len(y))*np.sum(np.sq.(y - np.matmul(X, beta_hat)))
return loss
def grad_OLS(beta_hat, X, y):
mse = loss_function_OLS(beta_hat, X, y)
G = (2/len(y))*np.matmul(np.transpose(X), np.matmul(X, beta_hat) - y) 
return G, mse
# C. Plots for the loss perform
min_val=-10.0
max_val=10.0
delta_grid=0.05
x_grid = np.arange(min_val, max_val, delta_grid)
y_grid = np.arange(min_val, max_val, delta_grid)
X_grid, Y_grid = np.meshgrid(x_grid, y_grid)
Z = np.zeros((len(x_grid), len(y_grid)))
for (y_index, y_value) in enumerate(y_grid):
for (x_index, x_value) in enumerate(x_grid):
beta_local = np.array((x_value, y_value))
Z[y_index, x_index] = loss_function_OLS(beta_local, X, y)
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})
# Plot the floor.
surf = ax.plot_surface(X_grid, Y_grid, Z, cmap=cm.coolwarm, linewidth=0, antialiased=False, alpha=0.2)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter('{x:.02f}')
ax.scatter([b0], [b1], [true_value], s=100, c='black', linewidth=0.5)
x_min = -10
x_max = -x_min
y_min = x_min
y_max = -x_min
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.ylabel('Slope')
plt.xlabel('Intercept')
fig.colorbar(surf, shrink=0.5, facet=5)
filename = "IMGS/surface_loss.png"
plt.savefig(filename)
plt.present()
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
plt.scatter([b0], [b1], s=100, c='white', linewidth=0.5)
plt.ylabel('Slope')
plt.xlabel('Intercept')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
filename = "IMGS/countour_loss.png"
plt.savefig(filename)
plt.present()

The subsequent block of code defines capabilities in order that we are able to resolve the OLS drawback utilizing Pytorch. Right here, utilizing Pytroch is an overkill, however the benefit is that we are able to use the pre-coded minimization algorithms (torch.optim):

def loss_OLS(mannequin, y, X): 
"""
Loss perform for OLS
"""
R_squared = torch.sq.(y.unsqueeze(1) - mannequin(X[:,1].unsqueeze(1)))
return torch.imply(R_squared)def set_initial_values(mannequin, w, b):
"""
Perform to set the burden and bias to sure values
"""
with torch.no_grad():
for title, param in mannequin.named_parameters():
if 'linear_relu_stack.0.weight' in title:
param.copy_(torch.tensor([w]))
elif 'linear_relu_stack.0.bias' in title:
param.copy_(torch.tensor([b]))
def create_optimizer(mannequin, optimizer_name, lr, momentum):
"""
Perform to outline an optimizer
"""
if optimizer_name == "Adam":
optimizer = torch.optim.Adam(mannequin.parameters(), lr) 
elif optimizer_name == "SGD":
optimizer = torch.optim.SGD(mannequin.parameters(), lr)
elif optimizer_name == "SGD-momentum":
optimizer = torch.optim.SGD(mannequin.parameters(), lr, momentum)
elif optimizer_name == "Adadelta":
optimizer = torch.optim.Adadelta(mannequin.parameters(), lr)
elif optimizer_name == "RMSprop":
optimizer = torch.optim.RMSprop(mannequin.parameters(), lr)
else:
increase("optimizer unknown")
return optimizer
def train_model(optimizer_name, initial_guess, true_value, lr, momentum):
"""
Perform to coach a mannequin
"""
# initialize a mannequin
mannequin = NeuralNetwork().to(system)
#print(mannequin)
set_initial_values(mannequin, initial_guess[0], initial_guess[1])
for title, param in mannequin.named_parameters():
print(title, param)
mannequin.prepare()
nb_epochs = 100
use_scheduler = False
freq_scheduler = 100
freq_gamma = 0.95
true_b = torch.tensor([true_value[0], true_value[1]])
print(optimizer_name)
optimizer = create_optimizer(mannequin, optimizer_name, lr, momentum)
# A LOOP OVER EACH POINT OF THE CURRENT GRID
# retailer imply loss by epoch
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=freq_gamma)
loss_epochs = torch.zeros(nb_epochs)
list_perc_abs_error = [] #retailer abs worth share error
list_perc_abs_error_i = [] #retailer index i
list_perc_abs_error_loss = [] #retailer loss
list_norm_gradient = [] #retailer norm of gradient
list_gradient = [] #retailer the gradient itself
list_beta = [] #retailer parameters
calculate_variance_grad = False 
freq_loss = 1
freq_display = 10
for i in tqdm(vary(0, nb_epochs)):
optimizer.zero_grad()
# Calculate the loss
loss = loss_OLS(mannequin, y_torch, X_torch)
loss_epochs[[i]] = float(loss.merchandise())
# Retailer the loss
with torch.no_grad():
# Extract weight and bias
b_current = np.array([k.item() for k in model.parameters()])
b_current_ordered = np.array((b_current[1], b_current[0])) #reorder (bias, weight)
list_beta.append(b_current_ordered)
perc_abs_error = np.sum(np.sq.(b_current_ordered - true_b.detach().numpy()))
list_perc_abs_error.append(np.median(perc_abs_error))
list_perc_abs_error_i.append(i)
list_perc_abs_error_loss.append(float(loss.merchandise()))
# Calculate the gradient
loss.backward()
# Retailer the gradient
with torch.no_grad():
grad = np.zeros(2)
for (index_p, p) in enumerate(mannequin.parameters()):
grad[index_p] = p.grad.detach().knowledge
#reorder (bias, weight)
grad_ordered = np.array((grad[1], grad[0]))
list_gradient.append(grad_ordered)
# Take a gradient steps
optimizer.step()
if i % freq_display == 0: #Monitor the loss
loss, present = float(loss.merchandise()), i
print(f"loss: {loss:>7f}, share abs. error {list_perc_abs_error[-1]:>7f}, [{current:>5d}/{nb_epochs:>5d}]")
if (i % freq_scheduler == 0) & (i != 0) & (use_scheduler == True):
scheduler.step()
print("i : {}. Lowering studying price: {}".format(i, scheduler.get_last_lr()))
return mannequin, list_beta, list_gradient 
def create_gif(filenames, output_name):
"""
Perform to create a gif, utilizing an inventory of photographs
"""
with io.get_writer(output_name, mode='I') as author:
for filename in filenames:
picture = io.imread(filename)
author.append_data(picture)
# Take away information, besides the ultimate one
for index_file, filename in enumerate(set(filenames)):
if index_file < len(filenames) - 1:
os.take away(filename)
# Outline a neural community with a single node  
# Get cpu or gpu system for coaching.
system = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Utilizing {system} system")
nb_nodes = 1
# Outline mannequin
class NeuralNetwork(nn.Module):
def __init__(self):
tremendous(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(1, nb_nodes)
)
def ahead(self, x):
out = self.linear_relu_stack(x)
return out

Minimization utilizing gradient descent:

lr = 0.10 #studying price
alpha = lr
init = (9.0, 2.0) #preliminary guess
true_value = [-2.0, 2.0] #true worth for parameters# I. Resolve
optimizer_name = "SGD"
momentum = 0.0
model_SGD, list_beta_SGD, list_gradient_SGD = train_model(optimizer_name , init, true_value, lr, momentum)
# II. Create gif
filenames = []
zoom=1 #to extend/lower the size of vectors on the plot
max_index_plot = 30 #when to cease plotting
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
# Add factors and arrows
for (index, (bb, grad)) in enumerate(zip(list_beta_SGD, list_gradient_SGD)):
if index>max_index_plot:
break
if index == 0:
label_1 = "SGD"
else:
label_1 = ""
# Level
plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)
# Arrows
plt.arrow(bb[0], bb[1], - zoom * alpha* grad[0], - zoom * alpha * grad[1], coloration='white')
# Add arrows for gradient:
# create file title and append it to an inventory
filename = "IMGS/path_SGD_{}.png".format(index)
filenames.append(filename)
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
plt.savefig(filename)
filename = "IMGS/path_SGD.png"
plt.savefig(filename)
create_gif(filenames, "SGD.gif")
plt.present()

Minimization utilizing gradient descent with momentum:

optimizer_name = "SGD-momentum"
momentum = 0.2# I. Resolve
model_momentum, list_beta_momentum, list_gradient_momentum = train_model(optimizer_name , init, true_value, lr, momentum)
# II. Create gif
filenames = []
zoom=1 #to extend/lower the size of vectors on the plot
max_index_plot = 30 #when to cease plotting
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
# Add factors and arrows
for (index, (bb, grad, bb_momentum, grad_momentum)) in enumerate(zip(list_beta_SGD, list_gradient_SGD, list_beta_momentum, list_gradient_momentum)):
if index>max_index_plot:
break
if index == 0:
label_1 = "SGD"
label_2 = "SGD-momentum"
else:
label_1 = ""
label_2 = ""
# Level
plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)
plt.scatter([bb_momentum[0]], [bb_momentum[1]], s=10, c='blue', linewidth=5.0, alpha=0.5, label=label_2)
# Arrows
#plt.arrow(bb_momentum[0], bb_momentum[1], - zoom * alpha* grad[0], - zoom * alpha * grad[1], coloration='white')
plt.arrow(bb_momentum[0], bb_momentum[1], - zoom * alpha* grad_momentum[0], - zoom * alpha * grad_momentum[1], coloration="blue")
# create file title and append it to an inventory
filename = "IMGS/path_SGD_momentum_{}.png".format(index)
filenames.append(filename)
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
plt.savefig(filename)
filename = "IMGS/path_SGD_momentum.png"
plt.savefig(filename)
create_gif(filenames, "SGD_momentum.gif")
plt.present()

Minimization utilizing RMSprop:

optimizer_name = "RMSprop"
momentum = 0.0
# I. Resolve
model_RMSprop, list_beta_RMSprop, list_gradient_RMSprop = train_model(optimizer_name , init, true_value, lr, momentum)# II. Create gif
filenames = []
zoom=1 #to extend/lower the size of vectors on the plot
max_index_plot = 30 #when to cease plotting
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
# Add factors and arrows
for (index, (bb, grad, bb_RMSprop, grad_RMSprop)) in enumerate(zip(list_beta_SGD, list_gradient_SGD, list_beta_RMSprop, list_gradient_RMSprop)):
if index>max_index_plot:
break
if index == 0:
label_1 = "SGD"
label_2 = "RMSprop"
else:
label_1 = ""
label_2 = ""
# Level
plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)
plt.scatter([bb_RMSprop[0]], [bb_RMSprop[1]], s=10, c='blue', linewidth=5.0, alpha=0.5, label=label_2)
# Arrows
plt.arrow(bb_RMSprop[0], bb_RMSprop[1], - zoom * alpha* grad_RMSprop[0], - zoom * alpha * grad_RMSprop[1], coloration="blue")
# create file title and append it to an inventory
filename = "IMGS/path_RMSprop_{}.png".format(index)
filenames.append(filename)
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
plt.savefig(filename)
filename = "IMGS/path_RMSprop.png"
plt.savefig(filename)
create_gif(filenames, "RMSprop.gif")
plt.present()

Minimization utilizing Adam:

optimizer_name = "Adam"
momentum = 0.0# I. Resolve
model_Adam, list_beta_Adam, list_gradient_Adam = train_model(optimizer_name , init, true_value, lr, momentum)
# II. Create gif
filenames = []
zoom=1 #to extend/lower the size of vectors on the plot
max_index_plot = 30 #when to cease plotting
# Plot contour
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
# Add factors and arrows
for (index, (bb, grad, bb_Adam, grad_Adam)) in enumerate(zip(list_beta_SGD, list_gradient_SGD, list_beta_Adam, list_gradient_Adam)):
if index>max_index_plot:
break
if index == 0:
label_1 = "SGD"
label_2 = "Adam"
else:
label_1 = ""
label_2 = ""
# Level
plt.scatter([bb[0]], [bb[1]], s=10, c='white', linewidth=5.0, label=label_1)
plt.scatter([bb_Adam[0]], [bb_Adam[1]], s=10, c='blue', linewidth=5.0, alpha=0.5, label=label_2)
# Arrows
plt.arrow(bb_Adam[0], bb_Adam[1], - zoom * alpha* grad_Adam[0], - zoom * alpha * grad_Adam[1], coloration="blue")
# create file title and append it to an inventory
filename = "IMGS/path_Adam_{}.png".format(index)
filenames.append(filename)
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
plt.savefig(filename)
filename = "IMGS/path_Adam.png"
plt.savefig(filename)
create_gif(filenames, "Adam.gif")
plt.present()

Creating the “Grasp plot” with the 4 trajectories collectively:

max_iter = 100
filenames = []
cp = plt.contour(X_grid, Y_grid, np.sqrt(Z), colours='black', linestyles='dashed', linewidths=1, alpha=0.5)
plt.clabel(cp, inline=1, fontsize=10)
cp = plt.contourf(X_grid, Y_grid, np.sqrt(Z))
colours = ["white", "blue", "green", "red"]# Add factors:
for (index, (bb_SGD, bb_momentum, bb_RMSprop, bb_Adam)) in enumerate(zip(list_beta_SGD, list_beta_momentum, list_beta_RMSprop, list_beta_Adam)):
if index % freq_plot == 0:
if index == 0:
label_1 = "SGD"
label_2 = "SGD-momentum"
label_3 = "RMSprop"
label_4 = "Adam"
else:
label_1, label_2, label_3, label_4 = "", "", "", ""
plt.scatter([bb_SGD[0]], [bb_SGD[1]], s=10, linewidth=5.0, label=label_1, coloration=colours[0])
plt.scatter([bb_momentum[0]], [bb_momentum[1]], s=10, linewidth=5.0, alpha=0.5, label=label_2, coloration=colours[1])
plt.scatter([bb_RMSprop[0]], [bb_RMSprop[1]], s=10, linewidth=5.0, alpha=0.5, label=label_3, coloration=colours[2])
plt.scatter([bb_Adam[0]], [bb_Adam[1]], s=10, linewidth=5.0, alpha=0.5, label=label_4, coloration=colours[3])
if index > max_iter:
break
# create file title and append it to an inventory
filename = "IMGS/img_{}.png".format(index)
filenames.append(filename)
# Add arrows for gradient:
plt.xlabel('cst')
plt.ylabel('slope')
plt.legend()
# save body
plt.savefig(filename)
#plt.shut()# construct gif
create_gif(filenames, "compare_optim_algos.gif")

Creating the 3D “Grasp plot”:

max_iter = 100
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})# Plot the floor.
surf = ax.plot_surface(X_grid, Y_grid, Z, cmap=cm.coolwarm, linewidth=0, antialiased=False, alpha=0.1)
ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter('{x:.02f}')
ax.view_init(60, 35)
colours = ["black", "blue", "green", "red"]
x_min = -10
x_max = -x_min
y_min = x_min
y_max = -x_min
# Add factors:
for (index, (bb_SGD, bb_momentum, bb_RMSprop, bb_Adam)) in enumerate(zip(list_beta_SGD, list_beta_momentum, list_beta_RMSprop, list_beta_Adam)):
if index == 0:
label_1 = "SGD"
label_2 = "SGD-momentum"
label_3 = "RMSprop"
label_4 = "Adam"
else:
label_1, label_2, label_3, label_4 = "", "", "", ""
ax.scatter([bb_SGD[0]], [bb_SGD[1]], s=100, linewidth=5.0, label=label_1, coloration=colours[0])
ax.scatter([bb_momentum[0]], [bb_momentum[1]], s=100, linewidth=5.0, alpha=0.5, label=label_2, coloration=colours[1])
ax.scatter([bb_RMSprop[0]], [bb_RMSprop[1]], s=100, linewidth=5.0, alpha=0.5, label=label_3, coloration=colours[2])
ax.scatter([bb_Adam[0]], [bb_Adam[1]], s=100, linewidth=5.0, alpha=0.5, label=label_4, coloration=colours[3])
if index > max_iter:
break
# create file title and append it to an inventory
filename = "IMGS/img_{}.png".format(index)
filenames.append(filename)
# Add arrows for gradient:
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.ylabel('Slope')
plt.xlabel('Intercept')
plt.legend()
# save body
plt.savefig(filename)
filename = "IMGS/surface_loss.png"
plt.savefig(filename)
plt.present()
create_gif(filenames, "surface_compare_optim_algos.gif")

—

Ruder, Sebastian. “An outline of gradient descent optimization algorithms.” arXiv preprint arXiv:1609.04747 (2016)
Sutskever, Ilya, et al. “On the significance of initialization and momentum in deep studying.” Worldwide convention on machine studying. PMLR, 2013.

Actually good collection of movies on this matter:

Previous articleFinest Laptops For Automobile Tuning: 2023

Next articlefeatures – Change Put up standing based mostly on customized discipline date +1 day

A Visible Method to Gradient Descent and different Optimization Algorithms | by Julien Pascal | Dec, 2022

Visualize the variations and similarities between gradient descent, gradient descent with momentum, RMSprop, and Adam

Unfolding Protein Folding With ESM2

What Are Knowledge Silos And How To Get Rid Of Them?

WSL 2 Is the Finest Choice for a Extra Productive Information Science Workflow | by Ken Jee | Dec, 2022

LEAVE A REPLY Cancel reply

Most Popular

features – Change Put up standing based mostly on customized discipline date +1 day

Finest Laptops For Automobile Tuning: 2023

The Overflow #157: Tis the season for hats

German Testing Day, Frankfurt am Essential,, Germany, Could 23-24 2023

Recent Comments

ABOUT US

POPULAR POSTS

features – Change Put up standing based mostly on customized discipline date +1 day

Finest Laptops For Automobile Tuning: 2023

The Overflow #157: Tis the season for hats

POPULAR CATEGORY