Gradient Descent Neural Network sigmoid#
Here’s a simplified neural network architecture:
Input layer: One neuron with input \(x\).
Hidden layer: One neuron with weight \(w\) and bias \(b\), applying the sigmoid activation function.
Output layer: One neuron with weight \(v\) and bias \(c\), applying the linear (identity) activation function.
The network’s output \(y\) can be expressed as:
Your goal is to use gradient descent to train the network by updating the weights and biases to minimize a cost function \(J(\theta)\), where \(\theta\) represents all the network parameters (\(w\), \(b\), \(v\), and \(c\)).
Initialization:
Initialize the parameters randomly or with predetermined values.
Set the learning rate \(\alpha\) (e.g., \(\alpha = 0.1\)).
Training:
For each training example, compute the predicted output \(y\) using the current network parameters:
\[y = v \cdot \text{sigmoid}(w \cdot x + b) + c\]Compute the cost function \(J(\theta)\) (e.g., mean squared error) between the predicted output \(y\) and the actual target output \(y_{\text{target}}\):
\[J(\theta) = \frac{1}{2}(y - y_{\text{target}})^2\]Compute the gradients of the cost function with respect to the parameters \(\theta\) using backpropagation. For example:
\(\frac{\partial J}{\partial v} = y - y_{\text{target}}\)
\(\frac{\partial J}{\partial c} = y - y_{\text{target}}\)
\(\frac{\partial J}{\partial w} = \frac{\partial J}{\partial y} \cdot \frac{\partial y}{\partial \text{sigmoid}} \cdot \frac{\partial \text{sigmoid}}{\partial(w \cdot x + b)} \cdot \frac{\partial(w \cdot x + b)}{\partial w}\)
\(\frac{\partial J}{\partial b} = \frac{\partial J}{\partial y} \cdot \frac{\partial y}{\partial \text{sigmoid}} \cdot \frac{\partial \text{sigmoid}}{\partial(w \cdot x + b)} \cdot \frac{\partial(w \cdot x + b)}{\partial b}\)
Update the parameters using gradient descent:
\(v = v - \alpha \cdot \frac{\partial J}{\partial v}\)
\(c = c - \alpha \cdot \frac{\partial J}{\partial c}\)
\(w = w - \alpha \cdot \frac{\partial J}{\partial w}\)
\(b = b - \alpha \cdot \frac{\partial J}{\partial b}\)
Repeat steps 1-4 for multiple iterations and examples until the cost function converges to a minimum, indicating that the network has been trained.
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(sig):
return sig * (1 - sig)
# Initialize random weights and biases
input_size = 1
hidden_size = 1
output_size = 1
np.random.seed(1)
weights_input_hidden = np.random.rand(input_size, hidden_size)
weights_hidden_output = np.random.rand(hidden_size, output_size)
bias_hidden = np.random.rand(1, hidden_size)
bias_output = np.random.rand(1, output_size)
# Define learning rate
learning_rate = 0.01
# Training data
X = np.array([[0], [1]])
y = np.array([[0], [1]])
# Training loop
epochs = 1000
for epoch in range(epochs):
# Forward pass
hidden_input = np.dot(X, weights_input_hidden) + bias_hidden
hidden_output = sigmoid(hidden_input)
final_input = np.dot(hidden_output, weights_hidden_output) + bias_output
predicted_output = final_input # Linear activation for output layer
# Calculate error
error = y - predicted_output
# Backpropagation
output_delta = error
hidden_error = output_delta.dot(weights_hidden_output.T)
hidden_delta = hidden_error * sigmoid_derivative(hidden_output)
# Update weights and biases
weights_hidden_output += hidden_output.T.dot(output_delta) * learning_rate
weights_input_hidden += X.T.dot(hidden_delta) * learning_rate
bias_output += np.sum(output_delta, axis=0, keepdims=True) * learning_rate
bias_hidden += np.sum(hidden_delta, axis=0, keepdims=True) * learning_rate
# Testing the trained network
new_data = np.array([[2.5]])
hidden_layer_activation = sigmoid(np.dot(new_data, weights_input_hidden) + bias_hidden)
output = np.dot(hidden_layer_activation, weights_hidden_output) + bias_output
print("Input:", new_data)
print("Predicted Output:", output)
Input: [[2.5]]
Predicted Output: [[0.94234077]]