Linear Regression

This is some notes taken when I summarize the things learned after taking Andrew Ng’s machine learning course at coursera.

Introduction

Regression is a technique to model relationships among variables. Typically, there’s one dependent variable y and one or many independent variables. This relationship is usually expressed as a regression function.

Linear regression, as the name suggests, models the relationship using a linear regression function. Depending on how many independent variables we have, we have simple linear regression with one independent variable and multivariate linear regression with more than one independent variables.

The hypothesis of linear regression can be described by the following equation,

Image may be NSFW.
Clik here to view.

The X are called features, and theta are the parameters. Given a set of training samples, we’ll need to choose theta to fit the training examples.

To measure how well we fit the training examples, we define the cost function of linear regression as below,

Image may be NSFW.
Clik here to view.

m represents the number of training samples, h(x) is the predicted value and y is the sample output value. The cost function measures the average square error of all samples and then divide by 2.

This is essentially an optimization problem where we need to choose parameter theta such that the cost defined by the cost function is minimized.

Over-fitting and Regularization

Fitting the regression parameters minimize the error for training samples, however we can run into the problem of trying too hard such that the regression function doesn’t generalize well. i.e.: The hypothesis produce high error for input outside of the training set. This problem is known as overfitting.

Two commonly used techniques to address overfitting is reducing number of features and regularization.

Regularization adds an additional term to the cost function to penalize having large theta value, which tends to produce much more smooth curves.

Image may be NSFW.
Clik here to view.

Note that by convention, the regularization term exclude j=0 case, which is theta 0.

Given the hypothesis and its cost function, there’re many ways to fit the parameter theta (i.e., solve the optimization problem), including conjugate gradient, BFGS, L-BFGS etc. The most commonly used technique is Gradient Descent.

Gradient Descent

The idea of gradient descent is to start at some random values, evaluate the cost. And keep iterating on theta value based on the function below to reduce the cost until we reach a minimal.

Image may be NSFW.
Clik here to view.

The alpha is called the learning rate. It can be proven that if choose a sufficiently small alpha value, the cost will converge at some minimum. However, we don’t want alpha value to be too small in practice because it will take longer time. Typically, we try out a range of alpha values (0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1) and plot the cost to see how fast it converges.

For linear regression with regularization, the above equation is essentially the following,Image may be NSFW.
Clik here to view.

The second term can easily be rewritten as,

Image may be NSFW.
Clik here to view.

Feature Scaling and Mean Normalization

When we do gradient descent, the values for different features normally differ in scale. For example, feature A may have value in the range of [1, 10], feature B varies from [-10000, 10000].

It’s good to have the feature values have similar scales and centered around 0 (i.e.: have approximately mean of 0).

The former can be achieved using feature scaling, just divide every value of that feature by a number such that the range is approximately [-1, 1]. The latter is accomplished using mean normalization (This doesn’t apply to X0). We can usually use (X – mean) to achieve this.

Numerical Analysis

Besides using optimization algorithms to fit theta iteratively, it turns out we can also compute the theta values numerically.

Without regularization, the numerical equation is as below,

Image may be NSFW.
Clik here to view.

While this method doesn’t need to choose learning rate and iterate, it is more computationally expensive as n get large because of the matrix multiplication and inverse. In addition, the inverse may not even exist. This is typically due to redundant features (some features are not linearly independent) or too many features too few samples.

With regularization, the numerical solution is the following,

Image may be NSFW.
Clik here to view.

Note that inverse part will exist even if the equation without regularization is not invertible.

Linear Regression

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112