As I mentioned in my previous post, a linear decision boundary is rarely enough to describe a real life decision boundary.
The Idea behind a non-linear logistic regression algorithms is to “grasp” the more complex mathematical relationship between our set of X features and their respective y label. One way is finding a polynomial relation, also known as “feature mapping”.
For example: instead of just mapping x1 and x2 to y1, consider mapping: x1, x2, x1^2*x2, x1^2*x2^2, x1^3*x2, x1^3*x2^2, and so on, to our y label
To visualize that, let’s consider a 2 features labeled data set, x1, x2 and y, representing 2 ranges of qa test results (0 to 2) and a y label of pass / no-pass for that product (0 / 1). Something like:
qa test #1, qa test #2, pass?
0.455131, 1.656942, 1
0.434386, 1.367466, 1
Let’s drag our csv to MLplayground.org
We can clearly see our decision boundary if far from being linear. It has some symmetricity to it, which makes it a good candidate for a polynomial kernel. But just for the fun of it, let’s try and run logistic regression without feature mapping:
Even with an optimized cost the linear decision boundary seems random and meaningless. The algorithm just can’t find a linear mathematical relationship between x1, x2 and y.
Now let’s try feature mapping. We’ll start with a polynomial of a second degree. It means that our X matrix just got much bigger; from x1 and x2 to: x1, x2, x1*x2, x1*x2^2, x1^2*x2, x1^2*x2^2. (we can think of x1 as x1*x2^0 and vice versa)
We can now actually “see” the underlying boundary, with a training accuracy of 78.8%. We can try to improve results with using a higher order polynomial, while adjusting alpha and #iteration parameters for an optimal cost. For 4, 6 and 10 degrees we get:
However, there’s a caveat to it; it’s computationally unscalable to use higher orders, keeping in mind that in a real life scenario you will probably want to grasp a relation between more than 2 features. Another problem with higher polynomial is overfitting, causing our decision boundary to look “custom tailored” to the specific training data.
With a 20 degree, in addition to tremendous slowness, we even experience accuracy degradation:
What can we do next, without jumping to the more exotic algorithms, like neural networks? One way is using a Gaussian kernel (not supported with simdfied yet). The idea is to be able to grasp a more random-like, unstructured boundaries based on the Gaussian distribution, with a more scalable computation.