Perceptron

Definition
- For a training example input $\bf{x} = (x_1, x_2, ..., x_d)$ and output $y \in \{1, -1\}$ .
- Hypothesis: $h(\textbf{x}) = sign((\sum_{i=1}^d x_i) - \theta) = sign(\sum_{i=0}^d x_i) = sign(\textbf{w}\textbf{x})$ .
Conceptual Explanation
- A linear / surface / hyperplane (a hypothesis) that separate the dots (training examples) examples inside a 2D / 3D / ND space.
- (If the features are N-dimensional, then the problem space is N-dimensional, and the hyperplane is N-1 dimensional)
Learning Algorithm
- Perceptron Learning Algorithm
  - Overall: start from some $\textbf{w}_0$ and correct its mistakes on $\mathcal{D}$ .
  - Implementation: From
    - Find a mistake of _t called , .
      - Follow the naive cycle or precomputed random cycle.
    - Try to correct the mistake by $\textbf{w}_{t+1} \leftarrow \textbf{w}_{t} + y_{n(t)}\textbf{x}_{n(t)}$ .
    - Until no more mistakes (a full cycle of not encountering mistakes).
    - Return the last $\textbf{w}_{PLA} = \textbf{w}$ as $g$ .
  - Conceptual Explanation:
    - Assume there is a hyperplane that splits the two different kinds of the output, the perceptron learning algorithm is trying to find the normal vector $\textbf{w}_{PLA}$ of the hyperplane though doing dot products of the current $\textbf{w}$ and the training examples $\textbf{x}$ .
    - The target is making all the positive example vectors has an acute angle with the positive examples and an obtuse angle with the negative examples.
  - Therom:
    - $y_n\textbf{w}_{t+1}^T\textbf{x}_n = y_n(\textbf{w}_{t}^T + y_{n}\textbf{x}_{n}) \textbf{x}_n \ge y_n\textbf{w}_t^T\textbf{x}_n$ .
Linear Separability
- Definition: If PLA halts $\mathcal{D}$ allows some $\textbf{w}$ to make no mistake, call such $\mathcal{D}$ linear seperable.
- PLA Fact: gets more aligned with .
  - $\textbf{w}_f^T\textbf{w}_{t+1} = \textbf{w}_f^T(\textbf{w}_{t} + y_{n(t)}x_{n(t)}) > \textbf{w}_f^T\textbf{w}_{t}$
  - The dot product of $\textbf{w}_f$ and $\textbf{w}_t$ is getting larger.
- PLA Fact: does not grow too fast
  - $||\textbf{w}_{t+1}||^2 = ||\textbf{w}_{t} + y_{n(t)}\textbf{x}_{n(t)}||^2 <= ||\textbf{w}_{t}||^2 + {max}_{n}||y_n\textbf{x}_n||^2$ .
- Therom: (For simplicity, no longer mark w and x as vector)
  - $R^2 = {max}_{n}||x||^2$
  - $\rho = {min}_{n}y_n\frac{w_f^T}{||w_f||}x_n$
  - $\frac{w_f^T w_T}{||w_f||||w_T||} \ge \sqrt{T} \frac{\rho}{R}$
Non-Linear Seperatably Data
- Gurantee to work as long as linear speratble, but doesn’t know if it is linear speratble at front, event not sure how long halting takes ( $\rho$ depends on $\textbf{w}_f$ ).
- Line with Noise Tolerance
  - $\textbf{w}_g \leftarrow {argmin}_w \sum_{n=1}^N ||y_n \ne sign(w^Tx_n)||$
  - Pocket Algorithm: Keeping best weights in pocket. Will be slower than PLA (storing alternative and checking all the examples).