Stay humble. Stay hungry. Stay foolish.

Perceptron

  1. Definition
    • For a training example input \bf{x} = (x_1, x_2, ..., x_d) and output y \in \{1,  -1\} .
    • Hypothesis: h(\textbf{x}) = sign((\sum_{i=1}^d x_i) - \theta) = sign(\sum_{i=0}^d x_i) = sign(\textbf{w}\textbf{x}).
  2. Conceptual Explanation
    • A linear / surface / hyperplane (a hypothesis) that separate the dots (training examples) examples inside a 2D / 3D / ND space.
    • (If the features are N-dimensional, then the problem space is N-dimensional, and the hyperplane is N-1 dimensional)
  3. Learning Algorithm
    • Perceptron Learning Algorithm
      • Overall: start from some \textbf{w}_0 and correct its mistakes on \mathcal{D}.
      • Implementation: From t = 0, 1, ...
        • Find a mistake of \textbf{w}_t called (\textbf{x}_{n(t)}, \textbf{y}_{n(t)}), sign(\textbf{w}^T_t \textbf{x}_{n(t)}) \ne y_{n(t)}.
          • Follow the naive cycle or precomputed random cycle.
        • Try to correct the mistake by \textbf{w}_{t+1} \leftarrow \textbf{w}_{t} + y_{n(t)}\textbf{x}_{n(t)}.
        • Until no more mistakes (a full cycle of not encountering mistakes).
        • Return the last \textbf{w}_{PLA} = \textbf{w} as g.
      • Conceptual Explanation:
        • Assume there is a hyperplane that splits the two different kinds of the output, the perceptron learning algorithm is trying to find the normal vector \textbf{w}_{PLA} of the hyperplane though doing dot products of the current \textbf{w} and the training examples \textbf{x}.
        • The target is making all the positive example vectors has an acute angle with the positive examples and an obtuse angle with the negative examples.
      • Therom:
        • y_n\textbf{w}_{t+1}^T\textbf{x}_n = y_n(\textbf{w}_{t}^T + y_{n}\textbf{x}_{n}) \textbf{x}_n \ge y_n\textbf{w}_t^T\textbf{x}_n.
  4. Linear Separability
    • Definition: If PLA halts \mathcal{D} allows some \textbf{w} to make no mistake, call such \mathcal{D} linear seperable.
    • PLA Fact: \textbf{w}_t gets more aligned with \textbf{w}_f.
      • \textbf{w}_f^T\textbf{w}_{t+1} = \textbf{w}_f^T(\textbf{w}_{t} + y_{n(t)}x_{n(t)}) > \textbf{w}_f^T\textbf{w}_{t}
      • The dot product of \textbf{w}_f and \textbf{w}_t is getting larger.
    • PLA Fact: \textbf{w}_t does not grow too fast
      • ||\textbf{w}_{t+1}||^2 = ||\textbf{w}_{t} + y_{n(t)}\textbf{x}_{n(t)}||^2 <= ||\textbf{w}_{t}||^2 + {max}_{n}||y_n\textbf{x}_n||^2.
    • Therom: (For simplicity, no longer mark w and x as vector)
      • R^2 = {max}_{n}||x||^2
      • \rho = {min}_{n}y_n\frac{w_f^T}{||w_f||}x_n
      • \frac{w_f^T w_T}{||w_f||||w_T||} \ge \sqrt{T} \frac{\rho}{R}
  5. Non-Linear Seperatably Data
    • Gurantee to work as long as linear speratble, but doesn’t know if it is linear speratble at front, event not sure how long halting takes (\rho depends on \textbf{w}_f).
    • Line with Noise Tolerance
      • \textbf{w}_g \leftarrow {argmin}_w \sum_{n=1}^N ||y_n \ne sign(w^Tx_n)||
      • Pocket Algorithm: Keeping best weights in pocket. Will be slower than PLA (storing alternative and checking all the examples).

Tags

Leave a comment