The original Slow Feature Analysis (SFA) was originally proposed to learn slowly varying features from generic input signals that vary rapidly over time (P. Berkes 2005; Wiskott Laurenz and Wiskott 1999). Computational neurologists observed long time ago that primary sensory receptors, like the retinal receptors in an animal’s eye - are sensitive to very small changes in the environment and thus vary on a very fast time scale, the internal representation of the environment in the brain varies on a much slower time scale. This observation is called temporal slowness principle. SFA, being the state-of-the-art model for how this temporal slowness principle is implemented, is an hypothesis for the functional organization of the visual cortex (and possibly other sensory areas of the brain). Said in a very practical way, we have some “process” in our brain that behaves very similarly as dictated by SFA (L. Wiskott et al. 2011).
Very beautifully, it is possible to show two reductions from two other dimensionality reduction algorithms used in machine learning: Laplacian Eigenmaps (a dimensionality reduction algorithm mostly suited for video compressing) and Fisher Discriminant Analysis (a standard dimensionality reduction algorithm). SFA can be applied in ML fruitfully, as there have been many applications of the algorithm to solve ML related tasks. The key concept for SFA (and LDA) is that he tries to project the data in the subspace such that the distance between points with the same label is minimized, while the distance between points with different label is maximized.
The high level idea of using SFA for classification is the following: One can think of the training set as an input series $x(i) \in \mathbb{R}^d , i \in [n]$. Each $x(i)$ belongs to one of $K$ different classes. The goal is to learn $K-1$ functions $g_j( x(i)), j \in [K-1]$ such that the output $ y(i) = [g_1( x(i)), \cdots , g_{K-1}( x(i)) ]$ is very similar for the training samples of the same class and largely different for samples of different classes. Once these functions are learned, they are used to map the training set in a low dimensional vector space. When a new data point arrive, it is mapped to the same vector space, where classification can be done with higher accuracy.
Now we introduce the minimization problem in its most general form as it is commonly stated for classification (P. Berkes 2005). Let $a=\sum_{k=1}^{K} \binom{|T_k|}{2}.$ For all $j \in [K-1]$, minimize:
with the following constraints:
$\frac{1}{n} \sum_{k=1}^{K}\sum_{i\in T_k} g_j( x(i)) = 0 $
$\frac{1}{n} \sum_{k=1}^{K}\sum_{i \in T_k} g_j( x(i))^2 = 1 $
$ \frac{1}{n} \sum_{k=1}^{K}\sum_{i \in T_k} g_j( x(i))g_v( x(i)) = 0 \quad \forall v < j $
For some beautiful theoretical reasons, QSFA algorithm is in practice an algorithm for fidning the solution of the generalized eigenvalue problem:
Here $W$ is the matrix of the singular vectors, $\Lambda$ the diagonal matrix of singular values. For SFA $A$ and $B$ are defined as: $ A=\dot{X}^T \dot{X} $ and $B := X^TX$, where $\dot{X}$ is the matrix of the derivative of the data: i.e. for each possible elements with the same label we calculate the pointwise difference between vectors. (computationally, it suffice to sample $O(n)$ tuples fom the uniform distribution of all possible derivatives.
It is possible to see that the slow feature space we are looking for is is spanned by the eigenvectors of $W$ associated to the $K-1$ smallest eigenvalues of $\Lambda$.
In (Kerenidis and Luongo 2018) we show how, using a “QuantumBLAS” ( i.e. a set of quantum algorithm that we can use to perform linear algebraic operations), we can perform the following algorithms. The intuition behind this algorithm is that the derivative matrix of the data can be pre-computed on non-whitened data, like one might do classically (and spare a matrix multiplication). Since with quantum computer we don’t have this problem, since we know how to perform matrix multiplication efficiently. As in the classical algorithm, we have to do some preprocessing to our data. For the quantum case, preprocessing consist in:
Polynomially expand the data with a polynomial of degree 2 or 3
Normalize and Scale the rows of the dataset $X$.
Create $\dot{X}$ by sampling from the distribution of possible couples of rows of $X$ with the same label.
Create QRAM for $X$ and $\dot{X}$
Note that all these operation are at most $O(nd\log(nd))$ in the size of the training set, which is a time that we need to spend anyhow, even by collecting the data classically.
To use our algorithm for classification, you use QSFA to bring one cluster at the time, along with the new test point in the slow feature space, and perform any distance based classification algorithm, like QFDC or swap tests, and so on. The quantum algorithm is the following:
Require Matrices $X \in \mathbb{R}^{n \times d}$ and $\dot{X} \in \mathbb{R}^{n \times d}$ in QRAM, parameters $\epsilon, \theta,\delta,\eta >0$.\
Ensure A state $\ket{\bar{Y}}$ such that $ | \ket{Y} - \ket{\bar{Y}} | \leq \epsilon$, with
Create the state using the QRAM that stores the dataset.
(Whitening algorithm) Map $\ket{X}$ to $\ket{\bar{Z}}$ with $| \ket{\bar{Z}} - \ket{Z} | \leq \epsilon $ and $Z=XB^{-1/2}.$ using quantum access to the QRAM.
(Projection in slow feature space) Project $\ket{\bar{Z}}$ onto the slow eigenspace of $A$ using threshold $\theta$ and precision $\delta$ (i.e. )
Perform amplitude amplification and estimation on the register $\ket{0}$ with the unitary $U$ implementing steps 1 to 3, to obtain $\ket{\bar{Y}}$ with $| \ket{\bar{Y}} - \ket{Y} | \leq \epsilon $ and an estimator $ \bar{ {|| Y ||} } $ with multiplicative error $\eta$.
Overall, the algorithm is subsumed in the following Theorem.
Let $X = \sum_i \sigma_i u_iv_i^T \in \mathbb{R}^{n\times d}$ and its derivative matrix $\dot{X} \in \mathbb{R}^{n \log n \times d}$ stored in QRAM. Let $\epsilon, \theta, \delta, \eta >0$. There exists a quantum algorithm that produces as output a state with in time and an estimator $\bar{||Y ||}$ with $ | \bar{||Y ||} - ||Y || | \leq \eta {||Y ||}$ with an additional factor.
A prominent advantage of SFA compared to other algorithms is that it is almost hyperparameter-free. The only parameters to chose are in the preprocessing of the data, e.g. the initial PCA dimension and the nonlinear expansion that consists of a choice of a polynomial of (usually low) degree $p$. Another advantage is that it is guaranteed to find the optimal solution within the considered function space (Escalante-B and Wiskott 2012). We made an experiment, and using QSFA with a quantum classifier, we were able to reach 98.5% accuracy in doing digit recognition: we were able to read 98.5% among 10.000 images of digits given a training set of 60.000 digits.