Here I explain the work of (Kapoor, Wiebe, & Svore, 2016). There, they basically applied aplitude amplification tecniques to two different version of the perceptron algorithm. With the first approach - that we describe in this post - the authors were able to gain a quadratic speedup w.r.t the number of elements in the training set. In the second approach, the authors leveraged the description of the perceptron in the so-called *version space* (the dual of the usual feature space descrption of the perceptron). This allowed them to gain a quadratic improvement w.r.t statistical efficiency: perhaps a more interesting gain than a quadratic speedup with the number of elements in the training set. We will see this the second model of quantum perceptron in another post, since the technique used is basically the same.

Let’s briefly introduce the perceptron. We are given a training set $\mathbb{T} = \{ \phi_1 … \phi_N\} $, $\phi_i \in \mathbb{R}^D$ of labeled vectors that belongs to two different classes: $y_i \in \{0,1\}$. For the sake of simplicity, we assume those vectors to be linearly separable. While in practice this is not always the case, it there are statistical guarantees that we still will be able to lern a good-enough separating hyperplane which approximate the best separating hyperplane $w^*$. A (classical) perceptron find an hyperplane $w$ that separates the data of the two classes. More formally, we want to find a $w$ such that $w^T x_i * y_i \leq 0 \quad \forall i \in \left[N\right]$ is true. The intuition for the classical algorithm is the following: start by an initial guess for $w$, and then update you guess by adding to the vector describing the hyperplane $w$ the misclassified vector, and then normalize to keep the norm of $w$ constant. In this way you rotate your current guess of $w$ until it correctly classify the training vector.

We say that the two classes are separated by a *margin* of $\delta$. Recall that the margin is defined (“a priori” w.r.t the dataset) as:

We can think of the margin as a measure of the training set which tells us how much to rotate $w$ each time to change the label of a misclassified vector. For the record, it is possible to prove that the perceptron makes at most $\frac{1}{\gamma^2}$ mistakes for points $\phi_i$ that are separated with angular margin $\gamma$.

In the quantum version of the algorithm we want to perform amplitude amplification to the perceptron, so the idea is to use amplitude amplification **find quicker** the misclassified vectors in the training set.

A quick recap on Grover-like algorithms. In order to apply amplitude amplification to a problem you need build two unitary operators that combined gives you the Grover iterate:

where $U_{init} = 2\ket{\psi}\bra{\psi}$ is the reflection about the mean, and is the change of phase of the “good” solution you are targeting in your problem.

By applying for a certain number of times the Grover iterate to the quantum state generated by querying an oracle (in tihs case our quantum memory), we can teak the probability of sampling a misclassified vector from your quantum computer.

More formally, this is the statement of the theorem:

*Let $A$ be any quantum algorithm that uses no measurements, and let $f : \{0,1 \}^n \to \{0, 1\}$ be any Boolean function. There exists a quantum algorithm that given the initial success probability $a > 0$ of $A$, finds a good solution with certainty using a number of applications of $A$ and $A^{-1}$ which is in $\Theta(1/\sqrt{a})$ in the worst case.*

In this post we see how to build the quantum circuit for $A$ and for the boolean function $f$ to suits our needs of quantizing a perceptron.

An underlying assumption that we do in the classical analysis of the algorithm is that we have sequential access to the training set, (like an array). In the quantum algorithm we will drop this assumption, and instead assume that we have access to random samples of the dataset. As you might have imagined, we will query the elements of the training set in superposition, at the cost of introducing the possibility of extracting the same training element multiple times.

Recall that classically, the cost of training a perceptron in the original array-model is:

However, as is shown in the paper, if we are allowed to sample the vectors from the training set the running time of a classical lerner stretches by a logarithmic factor (for reason related to the coupon collector problem) to:

Obviously, this is proven duly in the paper. (Kapoor, Wiebe, & Svore, 2016)

Let’s see how to speed things up with quantum. If you are already accustomed to quantum computation, perhaps I already gave you a hint on the quantum state we are going to create for our algorithm. We assume to have access to the following oracles:

And its inverse $U^{\dagger}$. With $U$ (by linearity) We are going to build a uniform superposition of the elements of the training set:

What does $\ket{\phi_i}$ means in practice? If we have to store a floating point vector $\phi^i \in \mathbb{R}^d$, we can store the m-bit binary representation of the $d$ floating point numbers, and add one qubit to store the label of the vector $y^i$ (map $-1$ to 0 for a negative labels). The authors note that you can interpret this qubit-string as an unsigned integer. They also note that in this way we map each element in the training set to a basis of our Hilber space.

Now that we have our data loaded in our quantum computer, we craft the unitary operator that allows us to test if the perceptron correctly assign a training vector. As in the classical algorithm, we start by a random guess of the weight vector $w_0$. Each time we find a misclassified vector we update our model by adjusting $w_t$, our current guess for $w_*$ as in the classical algorithm.

Said simply, we need a quantum circuit implementing the perceptron algorithm for a given weight $w$, and “plug it” into amplitude amplification theorem. (Brassard, Hoyer, Mosca, & Tapp, 2002). The unitary operator that want to implement to apply amplitude amplification just change the sign of misclassified vectors. We can therefore write it as such:

Let me explain this. We define $f_w(\phi, y)$ to be the boolean function of the perceptron function, that given a weight vector $w$ and the class of $\phi$ tells $0$ if the vector is currently well classified according to the label $y$, and return $+1$ if the vector is misclassified. This will allow us to change the sign of just the misclassified vectors. The unitary implementation of an oracle like this would can be plugged into the circuit for amplitude amplification and gives us an algorithm to do a quantum perceptron. As it is known, we can easily build the quantum circuit from a classical boolean circuit. Therefore we can assume to have the quantum circuit perceptron algorithm (for a given model $w$). We want this quantum circuit computes the following mapping:

The unitary operator that we need to implement in order to get a quantum version of $F_w$ can be built in the following way:

This represent the first part of the ingredients that we need for amplitude amplification. The second part consist in $U_{init}$, which in this case is $U_\text{init}=2\ket{\psi}\bra{\psi} - I$, with $\ket{\psi}=\frac{1}{\sqrt{N}}\sum_{j=1}^N = \ket{j}$

The grover iterate is defined $G=U_{init}U_{targ}$. This is the circuit we need to apply amplitude amplification to a problem. If you don’t believe me, you should check the detailed proof of the paper :)

The main result of this section is the following theorem:

*Given a training set that consists of unit vectors $\phi_0, … ,\phi_N$ that are separated by a margin of $\gamma$ in feature space, the number of applications of $F_w$ needed to infer a perceptron model w, such that $P(\exists j : f_w(\psi_j) = 1) \leq \epsilon$ using a quantum computer is $N_\text{quant}$ where:*

.

*The number of queries to $f_w$ needed in the classical setting, $N_\text{class}$, where the training vectors are found by sampling uniformly from the training data is bounded by:*

.

- Access to oracle $U$ storing $N$ input string
- Error parameter $\epsilon$.

- An hyperplane approximaitng $w^*$

- Create random vector $w$
*For*$k=1 … \lceil \log_{3/4} (\gamma^2\epsilon) \rceil$*For*$j = 1 … \lceil \log_c (1/\sin(2sin^{-1}(1\sqrt{N)})) \rceil $- Sample uniformly in integer $m \in [0…\lceil c^j \rceil ]$
- Prepare query register $\ket{\psi}=\sum_{i=1}^N\ket{i}\ket{0}$
- Perform $Q^m\ket{\psi}$
- Measure the first index register and get $\to i$.
- If $f_{w_t}(\phi_i, y_i) =1$ then update $F_{w_t}$

- Return $w_t$ to the user.

We show how to apply amplitude amplification on the dataset to find all the misclassified vectors with a quantum version of the perceptron circuit. At each iteration, we update our circuit $F_{w_t}$ with the new model of the perceptron and we contrinue untill no misclassified vectors are left in the trainingset. A couple of sentences on the algorithm. The two loop in the algorithm assure we to be able to find the correct number of times to apply $G$ with an exponential search among the space of parameters. Its is basically a trick to preserve the quadratic speedup without knowing in advance the right number of misclassified vectors for a given perceptron. Anyway, is a trick described properly in the paper and in (Brassard, Hoyer, Mosca, & Tapp, 2002)

Now the user can take the model $w$ and use in its classical algorithm, eventually with a classical computer. As explained in the paper the second part of the paper might trigger even more the interest of a machine learning pratcitioner. But that’s for another post. :)

- Kapoor, A., Wiebe, N., & Svore, K. (2016). Quantum perceptron models. In
*Advances in Neural Information Processing Systems*(pp. 3999–4007). - Brassard, G., Hoyer, P., Mosca, M., & Tapp, A. (2002). Quantum amplitude amplification and estimation.
*Contemporary Mathematics*,*305*, 53–74.