Privacy Policy

June 13, 2016

A primer on Projective Simulation: a (quantum) ML algorithm.

In these days, I had the chance to read about a recent algorithm for (quantum) machine learning. Its name is Projective Simulation (PS), proposed by Briegel and Cuevas [1] in 2012. PS melts together ideas from neural networks (whence takes the idea of a graph-like network of objects to model the used to model the memory used in PS)  and reinforcement learning (wherefrom it takes the idea of training the algorithm using rewards and punishments as a function of the environment).  PS connect aspects of research such as artificial intelligence, quantum information, and quantum computation.

Among the various features that make this algorithm interesting, one of the things I like most is that a steps of PS can be executed on a quantum physical system - such as a quantum computer - even gaining computational efficiency over classical computers using quantum interference [11], or exploiting topological structure of certain kind of graph [12].

PS uses a graph-like representation of the memory of an agent. This representation is so general that allows us to think of the neural networks as a physical implementation of the same kind of memory - called episodic and compositional memory -  used by PS. In fact, both algorithms share the concepts that any input received is accompanied by a certain spatiotemporal excitation pattern within the nodes of a network, where similar input cause the same excitation. But PS is way more than this...

A digression on freedom and information-processing machines

PS algorithm was proposed with a broader scope than being a working quantum Machine Learning (ML) algorithm. In fact, PS play a relevant role in the philosophical debate on the existence of free will. Briegel et al. [7], instead of claiming the presence or the absence of free will in mankind, claim that programmable structure, if (a) sufficiently complex and organized, and (b) capable of developing a specific kind of memory, can exhibit behaviors of creativity and free-will.  PS, other than competing with other Reinforcement Learning (RL) algorithms, is proposed as a tool: a fundamental information-theoretic concept that can be used to define a notion of freedom (creativity, intelligence), compatible with the laws of physics.

If we accept that free will is compatible with physical law, we also have to accept that it must be possible, in principle, to build a machine that would exhibit similar forms of freedom as the one we usually ascribe to humans and certain animals. [7]

Think of this algorithm as the "proof" of the claims of the authors relatively the existence of complex system exhibiting free will. We will see how with PS we can  create agents  whose behavior can (hopefully, unarguably) considered by common sense as creative or intelligent.

To better understand the design choices behind the algorithm, we will define Intelligence as the capability of an agent to perceive and act on its environment in a way that maximizes its chances of success. Creativity (a manifestation of intelligence) as the capability of an agent to deal with unprecedented situations, and relate a given situation to other conceivable situations [7][1]. Learning is described as a modification of the molecular details of a neural network of an agent due to experience.

The definition of learning is purposely similar to what happens in the brain of some animals, as discovered by some recent results in neuroscience [8]. For instance, the behavior of this poor Aplysia can be largely described as a stimulus-reflex circuit, where the structure of the circuit change over time [9].
ECM-like memory animal. LOL

PS can be used to analyze emerging behavior of an agent under specific conditions, so to test its behavior in simple environments, such as very simple games.

Authors remark that PS does not try to:

    explain how the brain works,
    explain the nature of consciousness,
    explain the nature of human freedom.

Let's see what PS is about now.

Key concepts in PS

The model of memory used by the intelligent agents in PS is called Episodic and Compositional Memory. ECM is deliberately similar to other notion of memory in other fields of science. Based on the concept of episodic memory, (see this article by Tulving and Ingvar), ECM is basically a stochastic network of clips. We define:

  •     percepts:  pieces of data that represent information from the outer world. These are the possible inputs of our algorithm. Formally, a percept is a tuple $( s = (s_1, s_2, ...,s_N) \in S $ , where $S$ is the percept space $S_1 \times ... \times S_n$. Each $s_i = 1, ..., |S_i|$.
  •     actions: actions the agent can perform in the external world. These are represented by tuples in the action (or actuator) space: $a = (a_1, a_2, ..., A_k) \in A = A_1 \times ... A_k$ where $a_k = 1, ... |A_k| $ . Imagine $a_1$ as the state of "doing a jump", $a_2$ the state of "walking" and so on. These are the output of a single call to EC memory.
  •     clips: these are nodes in the network of EC memory. A clip is meant to represent the fundamental unit of the episodic memory I told you before. Percept clips are those clips $c$ that gets stimulated by percepts $s$ according to a specific probability distribution $I(c|s)$, while action clips $a$ are clips that, if stimulated, trigger an action. One of the most important features in ECM is the possibility to have remembered or fictitious percepts ⓢ$ := \mu(s) \in \mu(S)$ or actions ⓐ . Fictitious or remembered percepts or actions are stored inside clips, as sequences. Each clip has a length $L \geq 1$, which means that $c$ is composed by of $(c^1, c^2, ..., c^L)$ where each $c^i \in \mu(S) \cup \mu(A)$. (In this article we will only use clips with $L=1$).
  •     edges : are objects representing directed arc between clips (they have a starting clip and a receiving clip), and they contain data useful to the execution of the algorithm, such as weights, glowing tags, and emotion tags.
  •     emotions: peace of data (called emotions tag) attached to the edge. Tags are represented as tuples  $e = (e_1, e_2, ..., e_k)$ in the emotional space  $E \equiv E_1 \times ... E_k = E, e_k = 1, ... |E_k| $.

The weight of an edge between clip $c_i$ and $c_j$ is at time $t$ is stored in the weight matrix $h^t(c_i, c_j)$. The transition probabilities between clips are directly proportional to the weight of the edges. At the beginning of the execution of the algorithm, every percept clip is directly connected to every action clip. For the first part of our journey into PS, there will be no further connections between clips, and there are no further layers of clips (this will be generalized in the further sections). The reward function is $\Lambda : S \times A \to \mathbb{R}$.

The gist of PS is that a percept excites a percept clips, this excitation will start a random walk in the episodic memory, going through a chain of clips, eventually triggering an action clip. The transition probability between states of this stochastic process is obtained as a function of the weight matrix.
Fig 2 from [1] - EC memory: a network of clips in PS

Emotions and reflection:
The state of emotional tags can change during the execution of the algorithm, according to a feedback function. It may seems that this is a similar concept of the reward function being used to update of the transition probability between clips, but it is not. The reward function is defined externally: it is dependent on the external environment. This function instead, is totally independent of it, and it is a parameter of the algorithm. Emotional tags can be thought as remembered rewards for previous actions. Seen from the purpose they serve, they are more similar to clips than to transition probability.

Emotions represent a "higher abstraction" over the "lower level" of transition probability. We will use them to introduce the idea of reflection. When the reflection parameter $R$ is $> 1$, the agent keeps computing the output, but in such a way that action clips are detached from their actual execution in the environment. This simulation is repeated until a certain condition on the emotion tags of the selected path has been satisfied, or it is stopped after $R -1 \in \mathbb{N}$ round of simulation into episodic memory. With a name that perfectly aid the intuition, $R$ is called reflection time for the agent. If the agent is unable to select a satisfying action, it executes in the environment one of the previously selected actions.

Emotion's space can be as complex as needed, but in the example, I found on literature it is limited to the funny space of $\{$ ☺ ,☹  $\}$.

I stress that during reflection, action clips are only virtually excited, and they do not trigger any real action. It is thanks to this capability of the system that during the period of reflection an agent can project itself into "conceivable future situations", before triggering the real actuator, so to "think" their possible outcome. We could say that the emotions attached to the actions represent the state of belief of the agent for the right action given a specific percept.



If the rewards are delayed (which is often the case in real world application), one can use afterglowing (lol) : a technique for distributing rewards on recently used clips or edges. This is achieved by tagging each edge with a glowing factor $g$, whose value is set to $1$ each time it is used, and to $g^{t+1}(c_i, c_j) = g^t(c_i, c_j)(1- \eta)$ otherwise. Clip glowing (assign values to clips instead of edges) gives slightly different results on complex clip networks. The assumption behind afterglowing is that action in the past contribute less than recent actions for rewards we get (i.e., compare the importance for the victory between the first and the last move in a chess game). For more information take a look at [2].

PS without clip composition

This section is used just to describe how PS works in its basic configuration, using only the tools described so far. The advanced features of the ECM memory are explained in the next sections.  Here we assume that percept clips are directly connected to action clips, without any middle layer of clips.

Ladies and gentlemen let me introduce you to Projective Simulation algorithm!

  1.     The input of the algorithm is a percept. A percept stimulates a percept clip.
  2.     The excitation of the percept clip start a stochastic process: a random walk on the network of clips. The initial clip of the walk is the the percept clip excited by the pecepts. The hopping probability at time $t$ is $p^{t}(s,a)$ and is initially uniform among all the clips. The random walk terminates once an action clip is reached. If the reflection parameter $R$ is set ($>1$), and if the emotion attached to that action is negative, than it engages reflection, and start computing further random walks (i.e. samples from the random variable of the stationary probability distribution). This process  is repeated at most $R-1$ times: until a boolean function $f : E \to \{true, false \}$is true(otherwise the last action is taken). Weights for the calculation of the hopping probability can be defined by the following policies:
    •     standard function $$p^{(t)}(c_i, c_j) = \frac{h^{(t)}(c_i, c_j)}{\sum_k h^{(t)}(c_i, c_k)} $$
    •     softmax function $$p^{(t)}(c_i, c_j) = \frac{e^{h^{(t)}(c_i, c_j)}}{\sum_k e^{h^{(t)}(c_i, c_k)}} $$ (i.e. it gives higher probability to stringer edges - enhancing the exploitation in exploitation/exploitation paradigm)
    •    $\epsilon$-Greedy algorithm, as in classical RL.
  3.     Selected action is executed in the real world and the eventual reward is collected and taken into account. As we do in ML, while updating our transition probability, we should model the act of forgetting (i.e. giving less weight to lessons learned in the past and letting our agent learn from newer experience). To do that, we use the forgetting factor $\gamma$. Forgetting and hopping probability update (for rewarded and penalized edges) can be compressed in a single formula: $$h^{t+1}(s,a) -h^t(s,a) = - \gamma[g^t(s,a)-] + \lambda \delta(s,s^n)\delta(a,a^n) $$. Where $\delta$ is the Kroneker's delta, and $\lambda =\Lambda^t(s^t,a^t)$. The emotion tag associated with that action is changed according to the reward received. If the reward is positive, the emotion is set to 😄, otherwise is set to 😄. Note that in this is just a toy case, and the emotion space can be way more complex.

Note that our formula the entries of the matrix $h(c_j,c_i)$ are always greater than 1.  These steps are iterated on and on, mimicking the continuous iteration of an agent within its environment.

Basically, SP is a continuous interaction with the environment, where each step comprises a call to the memory of an agent. Each call is a sample over a stationary distribution of an irreducible, aperiodic, and irreversible Markov Chain over the clip's space. The states of the Markov Chain may evolve over time, according to the feedback received from the environment.

Invasion Game

To have a taste of PS in action, we will focus on some variations of a game called Invasion Game. It's not the aim of this post to dig into the application of PS, but the study of this game really helps intuition.

Imagine you have two robots facing each other across a fence with holes.

          v <----attacker

---   ---   ---  ---  ---

          ^ <---- defender

The attacker have to cross the fence in one of the holes, while the defender has to block the attack moving to the hole the attacker wants to go, preventing him from crossing. At each round, before the attack, the attacker show a symbol which is consistent with the direction he wants to go.

The defender, who has no prior information on the meaning of the symbol, have to learn, while playing, what is the meaning of the symbols and block the right hole. If the defender is able to block one attempt, it gets a reward of $1$, otherwise, it gets $0$. After the rewards have been collected by the agent, the battle starts again, with the two robots facing each other. That's exactly what has been done in [1], where the defender was programmed with a PS algorithm.

Let's make the game formally fit our model:

    percepts: symbols shown by attacker: $ S \equiv S_1 = \{ \rightarrow, \leftarrow \}$ (attack right, attack left)
    action space: $A \equiv A_1=  \{ +, - \} $ (move right, move left)
    emotions space: $\{$ ☺ ,☹  $\}$

To measure how good is the algorithm in playing this game, we use the blocking efficiency, it shows us what is the expected reward for our agent at time $t$. Since we are dealing with non-deterministic processes we expect to be something averaged, and we must take into account the non-determinism of (a) the percept we receive from the external world which may change over time and (b) the conditional probability of choosing the right action given a specific percept at a specific time. This whole concept is nicely subsumed into this formula:

$$r^t = \sum_{s \in S} P^t(s)P^t(a^*_s | s) $$

Where $P^t(s)$ is the probability of receiving the percept $s$ at time $t$ and $P^t(a^*_s | s)$ is the probability at time $t$ of selecting the rewarded action ($a^*$) given the percept $s$ has been received.

For the following graphs, blocking efficiency has been empirically calculated, averaging the results among 1000 matches (i.e. you will see fluctuations in the curves).

Basic version of PS in practice

Imagine to train our agent to play Invasion Game, setting initially the reflection parameter at $R=1$, with percept clips directly connected to action clips. Needless to say, PS can play Invasion Game as many other RL algorithms. Here we are no interested in the learning capabilities of a PS agent, but instead, we will test the behavior of the defender in specific "exceptional cases". We will try to mess up the environment in two different ways, and see how our agent react.

The simplest mischief we can do to our robot is this: wait until it has learned to defend with high efficiency, and then change the meaning of the direction of the arrows (i.e.suddenly $\leftarrow$ means right and $\rightarrow$ means left). What happens to the blocking efficiency of our robot?

Fig 5 from [1] - Blocking efficiency of the agent. The meaning of the arrows changed at time $t=250$.

Here we clearly see that, when we change ($t=250$), each second curve is less steep than the first. That's because the agent has to forget what he has learned first, and then re-learn everything from scratch. This happens for 3 different $\gamma$ (the forgetting factor). Given the Bayesian structure of the algorithm itself, this is something we should expect: also other RL algorithm behave similarly.

Given this configuration of the game, there is another mischief we can do to our defender robot: expanding its percept space with edges of another color ( keeping the same direction of the arrows ). This would update our percept space to $S= S_1 \times S_2 = \{ \rightarrow, \leftarrow \} \times \{red, blue \}$. How does the efficiency of our agent evolve?

    Fig 6 from [1] - Blocking efficiency with arrows introducing arrows of different color at time $t=200$.

    Ah ah. For an agent totally unaware of the semantic of an arrow, we would expect to re-learn the correct behavior from scratch, and that's exactly what we see from the two curves. It is the "proof" that from the robot's perspective, we are using totally different symbols.

    And what now if we turn reflection on? Given the same measure of efficiency we can plot results for $R=1$ and $R=2$ (without symbol modification):

    Fig 7 from [1] - Blocking efficiency with 2 different Reflection's value.

    Reflection has a concrete impact on the learning phase of an agent. And this is by itself a nice result. The structure of EC memory installed in the agent is

    Fig 4 from [1] - The EC memory of the defending agent in Invasion Game.

    We have just started exploring the feature of PS. Wouldn't we love if an "intelligent" agent can find some sort of similarities between the two arrows pointing in the similar direction, but with different colors? Yes! After all, isn't this our definition of creativity?

    So be surprised then, because this behavior is exactly what we get using two advanced feature of the ECM.  Starting to be impressed? You should... :v

    Learning how to learn

    Mimicking what happen in biological brains, we can enhance our model of memory by adding new percept clips during the lifetime of an agent. These new percept clips can be connected at the same time to action clip (as before) but also to other percept clips. Practically this means that the walk on the graph could start from percept with other incoming edges from other percept clips.

    The steps of the algorithm are similar to the previous case, but now we have to update the transition probability of all the edges in the sequence of clips which lead from the percept to the action. Without digging into the details (that you can find in [1][12]), these are the modified steps of the algorithm:

        The input of the algorithm is a percept. A percept stimulates a percept clip.
        A random walk from the percept clip $s$ to an action clip $a$ select a sequence of memory clips $\Gamma$. The length of the sequence is called deliberation length. The length $D$ of the chain $(s, s^D)$ is called "deliberation length", and it is roughly "how long does it take to think". The probability for the walk to go from clip $c$ to clip $c'$ is proportional to the ratio between the weight of the edge from clip $c$ to clip $c'$ and the sum of all the weights of the edges starting from $c$. If $(s,a)$ is rewarded sufficiently, and if its emotion is suitable (i.e. 😄), the action is executed. Otherwise, this step is iterated to most other $R-1$ times, and then a random action is taken.
        If the action is rewarded, also edges in the associative memory from $s$ to $a$ will be rewarded by a factor $K$ called growth rate of associative connections (direct connections of percept clips with action clip are increased by $1$, as usual). Forgetting factor works as usual, with the requirement that weights of the compositional sequences of clips (i.e. not direct sequences from percept clip to action clip) are dumped towards $K$.

    Let's play again Invasion Game. This time, we will repeat the experiment of changing the color of the arrows shown by the attacker. We will have a slightly different EC memory, depicted in the image below.

    Fig 12 From [1] - Evolution of the ECM while learning
    In the picture, (a) is the initial state of the memory, (b) is the memory after it has been trained with red arrows (please note the slightly thick arc from the blue arrows to the red ones: its weight is $K$.), and (c) the associative memory in action.

    This is an example of how a newly excited percept clip (blue arrow) could excite another clip in episodic memory (red arrow), from which strong links to specific action clips had been built-up by previous experience. This capability can be used to speed the learning time of an agent, as we see in the graph below. The agent has been trained for until $t=200$ to play with red arrows, and then the attacker switch to blue arrows.
    Fig 11 From [1] - The bigger the K, the steeper the learning curve.

    This match our previous definition of intelligence: we would expect an intelligent agent to learn how to play faster with the new symbols, since their "semantics" (in terms of rewards) is already known for similar symbols. Our agent, in fact, knows already how to play! That's exactly what happens using EC memory with higher deliberation length.

    This structure resembles an associative memory, and this is an example of associative learning.

    Combining actions

    We can generalize our model creating new clips by composing pre-existing ones. For instance, we can compose together two action clips. There are a few requirements from merging, though:

    1.     Both action clips should be sufficiently rewarded for the same percept: there is a threshold level of reward for two actions to be considered sufficiently rewarded.
    2.     The actions are similar, e.g action vectors only differ in two components, and are semantically compatible (i.e. you cannot combine "jump" and "stand still" in your agent, or "go right" with "go left").
    3.     The newly composed action clip does not exist already.

    This feature is what allows us to do action clip compositions, which we will see applied in a 3D version of Invasion Game.

    Now the attacker have to cross an imaginary grid-like plane, and the defender can move over the plane, so to block the attacker.

    Our percept space is: $S \equiv S_1 = \{ \rightarrow, \leftarrow, \nearrow, \searrow, \nwarrow, \swarrow, \uparrow, \downarrow \}$

    Our action space $A \equiv A_1 \times A_2 = \{ (+, 0), (-, 0) \} \times \{ (0, +), (0, -) \} $, where each component of the tuple fix an axes, and the sign is ment to point the verse.

    Beware, we will now give the partial reward to our agent if he can match half of the direction of the attacker. For instance, if the attacker decides to move in diagonal $\nearrow$, the defender gets partial rewards if it chooses one among up or right.

    This time [1], the agent has first been trained only with attacks along the axis. When he is presented with attacks on the $\nearrow$ direction, he will soon realize that there are two action clips equally rewarded for that action, which are semantically compatible. So the agent might think it could merge two actions into one, by creating a new action clip which activates the two action in the real world simultaneously, and sees what happen.

    This is what happens in his brain in terms of clips. The bigger the edge, the bigger is the transition probability.

    Fig 16 from [1] - Example of action clip composition where our agent learns to move in diagonal.

    Needless to say, our new favorite ML algorithm does not let us down, showing how the defender can discover new "behavior" which were not previously given.

    Fig 17 from [1] - Agent's performance for the various threshold level of association.

    In the graph above, the agent has been trained for $t<0$ (not shown) with only attacks along the axes, while for $t>0$ the agent is faced with only attacks along the $\nearrow$ direction. As you can see, the partial reward is set to $1$, while the total reward for a diagonal move is set to $4$.


    Comparison with Reinforcement Learning algorithms

    PS has been compared to another algorithm of reinforcement learning in a recent thesis [2]. The comparison was on real implementations of games and other tasks.  Java code can be found in  [3]. Moreover, since the interface for the algorithm of PS and RL is very similar, a comparison of their respective computational complexity was possible [2]. Both algorithms, in fact, expose a function $getAction$ - the function that gives you an action given an input- and a function $getReward$ - which distribute rewards in the model.
    • The complexity of $getAction$ of both algorithms is $O(|A|)$, but goes to $O(|A|*R)$ if glowing or dumping are enabled.
    • $getAction$ can be implemented using the same selection function ($\epsilon$-Greedy, SoftMax, and the plain probability function).
    • For some games, the use of emotion has been useful.
    • Glowing can be compared to eligibility traces. The same concept for distributing delayed rewards on the previous action in RL.
    • The complexity of $getReward$ is more efficient in the RL case.
    • In  a  well-defined  world,  the  RL algorithms are guaranteed to find an optimal policy[2], but PS seems to be more suitable in a more complex environment, where rules are unknown and subject to changes.

    There are two algorithms in literature that resemble PS: experience replay, and a dyna-algorithm. Those are "conceptually" similar, but with totally different structure, assumptions, complexity, and features.


    The quantum world

    he only way for an agent to exploit quantum mechanics to interact with a classical environment is to use quantum mechanics for internal state representation. To do that, we translate clips into vectors in a Hilbert space. Clips are a composite structure of remembered or fictitious percepts or action, and we will capture this composite structure by means of a tensor product of Hilbert spaces. The steps for running PS on a quantum machine are those:

    • A classical percept $s \in S$ is translated into a state in the quantum memory of the agent. This state is described by the density operator $\rho(s)$.
    • The physical system will evolve according to an Hamiltonian (which we will specify later).
    • A quantum measure on this quantum system will lead us to a specific action.
    We define the probability of transition from a clip to another by the Born's rule: nonorthogonal vector/clips are connected by a  probability $0 \leq p= \left| \langle c_i|c_j \rangle \right|^2 \leq 1$ that the jump of the excitation during the walk may occur. That's the natural choice of embedding probability in a quantum world, so to give raise to our beloved quantum interference. As you may imagine, the initial state could be a superposition of multiple initial states, and this might lead to a speedup.

    Now the tough part: how to translate in the quantum physical word a conceptual thing such as a walk on a graph? The Hamiltonian for such operation is pretty complex and has been found here.

    $$ H = \sum_{ \{ j,k \} \in E} \lambda_{jk}(\hat{c}_k^{\dagger}\hat{c}_{j} + \hat{c}_k \hat{c}^{\dagger}_j ) + \sum_{j \in V} \epsilon \hat{c}_j^{\dagger}\hat{c}_j    $$

    If you are curious about the operator $c$ and $c_j$, take a look at my previous blog post here: those are exactly the same operation used to probabilistiscally move the excitation along a chain of qubits.

    However, there's a problem with this definition: our Hamiltonian describe the evolution of a reversible system: it means that the transition probability between two state is symmetric. While this quantum environment is perfect for undirected graphs, we have hard time modeling directed graphs. [1] We thus have to model a reversible, irreducible, and aperiodic Markov chain.

    The quantum speedup, which has been showed in [11]  to calculate reflection via the Groover's algorithm. But this is a whole new story worth another single blog post I will be glad to write in another post.

    Moreover, a recently, a broader class of graphs for which there is an exponential speedup of the hitting time of the Markov chain has been found. That's cool, since the core of our algorithm is a random walk on a graph-like structure, and the hitting time (in our case) is the average time for which the excitation goes from the first percept clip to the action. To date, we know that discrete quantum walks on hypercubes are exponentially faster on quantum computers, and this class of graph has been extended to embedded hypercubes on certain graphs [12].


    Straight from [2], we have: 

    Damping$0 \leq \gamma \leq 1$$\mathbb{R}$0 or $\frac{1}{10}$Forgetting factor
    Reflection$ 1 \leq R $$\mathbb{N}$1Number of reflection cycle
    Glowing$ \eta \leq 1$$\mathbb{R}$1Glowing factor for weighting rewards
    Associative growth rate$K > 0$$\mathbb{R}$-Growth rate of associative connections of composite paths



    PS can be thought as a generalization of RL. The job of updating transition probability between edges can be done by Bayesian updating, which is basically a RL algorithm.

    The ECM memory is an important part of the PS model. Is what allows an agent to do reflection, which is a kind simulation made by the agent of it's action in the world.  Emotions and ECM are what allows the agent to detach from primary experience and to project themselves into conceivable situations (fictitious memory), without taking any real action.

    A further generalization of PS scheme can be found in [10]. 

    To sum everything up, learning in PS is achieved in three different ways:

    • modifying via bayesian updating the transition probability between the clips of the network (aided by an RL algorithm)
    • creating new clips when new percepts are received
    • creating new clips from existing ones according to a compositional principle.

    Much of the "magic" of this algorithm is embedded into our definition of clips, but it is something it must be specified case by case. That is because PS agents are defined on a more conceptual level as agents whose internal states represent episodic and compositional memory and whose deliberation comprises association-driven hops between memory sequences (the clips). [13]

    I am not aware yet of any implementation in quipper or LiQI|> of the algorithm. We could, in fact, write the Hamiltonian of the physical system and give it as an input to the GSE algorithm, and run our software on our favorite quantum programming language [5] [6]. GSE is an algorithm for simulating a physical system (given it's Hamiltonian) on a gate-based model of quantum computer, efficiently. There is already an implementation of GSE, based on the work of [4].

    Speaking about simulating physics with computers,  I built an Hamiltonian with my own bare hands, and you can play with it here!

    I would like to close this article with a quote [7] which I believe subsume the Zeitgeist among many scientists nowadays:

     If we accept that free will is compatible with physical law, we also have to accept that it must be possible, in principle, to build a machine that would exhibit similar forms of freedom as the one we usually ascribe to humans and certain animals

    I do not claim copyright for any of the pictures in this post. They all belongs to the authors of [1].

    Ocio: I would appreciate if you can send me an email if you find any mistake. Feedback is always welcome, I want to improve! This post may get updated over time as I learn new things or I am not satisfied with my previous explanations.

    Sybreed - God is an automation (piano cover by a random guy on youtube)


    [1] Briegel, Hans J., and Gemma De las Cuevas. "Projective simulation for artificial intelligence." Scientific reports 2 (2012). - 
    [2] Bjerland, Øystein Førsund. "Projective Simulation compared to reinforcement learning" -
    [3] Java Implementation of PS algorithm by Bjerland, Øystein Førsund -
    [4] Simulation of Electronic Structure Hamiltonians Using Quantum Computers
    James D. Whitfield, Jacob Biamonte, Alán Aspuru-Guzik:
    [5] Green, Alexander S., et al. "An introduction to quantum programming in Quipper." Reversible Computation. Springer Berlin Heidelberg, 2013. 110-124.
    [6] Liquid User Manual :
    [7] Briegel, Hans J. "On creative machines and the physical origins of freedom." Scientific reports 2 (2012). - 
    [8] Kandel, Eric R. "The molecular biology of memory storage: a dialogue between genes and synapses." Science 294.5544 (2001): 1030-1038.
    [9] Antonov, Igor, et al. "Activity-dependent presynaptic facilitation and Hebbian LTP are both required and interact during classical conditioning in Aplysia." Neuron 37.1 (2003): 135-147
    [10] Projective simulation with generalization: Alexey  A.  Melnikov, Adi  Makmal,  Vedran  Dunjko, and  Hans  J.  Briegel.
    [11] Paparo, Giuseppe Davide, et al. "Quantum speedup for active learning agents." Physical Review X 4.3 (2014): 031002.
    [12]  Makmal, Adi, et al. "Quantum walks on embedded hypercubes." Physical Review A 90.2 (2014): 022314.
    [13] Hines, Andrew P., and P. C. E. Stamp. "Quantum walks, quantum gates, and quantum computers." Physical Review A 75.6 (2007): 062321.