| Prev: Haskell | Up: Contents | Next: Testing |
To summarise, optimisation of the simulator by a variety of methods and algorithms has lead to a hundred-fold increase in execution speeds and a ten-fold decrease in memory requirements over my original simulator code in certain situations.
The benchmark machine for my tests was a Cyrix P166+ machine with 32Mb of memory. The Cyrix processors are considered to have a better integer performance than an equivalent Intel processor but worse floating point performance.
Some of the results have proved surprising. I had expected that it would be possible to simulate concurrent gates concurrently. Alas the laws of quantum mechanics had something to say about that idea! I had to rewrite my threads to use concurrency in other areas of simulation. I had expected my cache-hit increasing 'complex' algorithm to reduce execution times for all reasonably sized circuits. However it only produces increases the performance in low memory situations (i.e. large simulations). I had also expected that multi-threading would not lead to any increase in performance on a single processor machine. However there is a remarkable increase in performance using multiple threads when the memory is low.
The first gains in execution time came from using profilers on simulations in order to spot areas of the simulation that were occupying the most processor time, and thus which areas I should concentrate on optimising. The results from the first profile showed that the simulator was spending approximately 85% of its time creating and destroying vectors and less than 10% of its performing the essential matrix calculations.
This inefficiency was due to returning vectors on the stack. Consider the following code.
Main( )
{
Matrix m = ...;
Vector v = ...
Vector r;
r = m.Multiply( v );
}
// Method to multiply a matrix by a vector
Vector Matrix::Multiply( Vector &v )
{
Vector r;
// Multiply performed here
return r;
}
C++ compilers in translate this to the following sequence of instruction. Extra code generated by the compiler is highlighted.
Main( )
{
Matrix m = ...;
Vector v = ...
Create vector v
Vector r;
Create vector r
r = m.Multiply( v );
perform the multiply, result returned on the stack in vector 'temp'
destroy contents of 'r'
copy 'temp' in to 'r'
destroy contents of 'temp'
}
// Method to multiply a matrix by a vector
Vector Matrix::Multiply( Vector &v )
{
Vector r;
create vector r
// Multiply performed here
return r;
create vector 'temp'
copy r in to 'temp'
destroy 'r'
}
Now consider what happens if we pass the v in to the function by reference.
Main( )
{
Matrix m = ...;
Vector v = ...
Create vector v
Vector r;
Create vector r
m.Multiply( v, r );
perform the multiply, result returned on the stack in vector 'r'
}
// Method to multiply a matrix by a vector, returning the result in r
Matrix::Multiply( Vector &v, Vector &r )
{
// Multiply performed here. Results placed in 'r'
return ;
}
The second code segment above removes the need for the temporary variables and copies. It saves two vector creations, two vector copies and three vector destructions. As the multiplication operations performed by the simulation are generally quite small this results in a substantial increase in the efficiency of the simulation algorithm. Further efficiencies were gained by moving vector creation for temporary vectors outside of the inner loops of the simulation.
Optimisation based on the results of the profiler have allowed me reverse the original profile results. Now, the system spends less than 5% of its time creating and destroying objects and over 85% of its time performing the matrix calculations, resulting in an increase of performance by a factor of ten.
In the description of the simple simulation algorithm above it was shown that the simulation of every gate in the system required access to every element in the state vector. As the state vector is likely to be many megabytes in length the system will not be able to cache the data in the processor cache. Further, if the size of the state vector exceeds the physical memory available to the simulation then the system will have to continuously page the state vector in and out of memory. This continuous paging will result in a severe degradation in system performance.
The solution to the above problems is to localise the memory access as much as possible. In the simple simulation the loop which iterates over the gates is the outermost loop. In the 'complex' memory localising method we move the gate loop inwards.
An recap of the simulation code is shown below.
For each gate in the system.
For BaseAmplitude = 0 .. 2n-1
If BaseAmplitude does not have a bit set which our gate uses
Perform calculations...
End if
End for
End for
Care must be taken reordering the loops to ensure that we preserve the ordering of dependent calculations within the simulation. Moving the gate-processing loop inside the loop that iterates over the amplitudes would not necessarily preserve this ordering. To preserve the order we need to perform some analysis on the dependencies within the simulation.
Suppose that we have two sequential gates as shown in Figure 21. The input of the second gate is entirely dependent on the output of the first gate. In this case it is possible to move the gate-processing loop inwards provided that we still ensure that the calculations of the first gate occur before those of the first gate.
Now imagine that we have two parallel gates as shown in Figure 22. Remember that the operation of a single gate affects the state of the entire system, not just the state of the bit(s) that it is working on. Because of this it is not possible to move the gate-processing loop inwards without some extra work to ensure that the correct ordering of calculations is preserved. As was mentioned in the description of the simple simulation algorithm, it does not matter which of the two gates is simulated first. However it does matter that the actions of the gates appear to be atomic. For instance, it would be incorrect if we allowed the first gate to perform a partial computation, use the resultant state as the input to the second gate and then allowed the first gate to use the result from the second gate to complete its calculations. (This is why my first attempt at a multi-threaded algorithm, described later, failed.)
Imagine the situation where the first gate is connected to bits 0 and 1 of the circuit and the second gate is connected to bits 2 and 3. Also imagine that we have (arbitrarily) decided to simulate the first gate before the second one. The trace of the accesses to the state vector is shown below. The simulation of each gate in this circuit requires two matrix multiplies, each involving a vector with two elements.
| Matrix Multiply | First Gate | Second Gate |
| 1 | Element 00 | Element 00 |
| 1 | Element 01 | Element 10 |
| 2 | Element 10 | Element 01 |
| 2 | Element 11 | Element 11 |
As you can see, the accesses of the matrix multiplies of the second gate are dependent upon the results from both of the outputs of the matrix multiplies of the first gate.
Now imagine if we extended the circuit to operate on three bits, as Figure 23, where the third bit in the circuit plays no part in the area of the simulation that we are looking at.
The simulation of each gate now requires four matrix multiplies. The traces of accesses to the state vector for the simulation of this system would be as follows.
| Matrix Multiply | First Gate | Second Gate |
| 1 | Element 000 | Element 000 |
| 1 | Element 001 | Element 010 |
| 2 | Element 010 | Element 001 |
| 2 | Element 011 | Element 011 |
| 3 | Element 100 | Element 100 |
| 3 | Element 101 | Element 110 |
| 4 | Element 110 | Element 101 |
| 4 | Element 111 | Element 111 |
As you can see, again the first two matrix multiplies of the second gate are dependent upon the output of the first two multiplies of the second gate. Notice, however, that they are not dependent upon the results of the third and fourth matrix multiplies of the first gate. Thus we could schedule the first and second multiplies of the second gate to occur before the third and fourth multiplies of the first gate.
If we were to schedule the first matrix operation of the second gate before the second operation of the first gate then we would localise the memory accesses. The first four matrix multiplies would access the only four elements of the state vector only. Previously, when we had to compute the entirety of the first gate before the second gate, the first four matrix operations required access to all eight elements of the state vector.
The above example can be turned in to a generic algorithm. If we consider a group of gates that act on an arbitrary number of bits, then we must perform the iteration of the 'Base Amplitude' over the bits that these gates act on in some defined order. The iteration over the rest of the bits in the circuit can be performed any order. The algorithm for simulating a group of gates in the circuit in order to localise memory accesses is as follows.
For BaseAmplitude = 0 .. 2n-1
If BaseAmplitude does not have a bit set which our group of gates use
For each gate in the group
For GroupAmplitude = 0..2n-1
If GroupAmplitude has only bits set which the rest of the group uses
Perform calculations.
End if
End for
End for
End if
End for
Notice that the expense of moving in the gate-processing loop is that we have to add another loop inside of it. This new innermost loop appears to be exponential and it would seem that the cost of localising memory accesses is to raise the processing requirements from 2n to (2n)2. However this is not the case. The innermost if statement ensures that the calculations for each gate are still performed the same number of times for each gate. Further, by including some clever bit masking and setting logic we can ensure that the innermost if statement is only called c*2n times, where c << 2n. (Effectively we reason that a failure on an if statement is caused by a single bit being set when it should not be and we can use this to increment the for counters in non-linear jumps). Despite these steps the localisation does require extra control overheads than the simple algorithm. A comparison of the two algorithms is shown later.
Modern microprocessors have a level-1 cache of around 16Kb and a level-2 cache of around 256Kb. In order to make the most of these caches we must ensure that the localisation takes in to account their sizes. For that reason the entire set of computations in a complex simulation are partitioned in to groups that work on small areas of around 256Kb. These groups are then partitioned further in to sub-groups that work in areas of around 16Kb.
The class CGateGroup was designed in order to represent a set of gates. The class has the following functionality.
The class CQList was created to help in the implementation of CGateGroup. This class implements a polymorphic list that can order its elements, subject to a partial or total ordering imposed on its elements.
Three methods in the class CMetaverse are used to the partitioning and simulation.
The method PartitionGroup deserves a fuller account. Given a set of gates it produces a list of groups. The groups are ordered in to computation order - that is if a group X precedes group Y then none of the inputs of the gates of X depend upon the outputs of the gates of Y.
Suppose for instance that we wished to partition a circuit in to groups with a localisation size of 512 bytes. 512 bytes equals 32 complex numbers at 16 bytes per complex number. In order to make sure that we localise access to the state vector to 32 (i.e. 25) elements at a time we must limit the number of bits that the gates within each group uses to 5.
The first thing to do is to order the input list of gates to partition in to a simulation order which attempts to keep related gates as close together as possible. For instance suppose that we have four gates A, B, C and D, and the input of B depends on the output of A (A <; B, or A "precedes" B), B < C and A < D. The dependency graph for this is shown in Figure 24. The ordering method will then output the list {A, B, C, D} or {A, D, B, C}. However it would not output {A, B, D, C}, even though this list preserves the partial ordering, as the related gates B and C are not as close together as possible. (Having said that, in order to reduce the ordering time the algorithm is a heuristic and is not guaranteed to produce the correct output all of the time).
Now we attempt to group the ordered list. The first gate in our imaginary list to partition is as in Figure 25. We now wish to add to this group the maximum number of gates subject to the ordering and size constraints. This gate already uses 2 bits so we have at most 3 more bits to play with. By the ordering property above the next gate in the list will be gate B. This uses the same bits as gate A so the localisation is not going to be affected by adding this gate to the group. Therefore we add it. Now we come to gate C - this uses three pins, none of which are in common with the other two gates. We add this gate to the group, leaving us no more bits to play with. Gate D uses the same bits as the other gates, and so it is added. Gates E and F cannot be added without exceeding our quota of bits and so they are left to be placed in the next group.
As you can see the partitioning algorithm is very dependent upon the partial ordering method of the CQList.
The cache hit increasing 'complex' algorithm orders the circuit to be evaluated in to a set of non-overlapping groups. Dependency information is kept on the groups and one can consider the whole to be analogous to a series of transactions to be executed in a database. It was my original intention that separate threads should calculate concurrent groups. This was part of the reason for creating the complex algorithm in the first place.
I designed and implemented an algorithm to perform the above task and the algorithm worked exactly as intended. Unfortunately, using the algorithm resulted in a sizeable fraction of the automatic tests returning incorrect results. I investigated the matter and determined that the cause of the problem lay in the nature of quantum circuits. I believe it was a reasonable thought that two concurrent gates could be simulated concurrently. In classical circuit the output of the two gates would not have interfered with each other. However in a quantum circuit the two gates do interfere. Another way of describing the state of two or more superpositioned bits is calling them 'entangled', i.e. affecting one of the bits produces an effect on the other. Here we see interference working, and this is why my original algorithm failed.
The above algorithm sought to introduce concurrency in the processing of gates. There was another plan for introducing many threads that I had also planned for, and that was to introduce concurrency in the sheer number of matrix multiplies required to simulate a gate. The simulation of a two bit gate in an n bit circuit requires 2n-2 non-overlapping matrix multiplies, and so there is plenty of scope for parallelisation.
Each matrix multiplication is likely to be small and the cost of thread synchronisation would make it inefficient to allocate each multiplication to a separate thread. Further, the non-linear ordering of the matrix operations make it hard to divide the multiplications in to groups without decreasing efficiency. However the cache hit increasing algorithm groups many gates together. The matrices multiplies involved in simulating a group of gates are much less inefficient to thread and it is these which are allocated to separate threads. In the example shown in the description of the memory localising algorithm above, and recapped below, the first two matrix multiplies of the first and second gates would go to one thread, the last two to another, as theses sets of operations do not interfere with each other.
| Matrix Multiply | First Gate | Second Gate | Thread |
| 1 | Element 000 | Element 000 | 1 |
| 1 | Element 001 | Element 010 | 1 |
| 2 | Element 010 | Element 001 | 1 |
| 2 | Element 011 | Element 011 | 1 |
| 3 | Element 100 | Element 100 | 2 |
| 3 | Element 101 | Element 110 | 2 |
| 4 | Element 110 | Element 101 | 2 |
| 4 | Element 111 | Element 111 | 2 |
The task for controlling the threads is called CCalculationThreads. The complex algorithm creates one when it has determined that the user requires more than one thread in the simulation. A parameter passed in the construction of a CCalculationThreads object gives the number of threads to create. The pseudo-code for the algorithm is summarised below.
New CCalculationThreads( NumberOfThreads, ThreadPriority )
...
for each sub-group
for each base amplitude
queue amplitude to process the gates for
end for
wait for queue to empty
end for
The pseudo-code for an algorithm that queues amplitudes is listed below.
wait for free slot to queue amplitude
add the amplitude to the queue
signal data in queue to process
The pseudo-code for an algorithm that processes queued amplitudes is listed below.
wait for data in queue to process
calculate gates for this amplitude
signal free slot in amplitude queue
Semaphore objects are used to handle the queue lengths. A mutex object controls access to the CCalculationThreads object. Each gate calculation requires two vectors to store temporary data for the calculation. This temporary data is created at the time of the creation of the CCalculationThreads object as the continuous creation of temporary vectors would result in a large amount of time being spent creating and destroying objects.
A trace of the percentage of processor time occupied by each thread in a quad threaded simulation is shown in Figure 26 below. There are two extra threads. The 'Main Thread' is the thread that is responsible for starting the simulation. It starts up a separate thread, the 'controller' for the purposes of running and controlling the simulation. After this second thread has started the first thread handles the simulation progress dialogue box. Hence the thread performs a large amount of initial work as the dialogue box is drawn. The controller thread then starts to prepare the simulation. It is responsible for initialising the vectors for the simulation, starting off the calculation threads and then queuing the amplitudes to process. You can see it peak at the start as it initialises, then occasionally require the processor to send data to the calculation threads throughout the simulation. The four other threads are used to perform the calculations. These oscillate around 25% processor time, as one would expect.
The slightly erratic behaviour towards the end of the graph is due to the uneven partitioning of the gates to be processed. The partitioning of the complex algorithm breaks this 18 bit circuit in to effectively non-overlapping groups of 7, 7 and 4 bits. Each of these groups is simulated separately. The 7 bit groups have 14 gates and the 4-bit group has 8 gates. The algorithm allocates work to each thread in sizes which are proportional to 2n, where n is the number of gates in the group. Thus, towards the end of the simulation smaller parcels of work are being handed out. The consequence is that the controlling thread has to perform proportionally more work in comparison to the other threads when with smaller groups have to be simulated. Further, the algorithm for allocating work to the threads does not attempt to divide the work equally amongst the threads. If a thread can complete its allocated task before the controlling thread allocates the next task then it is possible that it may get allocated the next task, and the one after, and so on. One thread, therefore, may perform all of the work. This is why one thread above has managed to peak at about 42% of processor time towards the end of the simulation.
I do not consider this to be a bug in the code as I have yet to see any adverse side effects from the above behaviour. Indeed, one could argue that the above behaviour has distinct benefits in a multiprocessor environment as we would like as much work as possible to be performed by one processor. This would reduce the number of cache misses that would occur if the work were switched to another processor. Also, the amount of work that needs to be performed by the simulation is always the same, as is the amount of thread synchronisation required and both are independent of the size of the groups.
Sparse vectors were implemented by reorganising the original vector implementation. A new class, CBaseComplexVector, was added to provide a default implementation for vectors. It provides default methods for setting and returning the length of vectors and performing mathematical operations on the vector and its elements. It expects derived classes to provide methods for setting and returning the individual elements of the vector. The class CComplexVector now represents statically allocated vectors - the old name was kept to preserve compatibility. CSparseComplexVector was added to represent sparse vectors. The updated class hierarchy for vectors is shown in Figure 27.
My sparse vector implementation is quite simple and works by dividing the vector that it is to represent in to an array of sections of a pre-determined size. Each of these sections is in itself a small static vector and is represented by a CComplexVector. Initially empty vectors represent these sections and thus require a minimal amount of memory. An attempt to read from an empty vector returns the complex value of 0. An attempt to write a value 0 to an empty vector is ignored. An attempt to write a value of anything other than 0 to an empty vector results in the vector being allocated and the corresponding element being set. Reading or writing to a non-empty vector causes the corresponding element to be returned or set, respectively.
At the moment the vector does not automatically shrink itself. The code to achieve this is relatively straightforward and has already been implemented, though it has not been enabled. My purpose in using sparse vectors was to reduce the maximum memory requirements of the application, not the average requirements.
The section size is determined at compile time. Smaller section sizes result in a greater memory overhead and generally a higher execution time though it could reduce the amount of memory required to represent the vector. A comparison of the amount of memory required to simulate a quantum Fourier transform is shown below.
The run time of the Quantum Fourier Transform is shown in Figure 28 for the benchmark system when there was plenty of memory available for the simulation. The difference between the simple, complex single-threaded and complex multi-threaded algorithms does not very greatly under these conditions. The complex single-threaded algorithm is about 5% slower than the simple algorithm and the complex multi-threaded algorithm is around 10% slower at its worst. As these variations are small in comparison to the exponential time of the simulation separate graphs for their performances are not shown.
The performance is as expected for the Fourier transform at around O( n22n ), where n is the number of input bits to the circuit. Thus adding an extra bit to the transform multiplies the processor work required by a factor of around 2.2. The workspace requires 2n complex numbers at 16 bytes per complex number.
The situation becomes more interesting at the memory requirements of the simulation start to exceed the physical memory available. The benchmark machine has a total of 32Mb of physical memory. The Windows NT operating likes to keep around 16Mb for its kernel, a disk cache and a set load loaded libraries. Thus there is only 16Mb left for the simulator and its workspace. Leaving a megabyte for the simulator code means that the entire simulation workspace cannot entirely fit in to memory and a small amount of paging is inevitable.
Figure 29 shows the timings for simple, complex single-threaded and complex multi-threaded simulations on a 20 bit Fourier transform. The complex algorithms have to perform more work as they have to order the calculations of the gates. In order for the complex algorithms to be quicker the processor time saved by the reductions in the number of cache misses and page faults has to compensate for this loss. Here the amount of paging is not high enough to allow the complex single threaded algorithm to regain the processor time required for the ordering. However the complex multi-threaded algorithm can regain this 'lost' processing time.
I was not expecting to see any gains using multiple threads on a single processor machine. In fact I was expecting a five to ten percent performance degradation as the multiple threads require extra controlling. Locks, semaphores and mutex's are all used within the simulator to control access to resources, queue requests for calculations and queue any free threads. I was therefore surprised to see the above result and sought an explanation.
The reason for the extra speed comes stems from NT switching threads when one is blocked after causing a page fault. Imagine a thread that requires access to a memory location that swapped out at that time. NT will queue a request for the required memory page. However this will take upwards of 15ms to arrive from the hard disk. When running a single threaded simulation the operating system has no choice but to block waiting for the page to arrive. However in a multi-threaded simulation it is possible that some other thread does not require a swapped out page and is able to run. If so NT can switch tasks to that thread, thereby saving the 15ms which, on modern microprocessors, equates to several million processor instructions.
The run times become even more interesting when the amount of memory required greatly exceeds the amount of physical memory available. The simple simulation algorithm makes no attempt to localise memory accesses in order to reduce the number of page faults. Thus the simulation of each gate requires access to all of the complex numbers in the simulation workspace. For a 21 bit Fourier transform, timings for which are shown in Figure 30, the simulator forces the reading and writing of 32Mb of data for each of the 273 gates. That is approximately 8.5 Gb reading and writing, and would take a modern PC hard disk approximately an hour if the accesses were sequential. Unfortunately, the memory access is not sequential (see 'Funny Counting', above) and so the time spent waiting for the hard disk is far greater than this.
The amount of pure processor time required for the simulation of the 21 bit Fourier transform should be 2.2 times more than a 20 bit transform, and thus around 2,000 seconds. The simple simulation requires around 72,000 seconds, i.e. the system spends thirty-five seconds waiting for the hard disk for every one second it spends calculating the circuit's outputs. The single-threaded complex simulation is far more efficient at around 15,000 seconds. The quad-threaded complex simulation requires a mere 6,000 seconds in comparison.
The conclusion is that the simple algorithm is more efficient when the system's physical memory requirements are sufficiently large to fully satisfy the workspace required for a simulation. The extra overheads of the ordering of the complex algorithm and the communication and synchronisation of the threaded algorithms means that the two algorithms are around five and ten percent slower, respectively. However the simple algorithm is unable to scale to situations where the memory requirements of the simulation are not met by the system. The complex algorithms have been designed to scale better and perform well. The multi-threaded algorithm performs excellently and exceeds my original expectations.
The comparisons below in Figure 31 show the difference between the observed maximum memory requirements of the simulator when performing a Quantum factorisation. It should be noted that the amount of space required by the sparse vector depends to some degree upon the random number that the Quantum factorisation algorithm uses, and can extend to as much as twice the values recorded here. I have chosen use this particular circuit to demonstrate the algorithm as it a 'real world' application. The Quantum Fourier transform by itself, as used in the comparisons above, requires the sparse vectors to use maximum amount of memory at some time, and so does not offer any opportunity for gains. The sizes of the sparse vector section used in these tests were 128 elements.
As you can see signification gains can be achieved by using sparse vectors. This is, however, accompanied by a 30% slowdown in execution times.
COMPARISON OF SIZE OF SPARSE VECTOR SECTIONS
As you can see in Figure 32 changing the size of the sections in the sparse vectors does not greatly make a difference to the maximum amount of memory that is required by a simulation. The tests were performed on various sizes of quantum factorisations that were run with a set 'random' number and number to factorise. A smaller section size will offer a greater memory reductions where the complex values that represent the system's state are sharply defined and widely spaced. A larger section size offers less computational overhead. The 128-element section size provides good compromise between these two ideals.
| Prev: Haskell | Up: Contents | Next: Testing |
This web page (c) 2000 Jon Marshall. Last updated 3rd June 2000