… some paralel processing tests using Microsoft Accelerator
Posted by admin in UncategorizedIt’s been again a long time since my last post.
Today I wanted to explore a little bit the realms of paralel processing. As I already stated in some previous posts, the current trend of going multicore is somewhat out of phase of the current programming techniques, meaning we have multicore but we rarely use it. Most applications available on the market are single core and some have just some modules optimized for multicore.
Suprisingly the most parallel applications nowadays are graphics processing. Latest generation GPU’s are actually multicore processors, but specialized for simple operations (thus very fast). For the general purpose software developer this processing powerhouse is out of reach due to the different programming model. What this means is that first of all there is a lack of higher level abstractions of the GPU and also there are limitations because of this specializations.
There are some API’s available from NVIDIA and AMD (read ATI) in the form of CUDA and STREAM, but for me, as a C# developer they are not attractive since they offer a C++ API.
Fortunately there is Microsoft Accelerator which you can get from here. This is a .Net abstraction over the GPU and allows for some basic computations to be forwarded to the GPU instead of the CPU. Naturally I wanted to explore a little bit and especially to do some tests. Below is my test example, some results and remarks.
using System;
using Microsoft.Research.DataParallelArrays;
namespace MSAccelerator
{
class Program
{
public const int dimension = 5000;
public const int repetitions = 1000000;
static void Main(string[] args)
{
Microsoft.Research.DataParallelArrays.ParallelArrays.InitGPU();
float[] matrix1 = new float[dimension];
float[] matrix2 = new float[dimension];
float[] matrix3 = new float[dimension];
using (Timer t = new Timer())
{
Console.Out.WriteLine(“Starting generating the source matrices”);
Random r = new Random((int)DateTime.Now.Ticks);
for (int i = 0; i < dimension; i++)
{
matrix1[i] = (float)(r.NextDouble() * 1000);
matrix2[i] = (float)(r.NextDouble() * 1000);
}
Console.Out.WriteLine(“Finished generating the source matrices.
Time taken (ms) :” + t.Milliseconds.ToString());
}
using (Timer t = new Timer())
{
Console.Out.WriteLine(“\nStarting adding the two the source matrices”);
for (int k = 0; k < repetitions; k++)
{
for (int i = 0; i < dimension; i++)
{
matrix3[i] = matrix1[i] + matrix2[i];
}
}
Console.Out.WriteLine(“Finished adding the two the source matrices.
Time taken (ms) :” + t.Milliseconds.ToString());
}
// transfer the data in structures than can be used by the accelerator
DisposableFloatParallelArray pmatrix1 = new DisposableFloatParallelArray(matrix1);
DisposableFloatParallelArray pmatrix2 = new DisposableFloatParallelArray(matrix2);
DisposableFloatParallelArray pmatrix3 = new DisposableFloatParallelArray(matrix3);
using (Timer t = new Timer())
{
Console.Out.WriteLine(“\nStarting adding the two the source
matrices using the GPU”);
for (int k = 0; k < repetitions; k++)
{
object result = ParallelArrays.Add(pmatrix1, pmatrix2);
}
Console.Out.WriteLine(“Finished adding the two the source
matrices using the GPU. Time taken (ms) :” + t.Milliseconds.ToString());
}
Console.In.Read();
}
}
public class Timer : IDisposable
{
private DateTime startTime;
public Timer() { startTime = DateTime.Now; }
public int Milliseconds
{
get
{
TimeSpan ts = DateTime.Now.Subtract(startTime);
return (int)(ts.Ticks / 10000);
}
}
public void Dispose() { }
}
}
The code above needs some explanations. The test involved the adding of two one dimensional int arrays. The dimension of the arrays were chosen to 5000 due to memory constraints. I am unable to pinpoint now the cause but going over 10.000 elements results in a null pointer exception in the Accelerator implementation. In might have something to do with the translation of the data into the vertices matrix or a bad implementation of the library. Because of this I chose to do a repetition in order to attain long running operations (1.000.000 repetitions). Yes this is quite high but at 10.000 repetitions the execution was under 15 msec and that is also around the value of the execution of the DateTime; so I had to choose a bigger repetition number.
In the bigger picture there are 4 steps: fill the matrix with random data, compute using a standard cpu-bound algorithm, put the data into structures usable by the Accelerator, make the operations using the Accelerator.
And now the results:
- on the cpu-bound algorithm: 36.909 msec
- on the gpu-bound algorithm: 265 msec
This results in a 139x increase in speed. Of course this is algorithm and hardware dependant.
Test machine specs: Core2Duo 6750 2.66Ghz, 4Gb Ram DDR2, Saphire ATI 4850 512Mb stock voltages/frequency.
One more thing to be added here are the obvious limitations of GPU processing, mainly the operations possible being on scalar values (int, float, double, bool). Also the context passing from CPU to GPU is costly and you only gain tabgible benefits only on arrays larger than 1 billion.
[UPDATE]: I found also a C# implementation of CUDA. I have not had time to look at it but it looks promising. Maybe in the future I will do some test comparissons between it and MS Accelerator.



Entries (RSS)
Cheers
Try matrix multiplication instead, (since it is computationaly harsher).
I have a GeForce Go 7400 and my computation limit is bound to 4096 elements.
Which is very low. =(
i think its not fair, do deklarate die ParallelArrays only once
DisposableFloatParallelArray pmatrix1 = new DisposableFloatParallelArray(matrix1);
DisposableFloatParallelArray pmatrix2 = new DisposableFloatParallelArray(matrix2);
DisposableFloatParallelArray pmatrix3 = new DisposableFloatParallelArray(matrix3);
//….
for loop
and, after the calculation with the normal forLoop u have the result in the maxtrix3-Array
in the Gpu-calculation u DONT have the managed float array after calculation, so u cant use it to display it or so.
when u want to show, that one algorithm is faster than another, both have to do the same.
so put the array deklaration oft the ParralelArrays into the loop, and write the result back in a managed float Array.
and u ll see, that its much slower than the normal calculation
(sry, my english ist worse
)
Hi JLafleur
Unfortunately I miss your point. The ideea was to have the same operations in the timed loops. I was not trying to also have the data avaialble somewhere. Maybe the operations should look like:
for (int i = 0; i < dimension; i++)
{
float temp = matrix1[i] + matrix2[i];
}
and
for (int k = 0; k < repetitions; k++)
{
ParallelArrays.Add(pmatrix1, pmatrix2);
}
I was pointing out just raw power not access latencies.
Hi,
Your benchmark numbers are little incorrect
Microsoft Accelerator uses defered execution just like LINQ. So, I think, your actual DirectX / GPU code in above example is not yet executed.
object result = ParallelArrays.Add(pmatrix1, pmatrix2);
Above statement will not perform actual summation in GPU unless you call target.ToArray method.
Create a target object as DX9Target , then call target.ToArray inside the loop.
I think it will degrade the performance dramatically.
NOTE: GPU is for best for matrix computation with at least 1K elements or more .
Here is a link to a paper by Microsoft Research about Accelerator
http://research.microsoft.com/pubs/70250/tr-2005-184.pdf