Rust-Programming/sparse_vector/README.md

# Sparse Vector Implementations

This repository aims at comparing various implementations of sparse vectors.

## What is a sparse vector?

A sparse vector is a vector in which most of its elements are zero.

That makes is easier to store because the many zero elements must not be stored.

Though this comes a the cost that we may need to decide between memory saving and computation time.

## Implementations overview

* Hashmap
* Index Array
* Compressed Index Array
* Binary Heap

### Index Array

> **NOTE**
>
> Due to poor choice of names and lazyness the implementation can only be found in the branch `compressed_indices_2`

We can omit all zero elements by storing an index array alongside all non zero values. Each value will be associated with an index in from the index array. This model is only efficient in memory size when the amount of zero elements is at least 50%. Since I used `usize` to store the indices, which is equal to a `u64` in 64-bit architectures, The required memory is:

```
mem(N) = non_zero_elements * (8 Bytes + 8 Bytes)
```

One significant downside is the cost of finding each corresponding entry when performing calculations such as the dot product. For this I used a binary search which gives a nive speedup.

### Hashmap Implementation

> **NOTE**
>
> Due to poor choice of names and lazyness the implementation can only be found in the branch `hashmap`

This implementation uses a hashmap to associate a value with its corresponding index in the vectors column. In Theory this should be as efficient in memory size as the previous array index method.

But in comparision this method requires signifacantly more memory since a hashmap allocates more memory than it can fill in order to reduce collisions.

It has one significant benefit, that being speed in calculations. Looking up values in a hashmap is generally faster than performing a binary seach. Also inserting and deleting is an O(1) operation.

> NOTE
>
> Two implementations of the dot product can be found:
>
> One implemented with a simple loop and one with a binary search. From testing I can say, that the simple loop variant is significanty faster than the crude binary search.

### Compressed Index Array

In order to reduce the size required to store the indices of each value we can compress them by only storing the relative offset to the previous value:


| Uncompressed Index | 0 | 7 | 13 | 33 | 45 | 47 | 48 | 57 | ... | 34567 |
| -------------------- | --- | --- | ---- | ---- | ---- | ---- | ---- | ---- | ----- | ------- |
| Compressed Index   | 0 | 7 | 6  | 20 | 12 | 2  | 1  | 9  | ... | 23    |

This yields smaller values. Thus we can savely reduce the bandwidth of available bits to store.

In this implementation I reduced to size from 64 to 16 bit. This makes memory usage a lot smaller, but computation gets a lot heavier, since all values have to be decompressed on the fly. A possible improvement would be to cache uncompressed values. May be worth investigating futher.

### Binary Heap

Implementation can be found in the main branch.

The binary heap has the advantage of being fast with inserting, removing and looking up values in logarithmic time.
We use indices again to sort the values of the vector into to binary heap.

## Comparision

The following values were achieved by using a randomly initialized vector with a length of 10^10 elements from which 2% were non zero. The dot product implementation was single threaded and run in release mode on hyperthreaded intel hardware.


| Implementation         | Size on Heap (GB) | Runtime of dot product (s) |
| :----------------------- | ------------------- | ---------------------------- |
| Naive                  | 80                | N/A                        |
| Index Array            | 3.6               | 6.254261896                |
| Hashmap                | 5.4               | 0.732189927                |
| Compressed Index Array | 2.0               | > 120                      |
| Binary Heap            | 1.3               | 2.089960966                |

Licensed under GPLv2 or later, same as the entire repository
updated README.md 2023-05-01 13:38:20 +00:00			`# Sparse Vector Implementations`

			`This repository aims at comparing various implementations of sparse vectors.`

			`## What is a sparse vector?`

			`A sparse vector is a vector in which most of its elements are zero.`

			`That makes is easier to store because the many zero elements must not be stored.`

			`Though this comes a the cost that we may need to decide between memory saving and computation time.`

			`## Implementations overview`

			`* Hashmap`
			`* Index Array`
			`* Compressed Index Array`
			`* Binary Heap`

			`### Index Array`

updated READMEs 2023-05-03 09:02:03 +00:00			`> NOTE`
			`>`
			> Due to poor choice of names and lazyness the implementation can only be found in the branch `compressed_indices_2`

updated README.md 2023-05-01 13:38:20 +00:00			We can omit all zero elements by storing an index array alongside all non zero values. Each value will be associated with an index in from the index array. This model is only efficient in memory size when the amount of zero elements is at least 50%. Since I used `usize` to store the indices, which is equal to a `u64` in 64-bit architectures, The required memory is:

			```
			`mem(N) = non_zero_elements * (8 Bytes + 8 Bytes)`
			```

			`One significant downside is the cost of finding each corresponding entry when performing calculations such as the dot product. For this I used a binary search which gives a nive speedup.`

			`### Hashmap Implementation`

updated READMEs 2023-05-03 09:02:03 +00:00			`> NOTE`
			`>`
			> Due to poor choice of names and lazyness the implementation can only be found in the branch `hashmap`

updated README.md 2023-05-01 13:38:20 +00:00			`This implementation uses a hashmap to associate a value with its corresponding index in the vectors column. In Theory this should be as efficient in memory size as the previous array index method.`

			`But in comparision this method requires signifacantly more memory since a hashmap allocates more memory than it can fill in order to reduce collisions.`

			`It has one significant benefit, that being speed in calculations. Looking up values in a hashmap is generally faster than performing a binary seach. Also inserting and deleting is an O(1) operation.`

updated READMEs 2023-05-03 09:02:03 +00:00			`> NOTE`
			`>`
			`> Two implementations of the dot product can be found:`
			`>`
			`> One implemented with a simple loop and one with a binary search. From testing I can say, that the simple loop variant is significanty faster than the crude binary search.`

updated README.md 2023-05-01 13:38:20 +00:00			`### Compressed Index Array`

			`In order to reduce the size required to store the indices of each value we can compress them by only storing the relative offset to the previous value:`


			`\| Uncompressed Index \| 0 \| 7 \| 13 \| 33 \| 45 \| 47 \| 48 \| 57 \| ... \| 34567 \|`
			`\| -------------------- \| --- \| --- \| ---- \| ---- \| ---- \| ---- \| ---- \| ---- \| ----- \| ------- \|`
			`\| Compressed Index \| 0 \| 7 \| 6 \| 20 \| 12 \| 2 \| 1 \| 9 \| ... \| 23 \|`

			`This yields smaller values. Thus we can savely reduce the bandwidth of available bits to store.`

			`In this implementation I reduced to size from 64 to 16 bit. This makes memory usage a lot smaller, but computation gets a lot heavier, since all values have to be decompressed on the fly. A possible improvement would be to cache uncompressed values. May be worth investigating futher.`

			`### Binary Heap`

updated READMEs 2023-05-03 09:02:03 +00:00			`Implementation can be found in the main branch.`

			`The binary heap has the advantage of being fast with inserting, removing and looking up values in logarithmic time.`
updated README.md 2023-05-01 13:38:20 +00:00			`We use indices again to sort the values of the vector into to binary heap.`

			`## Comparision`

updated READMEs 2023-05-03 09:02:03 +00:00			`The following values were achieved by using a randomly initialized vector with a length of 10^10 elements from which 2% were non zero. The dot product implementation was single threaded and run in release mode on hyperthreaded intel hardware.`
updated README.md 2023-05-01 13:38:20 +00:00

			`\| Implementation \| Size on Heap (GB) \| Runtime of dot product (s) \|`
			`\| :----------------------- \| ------------------- \| ---------------------------- \|`
			`\| Naive \| 80 \| N/A \|`
			`\| Index Array \| 3.6 \| 6.254261896 \|`
			`\| Hashmap \| 5.4 \| 0.732189927 \|`
			`\| Compressed Index Array \| 2.0 \| > 120 \|`
			`\| Binary Heap \| 1.3 \| 2.089960966 \|`
updated READMEs 2023-05-03 09:02:03 +00:00
			`Licensed under GPLv2 or later, same as the entire repository`