on a Code Report performance golf

Yesterday I saw one of Code Report's youtube videos on language comparisons, and got into speculation about what an optimal solution to the showcased problem would look like; another youtube user replied to the second of two comments of mine and I discovered that both my original comment and a third one that I had just written in reply were nowhere to be found, so this is my way of giving up and recording my thoughts about it in a more reliable way.

Table of Contents

The original video
Doing better
The second iteration
Benchmarks
Dreaming of perfection

The original video

This was Conor’s usual formula of taking a relatively easy programming problem and solving it in different languages, this time C++, Python, APL and BQN. The problem statement is as followed: we are given a list of \(n\) integers, as well as two additional integer parameters \(k\) and \(m\), and we must return the result of applying the following process \(k\) times:

find the first occurrence of the smallest element of nums
multiply it by \(m\)

Conor’s solutions were a literal implementation of the loop above in all four languages, as the video focuses on the expressivity of the languages rather than on the specific problem, which I appreciate. However, I was also left unsatisfied (as was another commenter) by the time complexity of the solution: surely it’s possible to do better.

Doing better

User @Roibarkan first gave this solution in python using a heap priority queue, using tuples of values and indices into the original array.

H = heapq.heapify([(x,i) for i, x in enumarate(nums)])
val, idx = heapq.heappop(H)
for _ in range(k):
  nums[idx] *=  m
  val, idx = heapq.heappushpop(H, (nums[idx], idx))
return nums

I translated it into C++ mostly out of the same kind of curiosity that Conor must have had when making the original video, and was happy to see that it was still relatively short. The version shown here isn’t the original as I integrated some suggestions by Roibarkan, but the idea is there. ^[1]

#include <vector>
#include <algorithm>
#include <ranges>
#include <numeric>

void get_final_state(std::vector<uint64_t> &inout_vals, uint64_t k, uint64_t m)
{
	auto const cmp = [&](auto i, auto j){
		if (inout_vals[i] != inout_vals[j])
			return inout_vals[i] > inout_vals[j];
		return i > j;
	};
	auto perm = std::vector<uint64_t>(inout_vals.size());
	std::ranges::iota(perm, 0);
	std::ranges::make_heap(perm, cmp);
	while (k--)
	{
		std::ranges::pop_heap(perm, cmp);
		inout_vals[perm.back()] *= m;
		std::ranges::push_heap(perm, cmp);
	}
}

How good is this solution? Well, let’s compare it to the ones in the video and we’ll go from there. Conor’s solutions are all \(O(nk)\) in time, and somewhere between \(O(1)\) and \(O(n)\) additional space. Space is trivially \(O(n)\) additional, since the only allocation is for the permutation, and time is \(O(n + k \log(n))\) because std::make_heap is linear, while std::push_heap and std::pop_heap are both \(O(\log(n))\). That’s pretty good, and depending on the relative sizes of \(k\) and \(n\) it could be as good as we get.

The second iteration

User @Qhartb then suggested that there was probably something to be gained by partitioning the elements by their log base \(m\), which ultimately made me think about this solution:

keep the unsorting permutation of the input in perm, so that the ith element of perm is the index of the ith smallest element in inout_vals.
maintain a prefix pre of perm with the property that the largest element whose index is in pre is smaller than the smallest element multiplied by \(m\).
at each step,
- if \(k\) is smaller than or equal to the length of the range, then multiply by \(m\) the first \(k\) elements indexed by pre and terminate. Otherwise,
- multiply by \(m\) each element indexed by pre. (this is correct because once the first element in pre is multiplied by \(m\), it becomes larger than any element in pre, and therefore the smallest element in inout_vals corresponds to the second element in pre, and so on.
- Subtract the length of pre from \(k\).
- Find the largest prefix of perm that satisfies pre's invariant by scanning the remaining elements until one larger than or equal to the largest element currently in pre is found.
- Merge the range found this way into pre (note that the elements in pre have all been multiplied by \(m\) and are still sorted relative to each other, but not relative to the rest of the range).

There is a big caveat here: the sort has to be stable, because the original algorithm explicitly mentions the first occurrence of the minimum value being multiplied by \(m\); this was also a mistake I made in the first translation of my heap-based implementation, because my comparison didn’t actually look at the indices to break ties in the values ^[2].

Now with that out of the way, here is the implementation in C++.

#include <vector>
#include <algorithm>
#include <ranges>
#include <numeric>

void get_final_state(std::vector<uint64_t> &inout_vals, uint64_t k, uint64_t m)
{
	auto const cmp = [&](auto i, auto j){
		if (inout_vals[i] != inout_vals[j])
			return inout_vals[i] < inout_vals[j];
		return i < j;
	};
	auto perm = std::vector<uint64_t>(inout_vals.size());
	std::ranges::iota(perm, 0);
	std::ranges::sort(perm, cmp);
	auto left = perm.begin();
	auto right = perm.begin() + 1;
	while(k > std::distance(left, right))
	{
		std::for_each(left, right, [&](auto i){ inout_vals[i] *= m; });
		k -= std::distance(left, right);
		auto mid = right;
		right = std::find_if(mid, perm.end(), [&](auto i){ return cmp(*(mid-1), i); });
		std::inplace_merge(left, mid, right, cmp);
	}
	std::for_each(left, left + k, [&](auto i){ inout_vals[i] *= m; });
}

That was quite a bit of work, and it could be all for nothing. Did we actually improve the time complexity? In short, no. But also maybe, kind of? I claim that the time complexity of this algorithm is \(O(n \log(n) + k)\), and that’s incomparable to \(O(n + k \log(n))\), because we don’t know the relative sizes of \(k\) and \(n\). If they are roughly the same, then we have neither improved nor degraded the performance, if \(k\) is smaller than \(n\) we’ve degraded it, and if \(n\) is smaller then we’ve improved. In a vacuum, there’s no way to tell, which is kind of the problem with such an abstract formulation: there is just no objective metric to judge how good a solution is (as long as it’s not \(O(nk)\) :P).

For the sake of argument, let’s say that \(k\) and \(n\) are roughly the same size:

I’m pretty confident that the constant factors of the second implementation are lower than the first one, simply because the multiply loop is tighter; as soon as the pre range gets a few elements, that’s going to make a difference over pushing and popping from a heap repeatedly.

This also goes for the sort: I hope that the standard library sort is faster than a very naive heap sort, which by extension makes it faster than the repeated heap usage of the first solution.

But now the proof that this last solution is \(O(n \log(n) + k)\) is a bit overdue, so let’s get on with it.

the \(n \log(n)\) part is just the sort, nothing special.
for the loop part, we prove the following statements:
- at most \(k\) iterations of the outer loop are performed.
- at most \(k\) multiplications are performed in total.
- the sum of the sizes of ranges fed to std::merge is at most the number of multiplications performed, plus \(k\).
- the total number of iterations of std::find_if is at most the number of multiplications performed, plus \(k\).

The first two are trivial because \(k\) decreases with the number of multiplications performed, and at least one multiplication is performed on each iteration of the loop.

The third follows trivially from the fact that the merged range is either immediately iterated on with a multiplication pass in the following iteration of the loop, or it is the final merge which accounts for the extra \(k\).

The last statement is true for the same reason, so the total operation count for the loop is \(O(k)\) as all the mentioned operations are linear.

Benchmarks

the second implementation is in fact a lot faster if \(k\) and \(n\) are both the same order of magnitude and large. The timings for \(n = k = 1'000'000, m = 3\) and random elements in the range \((0, n]\) are as follows:

heap-based:      1761457us
partition-based: 348003us

Dreaming of perfection

We could realistically push the asymptotic cost down to \(O(n + k)\), if the C++ standard library offered a std::radix_sort with a custom projection (radix sort can be easily implemented to be stable), though if I really wanted to

I could write one myself.

1. There were a couple revisions due to me both getting the direction of the comparison "wrong" and also forgetting that in Cpp heap refers to a max-heap, so the errors cancelled out

2. And in fact the first draft of this post also forgot to do break ties in the linear search of the second implementation