This year I participated in the APL Problem Solving Competition organized by Dyalog Ltd, I will document my thoughts about the problems I solved here.
the first problem starts off pretty tame: write a dfn that converts a DNA sequence to RNA, where both input and outputs are character arrays.
The solution doesn’t need much explanation, it’s just replacing the
'T'
s with 'U'
s.
rna ← {'U'@('T'∘=)⍵}
I had a little more fun with the next one, as I avoided writing down two separate look up tables by arranging my data properly:
the statement calls for a dfn that returns the reverse complement of a DNA sequence.
{⌽⍵}
'ACGT'
:
{'ACGT'⍳⌽⍵}
{(⊂'ACGT'⍳⌽⍵)⌷⌽'ACGT'}
With a little massaging we end up with the final solution.
revc ← {'ACGT'(⊂⍤⍳⌷⌽⍤⊣)⌽⍵}
This task calls for a dfn that returns the protein coded by a RNA sequence, terminating at the first stop codon. Setting up a lookup table is a logical first step:
t ← ↑('UUU' 'F')('CUU' 'L')('AUU' 'I')('GUU' 'V')
t⍪← ↑('UUC' 'F')('CUC' 'L')('AUC' 'I')('GUC' 'V')
...
t⍪← ↑('UGG' 'W')('CGG' 'R')('AGG' 'R')('GGG' 'G')
r_cod amm ← ↓⍉t
The helper function r_trs
does the heavy lifting,
splitting the sequence in codons of 3 (x←{↓(⌊3÷⍨≢⍵)3⍴⍵}r
),
looking up each one (y←r_cod⍳x
) and assinging the
corresponding amino-acid (amm⌷⍨∘⊂y
).
the whole function can be written tacitly, and the final solution
just applies it to it’s input and truncates the result at the first
'$'
.
r_trs ← amm⌷⍨∘⊂r_cod⍳{↓(⌊3÷⍨≢⍵)3⍴⍵}
prot ← {'$'((¯1+⍳⍨)↑⊢)r_trs⍵}
This was a parsing problem, of which I’ve done entirely too many doing Advent of Code in APL last year: it all boils down to partitioning and enclosing and partition-enclosing.
We need to write a dfn that accepts the path to a fasta file and returns a list of header-sequence pairs corresponding to the DNA sequences stored in the file.
For reference, a fasta file looks something like this:
>Rosalind_99
AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGA
TTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG
>Rosalind_2748
ATCAGGCTACCGTGTTTGCGGACGGGGGCTTAATCTAGCTTCTCATCTCAGCGACGTCTC
CTTGTTGGCACAGCGGTGGCAGGAGGTCCCCGCCGAGGAGCACATCGACCTTTCGGTGTA
...
The lines starting with >
are headers, and the
following lines contain the DNA sequences.
The solution is fairly pedestrian:
⊃⎕NGET⍵1
('>'=⊃¨)
, and use
them to partition-enclose the input.,/1↓⍵
concatenates all but the header⊂' '(1↓(¯1+⍳⍨)↑⊢)⊃⍵
gets the header up to the first
space character.readFASTA ← {{(⊂' '(1↓(¯1+⍳⍨)↑⊢)⊃⍵),,/1↓⍵}¨(('>'=⊃¨)⊂⊢)⊃⎕NGET⍵1}
The task is to write a function that takes a fasta file, extracts the first (only) DNA sequence, and returns all the proteins coded by it, beginning with a ‘M’ amino-acid and ending with a stop codon. I approached the problem by writing a helper function that would turn a sequence of DNA into the amino-acid sequences of it’s 6 reading frames.
Our final solution will look like this (for some definition of
crf
and aas
):
orf ← {crf⍤aas⊃⌽⊃readFASTA⍵}
readFASTA
, which will give us the
sequences.⊃⌽⊃
takes the last element in the first header-sequence
pair.aas
will be defined later, it outputs the amino-acid
sequences for the six reading frames of the DNA sequence.crf
will be defined later, it accepts a list of
amino-acid sequences and outputs their subsequences that begin with
M
and end right before a $
.aas
Here I had the choice between the easy and intuitive solution by composition, or the “smart” solution that probably has no real benefit in this case.
I obviously chose the latter.
What I could have done was to extract the reading frames with the
help of revc
, then convert each DNA sequence into RNA with
rna
, finally transcribe into amino-acids with
r_trs
: It would have looked something like this:
aas ← r_trs¨rna¨(,0 1 2∘.↓,⍥⊂∘revc⍨)
however, RNA and DNA are dual representations of the same data, so I
opted to convert my lookup table into DNA representation, and write a
function to parallel r_trs
: this saves us the RNA
conversion, and for big inputs it may perform better (though probably
not by a significant amount).
d_cod ← 'T'@('U'∘=)¨r_cod
d_trs ← amm⌷⍨∘⊂d_cod⍳{↓(⌊3÷⍨≢⍵)3⍴⍵}
aas ← d_trs¨(,0 1 2∘.↓,⍥⊂∘revc⍨)
In hindsight, I could have been a little smarter still: observing
that the possible codons are just 3-digit numbers in the base 4 number
system with digits 'ACGT'
, we can leverage the decode
primitive.
{1+4⊥⍉¯1+'ACGT'⍳(⌊3÷⍨≢⍵)3⍴⍵}
can substitute the lookup
into d_cod
, and d_trs
can be rewritten as
d_trs ← amm⌷⍨∘⊂{1+4⊥⍉¯1+'ACGT'⍳(⌊3÷⍨≢⍵)3⍴⍵}
, provided that
amm
contains the amino-acid sorted by the respective
codon’s numeric value.
There is no more need for the d_cod
or
r_cod
tables at all! we could even parametrise the solution
and pass either 'ACGT'
or 'ACGU'
to get both
r_trs
and d_trs
. The final definition would
have looked something like this:
trs ← {amm[1+4⊥⍉¯1+⍺⍳(⌊3÷⍨≢⍵)3⍴⍵]}
aas ← 'ACGT'∘trs¨(,0 1 2∘.↓,⍥⊂∘revc⍨)
But we would still be doing duplicate work in the lookups, as well as giving up some opportunities for semplification: this is what it would look like after addressing those issues.
amm ← 'KNKNTTTTRSRSIIMIQHQHPPPPRRRRLLLLEDEDAAAAGGGGVVVV$Y$YSSSS$CWCLFLF'
aas ← {{amm[1+4⊥⍉(⌊3÷⍨≢⍵)3⍴⍵]}¨,0 1 2∘.↓,⍥⊂∘(5-⌽)¯1+'ACGT'⍳⍵}
A nice insight comes from this layout: the last base is ignored in
most codons, as can be observed in 4 4 4⍴amm
(many rows are
constant).
crf
Speculation aside, to complete the task we would like - given a list
of the amino-acids coded by the reading frames - to extract any
subsequence of one of them, which starts with 'M'
and ends
just before a '$'
.
Now I find myself in a bit of an embarassing situation: I have two solutions to this task, which diverge at this point. I definitely submitted the first one, but I am not sure if I ended up overriding my submission with the second solution; as a result, I am not sure which version ended up being graded. I will just describe them both and worry about it some other time.
We define a 1-modifier _f
which takes a monadic function
and applies it to each value in ⍵
, then (expecting
⍺⍺
to return a list for each application) concatenates the
results together. This is heavily inspired by rust’s
flat_map
.
_f ← {⊃,/⍺⍺¨⍵}
To construct the solution:
$
:
'$'∘≠⊆⊢
is a tacit function that splits on
$
-'$'≠⊢/
returns 0
if the sequence ends in
$
, ¯1
otherwise. In the latter case, we need
to drop the last chunk from the partitioning.(-'$'≠⊢/)↓'$'∘≠⊆⊢
puts them together to give the chunks
that end just before $
.'M'∘(=⊂⊢)
partition-encloses at the positions of
M
.,⍨\⍤⌽
performs a reverse-catenate-scan in reverse (what
a mouthful) which yields the subsequences starting with M
and ending at the end of the chunk.What’s left is to do some plumbing with _f
and make sure
the results are unique:
crf ← {∪'M'∘(,⍨\⍤⌽(=⊂⊢))_f((-'$'≠⊢/)↓'$'∘≠⊆⊢)_f⍵}
The second solution is arguably a little more interesting, as it doesn’t directly translate from something you might write in a scalar language.
'$'
:s d←(,/,(⊂(+\≢¨))),∘'$'¨⍵
$
s, which are available as the
cumulative sum of the lengths.M
and $
in the
merged input with 'M$'(⍸=)¨⊂s
.M
s to the corresponding
$
’s position with binary search:
i←{↑⍺(⍵[1+⍵⍸⍺])}/ ...
$
at the end of input, so
each M
finds a corresponding $
.b e←¯1+↓(~i[2;]∊d)/i
∪b↓¨e↑¨⊂s
.Putting it all together, we get the following:
crf ← {
s d←(,/,(⊂(+\≢¨))),∘'$'¨⍵
b e←¯1+↓(~i[2;]∊d)/i←⊃{↑⍺(⍵[1+⍵⍸⍺])}/'M$'(⍸=)¨⊂s
∪b↓¨e↑¨⊂s
}
Looking back on it, the last line might have been better expressed as
∪b{s[⍺+⍳⍵-⍺]}¨e
to avoid materializing too many big arrays,
but the performance implications of it depend on the implementation.
The task is to write a dfn that computes or validates a vehicle identification number; there are quite a few details involved, but the long and short of it is that characters in the VIN contribute to a total score according to their value and position, and the score modulo 11 determines the correct value of the 9th digit.
The solution handles a few cases:
¯1
.vinc ← ⎕D,⎕A~'IOQ'
vin ← {
~×/⍵∊vinc: ¯1
16=≢⍵: (calc(⊣@9)⊢)8(↑,'0',↓)⍵
17=≢⍵: (calc=9∘⌷)⍵
¯1
}
We delegate the messy part to the calc
helper function,
that takes a 17 character vector and returns the character that, placed
at position 9, would make the VIN valid.
calc
I set up the tables:
num
contains the numeric values to substitute to the
charactersmul
contains the values that digits at each position
must be multiplied by.num ← ((65∘≤×1+9|83∘≤+-∘65)+(57∘≥×-∘48))⎕ucs vinc
mul ← 10 0@8 9⊢17↑2↓⌽,⍨⍳10
calc ← {(⎕D,'X')⌷⍨1+11|mul+.×num⌷⍨⊂vinc⍳⍵}
The procedure is as follows:
num⌷⍨⊂vinc⍳⍵
mul +.× ...
(⎕D,'X')⌷⍨1+11| ...
the tables defined above look very bad and are not at all insightful, what I should really have written is:
num ← (⍳9),(⍳8),((⍳9)~6 8),1↓⍳9
mul ← (⊢,10 0 9,⊢)⌽1↓⍳8
… or even just the raw numbers, it’s not like we’re starved for hard disk space.
This task asked for a function that sorts strings representing software versions according to their release number. The structure of the input is either:
"[a-zA-Z]+-[a-zA-Z]+-[0-9]+\.[0-9]+\.[0-9]+"
The solution leverages total array ordering, and the bulk of the problem just boils down to parsing the numbers in the string.
parts ← (-∘~⍨/'.'⎕vfi⊃)@3('-'∘≠⊆⊢)
sortVersions ← {1=≡⍵: ,⊂⍵ ⋄ ((⊂∘⍋parts¨)⌷⊢)⍵}
If the input is a flat character array, then just return it wrapped
in a 1-element list. If it’s a list of strings, split each entry by
-
into the fields (org
, package
,
version
). Then parse the three numbers in
version
using ⎕vfi
: this suggests a nice
extension of the problem statement: if one of the fields in
version
is not a number, ⎕vfi
will return a
0
verification bit and a 0
result: this means
that the expression result-~verification
returns
result
if valid, ¯1
otherwise.
This allows us to sort versions like foo-bar-10.x.0
in
between foo-bar-9.9.9
and foo-bar-10.0.0
,
which is kind of nice to have.
With that out of the way, we have a list of entries in the
form:
org package (major minor patch)
.
Applying ⍋
gives us a sorting permutation, which by the
rules of total array ordering conforms to the problem statement.
The task is to write a dfn that, given a sorted list of coin denominations (positive integers) as left argument and a total amount (a positive integer) as right argument, returns a matrix, where each row contains the coefficients of a linear combination of the elements in the left argument which sums to the right argument. Of course, the matrix should be exhaustive and not contain any duplicates.
Another way to look at this is that it returns the tallest matrix
res
that satisfies the following constraints:
⍵∧.=res+.×⍺
res≡∪res
The solution I submitted is a classic recursive one, with just a little bit of care put into the performance of it:
makeChange ← {
0≠⍵|⍨∨/⍺: (0,≢⍺)⍴0
1=≢⍺: ⍪⍵÷⊃⍺
⊃⍪/(((⊂¯1↓⍺)∇¨⍵-(⊃⌽⍺)×⊢),¨⊢)0,⍳⌊⍵÷⊃⌽⍺
}
It is a well known mathematical fact that for a finite set \(A\) of integers, there exists a linear
combination in \(A\) with integer
coefficients that sums to \(n\) if and
only if \(\text{GCD}(A) | n\) (where
the GCD of a finite set is given by the pairwise GCD reduction of the
elements in some arbitrary ordering, and \(|\) is the divisibility symbol). This
allows us to immediately return (0,≢⍺)⍴0
if ⍵
is not divisible by ∨/⍺
, which prunes some
branches.
The second line handles the general base case, and it is only
reached if there is exactly one trivial solution. We can avoid checking
divisibility because the first line catches the case where
0≠⍵|⊃⍺
.
The third line does all the recursing:
0,⍳⌊⍵÷⊃⌽⍺
determines how many coins of the largest
denomination can be taken away from ⍵
.⍵-(⊃⌽⍺)×⊢
calculates the amount left over from each
operation.(⊂¯1↓⍺)∇¨⍵-(⊃⌽⍺)×⊢
recurses on each case, removing the
largest denomination.(((⊂¯1↓⍺)∇¨⍵-(⊃⌽⍺)×⊢),¨⊢) ...
gives a list of matrices of
the solutions⊃⍪/
concatenates the solutions together.After some thinking, I came up with a more interesting solution for
this problem, which I didn’t submit as the competition had already
closed: the idea is to reduce the amount of recursive calls as much as
possible, by allowing the function to take an array right argument and
process values of ⍵
in parallel, sharing the recursive
subproblems between them.
makeChange ← {
1≡≢⍺: (0=⍵|⍨⊃⍺)↑¨⍪¨⍵÷⊃⍺
d←0,¨⍳¨⌊⍵÷i←⊢/⍺
u←∪∊r←⍵-i×d
s←(¯1↓⍺)∇u
⊃⍣(0≡≢⍴⍵)⊢r{⊃⍪/s[u⍳⍺],¨⍵}¨d
}
⍺
only has one element, we provide
the trivial solution to each element in ⍵
as a
0 1
or 1 1
matrix.d←0,¨⍳¨⌊⍵÷i←⊢/⍺
generates a list of lists, each
containing the amounts of coins of the biggest denomination that can be
taken from each element of ⍵
.r←⍵-i×d
generates the nested list of amounts left over
from each operationu←∪∊r
flattens and deduplicates this list, giving the
list of subproblems to recurse on.s←(¯1↓⍺)∇u
gives the list of solutions to each
subproblem{⊃⍪/s[u⍳⍺],¨⍵}
looks up each element of ⍺
in the list of subproblems and gives the corresponding solution, then
joins it to ⍵
and concatenates the results.⊃⍣(0≡≢⍴⍵)
discloses the result if the right argument
was a scalar, so that the result is conformant with the problem
statement.The task is to write a procedure partition
that returns
sliding windows of the right argument, with the size, strides and
starting position given by the left argument.
partition ← {
1 0∊⍨≡⍺: (⊂,⍺)∇⍵
(≢⍴⍵)<≢⊃⍺: 0↑⊂(⊃⍺)⍴⍬
s m o←(≢⍴⍵)(⊢,⍨1⍴⍨-∘≢)¨3↑⍺,⍬⍬
,⊂⍤((⊃⍺)∘⍴)⍤(≢s)⊢⍵[(¯1+⍳s)∘.+⍨(⊂o)+(⊂m)ׯ1+⍳0⌈1+⌊m÷⍨(⍴⍵)-o+s-1]
}
We start by parsing the left argument, case by case.
⍺
is simple, we enclose it (it will be used to
construct s
later): (⊂,⍺)∇⍵
⍺
is nested:
≢⊃⍺
is greater than the rank of ⍵
, we
are being asked for a window that can’t be found in the array, and we
return the empty list, however we also set the fill element to be the
desired shape, to maintain the invariant
(⊃⍺)≡⍴⊃⍺partition⍵
.s m o
where:
s
designates the shape of the windows.m
designates the strides of the windows along each
dimension.o
designates the origin of the first window.⍵
, or omitted completely. in this
case, the implied value for the omitted leading dimensions is
1
. To make sure our arrays are in this format:
3↑⍺,⍬⍬
: Pad ⍺
with instances of
⍬
up to 3 elements:(≢⍴⍵)(⊢,⍨1⍴⍨-∘≢)¨ ...
: Prepend 1
s to each
element, up to ≢⍴⍵
.Then, our strategy is to generate indices into ⍵
to
build up the resulting windows:
(⍴⍵)-o+s-1
: The counts of the distinct (non-trivial)
origins for a window of shape s
along each axis: each
window has to start at a position pos
such that:
∧/ (o ≤ pos) , pos ≤ (⍴⍵)-s
. If no windows fit along an
axis, the corresponding component is negative.0⌈1+⌊m÷⍨ ...
: Divide by the strides to get the counts
of windows actually in the result: clamp negative components up to
0.(⊂o)+(⊂m)ׯ1+⍳ ...
: Generate the origin of each window
as the sum of the base position o
and an offset.(¯1+⍳s)∘.+⍨ ...
: Sum each window’s origin with each
offset in the shape of the window with outer product. the commute is
used to make sure the rank ≢s
cells of the result
correspond to windows.⍵[ ... ]
: Index into ⍵
with the generated
indices.⊂⍤((⊃⍺)∘⍴)⍤(≢s) ...
: Pick the cells of rank
≢s
(each corresponding to a window) and enclose them after
giving them the requested shape, finally unravel the result.I didn’t feel satisfied with this solution, mainly because it uses way too many enclosed arrays as indices. Here it is, rewritten to convert to scalar indices in ravel order. I took the opportunity to modify a couple other things I wasn’t happy with:
⎕IO
into account instead of assuming
⎕IO←1
.⊆
to conditionally enclose ⍺
instead
of using recursion.I ← {⍵⊤⎕IO-⍨⍳×/⍵}
partition ← {
(≢⍴⍵)<≢⊃a←⊆⍺: 0↑⊂(⊃a)⍴⍬
s m o←(-r)↑¨(⊂(r←≢sw←⍴⍵)⍴1),¨3↑a,⍬⍬
or←sw⊥m×⍤0 1I 0⌈1+⌊m÷⍨sw-s+o-⎕IO
⊂⍤¯1⊢((≢or),⊃a)⍴(⊂⎕IO+(or+sw⊥o-⎕IO)∘.+sw⊥I s)⌷,⍵
}
The result is a little longer, but arguably simpler:
0⌈1+⌊m÷⍨sw-s+o-⎕IO
is exactly as before, but accounts
for ⎕IO
.I ← {⍵⊤⎕IO-⍨⍳×/⍵}
defines a function equivalent to
{⍉⎕IO-⍨↑,⍳⍵}
but doesn’t use nested arrays. That is, it
generates a matrix whose columns are the 0-based indices into an array
of shape ⍵
, given in ravel order.or←sw⊥m×⍤0 1I ...
generates the 0-based indices into
,⍵
of the origins of the windows. (the pattern
sw⊥ ...
is used to convert a matrix whose columns are
0-based indices (or offsets) into ⍵
to a list of
corresponding indices (or offsets) into ,⍵
)sw⊥I s
gives the offset in ,⍵
of each
element in a window from the origin of the same window.(⊂⎕IO+(or+sw⊥o-⎕IO)∘.+sw⊥I s)⌷,⍵
builds up windows of
indices and uses them to pick from ,⍵
⊂⍤¯1⊢((≢or),⊃a)⍴ ...
adjusts the shape of the windows
and encloses the results.If nothing else, this shows my issue with ⎕IO
: as soon
as you start doing arithmetic on indices it gets in the way. If
⎕IO
was always 0
, the solution could be
simplified to:
I ← {⍵⊤⍳×/⍵}
partition ← {
(≢⍴⍵)<≢⊃a←⊆⍺: 0↑⊂(⊃a)⍴⍬
s m o←(-r)↑¨(⊂(r←≢sw←⍴⍵)⍴1),¨3↑a,⍬⍬
or←sw⊥m×⍤0 1I 0⌈1+⌊m÷⍨sw-s+o
⊂⍤¯1⊢((≢or),⊃a)⍴(⊂(or+sw⊥o)∘.+sw⊥I s)⌷,⍵
}
It also exemplifies how keeping the data representation flat can
simplify the reasoning, as well as improving the performance: the new
version is about 28x faster than the old one on
3 3 3 partition 25 25 25⍴⎕A
.