GenomicRangeQuery at app.codility.com-programmers in C Plus Plus Explained - Find the minimal nucleotide from a range of sequence DNA
Foreword: Category of Article - Technology, Computers and Software, Algorithm
By: Chrysanthus Date Published: 29 Jan 2024
Task
Detected time complexity : O(N + M)
Lesson 5 at app.codility.com/programmers
The solution and its explanation are given below.
Problem
A DNA sequence can be represented as a string consisting of the letters A, C, G and T, which correspond to the types of successive nucleotides in the sequence. Each nucleotide has an impact factor, which is an integer. Nucleotides of types A, C, G and T have impact factors of 1, 2, 3 and 4, respectively. You are going to answer several queries of the form: What is the minimal impact factor of nucleotides contained in a particular part of the given DNA sequence?
The DNA sequence is given as a non-empty string S = S[0]S[1]...S[N-1] consisting of N characters. There are M queries, which are given in non-empty arrays P and Q, each consisting of M integers. The K-th query (0 <= K < M) requires you to find the minimal impact factor of nucleotides contained in the DNA sequence between positions P[K] and Q[K] (inclusive).
For example, consider string S = CAGCCTA and arrays P, Q such that:
P[0] = 2 Q[0] = 4
P[1] = 5 Q[1] = 5
P[2] = 0 Q[2] = 6
The answers to these M = 3 queries are as follows:
The part of the DNA between positions 2 and 4 contains nucleotides G and C (twice), whose impact factors are 3 and 2 respectively, so the answer is 2.
The part between positions 5 and 5 contains a single nucleotide T, whose impact factor is 4, so the answer is 4.
The part between positions 0 and 6 (the whole string) contains all nucleotides, in particular nucleotide A whose impact factor is 1, so the answer is 1.
Write a function:
vector<int> solution(string &S, vector<int> &P, vector<int> &Q);
that, given a non-empty string S consisting of N characters and two non-empty vectors P and Q consisting of M integers, returns a vector consisting of M integers specifying the consecutive answers to all queries.
Result vector should be returned as a vector of integers.
For example, given the string S = CAGCCTA and vectors P, Q such that:
P[0] = 2 Q[0] = 4
P[1] = 5 Q[1] = 5
P[2] = 0 Q[2] = 6
the function should return the values [2, 4, 1], as explained above.
Write an efficient algorithm for the following assumptions:
N is an integer within the range [1..100,000];
M is an integer within the range [1..50,000];
each element of vectors P and Q is an integer within the range [0..N - 1];
P[K] <= Q[K], where 0 <= K < M;
string S consists only of upper-case English letters A, C, G, T.
Note:
A slice is a sequence of elements that form a subset (subgroup) of the given vector (array).
Brute Force Approach
#include <iostream>
#include <vector>
#include <string>
#include <unordered_map>
using namespace std;
vector<int> solution(string &S, vector<int> &P, vector<int> &Q) {
int M = P.size();
unordered_map<char, int> mp = {{'A', 1}, {'C', 2}, {'G', 3}, {'T', 4}};
vector<int> answers(M, 0);
for(int K=0; K<M; K++) {
int min = 5;
for (int i=P[K]; i<=Q[K]; i++) { //scan the slice
if (mp[S[i]] < min) {
min = mp[S[i]];
answers[K] = min;
}
}
}
return answers;
}
int main()
{
vector<int> P{2, 5, 0};
vector<int> Q{4, 5, 6};
string str = "CAGCCTA";
vector<int> ret = solution(str, P, Q);
cout << ret[0] <<' '<< ret[1] <<' '<< ret[2] << endl;
vector<int> P2{0};
vector<int> Q2{0};
string str2 = "CAGCCTA";
vector<int> ret2 = solution(str2, P2, Q2);
cout << ret2[0] <<' '<< ret2[1] <<' '<< ret2[2] << endl;
return 0;
}
The output is:
2 4 1
2 0 0
The scoring and detected time complexity for this solution() function, at app.codility.com/programmers is:
Task score : 62% - Correctness : 100% ; Performance : 0%
Detected time complexity: O(N x M)
The correctness is 100% and the speed (time taken) is 0% (too slow). The reasons for the 0% performance (from app.codility.com/programmers) are given in the following table:
almost_all_same_letters | X TIMEOUT ERROR |
GGGGGG..??..GGGGGG..??..GGGGGG | Killed. Hard limit reached: 6.000 sec. |
large_random large random string, length | X TIMEOUT ERROR Killed. Hard limit reached: 6.000 sec. |
extreme_large all max ranges | X TIMEOUT ERROR Killed. Hard limit reached: 6.000 sec. |
N x M as number of operations for the solution() function, can be compared (is similar) to N2 operations (especially when N = M). Imagine a grid of N columns by M rows where each cell in the grid represents an operation. N is for the length of the given string, S, and M is for the given number of queries. So, the total number of operations for the above solution() function, is N x M. Remember, it is the maximum possible number of operations that is considered, when dealing with time complexity.
Smart Solution
Multiplication (N x M) is continuous addition (continuous summing). Prefix-sums can reduce time complexity from O(N2) to O(N). So, prefix-sums can be used somewhere, somehow in the solution() function code, to enable the number of operations go down from (N x M) to (N + M). After all, this problem is on the lesson (tutorial) topic of prefix-sums.
If the slice sum from the prefix-sums for the number of occurrences of each nucleotide type, in the string, is obtained for each query, then the total number of operations can be reduced from (N x M) to (N + M). However, here, two dimensional array is needed and not one dimensional array.
In the problem example, N is 7 and M is 3. With that, 7 x 3 = 21 operations and N + M = 7 + 3 = 10 operations.
Illustration
The illustration is done here with the above sample case of S = CAGCCTA consisting of N = 7 characters and M = 3 queries, which are:
P[0] = 2 Q[0] = 4
P[1] = 5 Q[1] = 5
P[2] = 0 Q[2] = 6
A two dimensional array of occurrences has to be created with all elements initialized to 0, as the following table shows:
Number of Occurrences of Nucleotide Types in given String, Cells initialized to 0 | |||||||||
---|---|---|---|---|---|---|---|---|---|
index | 0 | 1 (C) | 2 (A) | 3 (G) | 4 (C) | 5 (C) | 6 (T) | 7 (A) | |
nucleotide | |||||||||
0 (A) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
1 (C) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
2 (G) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
3 (T) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The code segment to produce this two dimensional array with all elements initialized to 0 is:
int occurrences[4][N+1];
for (int j=0; j<4; j++) {
for (int i=0; i<N+1; i++) {
occurrences[j][i] = 0;
}
}
The time of execution for this code segment is not included in the overall time complexity. The first column of zeros is the prefix sums preceding zero.
Type A nucleotide has an impact factor of 1. However, in the table its row index is 0 = 1 - 1. Type C nucleotide has an impact factor of 2. However, in the table its row index is 1 = 2 - 1. Type G nucleotide has an impact factor of 3. However, in the table its row index is 2 = 3 - 1. Type T nucleotide has an impact factor of 4. However, in the table its row index is 3 = 4 - 1.
Next in the program, the number of occurrences of each nucleotide type in the string, S = "CAGCCTA" has to be put into the two dimensional array (table). The table becomes:
Number of Occurrences of Nucleotides in a 2D Array | |||||||||
---|---|---|---|---|---|---|---|---|---|
index | 0 | 1 (C) | 2 (A) | 3 (G) | 4 (C) | 5 (C) | 6 (T) | 7 (A) | |
nucleotide | |||||||||
0 (A) | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | |
1 (C) | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | |
2 (G) | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | |
3 (T) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
'A' occurs 1 time at index 1 (zero based indexing) in the given string, but occurs at index 2 = 1+1 above. 'A' also occurs once at index 6 in the given string, but occurs at index 7 = 6+1 above. C occurs 1 time at index 0 in the given string, 1 time at index 3 in the given string, and 1 time at index 4 in the given string. However, C occurs 1 time at index 1 = 0+1 above, 1 time at index 4 = 3+1 above, and 1 time at index 5 = 4+1 above. G occurs 1 time at index 2 in the given string, but occurs at index 3 = 2+1 above. T occurs 1 time at index 5 in the given string, but occurs at index 6 = 5+1 above. Every other cell remains at the initialized value of 0.
The code segment to produce the above 2D array is:
for (int i=0; i < N+1; i++) {
if(S[i] == 'A') occurrences[0][i+1] = 1;
else if(S[i] == 'C') occurrences[1][i+1] = 1;
else if(S[i] == 'G') occurrences[2][i+1] = 1;
else if(S[i] == 'T') occurrences[3][i+1] = 1;
}
The subscript, [i+1] is to compensate for the prefix sums preceding zero. This affects the sub-scripting of the slice sums - see below.
Next in the program, the prefix sums for the number of occurrences of each nucleotide type for the string, S = "CAGCCTA" are obtained for each row. The 2D table (array) becomes:
Prefix Sums for Number of Occurrences of Nucleotides in a 2D Array | |||||||||
---|---|---|---|---|---|---|---|---|---|
index | 0 | 1 (C) | 2 (A) | 3 (G) | 4 (C) | 5 (C) | 6 (T) | 7 (A) | |
nucleotide | |||||||||
0 (A) | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 2 | |
1 (C) | 0 | 1 | 1 | 1 | 2 | 3 | 3 | 3 | |
2 (G) | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | |
3 (T) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
The prefix sums is not done the conventional way here. Here, it is done within the row. The code segment for this table is:
for (int j=0; j<4; j++) {
for (int i=1; i<N+1; i++) {
occurrences[j][i] = occurrences[j][i] + occurrences[j][i-1]; //prefix summing; equivalent to occurrences[j][i] += occurrences[j][i-1];
}
}
Slice Sum
Consider the query: P[0] = 2 . . . Q[0] = 4:
If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor is 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 0 = 1 - 1 (in row 0). So A does not occur at all in this portion (part) of the string, S = "CAGCCTA".
The impact factors are as follows: A= 1; C=2; G=3; T=4. Now that A is not found in the range P[0] = 2 . . . Q[0] = 4, in the string , S = "CAGCCTA", if C is found in that range, then 2 for C will be the minimal impact factor, since 3 for G and 4 for T are higher impact factors.
Still on the query: P[0] = 2 . . . Q[0] = 4: if C occurs at least once, then since C has the bigger impact factor after A, the minimal impact factor will be 2. This assumes that A did not occur at all, in the range. If the slice sum for C in this query is greater than 0, then it means C occurs at least once. From the above table, the slice sum for C in this interval is 2 = 3 - 1 (in row 1). Two is greater than 0. So C occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 2 for C = 2. Here, the impact factor for C being equal to 2 and the slice sum being equal to 2 is coincidence.
Consider the query: P[1] = 5 . . . Q[1] = 5:
If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor is 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 0 = 1 - 1 (in row 0). So A does not occur at all in this portion of the string, S = "CAGCCTA".
Now that A is not found in the range P[0] = 2 . . . Q[0] = 4, in the string , S = "CAGCCTA", if C is found in that range, then 2 for C will be the minimal impact factor, since 3 for G and 4 for T are higher impact factors.
Still on the query: P[1] = 5 . . . Q[1] = 5, if C occurs at least once, then since C has the bigger impact factor after A, the minimal impact factor will be 2 for C. This assumes that A did not occur at all, in the range. If the slice sum for C in this query is greater than 0, then it means C occurs at least once. From the above table, the slice sum for C in this interval (inclusive) is 0 = 3 - 3 (in row 1). So C does not occur at all in this portion (part) of the string, S = "CAGCCTA".
Still on the query: P[1] = 5 . . . Q[1] = 5, if G occurs at least once, then since G has the bigger impact factor after C, the minimal impact factor will be 3 for G. This assumes that neither A nor C occurred at all, in the range. If the slice sum for G in this query is greater than 0, then it means G occurs at least once. From the above table, the slice sum for G in this interval is 0 = 1 - 1 (in row 2). So G does not occur at all in this portion of the string, S = "CAGCCTA".
Still on the query: P[1] = 5 . . . Q[1] = 5, if T occurs at least once, then since T has the bigger impact factor after G, the minimal impact factor will be 4 for T. This assumes that neither A nor C nor G occurred at all, in the range. If the slice sum for T in this query is greater than 0, then it means T occurs at least once. From the above table, the slice sum for T in this interval (inclusive) is 1 - 0 = 1 (in row 3). One is greater than 0. So T occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 4 for T = 4.
Consider the query: P[2] = 0 . . . Q[2] = 6:
If A occurs at least once, then since A has the smallest impact factor, the minimal impact factor will be 1. If the slice sum for A in this query is greater than 0, then it means A occurs at least once. From the above table, the slice sum for A in this interval is 2 - 0 = 2 (in row 0). So A occurs at least once, in this portion of the string, S = "CAGCCTA". And the answer for the query is 1 for A = 1.
The code segment for this slice sum is:
for (int K=0; K<M; K++){
if (occurrences[0][Q[K]+1] - occurrences[0][P[K]] > 0) answers[K] = 1; //for A
else if (occurrences[1][Q[K]+1] - occurrences[1][P[K]] > 0) answers[K] = 2; //for C
else if (occurrences[2][Q[K]+1] - occurrences[2][P[K]] > 0) answers[K] = 3; //for G
else if (occurrences[3][Q[K]+1] - occurrences[3][P[K]] > 0) answers[K] = 4; //for T
}
Complete Program
#include <iostream>
#include <vector>
#include <string>
using namespace std;
vector<int> solution(string &S, vector<int> &P, vector<int> &Q) {
int N = S.size();
int M = P.size();
vector<int> answers(M);
int occurrences[4][N+1];
for (int j=0; j<4; j++) {
for (int i=0; i<N+1; i++) {
occurrences[j][i] = 0;
}
}
for (int i=0; i < N+1; i++) {
if (S[i] == 'A') occurrences[0][i+1] = 1;
else if (S[i] == 'C') occurrences[1][i+1] = 1;
else if (S[i] == 'G') occurrences[2][i+1] = 1;
else if (S[i] == 'T') occurrences[3][i+1] = 1;
}
for (int j=0; j<4; j++) {
for (int i=1; i<N+1; i++) {
occurrences[j][i] = occurrences[j][i] + occurrences[j][i-1];
}
}
for (int K=0; K<M; K++){
if (occurrences[0][Q[K]+1] - occurrences[0][P[K]] > 0) answers[K] = 1;
else if(occurrences[1][Q[K]+1] - occurrences[1][P[K]] > 0) answers[K] = 2;
else if(occurrences[2][Q[K]+1] - occurrences[2][P[K]] > 0) answers[K] = 3;
else if(occurrences[3][Q[K]+1] - occurrences[3][P[K]] > 0) answers[K] = 4;
}
return answers;
}
int main()
{
vector<int> P{2, 5, 0};
vector<int> Q{4, 5, 6};
string str = "CAGCCTA";
vector<int> ret = solution(str, P, Q);
cout << ret[0] <<' '<< ret[1] <<' '<< ret[2] << endl;
vector<int> P2{0};
vector<int> Q2{0};
string str2 = "CAGCCTA";
vector<int> ret2 = solution(str2, P2, Q2);
cout << ret2[0] <<' '<< ret2[1] <<' '<< ret2[2] << endl;
return 0;
}
The output is:
2 4 1
2 0 0
At app.codility.com/programmers, the scoring and detected time complexity for this smart approach is:
Task score : 100% - Correctness : 100% ; Performance : 100%
Detected time complexity: O(N+M)
where N is for the number of characters in the string, and M is the number of queries.