For future study abord, I decide to practise wiriting in English, starting from technical articles.

Repeated DNA Sequences

Problem: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG".

When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.)

For example,

Given s ="AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

Output: ["AAAAACCCCC", "CCCCCAAAAA"]

This problem can recognized as the string matching problem from the first look. The instinctive solution is to try every potional string pairs exhaustivity, but this should be very slow. There are \(N\) length DNA, which means the potional substring is \(N-10\), as a result, there are going to be need \(10(N-10)^2\) compares. It's \(O(n^2)\) !

Actually, there are some other possbile solutions, which is quicker than the \(O(n^2)\).

Observing this problem carefully, we can find that there are only 4 different characters that needed to be compared. Why shouldn't we make a map, mapping the character to the binary code, then we can compare numbers instead of characters.

A \(10\) length DNA sequence can be mapped into a integer and we only need to compare the integer.

More specifically, our maps are: A=00,C=01,G=10 and T=11, and the substring AAAAACCCCC can be represented as 0101010101=341. This will be much more convient for us to compare substrings.

Mappings is the core idea for this problem and the rest is implementing:

#!cpp
vector<string> findRepeatedDnaSequences(string s) 
{
    vector<string> reval;
    unordered_set<int> tmp;

    if(s.size() <= 10)
        return reval;

    unordered_set<int> container;
    int i,j;
    int num = init(s);
    container.insert(num);

    for(i = 1, j = 10; j < s.size(); ++i,++j)
    {
        num = num << 2;
        num = num | map(s[j]);
        num = num & 1048575;

        if(container.find(num) != container.end() && tmp.find(num) == tmp.end())
        {
            reval.push_back(s.substr(i,10));
            tmp.insert(num);
        }
        container.insert(num);
    }
    return reval;
}

int init(string s)
{
    int reval = 0;
    for(int i = 0;i < 10; ++i)
    {
        reval = reval << 2;
        reval = reval | map(s[i]);
    }
    return reval;
}

int map(char DNA)
{
    if (DNA == 'A')
        return 0;
    else if(DNA == 'C')
        return 1;
    else if(DNA == 'G')
        return 2;
    else if(DNA == 'T')
        return 3;
    else
        return 0;
}

Number of 1 Bits

Problem: Write a function that takes an unsigned integer and returns the number of 1 bits it has (also known as the Hamming weight).

For example, the 32-bit integer 11 has binary representation 00000000000000000000000000001011, so the function should return 3.)

This should be easy, we only need to ues the shift operation to check out how many 1 in the given integer. We need 32 shifts at most.

The code is easy and short:

#!cpp
int hammingWeight(uint32_t n) {
    uint32_t m = n;
    int count = 0;
    while(m > 0)
    {
        count += (m & 1);
        m = m >> 1;
    }
    return count;
}

Reverse Bits

Problem: Reverse bits of a given 32 bits unsigned integer.

For example

Given input 43261596 (represented in binary as 00000010100101000001111010011100),

return 964176192 (represented in binary as 00111001011110000010100101000000).

This should be easy too.

Due to the fact that X & 1 = X and X | 0 = X, in which X can be both 0 and 1.

We can use X & 1 = X to get the last bit of the given integer and use X | 0 = X to give the bit to new integer.

We use two different ways beacuse when we 'cut off'(using X & 1 = X) the last bit of given integer, we should make sure that other bits become 0(using X & 0 = 0), and when we 'attach'(using X | 0 = X) the bit to the new integer, we also should make sure the other bits are uneffected(using X | 0 = 0).

The codes are following:

#!cpp
uint32_t reverseBits(uint32_t n) {
    uint32_t x = n;
    uint32_t y = 0;
    uint32_t t;

    int i = 0;
    while(true)
    {
        t = x & 1;
        y = y | t;

        i += 1;
        if(i >= 32)
        {
            break;
        }
        x = x >> 1;
        y = y << 1;
    }

    return y;
    }
}

Flyaway is the owner of this blog.

Solution Report for LeetCode 3

Repeated DNA Sequences

Number of 1 Bits

Reverse Bits

Reading Time

Published

Category

Tags

Contact