On 16th page of Computer Science Distilled: Learn the Art of Solving Computational Problems
we face the following example:
A biologist is studying a DNA segment related to a genetic disease. The segment is made of 23 base pairs, where 9 must be A-T, 14 must be G-C. She wants to run a simulation task on every possible DNA segment having these numbers of base pairs. How many simulation tasks is she looking at?
The author solves the problem as such:
Number of permutations: 23!/(9! × 14!) = 817, 190
Number of orientations: 2^23
Thus we have:
Number of possible sequences: 817, 190 × 2^23 ≈ 7 trillion
but this calculation is not accurate; building blocks of DNAs are called base pairs for a good reason! They come in pairs. I think it would be best to explain using a simplified example.
Let's say that very same scientist now wants to simulate all possible sequences of a DNA segment which is only 4 base pairs long, with 1 base pair being C-G and 3 base pairs being A-T, How many times should she run the simulation?
Using the same methood as the author we have:
Number of permutations: 4!/(3! × 1!) = 4
Number of orientations: 2^4 = 16
Thus we have:
Number of possible sequences: 4 × 2^23 ≈ 64
However the scientist has to run the simulation only 32 times (if she wants to run the simulation only once for every possible sequence). Here I have listed all 64 possible sequences (for assurance):
1. AAAC AACA ACAA CAAA
2. TAAC TACA TCAA ATAC ATCA AATC AACT ACAT ACTA CAAT CATA CTAA
3. ATTC ATCT ACTT TATC TACT TTAC TTCA TCTA TCAT CTTA CTAT CATT
4. TTTC TTCT TCTT CTTT
5. AAAG AAGA AGAA GAAA
6. TAAG TAGA TGAA ATAG ATGA AATG AAGT AGAT AGTA GAAT GATA GTAA
7. ATTG ATGT AGTT TATG TAGT TTAG TTGA TGTA TGAT GTTA GTAT GATT
8. TTTG TTGT TGTT GTTT
I have put the sequences in 8 groups for ease of use, let's re-write the 1st group in full:
Did you notice? the complementary strand of group one is the same as group 8!
Group 8 is not a new set of possible sequences, indeed it is the opposing strand of group 1, the same is true about group 2-7, 3-6, and 4-5.
The method used by the author is correct, however when working with DNAs, since they are double-stranded we should divide the final result by 2, in our example, the scientist will have to run simulation 32 times and in author's example this number would roughly be 3.5 trillion.
If it were the case that DNAs didn't have heads and tails, following sequences would practically be the same:
That would've meant that we have to divide our final result by 2 yet once again, however that is not the case and DNAs do have heads and tails. the head is called the 5' end and the tail is 3' end. since enzymes only do their work from head to tail, the sequences mentioned above are not the same.
I personally respect the author for taking the effort to bring examples outside the scope of an every-day programmer, not only it proves the vast use of mathematics in all sciences but also makes the book more friendly to anyone (even a biology student who never has nor ever will write a program).