Imagine solving a puzzle with 100 pieces, each piece a centimeter in size, something like this:
The genome is considerably larger than this puzzle and has much more than 100 pieces to assemble, in a typical NGS run |
A few things are immediately obvious
1. The size of each piece (a 'read' in NGS terminology) matters; larger pieces make solving the puzzle exponentially easier
2. A few pieces seem to fit in multiple places of the puzzle; they are similar to a great degree, (like the pieces containing black hair in this puzzle), however, differ ever so slightly (maybe in only a few pixels, or in NGS terms, nucleotides)
3. Some pieces are repetetive (like the 'blue sky' in this puzzle), which add to the difficulty of assembling the puzzle; they are equivalent to the nucleotide repeat sequences encountered in reads of NGS
4. In this puzzle, no two pieces overlap. However, NGS reads will often overlap with each other.
Together, these are the core issues in the genome assembly problem, i.e. reconstructing a genome from its fragments when you have no 'reference genome' to look up to.