With the thrive of data acquisition methods, computerized research has become increasingly more common. However, in order to match the huge data-processing demands, the design of new algorithms along with their optimization on specific hardware platforms has become a necessity. This scenario is particularly true in the case of comparative genomics, where massive DNA sequences are being published daily, and their processing presents many computational bottlenecks.
The comparison of DNA sequences is a central problem with direct impact on human health, and therefore its computational acceleration is of wide interest. However, due to its arbitrary nature, the parallel acceleration of sequence comparison poses computational challenges such as including heterogeneous granularity, unpredictable load, etc. In order to achieve high performance, algorithms must be tailored to the underlying hardware model, which may represent different computational approaches and often require even the redesign of the algorithms themselves.
This thesis addresses a computational tour of the sequence comparison problem by making use of hardware and algorithmic optimizations in single core machines, shared memory systems and Graphic Processing Units. The first contribution features a formal framework that enables unlimited search space size in strictly linear time. The second contribution describes a parallelization using shared memory machines that achieves high sensitivity in metagenomic sequences. The third contribution describes the overcoming of the data parallelism model in GPUs for the irregular pairwise sequence comparison. Lastly, the use of Machine-Learning-aided schedulers is explored to improve resource allocation and throughput in supercomputers dedicated to sequence comparison.