TY - GEN
T1 - A fast sorting algorithm for aptamer identification using deep sequencing
AU - Xiao, Yiou
AU - Mehrotra, Kishan G.
AU - Allis, Damian G.
AU - Borer, Phillip N.
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/10/10
Y1 - 2014/10/10
N2 - In recent years, with the advent of fast sequencing technology, the genomic database is growing rapidly. Researchers in the bioinformatics field are expecting faster and more accurate tools to effectively analyze the gigantic data sets. In the context of aptamer search, the goal is to search for the over-represented DNA sequences from the randomly generated aptamer libraries. Hash functions are widely used in substring comparison, sequence alignment and clustering tools. We have developed a light-weight tool that takes advantage of the hash functions to reduce the size of genomic data and conducts η-neighbor searches on the centroid sequence. This greatly improves the efficiency of the search compared with existing tools. Furthermore, the prior calculation of hash values of η-neighbors decreases the searching overhead. In a dataset of 2.23 million sequences, the proposed algorithm accurately count the frequency of the Human α-Thrombin aptamer sequences in less than 40 seconds, whereas the current script-based method takes 2 hours and 18 minutes.
AB - In recent years, with the advent of fast sequencing technology, the genomic database is growing rapidly. Researchers in the bioinformatics field are expecting faster and more accurate tools to effectively analyze the gigantic data sets. In the context of aptamer search, the goal is to search for the over-represented DNA sequences from the randomly generated aptamer libraries. Hash functions are widely used in substring comparison, sequence alignment and clustering tools. We have developed a light-weight tool that takes advantage of the hash functions to reduce the size of genomic data and conducts η-neighbor searches on the centroid sequence. This greatly improves the efficiency of the search compared with existing tools. Furthermore, the prior calculation of hash values of η-neighbors decreases the searching overhead. In a dataset of 2.23 million sequences, the proposed algorithm accurately count the frequency of the Human α-Thrombin aptamer sequences in less than 40 seconds, whereas the current script-based method takes 2 hours and 18 minutes.
UR - http://www.scopus.com/inward/record.url?scp=84911164346&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84911164346&partnerID=8YFLogxK
U2 - 10.1109/ASONAM.2014.6921671
DO - 10.1109/ASONAM.2014.6921671
M3 - Conference contribution
AN - SCOPUS:84911164346
T3 - ASONAM 2014 - Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
SP - 759
EP - 763
BT - ASONAM 2014 - Proceedings of the 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
A2 - Wu, Xindong
A2 - Wu, Xindong
A2 - Ester, Martin
A2 - Xu, Guandong
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2014
Y2 - 17 August 2014 through 20 August 2014
ER -