FM-Index Reveals the Reverse Suffix Array

Jun 2020

FM-Index Reveals the Reverse Suffix Array

Informatics Institute faculty member Prof. Dr. Oğuzhan Külekçi's coauthored paper titled “FM-index reveals the reverse suffix array, A. Ganguly, D. Gibney, S. Hooshmand, M.O. Kulekci, S.V. Thankachan” has been accepted to CPM’2020 – 31th Annual Symposium on Combinatorial Pattern Matching.

This work has been partially supported by the research fund of Istanbul Technical University under project number MGA-2019-42224.

Abstract:
Given a text T[1,n] over an alphabet Σ of size σ, the suffix array of T stores the lexicographic order of the suffixes of T. The suffix array needs Θ(nlogn) bits of space compared to the nlogσ bits needed to store T itself. A major breakthrough [FM–Index, FOCS’00] in the last two decades has been encoding the suffix array in near-optimal number of bits (≈ log σ bits per character). One can decode a suffix array value using the FM-Index in logO(1) n time.

We study an extension of the problem in which we have to also decode the suffix array values of the reverse text. This problem has numerous applications such as in approximate pattern matching [Lam et al., BIBM’ 09]. Known approaches maintain the FM–Index of both the forward and the reverse text which drives up the space occupancy to 2n log σ bits (plus lower order terms). This brings in the natural question of whether we can decode the suffix array values of both the forward and the reverse text, but by using n log σ bits (plus lower order terms). We answer this question positively, and show that given the FM–Index of the forward text, we can decode the suffix array value of the reverse text in near logarithmic average time. Additionally, our experimental results are competitive when compared to the standard approach of maintaining the FM–Index for both the forward and the reverse text. We believe that applications that require both the forward and reverse text will benefit from our approach.