Subcaterpillar Isomorphism: Subtree Isomorphism Restricted Pattern Trees To Caterpillars

In this paper, we investigate a subcaterpillar isomorphism that is a problem, for a rooted labeled caterpillar P and a rooted labeled tree T, of determining whether or not there exists a subtree in T which is isomorphic to P. Then, we design two algorithms to solve the subcaterpillar isomorphism for a caterpillar P and a tree T in (i) O(p + tDhσ) time and O(Dh) space and in (ii) O(p + tDσ) time and O(D(h + H)) space, respectively. Here, p is the number of vertices in P, t is the number of vertices in T, h is the height of P, H is the height of T, σ is the number of alphabets for labels and D is the degree of T. Furthermore, we give experimental results of the two algorithms for artificial data and real data.


I. INTRODUCTION
T HE PATTERN matching for tree-structured data such as HTML and XML documents for web mining or DNA and glycan data for bioinformatics is one of the fundamental tasks for information retrieval or query processing. As such pattern matching for rooted labeled unordered trees (a tree, for short), a subtree isomorphism is the problem of determining, for a pattern tree P and a text tree T , whether or not there exists a subtree of T which is isomorphic to P . It is known that the subtree isomorphism can be solved in O(p 1.5 t/ log p) time [10], where p is the number of vertices in P and t is the number of vertices in T . On the other hand, it cannot be solved in O(t 22ε ) time for every ε (0 < ε < 1) under SETH [1].
In this paper, we focus on subcaterpillar isomorphism that is a subtree isomorphism when P is a rooted labeled caterpillar (a caterpillar, for short) (cf., [3]). The caterpillar is an unordered tree transformed to a rooted path after removing all the leaves in it. The caterpillar provides the structural restriction of the tractability of computing the edit distance [8] and inclusion problem [7] for unordered trees.
It is known that the problem of computing the edit distance between unordered trees is MAX SNP-hard [11]. This statement also holds even if two trees are binary, the maximum height is at most 3 or the cost function is the unit cost function [2], [4]. On the other hand, we can compute the edit distance between caterpillars in O(n+H 2 σ 3 ) time in the general cost function and O(n + H 2 σ) time under the unit cost function, where n is the total number of vertices of the two caterpillars, H is the maximum height of the two caterpillars and σ is the number of alphabets for labels in the two caterpillars [8] 1 .
It is known that the inclusion problem of determining whether or not a text tree T achieves to a pattern tree P by deleting vertices in T is NP-complete [6]. This statement also holds even if P is a caterpillar [6]. On the other hand, if both P and T are caterpillars, then we can solve the inclusion problem in O(p + t + (h + H)σ) time, where h is the height of P and H is the height of T [7] 2 .
In this paper, we design two algorithms to solve the subcaterpillar isomorphism in (i) O(p + tDhσ) time and O(Dh) space and (ii) O(p + tDσ) time and O(D(h + H)) space, respectively. Here, D is the degree of T . Since there may exist many matching positions that match P in T when P is much smaller than T , the above algorithms also output all of such positions. Hence, under the assumption that p < t, h j t and h < H, the algorithm (i) runs in O(tDσ) time and O(Dh) space and the algorithm (ii) runs in O(tDσ) time and O(DH) space.
Note that both algorithms do not use the maximum cardinality matching algorithm for bipartite graphs [5], which is essential for the subtree isomorphism algorithm [10]. Also we cannot apply the proof of the SETH-hardness in [1] when a pattern tree P is a caterpillar.
Furthermore, by implementing the algorithms (i) and (ii), we give experimental results of the two algorithms for artificial data and real data. Then, we confirm that, whereas the algorithm (ii) is faster than the algorithm (i) as same as the theoretical results for artificial data of which number of matching positions is large, the algorithm (i) is faster than the algorithm (ii) for real data.

II. PRELIMINARIES
A tree is a connected graph without cycles. For a tree T = (V, E), we denote V and E by V (T ) and E(T ). We sometimes 1 The time complexity represented in [8] is O(H 2 λ 3 ) time and O(H 2 λ) time, where λ is the maximum number of leaves in the two caterpillars. Since O(λ 3 ) and O(λ) in them are corresponding to the time complexity of computing the multiset edit distances under the general and the unit cost functions (cf. [9]), we can replace λ with σ, by storing the labels occurring in the leaves. Also, in order to compare the time complexity of this paper, we add O(n) as the initialization of the algorithm, containing the above storing. 2 The time complexity represented in [7] is O((h + H)σ) time. In order to compare the time complexity of this paper, we add O(p + t) as the initialization of the algorithm. denote v * V (T ) by v * T . A rooted tree is a tree with one vertex r chosen as its root, which we denote by r(T ).
For each vertex v in a rooted tree with the root r, let UP r (v) be the unique path from v to r. The parent of v(; = r), which we denote by par (v), is its adjacent vertex on UP r (v) and the ancestors of v(; = r) are the vertices on UP r (v) \ {v}. We denote u < v if v is an ancestor of u, and we denote u f v if either u < v or u = v. The parent and the ancestors of the root r are undefined. We say that u is a child of v if v is the parent of u, and u is a descendant of v if v is an ancestor of u. We denote the set of all children of v by ch(v). Two vertices with the same parent are called siblings. A leaf is a vertex having no children and we denote the set of all the leaves in T by lv (T ). We call a vertex that is not a leaf an internal vertex.
For a rooted tree T = (V, E) and a vertex v * T , the complete subtree of T at v, denoted by T (v), is a rooted tree The height h(v) of a vertex v is defined as |UP r (v)| 2 1 and the height h(T ) of T is the maximum height for every vertex v * T . The degree d(v) of a vertex v is the number of the children of v, and the degree d(T ) of T is the maximum degree for every vertex in T .
We say that a rooted tree is ordered if a left-to-right order among siblings is given; Unordered otherwise. For a fixed finite alphabet Σ, we say that a tree is labeled over Σ if each vertex is assigned a symbol from Σ. We denote the label of a vertex v by l(v), and sometimes identify v with l(v). In this paper, we call a rooted labeled unordered tree over Σ a tree, simply.
In this paper, we often represent a rooted labeled unordered tree as a rooted labeled ordered tree under a fixed order of siblings. Then, for a rooted labeled ordered tree T , a vertex v in T and its children v 1 , . . . , v i , the postorder traversal of T (v) is obtained by first visiting T (v k ) (1 f k f i) and then visiting v. The postorder number of v * T is the number of vertices preceding v in the postorder traversal of T .
Definition 1: Let T and S be trees. 1) We say that T is a subtree of S, denoted by T¯S, if T is a tree such that V (T ) ¦ V (S) and We say that T and S are isomorphic, denoted by T c S, if T¯S and S¯T . 3) We say that T is a subtree isomorphism of S, denoted by T ¶ S, if there exists a tree S 2¯S such that T c S 2 .
In this paper, we deal with a subtree isomorphism problem of P for T whether or not P ¶ T for trees P and T . We call P a pattern tree and T a text tree. Then, the following theorem holds.
Theorem 1 (Shamir & Tsur [10]): Let P and T be trees where p = |P | and t = |T |. Then, the problem of determining whether or not P ¶ T is solvable in O(p 1.5 t/ log p) time.
As the restricted form of trees, we introduce a rooted labeled caterpillar (a caterpillar, for short) as follows.

Definition 2:
We say that a tree is a caterpillar (cf. [3]) if it is transformed to a rooted path after removing all the leaves in it. For a caterpillar C, we call the remained rooted path a backbone of C and denote it by bb(C).
It is obvious that r(C) = r(bb(C)) and V (C) = V (bb(C)) * lv (C) for a caterpillar C, that is, every vertex in a caterpillar is either a leaf or an element of the backbone.

III. ALGORITHMS FOR SUBCATERPILLAR ISOMORPHISM
In this section, we focus on a subcaterpillar isomorphism that is a subtree isomorphism when P is a caterpillar. In other words, we focus on the problem of whether or not P ¶ T for a caterpillar P and a tree T . Throughout of this section, we For a pattern caterpillar P , we refer the backbone of P to a sequence ïv 1 , . . . , v n ð such that (v i , v i+1 ) * E(P ) and v n = r(P ). We denote the children of v i by ch(v i ).
On the other hand, for a text tree T , we refer the vertices in T to w 1 , . . . , w m in postorder traversal. We denote the height of w j by h(w j ) and the set of children of w j by ch(w j ).
Let P be a pattern caterpillar and T a text tree such that P ¶ T . Also let P 2¯T be a subcaterpillar in T such that P c P 2 and bb( Example 1: Consider a pattern caterpillar P and a text tree T in Figure 1. Here, the number assigned to every vertex in T denotes the postorder number. Also v i denotes the backbone. To design the algorithm to determine subcaterpillar isomorphism, we use a multiset of labels in order to compare two sets of vertices. A multiset on Σ is a mapping S : Σ ³ N. For a set V of vertices, we denote the multiset of labels occurring in V by V . Then, it is necessary for the subcaterpillar isomorphism to check whether or not ch( Then, we design the algorithm SUBCATISO in Algorithm 1 to determine whether or not P ¶ T . Here, the algorithm SUBCATISO output all of the matching positions if P ¶ T . Then, it holds that no matching point is output if P ; ¶ T . procedure SUBCATISO(P, T ) /* P : caterpillar such that bb(P ) = ïv1, . . . , vnð */ /* T : tree consisting of vertices w1, . . . , wm in postorder traversal */ for i = 1 to n 2 1 do match[i] ± '; Since h(  Theorem 2: Let P be a caterpillar and T a tree. Then, the algorithm SUBCATISO correctly outputs all of the matching positions of P in T in O(p + tDhσ) time and O(Dh) space.
Proof: First, we show the correctness of the algorithm SUBCATISO. The matching point of P in T is the internal vertices of T . Then, the algorithm SUBCATISO first stores the candidate j of the matching point corresponding to v 1 to match [1] if l(v 1 ) = l(w j ) and ch(v 1 ) ¦ ch(w j ) (line 14).
Then, for the current j, the algorithm SUBCATISO removes the candidate k from match[i] if w j is an ancestor of w k (line 8) and stores k to match[i + 1] if l(v i+1 ) = l(w j ), ch(v i+1 ) ¦ ch(w j ) and i < n 2 1 (line 12). If i = n 2 1, then the algorithm SUBCATISO outputs k (line 10).
Hence, every output k at line 10 satisfies that l(v i ) = l(par i21 (w k )) and ch(v i ) = ch(par i21 (w k )) for every i (1 f i f n), where par 0 (v) = v and par i+1 (v) = par (par i (v)). As a result, the algorithm SUBCATISO outputs all of the matching points of P in T .
Next, consider the complexity of the algorithm SUBCATISO. As prepossessing, it is necessary to store ch(v i ) for v i in P and In order to reduce the searching time in match[i] for every i (1 f i f n 2 1) of the algorithm SUBCATISO, we design another algorithm SUBCATISO2 in Algorithm 2.
The main difference between the algorithms SUBCATISO and SUBCATISO2 is that the index i accessed to the array match is determined by height[h j21 ] without accessing to match[i] for every i (1 f i f n 2 1).

Example 3:
We apply the algorithm SUBCATISO2 to the pattern caterpillar P and the text tree T in Example 1 in Figure 1. Then, Table II   1) For j = 6, by lines 3 and 13, 6 is stored to match [1] and height [3], and current(6) is set to 1. 2) For j = 8, by lines 6 and 7, 6 is selected as k * height [3]. Since h(w 6 ) = 3 = 2 + 1 = h 8 + current(6), 6 is deleted from match [1]. By line 13, 6 is stored to match [2] and height [2], and current (6) is set to 2. By line 18, 8 is stored to match [1] and height [2], and current(8) is set to 1. 3) For j = 9, by lines 6 and 7, 6 and 8 are selected as k * height [2]. For k = 6, by line 9, i is set to 2 = current(6) and 6 is deleted from match [2]. By lines 13 and 14, 6 is output. Also, for k = 8, by line 9, i is set to 1 = current(8) and 8 is deleted from match [1]. By line 13, 8 is stored to match [2] and height [1], and current (8) is set to 2. 4) For j = 16, by lines 3 and 13, 16 is stored to height [2] and match [1], and current(16) is set to 1. 5) For j = 17, by lines 6 and 7, 16 is selected as k * height [2]. By line 9, i is set to 1 = current(16) and 16 is deleted from match [1]. By line 13, 16 is stored to match [2] and height [1], and current (8) is set to 2. 6) For j = 18, by lines 6 and 7, 8 and 16 are selected as k * height [1]. For k = 8, by line 9, i is set to 2 = current(8) and 8 is deleted from match [2]. By lines 13 and 14, 8 is output. Also, for k = 16, by the same reason, 16 is output. Corollary 1: Let P be a caterpillar and T a tree such that p < t, h j t and h < H. Then, the algorithm SUBCATISO determines whether or not P ¶ T in O(tDσ) time and O(Dh) space. Also the algorithm SUBCATISO2 determines whether or not P ¶ T in O(tDσ) time and O(DH) space.

IV. EXPERIMENTAL RESULTS
In this section, we give experimental results of the algorithms SUBCATISO and SUBCATISO2 for both the artificial data and the real data. Here, the computer environment is that OS is Ubuntu 18.04.4, CPU is Intel Xeon E5-1650 v3 (3.50GHz) and RAM is 3.8GB.

A. Artificial data
First, in order to investigate the efficiency of the algorithm SUBCATISO2, we adopt a binary caterpillar P k with height k and the unique label, which is a caterpillar such that every internal vertex has just two children, and a complete binary tree T 2k with height 2k and the unique label, which is a tree such that every internal vertex has just two children and the height of every leaf is just 2k. It is obvious that P k ¶ T 2k .
Note that the algorithm SUBCATISO2 is more efficient than the algorithm SUBCATISO when the number of the matching points of P in T are large. Then, Table III illustrates the running time of the algorithms SUBCATISO and SUBCATISO2 for P k and T 2k and the number (#match) of matching points of P k for T 2k for 4 f k f 11.  Table III shows that the algorithm SUBCATISO2 is faster than the algorithm SUBCATISO for P k and T 2k when k is larger.
The number of the matching points of P k+1 is about 4 times of those of P k . On the other hand, the running time of P 8 (resp., P 9 , P 10 , P 11 ) by the algorithm SUBCATISO is about 5.5 times (resp., about 6.5 times, about 8.5 times, about 8.7 times) of that of P 7 (resp., P 8 , P 9 , P 10 ). Also the running time of P 8 (resp., P 9 , P 10 , P 11 ) by the algorithm SUBCATISO2 is about 5 times (resp., about 5.2 times, about 5.6 times, about 6.2 times) of that of P 7 (resp., P 8 , P 9 , P 10 ).

B. Real data
Next, we give experimental results for caterpillars and trees in real data. We deal with data for N-glycans and all-glycans from KEGG 3 , CSLOGS 4 , dblp 5 and TPC-H, Auction, Nasa, Protein and University from UW XML Repository 6 . In particular, we deal with the largest 51,546 trees (1%) in dblp (refer to dblp 1% ). As pattern caterpillars, we deal with non-isomorphic caterpillars in TPC-H, caterpillars obtained by deleting the root in Auction and non-isomorphic caterpillars obtained by deleting the root in Nasa, Protein, and University. Note that we use all the trees as text trees in TPC-H, Auction, Nasa, Protein and University. Table IV illustrates the information of such caterpillars and trees. Here, #, n, d and h are the number of caterpillars and trees, the average number of vertices, the average degree and the average height. Then, Table V illustrates the total and average running time (msec.) of the algorithms SUBCATISO and SUBCATISO2 applying to data in Table IV by regarding caterpillars as pattern caterpillars and trees as text trees. Here, #cat denotes the number of pattern caterpillars and #tree denotes the number of text trees. Also the average running time is obtained by dividing the total running time by the total number of pairs, that is, (#cat)×(#tree). Furthermore, Table VI illustrates the number (#(P ¶ T )) of pairs such that P ¶ T and its ratio in the total number of pairs, and the total and average number (#match) of matching points when P ¶ T . In contrast to Table III, Table V shows that the algorithm SUBCATISO is faster than the algorithm SUBCATISO2 for real data like as Corollary 1. One of the reasons is that the number of the matching points for real data is much smaller than that for artificial data. In fact, Table VI shows that the average number of matching points for real data when P ¶ T is less than 2.
Table V also shows that the average running time of both algorithms for Nasa is largest and that for SwissProt is next largest for all the data. The reason is that both the average number of vertices of text trees in Table IV and the average  number of matching points in Table VI are larger than other data. Table VI shows that, for CSLOGS, TCP-H, Auction and University, the number of the matching point is always exactly one when P ¶ T . Then, for the other data as N-glycans, all-glycans, dblp 1% , SwissProt, Nasa and Protein, Table VII illustrates the histograms of the number of matching points and the maximum value (max) of matching points when P ¶ T .  Table VII shows that the number of cases whose matching points are more than 1 for SwissProt is largest and that for all-glycans is next largest for all the data. On the other hand, the maximum value of matching points for Nasa is extremely largest and that for SwissProt is next largest for all the data.

V. CONCLUSION
In this paper, we have investigated the subcaterpillar isomorphism and designed two algorithms SUBCATISO running in O(p + tDhσ) time and O(Dh) space and SUBCATISO2 running in O(p + tDσ) time and O(D(h + H)) space, where p = |P |, t = |T |, h = h(P ), H = h(T ), D = d(T ) and σ = |Σ|. Also we give experimental results for artificial data and real data.
Then, as same as Theorems 2 and 3, we have confirmed that the algorithm SUBCATISO2 is faster than the algorithm SUBCATISO for artificial data whose number of the matching points of P in T are large. On the other hand, we have confirmed that the algorithm SUBCATISO is faster than the algorithm SUBCATISO2 for real data. One of the reason is that the running time of using the array height[h] in the algorithm SUBCATISO2 cannot be absorbed like as Corollary 1 when the number of the matching points is not large.
The reason why we cannot apply the SETH-hardness to subcaterpillar isomorphism is that a caterpillar has a unique backbone. Then, it is a future work to extend a caterpillar to a tree with the bounded number of backbones, in order to avoid to the SETH-hardness of subtree isomorphism [1]. Also it is a future work to extend the algorithms in this paper to unrooted subcaterpillar isomorphism like as [10].