String Subsequence Kernels for Text Classification

Michael Giuffrida
Yale University

CPSC 490: Senior Project
Spring 2012

Advisor: Dana Angluin

Abstract

This paper explores the string subsequence kernel, a kernel function whose feature space is generated by subsequences of strings. This kernel compares two strings based on the number of occurrences of common substrings they contain, where each common substring is weighted based on how contiguous that substring is within the string. Although a recursive definition of the string subsequence kernel that uses dynamic programming methods exists, its computation is still expensive. An approximation, the string subsequence kernel with lambda pruning, is also considered, and a novel extension of the approximation algorithm that includes caching is investigated. Once the accuracies of the string subsequence kernel are presented, the runtimes for the approximation kernels are measured and compared with the exact algorithm over a variety of parameters.