How to get a substring at words boundaries in PHP?

Posted on In Tutorial

How to get a substring at words boundaries in PHP? For example, for a string $a = "ab cc dde ffg ff";, I would like to get its substring of length at most 7 from position 0 without breaking words. A simple substring($a, 0, 7) will break the word dde.

Here, we consider “words” are separated by spaces. For the last character, say c, of the substring, if it is the end of a word, then

  • c is not space
  • c is the original string’s last character, or c’s next character is space

If c is not the end of a word, then we need to remove characters from c backwards, until, we reach the end of of a word.

Following the analysis, we can write the PHP code as follows (in a way more suitable for a clean implementation in PHP).

function is_word_end($str, $pos) {
  if ($pos >= strlen($str)) return false;
  if ($str[$pos] == ' ') return false;
  if ($pos + 1 < strlen($str) && $str[$pos+1] != ' ') return false;

  return true;
}

function word_substr($str, $len) {
  while ($len > 0 && ! is_word_end($str, $len-1)) {
    $len--;
  }
  if ($len > 0) return substr($str, 0, $len);
  else return "";
}

Here, is_word_end() checks whether the character at a position of the string is a word end. The word_substr() find the longest substring shorter than $len that ends at a word end.

For tests of these functions

$a = "ab cc dde ffg ff";

for ($len = 1; $len < 20; ++$len) {
  print($len);
  print(": ");
  print(word_substr($a, $len));
  print("\n");
}

prints the expected results.

1: 
2: ab
3: ab
4: ab
5: ab cc
6: ab cc
7: ab cc
8: ab cc
9: ab cc dde
10: ab cc dde
11: ab cc dde
12: ab cc dde
13: ab cc dde ffg
14: ab cc dde ffg
15: ab cc dde ffg
16: ab cc dde ffg ff
17: ab cc dde ffg ff
18: ab cc dde ffg ff
19: ab cc dde ffg ff

Eric Ma

Eric is a systems guy. Eric is interested in building high-performance and scalable distributed systems and related technologies. The views or opinions expressed here are solely Eric's own and do not necessarily represent those of any third parties.