UIZE JavaScript Framework

2014 NEWS 2014-05-25 - NEW MODULE: Uize.Str.Whitespace

The new Uize.Str.Whitespace module provides methods for testing if strings contain whitespace characters, if they contain non-whitespace characters, if they are only whitespace or non-whitespace characters, and for finding the first index or last index of whitespace or non-whitespace characters.

1. Whitespace Characters

The Uize.Str.Whitespace module defines whitespace characters as any character from the set of characters listed in the following table...

Whitespace Characters
Code Escape Description
9 \t Horizontal Tab (HT)
10 \n Line Feed (LF)
11 \x0b Vertical Tab (VT)
12 \f Form Feed (FF)
13 \r Carriage Return (CR)
32 \x20 Space
160 \xa0 Non-breaking space
8192 \u2000 --
8193 \u2001 --
8194 \u2002 En Space
8195 \u2003 Em Space
8196 \u2004 --
8197 \u2005 Four-per-em Space
8198 \u2006 --
8199 \u2007 Figure Space
8200 \u2008 Punctuation Space
8201 \u2009 Thin Space
8202 \u200a Hair Space
8203 \u200b Zero-width Space
8232 \u2028 Line Separator
8233 \u2029 Paragraph Separator
12288 \u3000 Ideographic Space

The Uize.Str.Whitespace module provides the following methods for dealing with whitespace characters...

Uize.Str.Whitespace.isWhitespace - tests if the source string is only whitespace characters
Uize.Str.Whitespace.hasWhitespace - tests if the source string contains any whitespace characters
Uize.Str.Whitespace.indexOfWhitespace - finds the first whitespace character and returns its index
Uize.Str.Whitespace.lastIndexOfWhitespace - finds the last whitespace character and returns its index

2. Non-whitespace Characters

Non-whitespace characters are defined simply as any characters that don't fit the definition for whitespace characters

The Uize.Str.Whitespace module provides the following methods for dealing with non-whitespace characters...

Uize.Str.Whitespace.isNonWhitespace - tests if the source string is only non-whitespace characters
Uize.Str.Whitespace.hasNonWhitespace - tests if the source string contains any non-whitespace characters
Uize.Str.Whitespace.indexOfNonWhitespace - finds the first non-whitespace character and returns its index
Uize.Str.Whitespace.lastIndexOfNonWhitespace - finds the last non-whitespace character and returns its index

3. Benefits Over Using Regular Expressions

While it is possible to use regular expressions to detect whitespace and non-whitespace characters in strings, the Uize.Str.Whitespace module offers some key benefits.

3.1. Improved Performance

By avoiding the use of regular expressions, the Uize.Str.Whitespace module can achieve improved performance in performance critital applications such as parser implementations.

In addition to avoiding regular expressions, the methods of the Uize.Str.Whitespace module also achieve improved performnce by implementing an optimized handling for the special case of single character source strings that avoids looping.

3.2. Convenient Index Methods

The various index type methods of the Uize.Str.Whitespace module provide a more convenient and semantically elegant way of finding the index of whitespace or non-whitespace characters in a string.

To illustrate this, consider the following example of how an index could be obtained using a whitespace matcher regular expression versus using the Uize.Str.Whitespace module...

BEFORE

var regExp = /\s/g;
regExp.exec (sourceStr);
var whitespacePos = regExp.lastIndex - 1;

Using a regular expression, we have to create the regular expression and assign it to a local variable. Then, we call the exec method on the regular expression instance. Finally, we compute the index of the matched whitespace character by using the regular expression instance's lastIndex property. In order for this property to have a meaningful value, the regular expression instance must be created with the "g" flag.

All of this is not so intuitive. In contrast, using the Uize.Str.Whitespace.indexOfWhitespace static method produces a statement that is easy to read and make sense of...

AFTER

var whitespacePos = Uize.Str.Whitespace.indexOfWhitespace (sourceStr);

3.3. Start Position

The index type methods of the Uize.Str.Whitespace module provide any easy and understandable way to specify a start position for a search for whitespace or non-whitespace characters.

Consider the following example of how a start position for a search can be achieved using regular expression versus using the Uize.Str.Whitespace module...

BEFORE

var regExp = /\s/g;
regExp.lastIndex = startPos;
regExp.exec (sourceStr);
var whitespacePos = regExp.lastIndex - 1;

Using a regular expression, we have to set the start position as the value for its lastIndex property before we call its exec method. Combining this with the other steps we need to perform, we end up with something that is far less elegant than just using the Uize.Str.Whitespace.indexOfWhitespace static and specifying the start position using the optional second argument...

AFTER

var whitespacePos = Uize.Str.Whitespace.indexOfWhitespace (sourceStr,startPos);

3.4. Backwards Scanning

The Uize.Str.Whitespace.lastIndexOfWhitespace and Uize.Str.Whitespace.lastIndexOfNonWhitespace methods support backwards scanning to find the last whitespace or non-whitespace character in a source string.

This can be achieved with regular expressions by applying a bit of trickery, but there can be a performance cost. Consider the following example of how a last index of whitespace could be obtained using a regular expression versus using the Uize.Str.Whitespace module...

BEFORE

var regExp = /\s\S*$/g;
var match = regExp.exec (sourceStr);
var whitespacePos = match ? regExp.lastIndex - match [0].length : -1;

In order to achieve a backwards scan for the last whitespace character using a regular expression, we have to create a regular expression than matches a whitespace character, followed by any number of non-whitespace characters, and that is anchored to the end of the source string. Now, because our match could contain more than one character, we need to use the length of the first element in the match array to adjust the value of the lastIndex property.

We don't need to deal with this kind of trickery if we just use the dedicated Uize.Str.Whitespace.lastIndexOfWhitespace static method...

AFTER

var whitespacePos = Uize.Str.Whitespace.lastIndexOfWhitespace (sourceStr);

3.4.1. Backwards Scanning and Start Position

While backwards scanning for whitespace or non-whitespace characters using regular expressions is awkward enough, backwards scanning from a start position is even clumsier.

One way to accomplish this would be to create a slice of the source string that terminates at the desired start position for the scan. Then, the previously mentioned approach to backwards scanning using regular expressions could be applied.

BEFORE

var regExp = /\s\S*$/g;
var match = regExp.exec (sourceStr.slice (0,startPos + 1));
var whitespacePos = match ? regExp.lastIndex - match [0].length : -1;

Having to create a temporary slice of the source string and then use a tricky regular expression match on that slice is quite unfortunate from a performance perspective. The approach to supporting start position that is implemented in the Uize.Str.Whitespace.lastIndexOfWhitespace and Uize.Str.Whitespace.lastIndexOfNonWhitespace methods is better suited to performance critical situations.

AFTER

var whitespacePos = Uize.Str.Whitespace.lastIndexOfWhitespace (sourceStr,startPos);

4. Comprehensively Documented and Tested

The Uize.Str.Whitespace module is comprehensively documented and has exhaustive unit tests in the Uize.Test.Uize.Str.Whitespace test module.