April 2009 Archives

Tue Apr 28 10:23:05 UTC 2009

Regular Expresions ... ISO dates

I've been doing some data validation recently, and this afternoon wanted to get a decent date field set up. I like ISO 8601 dates, they sort well in most default locales, and for the application in question I've gone for YYYY-MM-DD as the template.

This is trivially easy to detect as distinct from plain text, with

    \d{4}-\d{2}-\d{2}

but of course this does not do any validation of the content i.e. 9999−99-99 is seen as correct.

It's easy to improve on this expression to match the range of numbers for months 01 to 12 ⇒

    (0[1-9]|1[012])

, but actually making it correct is much harder. We need to know which months have only 30 days as a maximum, and to be able to detect leap-years that permit February to get to 29 days.

I had a quick google around – there are quite a few regexp cargo cult code libraries around the place, many don't match quite the same date format (yy/dd/mm is horrible), and most of they are simply not explained.

Then I found a nice article from Michael Ash (http://bit.ly/mashregex) that talked about ways to extend the simple versions, and explained the best bit, leap-year detection! I didn't need everything from his examples (I didn't want to make a leading zero optional, nor allow non-hyphen separators), and I only needed to validate from year 2000 to 2099, so I ended up with a much less complex expression.

The leap-year detection was easy enough, Michael had spotted a nice pattern that enumerates all the possible leap-years in a century (specifically, one where the initial year is also leap, which 2000 was) :-

00 04 08 12 16
20 24 28 32 36
40 44 48 52 56
60 64 68 72 76
80 84 88 92 96
This pattern reduces down to two parts –
    ([02468][048]|[13579][26])

The building blocks look like this :-

  • Which months have 31 days in them? January, March, May, July, August, October and December
    • So only these months can have “day 31” in any year
      • “all day 31”
    (0[13578]|1[02])-31
  • Which months have at least 30 days in them? Everything except February
    • So only 11 months can have “day 30” or “day 29” in any year (i.e. whether the year is leap or not)
      • “all days 29 or 30”
    (0[1,3-9]|1[012])-(30|29)
  • All months can have days 01 through to 28, in any year
    • “all days ≤ 28”
    (0[1-9]|1[012])-(0[1-9]|1[0-9]|2[0-8])
  • February can have “day 29” if we are in a leap year
    • “all Feb 29s”
    ([02468][048]|[13579][26])-02-29

Three of these four building blocks are valid for any year, and can therefore be grouped together. The pseudo-code looks like :-

  • starts with “20”
    • followed by leap-years, and is “Feb 29”
    • OR
    • followed by any two numbers (doesn't need to exclude leap-years)
      • is any day ≤ 28
      • OR
      • is any valid day 29 or 30
      • OR
      • is any valid day 31

And the final expression (tested with Visual REGEXP, which is already packaged for Ubuntu as visual-regexp) ..

20(([02468][048]|[13579][26])-02-29|\d\d-((0[1-9]|1[012])-(0[1-9]|1[0-9]|2[0-8])|(0[1,3-9]|1[012])-(30|29)|(0[13579]|1[02])-31))

Formatted for readability:

20
(
    ([02468][048]|[13579][26])-02-29
    |
    \d\d-
    (
        (0[1-9]|1[012])-(0[1-9]|1[0-9]|2[0-8])
        |
        (0[1,3-9]|1[012])-(30|29)
        |
        (0[13579]|1[02])-31
    )
)

Posted by Jim Cheetham | Permanent Link