Why do we need serial dates in the Transitional form of IS 29500?
October 22, 2009
As a member of ISO/IEC JTC1 SC34 Working Group 4 (which Mr Norbert Bollow of the Swiss mirror committee somewhat bizarrely refers to as “so-called”) and someone directly implicated in his recent blog post, I thought it might be useful to help him understand the situation more clearly.
I have been heavily involved in spreadsheets over the last 14 years working at Datawatch. For the last 9 years, I have been in charge of the Monarch and Monarch Data Pump products, which have interacted heavily with spreadsheets, both from an input and output perspective. We supported (and still support older versions – the file format specifications are not available for later versions) Lotus 1-2-3 as well as Excel, stretching back to Excel 2.1 and Lotus versions well before that.
The Monarch product is primarily used in conjunction with Excel, with approximately 95% of our users reading, writing, appending and updating Excel spreadsheets, both the older binary formats and the new OOXML format.
We have a large user base of around 500,000 users worldwide and have gained fairly detailed knowledge of how people use and abuse spreadsheets in various ways, as well as how many other vendors import and export spreadsheets from their applications in the 18 years since the first release of Monarch.
Sowing the seeds – versioning
Consequent to the lack of a versioning scheme in ECMA376-1, applications created to consume and create OOXML documents were unable to distinguish between ECMA376-1 and future versions. This should have been addressed in the original specification and certainly at the BRM. It was not, which casts doubt on the unimpeachable sagacity which some seem to attribute to decisions made at the BRM. This sacred cow status, especially surrounding ISO8601 dates is not a healthy thing-it should be subject to scrutiny and review, especially with far more time available to analyze the ramifications of changes than was available at the BRM.
Sowing the seeds – anyone for spreadsheets?
Another aspect of the BRM and indeed much of the process around OOXML is the lack of spreadsheet experts involved. Practically all those involved in the process are really XML and document specialists. Their background, depending on age, is almost always SGML and XML, not VisiCalc, Lotus 1-2-3, Quattro Pro, Excel, Gnumeric and Calc.
Spreadsheets then, became second class citizens in the process, with few people showing them the care and attention showered upon the word processing aspects of the specification.
The XML experts came from the viewpoint of XML Schema, which, many may be surprised to learn, does not itself fully implement ISO8601 dates. It wisely uses a tightly defined subset of ISO8601. Many advocated that ISO8601 dates should be used within OOXML documents. This approach is eminently sensible, since it is much simpler to consume documents with XML technology, if the date data can be easily consumed and processed by common XML tools.
However, spreadsheets have paid very little attention to XML, the file formats have historically been extremely terse and efficient and they have their own design goals which are distinctly different from word processors. One thing that spreadsheets always do, is store and process dates as serial date values. Almost every single spreadsheet file in existence contains serial date values.
The Leap Year Bug – or not
Mr Bollow refers to the the reintroduction of the leap year bug which was introduced by Lotus 1-2-3 and replicated in Microsoft Excel. The very fact I say “introduced by Lotus 1-2-3” gives you an idea how venerable this bug actually is.
This bug has existed for an exceedingly long time and anyone that deals with spreadsheets is well aware of it. In fact, it has really ceased to be a bug and become expected behaviour. I can’t claim this is a good thing, but that is the way that it is.
Now, when you start to deviate from expected behaviour that has existed for decades, you will run into problems. The amount of spreadsheet consuming and producing applications is gigantic and changing the ground rules is not an option if you want any sort of interoperability. The leap year bug itself is not an intrinsic issue with serial dates itself, but an application issue introduced by Lotus 1-2-3 which has become accepted practice.
If Mr Bollow is saying that ISO8601 dates must be used everywhere without exception, then I am assuming he also advocates throwing Unix time over the side in favour of an ISO8601 implementation. Good luck with that. Oh and OpenFormula too. (More later).
Well, maybe semi-reintroduced
In addition, Mr Bollow fails to make the distinction between the two forms of OOXML: Transitional, which is meant to help support “legacy” information and provide a transition vehicle to the more pure Strict form of the standard.
Serial dates for spreadsheet cell values were not allowed in the Strict form and have not been reintroduced in the Strict form by Working Group 4. As far as I am aware, there is absolutely no intention to do so. In addition, many thought that serial dates were allowed in the Transitional form, and it came as some shock to many when I pointed this out originally, back in February.
Another important point is that this only affects spreadsheet cell values, ISO8601 functionality is not being excised by these changes, in fact ISO8601 dates are still allowed in spreadsheet cell values in the Transitional form. Personally, I don’t think this is wise, to fully avoid data loss issues, ISO8601 dates in spreadsheet cell values should not be allowed in Transitional, but be the only allowed date values in Strict.
There are many places where the ISO8601 date specification is used and will continue to be used in spreadsheets in the Transitional form, such as the WEEKNUM function, which has arguments to specify ISO 8601 week numbering.
Can’t we just leave it as it is and let users/vendors sort out the mess?
I have heard this argument, and it immediately marks out anyone who makes it as a complete dillettante with respect to the workings of finance.
In contrast to word processing documents, dates are far more pervasive within spreadsheets and in general, far more critical. Most financial analysis is date-based, reporting is always date based and calculations of financial instruments are mostly date based. It is safe to say that an extremely high proportion of spreadsheets contain dates and that the integrity of those dates is critical.
In ECMA376-1 all dates in spreadsheets were treated as serial dates, so any reading and writing of dates was using this format. With the enforcement of ISO 8601 Dates in the current specification (§18.17.4, §184.108.40.206, §220.127.116.11) and the primary example (§18.104.22.168) featuring the use of ISO 8601 Dates and the newly introduced d attribute (from §18.18.11 ST_CellType), all conforming applications must write dates in SpreadsheetML cells in ISO 8601 format.
This semantic change (no schema-enforced change exists) means that all existing applications fail to open IS2900 transitional spreadsheet documents correctly. The observed behaviour of applications differs from an inability to open the document to silent data loss.
This is a huge problem, if you combine the inability to distinguish two versions of the specification of instance documents with a semantic change, you have a recipe for disaster. To simplify, imagine the chaos that would ensue if you silently changed the currency you use for accounting, but didn’t tell any of your finance staff when.
The silent data loss encountered is made even more problematic, in that the value may be parsed as an incorrect date, instead of a null value or other failure. This means that there is little to alert the user of problems. Dependent formulas will not fail with divide by zero errors, only when there is additional logic that takes into account date boundaries. Visual recognition of the failure will usually be required.
Some other scenarios to consider are when using spreadsheet files that are linked to other spreadsheets, or products that perform lookups to spreadsheets. This means that patching of older applications would have to be absolute and in sync across all organizations involved. For example, a spreadsheet in one country could be used as a lookup for data by spreadsheets in another country. This would mean that all branches, divisions and subsidiaries of an organization may need to ensure that all application software that consumed spreadsheets is patched in sync to avoid data loss.
The applications I tested are easily available ones, but we also need to consider larger applications such as ERP and BI software where bi-directional use of spreadsheets is used. This means that bad data in the local spreadsheet could be propagated to an enterprise-wide system. Patching such enterprise-wide systems is an extremely costly undertaking. Even patching Office suites is a very costly undertaking, in terms of testing and rollout, even if there is no additional software cost from the application vendor.
But how many ECMA376-1 consuming applications can there be?
Another argument is that this is a storm in a teacup, the primacy and purity of ISO8601 dates is more important than the pandering to the handful of applications out there that might encounter a later version and produce errors.
There are many of them out there, obviously Office 2007, which is fairly popular, I’ve heard, but also a huge amount by smaller vendors, such as SAP, Oracle, Lawson, IBM (e.g. support for Excel 2007 in Cognos)– people like that. As mentioned before, most ERP (Enterprise Resource Planning) and BI (Business Intelligence) vendors deal with XLSX files, some of them bi-directionally – meaning bad data could be propagated globally throughout these enterprise-level systems.
But these are just the tip of the iceberg, there are many bespoke in-house systems, especially in the financial space that rely heavily on Excel files.
But, won’t vendors have been slow to adopt the new formats, I hear you cry. The answer in this case, is no. The benefit of the OOXML format is that it has been much easier and quicker for vendors to implement support than the old binary Excel formats, which were horrible. The documentation was obviously much better and the XML nature meant it was far easier to implement support on different platforms, which is key for enterprise vendors that run on a huge gamut of different OS and technology platforms.
In addition, there has been a lot of pressure from users to support the new formats, as the size of spreadsheets was greatly expanded with the introduction of OOXML. Previously, Excel spreadsheets were limited to 65,536 rows. Enough for any spreadsheet, you may say, but in my experience, they always wanted more.
We frequently had product enhancement requests to allow Monarch to export many hundreds of thousands of rows into spreadsheets, using tricks such as populating one sheet, moving on to the next when the limit was reached and so on.
I wonder about the wisdom of million row spreadsheets, but users will always seek to push the envelope.
Behaviour of existing applications when encountering ISO 8601 Dates
Some testing (earlier this year) was performed on easily available applications, to see what the scenario of using ISO8601 dates in an instance document would look like. The following implementations fail when opening a file which contains no changes introduced in IS29500, except for ISO 8601 dates, with the t attribute of the cell set to “d”. Only Datawatch Monarch 10 will work without error, under the (unlikely) condition that the ISO-8601 date string only includes the date portion. All other tested implementations fail.
Office 2007 SP1
Warning dialog appears “Excel found unreadable content in <>. Do you want to recover the contents of this workbook …” On clicking Yes, the file is loaded but all dates are removed.
Office 2007 SP2 Beta
No warning dialog appears, dates are silently corrupted, but still exist within the file as valid, but incorrect dates.
OpenOffice 3.0.1 Calc
Similar behaviour to Office 2007 SP2 Beta
Similar behaviour to Office 2007 SP2 Beta
Apple iWorks 09 Numbers
Similar behaviour to Office 2007 SP2 Beta
Similar behaviour to Office 2007 SP2 Beta
Excel Mobile 6.1
Similar behaviour to Office 2007 SP2 Beta
Datawatch Monarch V9 / Monarch Data Pump V9
File cannot be opened
Datawatch Monarch V10 / Monarch Data Pump V10
File can be opened correctly if only the date portion of an ISO-8601 date string exists. If it is the long form, an error message warning of corrupt data appears, informing the user that it will be imported as nulls. The problem can be rectified, by changing the field type from date to character. Note that Monarch is often used in lights-out operation and Data Pump is only used in lights out operation.
Cleaning up the mess
So, the “so-called” Working Group 4 were faced with the following set of problems:
- For an existing ECMA 376-1 consuming application, there was no way to distinguish a later version, so the application would happily read and process any future instance, unaware of any changes (especially purely semantic ones!)
- For an existing ECMA 376-1 consuming application, there was no way to distinguish between a document of conformance class strict versus one of conformance class transitional.
- Changes to implement ISO 8601 dates in SpreadsheetML had not been thought out well at all in the BRM process.
- Changes to implement ISO 8601 dates per se had not been thought out well at all in the BRM process (i.e. no subsetting as per the XML Schema spec)
- Many assumed that serial dates were still allowed in the transitional form, which one could easily assume based on the lack of strong typing (the cell value, which is the target container for dates) is a string, not a date, with an optional attribute to indicate an ISO8601 date. In addition, there is a large amount of text in the OOXML specification referring to serial dates.
- The catastrophic silent data loss issue proven to exist in many applications designed for ECMA376-1.
We all know the various stories about the financial catastrophes that can occur with errors in spreadsheets. Compounding this enormously at the file format level would not be popular amongst organisations such as EUSPRIG or indeed, anyone using spreadsheets at all, which is just about everyone.
So what did Working Group 4 decide to do about this?
Let’s take a look at the Scope statement for IS29500:
"ISO/IEC 29500 defines a set of XML vocabularies for representing word-processing documents, spreadsheets and presentations. On the one hand, the goal of ISO/IEC 29500 is to be capable of faithfully representing the preexisting corpus of word-processing documents, spreadsheets and presentations that had been produced by the Microsoft Office applications (from Microsoft Office 97 to Microsoft Office 2008, inclusive) at the date of the creation of ISO/IEC 29500. It also specifies requirements for Office Open XML consumers and producers. On the other hand, the goal is to facilitate extensibility and interoperability by enabling implementations by multiple vendors and on multiple platforms."
(For anyone that was wondering, Office 2008 was the Mac version.)
Although the preexisting corpus only references Microsoft Office, it certainly applies to the huge corpus of documents produced by applications other than Office, but consumable by Office too.
- Since the Transitional form is meant to help deal with the transition of legacy documents, it was decided to make best efforts to provide compatibility with ECMA376-1 in the Transitional form of OOXML, so that existing applications worked properly. This involved clarifying or reintroducing, depending on your point of view, the use of serial dates for SpreadsheetML cell values.
- Since the Strict form is the ideal form of the specification (ISO8601 dates only etc), where applications should strive to end up over time, it was decided to change the namespace, so that applications designed for ECMA376-1 would not be able to read them, avoiding data loss issues. In the absence of an existing versioning system, this was the only way to prevent existing applications from processing future version files that would likely create compatibility issues.
In addition, some members of Working Group 4 are determined to consider the implementation details of ISO8601 dates in spreadsheets, and wider, possibly using a subsetting approach like that found in XML Schema.
There certainly needs to be definition of which forms of ISO 8601 elements should be used, for example, possibly specifying “Complete Representation in Extended Format” should be used for dates and times, with separators explicitly defined and so on. Other considerations might be the expansion of the range of valid dates, less than zero and greater than 9999. ISO 8601 allows for a fair degree of ambiguity, so honing down the allowable forms would make implementers’ lives much easier.
There are also a wealth of other aspects of ISO8601 that would need to be excluded, such as recurring time intervals.
In the final analysis, the venerable Leap Year bug, now, somewhat strangely, elevated to accepted behaviour, is far less dangerous than the silent data loss problem that not allowing serial dates in spreadsheet cells could be.
Rob Weir (CoChair – OASIS ODF TC, Member – OASIS ODF Adoption TC, Member -OASIS ODF Interoperability and Conformance TC, Member – INCITS V1, Chief ODF Architect, IBM)
This is an interesting post, but there are a few issues that I need to address here:
“If you guessed “Microsoft”, you may advance to the head of the class.”
Alas Rob, it was Lotus that thrust this onto the world when they were the dominant spreadsheet and the minnow Excel had to play ball!
“The “legacy reasons” argument is entirely bogus. Microsoft could have easily have defined the XML format to require correct dates and managed the compatibility issues when loading/saving files in Excel. A file format is not required to be identical to an application’s internal representation.”
That may well be true, but I would imagine that would cause a large technical burden when managing backward compatibility with fixes such as the Compatibility Pack, as well as for the tens of thousands of developers reading and writing BIFF8 (the older Excel native binary format) who likely consumed, processed and exported serial dates.
Spreadsheets historically did not have date engines that could deal natively with ISO8601 dates and I doubt any do now. They could, of course parse them in and out, but it is not a trivial amount of work to put in the plumbing and why take the performance hit. Serial dates are great for date diffing and grouping, which is one of the most common operations – i.e how old is this debt, what transactions are in this quarter etc.
In addition, this argument cuts both ways, applications could convert serial dates into ISO 8601 dates if they so wished. Anyway, as of today, we have to clean up the mess as best we can.
Allowing serial dates in OOXML also makes it easier to interoperate with the forthcoming OpenFormula specification, which reasonably eschews ISO8601 dates in favour of serial dates and datetimes as input. BTW OpenFormula looks excellent and I must commend the work of Dave Wheeler and the rest of the OpenFormula SC.
As per the latest OpenFormula draft of May 9, 2009:
“A Date is a subtype of number; the number is the number of days from a particular date called the epoch. Thus, a date when presented as a general-purpose number is also called a serial number.” …
“A DateTime is also a subtype of number, and for purposes of formulas it is simply the date plus the time of day.”
I do hope Mr Bollow is pursuing the OpenFormula SC with the same vigour for their anti-ISO8601 activities, maybe we can convince him together!
Joel Spolsky (former Excel Program Manager)
This explains the infamous leap year issue that Lotus created and Excel had to stomach.
The problem is that the horse has bolted, we now have to figure out to do the best with what we have.
Jesper Lund Stocholm (SC34/WG4 Member, Danish Standards)
Alex Brown (SC34/WG1 Convenor, SC34/WG4 Member, British Standards)
ISO 8601 date discussion at Copenhagen WG4 meeting.