We currently have a tool called reftest, which compares the rendering of HTML testcases against reference files – usually written in HTML, but they could just be a container document enclosing a PNG.
It would be really cool if reftest were a Tinderbox test, with the tinderbox building Firefox with the particular options reftest needs. We could then enhance it to also run various HTML and CSS testsuites on the web (e.g. Hixie’s) and also do Save Page As… on various popular websites and add those files into the mix.
If we could get to a position where our latest rendering engine were being automatically and regularly checked against thousands of testcases and real world pages, we’d be immediately be able to see the effect on the web of any changes we made. We’d be much better placed to catch regressions.
Anyone feel like making this happen?
Errm … for me, the obvious short answer to the question above is “no”! :)
But this did remind me of something else I saw recently: Acid2.
I was suprised when FF 2.0.0.1 failed to render the page! :(
David: Recent (as of about a week ago) trunk builds of Firefox 3 pass Acid 2. But it took one of the largest patches Mozilla has ever seen to get it working. The change was far too risky and too late for Firefox 2, but you can rest assured that FFx will pass with flying colors.
Would be cool if we could feed reporter data into this somehow, and track regressions. Perhaps by comparing 1.0/1.5/2.0’s rendering of a reported page against each other and 3.0.
I can make the data available by feed or some other mechanism, if someone creates the functionality to use it.
Indeed. If reporter were to automatically download and save commonly-reported pages, the result could be used as a testcase for fixing the bug and then added to the test suite.
I’m not sure we want to compare across browser versions; we’d expect there to be quite a lot of (desired) changes.
davel had this set up running on buildbot and mirroring to tinderbox before he left. Not sure what happened to that machine, though — I can’t seem to find it on tinderbox anymore.
I definitely think we need to convert Hixie’s existing tests to work with the reftest. The idea of using larger, more complex (but static) pages for the test is an idea that hadn’t crossed my mind. I thought the tests had to be as minimized as possible, but this might not be the case. Maybe dbaron can weigh in on this.
Also, what’s sort of scary is that in the short time that davel had his test tinderbox running, and with only about five existing tests in the testsuite, we found a regression (bug 358433). I can’t imagine how many more we’d find with a larger set of testcases.
I’m not totally sure how davel hacked his tinderbox to run reftest. I know it was decided that using the `make check’ target was not appropriate for this test… Maybe someone could get a hold of davel and find out how he was doing it.
I don’t know about “a lot of”, but it might well be that most of the changes are desired.
Still, if the tool could produce a list of the ones with changes and maybe create a web page that shows (scaled-down versions of) two renderings side-by-side, then humans (and here I am thinking not so much active developers but more along the lines of web developers who _follow_ Mozilla development but maybe don’t have such an active role in it) could go down the list and attempt to ascertain which changes are desired improvements and which are regressions.
It would be possible, at least in theory, for the tool, if it can detect the changes, to try to pin down when the change occurred — i.e., what was the last browser version that produced the old rendering. (That should be O(log n) where n is the number of versions of the browser under consideration, I think. Periodically, when you determine that you’ve fixed pretty much all of the outstanding rendering regressions, you would tell the tool not to compare against anything older than a certain date anymore.)
Then if the humans have any doubt about the desiredness or not of the change (as I suspect they often would — it’s not always obvious which rendering is technically correct if the difference is subtle) they could look at a list of things checked in in that timeframe and try to determine if any were _intended_ to have the effect shown in the screenshot. If not, they could call the developers’ attention to that one.
Sure, there would be some false alarms. No system is perfect.