Even a chimp can write code

Thursday, April 15, 2010

Of Crashing and Sometimes Burning

I don't feel we've narrated the story about error and exception handling in Silverlight very well. We started out in v1.0 with a relatively sucky error reporting and diagnostics story and only did real work in building a robust internal infrastructure and adding diagnostics support incrementally, starting with Silverlight 2. We've since made big strides.

Our core tenet is to always provide a robust and reliable operating experience for apps.

What conditions can cause crashes?

Crashes are rare occurences. Ideally we'd avoid them altogether. They are symptoms of one of the following conditions:

  1. A bug in the platform
  2. A security mitigation being acted out
  3. The app doing an illegal reentrant call

Crashes due to platform bugs
These are rare because few bugs of that severity ever get past our quality gates. This remedy is used within Silverlight only in situations where there's a reasonable chance the platform is in uncertain state and going ahead might be fraught with problems. Within this category, there are two variants: shutting down the app and then the runtime; and tearing down the host process.

Of the two variants, the former is the more common approach. It is typical for Silverlight to notify the app via Application.UnhandledException, and most times followed by raising the OnError DOM event for the OBJECT tag prior to shutdown. If you don't see either of these happening, then that might also indicate a bug, and you should report it.

In some situations Silverlight may not consider it tenable to continue processing. Instead it takes recourse to a failfast - aka a forceful teardown. Examples include when the internal state machine appears to be corrupt or required data is missing (an ExecutionEngineException thrown but not necessarily).

If you think you've found a platform bug that is causing crashes, we'd love to get our hands on a simplified repro app with sources. If the bug does not reproduce reliably, we'd appreciate crash dumps.

We will fix most of these issues as and when they're reported.

Crashes due to an in-built security mitigation
When Silverlight detects an Access Violation (aka segfault) or a Stack Overflow, or a Buffer Overflow, the runtime's built-in failfast routines kick in. These are deliberately designed to disallow app code to catch/suppress/rethrow these exceptions, and instead prevent any exploit from taking effect. Doing that would be folly... therein lie known attack vectors. Even in the most benign situations Silverlight assumes past performance is an indicator of future performance (no, we wouldn't be good investors with that mindset!) and finds its extreme action justifiable.

We will not "fix" issues in this category.

Crashes due to reentrancy
Silverlight does not deal well with reentrant behavior. Simply put, if Silverlight calls app code in a synchronous callout, it expects the call to be blocking, and it does not expect to be called back via that or another avenue until the blocking call exits. But when it finds itself in this situation, Silverlight will teardown with extreme prejudice, bringing down app, runtime and host (browser) together with a Null AccessViolation. A Null AV is different from a regular AV.

Here's how to tell if this is happening to you. The simple way is to look for "Null AccessViolation" in your dump. A more contrived way requires symbols and involves looking at the callstack:

KernelBase.dll!DebugBreak()  Line 81             C
npctrl.dll!CWindowsServices::DebugBreak() Line 4270 + 0x8 bytes C++
npctrl.dll!DisplayDebugDialog(unsigned int uClass=1, unsigned short * pFileName=0x5ef4ef78, int iLine=64, int iValue=0, unsigned short * pTestString=0x5ef36dec, unsigned short * pMessage=0x037d9a0c) Line 926 C++
npctrl.dll!XcpVTrace(unsigned int uClass=1, unsigned short * pFileName=0x5ef4ef78, int iLine=64, int iValue=0, unsigned short * pTestString=0x5ef36dec, unsigned short * pMessageString=0x5ef31e78, void * pVArgs=0x037d9c00) Line 1036 C++
npctrl.dll!CWindowsServices::XcpTrace(unsigned int uClass=1, unsigned short * pFilename=0x5ef4ef78, int iLine=64, int iValue=0, unsigned short * pTestString=0x5ef36dec, unsigned short * pMessageString=0x5ef31e78, ...) Line 7258 C++
npctrl.dll!CReentrancyGuard::CheckReentrancy(int bNullAvOnReentrancy=1) Line 64 + 0x36 bytes C++
npctrl.dll!CXcpDispatcher::OnReentrancyProtectedWindowMessage(HWND__ * hwnd=0x004b039a, unsigned int msg=1026, unsigned int wParam=0, long lParam=0) Line 900 + 0xa bytes C++
npctrl.dll!CXcpDispatcher::WindowProc(HWND__ * hwnd=0x004b039a, unsigned int msg=1026, unsigned int wParam=0, long lParam=0) Line 807 + 0x18 bytes C++
user32.dll!_InternalCallWinProc@20() Line 106 Asm
user32.dll!UserCallWinProcCheckWow(_ACTIVATION_CONTEXT * pActCtx=0x00000000, long (HWND__ *, unsigned int, unsigned int, long)* pfn=0x5f10bc90, HWND__ * hwnd=0x004b039a, _WM_VALUE msg=1026, unsigned int wParam=0, long lParam=0, void * pww=0x00e267f8, int fEnableLiteHooks=1) Line 163 + 0x12 bytes C
user32.dll!DispatchMessageWorker(tagMSG * pmsg=0x5f10bc90, int fAnsi=0) Line 2591 + 0x1e bytes C
user32.dll!DispatchMessageW(const tagMSG * lpMsg=0x037d9dec) Line 999 C
user32.dll!DialogBox2(HWND__ * hwnd=0x002902ac, HWND__ * hwndOwner=0x000b0924, int fDisabled=0, int fOwnerIsActiveWindow=0) Line 1150 C
user32.dll!InternalDialogBox(void * hModule=0x760b0000, DLGTEMPLATE * lpdt=0x04923388, HWND__ * hwndOwner=0x000708fe, int (HWND__ *, unsigned int, unsigned int, long)* pfnDialog=0x7611eec8, long lParam=58564704, unsigned int fSCDLGFlags=0) Line 1314 + 0x9 bytes C
user32.dll!SoftModalMessageBox(_MSGBOXDATA * lpmb=0x00000030) Line 1237 + 0x18 bytes C
user32.dll!MessageBoxWorker(_MSGBOXDATA * pMsgBoxParams=0x037da060) Line 791 C
user32.dll!MessageBoxIndirectW(const tagMSGBOXPARAMSW * lpmbp=0x037da0d4) Line 528 + 0xe bytes C
The presence of CReentrancyGuard::CheckReentrancy indicates the reentrancy guard; message 1026 indicates WM_INTERNAL_TICK the veritable sign of the apocalypse reentrancy; and the presence of a (non-Silverlight, in this case from NTUser) MessageBox at the bottom indicates an actor that pumps messages including Silverlight messages. Note that Silverlight's own MessageBox will not cause this reentrancy - it and the reentrancy guard were specifically designed to play well with each other. So I recommend using that any place you'd consider using an alert.

Now consider that SizeChanged and LayoutUpdated events are raised synchronously as part of a layout pass. (I've previously expressed my firm preference for all things async -- I'd have loved for these events to be async but lost that argument since it'd have broken WPF compat and eliminated a couple scenarios. But I digress.) Now if you were to use the DOM Bridge within a SizeChanged event handler to, say, do an alert(), that'd align all the stars. The alert would pump messages; Silverlight would be in a blocking call, not expecting any messages in its internal queue; its little "reentrancy guard" would see this as reentrant behavior, and initiate adverse action terminating with a Null AV.

Silverlight's behavior in this situation was intended to be moral equivalent of a shock collar. We wanted app authors to detect these situations at development time and fix their code, so it never shipped with reentrant behavior.

This condition is the most common cause for crashes in Silverlight, and is indicative of programming error in the app.

We're constantly evaluating whether this sort of positive reinforcement through operant conditioning actually works. If you have anecdotes or feedback, please drop a line. Bear in mind that we have done a poor job at advertising this behavior, and I do sincerely apologize. So the fact that it took you a good while longer than average to debug such an issue is unfortunate but explained away by that.

Where to look for crash dumps?
While developing apps on Windows, with Windows Error Reporting (Watson) you should see the path in the UI shown upon a crash. Alternately look under %Temp% for dirs with "WER" prefix, and look for a .mdmp file prefixed by process name. If this is an out of browser app, you should see "sllauncher.exe.mdmp" instead of one for the browser's executable.

Further reading

With the context from this post, I hope you will see the MSDN doc on Silverlight Error Handling in new light. Some of that content originated from my original spec on the topic written circa Silverlight 2. It will help you understand control flow. In addition, this topic page will help you understand how best to handle exceptions in your own Silverlight app code.

A word on diagnostics

If we could rewrite the past and change something, we'd have tweaked our initial conditions to one that enforced error reporting best practices in our internal development of the runtime. This would have eliminated having to perform triage and surgery on a rapidly evolving runtime as we added better diagnostics. There are still too many E_FAILs in our code that bubble up and result in the totally unhelpful "Unknown error" and the wildly melodramatic "Catastrophic error". There's a bit of dark humor in the backstory, but that's for another time.

Our work is not done though. We continue to seek feedback from app authors and fix our oversights. Keeping Silverlight easy to develop against is a giant priority for our team. Making sure you have great diagnostics and get past issues fast is one way we can make that happen.


Email this | Bookmark this


Post a Comment | Home | Inference: my personal blog