I figured that if the camera has a DR of, let's say, 12 stops, then a 7 stop underexposed image should still capture plenty of tones to examine.
The results were not at all like I expected. The image was almost utterly black with the whitest patch barely registering on my Eizo monitor (it reads about 0,0,5 in PSCS5, 16bit file).
You tested the dynamic range of an entire system : camera, AND (standard) treatment, AND display.
To see what the camera alone can achieve, you'll have to equal things on the treatment and display side, ie to try to make both images look the same on the display. If you can do it, then you're within camera DR.
Forgetting the ND, if I shoot my Colour Checker using a wide range of full shutter speed steps [...] and then look at the sequence of images I can see details in about 12 or 13 different images.
That seems more correct ; try to correct each image (brighten the underexposed ones, and vice-versa) to see where there is still
enough detail and tones. As said, textures can help to see when noise starts to drown image content on the underexposure side ; on the other side, try to see when false or washed colors begin to show (note that the raw converter used still plays a role : some handle noise better than others, some others handle overexposure better, etc...).
DR is the difference between these two limits.
I think that most discussions of practical DR involve shooting a RAW file where the range of the scene actually contains 9 or 10 stops of dynamic range. This could be a scene like shooting from an interior looking out a window to a sunlit area. You set the exposure to just avoid clipping the highlights, or even slightly overexpose them but only to the point where they can be pulled back from a RAW file and still keep acceptable quality/detail. Then look down into the shadows (which will appear black) and attempt to use exposure compensation or a tone curve to bring detail up out of the darkness and see how far down you can go.
You won't get (exactly) the same results with that method, because having the highlights and shadows in the same scene means that the lens will produce some veiling glare, depending on the place and quantity of highlights (and depending of the lens and aperture used of course). This has the double consequence to raise the shadows (remember pre-exposure in the darkroom?), making them easier to record, and to flatten their contrast, making them less legible.
On one hand, it's much more of a real-world experiment, because DR always matter in a unique scene in practice, and I'd thought that with real-world lenses veiling glare can often be the limiting factor.
On the other hand, it throws many more variables in the equation and make results much less repeatable.