Each 6MP "frame" has a rectangular array of photosites, separated by 2X in both dimensions, where X is the nominal (full resolution) pixel spacing. The frames are then interleaved, after shifting one frame relative to the other by 1X in both dimensions, so that the photosites of one array fit into the gaps between photosites in the other array. The final full resolution sensor has 12MP in a non-rectangular array, sort of like a brick wall with staggered rows of square bricks. Does that make sense? It is similar to super-resolution, like what you would get if you had a single 6MP sensor and combined two images, shifting the sensor by 1/2 pixel in both dimensions between images. Except it is not really super resolution, because there are really 12MP photosites.
The HDR aspect requires operating the two frames at very different sensitivity, apparently by electronically controlling the exposure time. Dark areas in the final image are constructed from pixels in the frame which had full exposure time (for low noise), while bright areas are constructed from pixels in the frame which had reduced exposure time (to prevent saturation). Using the pixels this way of course prevents using them in the full resolution mode, so you get a 6MP HDR image.