Hannes,
Sorry to take so long to reply to this. It's a bit of a subtle one, I wanted to check the figures first, and I'm afraid it took a little while to get around to that.
This behaviour is essentially a bug in the result printing part of the simple host in the Vamp SDK, although there is also a shortage of documentation for the (legitimate) behaviour of the SDK code that leads to this bug. Sonic Visualiser appears to have the correct output here. I will aim to get the bug fixed and the behaviour properly documented in the 1.4 SDK release.
The cause is to do with the handling of frame timestamps when using plugins that have frequency-domain input. Sonic Visualiser feeds these plugins frames starting from the frame that is centred on the first input audio sample, which is timestamped at time zero.
The simple host, in contrast, begins with a frame that starts at the first input audio sample (not such a good thing to do, because it means that samples earlier in the file than half the frame size are not properly represented, but technically legitimate). This host uses a PluginInputDomainAdapter to handle the conversion to frequency domain if the plugin requested it; for the first frame, it feeds to the adapter these time-domain samples starting at the first input audio sample, with a timestamp of zero. The adapter recognises that the frequency domain input timestamp should be adjusted to the centre of the frame, and makes that adjustment. However, the host is not aware that that adjustment has happened, and so prints out the results with the un-adjusted timestamps. That's the bug.
The results shown in SV and those returned by the simple host will match for plugins that use time-domain input, and should also match for plugins with frequency-domain input where their outputs are timestamped explicitly by the plugin using "adjusted" timestamps, for example the Onsets output of the Simple Percussion Onset Detector example plugin. Where they differ (with SV being correct and the simple host wrong) is for plugins with frequency-domain input whose outputs are timestamped implicitly by the host, for example the chromagram -- the discrepancy is particularly marked for the chromagram because of its long frame size; the difference of 0.74 seconds you mention is half of that frame size.
Chris