I guess, its easier then you think, if you render the sounds in a second pass.
for each sound you would do something like this:
imagine you want to render a sound at the position 10 seconds.
you render the normal video to a tempfile.
ffmpeg -y -i [tempvideofile] -itsoffset 00:00:10 -i /path/to/sound/file.ogg -async 1 final.mp4
That is the concept for one audio file. But for many sounds, we dont want to create a new stream for every sound. So we could use sox (commandline program of linux we may need another way for the final mod. But for this showcase its enough.)
now we delay every sound
sox sound1.ogg /tmp/replay/delayed_001.ogg pad 10
sox sound2.ogg /tmp/replay/delayed_002.ogg pad 12
sox sound3.ogg /tmp/replay/delayed_003.ogg pad 15
and so on
now we mix them to gether
sox -m /tmp/replay/delayed_001.ogg /tmp/replay/delayed_002.ogg /tmp/replay/delayed_003.ogg /tmp/replay/sound.ogg
(may need to seperate in to more process, when to many sounds are given.)
and merge it with the old video file
ffmpeg -y -i [tempvideofile] -i /tmp/replay/sound.ogg final.mp4
i hope i could help,
sincerely MrBesen