One of the bitter lessons in computer vision is the false assumption that solving all the intermediate representations—segmentation, 3D meshes, depth, etc.—is needed to make a computer understand the scene.
In the end, it is the final policy that takes action that matters.