Your Multimodal Speech Model Says I Have a Face… · DeepSignal