I’m trying to pretrain a Wav2Vec2 model based on the example given here`.

I was initially getting a contrastive loss like the graph on the left which seemed very slow so I upped the learning rate and got the graph on the right after only a few steps.

I’m not familiar with the nuts and bolts of contrastive loss but this came as a bit of a surprise and I was wondering if anyone could help me understand.

The batch size (with accumulation) is 32, the number of epochs is 20 and the warmup steps is 1200 for both attempts.