The way scale is calculated for AR apps to preserve real-world size


I have a ML model that analyzes camera video-stream and returns coordinates of an object on a screen (something like Mediapipe Pose, also I have a real-world sized 3D gLTF model which I want to render on top of one coordinate provided by the model and adjust with each frame. In addition, I know a canonical real-world distance between two coordinates returned by the model and I can calculate the screen-distance between those coordinates.

Given all of this, what is the best way to calculate the scale of the gLTF model, so that it looks natural among other objects on camera?
I tried many things, but I am struggling to convert real-world canonical distance to screen one, so that I can calculate proportion.
Thank you in advance!