One advice that you'd surely have come across while developing a production grade machine learning model is:
Make sure that the training and production dataset have the same distribution
Now what does that actually mean? Can you come up with the simplest example of training and production datasets that have different distribution? What exactly do we mean by distribution?
Imagine studying for an exam
You know before-hand that there will be 10 questions on the paper. There are 5 chapters in total, and there would be 2 questions from each chapter. Well, at least that's what you observed in the large number of mock-tests that you gave while preparing for the finals.
Not having the time to learn all 5 chapters, you decided to skip a difficult chapter. You figured that you'd still score 80%.
However, to your surprise, the finals actually had 6 questions from the chapter that you skipped. And you could only score 40% to your utter disappointment.
In this case, the mock tests and the finals had "different distributions."
How would the above example look in a real-world dataset?
Imagine that you have a training set and a test set. You get a 95% accuracy on the training set (which you did split for cross-validation too). When we say 95% accuracy, there are 5% of the training set examples that we did not get right.
Also, trying to get that last 5% right resulted in over-fitting and lower cross-validation accuracy. So we became content with 95%.
However, it is important to investigate the nature of the 5% that we're getting wrong too. The 5% represents data points that the machine learning algorithm deemed okay to get wrong, because it's getting a whole lot of other things right.
What if your entire test set comprised of data that looked similar to the 5% that your algorithm got wrong in the training set? Then your test set accuracy would be brutally close to 0%.
Now you have a simple conceptual grasp of what it means for the training and test sets to have different distributions.
This will also give you incentive to investigate what examples is your algorithm getting wrong, and do those wrong examples change drastically each time you make an optimization.
This is one of the reasons its advisable to have a gold-data set that you always test against, apart from your usual training/validation/test set.