Google’s John Mueller Discusses TF-IDF Algo by @martinibuster


Google’s John Mueller discussed the role of TF-IDF in Google’s algorithm. He discussed what it was and offered a better way to optimize for ranking web pages.

What is TF-IDF?

Wikipedia has a concise definition of what TF-IDF is:

“…tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection… The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.”

The key thing to focus on is that TF-IDF is a metric related to the entire “collection” or “corpus.” That means all the web pages containing a specific word or phrase. In the case of web search, this means that the metric depends on how often the word or phrase appears in every web page that exists online. This is a statistical analysis.

That part about “some words appear more frequently in general” is about how TF-IDF is used to catch and remove commonly