Last Update: 2007-03-22
Ohohlfeld.com Banner


Zipf Distributions

21st March 2007

This draft aims to provide some informations about Zipf's Law and describes two easy experiments. It's a collection of different informations. It's not intended as an full introduction or overview article.

Zipf's law is a power law Pn ~ 1/na, whereas a is supposed to be close to 1. It's named after the American linguist George Kingsley Zipf, "who studied statistical occurrences in different languages" [11]. Pn denotes the frequency of a word ranked nth [11].

Zipf distributions often appear when human choice is involved, just for instance how many links one puts on a web page, which words to put into a text or the selection of the place of residence. A good introduction can be found in [7] and [9]. Here are some informations published at Wikipedia:

Originally, Zipf's law stated that, in a corpus of natural language utterances, the frequency of any word is roughly inversely proportional to its rank in the frequency table. So, the most frequent word will occur approximately twice as often as the second most frequent word, which occurs twice as often as the fourth most frequent word, etc. The term has come to be used to refer to any of a family of related power law probability distributions.
[..] The "law" was publicized by Harvard linguist George Kingsley Zipf Zipf's law is thus an empirical law, not a theoretical one. Zipfian distributions are commonly observed, in many kinds of phenomena. The causes of Zipfian distributions in real life are a matter of some controversy, however. The fact that Zipfian distributions arise in randomly-generated texts with no linguistic structure suggests that the law as applied to languages may in part be a statistical artifact. (See [10])
Source: Wikipedia: Zipf's Law: [9]

Zipf Distribution in Languages

Plot showing the Zipf distribution of the words taken from the book Faust in three languages

A Zipf distribution can be found in languages (as studied by George Kingsley Zipf), namely in word and (pairs of) letter frequencies. The latter effect is used by entropy encoding. Frequent symbols will get a short code, whereas rare symbols get longer code words. A well known example is morse code: the letter e is frequently used in the English language and therefore has a very short code word (It's coded with a dot, a short mark - it's the shortest possible representation). Moreover, the most frequent words, or the Zipf coefficient can be used for language recognition as explained later.

Goethe's Faust is used for this experiment, as it can be retrieved freely from Project Gutenberg in various languages. The plot above shows the word frequencies of the different texts over the word rank. One can see that a small fraction of words is used very frequently, whereas the bigger part contains words which are less often used - in fact, most of the words are used only once! Furthermore, the word frequencies of the three texts follow nearly the same distribution.

The most frequent words in the English version of Faust are: the, and, to, a, of, in, with, my, is, me, ... The most frequent words in the German version are: und, ich, die, der, nicht, das, ein, ist, zu, ... This list contains particles, prepositions, personal pronouns, possessive pronouns and other words which are not related to the content of the analysed text. Therefore, word frequencies can be used for a rough estimation of the text language (e.g. by comparing coefficients [0]), but not for text classification (e.g. this is a news report dealing with computer science).

A frequency list for the English language can be found here. This list contains the most common words in TV and movie scripts as well as in books from project Gutenberg.

The above plot shows the frequency of pairs of two letters (2-gram). As one can see, even pairs of letters follow a Zipf distribution. The most frequent 2-letter-pairs in the English version of Faust I are the following (ordered by their respective frequency): th, he, in, re, st, er, an, ou, nd, es, ar, is, to, en, on, ea, ...

Zipf Distribution in Hypertexts - The Web

Zipf distributions can be found in the structure of hypertext documents (such as HTML) as well. This can be demonstrated by writing a simple web crawler that counts the number of outgoing links on each page as well as the hosts and URLs it has seen together with their respective frequency (e.g. how often a certain host such as foo.org occurred during the crawling process).

URL frequency plot

Host frequency plot

The two plots above show the host and URL frequency over their rank. The data was captured using a breadth-first search. (Due to internal links, a depth-first search run can take longer to converge to a Zipf distribution.) One can see that the popularity of hosts or hypertext documents follows a Zipf distribution. In other words, a small number of URLs or hosts is more frequently used that others. Just for instance, the hosts google.com or yahoo.com are (much) more often used than ohohlfeld.com.

Number of outgoing links plot

The above plot is taken from the same crawl run and shows the number of outgoing links per visited web page. The data was filtered, links to images (img tags) or to already known URLs were excluded. The plot shows that the number of outgoing links follows the Zipf distribution as well (see also [6] p. 4). In other words, only a small fraction of web pages have a large number of outgoing links.

Pitkow summarized that the requested file popularity as well as the number of requests to a server follow a Zipf distribution [5].

More Examples

Zipf distributions can be found in many more place, e.g. the last names of people (see [7] p. 9).

Popularity of Products

Another popular example is The Long Tail, as coined by Chris Anderson in 2004. Similar to the popularity of web pages, the popularity of goods follows a similar distribution. This can be found in the book, movie or music marked, just for instance. A small number of novels has a much higher sales volume than others. This enables new business ideas: "Anderson argued that products that are in low demand or have low sales volume can collectively make up a market share that rivals or exceeds the relatively few current bestsellers and blockbusters, if the store or distribution channel is large enough." (Source: Wikipedia: The Long Tail ). The Internet can simplify the distribution of Long Tail products. For example, the digital distribution of music reduces storage costs amongst others and makes it possible to sell a huge number of less frequently bought tracks.

Size of Cities

Furthermore, also the number of inhabitants follows a Zipf distributions (see [1]). There is a small number of very big cities but a large number of smaller cities or villages.

Zipf Distribution in Bibliography

Poosala (see [7] p. 9) gives examples for Zipf distributions in bibliography:
  • Number of publications of authors
  • Books by number of pages
  • Citation Frequency of an article

Implications for P2P-Systems

Modern peer-to-peer Systems are build upon a Distributed Hash Table (DHT), which is - as a normal hash table - a key to value mapping. However, this has some disadvantages when the frequency of keys is Zipf distributed. A group of people is working on a open-source P2P search engine called Yacy. Yacy is using a Chord like DHT. Informations are stored basically by mapping tokens (extracted words from a web page) to URLs. The hashed token will be the key, key := h(token). Every peer node is responsible for a certain interval of the hash space. As some words occur more often than others, some nodes will have to hold more values (URLs) than others. Examples for popular words would be www or Linux. When the index is not shared in such a way that multiple nodes are responsible for such a key, a single node has to handle a high load.

A P2P network topology that adapts better to Zipf distributed values is the BubbleStorm topology. BubbleStorm is a random graph and solves the Rendezvous Problem (a query wants to find a match) by propagation of messages (bubbles) till intersection. The propagation is implemented by Flooding. As Flooding can be very inefficient (cf. the problems of the Gnutella network) and is restricted by more modern networks (cf. super nodes in the Fasttrack network), the size of the bubbles will be dynamically adapted to the actual network size.

Further reading

© 2001-2009 by Oliver Hohlfeld, M.Sc. | Imprint empty
Name
E-Mail Address
Subject
Message
visit my project