Français

Web Languages Hit Parade

June 1997

The Babel team (a joint initiative from Alis Technologies and the Internet Society) announces the first major study of the actual distribution of languages on the Internet.

Up to which point is the Internet - and, more precisely the Web - dominated by the English language? How does this perception translate into real figures? Which other languages take up an important portion of the Web?

Until now, no major study had been conducted on language distribution. This is not true any more. Furthermore, the study will be updated twice a year.

Methodology

To ensure, as much as possible, the validity of the results, the Babel team has developed a rigorous method for exploring the Web.

Finding the Machines
The process starts with a random survey of the Internet by means of a random number generator. Each number is considered an IP address, and a program using the ICMP protocol (ping) determines if a machine exists at that address. From a sample of more than 30 million potential addresses, we uncovered close to 60,000 machines.

NOTE — from the total number of available addresses, the number of machines surveyed, and the total number of machines found, we can estimate the number of accessible machines on the Internet today to be approximately 7,166,000. This figure excludes, obviously, the great number of machines hidden behind firewalls that do not respond to a ping, most probably not Web servers visible to the general public.
Finding the Servers
The next step consists of finding Web servers; not all machines are servers! A second program takes each machine from the list, and determines if it hosts an HTTP server. More than 8,000 machines responded positively, and it is on these machines that the last step of the process, the linguistic analysis, took place.

NOTE — from the total number of available addresses, the number of machines surveyed, and the total number of servers found, we can estimate the number of accessible Web servers on the Internet today to be approximately 1,007,000.
Analyzing the Pages
The linguistic analysis program retrieves pages published by each server (in a first step, we limited ourselves to home pages, without exploring the site) and removes the HTML tagging. If more than 500 characters remain, the analysis program submits the text to an automatic language identification software. Using the most advanced language recognition, Internet and cryptanalysis techniques, this program is able to identify the document's language and character set. The detection software, SILC, can recognize 17 of the world's most used languages (see the list of languages for details) in a wide variety of encodings (character sets).
Verifying
This last step (very boring for the Babel team) consisted in visiting manually (using a browser) a sample of the pages, and comparing the automatic detection with a visual identification. Close to 200 pages have been verified until now, thus allowing to confirm the general reliability of the detection software and the process, but, nevertheless, revealing a few flaws. Correction factors have been estimated from these (weak) statistics, factors taken into account in the next to last column of the table. An analysis of error sources can also be found in an appendix.

Preliminary Hit Parade (Limited to Home Pages)

From 3,239 home pages containing more than 500 characters, the most frequently found languages on the Web, in decreasing order, are shown in the table below. The last column gives the estimated number of significant Web servers (more than 500 characters of text) in every language, calculated in proportion to the total number of IP adresses, from the surveyed number and from the number of servers found in that language.
RankingLanguageNumber
of pages
Percentage Corrected
percentage
Estimated
number
of servers
1English2 72284,0 % 82,3 %332 778
2German1474,5 % 4,0 %17 971
3Japanese1013,1 % 1,6 %12 348
4French591,8 % 1,5 %7 213
5Spanish381,2 % 1,1 %4 646
6Swedish351,1 % 0,6 %4 279
7Italian311,0 % 0,8 %3 790
8Portuguese210,7 % 0,7 %2 567
9Dutch200,6 % 0,4 %2 445
10Norwegian190,6 % 0,3 %2 323
11Finnish140,4 % 0,3 %1 712
12Czech110,3 % 0,3 %1 345
13Danish90,3 % 0,3 %1 100
14Russian80,3 % 0,1 %978
15Malay40,1 % 0,1 %489
none or
unknown
(correction)
5,6 %
Total 3 239100 % 100 %395 984

The complete list of the 3,239 pages visited, with the language assigned to each page, is available here.

This work is still in progress. We are increasing our site coverage, improving preprocessing, and, above all, extending our analysis of the sites beyond home pages. And we continue to verify manually a proportion of the pages to ensure reliability. Final results, covering all pages of the visited sites, will be available here shortly.


Appendixes

List of Languages Handled by the Detection Software
1. German 7. French 13. Portuguese
2. English 8. Italian 14. Russian
3. Chinese 9. Japanese 15. Serbo-croatian
4. Danish 10. Malay 16. Swedish
5. Spanish 11. Dutch 17. Czech
6. Finnish 12. Norwegian

Error Sources

In spite of all the care we took, inevitable sources of errors may pollute our results, in certain cases in a way that is hard to quantify. We enumerate here identified sources, to allow the reader to judge of their importance and assess their impact on the validity of the final result.

The first source is, at the time being, the restriction of the analysis to the home page of each server. It is a known fact to any non-anglophone Internet surfer that one (or several) home pages in another language are very often hidden behind an hyperlink on an English home page, especially if this other language is dominant in the geographic location of the server. We are trying to bypass this obvious source of error by exploring the sites beyond home pages.

Other more subtle sources of errors have to do with the way to uncover machines and, among them, HTTP servers. Our detection method is based on the echo of an ICMP packet; the Internet is not a 100% reliable network, and packets get lost, the probability of such a loss increasing with distance. Remote machines (from Montreal) have a slightly greater chance of not being detected, and, therefore, may make our sample more representative of nearby regions. Please note that we mean network distance, which is different from geographic distance. We believe that this source of error is negligible.

Server detection faces a similar problem: one connection is tried, with a maximum timeout response time. Remote servers, situated beyond congested networks, whether slow or overloaded may therefore be ignored. Again, yielding the risk of a biased sample, but we believe that the timeout period is long enough to reduce this risk to a minimum.

Finally, the analysis step is also a source of errors. The detection software is not perfect, and it sometimes encounters pages that are not in any language (e.g. a directory of cryptic filenames, or a list of users of an SGI machine). Our verification step allows us to quantify those errors, to estimate correction factors, and to adjust results accordingly. We are currently working at fine-tuning the pre-processing program to further reduce these problems.



Return to Main Page

Tango multilingual browser ensures the correct display of all the languages of Babel. © 1997, Alis Technologies Inc.

Reactions? Comments? Suggestions?   Please write.