|
The Babel team (a joint initiative from Alis Technologies and the Internet
Society) announces the first major study of the actual distribution of
languages on the Internet.
Up to which point is the Internet - and, more precisely the Web - dominated
by the English language? How does this perception translate into real
figures? Which other languages take up an important portion of the Web?
Until now, no major study had been conducted on language distribution.
This is not true any more. Furthermore, the study will be updated twice a
year.
Methodology
To ensure, as much as possible, the validity of the results, the Babel team
has developed a rigorous method for exploring the Web.
Finding the Machines
The process starts with a random survey of the Internet by means of a random
number generator. Each number is considered an IP address, and a program
using the ICMP protocol (ping) determines if a machine exists at that
address. From a sample of more than 30 million potential addresses,
we uncovered close to 60,000 machines.
NOTE from the total number of available addresses,
the number of machines surveyed, and the total number of machines
found, we can estimate the number of accessible machines on the Internet
today to be approximately 7,166,000. This figure excludes, obviously, the great
number of machines hidden behind firewalls that do not respond to a ping,
most probably not Web servers visible to the general public.
Finding the Servers
The next step consists of finding Web servers; not all machines are
servers! A second program takes each machine from the list, and determines if it hosts an HTTP server.
More than 8,000 machines responded positively, and it is on these machines
that the last step of the process, the linguistic analysis, took place.
NOTE from the total number of available addresses,
the number of machines surveyed, and the total number of servers
found, we can estimate the number of accessible Web servers on the Internet
today to be approximately 1,007,000.
Analyzing the Pages
The linguistic analysis program retrieves pages published by each server
(in a first step, we limited ourselves to home pages, without exploring
the site) and removes the HTML tagging. If more than 500 characters remain,
the analysis program submits the text to an automatic language
identification software. Using the most advanced language recognition,
Internet and cryptanalysis techniques, this program is able to identify the
document's language and character set. The detection software,
SILC,
can recognize 17 of the world's most used languages (see the
list of languages for details) in a wide
variety of encodings (character sets).
Verifying
This last step (very boring for the Babel team) consisted in visiting
manually (using a browser) a sample of the pages, and comparing the
automatic detection with a visual identification. Close to 200 pages
have been verified until now, thus allowing to confirm the general reliability
of the detection software and the process, but, nevertheless, revealing a few
flaws. Correction factors have been estimated from these (weak) statistics,
factors taken into account in the next to last column of the table. An
analysis of error sources can also be found in an appendix.
Preliminary Hit Parade (Limited to Home Pages)
From 3,239 home pages containing more than 500
characters, the most frequently found languages on the Web, in decreasing
order, are shown in the table below. The last column gives the estimated
number of significant Web servers (more than 500 characters of text)
in every language, calculated in proportion to the total number of IP
adresses, from the surveyed number and from the number of servers found in
that language.
| Ranking | Language | Number of pages | Percentage |
Corrected percentage | Estimated number of servers |
| 1 | English | 2 722 | 84,0 % |
82,3 % | 332 778 |
| 2 | German | 147 | 4,5 % |
4,0 % | 17 971 |
| 3 | Japanese | 101 | 3,1 % |
1,6 % | 12 348 |
| 4 | French | 59 | 1,8 % |
1,5 % | 7 213 |
| 5 | Spanish | 38 | 1,2 % |
1,1 % | 4 646 |
| 6 | Swedish | 35 | 1,1 % |
0,6 % | 4 279 |
| 7 | Italian | 31 | 1,0 % |
0,8 % | 3 790 |
| 8 | Portuguese | 21 | 0,7 % |
0,7 % | 2 567 |
| 9 | Dutch | 20 | 0,6 % |
0,4 % | 2 445 |
| 10 | Norwegian | 19 | 0,6 % |
0,3 % | 2 323 |
| 11 | Finnish | 14 | 0,4 % |
0,3 % | 1 712 |
| 12 | Czech | 11 | 0,3 % |
0,3 % | 1 345 |
| 13 | Danish | 9 | 0,3 % |
0,3 % | 1 100 |
| 14 | Russian | 8 | 0,3 % |
0,1 % | 978 |
| 15 | Malay | 4 | 0,1 % |
0,1 % | 489 |
| none or unknown (correction) | | |
5,6 % | |
| Total | | 3 239 | 100 % |
100 % | 395 984 |
The complete list of the 3,239 pages visited, with the language assigned to
each page, is available here.
This work is still in progress. We are increasing our site coverage, improving
preprocessing, and, above all, extending our analysis of the sites beyond
home pages. And we continue to verify manually a proportion of the pages
to ensure reliability. Final results, covering all pages of the visited
sites, will be available here shortly.
Appendixes
List of Languages Handled by the Detection Software
| 1. German |
7. French |
13. Portuguese |
| 2. English |
8. Italian |
14. Russian |
| 3. Chinese |
9. Japanese |
15. Serbo-croatian |
| 4. Danish |
10. Malay |
16. Swedish |
| 5. Spanish |
11. Dutch |
17. Czech |
| 6. Finnish |
12. Norwegian |
Error Sources
In spite of all the care we took, inevitable sources of errors may pollute
our results, in certain cases in a way that is hard to quantify. We
enumerate here identified sources, to allow the reader to judge of their
importance and assess their impact on the validity of the final result.
The first source is, at the time being, the restriction of the analysis
to the home page of each server. It is a known fact to any non-anglophone
Internet surfer that one (or several) home pages in another language are very
often hidden behind an hyperlink on an English home page, especially if
this other language is dominant in the geographic location of the server. We
are trying to bypass this obvious source of error by exploring the sites beyond
home pages.
Other more subtle sources of errors have to do with the way to uncover
machines and, among them, HTTP servers. Our detection method is based on the
echo of an ICMP packet; the Internet is not a 100% reliable network, and
packets get lost, the probability of such a loss increasing with distance.
Remote machines (from Montreal) have a slightly greater chance of not
being detected, and, therefore, may make our sample more representative of
nearby regions. Please note that we mean network distance, which is different from
geographic distance. We believe that this source of error is negligible.
Server detection faces a similar problem: one connection is tried, with a maximum
timeout response time. Remote servers, situated beyond congested networks,
whether slow or overloaded may therefore be ignored. Again, yielding the risk
of a biased sample, but we believe that the timeout period is long enough to
reduce this risk to a minimum.
Finally, the analysis step is also a source of errors. The detection software is
not perfect, and it sometimes encounters pages that are not in any language (e.g.
a directory of cryptic filenames, or a list of users of an SGI machine). Our
verification step allows us to quantify those errors, to estimate correction factors,
and to adjust results accordingly. We are currently working at fine-tuning
the pre-processing program to further reduce these problems.
|