Internet Archive


October 20, 2021

The Internet Archive (English: Internet Archive) is a non-profit digital library founded in 1996 by Alexa founder Brewster Carley in the United States that provides Internet multimedia file reading services. It is headquartered in San Francisco, California. The mission of Richmond District is "universal access to all knowledge" (English: universal access to all knowledge.). The "archive" provides digital materials such as websites, web pages, graphic materials, music, video, audio, software, dynamic images, and millions of books, which are permanently stored and obtained copies free of charge. As of October 2012, its information reserves reached 10PB (that is, 10,240TB). In addition, the archive is also one of the proponents of network openness and liberalization.

Data source

The archive's data is automatically collected by its own web crawler, and the website archive archive "Website Time Machine" crawled more than 150 billion web pages.

Funding situation

The annual budget is about 10 million U.S. dollars, and the source is its web crawler service, partnerships, sponsorship, and the Kali Austin Foundation. There are only dozens of employees in the headquarters. Most of the employees work in the book scanning center. There is also a data center in Redwood City.


The archives database is a member of the International Internet Reservation Association, and was selected as the official designated library by the State of California in 2007. The data collected by the archives is various. As of the beginning of 2015, the Internet Archive had collected a total of 2,400 MS-DOS games.


In 1996, Brewster Kahle founded the profitable Alexa Internet at the same time as the Internet Archive; in October of the same year, it began to collect and store data. However, these data were inaccessible until the "Time Machine" was developed in 2001. At the end of 1999, the scope of collection was expanded. In August 2012, it was announced that BitTorrent would be added to its existing 1.3 million file download options. Because it is coordinated through two archive data centers, this becomes the fastest way to download data from the archive. On November 6, 2013, a fire broke out at the headquarters of the Archives in the Richmond District, which damaged many equipment and some nearby apartments, with estimated losses of US$600,000.

Web Archive

Time Machine

The website time machine is one of the most important services of the Internet Archive. Its name is taken from an American cartoon called The Rocky and Bullwinkle Show. Time machine allows people to search and access archives of their web pages. In some countries and regions, the use of the term time machine has become very common, and "time machine" and "Internet archives" have even begun to be used as synonyms.


Archive-it is a tool to help organizations and individuals build archives. Once the URL of the target website is entered and saved and the website allows access to the robots.txt used by the Internet Archive, the web page will become part of the time machine. As of March 2014 (2014-03), Archive-it has more than 275 organizations in 46 states and 16 other countries as its partners, and has an online archive of more than 7.4 billion web pages.

Collecting Bibliography

The Internet Archive has collected digitized books from all over the world and special collections of major libraries and cultural heritage institutions. The Internet Archive operates 33 book scanning centers in 5 countries, and its activities are financially supported by libraries and foundations. As of July 2013 (2013-07), the archives had collected 4.4 million books, with more than 15 million downloads per month. As of November 2008 (2008-11), the archives have a total of 1 million online texts with a total size of 0.5PB, covering original photographic images, cropped and skewed images, PDF files and original OCR data.

Number of texts in each language

Number of texts in each era

Image data

In addition to the above content, the Internet Archive also collects a large number of digital media, all of which comply with the US public domain or CC licensing agreement. These media files are organized into collections according to media types (moving images, audio, text, etc.), and are divided into sub-collections according to various standards. For example, the relevant materials provided by the Metropolitan Museum of Art will be divided into a subset, and the number of relevant materials in this collection has now exceeded 140,000. Each main collection contains a "community" sub-collection (previously known as "open source") for storing public contributions.

Audio Collection

Audio files�

INSERT INTO `wiki_article`(`id`, `article_id`, `title`, `article`, `img_url`) VALUES ('NULL()','互联网档案馆','Internet Archive','','')