Skip to main content

2024 | OriginalPaper | Buchkapitel

The Open Web Index

Crawling and Indexing the Web for Public Use

verfasst von : Gijs Hendriksen, Michael Dinzinger, Sheikh Mastura Farzana, Noor Afshan Fathima, Maik Fröbe, Sebastian Schmidt, Saber Zerhoudi, Michael Granitzer, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, Benno Stein

Erschienen in: Advances in Information Retrieval

Verlag: Springer Nature Switzerland

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index.
The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index—for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking, SIGIR 2021, pp. 2288-2292. Association for Computing Machinery, New York (2021), ISBN 9781450380379 Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking, SIGIR 2021, pp. 2288-2292. Association for Computing Machinery, New York (2021), ISBN 9781450380379
2.
Zurück zum Zitat Fröbe, M., et al.: The Information Retrieval Experiment Platform. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023) Fröbe, M., et al.: The Information Retrieval Experiment Platform. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023)
4.
Zurück zum Zitat Gao, L., et al.: The Pile: An 800GB Dataset of Diverse Text for Language Modeling (Dec 2020) Gao, L., et al.: The Pile: An 800GB Dataset of Diverse Text for Language Modeling (Dec 2020)
5.
Zurück zum Zitat Goel, S., Broder, A.Z., Gabrilovich, E., Pang, B.: Anatomy of the long tail: ordinary people with extraordinary tastes. In: Davison, B.D., Suel, T., Craswell, N., Liu, B. (eds.) Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, 4-6 February 2010, pp. 201–210. ACM, New York (2010) Goel, S., Broder, A.Z., Gabrilovich, E., Pang, B.: Anatomy of the long tail: ordinary people with extraordinary tastes. In: Davison, B.D., Suel, T., Craswell, N., Liu, B. (eds.) Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, 4-6 February 2010, pp. 201–210. ACM, New York (2010)
6.
Zurück zum Zitat Gollub, T., Potthast, M., Stein, B.: Shaping the Information Nutrition Label. In: Albakour, D., Corney, D., Gonzalo, J., Martinez, M., Poblete, B., Valochas, A. (eds.) 2nd International Workshop on Recent Trends in News Information Retrieval (NewsIR 2018) at ECIR. CEUR Workshop Proceedings, vol. 2079, pp. 9–11 (Mar 2018), ISSN 1613-0073 Gollub, T., Potthast, M., Stein, B.: Shaping the Information Nutrition Label. In: Albakour, D., Corney, D., Gonzalo, J., Martinez, M., Poblete, B., Valochas, A. (eds.) 2nd International Workshop on Recent Trends in News Information Retrieval (NewsIR 2018) at ECIR. CEUR Workshop Proceedings, vol. 2079, pp. 9–11 (Mar 2018), ISSN 1613-0073
7.
Zurück zum Zitat Granitzer, M., Voigt, S., et al.: Impact and Development of an Open Web Index for Open Web Search. J. Assoc. Inform. Sci. Technol. (2023) Granitzer, M., Voigt, S., et al.: Impact and Development of an Open Web Index for Open Web Search. J. Assoc. Inform. Sci. Technol. (2023)
8.
Zurück zum Zitat Guha, R.V., Brickley, D., MacBeth, S.: Schema.org: evolution of structured data on the web: big data makes common schemas even more necessary. Queue 13(9), 10–37 (2015), ISSN 1542-7730 Guha, R.V., Brickley, D., MacBeth, S.: Schema.org: evolution of structured data on the web: big data makes common schemas even more necessary. Queue 13(9), 10–37 (2015), ISSN 1542-7730
9.
Zurück zum Zitat Kamphuis, C., Hasibi, F., Lin, J., de Vries, A.P.: REBL: entity linking at scale. In: Alonso, O., Baeza-Yates, R., King, T.H., Silvello, G. (eds.) Proceedings of the Third International Conference on Design of Experimental Search & Information Retrieval Systems, San Jose, CA, USA, 30-31 August 2022. CEUR Workshop Proceedings, vol. 3480, pp. 68–75. CEUR-WS.org (2022) Kamphuis, C., Hasibi, F., Lin, J., de Vries, A.P.: REBL: entity linking at scale. In: Alonso, O., Baeza-Yates, R., King, T.H., Silvello, G. (eds.) Proceedings of the Third International Conference on Design of Experimental Search & Information Retrieval Systems, San Jose, CA, USA, 30-31 August 2022. CEUR Workshop Proceedings, vol. 3480, pp. 68–75. CEUR-WS.org (2022)
10.
Zurück zum Zitat Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020, pp. 39-48. Association for Computing Machinery, New York (2020), ISBN 9781450380164 Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020, pp. 39-48. Association for Computing Machinery, New York (2020), ISBN 9781450380164
11.
Zurück zum Zitat Koster, M., Illyes, G., Zeller, H., Sassman, L.: RFC 9309 Robots Exclusion Protocol (2022) Koster, M., Illyes, G., Zeller, H., Sassman, L.: RFC 9309 Robots Exclusion Protocol (2022)
12.
Zurück zum Zitat Kreutzer, J., et al.: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets (2021) Kreutzer, J., et al.: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets (2021)
13.
Zurück zum Zitat Lewandowski, D.: The web is missing an essential part of infrastructure: an open web index. Commun. ACM 62(4), 24 (2019)CrossRef Lewandowski, D.: The web is missing an essential part of infrastructure: an open web index. Commun. ACM 62(4), 24 (2019)CrossRef
14.
Zurück zum Zitat Li, H., Su, Y., Cai, D., Wang, Y., Liu, L.: A Survey on Retrieval-Augmented Text Generation. arXiv preprint arXiv:2202.01110 (2022) Li, H., Su, Y., Cai, D., Wang, Y., Liu, L.: A Survey on Retrieval-Augmented Text Generation. arXiv preprint arXiv:​2202.​01110 (2022)
15.
Zurück zum Zitat Lin, J., et al.: Supporting interoperability between open-source search engines with the common index file format. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2149–2152 (2020) Lin, J., et al.: Supporting interoperability between open-source search engines with the common index file format. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2149–2152 (2020)
17.
Zurück zum Zitat Middleton, S.E., Kordopatis-Zilos, G., Papadopoulos, S., Kompatsiaris, Y.: Location extraction from social media: geoparsing, location disambiguation, and geotagging. ACM Trans. Inform. Syst. (TOIS) 36(4), 1–27 (2018)CrossRef Middleton, S.E., Kordopatis-Zilos, G., Papadopoulos, S., Kompatsiaris, Y.: Location extraction from social media: geoparsing, location disambiguation, and geotagging. ACM Trans. Inform. Syst. (TOIS) 36(4), 1–27 (2018)CrossRef
18.
Zurück zum Zitat Mühleisen, H., Bizer, C.: Web data commons - extracting structured data from two large web corpora. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) WWW 2012 Workshop on Linked Data on the Web, Lyon, France, 16 April 2012. CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012) Mühleisen, H., Bizer, C.: Web data commons - extracting structured data from two large web corpora. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) WWW 2012 Workshop on Linked Data on the Web, Lyon, France, 16 April 2012. CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)
19.
Zurück zum Zitat Overwijk, A., Xiong, C., Liu, X., VandenBerg, C., Callan, J.: ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information (Dec 2022) Overwijk, A., Xiong, C., Liu, X., VandenBerg, C., Callan, J.: ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information (Dec 2022)
20.
Zurück zum Zitat Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020) Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)
22.
Zurück zum Zitat Scells, H., Zhuang, S., Zuccon, G.: Reduce, reuse, recycle: green information retrieval research. In: Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J.S., Kazai, G. (eds.) SIGIR 2022: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11 - 15 July 2022, pp. 2825–2837. ACM (2022) Scells, H., Zhuang, S., Zuccon, G.: Reduce, reuse, recycle: green information retrieval research. In: Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J.S., Kazai, G. (eds.) SIGIR 2022: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11 - 15 July 2022, pp. 2825–2837. ACM (2022)
24.
Zurück zum Zitat van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K., de Vries, A.P.: REL: an entity linker standing on the shoulders of giants. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2197–2200. ACM, Virtual Event China (Jul 2020), ISBN 978-1-4503-8016-4 van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K., de Vries, A.P.: REL: an entity linker standing on the shoulders of giants. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2197–2200. ACM, Virtual Event China (Jul 2020), ISBN 978-1-4503-8016-4
25.
Zurück zum Zitat Wiegmann, M., Wolska, M., Schröder, C., Borchardt, O., Stein, B., Potthast, M.: Trigger warning assignment as a multi-label document classification problem. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12113–12134. Association for Computational Linguistics, Toronto, Canada (Jul 2023) Wiegmann, M., Wolska, M., Schröder, C., Borchardt, O., Stein, B., Potthast, M.: Trigger warning assignment as a multi-label document classification problem. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12113–12134. Association for Computational Linguistics, Toronto, Canada (Jul 2023)
Metadaten
Titel
The Open Web Index
verfasst von
Gijs Hendriksen
Michael Dinzinger
Sheikh Mastura Farzana
Noor Afshan Fathima
Maik Fröbe
Sebastian Schmidt
Saber Zerhoudi
Michael Granitzer
Matthias Hagen
Djoerd Hiemstra
Martin Potthast
Benno Stein
Copyright-Jahr
2024
DOI
https://doi.org/10.1007/978-3-031-56069-9_10

Premium Partner