nach oben

Erschienen in:

2024 | OriginalPaper | Buchkapitel

The Open Web Index

Crawling and Indexing the Web for Public Use

verfasst von : Gijs Hendriksen, Michael Dinzinger, Sheikh Mastura Farzana, Noor Afshan Fathima, Maik Fröbe, Sebastian Schmidt, Saber Zerhoudi, Michael Granitzer, Matthias Hagen, Djoerd Hiemstra, Martin Potthast, Benno Stein

Erschienen in: Advances in Information Retrieval

Verlag: Springer Nature Switzerland

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Only few search engines index the Web at scale. Third parties who want to develop downstream applications based on web search fully depend on the terms and conditions of the few vendors. The public availability of the large-scale Common Crawl does not alleviate the situation, as it is often cheaper to crawl and index only a smaller collection focused on a downstream application scenario than to build and maintain an index for a general collection the size of the Common Crawl. Our goal is to improve this situation by developing the Open Web Index.

The Open Web Index is a publicly funded basic infrastructure from which downstream applications will be able to select and compile custom indexes in a simple and transparent way. Our goal is to establish the Open Web Index along with associated data products as a new open web information intermediary. In this paper, we present our first prototype for the Open Web Index and our plans for future developments. In addition to the conceptual and technical background, we discuss how the information retrieval community can benefit from and contribute to the Open Web Index—for example, by providing resources, by providing pre-processing components and pipelines, or by creating new kinds of vertical search engines and test collections.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Improving Exposure Allocation in Rankings by Query Generation

Nächstes Kapitel A Conversational Robot for Children’s Access to a Cultural Heritage Multimedia Archive

Formal, T., Piwowarski, B., Clinchant, S.: SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking, SIGIR 2021, pp. 2288-2292. Association for Computing Machinery, New York (2021), ISBN 9781450380379

Fröbe, M., et al.: The Information Retrieval Experiment Platform. In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023)

Fröbe, M., et al.: Continuous integration for reproducible shared tasks with TIRA.io. In: Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023). LNCS. Springer (2023). https://doi.org/10.1007/978-3-031-28241-6_20

Gao, L., et al.: The Pile: An 800GB Dataset of Diverse Text for Language Modeling (Dec 2020)

Goel, S., Broder, A.Z., Gabrilovich, E., Pang, B.: Anatomy of the long tail: ordinary people with extraordinary tastes. In: Davison, B.D., Suel, T., Craswell, N., Liu, B. (eds.) Proceedings of the Third International Conference on Web Search and Web Data Mining, WSDM 2010, 4-6 February 2010, pp. 201–210. ACM, New York (2010)

Gollub, T., Potthast, M., Stein, B.: Shaping the Information Nutrition Label. In: Albakour, D., Corney, D., Gonzalo, J., Martinez, M., Poblete, B., Valochas, A. (eds.) 2nd International Workshop on Recent Trends in News Information Retrieval (NewsIR 2018) at ECIR. CEUR Workshop Proceedings, vol. 2079, pp. 9–11 (Mar 2018), ISSN 1613-0073

Granitzer, M., Voigt, S., et al.: Impact and Development of an Open Web Index for Open Web Search. J. Assoc. Inform. Sci. Technol. (2023)

Guha, R.V., Brickley, D., MacBeth, S.: Schema.org: evolution of structured data on the web: big data makes common schemas even more necessary. Queue 13(9), 10–37 (2015), ISSN 1542-7730

Kamphuis, C., Hasibi, F., Lin, J., de Vries, A.P.: REBL: entity linking at scale. In: Alonso, O., Baeza-Yates, R., King, T.H., Silvello, G. (eds.) Proceedings of the Third International Conference on Design of Experimental Search & Information Retrieval Systems, San Jose, CA, USA, 30-31 August 2022. CEUR Workshop Proceedings, vol. 3480, pp. 68–75. CEUR-WS.org (2022)

10.

Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2020, pp. 39-48. Association for Computing Machinery, New York (2020), ISBN 9781450380164

11.

Koster, M., Illyes, G., Zeller, H., Sassman, L.: RFC 9309 Robots Exclusion Protocol (2022)

12.

Kreutzer, J., et al.: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets (2021)

13.

Lewandowski, D.: The web is missing an essential part of infrastructure: an open web index. Commun. ACM 62(4), 24 (2019)CrossRef

14.

Li, H., Su, Y., Cai, D., Wang, Y., Liu, L.: A Survey on Retrieval-Augmented Text Generation. arXiv preprint arXiv:2202.01110 (2022)

15.

Lin, J., et al.: Supporting interoperability between open-source search engines with the common index file format. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2149–2152 (2020)

16.

Lugeon, S., Piccardi, T.: Curlie Dataset - Language-agnostic Website Embedding and Classification (Jan 2023). https://doi.org/10.6084/m9.figshare.19406693.v5, https://figshare.com/articles/dataset/Curlie_Dataset_-_Language-agnostic_Website_Embedding_and_Classification/19406693

17.

Middleton, S.E., Kordopatis-Zilos, G., Papadopoulos, S., Kompatsiaris, Y.: Location extraction from social media: geoparsing, location disambiguation, and geotagging. ACM Trans. Inform. Syst. (TOIS) 36(4), 1–27 (2018)CrossRef

18.

Mühleisen, H., Bizer, C.: Web data commons - extracting structured data from two large web corpora. In: Bizer, C., Heath, T., Berners-Lee, T., Hausenblas, M. (eds.) WWW 2012 Workshop on Linked Data on the Web, Lyon, France, 16 April 2012. CEUR Workshop Proceedings, vol. 937. CEUR-WS.org (2012)

19.

Overwijk, A., Xiong, C., Liu, X., VandenBerg, C., Callan, J.: ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information (Dec 2022)

20.

Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)

21.

Scao, T.L., et al.: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR arXiv: 2211.05100 (2022)

22.

Scells, H., Zhuang, S., Zuccon, G.: Reduce, reuse, recycle: green information retrieval research. In: Amigó, E., Castells, P., Gonzalo, J., Carterette, B., Culpepper, J.S., Kazai, G. (eds.) SIGIR 2022: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11 - 15 July 2022, pp. 2825–2837. ACM (2022)

23.

Touvron, H., et al.: LLaMA: Open and Efficient Foundation Language Models. CoRR arXiv: 2302.13971 (2023)

24.

van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K., de Vries, A.P.: REL: an entity linker standing on the shoulders of giants. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2197–2200. ACM, Virtual Event China (Jul 2020), ISBN 978-1-4503-8016-4

25.

Wiegmann, M., Wolska, M., Schröder, C., Borchardt, O., Stein, B., Potthast, M.: Trigger warning assignment as a multi-label document classification problem. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12113–12134. Association for Computational Linguistics, Toronto, Canada (Jul 2023)

Titel: The Open Web Index
verfasst von: Gijs Hendriksen
Michael Dinzinger
Sheikh Mastura Farzana
Noor Afshan Fathima
Maik Fröbe
Sebastian Schmidt
Saber Zerhoudi
Michael Granitzer
Matthias Hagen
Djoerd Hiemstra
Martin Potthast
Benno Stein
Verlag: Springer Nature Switzerland
Buch: Advances in Information Retrieval
Print ISBN: 978-3-031-56068-2

Electronic ISBN: 978-3-031-56069-9

Copyright-Jahr: 2024
DOI: https://doi.org/10.1007/978-3-031-56069-9_10

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner