Latin American and Caribbean Contemporary Art Web Archive collection derivatives

Web archive derivatives of the Latin American and Caribbean Contemporary Art Web Archive collection collection from the Ivy Plus Libraries Confederation. The derivatives were created with the Archives Unleashed Toolkit and Archives Unleashed Cloud. The ivy-11576-parquet.tar.gz derivatives are in the Apache Parquet format, which is a columnar storage format. These derivatives are generally small enough to work with on your local machine, and can be easily converted to Pandas DataFrames. See this notebook for examples. Domains .webpages().groupBy(ExtractDomainDF($"url").alias("url")).count().sort($"count".desc) Produces a DataFrame with the following columns: domain count Web Pages .webpages().select($"crawl_date", $"url", $"mime_type_web_server", $"mime_type_tika", RemoveHTMLDF(RemoveHTTPHeaderDF(($"content"))).alias("content")) Produces a DataFrame with the following columns: crawl_date url mime_type_web_server mime_type_tika content Web Graph .webgraph() Produces a DataFrame with the following columns: crawl_date src dest anchor Image Links .imageLinks() Produces a DataFrame with the following columns: src image_url Binary Analysis Audio Images PDFs Presentation program files Spreadsheets Text files Word processor files   The ivy-11576-auk.tar.gz derivatives are the standard set of web archive derivatives produced by the Archives Unleashed Cloud. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout. Raw Network file, which can also be loaded into Gephi. You will have to use that network program to lay it out yourself. Full text file. In it, each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content. Domains count file. A text file containing the frequency count of domains captured within your web archive.

Tags
Data and Resources
To access the resources you must log in

This item has no data

Identity

Description: The Identity category includes attributes that support the identification of the resource.

Field Value
PID https://www.doi.org/10.5281/zenodo.3633118
PID https://www.doi.org/10.5281/zenodo.3633117
URL https://figshare.com/articles/Latin_American_and_Caribbean_Contemporary_Art_Web_Archive_collection_derivatives/11785425
URL http://dx.doi.org/10.5281/zenodo.3633118
URL https://zenodo.org/record/3633118
URL http://dx.doi.org/10.5281/zenodo.3633117
Access Modality

Description: The Access Modality category includes attributes that report the modality of exploitation of the resource.

Field Value
Access Right Open Access
Attribution

Description: Authorships and contributors

Field Value
Author Ruest, Nick, 0000-0003-1891-1112
Author Sala, Christine
Author Abrams, Samantha
Publishing

Description: Attributes about the publishing venue (e.g. journal) and deposit location (e.g. repository)

Field Value
Collected From Zenodo; figshare; Datacite
Hosted By Zenodo; figshare
Publication Date 2020-01-31
Publisher Zenodo
Additional Info
Field Value
Language UNKNOWN
Resource Type Dataset
keyword Art, Caribbean
keyword Art, Latin American
keyword Arts and amp; Humanities
system:type dataset
Management Info
Field Value
Source https://science-innovation-policy.openaire.eu/search/dataset?datasetId=dedup_wf_001::2e6208f0eb14a000d51ebf058d8883f2
Author jsonws_user
Last Updated 3 January 2021, 23:12 (CET)
Created 3 January 2021, 23:12 (CET)