19 items found

Groups: Datacite Tags: NLP

Filter Results
  • dataset

    CodiEsp-abstracs: Abstracts from Lilacs and Ibecs with ICD10 codes

    JSON file with abstracts from Lilacs and Ibecs with ICD10 codes (ICD10-CM and ICD10-PCS) associated to them (CIE10 in Spanish). These databases have MeSH terms...
  • dataset

    CodiEsp codes: list of valid CIE10 codes for the CodiEsp task

    This compressed folder contains two files:  + codiesp-D_codes.tsv: list of CIE10-Diagnósticos terms (2018 version) with their description in Spanish and in English....
  • dataset

    dedup_wf_001--ef0fc5ab66a6d95fe2db127c359d6cda

    This release contains data sets for experiments with document-level machine translation. The data sets have been used in previous studies and provided here for replicability and...
  • dataset

    dedup_wf_001--b719e38f8d8cb6229c959d496ae1b5d1

    In order to analyze the impact on model quality while reducing the number of dimensions, strictly controlled trainings of word embedding are performed on Wikipedia corpora of...
  • dataset

    Transfer fine-tuned BERT models by paraphrases

    Transfer fine-tuned BERT models by phrasal paraphrases.  transferFT_bert-base-uncased.pkl bases on the bert-base-uncased model transferFT_bert-large-uncased.pkl bases...
  • dataset

    Gold Standard Corpus, Ontologies, And Entity-Quality Ontology Annotations For...

    This data set includes a gold-standard corpus of evolutionary phenotype descriptions (in the form of character state descriptions pulled from a variety of phylogenetic...
  • dataset

    Security Bug Conversations

    This dataset will be released as part of the following publication. Benjamin S. Meyers, Nuthan Munaiah, Andrew Meneely, and Emily Prud'hommeaux. Pragmatic...
  • dataset

    MeSpEn_Parallel-Corpora

    MeSpEn consists of a resource of heterogeneous health related documents in Spanish and English useful to build parallel corpora for training and evaluating Spanish <->...
  • publication

    Deep Learning Approaches to Text Production

    Text production is a key component of many NLP applications. In data-driven approaches, it is used for instance, to generate dialogue turns from dialogue moves, to verbalise the...
  • publication

    Adapting text mining tools to noisy text

    Invited talk given at Text Mining for Science Studies Workshop, Berlin
  • publication

    Curation Technologies for a Cultural Heritage Archive: "Project Tongilbu"

    We are developing a platform for generic curation technologies, using various NLP procedures, that is specifically targeted at, but not limited to, document collections that are...
  • publication

    Data Discovery on Siren

    Tutorial given on May  4th 2020 at the Knowledge Graph Conference
  • publication

    dedup_wf_001--0d7b92f2f6bde9215ae00f45018692e2

    POSTDATA focused on poetry analysis, the publication of poetic resources and their exploration, applying Digital Humanities methods.     This is a trans-domain project, as it...
  • publication

    The Knowledge Graph that Listens

    Enterprises that are building Knowledge Graphs are rapidly getting a grip on unstructured data with current advances in Natural Language Processing (NLP) techniques. But there...
  • publication

    Adapted TextRank for Term Extraction

    Automatic Term Extraction is a fundamental Natural Language Processing task often used in many knowledge acquisition processes. It is a challenging NLP task due to its high...
  • publication

    Semantically Aware Text Categorisation for Metadata Annotation

    In this paper we illustrate a system aimed at solving a longstanding and challenging problem: acquiring a classifier to automatically annotate bibliographic records by starting...
  • publication

    Machine Learning for ontologies: the KNOWMAK experience

    Webinar "Can machine learning technologies be useful to create or complete ontologies in agriculture?" as part of the CGIAR Ontologies Communities of Practice Platform for Big...
  • software

    Extracting Terms Concerning Ai Based On Web Of Science Data

    This is the accompanied code to extract terms connected to AI through titles and abstracts from Web of Science data.
  • dataset

    Magi Practical Web Article Corpus

    This corpus contains 10 million Chinese articles consisting of more than 10 billion words, which has been extracted from the Internet, and refined such that only the main body...