Séminaire en direction des partenaires de la chaire NoRDF :
Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost
Lihu Chen, Doctorant, Télécom Paris, Institut Polytechnique de Paris
State-of-the-art NLP systems represent inputs with word embeddings, but these are brittle when faced with Out-of-Vocabulary (OOV) words. To address this issue, we follow the principle of mimick-like models to generate vectors for unseen words, by learning the behavior of pre-trained embeddings using only the surface form of words. We present a simple contrastive learning framework, LOVE, which extends the word representation of an existing pre-trained language model (such as BERT), and makes it robust to OOV with few additional parameters. Extensive evaluations demonstrate that our lightweight model achieves similar or even better performances than prior competitors, both on original datasets and on corrupted variants. Moreover, it can be used in a plug-and-play fashion with FastText and BERT, where it significantly improves their robustness.
Neuro Symbolic Approaches for Logical Reasoning on Natural Language
Chadi Helwé, Doctorant, Télécom Paris, Institut Polytechnique de Paris
Nowadays, transformer-based models achieve impressive performance on different natural language processing tasks. However, it is unclear to what extent they can reason on natural language. So we pointed out the successes and limitations of such models in a survey paper. One major limitation is negation. They achieve poor performance when negation is included in textual entailment examples. So we developed TINA (Textual Inference with Negation Augmentation), a principled technique for negated data augmentation that can be combined with the unlikelihood loss function. Our proposed approach improves the performance of different models on textual entailment datasets with negation without sacrificing performance on datasets without negation. Also, we developed a library called LogiTorch, a PyTorch-based library with different logical reasoning benchmarks, models, and utility functions. The library allows researchers to easily use logical reasoning datasets and train logical reasoning models with just a few lines of code. Finally, we are developing the MAFALDA (Multi-level Annotated FALlacy DAtaset) benchmark to evaluate the performance of the models in detecting and classifying fallacies.