Research Statement

Research Envision on Structure-aware Semantics Understanding

The semantics of the world can be essentially organized in structured formats, and different data of modalities comes with structural representations. For example, the understanding of almost all the NLP applications can be seen as a hierarchy with different levels. For the understanding of other modal information (e.g., visions), the key also lies in the comprehension of semantic structure, such as with the scene graph representation. The following Figure exemplifies the linguistic syntax structures in NLP of dependency tree & constituency grammar and also the visual scene graph structure in CV. Besides, the world knowledge has been represented in structured formats, i.e., knowledge graph. Also human-level reasoning largely follows structural manner.

Thus, the essence of semantics understanding of languages, visions, etc., lies in the understanding of the intrinsic semantic structures, which motivates my research angle of Structure-aware Intelligence Learning (SAIL). With the idea of SAIL, I divide my research into three key branches: structure-aware NLP, structure-aware MM and structure-aware LLM. Starting with deep learning-based semantics understanding in NLP area, I engaged in the exploration of structure-aware NLP. Later I have extended the SAIL idea to structure-aware MM. The recent rise and great triumph of LLM have revealed the great potential of leading AGI via this path. Correspondingly, latest I proactively integrate the idea of structural awareness into the LLM for semantics understanding, i.e., structure-aware LLM. And the ultimate goal is thus to realize human-level AGI for universal modalities by modeling the semantic structures of the world. To achieve the AGI goal via SAIL that aligns the most with human society, these targets also should and will be achieved, including efficacy, interpretability, robustness (generalizability), efficiency (scalability) and trustworthiness. In the following Figure, I summarize and illustrate the big picture of my research goal.

My research scope covers the Natural Language Processing (NLP) and the intersection of NLP and Computer Vision (CV), i.e., Vision-Language Learning or Multimodal Machine Learning. Starting with deep learning based semantics understanding, where I engage in structure-aware NLP and structure-aware MM, I proactively integrate the SAIL idea into the language model (LM) for semantics understanding, i.e., structure-aware LM. The recent rise and great triumph of LLM have reveal the great potential of leading to AGI via this path. And the ultimate goal is thus to realize human-level AGI for universal modalities by modeling the semantic structures of the world. To achieve the AGI goal via SAIL that aligns the most with human society, these targets also should and will be achieved, including efficacy, interpretability, robustness (generalizability), efficiency (scalability) and trustworthiness.

Research Interests

My research is sliced into the following blocks with selected publications [View complete publications]:

▶ A. Structure-aware NLP

  • Sentence-level Structural Modeling
    • Linguistic Parsing and POS Tagging
    • Syntax Parsing and Grammar Induction
    • Structured Information Extraction (IE), e.g., Named Entity Recognition (NER), Relation Extraction and Event Extraction
    • Structured Sentiment Analysis
    • Semantic Parsing, Semantic Role Labeling (SRL)
    • Structure-guided Text Generation (Conditioned Text Generation, Machine Translation, Summarization)
    • Coreference Chain Resolution
    • Syntax-aided Semantics Modeling
    • Universal Structured NLP
  • Dialogue-level Structural Modeling
    • Conversation Discourse Structure Parsing
    • Conversational Information Extraction
    • Conversational Semantic Role Labeling
    • Conversation Sentiment Analysis
  • Document-level Structural Modeling
    • Documental Discourse Structure Parsing
    • Documental Information Extraction
    • Documental Sentiment Analysis

▶ B. Structure-aware MM

  • Structure Parsing
    • Multimodal Grammar Induction
    • Text/Visual/Video Scene Graph (SG) parsing
  • Structure-based Multimodal Applications
    • Multimodal Sentiment Analysis
    • Multimodal Information Extraction
    • Multimodal Machine Translation
    • Vision Captioning
    • Cross-modal Retrieval
    • Vision-Language/Video Event Extraction (Situation Recognition, SRL)
    • Audio/Speech Modeling
    • Image/Video/3D Modeling
    • Text-to-Vision Generation

▶ C. Structure-aware LM

  • Langauge Modeling
    • Structure-aided Langauge Modeling
    • KG-enriched Langauge Modeling
    • Multimodal Langauge Modeling
    • Universal Langauge Modeling
  • LM-empowered Machine Learning
    • Prompt Learning/Tuning
    • In-context Learning
    • Instruction Tuning
    • Reasoning