Article Text

Download PDFPDF
09 Exploring the genotype-phenotype associations of colorectal cancer using vector space model
  1. N Deng1,
  2. Du NK1,
  3. YN Feng1,
  4. ZY Wang1,
  5. HL Duan1,
  6. F Liu1,2
  1. 1Department of Biomedical Engineering, Key Laboratory for Biomedical Engineering of Ministry of Education, Zhejiang University, Hangzhou, China
  2. 2General Hospital of Ningxia Medical University, Yinchuan, China


Background Colorectal cancer is a malignant tumour which endangers human lives. With the rapid development of molecular medicine, a great deal of research related to clinic-omics data has been published. Mining the association of genotype-phenotype data has been increasingly recognised as an effective way for early stage prediction of colorectal cancer.

Methods In this study, a literature text mining method was proposed for biomedical objects association using the Vector Space Model (VSM). For each article, we represented biomedical objects as the vectors of VSM. Gene symbols were denoted as the genotype objects, and the MeSH terms annotated from the literature were denoted as the phenotype objects. A TF-IDF algorithm was then used to quantitatively calculate the correlation between genotype and phenotype objects.

Results A total of 473 242 articles related to colorectal cancer were acquired from the MEDLINE database. We finally obtained 77 clinical terms and 490 genes highly related to colorectal cancer, resulting in 2125 associations between these clinical terms and genes. Biological pathway analysis by KEGG database demonstrated that genotype-phenotype association mining from our study covers all stages of the development of colorectal cancer, a number of which were at the early stage. These findings might become a beneficial complement of cancer translation research.

Conclusion Our study provides a biomedical literature mining method for cancer translational research such as construction of a precision medicine knowledge base, biomarker prediction/evaluation, and knowledge discovery in texts.

Acknowledgements Supported by the National key research and development program of China (No. 2016YFC0901703), and the Public Projects of Zhejiang Province, China (No. 2017C33064).

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.