About

Fun facts

  1. Leaving academic to apply my knowledge in real world applications

    • Focus in NLP, data mining and graphs
  2. Current side-projects

    • Unisearch : A vector based search engine demo. Underlying neural network is based on CLIP , but trained on title-content and text-image pair datasets. The text encoder also supports 9 different languages ( english, chinese, spanish, italian, japanese, korean, vietnamese, german, french ). Image search works pretty good, but some finetuning required to work better in text search.

    • Today Headlines news aggregation website using an experimental graph clustering in 5 different regions (US, Singapore, Taiwan, Malaysia).The aim is to optimize for speed, no statistic-based nor neural network models required.

  3. Active maintainers of these pypi packages:

    • fastlangid : The only language detection library that supports simplified chinese, traditional chinese and cantonese

    • h5record : Easy to use large scale dataset format for pytorch

  4. Multilinguist : I speak in 4 different languages (English, Chinese, Cantonese, Malay)

  5. Published work:

Archived but interesting projects

  1. 2019 Youtube trending visualization

  2. 2019 ML metrics collection service : weights and biases like service

  3. 2018 Paper citation prediction

Contact me for any interesting research or idea exploration work

zhirui09400@icloud.com