Text matching in pharmacy: A Summer NLP project (part 1)
Introduction
-
I was invited by a friend to work for their team on data analytics (freelance, one month). My motivations were to (1) practice NLP, (2) fill in my free time before flights and (3) earn a bit of money. I have done 2 NLP projects before, one about text classification in financial domain (which required a similar skill set as this project), and one in learning embeddings in source code.
-
Background: I am given two tables recording (intendedly) all medicines in Vietnam’s pharmaceutical market.
- The first table: comes from a government official source, written mostly in Vietnamese, with ~45K rows and ~20 columns.
- The second table comes from a market research agency who also manually captured the market. It has ~90K rows and ~25 (slighyly different) columns, and was written in (very concise and “notational”) English.
- They overlap and complement each other quite a bit.
- It is of business interest to “match” items from these two tables together. Doing this would create a larger (more rows) and richer (more columns) dataset about the market, which would eventually be useful for (1) market trend analysis and (2) adhoc lookup of some specific medicine.
- Research question: How to effectively group identical items from both tables together? A matching group has at least one item from each table, which both refer to the same real-life medicine.
- Remarks
- Optimization metrics include precision, recall, development time, and system reusability.
- This problem is a bit analogous to Coreference Resolution because the goal is also to determine groups of things that refer to the same real-world entity. Therefore, we can cast it to a classification problem – determining whether two items, each of which from a table, are referencing two the same actual medicine.
- The data is quite messy.
- Out of 20 and 25 columns from each table, six of them are explicitly shared, including:
- product name (e.g., “GILOBA PHYTOSOME” or “Giloba”)
- manufacturer name (e.g., “MEGA LIFESCIENCES” or “Mega Lifesciences Public Company Limited”)
- molecule (e.g., “GINKGO BILOBA” or “Cao ginkgo billoba (dưới dạng ginkgo biloba phytosome) 40mg”)
- pack strength (e.g., “40MG” or “40mg”)
- pack form (e.g., “SOFTCAP” or “Viên nang mềm”)
- unit pack (e.g., “30” or “Hộp 3 vỉ x 10 viên”)
- The shared columns are usually written in very different ways, and the syntaxes varied a lot.
- Out of 20 and 25 columns from each table, six of them are explicitly shared, including:
To be continued…