AI needs a genome

Today I was sitting in an easy class of algorithms and data structures. While I enjoyed the positive learning atmosphere there, the class was too easy. So I decided to think of something to do for the next classes. And I come up with the following (nerdy) question: How to model the algorithmic thinking skill of good CS students? That initial question took me 3 hours straight to brainstorm, contemplate, and read up. And it was a good evening. Below is a messy recap of the ideas coming to my mind this evening.

Firstly, I found a succint definition of “modelling” in a Youtube video of a pharmaceutical scientist. They say the following:

A model is a simple representation of reality that helps us to understand how something works. But models can also help us to understand unknowns or predict what might happen.

I learned from Dr. KinHo Chan (at Fulbright) that science has four functions – describe, explain, predict, and control. Therefore, the quote above is basically saying model is the main outcome of science! As a bonus, the video lists from surprising examples of models in real life, including maps (I love maps!), charts, music sheets, and timetables. Then I realized I also love looking at these things. Now I can explain why interest in them – because I like the dense amount of information in those models!

Secondly, I arbitrarily created the following algorithmic-thinking task: Determine the Big-O time complexity of an algorithm (maybe given in pseudocode or formal programming languages). That was what the prof taught my class today, and I wondered how that might be “automated”?

I first reflected on how I solved that problem. For me, I usually visualize the data being mutated through the process, thus understand how the process runs, then do some rough calculations on the total number of steps needed. That works for me, but might not be the best way to do it.

Notice that, if a computer follows these steps, we say it is doing symbolic reasoning. So, to give the computer the ability to solve this task, do we want the statistical learning paradigm or the symbolic learning one? I personally prefer the computer to represent its knowledge about this task symbolically, but it can achieve this knowledge however works best. This is actually called Symbolic Regression (the field where I worked in quite a bit for my undergrad capstone). Anyway, I conjectured if represent all the knowledge about the world (including the algorithmic one) in a computer program would work. Then, many challenges arise: (1) What programming language should we choose – Python or machine code?, (2) How do we provide training “data” for the computer to learn about the knowledge?, and (3) Given such a data, how might the computer update its “knowledge” according to the data?

For question (1) and (3), I think abstraction and instantiation must be two of the required abilities of such an algoritm. That seems like the best way to learn math (2 + 2 is 4, but the abstraction is the addition of 2 numbers) and further understand the complex real world. This looks like symbolic learning, but who knows if the statistical models can learn and silently represent these symbolic logics in its network.

For (2) I think self-supervised data or real environment data is the best, because it does not require tedious labelling efforts but seem to be similarly natural to how humans learn. Interactive NLP is one example for this.

How to fully answer these three questions definitely over my reach in an evening. Finally, I reflected on how us humans could ever do that. And I realize we all start with a magic set of instructions coded in a sperm and an egg. That is 23 chromosomes, each include genes and DNAs (not really sure the difference between these). And actually, Dr. Zador at CSHL has the same idea. AI needs a genome, as he said. Currently, the genome that best chatbots like ChatGPT has is the network architecture itself (transformer). Everything else is nurture… Is that too little information of a genome to compete with even a rat?