Narendran Thangarajan

Search Engineer in 2024: Things to know

Sun, 31 Dec 2023 10:00:00 +0000

This blog is a living document, and I will continue to update it as I continue to learn more.

For a long-time now, I’ve found connecting people to the right information as truly meaningful and deeply energizing. Since I was in high school, I’ve loved organizing information and things and serving them to people in a way that they leave more knowledgeable and fulfilled – more on this in a separate personal blog. I’ve recently realized that information search in companies like Google and Uber (where I’ve worked previously) is literally a planet-scale incarnation of the same theme and that is what drives me everyday to gain the skills and solve relevant, meaningful problems using those skills. Search is about connecting a user’s verbalized intent (aka “query”) to the information/product (aka “document”) that will fully satisfy their need. The goal of this article is to take this previous line, and unpack it component by component. Typical search systems have retrieval, query understanding and re-ranking.

Retrieval

At the highest level of abstraction, search is about finding the documents which “match” the given query. This is technically called retrieval. There are typically two ways you can match a document based on the query.

Lexical Matching is based on query and document having a lexical aka word/token level match. Conceptually, lexical matching needs a map of query-like tokens pointing to documents. So a document like D1:”Dogs are animals” would be 3 different mappings of dog:D1, are:D1, animals:D1. Now, a user query of “cat” would lead to no matches, while “dog” would retrieve D1 as a lexical match. In practice, we would have millions of documents and hence the map (or index) looks more like dog:[D11, D13, D4, D9] and order of linked documents under a token is typically based on document popularity.
Semantic Matching involves augmenting the document with semantically matching tokens and indexing the document for those tokens as well. In the symbolic approach, we can add semantically relevant terms to the document (e.g. relevance feedback). With the wide adoption of neural networks, the more recent approach is to represent the document as a vector/embedding which captures the semantics of the full document, and we would use the same neural encoder to get the query embeddings. Then we would use nearest neighbour search algorithms to find the N closest document embeddings to the given query embedding. Semantic matching also help with precision since it naturally factors in the context in which the words occur e.g. “milk chocolate” vs. “chocolate milk”.

Both matching techniques have their pros and cons: for instance, lexical matching works well irrespective of domain but has a fixed upper limit in terms of quality, and semantic matching brings in intelligence but its quality is limited by embedding quality (and transitively to the finetuning dataset size). If your domain has smaller sized queries and documents compared to web search, I would personally start with lexical matching first, and progressively add semantic retrieval as needed by the business. In more advanced deployments, we would see both retrieval strategies used simulataneously with a blending layer to merge the search results from both the retrieval strategies before going to re-ranking stage.

Also, there are a few things which apply to both the retrieval paradigms: challenges with scaling up index sizes, live indexing, filters to skip documents (could be user-selected like time range or machine inferred like the query needs only “job” or “person” documents only), and there is a maximum limit on how many documents we want to match before considering retrieval done.

Things to know in 2024:

Hybrid retrieval techniques are getting widely implemented. E.g. Native support in Apache Solr, hybrid indexing in Pinecone. However, more commonly seen are custom solutions like Elastic Search for sparse + Vespa/Weaviate/Milvus/QDrant for dense retrieval. If there is a good overlap of candidates from both retrieval sources, then a classic, score-agnostic ranking technique called Reciprocal Ranking Fusion (RRF) also has proven to work very well. If you want to find a good middle ground: Staring with lexical matching and using Splade V2 for semantic expansion can take you a long way towards hybrid retrieval. Read up on the original Splade technique.
Two-tower siamese networks continue to be the industry SOTA in building embeddings for embedding-based semantic retrieval. Graph embeddings, say for users/documents are used as input features on the document tower. E.g. Personalized Retrieval in Etsy
LLM-based search experiences would be a step function capability in helping users. E.g. giving them product reviews, generating semantic tags for products, reducing the need for enterprises to build KG ontologies and maintain algorithms to convert unstructured information in the wild to map to their ontologies. Additionally, even with semantic retrieval, it is still non-trivial to answer questions like “healthy snacks for kids”, “thanksgiving dinner ideas” etc. This is a place where LLMs can directly help.

Query Understanding

Given that the user query comes from… humans, expect it to be imperfect and non-canonical. Once the query is corrected, we need to enrich it with information needed for effective retrieval. Following are some steps you could expect to see in a query understanding system.

Spelling correction: Refer Symspell.
Multi-lingual complexities: Handling tick signs in languages like Portuguese and Spanish (e.g. azúcar), normalizing queries from different Japanese writing systems.
Compounding and decompounding
Expansions
Category classification: Many search systems today could retrieve results from multiple domains/verticals/categories.
Entity classification: Person vs. Job Title, Food vs. Restaurant, Celebrity vs. Product etc.
Named Entity Recognition: Segmenting the different parts of the query into different facets. E.g. “Naren who worked in Google” should be a Person entity search for the token “Naren” and employer facet being “Google”.

Once done, we have corrected and enriched the query with a lot of metadata to help with both increasing the radius of retrieval (aka recall) and simultaneously scoping down to increase the relevance of results retrieved (aka precision).

Tools and Techniques to know in 2024:

LLMs with their instruction following + world knowledge capabilities make excellent query understanders. Depending on the usecase, we could cache the query understanding result + run the LLM inference online for a different subset of queries. Note that we will have to run the same inferences on the document side as well to be able to match and boost.
Rethinking eCommerce search to be done on unstructured, textual data instead of structured data. Position paper by Instacart. However, I think relying on the LLM to generate product IDs is not a good idea, and we should expect inaccuracies.

Re-ranking

The final step after query understanding and retrieval is deciding the final order of results to show to the user. This is usually an optional step, and typically needed to add a different preference order. For instance, in eCommerce search, the goal for both the user and the platform is for the user to purchase from the shop or purchase the item being shown, which means ranking based on probability of purchase, some combination of query-document relevance and ordering probability both work well. Typically, there is also filtering to remove away documents based on relevance conditions. Also, its not uncommon to have more than one re-ranking step.

Tools and Techniques to know in 2024:

XGBoost models are continuing to be strong baselines, with deep learning models used to eke out the final few percentage points in NDCG.
LLMs can help with evaluating ranking model performances, and also for generating relevance labels.

Closing Thoughts

Finally, like in any real-world project, you would run into a ton of real-world idiosyncrasies, which you get progressively better at with more experience, like the following

Users usually prefer previously clicked documents (exact and similar). E.g. reordering same meal, preferring the same 2% fat milk product, preferring wikipedia for informational needs etc. So you might want to build that capability into all information surfaces. This is typically called personalization, and makes a significant improvement in user satisfaction especially in head/high-traffic queries.
In every layer and each component in that layer, we run into coverage vs. quality tradeoffs though in theory we need 100% on both. The choice in the tradeoff is highly dependent on the kind of product (ecommerce shopping, home assistant, information search results page, job search etc.), the kind of queries users are issuing on your platform and the number of documents we have to serve from.
While addressing the coverage vs. quality tradeoff, we inevitably run into machine learning solutions which come with their own complexities. In practice, its useful to have override systems which can be updated in real-time. This is usually implemented in a configuration language which takes effect without a server restart. E.g. the query classification ML model in your job search could classify “Naren” as a “Job Title” while in reality it should be “Person”. Secondly, given that the rules of ML model are driven by real-world user data, it is important to have a continuous update and continuous monitoring setup for relevant models.
Once your information search surface gets popular, it becomes a good surface to show sponsored information related to the user and the user’s intent – and a precondition here is convincing ourselves that sponsored results are win-win-win for the sponsorer, the search platform and the user (to discover new products). However, remember that anything that is inserted in the middle of a relevance-ordered list would decrease user-perceived relevance, and hence its critical to continuously assess where we are in the tradeoff spectrum and adjust the triggering logic of sponsored items accordingly. Especially, its important to setup hold-out experiments to understand long-term effects of such changes.
Like any engineered product, there is a tradeoff between adding complexity vs. maintainability. But adding complexity is required to bring out magical user experiences. My advice here would be to keep adding complexity as needed and periodically assess the direction and perform simplifications.

Will keep adding more to this document as I keep learning. Please let me know if I am missing any other component here.

Thank you!

Turning 30

Sat, 01 Aug 2020 10:00:00 +0000

Today, I turn 30.

I have been getting a lot of “You are 30 now! How do you feel?” and my answer has been simply “older”. This blog is an attempt to understand what getting “older” and especially what turning 30 means. This essay might make more sense if you are already like 25+ and I hope it is useful otherwise too.

Almost a decade back, I had written a similar blog about turning 20 which made for a fun read a few days back. Hopefully this blog is less embarrassing than that.

[12 minute read]

Thousands of Decisions and their Consequences

Turning 30 does mark a significant milestone in your life.

It concludes the first decade of a completely independent life. At this point, you have made thousands of decisions (large and small) affecting things ranging from yourself to your family to your career and to the society at large. Besides MAKING those decisions, by the age of 30 you’ve had the opportunity to witness and respond to the CONSEQUENCES of thousands of such decisions. No, I do not exaggerate when I say “thousands” of decisions. Imagine the gamut of decisions you have taken since your early 20s, which in your teens someone else was deciding for you:

On Education and Work:

Which university should I apply to?
Got admits! Which one to go to?
Should I get an education loan or see what luck has for me with on-campus jobs?
Which team should I join for the group project?
Hmm, I don’t have a side hustle yet and I don’t have money for next quarter’s tuitions – should I silently go back home after this quarter?
Everyone in college looks so cool and confident. What should I do to feel less awkward?
Should I get a 6-inch from Subway and look cool in front of my new friends or get a foot-long and secretly hide half of it for dinner?
Which career path should I take after graduation?
Which shirt should I wear for that job interview?
Which of these opportunities promises a happy life and secure future?
Yay! First salary! Should I spend it on traveling or save it up for rainy days?
Where should I put this $100 - retirement account, stock/options trading, mortgage payments, better health insurance, 6-month emergency fund?
Do I need external monitors?
Its 9 am, check email or get some work done?
Should we hire X or not?
Suggest this design or that?
Attend this meeting or skip?
Its 3pm, should I have a second coffee now?
…

On Personal Life:

What should I cook for dinner tonight?
Who should I watch the next Game of Thrones episode with?
Which class do I try? Piano/Triathlon/Tennis/Skiing/Swimming?
What to do this weekend? Movie/Clubbing/Sleepover/Board Games/Birthday Party?
Best friend is getting married! Vegas or Los Angeles for Bachelor Party?
Should I get married?
Wow, she looks good, could she be the One?
Repeat previous question N times.
Woohoo! I am getting married. Should I invite 10 people or 1000 people to my wedding?
Where should I go for a vacation this July 4th weekend?
Skyscanner.com or Kayak.com?
Canoeing or Kayaking?
AirBnb or Holiday Inn? Car camping or RV?
Is Top of the Rock worth the money?
Damn, I broke my legs! Drop-in at my doctor’s place or urgent care or emergency or call 911?
Should I get a pet?
Should I have a kid?
Which of the 8 different genetic tests do I want to take?
Huggies or Pampers or Cloth diapers?
Cocomelon or Word Party?
Disney or Universal Studios or both?
Should I sell the crib and donate all the baby clothes now?
Should I have a second kid?
Should I rent a bigger place then?
OMG $3000! Should I consider buying a house now or will the market crash due to a second deadly wave of CoViD?
Which school district do I want?
…

Well, you get it. Hopefully you are convinced now that I was not exaggerating about the “thousands” of decisions. And yes, you made all those decisions! How cool is that? And all these decisions, though simple and innocent in appearance, have significant repercussions on your life. These decisions define who you are. And that is exactly why turning 30 is a significant milestone in the formation of your own identity.

Now, nobody really trains you to make these decisions. You sort of stumble upon them and figure it out on the way. By the time you are 30, you have automatic responses to many of the above questions and spend your precious mental energy on the remaining new and exciting questions. No big deal, life moves on and you stumble on more unseen experiences. And as you might expect such a way of life will have ups and downs by design leading to moments of pride and pondering. On that note, I would like to contemplate on the top two life lessons from my decade of independent decision making: Logic Can Fail and Embrace Grey.

Learning #1: Logic Can Fail

Logic fails in surprising ways. By logic, I mean being too smart.

This is how I would describe my 25-year-old self: I have lived for 25 long years now and amassed tons of personal and vicarious life experiences. I have read voraciously from verified, reliable sources. I am trained in logical and critical thinking. I am up-to-date on politics, sports, finance, technology, science etc.. I can articulate my points with impeccable logic backed by undeniable science and powerful personal/historical anecdotes.

In a nutshell, I felt simply invincible. Each time I crushed an argument, and put the other person out of words, the implicit feedback I got only reinforced my already strong convictions. I was right. I was right EVERY SINGLE TIME. Except that… I utterly failed. Turns out, each time the argument ended, the other person had simply given up on me or in case of my parents, punted the discussion to a later time when I am more mature. You can be right, and still fail spectacularly.

Logic Fails… Due to Lack of Full Information

This is probably the biggest class of logic failures.

Logic provides guarantee of correctness when provided with complete information. But in real-life you never have complete information on anything you make decisions on. Even when you have all the relevant people in the room, there is no reason they should share all their knowledge, their feelings and more importantly, their deep-rooted insecurities with you. And many times you just won’t have the time and resources to hunt down all the relevant information. Logic can only do so much in this setting.

Turns out as you grow older you naturally build an extensive graph of probabilities and useful mental models on what can and cannot happen in life. This makes decision making more accurate and easier over time. You stop judging people and see yourself saying “it happens” often. So, keep learning from your own and others’ lives and when it comes to decision making, stay open to the fact that you could be missing/misunderstanding critical details.

Logic Fails… Due to Band-Aid Situations

There are times when you feel you have all the information in hand, but you simply don’t want or don’t have the time to fix the problem. This shows up a lot in relationships in life and work.

In engineering, there is a method of problem solving called Root Cause Analysis which involves identifying the root cause of a problem and then performing corrective actions to prevent re-occurrence. Some problems in life could take months of Root Cause Analysis and then could take years to fix. Yes, you read it right, YEARS. And depending on the thing or (more likely) the person you are acting upon, this could be completely worth the effort. However, you will often run into cases when you should deliberately accept failure of logic owing to lack of time where you disagree, shake hands and move on without fixing the root cause. When the problem surfaces again, you do the same – just patch it and move on. I call these band-aid situations.

Applying a band-aid does not mean lack of courage or perseverance on your part. In fact, more often than you think, this is good for both yourself and the others involved.

Logic Fails… Due to Deep-Rooted Beliefs

In the last decade, I’ve moved from making fun of deep-rooted beliefs my parents espoused, to convincing myself that I would recommend the same to my next generation (and be that “out-dated, unscientific, illogical” parent for the next couple of decades). I like to call this class of logic failures as “subconscious safety nets” due to their ability to protect you from your conscious, rational self.

On the topic of buying Gold, following used to be a very typical conversation between me and my dad.

Me: "Dad, sooo.. have you sold Gold before?"
Dad: "Nope. In fact, I personally don't know anyone who ever did that."

Me: "So where do I sell gold when I need to sell it?"
Dad: "What do you mean? You NEVER, EVER sell your gold. You only keep buying it whenever you can."

Me: "Wait, that simply does not make sense. Why would I buy it then?"
Dad: "I cannot imagine how you can think of SELLING gold! Don't even think of touching it."

And I couldn’t fathom how I would have a chunk of my net worth sitting in a safe deposit box with no apparent purpose. My parents, as they usually do, would just accept the ridicule and move on to a different topic. In a few years, this topic vanished from our conversations. Then I witnessed something.. which proved to be a turning point.

There was a 30-something affluent, happy couple named Raj and Riya. They were happily wed for 4 years and had a beautiful 2 year old daughter. They had a wonderful house, business was booming, and things simply cannot get better. And then, suddenly Raj’s behavior started changing and it changed rapidly with every passing day. He became short-tempered, was over-reacting and a dreaded dementia started setting in. He was not open to treatments and started making blatant mistakes in his business and investments. He made bigger mistakes to cover up previous mistakes. When he finally agreed to visit a doctor, to everyone’s horror, he was diagnosed with Alzheimer’s. This had a huge impact on his confidence, and he turned to gambling to find the much needed acceptance. Before Riya could react, all their hard-earned money including their sprawling, coveted primary residence at the city center… was simply gone. Riya finally made peace with the situation and moved out with their daughter to her parents’ place and started off a new life from scratch. Raj continued visiting her asking for more money to gamble. After being told off a few times, Raj stopped bothering her. And in a few days, she learnt that he was… dead. Amid the devastation the family was going through, Riya was purging all their bank accounts to ensure it cannot be misused further by Raj’s business partners. To her surprise, there was one account which was still untouched.

The account was their safety deposit box with.. Gold.

This experience, though second-hand, was so powerful to etch the importance of having an absolutely untouchable asset, an asset which can endure catastrophes of our own pristine minds and protect families and generations to come from our own selves. As you can see, you can replace Gold with any financial asset here, but in my family’s case and in Raj’s case Gold turned out to be the traditional choice for the untouchable asset.

One other subconscious safety net where I’ve gone from mockery -> ignoring it -> outright rejection —–> acceptance is the indispensable role of God in life.

Following are some of my own quotes from the past decade in chronological order on the concept of God.

"I do not care if there is a God or not. Why are people so obsessed about it?"
"Seek scientific truth. Where is God? Do you see Him? Has anyone seen Him at all?"
"If there is no evidence, then God simply does not exist. The end."
"Have you even listened to Richard Dawkins"
"See, why do I need a God while science can and will fully explain the universe?"
"If there was a God, He wouldn't let his people suffer. Would He?"
"Why do you have to bribe God in temples. Don't you understand this is just a business transaction at this point?"
"I've never thought harm to an ant, why did God break my ankle and make me bed-ridden for months?"
"The concept of God blinds children from looking further into the truth!"
"Ok, God might apply to you. I do not need Him though. That said, I won't scoff at people who need Him."
"Damn! I get it now. God is a crucial part of a healthy society -- especially one that is 7bn+ people large."

Let’s leave it at that. If you are curious to know how the last transition happened, we should meet over coffee/dinner. That by itself warrants another essay.

So, when you find someone, especially smart and well-meaning elders, with an irrational belief or unexplainable practice – give them an opportunity to explain it to their best potential. There is a chance, however small, that there is an unconscious safety net waiting for you to explore.

Learning #2: Embrace Grey

By the time you are 30, most of the choices you are faced with are not as stark as black and white. If they are black and white, and you are still confused, talk to a buddy, get a cup of coffee or.. simply go to sleep and decide the next day. You got this.

Let’s talk about the Grey between the black and white.

On Conflicts: There are no Heroes and Villains: The famous quote from Aesop goes “A man is known by the company he keeps” and there is a lot of truth to it. You should assume people around you to be as smart/stupid and good-intentioned as you are. In that setting, when your friends John and Jane have a conflict, it is highly unlikely that one of them is a clear villain there and do not make haste to label them as such. In fact, it is easy and interesting to do so. The truth is usually somewhere in between a Hero and Villain. As you navigate the space of evidences and decisions to solve the conflict, you will come across many, many choices of Grey. The real skill there comes in assigning each of those choices different shades of grey, make a choice and articulate them effectively. So, get comfortable in thinking and communicating Grey.
On Measuring: There is no One Metric to rule them all: Step count alone does not determine your fitness and year-over-year profit alone does not determine long-term growth of a company. Many times, society will lead you to believe that there is a single magic number to base our decisions on and provide you products and services around that magic number. This spans across various walks of life like health, personal finance, health of relationships, nurturing kids, measuring work, optimizing investing, buying a home, tracking CoViD, and even general happiness in life. When society throws such a number at you, put it in perspective with other numbers around it (in space/time) and deliberately move it to a grey zone. I promise your decisions will be much better than otherwise.
On being emotional: Being Dispassionate vs. Passionate: One of the recurring themes of my 20s was experimenting between living a dispassionate vs. passionate life. You go about your life as usual - you meet your friends, you listen to their joys and sorrows, you go to work, you face ups and downs. But you can perceive these events either by deeply connecting with the event/person or dispassionately pass through those events emotionally unaffected. In my early 20’s I was heavily influenced by Stoicism and practiced living a dispassionate life. It worked really, really well in the USA’s individualistic culture for a long time. I had lots of very good friends and maintain excellent relationship with my family. I went from being single to getting married, and then one day suddenly the ideal of dispassionate living stopped working for me – the arrival of my daughter changed things quite drastically. I cared too deeply for my daughter that I simply couldn’t have my brain regulate my heart every time something unnecessary is enforced on her. I’ve moved considerably to the passionate side of the scale and I feel more human now.

"Discovering more joy does not, I’m sorry to say, save us from the inevitability of hardship and heartbreak. In fact, we may cry more easily, but we will laugh more easily, too. Perhaps we are just more alive. Yet as we discover more joy, we can face suffering in a way that ennobles rather than embitters. We have hardship without becoming hard. We have heartbreak without being broken."

- Archbishop Desmond Tutu, in The Book of Joy: Lasting Happiness in a Changing World.

On thinking styles: Analytical vs. Creative: Analysis and creation fundamentally require two different mindsets. Analysis requires you to observe a phenomenon carefully from different angles, attend to its details, understand patterns, and come out with an insight which informs your next step. Creation requires you to do a wide-range of observations, completely disconnect and let your subconscious brain make connections which did not exist before. To be a successful and happy individual in life, you need both. Success, to some extent, can be chased by over-leveraging on analytical thinking. However, creativity is what lets your originality and authentic-self shine through leading to exponential jumps in career and in personal life. I am convinced at this point that you can’t be creative and analytical at the same point of time. The Grey here is to allot time for both creativity and analytical thinking in a given time interval so that on the whole your thinking style looks Grey.

So, get comfortable with grey and when given options A and B, choose C. Your identity is pretty much where you fall in this infinite-dimensional spectrum of grey. As you grow older you will add more dimensions and higher resolution to each of the spectrums, and continue taking a stand on where you fall within each of them making life richer and easier in the process.

Closing Thoughts: Living Fully

With the human life expectancy of ~79 years (CDC), turning 30 is a moment when you realize that life is short and long at the same time. I’ve always lived in competitive environments from my early childhood. I’ve forever been driven to reach some “goal” at any point in my life – get 100/100 marks, get that gold medal, get into a top university, get hired by a top company, get promoted every 2 years, etc. If I reach the goal I am awesome and acknowledged, otherwise I am a disaster and should rethink the purpose of my life. Interestingly, the society keeps manufacturing and surfacing these “goals” and has well-established, evolved multi-agent processes and reward structures around it. I kept my heads down and slogged tirelessly to see the end of this tunnel so I can finally start to… you know.. live, whatever that means. Somewhere in this daisy-chained rat race, my parents felt I was doing great on my own and let me drive by myself. Wait, what?! Where are you going?

"The journey is the destination" - Dan Eldon

By the time you are 30, you know a lot more about what you really want and what the world can offer. You realize that the world can give you a much richer and fuller life if you simply pause occasionally, look around, take a detour and smell those byway flowers rather than repeatedly devise optimized route plans and drive straight to the made-up destinations. It’s the journey that matters. Note that this does not mean you become less ambitious, in fact surprisingly you become highly ambitious on a lot of things.

Looking forward to the next decade of life!

The Differentiable Bread Toaster: A Discontinuity in Modern Software Systems

Wed, 20 May 2020 10:00:00 +0000

In the last couple of years, I’ve had the opportunity to improve some highly-used software systems applying recent developments in machine learning (ML). Amid all the joys and the noise in developing and deploying ML models in production, there is a nagging discontinuity in both the thought-process and actual software development which has felt unnatural to me. The discontinuity is between the Java/C++/Python/Go programs we write as part of standard software development and the artifacts of ML i.e. learned models. In this post, I will quickly describe the problem and a potential opportunity to address it using differentiable programming.

[8 minute read]

What is Differentiable Programming

Differentiable programming allows you to write computer programs that could be differentiated with respect to their inputs. This has been possible for decades now either via numerical approximations to differentiation (which suffers from inaccuracies) or using symbolic differentiation techniques like what is done in Python’s SymPy package and in Wolfram (which suffers from lack of speed and flexibility required in practice). Recent improvements in Automatic Differentiation has allowed libraries like AutoDiff, AutoGrad, and JAX (which is used in this post) to address these concerns enabling differentiation as a first-class tool in mainstream software development.

Following is a very simple example of a differentiable program. By the end of this post, I will give an idea of how composing such simple programs enables us to build software systems which were hitherto hard to build or even inconceivable.

def squared(x):
  return x**2

print(squared(4)) #  x^2 = 16
print(grad(squared)(4)) #  First order derivative = 2*x = 8
print(grad(grad(squared))(4)) #  Second order derivative = 2 for all values of x.

Current Applications and State of Differentiable Programs

Differentiation has been largely applied in scientific computing for decades now and following is a small sample from the application domains.

Finance: Black Scholes equation for pricing options.
Physics: Equation of motion, Heat dissipation, Brownian motion.
Biology: Population growth models, Neuron action potentials.
Chemistry: Rate equation to predict rate of change of concentration of reactants and products.
Archaeology: Carbon dating.
Web: Content ranking, Click prediction.

And we have lots of tools to perform differentiation using computers. Today, there are Domain Specific Languages like MatLab, R, Julia & Wolfram, and General-purpose Programming Languages like C++ and Python which have extensions to do differentiation. However, when thought about in the context of mainstream software development, they suffer two problems.

Firstly, to make the discussion easier, let’s pick a specific area and programming language, say Supervised ML and Python. There is a dichotomy in thought and code when we think about where our classical programs end and supervised “intelligence” begins today. For instance, where do TensorFlow SavedModels and pickled PyTorch models fit in our software? At best, we can think of them as opaque learned functions with rigid expectations on inputs and outputs. As a software developer if I have to modify their functionality a little bit, there is a completely different set of tools and terminology I have to work with – let’s call it the Software 2.0 workflow. This dichotomy shows through right from the code, to the developers’ perception, the surrounding workflow, inter-team coordination and bubbles up all the way to the job listings from companies (like SWE, SWE-ML, ML-SWE, Research Engineer, Applied Scientist in ML etc.) and university courses/programs.

Secondly, today’s Supervised ML frameworks are too complex and have too many assumptions built in. The latter can be worked around by using more granular components of the framework like TensorFlow. The argument on complexity still remains. When I got introduced to ML, I was astounded at the level of complexity in systems like Tensorflow and PyTorch. What are they? Libraries/runtimes/programming languages? Even before I answered those questions, I had gotten used to the workflow and deferred thinking about it. Over time I realized that bringing together the core pieces required for mainstream machine learning is very, very hard. As this blog from Julia team puts it, machine learning is heavy on numerics, derivatives and parallelism which explains the complexity in developing ML systems of today.

In that backdrop, it is exciting to see a coherent suite for ML coming together in Python. There is always numpy to take care of the numerics. Recent developments in well-thought Automatic Differentiation libraries can take care of the derivatives and parallelism aspects while also allowing programming flexibility by differentiating through constructs like variables, loops and conditionals. Woohoo! Ok, what can we do with this capability?

Using Differentiation as a first-class programming primitive

To some approximation, software developers observe how humans communicate and exchange in real world and model those behaviors in computer programs. In this paradigm, f(x,y) = x + y will always return x + y irrespective of the actual real-world function f has to approximate. And when the underlying assumptions on real-world behaviors do change, we have channels like monitoring, product forums, brand analysis etc. to detect a shift in expectations, re-design, re-implement, rinse and repeat.

But.. what if programs can automatically adapt and rewrite themselves over time? Like the way humans grow intelligent by composing knowledge extracted from daily observations on top of existing knowledge, can we continually feed soft rules into the program and make it increasingly aware of its purpose in life? Yes, we can. A ~20 line artificial neuron (code) could arguably be the smallest unit of such a composable, learning program. And here is a tiny demo where we learn the rule “given three numbers, always return the second number”. The differentiable program learns it with just 20 examples! Composing this idea of an artificial neuron with some structural assumptions, we get differentiable data structures and programming constructs like Attention, RNNs, Pooling techniques and even CNNs. These could be considered differentiable equivalents of hashmaps, loops, conditionals and image filters. So from latest trends in ML, we have been exploiting differentiable image filters to make sense of images and differentiable hashmaps to keep track of the words in a document.

We don’t need to stop with the differentiable versions of data structures and programming constructs.

Let’s look at programming from the other direction i.e. from top to bottom. I have a physical process I want to emulate, and I have very little understanding of how it works. However, I can observe how it operates under different inputs. To help our imagination, let’s consider a simple bread toaster (because… why not?) which can take in different number of bread slices, say N, and toasts them in an unknown f(N) seconds. I have been manually powering down this toaster all these days and I really want to automate it. Let’s consider three hypotheses about f(N).

f(N) = 30 * N // It says 30 seconds on the toaster’s manual, no learning needed.
f(N) = 30 * N // Its not on the manual, needs a learning program.
f(N) = 7th order Taylor approximation of sin(N) // Its not on the manual, needs some biased learning program.

Our differentiable bread toaster (code) can handle all these cases now. And we can improve its accuracy further by adding well-known physical phenomena like the equation for heat dissipation + structure of the grill inside the toaster and Automatic Differentiation would effortlessly differentiate it for you. Similar ideas have been seen in robotics with differentiable physics engines, improving computer vision with differentiable modules for image transformation + camera calibration and dramatically improving computer graphics rendering using differentiable ray tracers. Tesla’s AutoPilot can be framed as a gigantic differentiable program where we plug in high-precision algorithms or learned models for different aspects of driving and compose them in a principled manner with appropriate differentiable data structures and algorithms to output steering, braking and acceleration. As we can see, the possibilities are endless.

Let’s build!

I strongly believe differentiable programming can address the aforementioned discontinuity in modern software systems by seamlessly blending hand-coded and learned rules. Plus, we can leverage all the existing hard-earned, well-tested knowledge on algorithms and data structures to tackle grander challenges in software engineering. It’s time to differentiate our bread toasters and build the next “AutoPilot”.

If our small minds, for some convenience, divide this glass of wine, this universe, into parts -- physics, biology, geology, astronomy, psychology, and so on -- remember that nature does not know it! - Richard P. Feynman

References

JAX, Engineering Applications of Differential equation, Differential programming languages, Software 2.0, Tailor-made ML language, Differentiable Programming

MongoDB vs. RethinkDB

Thu, 04 Jun 2015 16:00:00 +0000

As a student, I prefer building prototypes before building the full-fledged systems. So when it comes to choice of data storage, firstly, I would prefer having no schema enforced on my data since it will almost always change with time. Secondly, once the system is up and running, I would expect robust data collection, efficient querying and easy-to-use analytics systems to make sense of the data. Ever since MongoDB was initially released in 2009, I have always used it for all my personal and research projects. However nowadays, I have started seeing a lot of blog posts which complain about the performance of MongoDB. So in this blog post, I attempt to study and compare document-based databases MongoDB and the recent favorite, RethinkDB, since I am starting to consider RethinkDB for my future projects (though it has been around since 2012). One common paradigm which I like in both is that they store data in the same way it is (most likely) used in the application layer, i.e. as JSON documents, reducing impedance mismatch in data representation.

Philosophy

RethinkDB, which has been in active development for three years now, was designed with a bottom-up perspective with the design goals of ease of use, high availability and high scalability in mind. From the MongoDB architecture guide, we can understand that MongoDB was designed top-down by taking an existing system (like MySQL) and enhancing it with dynamic schemas while still providing the relational database features like indexes and updates.

Data Model

Firstly, lets talk about the physical data model. In MongoDB, data is stored as BSON documents while in RethinkDB, data is stored as JSON documents. One of the advantage of BSON documents is that in addition to the existing JSON datatypes (string, number, object, array, boolean and null) it provides int, long and double types. The full list of BSON types can be found here. This enables more granular comparison (and hence sorting) of data. For instance, since MongoDB comes with an inbuilt date type, it is straight-forward to construct date range queries which I have found to be very useful while working with Twitter data for my recent projects.

@coffeemug: RethinkDB implements an extended version of JSON, so it supports additional data types (like dates, geometric primitives, etc.) It isn't necessarily better or worse than BSON -- there are a few nuanced tradeoffs (e.g. JSON library implementations in languages with smaller ecosystems are better), but from the functionality perspective both approaches are very similar.

Secondly, talking about conceptual data model, both RethinkDB and MongoDB are pretty simple. From MongoDB’s documentation:

A MongoDB deployment hosts a number of databases. A database holds a set of collections. A collection holds a set of documents. A document is a set of key-value pairs. Documents have dynamic schema.

Similarly, a RethinkDB deployment has a bunch of databases. Each database has a bunch of tables, where each table contains related documents (which have dynamic schema).

Query Language

Instead of going to a new query language, MongoDB keeps it simple by hooking directly into the programming language using which it is called. However, RethinkDB has introduced a new query language called ReQL which is inspired by functional languages like Haskell. If you love functional programming paradigm, you will love ReQL as well since it provides the same simplicity and power. Also, the functional language paradigm makes it easy to parallelize execution among multiple cores/servers/datacenters.

Performance

The scripts used for performing the following performance study can be found here. The following readings were taken on a mid-2013 Macbook Air with 1.7 Ghz Intel Core i7 (2 cores) , 8 GB 1600 MHz DDR3 memory, 256 KB L2 cache (per core), 4MB L3 cache and 250 GB APPLE SSD. The system run OS X 10.10.2 (Yosemite) and MongoDB version 2.6.5 was accessed with pymongo version 2.6.3. To access RethinkDB, version 2.0.0-2 of python rethinkdb package was used on RethinkDB 2.0.2. To make the results glanceable, I have plotted the average execution time with a blue line in all the plots below. All the performance numbers are in microseconds (us).

Write Performance

MongoDB writes (Avg. : 37130 us)	RethinkDB writes (Avg. : 12680 us)

As we can see clearly, there is a remarkable difference in the write performance. Besides the higher average latency, MongoDB has a higher variance of 1744 us as compared to 1468 us in RethinkDB. This is where MongoDB’s write concern gives us choice. Following plots show the performance differences with the three different write concerns in mongoDB. I haven’t tested the replica write concern since I working on a single server configuration.

MongoDB Unacknowledged (115.00 us : memory, maybe)	MongoDB Acknowledged (265.2 us : memory)	MongoDB Journaled (37130 us : disk)

So the right way to compare with RethinkDB’s write performance would be with the Journaled write concern which actually persists data to disk (which is RethinkDB’s default write behaviour). In terms of speed of persistent writes, RethinkDB wins. Thanks to the recent comment from RethinkDB cofounder @coffeemug, I realized that RethinkDB also offers more flexibility when it comes to writes.

@coffeemug: RethinkDB also offers a way to configure the write mode (see the `durability` option in http://rethinkdb.com/api/javascript/run/ and `write_acks` option in http://rethinkdb.com/docs/consistency/). Everything is fully configurable, and there is full support for high availability in case of hardware failure.

Read performance

For studying the read performance in both databases, the generated primary key fields were used for reading a single document. MongoDB generates a _id field and RethinkDB generates a id field when a primary key is not part of the input document.

Update: The following readings have been updated after the RethinkDB team reported the problem with using `find` in mongoDB. Based on that, the script used for measuring read timings has been updated.

MongoDB reads (cached) (Avg. : 270.77 us)	MongoDB reads (not cached) (Avg. : 452.54 us)

In either case, the first call to find() takes a longer time since it has OS level operations like getting file handle to the appropriate BSON file. However, since we are working with 1000 iterations, the effect of the first call will be suppressed. In real-world deployments, MongoDB will have huge memory and cache at its disposal. So, to be fair, I felt it was needed to show how MongoDB would perform in both cases - when data is in cache and when data is flushed out of cache.

RethinkDB reads (cached) (Avg. : 558.4 us)	RethinkDB reads (not cached) (Avg. : 740.2 us)

In terms of read performance, MongoDB wins RethinkDB by a huge margin. I do not clearly understand why though. However, considering map-reduce queries as a specialized read query, let’s talk about it here. RethinkDB parallelizes map-reduce operations across shards and CPU cores whenever possible (source: RethinkDB documentation). However, MongoDB’s map-reduce queries are painfully slow since they use only one core even if you have 32 cores (source: experience). In case you have a use-case to execute map-reduce on MongoDB, I would suggest you to translate them to aggregation queries which run on their (relatively recent and much faster) C++ based aggregation framework.

Real time feeds

With the rise of social media, we have a lot of sources which provide real-time data. To make better sense of the high-volume, high-velocity data we see everyday, we need some processing to be setup on arrival of each datapoints (aka event). The processing can be throwing the datapoint to a front-end which in turn plots it on a map like Tweet Ping dashboard or using the data point for recalculating Bollinger Bands in a chart showing real-time trends. With the advent of Apache Spark, real-time machine learning has become much easier as well. All that said, such kinds of processing should be done without much overhead when using high velocity data. For instance, if you use the most basic sample hose streaming API from Twitter to get 1% sample of all global tweets, you would get roughly 70 tweets per second.

MongoDB’s advised way of going about it is to use tailable cursors which was originally inspired by tail -f command in UNIX systems which we generally use to monitor log files. However, the catch is that tailable cursors can be created only for special collections called capped collections. Making a collection callable is possible only when we create the collection. This basically takes two lines in python.

db.createCollection("capped_collection", { size: 1000, capped: true })

cur = db.capped_collection.find(tailable=True)

Tailable cursors and capped collection would work only for insertions. If you are interested in all operations on the collection, the way to go is to poll the operational logs files called oplog. There are tools around this like mongo-watch to make our lives easier. And the oplog way doesn’t require any kind of special collections.

RethinkDB provides a very intuitive way to get notifications on changes. The composability ReQL allows helps us to chain operations in any order giving us much more flexibility than what was possible with MongoDB. A simple ReQL snippet like

for change in r.db("test_db").table("test_table").changes().run(connection):
	print change

gives us real-time notification. You can take a look at the documentation on this feature to appreciate the level of flexibility it provides.

Hopefully, this blog post gave an overall idea on where RethinkDB and MongoDB stand and if they are suitable for your project. In my case, given the tolerance to lossy writes on machine failure and need for faster reads & writes than RethinkDB’s, my projects are going to stay with MongoDB, while I wait for RethinkDB to become more impressive in terms of performance.