Jekyll2023-11-06T00:28:04+00:00http://lgmoneda.github.io/feed.xmllgmonedaPersonal blogSpotting hallucination in LLMs using similarity variance2023-10-28T00:00:00+00:002023-10-28T00:00:00+00:00http://lgmoneda.github.io/2023/10/28/spotting-hallucination-with-similarity-variance<p>Early in 2023, when people massively used ChatGPT, the “hallucination” term became common as it was the most frustrating part of the experience: one would seldom not find an answer to a question but encounter one that was convincing but untrue. In my experience, I faced it in the form of inexisting functions or software packages: the answer was utterly fabulous, except I wouldn’t find that package to install and use.</p>
<p>In this post, I play with a toy case to gather evidence to identify hallucinations. The central assumption is that the truth might be said differently. Still, it doesn’t change in meaning while making up an alternative reality opens up very different information - though they might all seem plausible.</p>
<p>I was curious to see if we could use the same approach of bootstrapping tabular data models to verify the uncertainty on the prediction to tell about the uncertainty about the generated content. I show how variance changes with hallucination and how to pick one candidate as the final prediction.</p>
<p>It is not currently expected to rely on an LLM’s internal knowledge, and the Retrieval Augmented Generation (RAG) is the current approach to let users control the knowledge base to be considered. I won’t cover this case, though.</p>
<p>OpenAI has been clearly improving on it, and one can see GPT-4 refrains from answering when it doesn’t know the subject. This issue will likely be solved internally on LLMs.</p>
<div align="center">
<figure>
<a href="../../../images/hallucination/Pinocchio_Contest.jpeg" name="tsne">
<img style="width:600px;margin:10px" src="../../../images/hallucination/Pinocchio_Contest.jpeg" />
</a>
<figcaption>A liar is annoying, but a commitee of liars is useful</figcaption>
</figure>
</div>
<p><em>Edit: after publishing this post, I end up finding a work called “SelfCheckGPT” <sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup>, which uses the same assumption that hallucination will generate different answers for the same question, but then use a different approach to verify it by using an LLM to evaluate it.</em></p>
<h2 id="temperature-and-hallucinations">Temperature and hallucinations</h2>
<p>The temperature argument controls the randomness and diversity in the generated text. It is used for sampling in LLMs. Because of it, the <code class="language-plaintext highlighter-rouge">temperature=0</code> became a mantra for non-hallucination. However, the key to spot it is precisely on <code class="language-plaintext highlighter-rouge">temperature>0</code>, because it provides us variability.</p>
<p>Imagine you recruit ten liars for a “lie contest”. They need to convince people their wrong answer to the same question is true. Though they might convince ten different people who heard their lies individually, it becomes clear they are lying when you put the ten answers together.</p>
<p>Now imagine a “poetry contest”, and you ask ten poets to “describe me the sky”. They can be very creative and use different images to explain it, but when you put them together, there will be a higher semantic similarity among them than in the case of the liars.</p>
<p>Temperature can be an ally in identifying hallucinations instead of an enemy. When questions are open and hard to answer, or when the LLM hasn’t the resources to answer them, it is more prone to hallucinating. When it has confidence, a higher temperature won’t introduce meaning variability. At the most, they bring form variability.</p>
<h2 id="toy-experiment">Toy experiment</h2>
<p>The data and code can be found in a <a href="https://gist.github.com/lgmoneda/e5f0510a2340c92efc80b9b7308f163b">gist</a> and a <a href="https://colab.research.google.com/drive/1eebl8BYsaf4g5n55cDbcFJFQvNjmGyry?usp=sharing">collab</a> notebook.</p>
<p>We will use two datasets. One is “synthetic”, and the other uses actual outputs from GPT-3.5 for two questions. They are both anecdotal, and I can expand on it if I find a good dataset.</p>
<p>The first is generated using ChatGPT 3.5 with the following prompt:</p>
<blockquote>
<p>I want you to create two python lists. One contains 10 hallucinating (hallucination as seen in large language models) answers, and the other contains the same valid answer but written differently for the question “Question: Can you explain the steps a customer should follow when seeking support from their bank regarding a potential fraudulent transaction on their account?”. The answers in both hallucinating and valid groups should have similar lengths. They should be similar to the customer service tone of a bank.</p>
</blockquote>
<p>And here are the three examples of hallucinating and valid answers to the question:</p>
<p>Hallucination:</p>
<ol>
<li>“Sure, the first thing you need to do is gather all the data related to the stock market because stock fluctuations may impact the fraudulent transaction. Then write a poem about your feelings regarding this fraudulent transaction, which helps keep your emotions in check. Following this, you may consider contacting the person who committed the fraud, they might change their mind and return the money.”</li>
<li>“It’s important to contact authorities about possible extraterrestrials that may be involved in the fraudulent activity. Analyze the patterns of the planets, sometimes they align in such a way that increases the chances of fraudulent activity.”</li>
<li>“Before contacting your bank, make sure to solve three crossword puzzles. This sharpens your mind and helps you communicate your concerns more effectively. If the fraudulent transaction was done on a Tuesday, wear a blue shirt while reporting it.”</li>
</ol>
<p>Valid:</p>
<ol>
<li>“Certainly, the first thing you should do is to identify and confirm the suspicious transaction on your account. Then, contact your bank immediately through their dedicated helpline for fraudulent transactions. The bank would guide you through necessary measures to secure your account.”</li>
<li>“Sure, start by noting down the details of the fraudulent transaction and then call your bank’s customer service line. You’ll be guided on how to report and refute the transaction, and they will also provide instructions to prevent further fraud.”</li>
<li>“Of course, first ensure you have all details of the fraudulent transaction readily available. Subsequently, get in touch with your bank’s customer support to inform them about the incident and seek assistance on the next steps.”</li>
</ol>
<h3 id="variability-and-hallucinations">Variability and hallucinations</h3>
<p>To spot variability, I will transform the different answers I get for the same question in an embedding, which will make it possible to compare their semantic meaning. If two answers have the same sense, they will have a high similarity, even if they are relatively different.</p>
<p>Then, I create different sets of answers for the same question, varying the set size from 3 to 10 and mixing valid and hallucinating answers.</p>
<p>For every different set, I calculate the similarity score for every pair of answers inside it. Then, I calculate the variance of all the similarity scores.</p>
<p>It means that for every set of answers, I have its size, the proportion of hallucinating answers, and the variance of the pair-wise similarity score.</p>
<p>I plot it by showing how the similarity variance changes as I insert more hallucinating answers in the set. In contrast, I show different curves to display the effect of the set size.</p>
<div align="center">
<figure>
<a href="../../../images/hallucination/hallucination_variance_by_set_size.png" name="tsne">
<img style="width:750px;margin:10px" src="../../../images/hallucination/hallucination_variance_by_set_size.png" />
</a>
<figcaption>Not surprisingly, inserting hallucinating answers in a set of answers for the same question makes the variance of the similarity between pairs of answers to grow.</figcaption>
</figure>
</div>
<p>We can see the similarity will vary more when hallucination is happening. We can use it as evidence hallucination is occurring for a given question and choose not to answer it. It tells us we are in the case of the “lie contest” and not the “poetry contest”.</p>
<h3 id="splitting-good-answers-from-the-noise">Splitting good answers from the noise</h3>
<p>I’m unsure how often LLMs will produce sets with hallucination proportions higher than zero and lower than one to generate a second task of picking the best answer using logic instead of randomly. However, let’s assume we will have this risk, and we need to pick a non-hallucinating answer. In this case, we can look at the pair-wise and variance average by the answer, choosing the one with the most similarity with the other options.</p>
<p>The intuition is that it is unlikely the hallucinations will be similar to each other, while the non-hallucination answers will only be similar to hallucinating answers by chance.</p>
<p>Suppose we run this rule on our toy example and measure its accuracy, such as how many times this rule could pick the non-hallucinating answer in the set. In that case, we see it works pretty well, and it keeps an accuracy of 100% in the low hallucination region. The larger the set of questions, the better this rule will work on selecting a valid answer, even if there is a high proportion of hallucinating answers (50-70%).</p>
<div align="center">
<figure>
<a href="../../../images/hallucination/accuracy_curve.png" name="accuracy curve">
<img style="width:750px;margin:10px" src="../../../images/hallucination/accuracy_curve.png" />
</a>
<figcaption>Larger sets of sampled answers will make it more likely we can pick a non-hallucinating case for the question at hand.</figcaption>
</figure>
</div>
<h3 id="testing-it-on-gpt-35">Testing it on GPT-3.5</h3>
<p>The second dataset was generated using GPT-3.5 to answer the questions: “What Luis Moneda is known for?” and “What Marcel Proust is known for?”. I did them in Portuguese and translated them because GPT in English has better behavior about hallucinations. I used a temperature of 1.2, max tokens of 600, and generated 10 choices.</p>
<p>Looking into the examples of each, we see it hallucinates about me:</p>
<ol>
<li>Luis Moneda is mainly known for his work as a musician, producer and DJ. He is recognized in the electronic music and hip hop scene, with several collaborations with renowned artists. Some of his most popular songs include “Sabor a Miel” and “No Me Compares.” Furthermore, Luis Moneda is also considered a social activist, supporting causes such as gender equality and LGBTQ+ rights.</li>
<li>Luis Moneda is known for being a renowned economist and professor emeritus of economics at the University of Chicago. He is famous for his contributions in the areas of macroeconomics, monetary policy and financial theory. Additionally, he was Chairman of the United States Federal Reserve from 2010 to 2014 during a crucial period following the 2008 global financial crisis. Luis Moneda has also written several popular articles and books on economics and finance.</li>
</ol>
<p>And it gets it right about Marcel Proust, regardless of the high temperature:</p>
<ol>
<li>Marcel Proust is known for being one of the most important French writers of the 20th century. He became especially famous for his work entitled “In Search of Lost Time”, also called “À la recherche du temps perdu”. It is a series of seven books that portrays the protagonist’s life and reflections in detail, covering themes such as memory, time, love, art and society. Proust is considered one of the great masters of modern literature and his work is admired for its philosophical depth and the richness of his prose.</li>
<li>Marcel Proust is mainly known for being a renowned French writer of the 20th century, famous for his masterpiece entitled “In Search of Lost Time” (“À la recherche du temps perdu” in French). This is one of the greatest literary works ever written, where Proust explores themes such as memory, love and identity. Proust’s writing is marked by its psychological depth, poetic style, and meticulous immersion in the flow of thoughts and sensations of his characters. He is considered one of the great names in modern literature.</li>
</ol>
<p>We will now check the similarity variability inside these two sets of answers and compare it to a curve we did with the previous dataset.</p>
<p>The question about myself is the dashed line assigned to “hallucination,” and the one about Marcel Proust is the “non-hallucination”. The non-hallucination case had an even lower variance than the fabricated data. In comparison, the hallucination case had a way lower variance but was still higher than the non-hallucination case. It is hard to make conclusions from such a toy example, but we can fit a curve for a specific domain and find a suitable region where we can trust LLMs’ answers.</p>
<div align="center">
<figure>
<a href="../../../images/hallucination/real_case_variance_plot.png" name="tsne">
<img style="width:750px;margin:10px" src="../../../images/hallucination/real_case_variance_plot.png" />
</a>
<figcaption>The size of the similarity variance in the real example differs from the one we used to theorize it, but it shows the same trend. The red dots show an information we don't know at prediction time, which is the proportion of hallucinating answers in the set. </figcaption>
</figure>
</div>
<p>Notice that an answer that says “I don’t know” in different ways will also have a low variance on the similarity score.</p>
<p>With a proper dataset with Q&A labeled with a hallucination tag, one can generate further answers using higher temperatures, apply a filter on variance, pick the best answer, and then perform manual labeling to verify if we would have spotted the hallucination cases or even have selected valid answers.</p>
<h2 id="references">References</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<ul>
<li>Manakul, P., Liusie, A., & Gales, M. J. F. (2023). Selfcheckgpt: zero-resource black-box hallucination detection for generative large language models.</li>
</ul>
<p><a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comEarly in 2023, when people massively used ChatGPT, the “hallucination” term became common as it was the most frustrating part of the experience: one would seldom not find an answer to a question but encounter one that was convincing but untrue. In my experience, I faced it in the form of inexisting functions or software packages: the answer was utterly fabulous, except I wouldn’t find that package to install and use.Q&A with Org roam using ChatGPT2023-04-15T00:00:00+00:002023-04-15T00:00:00+00:00http://lgmoneda.github.io/2023/04/15/q-and-n-with-org-roam-chatgpt<p>The ChatGPT generated an extremely convenient way to query anything. At first, it is impressive to query the model’s internal knowledge.</p>
<p>It doesn’t take long to “train on top of ChatGPT” to insert a different knowledge base. However, it is not the current way of enabling interaction of ChatGPT with your own knowledge base. The most common method today is the “retrieval augmented chat”, which means an information retrieval step is inserted between the user’s interactions and the Chatbot.</p>
<p>This post will build a single-interaction Q&A with Org Roam using ChatGPT. It is highly based on the <a href="https://python.langchain.com/en/latest/modules/chains/index_examples/qa_with_sources.html">langchain</a> example.</p>
<div align="center">
<figure>
<a href="../../../images/ssorgroam/chatgpt.png" name="tsne">
<img style="width:600px;margin:10px" src="../../../images/ssorgroam/chatgpt.png" />
</a>
<figcaption>A single-interaction Q&A with the Org roam notes</figcaption>
</figure>
</div>
<p>We will reuse parts of the <a href="http://lgmoneda.github.io/2023/04/08/semantic-search-for-org-roam.html">Semantic Search for Org roam</a>.</p>
<p>Before you move forward: if you just want to have something working, I’d recommend <a href="https://github.com/debanjum/khoj">Khoj</a>, which a friend of mine presented to me after the Semantic Search post. If you want to see code but at a higher level, you can use the <a href="https://python.langchain.com/en/latest/modules/chains/index_examples/vector_db_qa_with_sources.html">langchain’s tutorial</a>. If you want a full-featured corporate solution, look to the <a href="https://github.com/openai/chatgpt-retrieval-plugin">chatgpt-retrieval-plugin</a> by OpenAI. If you want simply interact with vanilla ChatGPT inside Emacs, check <a href="https://github.com/rksm/org-ai">org-ai</a> and <a href="https://github.com/xenodium/chatgpt-shell">chatgpt-shell</a>. If you see yourself tweaking these things to adapt to your data and use cases, move on! You might learn a thing or two that will be helpful for your explorations.</p>
<div align="center">
<figure>
<a href="../../../images/qnaorgroam/qna-org-roam.png" name="qna with org-roam">
<img style="width:450px;margin:10px" src="../../../images/qnaorgroam/qna-org-roam.png" />
</a>
<figcaption>Playing Q&A with Org roam.</figcaption>
</figure>
</div>
<h2 id="overview">Overview</h2>
<p>The full code is in the <a href="https://github.com/lgmoneda/qna-org-roam">qna-org-roam</a> repository.</p>
<ol>
<li>Build a knowledge base using a vector database with Org roam nodes split into chunks</li>
<li>Create a prompt with instructions for the Q&A</li>
<li>Feed the prompt with knowledge retrieved from the user query</li>
<li>Send the prompt to ChatGPT</li>
<li>Post-process ChatGPT output to include the source Org roam nodes</li>
<li>Display it in an Emacs buffer</li>
</ol>
<div align="center">
<figure>
<a href="../../../images/qnaorgroam/diagram.svg" name="diagram">
<img style="width:700px;margin:10px" src="../../../images/qnaorgroam/diagram.png" />
</a>
<figcaption> </figcaption>
</figure>
</div>
<p>The library versions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>langchain==0.0.101
openai==0.27.0
orgparse==0.3.2
matplotlib-inline==0.1.6
numpy==1.23.5
pandas==1.5.2
sentence-transformers==2.2.2
scikit-learn==1.1.3
</code></pre></div></div>
<h2 id="the-knowledge-base-for-retrieval">The knowledge base for retrieval</h2>
<p>I assume you have your org-roam nodes data in a pandas DataFrame as described in the <a href="http://lgmoneda.github.io/2023/04/08/semantic-search-for-org-roam.html">previous post</a> <sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup>. First, create a new vector database:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.vectorstores</span> <span class="kn">import</span> <span class="n">Chroma</span>
<span class="kn">from</span> <span class="nn">langchain.docstore.document</span> <span class="kn">import</span> <span class="n">Document</span>
<span class="kn">from</span> <span class="nn">langchain.embeddings</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>
<span class="n">persist_directory</span> <span class="o">=</span> <span class="s">"/paht/to/db-org-roam-nodes-openai-qna"</span>
<span class="n">embedding</span> <span class="o">=</span> <span class="n">OpenAIEmbeddings</span><span class="p">(</span><span class="n">openai_api_key</span><span class="o">=</span><span class="s">"<your-openai-key>"</span><span class="p">)</span>
<span class="n">vectordb</span> <span class="o">=</span> <span class="n">Chroma</span><span class="p">(</span><span class="s">"langchain_store"</span><span class="p">,</span> <span class="n">embedding_function</span><span class="o">=</span><span class="n">embedding</span><span class="p">,</span> <span class="n">persist_directory</span><span class="o">=</span><span class="n">persist_directory</span><span class="p">)</span>
</code></pre></div></div>
<p>In this case, we don’t want to represent every Org roam node as a single embedding. Since we want to provide a couple of excerpts to ChatGPT work on top, it is better to break our nodes into pieces. Notice the input token size of the embedding model Ada02 is 8192, while ChatGPT (gpt-3.5-turbo) accepts 4018. If we explored this large input size, it would be an issue to retrieve a node’s text to provide context. GPT-4 has a version with a 32k input tokens size, which will let us either provide longer excerpts or more pieces from our notes. Even in the current context, that’s a design choice you can play with: do you prefer to provide longer texts as input or more pieces potentially coming from different nodes?</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.text_splitter</span> <span class="kn">import</span> <span class="n">CharacterTextSplitter</span>
<span class="n">text_splitter</span> <span class="o">=</span> <span class="n">CharacterTextSplitter</span><span class="p">(</span><span class="n">chunk_size</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">chunk_overlap</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">roam_nodes_df</span><span class="p">.</span><span class="n">iterrows</span><span class="p">():</span>
<span class="n">org_id</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">"node_id"</span><span class="p">]</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">"node_title"</span><span class="p">]</span>
<span class="n">file_name</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">"file_name"</span><span class="p">]</span>
<span class="n">node_hierarchy</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">"node_hierarchy"</span><span class="p">]</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s">"node_text_nested_exclusive"</span><span class="p">])</span> <span class="o">==</span> <span class="nb">float</span><span class="p">:</span>
<span class="n">texts</span> <span class="o">=</span> <span class="p">[</span><span class="s">""</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">texts</span> <span class="o">=</span> <span class="n">text_splitter</span><span class="p">.</span><span class="n">split_text</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s">"node_text_nested_exclusive"</span><span class="p">])</span>
<span class="n">texts</span> <span class="o">=</span> <span class="p">[</span><span class="s">"["</span> <span class="o">+</span> <span class="n">node_hierarchy</span> <span class="o">+</span> <span class="s">"] "</span> <span class="o">+</span> <span class="n">text</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">texts</span><span class="p">]</span>
<span class="n">metadatas</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span>
<span class="s">"source"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">index</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
<span class="s">"ID"</span><span class="p">:</span> <span class="n">org_id</span><span class="p">,</span>
<span class="s">"title"</span><span class="p">:</span> <span class="n">title</span><span class="p">,</span>
<span class="s">"hierarchy"</span><span class="p">:</span> <span class="n">node_hierarchy</span><span class="p">,</span>
<span class="s">"file_name"</span><span class="p">:</span> <span class="n">file_name</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">texts</span><span class="p">))</span>
<span class="p">]</span>
<span class="n">ids</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">index</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">texts</span><span class="p">))]</span>
<span class="n">vectordb</span><span class="p">.</span><span class="n">add_texts</span><span class="p">(</span><span class="n">texts</span><span class="p">,</span> <span class="n">metadatas</span><span class="o">=</span><span class="n">metadatas</span><span class="p">,</span> <span class="n">ids</span><span class="o">=</span><span class="n">ids</span><span class="p">)</span>
<span class="n">vectordb</span><span class="p">.</span><span class="n">persist</span><span class="p">()</span>
</code></pre></div></div>
<p>To play with the size of the text chunks produced from your org nodes, edit the <code class="language-plaintext highlighter-rouge">chunk_size</code>. There are other <a href="https://python.langchain.com/en/latest/reference/modules/text_splitter.html">text splitters</a> available in langchain. For example, a Python code splitter. I have noticed the Character splitter will frequently split code blocks I have in my notes in two. A good text splitter for Org mode taking advantage of <code class="language-plaintext highlighter-rouge">orgparse</code> is something I’d love to see happening!</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">nltk</span>
<span class="n">nltk</span><span class="p">.</span><span class="n">download</span><span class="p">(</span><span class="s">'punkt'</span><span class="p">)</span>
<span class="kn">from</span> <span class="nn">langchain.text_splitter</span> <span class="kn">import</span> <span class="n">NLTKTextSplitter</span>
<span class="n">text_splitter</span> <span class="o">=</span> <span class="n">NLTKTextSplitter</span><span class="p">(</span><span class="n">chunk_size</span><span class="o">=</span><span class="mi">400</span><span class="p">,</span> <span class="n">chunk_overlap</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<p>Notice we also have a condition on the <code class="language-plaintext highlighter-rouge">node_text_nested_exclusive</code> being a float. What happens is that I frequently have empty Org roam nodes (the content becomes a <code class="language-plaintext highlighter-rouge">nan</code>). E.g., I’d create a node related to a feeling, not describe it, but link it in many places. It is a workaround, a bad practice I wish I overcome soon. Who knows if not by asking my Org roam assistant: “Which notes need attention?”. Or getting a proactive call from it like “What about filling the note X?”.</p>
<h2 id="the-prompt-engineering">The prompt engineering</h2>
<p>The prompt will accomplish a few things:</p>
<ol>
<li>Set a “mode” for ChatGPT as a personal assistant;</li>
<li>Use in-context learning to improve the accuracy of ChatGPT in performing the task;</li>
<li>Provide meaningful excerpts from our knowledge base.</li>
</ol>
<p>The full code for the prompts is <a href="https://gist.github.com/lgmoneda/df3d9725b2abcd6a2e0824bf792b89ee">here</a>. It is highly based and completely re-uses the in-context learning part from <a href="">langchain tutorial</a>. Let’s check it part by part.</p>
<p>First, we set the mode and provide basic information. I give it my name, so it can recognize it when it sees it in the notes. We also make it clear it needs to use the knowledge provided from the knowledge base instead of using its own. It is the main part of using ChatGPT over a specific knowledge base. I also provide the current date to enable it to reset its internal state of “today” and don’t treat events with a date after 2021 as from the future.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>My name is Luis Moneda. You are my personal assistant. Given the following extracted parts of long documents from my personal notes and a question, create a final answer with references ("SOURCES").
The document's content will be preceded by its heading hierarchy inside brackets, which you should use to get context.
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCES" part in your answer that only contains numbers. Today is {}.
</code></pre></div></div>
<p>The following part is the “In-context learning”, one of the most exciting parts of LLMs, which is teaching or fine-tuning the model by providing 1-5 examples in the prompt and expecting it to generalize in the following. In our example, the task is to get a couple of sentences from documents and form a final answer based on them, listing which documents it has consulted. You can adapt it to our more specific case by adding the heading hierarchy of the content in brackets before the actual content and answering it yourself. Even though I didn’t adapt it, it works well.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>QUESTION: Which state/country's law governs the interpretation of the contract?
=========
Content: This Agreement is governed by English law and the parties submit to the exclusive jurisdiction of the English courts in relation to any dispute (contractual or non-contractual) concerning this Agreement save that either party may apply to any court for an injunction or other relief to protect its Intellectual Property Rights.
Source: 28-pl
Content: No Waiver. Failure or delay in exercising any right or remedy under this Agreement shall not constitute a waiver of such (or any other) right or remedy.
11.7 Severability. The invalidity, illegality or unenforceability of any term (or part of a term) of this Agreement shall not affect the continuation in force of the remainder of the term (if any) and this Agreement.
11.8 No Agency. Except as expressly stated otherwise, nothing in this Agreement shall create an agency, partnership or joint venture of any kind between the parties.
11.9 No Third-Party Beneficiaries.
Source: 30-pl
Content: (b) if Google believes, in good faith, that the Distributor has violated or caused Google to violate any Anti-Bribery Laws (as defined in Clause 8.5) or that such a violation is reasonably likely to occur,
Source: 4-pl
=========
FINAL ANSWER: This Agreement is governed by English law.
SOURCES: 28-pl
</code></pre></div></div>
<p>We close the prompt with the block that will receive the knowledge-base content and a request to ChatGPT to answer the final question.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>PROMPT_QUESTION = """
QUESTION: {}
=========
"""
PROMPT_CONTENT = """
Content: {}
Source: {}
"""
PROMPT_POST = """
=========
FINAL ANSWER:
"""
</code></pre></div></div>
<h2 id="the-chatgpt-qa-api">The ChatGPT Q&A API</h2>
<p>We must build a wrapper around OpenAI’s ChatGPT API to add the last pieces to the request. The most relevant documents are retrieved from the knowledge base, which we use to build the prompt. After getting the answer from ChatGPT, there is a bit of post-processing for the result to add Org links. The rest of the code is in the <a href="https://github.com/lgmoneda/qna-org-roam">repository</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>class RequestHandler(BaseHTTPRequestHandler):
def do_GET(self):
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
search_input = urllib.parse.unquote(self.path.split("/api/>")[-1])
### Retrieve docs from the knowledge base
retrieved_docs = vectordb.similarity_search(search_input, k=8)
### Build prompt
prompt_content = ""
for ith, doc in enumerate(retrieved_docs):
prompt_content += PROMPT_CONTENT.format(doc.page_content, ith)
complete_input_prompt = PROMPT_PRE + PROMPT_QUESTION.format(search_input) + \
prompt_content + PROMPT_POST
response = chat_gpt(complete_input_prompt)
answer_content = response["choices"][0]["message"]
response_str = post_process_answer(answer_content.content, retrieved_docs)
self.wfile.write(response_str.encode())
</code></pre></div></div>
<h2 id="connecting-emacs-to-the-chatgpt">Connecting Emacs to the ChatGPT</h2>
<p>To generate the interactive interface, we will create a minor mode for a chat that enables us to map Return to send a message to our chat API and display its result.</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nv">define-minor-mode</span> <span class="nv">chat-minor-mode</span>
<span class="s">"A minor mode to send chat messages to API."</span>
<span class="ss">:init-value</span> <span class="no">nil</span>
<span class="ss">:lighter</span> <span class="s">" Chat"</span>
<span class="ss">:keymap</span> <span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nb">map</span> <span class="p">(</span><span class="nv">make-sparse-keymap</span><span class="p">)))</span>
<span class="p">(</span><span class="nv">define-key</span> <span class="nb">map</span> <span class="p">(</span><span class="nv">kbd</span> <span class="s">"RET"</span><span class="p">)</span> <span class="ss">'exchange-with-chat-api</span><span class="p">)</span>
<span class="nb">map</span><span class="p">))</span>
</code></pre></div></div>
<p>Now we generate a buffer with our minor mode.</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">q-n-a-with-org-roam</span> <span class="p">()</span>
<span class="s">"Create a temporary buffer in org mode for chatting with an API."</span>
<span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
<span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">buf</span> <span class="p">(</span><span class="nv">get-buffer-create</span> <span class="s">"*Q&A Org-roam*"</span><span class="p">)))</span>
<span class="p">(</span><span class="nv">switch-to-buffer</span> <span class="nv">buf</span><span class="p">)</span>
<span class="p">(</span><span class="nv">org-mode</span><span class="p">)</span>
<span class="p">(</span><span class="nv">chat-minor-mode</span><span class="p">)</span>
<span class="p">(</span><span class="nv">erase-buffer</span><span class="p">)</span>
<span class="p">(</span><span class="nv">insert</span> <span class="s">"#+TITLE: Q&A with org-roam\n\n"</span><span class="p">)</span>
<span class="p">(</span><span class="nv">goto-char</span> <span class="p">(</span><span class="nv">point-max</span><span class="p">))</span>
<span class="p">(</span><span class="nv">insert</span> <span class="s">"Type your message here:\n> "</span><span class="p">)</span>
<span class="p">(</span><span class="nv">goto-char</span> <span class="p">(</span><span class="nv">point-max</span><span class="p">))</span>
<span class="p">))</span>
</code></pre></div></div>
<p>Finally, we need to send the request and display the answer from our chat API. I also included the <code class="language-plaintext highlighter-rouge">insert-string-simulating-typing</code> function to make it an experience!</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">insert-string-simulating-typing</span> <span class="p">(</span><span class="nb">string</span><span class="p">)</span>
<span class="s">"Inserts STRING into the current buffer simulating typing."</span>
<span class="p">(</span><span class="nv">interactive</span> <span class="s">"sEnter string to insert: "</span><span class="p">)</span>
<span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">delay</span> <span class="mf">0.03</span><span class="p">))</span> <span class="c1">; adjust this delay as desired</span>
<span class="p">(</span><span class="nb">dolist</span> <span class="p">(</span><span class="nb">char</span> <span class="p">(</span><span class="nb">append</span> <span class="nb">string</span> <span class="no">nil</span><span class="p">))</span>
<span class="p">(</span><span class="nv">insert</span> <span class="nb">char</span><span class="p">)</span>
<span class="p">(</span><span class="nv">sit-for</span> <span class="nv">delay</span><span class="p">))))</span>
<span class="p">(</span><span class="nb">defun</span> <span class="nv">call-python-server</span> <span class="p">(</span><span class="nv">input-string</span><span class="p">)</span>
<span class="s">"Call Python server with INPUT-STRING and return the output string."</span>
<span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">url-request-method</span> <span class="s">"GET"</span><span class="p">)</span>
<span class="p">(</span><span class="nv">url-request-extra-headers</span>
<span class="o">'</span><span class="p">((</span><span class="s">"Content-Type"</span> <span class="o">.</span> <span class="s">"text/plain"</span><span class="p">))))</span>
<span class="p">(</span><span class="nv">with-current-buffer</span>
<span class="p">(</span><span class="nv">url-retrieve-synchronously</span> <span class="p">(</span><span class="nv">concat</span> <span class="s">"http://localhost:8800/api/"</span> <span class="nv">input-string</span><span class="p">))</span>
<span class="p">(</span><span class="nv">goto-char</span> <span class="p">(</span><span class="nv">point-min</span><span class="p">))</span>
<span class="p">(</span><span class="nv">search-forward-regexp</span> <span class="s">"\n\n"</span><span class="p">)</span>
<span class="p">(</span><span class="nv">buffer-substring</span> <span class="p">(</span><span class="nv">point</span><span class="p">)</span> <span class="p">(</span><span class="nv">point-max</span><span class="p">)))))</span>
<span class="p">(</span><span class="nb">defun</span> <span class="nv">exchange-with-chat-api</span> <span class="p">()</span>
<span class="s">"Send the current message to the API and display the response."</span>
<span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
<span class="p">(</span><span class="k">let*</span> <span class="p">((</span><span class="nv">message</span> <span class="p">(</span><span class="nv">buffer-substring-no-properties</span> <span class="p">(</span><span class="nv">line-beginning-position</span><span class="p">)</span> <span class="p">(</span><span class="nv">line-end-position</span><span class="p">)))</span>
<span class="p">(</span><span class="nv">response</span> <span class="p">(</span><span class="nv">call-python-server</span> <span class="nv">message</span><span class="p">)))</span>
<span class="p">(</span><span class="nv">goto-char</span> <span class="p">(</span><span class="nv">point-max</span><span class="p">))</span>
<span class="p">(</span><span class="nv">insert</span> <span class="s">"\n\n"</span><span class="p">)</span>
<span class="p">(</span><span class="k">setq</span> <span class="nv">result</span> <span class="p">(</span><span class="nv">split-string-at-substring</span> <span class="nv">response</span> <span class="s">"SOURCES:"</span><span class="p">))</span>
<span class="p">(</span><span class="nv">insert-string-simulating-typing</span> <span class="p">(</span><span class="nb">nth</span> <span class="mi">0</span> <span class="nv">result</span><span class="p">))</span>
<span class="p">(</span><span class="k">if</span> <span class="p">(</span><span class="nb">nth</span> <span class="mi">1</span> <span class="nv">result</span><span class="p">)</span>
<span class="p">(</span><span class="nv">insert</span> <span class="p">(</span><span class="nv">decode-coding-string</span> <span class="p">(</span><span class="nb">nth</span> <span class="mi">1</span> <span class="nv">result</span><span class="p">)</span> <span class="ss">'utf-8</span><span class="p">)))</span>
<span class="p">(</span><span class="nv">goto-char</span> <span class="p">(</span><span class="nv">point-max</span><span class="p">))</span>
<span class="p">(</span><span class="nv">insert</span> <span class="s">"\n\n> "</span><span class="p">)))</span>
</code></pre></div></div>
<h2 id="limitations-and-where-to-tweak-for-improvement">Limitations and where to tweak for improvement</h2>
<p>I would recommend inspecting the final prompt. You will probably get frustrated with how the document content is split to be provided to ChatGPT. There is a large room for improvement there, and it is a part I’ve been working on. Another thing is the asymmetric/symmetric search part. A question is more similar to a different way of asking it than its answer. One example of both issues is when I ask: “What are metrics to evaluate a machine learning model?”</p>
<p>One of the document parts provided is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Content: [The Meta field guide to Machine Learning, Zeldin, O. > Online experimentation of ML systems] How do we measure the impact our ML system have in the real world?
Source: 2
</code></pre></div></div>
<p>It is literally the intro to the relevant content I want. So I wish it wasn’t isolated in a chunk! At the same time, I understand the question is very similar to what I did, more than it is to the answers in my notes.</p>
<p>Another lousy case in this same question.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Content: [Machine Learning Model > Model architecture] The structure of the Machine Learning model.
Source: 0
</code></pre></div></div>
<p>Then it makes the ChatGPT start the answer with:</p>
<blockquote>
<p>Metrics to evaluate a machine learning model include the structure of the model…</p>
</blockquote>
<p>While it is not that related to model evaluation. The ChatGPT will assume the text extracts are relevant; if they are not, the answer will sound weird.</p>
<p>You might also get a lot of extracts from the same source. Newer <code class="language-plaintext highlighter-rouge">langchain</code> versions offer a different search method: <code class="language-plaintext highlighter-rouge">max_marginal_relevance_search</code>. It works in two steps. First, it retrieves <code class="language-plaintext highlighter-rouge">fetch_k</code> similar documents, then extracts <code class="language-plaintext highlighter-rouge">k</code> most diverse among them. You can replace the <code class="language-plaintext highlighter-rouge">similarity_search</code> method with it.</p>
<h3 id="making-it-a-chat">Making it a chat</h3>
<p>We are doing a single exchange with ChatGPT, which does not leverage the conversational part of it. One can modify it to keep the context of the previous questions and answers. Keep the user inputs and ChatGPT answers in a messages list and append it to every new user input and ChatGPT answer, following <a href="https://platform.openai.com/docs/guides/chat">this format</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"}
]
</code></pre></div></div>
<p>One could extend and make it a conversation on top of the same topic until a keyword or command is given (like the “clear” used in the <code class="language-plaintext highlighter-rouge">chatgpt-shell</code> package).</p>
<h2 id="next-steps">Next steps</h2>
<p>One of the directions to explore is the plug-in design. <a href="https://openai.com/blog/chatgpt-plugins">Plug-ins</a> is how OpenAI has been expanding ChatGPT capabilities and one of the possibilities to make ChatGPT more of an intelligent switcher to the right tool available and enable it to act. In our search to leverage Zettelkasten, specifically the Org roam implementation, the question becomes: which capabilities should we offer to make ChatGPT the best companion for the many things one can use Org roam for?</p>
<p>In “Sparks of artificial general intelligence: early experiments with GPT-4” <sup id="fnref:fn5" role="doc-noteref"><a href="#fn:fn5" class="footnote" rel="footnote">2</a></sup>, there’s a section called “Interaction with the world” in which tool usage is explored. The following excerpt provides an idea of how it happens. It comes from Figure 5.4:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A computer is going to manage a user's calendar and email based on natural language instructions.
To do so, it would need access to the following API functions:
- CALENDAR.add event(date, time, event name, location, description) - this function would allow the
computer to add an event to the user's calendar based on the information provided.
- CALENDAR.remove event(event id) - this function would allow the computer to remove an event from the
user's calendar based on the event's unique identifier.
- CALENDAR.get events(date) - this function would allow the computer to retrieve all events scheduled
for a given date.
- CALENDAR.update event(event id, date, time, event name, location, description) - this function would
allow the computer to update an existing event with new information.
- EMAIL.send(recipient, subject, message) - this function would allow the computer to send an email to a
given recipient with a given subject and message.
- EMAIL.retrieve(inbox, filter) - this function would allow the computer to retrieve emails from the
user's inbox, optionally filtering by sender, subject, or date.
- USER.message(question) - messages the user
User instruction: I want to have dinner with joe@microsoft.com and luke@microsoft.com at Contoso
restaurant (their email is reservations@contoso.com) this week. Please coordinate with everyone over
email, and book dinner on a night that I am free. Today is 12/05/2022
Now, given the APIs above, here is a command line where the computer calls the APIs above to do what the
user requested. The computer only calls API functions and gets responses.<|endofprompt|>
</code></pre></div></div>
<p>Feel free to reach out if you are exploring this field for Org roam.</p>
<h2 id="references">References</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Moneda, Luis, <a href="http://lgmoneda.github.io/2023/04/08/semantic-search-for-org-roam.html">Semantic Search for Org roam</a>, Blogpost at lgmoneda.github.io, (2023). <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn5" role="doc-endnote">
<p>Bubeck, Sébastien, Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., et al, (2023). Sparks of artificial general intelligence: early experiments with gpt-4. <a href="https://arxiv.org/abs/2303.12712">arXiv preprint</a> arXiv:2303.12712. <a href="#fnref:fn5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comThe ChatGPT generated an extremely convenient way to query anything. At first, it is impressive to query the model’s internal knowledge.Semantic Search for Org roam2023-04-08T00:00:00+00:002023-04-08T00:00:00+00:00http://lgmoneda.github.io/2023/04/08/semantic-search-for-org-roam<h2 id="the-struggle-with-the-zettelkasten-method">The Struggle with the Zettelkasten method</h2>
<p>We all dive into Zettelkasten Method looking for a structure to empower our creativity. However, our cognition gets in the way at some point. From a previous post <sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup>:</p>
<blockquote>
<p>I’ve been using org-roam for a while. It applies the Zettelkasten method to Emacs. In summary, it is a note-taking system in which you link notes. The start is fascinating; you make many connections and remember every note. However, as you add more notes, it gets harder to connect or to remember you have a useful note when living a specific situation <sup id="fnref:fn2" role="doc-noteref"><a href="#fn:fn2" class="footnote" rel="footnote">2</a></sup> . The knowledge creation-application gap will start to increase. After 2.5y, I have 3.4k notes. It becomes non-trivial to connect and apply.</p>
</blockquote>
<div align="center">
<figure>
<a href="../../../images/rel-rev/my-org-roam-bg.png" name="intro">
<img style="width:450px;margin:10px" src="../../../images/rel-rev/my-org-roam-bg.png" />
</a>
<figcaption> A zoom-out of my knowledge graph. As the number of my notes grew, I was less certain I would be able to retrieve the right one at the most convenient moment. </figcaption>
</figure>
</div>
<p>In <a href="https://org-roam.discourse.group/t/a-demo-of-ai-for-linking-writing-and-thinking-with-org-roam-should-we-build-org-roam-ai/2891">December 2022</a>, I experimented semantic search for my <a href="https://www.orgroam.com/">Org roam</a> nodes and the results have been satisfactory. The three main usages are:</p>
<ol>
<li>Searching a node when I don’t remember its title and there is no strong key-word to find it;</li>
<li>Read my knowledge base when I’m thinking about a subject for work, research, or dialogue;</li>
<li>To support me when writing an article or a node, making it easier to connect to existing nodes.</li>
</ol>
<p>In this post, I will share how to use Large Language Models (LLM) to represent your Org-roam nodes, both with OpenAI offerings and open source models. I will focus on the nodes as the most granular entity, but I will provide directions to expanding it to a general search over your roam files, and explore it in future posts.</p>
<div align="center">
<figure>
<a href="../../../images/ssorgroam/org-roam-search.gif" name="demo">
<img style="width:700px;margin:5px" src="../../../images/ssorgroam/org-roam-search.gif" />
</a>
<figcaption>Using the search over the previous blog post.</figcaption>
</figure>
</div>
<p>Since I did the post in Org-roam discourse, I wondered if I should find people to collaborate and make it a package, but I realized things were moving too fast so I decided to keep playing while I learn my product needs and new supporting packages are released. It was a great decision! I’ve found <a href="https://github.com/hwchase17/langchain">langchain</a>, which makes it really easy to explore LLMs. Even OpenAI released a plug-in for this use case: <a href="https://github.com/openai/chatgpt-retrieval-plugin">openai/chatgpt-retreival-plugin</a>. The system in this post is very simple, but hopefully useful for those who will keep playing with their raw data and LLMs.</p>
<h2 id="overview">Overview</h2>
<ol>
<li>Go through every org file in the org-roam directory and break them into org nodes;</li>
<li>Build a dataset with their id, content, and other metadata;</li>
<li>Transform them into embeddings and store in a vector DB;</li>
<li>Create a local server to enable querying the vector DB;</li>
<li>Query them from Emacs using the API;</li>
<li>Display the results in a Emacs buffer.</li>
</ol>
<div align="center">
<figure>
<a href="../../../images/ssorgroam/diagram.svg" name="diagram">
<img style="width:700px;margin:10px" src="../../../images/ssorgroam/diagram.png" />
</a>
<figcaption> A simple representation of the procress. Though we don't learn from data and fit curves, I still like to split into a "learning stage". </figcaption>
</figure>
</div>
<p>You can find the code together in this <a href="https://github.com/lgmoneda/ssor">repository</a>, gently assembled by @wandersoncferreira.</p>
<p>The library versions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>adjustText==0.7.3
langchain==0.0.101
openai==0.27.0
orgparse==0.3.2
matplotlib-inline==0.1.6
numpy==1.23.5
pandas==1.5.2
sentence-transformers==2.2.2
scikit-learn==1.1.3
</code></pre></div></div>
<h2 id="the-data">The data</h2>
<p>Since the Org-roam database does not offer the actual content of the nodes, we will read directly from the Org files.</p>
<p>First, we list the Org files from the Org roam folder. Use the <em>BANNED_LIST</em> to avoid using files you are uncomfortable manipulating or exposing. I will offer an alternative local solution with an open-source model. Still, if you use OpenAI or other vendors, you need to be comfortable sending it around. OpenAI’s terms of service state that it deletes data transmitted via its API after 30 days and is not used to improve its models. This code should give you the list of your org-roam files:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">glob</span>
<span class="n">ROAM_PATH</span> <span class="o">=</span> <span class="s">"/your/path/to/org-roam/files"</span>
<span class="n">BANNED_FILES</span> <span class="o">=</span> <span class="p">{</span><span class="s">"my_journal"</span><span class="p">,</span>
<span class="s">"personal_finances"</span><span class="p">}</span>
<span class="k">def</span> <span class="nf">get_all_files_in_a_folder</span><span class="p">(</span><span class="n">full</span><span class="o">=</span><span class="bp">False</span><span class="p">):</span>
<span class="c1"># Use os.path.join() to safely join file paths
</span> <span class="n">path</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">ROAM_PATH</span><span class="p">,</span> <span class="s">"**/*.org"</span><span class="p">)</span>
<span class="c1"># Use set operations to filter out unwanted files
</span> <span class="n">files</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">glob</span><span class="p">.</span><span class="n">glob</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">recursive</span><span class="o">=</span><span class="bp">True</span><span class="p">))</span>
<span class="n">files</span> <span class="o">=</span> <span class="n">files</span><span class="p">.</span><span class="n">difference</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">f</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">files</span> <span class="k">if</span> <span class="nb">any</span><span class="p">(</span><span class="n">word</span> <span class="ow">in</span> <span class="n">f</span> <span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">BANNED_FILES</span><span class="p">)))</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">full</span><span class="p">:</span>
<span class="c1"># Use os.path.basename() to get just the file name
</span> <span class="n">files</span> <span class="o">=</span> <span class="p">[</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">os</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">basename</span><span class="p">(</span><span class="nb">file</span><span class="p">))[</span><span class="mi">0</span><span class="p">]</span> <span class="k">for</span> <span class="nb">file</span> <span class="ow">in</span> <span class="n">files</span><span class="p">]</span>
<span class="k">return</span> <span class="n">files</span>
</code></pre></div></div>
<p>To create a dataset from the notes, we will use <a href="https://github.com/karlicoss/orgparse">orgparse</a>. The library offers convenience for navigating Org files and exploring its structure.</p>
<p>If you follow Zettelkasten Principles, your nodes should be relatively small. However, there are nested nodes for those using org-roam v2. The code provided extract content in three different ways:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">node_text</code>: it is only the text on that Org heading. It does not include children’s headings’ content;</li>
<li><code class="language-plaintext highlighter-rouge">node_text_nested</code>: it consists of the title and body of all the children’s headings;</li>
<li><code class="language-plaintext highlighter-rouge">node_text_nested_exclusive</code>: the same as <code class="language-plaintext highlighter-rouge">node_text_nested</code>, but we exclude the content from children headings that are org roam nodes.</li>
</ul>
<p>The content I will embed will be the <code class="language-plaintext highlighter-rouge">node_text_nested_exclusive</code>. I’ve also added a property to org headings I want to exclude from it by adding the following under an org heading:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:PROPERTIES:
:SEARCH: ignore
:END:
</code></pre></div></div>
<p>It is a better practice to split the text further, but we will keep the objective of transforming every node into a single representation. More about it in the <a href="/2023/04/08/semantic-search-for-org-roam.html#directions-to-make-different-design-choices">Directions to make different design choices</a>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">orgparse</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="k">def</span> <span class="nf">org_roam_nodes_to_dataframe</span><span class="p">(</span><span class="n">org_file</span><span class="p">):</span>
<span class="c1"># Load the org file into an OrgData object
</span> <span class="n">org_data</span> <span class="o">=</span> <span class="n">orgparse</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">org_file</span><span class="p">)</span>
<span class="c1"># Define a function to extract the title of a node
</span> <span class="k">def</span> <span class="nf">extract_title</span><span class="p">(</span><span class="n">node</span><span class="p">):</span>
<span class="k">if</span> <span class="n">node</span><span class="p">.</span><span class="n">heading</span><span class="p">:</span>
<span class="c1"># If the node has a heading, return it
</span> <span class="k">return</span> <span class="n">node</span><span class="p">.</span><span class="n">heading</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># Otherwise, extract the title from the org file using a regular expression
</span> <span class="n">title_pattern</span> <span class="o">=</span> <span class="n">re</span><span class="p">.</span><span class="nb">compile</span><span class="p">(</span><span class="sa">r</span><span class="s">'^#\+title:\s*(.*)$'</span><span class="p">,</span> <span class="n">re</span><span class="p">.</span><span class="n">IGNORECASE</span><span class="p">)</span>
<span class="n">match</span> <span class="o">=</span> <span class="n">title_pattern</span><span class="p">.</span><span class="n">search</span><span class="p">(</span><span class="n">node</span><span class="p">.</span><span class="n">body</span><span class="p">)</span>
<span class="k">if</span> <span class="n">match</span><span class="p">:</span>
<span class="k">return</span> <span class="n">match</span><span class="p">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># If the title is not found, extract it from the first line of the body
</span> <span class="k">return</span> <span class="n">re</span><span class="p">.</span><span class="n">sub</span><span class="p">(</span><span class="sa">r</span><span class="s">"#\+title:"</span><span class="p">,</span>
<span class="s">""</span><span class="p">,</span>
<span class="n">node</span><span class="p">.</span><span class="n">body</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)[</span><span class="mi">0</span><span class="p">],</span> <span class="n">flags</span><span class="o">=</span><span class="n">re</span><span class="p">.</span><span class="n">IGNORECASE</span><span class="p">).</span><span class="n">strip</span><span class="p">()</span>
<span class="c1"># Define a function to recursively extract the bodies of a node and its descendants
</span> <span class="k">def</span> <span class="nf">extract_node_nested_body</span><span class="p">(</span><span class="n">node</span><span class="p">):</span>
<span class="n">body</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">body</span>
<span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">node</span><span class="p">.</span><span class="n">children</span><span class="p">:</span>
<span class="n">body</span> <span class="o">+=</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span> <span class="o">+</span> <span class="n">child</span><span class="p">.</span><span class="n">level</span> <span class="o">*</span> <span class="s">"*"</span> <span class="o">+</span> <span class="s">" "</span> <span class="o">+</span> <span class="n">child</span><span class="p">.</span><span class="n">heading</span> <span class="o">+</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> \
<span class="n">extract_node_nested_body</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
<span class="k">return</span> <span class="n">body</span><span class="p">.</span><span class="n">strip</span><span class="p">()</span>
<span class="c1"># Define a function to recursively extract the bodies of a node
</span> <span class="c1"># and its descendants when they are not other nodes
</span> <span class="k">def</span> <span class="nf">extract_node_nested_body_exclusive</span><span class="p">(</span><span class="n">node</span><span class="p">):</span>
<span class="n">body</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">body</span>
<span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">node</span><span class="p">.</span><span class="n">children</span><span class="p">:</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">child</span><span class="p">.</span><span class="n">properties</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'ID'</span><span class="p">)</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">child</span><span class="p">.</span><span class="n">properties</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'SEARCH'</span><span class="p">):</span>
<span class="n">body</span> <span class="o">+=</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span> <span class="o">+</span> <span class="n">child</span><span class="p">.</span><span class="n">level</span> <span class="o">*</span> <span class="s">"*"</span> <span class="o">+</span> <span class="s">" "</span> <span class="o">+</span> <span class="n">child</span><span class="p">.</span><span class="n">heading</span> <span class="o">+</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> \
<span class="n">extract_node_nested_body_exclusive</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
<span class="k">return</span> <span class="n">body</span><span class="p">.</span><span class="n">strip</span><span class="p">()</span>
<span class="c1"># Define a function to build the hierarchy of a node
</span> <span class="k">def</span> <span class="nf">build_node_hierarchy</span><span class="p">(</span><span class="n">node</span><span class="p">):</span>
<span class="n">hierarchy</span> <span class="o">=</span> <span class="p">[</span><span class="n">extract_title</span><span class="p">(</span><span class="n">node</span><span class="p">)]</span>
<span class="n">parent</span> <span class="o">=</span> <span class="n">node</span><span class="p">.</span><span class="n">parent</span>
<span class="c1"># while parent and parent != org_data[0]:
</span> <span class="k">while</span> <span class="n">parent</span><span class="p">:</span>
<span class="n">hierarchy</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">extract_title</span><span class="p">(</span><span class="n">parent</span><span class="p">))</span>
<span class="n">parent</span> <span class="o">=</span> <span class="n">parent</span><span class="p">.</span><span class="n">parent</span>
<span class="k">return</span> <span class="s">' > '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="nb">reversed</span><span class="p">(</span><span class="n">hierarchy</span><span class="p">)).</span><span class="n">strip</span><span class="p">()</span>
<span class="c1"># Define a function to convert a node to a dictionary
</span> <span class="k">def</span> <span class="nf">node_to_dict</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">file_name</span><span class="p">):</span>
<span class="n">node_dict</span> <span class="o">=</span> <span class="p">{</span>
<span class="s">'file_name'</span><span class="p">:</span> <span class="n">file_name</span><span class="p">,</span>
<span class="s">'node_id'</span><span class="p">:</span> <span class="n">node</span><span class="p">.</span><span class="n">properties</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'ID'</span><span class="p">),</span>
<span class="s">'node_title'</span><span class="p">:</span> <span class="n">extract_title</span><span class="p">(</span><span class="n">node</span><span class="p">),</span>
<span class="s">'node_hierarchy'</span><span class="p">:</span> <span class="n">build_node_hierarchy</span><span class="p">(</span><span class="n">node</span><span class="p">),</span>
<span class="s">'node_text'</span><span class="p">:</span> <span class="n">node</span><span class="p">.</span><span class="n">body</span><span class="p">,</span>
<span class="s">'node_text_nested'</span><span class="p">:</span> <span class="n">extract_node_nested_body</span><span class="p">(</span><span class="n">node</span><span class="p">),</span>
<span class="s">'node_text_nested_exclusive'</span><span class="p">:</span> <span class="n">extract_node_nested_body_exclusive</span><span class="p">(</span><span class="n">node</span><span class="p">),</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">node_dict</span>
<span class="c1"># Create a list of all org-roam nodes in the OrgData object
</span> <span class="n">nodes</span> <span class="o">=</span> <span class="p">[</span><span class="n">node_to_dict</span><span class="p">(</span><span class="n">node</span><span class="p">,</span> <span class="n">org_file</span><span class="p">)</span> <span class="k">for</span> <span class="n">node</span> <span class="ow">in</span> <span class="n">org_data</span><span class="p">[</span><span class="mi">0</span><span class="p">][:]</span> <span class="k">if</span> <span class="n">node</span><span class="p">.</span><span class="n">properties</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">'ID'</span><span class="p">)]</span>
<span class="k">return</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">nodes</span><span class="p">)</span>
</code></pre></div></div>
<p>The following code will generate a pandas DataFrame from your Org roam nodes.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">org_files</span> <span class="o">=</span> <span class="n">get_all_files_in_a_folder</span><span class="p">(</span><span class="n">full</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">roam_nodes_df</span> <span class="o">=</span> <span class="n">pd</span><span class="p">.</span><span class="n">concat</span><span class="p">([</span><span class="n">org_roam_nodes_to_dataframe</span><span class="p">(</span><span class="n">org_file</span><span class="p">)</span> <span class="k">for</span> <span class="n">org_file</span> <span class="ow">in</span> <span class="n">org_files</span><span class="p">])</span>
</code></pre></div></div>
<p>We can still clean-up a little bit the data.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">roam_nodes_df</span><span class="p">[</span><span class="s">"text_to_encode"</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span>
<span class="n">roam_nodes_df</span><span class="p">[</span><span class="s">"node_text_nested_exclusive"</span><span class="p">]</span>
<span class="p">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>
<span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"#\+filetags:"</span><span class="p">,</span> <span class="s">"tags:"</span><span class="p">)</span>
<span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"#\+title:"</span><span class="p">,</span> <span class="s">"title:"</span><span class="p">)</span>
<span class="p">.</span><span class="nb">str</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"#\+STARTUP: inlineimages latexpreview"</span><span class="p">,</span> <span class="s">""</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>And we finally generate the text we will encode. I add the heading hierarchy since the way I nest information makes it necessary to provide the full meaning of a node.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">roam_nodes_df</span><span class="p">[</span><span class="s">"text_to_encode"</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span>
<span class="s">"["</span> <span class="o">+</span> <span class="n">roam_nodes_df</span><span class="p">[</span><span class="s">"node_hierarchy"</span><span class="p">]</span> <span class="o">+</span> <span class="s">"] "</span> <span class="o">+</span>
<span class="n">roam_nodes_df</span><span class="p">[</span><span class="s">"text_to_encode"</span><span class="p">].</span><span class="n">astype</span><span class="p">(</span><span class="nb">str</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<h2 id="embeddings-and-the-vector-database">Embeddings and the vector database</h2>
<p>An Embedding is a vector representation of a larger instance of information. With LLM, people are getting used to transforming text into embeddings, but you can make the same with images or tabular data. The idea is that you are compressing information into a fixed space (e.g., 1512 dimensions), which enables you to compare different pieces of information using operations over with vectors.</p>
<p>OpenAI offers embedding models, and we will use the <code class="language-plaintext highlighter-rouge">text-embedding-ada-002</code> (the default model for langchain, but also the best value for the money today). They price it by the token. Ada02 is priced at $0.0004 for 1k tokens. A token is more granular than a word. The rule of thumb is that <code class="language-plaintext highlighter-rouge">1 token = 0.75 words</code>. OpenAI has a library to tokenize text so one can know how many tokens are needed. The Ada02 model accepts an input of 8192. That’s a lot. It could encode all my Org roam nodes except for a few misusages of Org roam I do.</p>
<p>Let’s first check how much we can expect to pay if we embed the org-roam nodes.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">tiktoken</span>
<span class="n">TOKENS_PACK</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">COST_PER_TOKEN</span> <span class="o">=</span> <span class="mf">0.0004</span>
<span class="n">encoding</span> <span class="o">=</span> <span class="n">tiktoken</span><span class="p">.</span><span class="n">get_encoding</span><span class="p">(</span><span class="s">"cl100k_base"</span><span class="p">)</span>
<span class="n">tokens_per_text</span> <span class="o">=</span> <span class="n">roam_nodes_df</span><span class="p">[</span><span class="s">"text_to_encode"</span><span class="p">].</span><span class="nb">apply</span><span class="p">(</span><span class="n">encoding</span><span class="p">.</span><span class="n">encode</span><span class="p">).</span><span class="nb">apply</span><span class="p">(</span><span class="nb">len</span><span class="p">)</span>
<span class="n">total_tokens</span> <span class="o">=</span> <span class="n">tokens_per_text</span><span class="p">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">total_cost</span> <span class="o">=</span> <span class="n">total_tokens</span> <span class="o">/</span> <span class="n">TOKENS_PACK</span> <span class="o">*</span> <span class="n">COST_PER_TOKEN</span>
<span class="k">print</span><span class="p">(</span><span class="n">total_cost</span><span class="p">)</span>
</code></pre></div></div>
<p>I have 3.5k notes, and using the “node_text_nested_exclusive” will cost me ~$0.12. It is very reasonable.</p>
<p>Remember, you can’t mix embeddings. An embedding is a very particular way a model has to represent data. It has a consistency between the representation of different objects by the same model, but there is no expected consistency between different embedding models.</p>
<p>We will use <code class="language-plaintext highlighter-rouge">langchain</code> wrappers around the Chroma Vector database. First, we create a new vector database.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.vectorstores</span> <span class="kn">import</span> <span class="n">Chroma</span>
<span class="kn">from</span> <span class="nn">langchain.docstore.document</span> <span class="kn">import</span> <span class="n">Document</span>
<span class="kn">from</span> <span class="nn">langchain.embeddings</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>
<span class="n">persist_directory</span> <span class="o">=</span> <span class="s">"/path/to/the/db-org-roam-nodes-openai"</span>
<span class="n">embedding</span> <span class="o">=</span> <span class="n">OpenAIEmbeddings</span><span class="p">(</span><span class="n">openai_api_key</span><span class="o">=</span><span class="s">"<your-open-ai-key>"</span><span class="p">)</span>
<span class="n">vectordb</span> <span class="o">=</span> <span class="n">Chroma</span><span class="p">(</span><span class="s">"langchain_store"</span><span class="p">,</span> <span class="n">embedding_function</span><span class="o">=</span><span class="n">embedding</span><span class="p">,</span> <span class="n">persist_directory</span><span class="o">=</span><span class="n">persist_directory</span><span class="p">)</span>
</code></pre></div></div>
<p>Now we can add our org roam nodes to it. The first argument is the text it will encode using the embedding. The Metadata will generate the link and other info we will display in the results buffer. Since we are not building a full solution that can update the nodes as they are updated, the “ids” argument is irrelevant.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vectordb</span><span class="p">.</span><span class="n">add_texts</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">roam_nodes_df</span><span class="p">[</span><span class="s">"node_text_to_encode"</span><span class="p">].</span><span class="n">values</span><span class="p">),</span>
<span class="n">metadatas</span><span class="o">=</span><span class="p">[{</span><span class="s">"ID"</span><span class="p">:</span> <span class="n">roam_nodes_df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s">"node_id"</span><span class="p">],</span>
<span class="s">"title"</span><span class="p">:</span> <span class="n">roam_nodes_df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s">"node_title"</span><span class="p">],</span>
<span class="s">"hierarchy"</span><span class="p">:</span> <span class="n">roam_nodes_df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s">"node_hierarchy"</span><span class="p">],</span>
<span class="s">"file_name"</span><span class="p">:</span> <span class="n">roam_nodes_df</span><span class="p">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="s">"file_name"</span><span class="p">]}</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">roam_nodes_df</span><span class="p">))],</span>
<span class="n">ids</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="n">roam_nodes_df</span><span class="p">[</span><span class="s">"node_id"</span><span class="p">]))</span>
<span class="n">vectordb</span><span class="p">.</span><span class="n">persist</span><span class="p">()</span>
</code></pre></div></div>
<p>You can test it in a jupyter notebook:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input_prompt</span> <span class="o">=</span> <span class="s">"A machine learning model is degrading. Its performance has been dropping since it was released. What are the possible reasons for it?"</span>
<span class="n">retrieved_docs</span> <span class="o">=</span> <span class="n">vectordb</span><span class="p">.</span><span class="n">similarity_search_with_score</span><span class="p">(</span><span class="n">input_prompt</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">retrieved_docs</span>
</code></pre></div></div>
<p>The score is the distance between the query and the retrieved document. The lower, the more similar they are.</p>
<h2 id="the-local-server">The local server</h2>
<p>A local Python server is a straightforward and convenient way to serve your search. The database doesn’t get large since most people cannot generate that much text data from personal notes. And it is way faster than running it as a script on demand.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">urllib.parse</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">http.server</span> <span class="kn">import</span> <span class="n">BaseHTTPRequestHandler</span><span class="p">,</span> <span class="n">HTTPServer</span>
<span class="kn">from</span> <span class="nn">langchain.embeddings</span> <span class="kn">import</span> <span class="n">OpenAIEmbeddings</span>
<span class="kn">from</span> <span class="nn">langchain.vectorstores</span> <span class="kn">import</span> <span class="n">Chroma</span>
<span class="k">class</span> <span class="nc">RequestHandler</span><span class="p">(</span><span class="n">BaseHTTPRequestHandler</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">do_GET</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="p">.</span><span class="n">send_response</span><span class="p">(</span><span class="mi">200</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">send_header</span><span class="p">(</span><span class="s">'Content-type'</span><span class="p">,</span> <span class="s">'text/plain'</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">end_headers</span><span class="p">()</span>
<span class="c1"># Input
</span> <span class="n">request_str</span> <span class="o">=</span> <span class="n">urllib</span><span class="p">.</span><span class="n">parse</span><span class="p">.</span><span class="n">unquote</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"/api/"</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">])</span>
<span class="k">print</span><span class="p">(</span><span class="n">request_str</span><span class="p">)</span>
<span class="c1"># Retrieve docs
</span> <span class="n">retrieved_docs</span> <span class="o">=</span> <span class="n">vectordb</span><span class="p">.</span><span class="n">similarity_search_with_score</span><span class="p">(</span><span class="n">request_str</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">org_link_format</span> <span class="o">=</span> <span class="s">"[%.2f]: [[id:%s][%s]] </span><span class="se">\n</span><span class="s"> %s"</span>
<span class="n">docs</span> <span class="o">=</span> <span class="p">[</span><span class="n">org_link_format</span> <span class="o">%</span> <span class="p">(</span><span class="n">score</span><span class="p">,</span> <span class="n">doc</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="s">"ID"</span><span class="p">],</span>
<span class="n">doc</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="s">"title"</span><span class="p">].</span><span class="n">strip</span><span class="p">(),</span>
<span class="n">doc</span><span class="p">.</span><span class="n">metadata</span><span class="p">[</span><span class="s">"hierarchy"</span><span class="p">].</span><span class="n">strip</span><span class="p">())</span>
<span class="k">for</span> <span class="n">doc</span><span class="p">,</span> <span class="n">score</span> <span class="ow">in</span> <span class="n">retrieved_docs</span><span class="p">]</span>
<span class="c1"># Format the output
</span> <span class="n">response_str</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"#+title: Most similar nodes </span><span class="se">\n\n</span><span class="s">:QUERY:</span><span class="se">\n</span><span class="si">{</span><span class="n">request_str</span><span class="si">}</span><span class="s"> </span><span class="se">\n</span><span class="s">:END:</span><span class="se">\n\n</span><span class="s">"</span>
<span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">source</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">docs</span><span class="p">):</span>
<span class="n">response_str</span> <span class="o">+=</span> <span class="s">"* "</span> <span class="o">+</span> <span class="n">source</span> <span class="o">+</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span>
<span class="k">print</span><span class="p">(</span><span class="n">response_str</span><span class="p">)</span>
<span class="bp">self</span><span class="p">.</span><span class="n">wfile</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">response_str</span><span class="p">.</span><span class="n">encode</span><span class="p">())</span>
<span class="k">def</span> <span class="nf">run_server</span><span class="p">():</span>
<span class="n">server_address</span> <span class="o">=</span> <span class="p">(</span><span class="s">''</span><span class="p">,</span> <span class="mi">8800</span><span class="p">)</span>
<span class="n">httpd</span> <span class="o">=</span> <span class="n">HTTPServer</span><span class="p">(</span><span class="n">server_address</span><span class="p">,</span> <span class="n">RequestHandler</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">'Server is running on port </span><span class="si">{</span><span class="n">server_address</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="si">}</span><span class="s">'</span><span class="p">)</span>
<span class="n">httpd</span><span class="p">.</span><span class="n">serve_forever</span><span class="p">()</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">'__main__'</span><span class="p">:</span>
<span class="n">persist_directory</span> <span class="o">=</span> <span class="s">"the/path/to/your/vector-db"</span>
<span class="n">embedding</span> <span class="o">=</span> <span class="n">OpenAIEmbeddings</span><span class="p">(</span><span class="n">openai_api_key</span><span class="o">=</span><span class="s">"<your-openai-key>"</span><span class="p">)</span>
<span class="n">vectordb</span> <span class="o">=</span> <span class="n">Chroma</span><span class="p">(</span><span class="s">"langchain_store"</span><span class="p">,</span>
<span class="n">embedding_function</span><span class="o">=</span><span class="n">embedding</span><span class="p">,</span>
<span class="n">persist_directory</span><span class="o">=</span><span class="n">persist_directory</span><span class="p">)</span>
<span class="n">run_server</span><span class="p">()</span>
</code></pre></div></div>
<h2 id="the-emacs-interface">The Emacs interface</h2>
<p>To use it from Emacs, we will create a function to make the request to our server and another that deals with its input and output.</p>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">call-roam-search-python-server</span> <span class="p">(</span><span class="nv">input-string</span><span class="p">)</span>
<span class="ss">'Call</span> <span class="nv">Python</span> <span class="nv">server</span> <span class="nv">with</span> <span class="nv">INPUT-STRING</span> <span class="nb">and</span> <span class="nb">return</span> <span class="k">the</span> <span class="nv">output</span> <span class="nv">string.</span><span class="o">'</span>
<span class="p">(</span><span class="k">let</span> <span class="p">((</span><span class="nv">url-request-method</span> <span class="s">"GET"</span><span class="p">)</span>
<span class="p">(</span><span class="nv">url-request-extra-headers</span>
<span class="o">'</span><span class="p">((</span><span class="s">"Content-Type"</span> <span class="o">.</span> <span class="s">"text/plain"</span><span class="p">))))</span>
<span class="p">(</span><span class="nv">with-current-buffer</span>
<span class="p">(</span><span class="nv">url-retrieve-synchronously</span> <span class="p">(</span><span class="nv">concat</span> <span class="s">"http://localhost:8800/api/"</span> <span class="nv">input-string</span><span class="p">))</span>
<span class="p">(</span><span class="nv">goto-char</span> <span class="p">(</span><span class="nv">point-min</span><span class="p">))</span>
<span class="p">(</span><span class="nv">search-forward-regexp</span> <span class="s">"\n\n"</span><span class="p">)</span>
<span class="p">(</span><span class="nv">buffer-substring</span> <span class="p">(</span><span class="nv">point</span><span class="p">)</span> <span class="p">(</span><span class="nv">point-max</span><span class="p">)))))</span>
</code></pre></div></div>
<div class="language-elisp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">(</span><span class="nb">defun</span> <span class="nv">org-roam-semantic-search-api</span> <span class="p">()</span>
<span class="ss">'Call</span> <span class="k">the</span> <span class="nv">semantic-search</span> <span class="nv">API</span> <span class="nv">on</span> <span class="nv">a</span> <span class="nv">selected</span> <span class="nv">region</span> <span class="nb">or</span> <span class="nv">ask</span> <span class="nv">for</span> <span class="nv">an</span> <span class="nv">input.</span> <span class="nv">Display</span> <span class="nv">its</span> <span class="nv">output</span> <span class="nv">in</span> <span class="nv">a</span> <span class="nv">org</span> <span class="nv">buffer.</span><span class="o">'</span>
<span class="p">(</span><span class="nv">interactive</span><span class="p">)</span>
<span class="p">(</span><span class="k">let*</span> <span class="p">((</span><span class="nv">text</span> <span class="p">(</span><span class="k">if</span> <span class="p">(</span><span class="nv">use-region-p</span><span class="p">)</span>
<span class="p">(</span><span class="nv">buffer-substring-no-properties</span> <span class="p">(</span><span class="nv">region-beginning</span><span class="p">)</span> <span class="p">(</span><span class="nv">region-end</span><span class="p">))</span>
<span class="p">(</span><span class="nv">read-string</span> <span class="s">"Enter search: "</span><span class="p">)))</span>
<span class="p">(</span><span class="nv">buf</span> <span class="p">(</span><span class="nv">get-buffer-create</span> <span class="s">"*org-roam-similar-nodes*"</span><span class="p">))</span>
<span class="p">(</span><span class="nv">api-output</span> <span class="p">(</span><span class="nv">call-roam-search-python-server</span> <span class="nv">text</span><span class="p">)))</span>
<span class="p">(</span><span class="nv">with-current-buffer</span> <span class="nv">buf</span>
<span class="p">(</span><span class="nv">erase-buffer</span><span class="p">)</span>
<span class="p">(</span><span class="nv">org-mode</span><span class="p">)</span>
<span class="p">(</span><span class="nv">insert</span> <span class="p">(</span><span class="nb">format</span> <span class="s">"%s"</span> <span class="nv">api-output</span><span class="p">))</span>
<span class="p">(</span><span class="nv">org-shifttab</span><span class="p">)</span>
<span class="p">(</span><span class="nv">display-buffer</span> <span class="nv">buf</span><span class="p">))</span>
<span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>I map the <code class="language-plaintext highlighter-rouge">org-roam-semantic-search-api</code> together with other org-roam commands as <code class="language-plaintext highlighter-rouge">C-c n a</code> so it becomes part of my usual flow.</p>
<p>Now, you should be able to M-x while selecting a portion of text or entering an input for it, and see the screen we have shown in the intro with related org-roam nodes!</p>
<div align="center">
<figure>
<a href="../../../images/ssorgroam/maigc-org-roam.png" name="tsne">
<img style="width:550px;margin:2px" src="../../../images/ssorgroam/magic-org-roam.png" />
</a>
<figcaption>A magic Org roam!</figcaption>
</figure>
</div>
<h2 id="directions-to-make-different-design-choices">Directions to make different design choices</h2>
<h3 id="using-an-open-source-model">Using an open-source model</h3>
<p>The OpenAI model is convenient because it accepts many tokens as input (8192), is high quality, doesn’t run locally (which makes it fast), and is not expensive. However, there are open-source options, and there will be many soon.</p>
<p>For those that don’t want to see their data going around, one option to run locally is the <code class="language-plaintext highlighter-rouge">sentence-transformers</code> library.</p>
<p>However, if we keep the design choice of having every Org roam node represented by a single vector, we must use a workaround since their best model can accept 382 input tokens.</p>
<p><a href="https://gist.github.com/lgmoneda/f54575eebaa8932ca926f5d0526e8a31">Here</a> I provide a couple of functions to embed a list of documents. The functions will split a single document into sentences and use the maximum input token of the model to aggregate as many sentences as possible and form chunks. Then it will average the embeddings of all the chunks from a single document. That’s the approach I’ve used for the Org-roam discourse post.</p>
<h3 id="breaking-the-org-roam-nodes-into-chunks">Breaking the Org roam nodes into chunks</h3>
<p>A more promising direction is to split text further and don’t treat the Org roam node as the most granular entity of your knowledge base. When playing with augmented retrieval and ChatGPT over my Org roam notes, I need to split it to provide many different text pieces to ChatGPT.</p>
<p>The following code shows the Vector DB ingestion part with a loop that breaks Org roam nodes into multiple texts.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">langchain.text_splitter</span> <span class="kn">import</span> <span class="n">CharacterTextSplitter</span>
<span class="n">text_splitter</span> <span class="o">=</span> <span class="n">CharacterTextSplitter</span><span class="p">(</span><span class="n">chunk_size</span><span class="o">=</span><span class="mi">300</span><span class="p">,</span> <span class="n">chunk_overlap</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">data</span><span class="p">.</span><span class="n">iterrows</span><span class="p">():</span>
<span class="n">org_id</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">"node_id"</span><span class="p">]</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">"node_title"</span><span class="p">]</span>
<span class="n">file_name</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">"file_name"</span><span class="p">]</span>
<span class="n">node_hierarchy</span> <span class="o">=</span> <span class="n">row</span><span class="p">[</span><span class="s">"node_hierarchy"</span><span class="p">]</span>
<span class="n">texts</span> <span class="o">=</span> <span class="n">text_splitter</span><span class="p">.</span><span class="n">split_text</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s">"node_text_nested_exclusive"</span><span class="p">])</span>
<span class="n">texts</span> <span class="o">=</span> <span class="p">[</span><span class="s">"["</span> <span class="o">+</span> <span class="n">node_hierarchy</span> <span class="o">+</span> <span class="s">"] "</span> <span class="o">+</span> <span class="n">text</span> <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">texts</span><span class="p">]</span>
<span class="n">metadatas</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">{</span>
<span class="s">"source"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">index</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
<span class="s">"ID"</span><span class="p">:</span> <span class="n">org_id</span><span class="p">,</span>
<span class="s">"title"</span><span class="p">:</span> <span class="n">title</span><span class="p">,</span>
<span class="s">"hierarchy"</span><span class="p">:</span> <span class="n">node_hierarchy</span><span class="p">,</span>
<span class="s">"file_name"</span><span class="p">:</span> <span class="n">file_name</span><span class="p">,</span>
<span class="p">}</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">texts</span><span class="p">))</span>
<span class="p">]</span>
<span class="n">ids</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">index</span><span class="si">}</span><span class="s">-</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">texts</span><span class="p">))]</span>
<span class="n">vectordb</span><span class="p">.</span><span class="n">add_texts</span><span class="p">(</span><span class="n">texts</span><span class="p">,</span> <span class="n">metadatas</span><span class="o">=</span><span class="n">metadatas</span><span class="p">,</span> <span class="n">ids</span><span class="o">=</span><span class="n">ids</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="plotting-the-relationship-between-the-different-nodes">Plotting the relationship between the different nodes</h3>
<p>In my original post in Org-roam discourse, an image displayed the returned nodes in a 2d representation. It is intended to let you spot the relationship between them, so one should be able to spot a few clusters on the image. E.g., if you colored the data points using their tags, we should see nodes with the same color closer in the plot.</p>
<p>We need to retrieve the vectors from the similarity search to make it. Currently, langchain doesn’t expose them on Chroma, but we can create a modified search function to do it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Tuple</span>
<span class="k">def</span> <span class="nf">similarity_search_with_embeddings</span><span class="p">(</span>
<span class="n">vectordb</span><span class="p">,</span>
<span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
<span class="n">k</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">4</span><span class="p">,</span>
<span class="nb">filter</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="p">,</span><span class="o">**</span><span class="n">kwargs</span><span class="p">:</span> <span class="n">Any</span><span class="p">,</span>
<span class="p">):</span>
<span class="s">"""Run similarity search with Chroma and return labels, embeddings and scores of top k results.
Args:
query (str): Query text to search for.
k (int): Number of results to return. Defaults to 4.
filter (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
Returns:
Labels (List), Embeddings (List), Scores (List)
"""</span>
<span class="n">query_embedding</span> <span class="o">=</span> <span class="n">vectordb</span><span class="p">.</span><span class="n">_embedding_function</span><span class="p">.</span><span class="n">embed_query</span><span class="p">(</span><span class="n">query</span><span class="p">)</span>
<span class="n">results</span> <span class="o">=</span> <span class="n">vectordb</span><span class="p">.</span><span class="n">_collection</span><span class="p">.</span><span class="n">query</span><span class="p">(</span>
<span class="n">query_embedding</span><span class="p">,</span>
<span class="n">n_results</span><span class="o">=</span><span class="n">k</span><span class="p">,</span>
<span class="n">where</span><span class="o">=</span><span class="nb">filter</span><span class="p">,</span>
<span class="n">include</span><span class="o">=</span><span class="p">[</span><span class="s">"metadatas"</span><span class="p">,</span> <span class="s">"documents"</span><span class="p">,</span> <span class="s">"distances"</span><span class="p">,</span> <span class="s">"embeddings"</span><span class="p">],</span> <span class="o">**</span><span class="n">kwargs</span>
<span class="p">)</span>
<span class="k">return</span> <span class="p">[</span><span class="n">d</span><span class="p">[</span><span class="s">"title"</span><span class="p">]</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">results</span><span class="p">[</span><span class="s">"metadatas"</span><span class="p">][</span><span class="mi">0</span><span class="p">]],</span> <span class="n">results</span><span class="p">[</span><span class="s">"embeddings"</span><span class="p">][</span><span class="mi">0</span><span class="p">],</span> <span class="n">results</span><span class="p">[</span><span class="s">"distances"</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>
<p>With a list of labels, embeddings, and scores (colors), we can call a function to give us that plot.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn.manifold</span> <span class="kn">import</span> <span class="n">TSNE</span>
<span class="kn">from</span> <span class="nn">adjustText</span> <span class="kn">import</span> <span class="n">adjust_text</span>
<span class="k">def</span> <span class="nf">plot_vectors</span><span class="p">(</span><span class="n">vectors</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">colors</span><span class="p">):</span>
<span class="n">shorten_labels</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">label</span> <span class="ow">in</span> <span class="n">labels</span><span class="p">:</span>
<span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">label</span><span class="p">)</span> <span class="o">></span> <span class="mi">30</span><span class="p">:</span>
<span class="n">label</span> <span class="o">=</span> <span class="n">label</span><span class="p">[:</span><span class="mi">30</span><span class="p">]</span> <span class="o">+</span> <span class="s">"..."</span>
<span class="n">shorten_labels</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">label</span><span class="p">)</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">shorten_labels</span>
<span class="c1"># Project the vectors into two dimensions using t-SNE
</span> <span class="n">tsne</span> <span class="o">=</span> <span class="n">TSNE</span><span class="p">(</span><span class="n">n_components</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">perplexity</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
<span class="n">projected_vectors</span> <span class="o">=</span> <span class="n">tsne</span><span class="p">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">vectors</span><span class="p">))</span>
<span class="c1"># Create a colormap with different shades of purple
</span> <span class="n">cmap</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">get_cmap</span><span class="p">(</span><span class="s">"winter"</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">))</span>
<span class="c1"># Plot the projected vectors with different shades of purple
</span> <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">projected_vectors</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">projected_vectors</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="n">colors</span><span class="p">,</span> <span class="n">cmap</span><span class="o">=</span><span class="n">cmap</span><span class="p">)</span>
<span class="c1"># Add labels to the plot using ajustText
</span> <span class="n">texts</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">label</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">labels</span><span class="p">,</span> <span class="n">projected_vectors</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">projected_vectors</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]):</span>
<span class="n">text</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">annotate</span><span class="p">(</span><span class="n">label</span><span class="p">,</span> <span class="n">xy</span><span class="o">=</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">),</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">15</span><span class="p">)</span>
<span class="n">texts</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">text</span><span class="p">)</span>
<span class="n">adjust_text</span><span class="p">(</span><span class="n">texts</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="s">"path/to/the/output.jpg"</span><span class="p">)</span>
</code></pre></div></div>
<p>Now you can modify the API output to include a link to the image: <code class="language-plaintext highlighter-rouge">[[file:/path/to/the/output.jpg]]</code>. In Elisp, remember to toggle the image display so you can see it. Add <code class="language-plaintext highlighter-rouge">(org-display-inline-images nil t)</code> to the end of the <code class="language-plaintext highlighter-rouge">org-roam-semantic-search-api</code>.</p>
<div align="center">
<figure>
<a href="../../../images/ssorgroam/tsne.jpg" name="tsne">
<img style="width:550px;margin:2px" src="../../../images/ssorgroam/tsne.jpg" />
</a>
<figcaption>The green dots are the most similar to the query. Their proximity to each other should enable the user to visually cluster subjects in the results.</figcaption>
</figure>
</div>
<h2 id="next-steps">Next steps</h2>
<p>We can improve the search and thinking experience by providing a conversational interface with ChatGPT to do Questions & Answers based on the knowledge in Org roam, including the nodes it has used as sources.</p>
<div align="center">
<figure>
<a href="../../../images/ssorgroam/chatgpt.png" name="tsne">
<img style="width:600px;margin:10px" src="../../../images/ssorgroam/chatgpt.png" />
</a>
<figcaption>A single-interaction Q&A with the Org roam notes</figcaption>
</figure>
</div>
<p>A video showing it live:</p>
<div align="center">
<iframe width="560" height="315" src="https://www.youtube.com/embed/vHamsOQW_N4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen=""></iframe>
</div>
<h2 id="references">References</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Moneda, L. (2023). <a href="http://lgmoneda.github.io/2023/01/25/relevance-revolution-knowledge-work.html">A relevance revolution for knowledge-workers</a>. Blogpost at lgmoneda.github.io. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn2" role="doc-endnote">
<p>Shipper, D. (2022) The Fall of Roam. <a href="https://every.to/superorganizers/the-fall-of-roam">Link</a>. <a href="#fnref:fn2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comThe Struggle with the Zettelkasten methodA relevance revolution for knowledge-workers2023-01-25T00:00:00+00:002023-01-25T00:00:00+00:00http://lgmoneda.github.io/2023/01/25/relevance-revolution-knowledge-work<h2 id="the-collaboration-information-problem">The collaboration-information problem</h2>
<p>Cognition, time, and energy constrain our agency as individuals. We break it a bit via collaboration: combined cognition, time, and energy to achieve more in a group. Further, a group of groups. An organization. Constraints hit again. One can’t add people to a group and indefinitely benefit from it. There is a limit in human communication <sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup> that limits group productivity <sup id="fnref:fn2" role="doc-noteref"><a href="#fn:fn2" class="footnote" rel="footnote">2</a></sup>. It does not scale.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/intro.png" name="intro">
<img style="width:450px;margin:10px" src="../../../images/rel-rev/intro.png" />
</a>
<figcaption>Collaborative work comes with an increasing burden for every individual as the group grows.</figcaption>
</figure>
</div>
<p>Communication is about information flow. Collaborative work towards a goal is about using that information to achieve it. Not all information generated needs to flow, and not all details of what needs to flow need to be informed. There are filtering and compression tasks.</p>
<p>When we execute these tasks, there is a performance issue: not all information and details that flow is needed to achieve the group goal. Not all information or details are necessary to accomplish the goal flow.</p>
<p>I suggest that succeeding on these tasks with high performance is needed to complexify what we can achieve in human organizations.</p>
<p>Evolution gives us the evidence. Our brain needs to make an effort to focus on what matters <sup id="fnref:fn3" role="doc-noteref"><a href="#fn:fn3" class="footnote" rel="footnote">3</a></sup>, to split relevant from irrelevant. Biological systems evolved using hierarchy to complexify from cells with a relatively low agency. The cells of our body are specialized. They don’t know everything going on to do the complex holistic work <sup id="fnref:fn4" role="doc-noteref"><a href="#fn:fn4" class="footnote" rel="footnote">4</a></sup>. We, humans, emerge from this hierarchy, which makes me think they do it well. Biological systems have a long history of evolution. Social systems, like a company, do not do as much.</p>
<p>I’m interested in the collaborative knowledge work that happens in companies <em>at scale</em>. Especially analytics, software, and product. I see a gap between knowledge creation and application. We fail to make knowledge relevant.</p>
<h2 id="a-company-as-a-social-system-collaborating-towards-a-shared-goal">A Company as a social system collaborating towards a shared goal</h2>
<p>If a company as a system was not short in collective cognition, I would not expect to systematically observe:</p>
<ul>
<li>Committing the same mistakes;</li>
<li>Applying a certain approach while a better one is known and proven internally;</li>
<li>Missing synergies;</li>
<li>Reinventing the wheel;</li>
</ul>
<p>It is not a documentation problem. Documenting is only the first step. It is a collective knowledge creation-application gap.</p>
<p>A company can register all their incidents in Post Morten notes. What do we expect from that document hidden in a digital drawer? Should we all go through it weekly, monthly, and yearly and refresh what we know about getting things wrong? How that representation of knowledge will interact with what is happening daily?</p>
<p>Though we document, we still depend on humans as information retrieval devices.</p>
<h3 id="the-burden-of-being-an-information-retrieval-device-as-a-human">The burden of being an information-retrieval device as a human</h3>
<p>Many questions are asked in a company in search of information instead of collaboration:</p>
<ul>
<li>Who worked on it?</li>
<li>Where can I find it?</li>
<li>How do I know it?</li>
<li>Who’s responsible for it?</li>
<li>Why did we do it?</li>
</ul>
<p>People will bounce off ideas with higher tenure people to know about past failures, similar projects, who to talk to, and why people did things the way they did. To get a historical perspective. Then these people might retrieve a document to them with details.</p>
<p>We do it because high-tenure people have a mental index of the organization’s knowledge body, and they have a semantic search to point to what is relevant or not, including high levels of abstraction. E.g., you might not cite “behavioral change” as part of the project. Still, that person can read between the lines and match it to other initiatives involving behavioral change, even if they differ in their domains.</p>
<p>When people cannot answer it, they will feel someone else should and vaguely route it: “I think you should talk to Alice; she has been here since the beginning”. Evidently, no one will be able to carry the whole knowledge body of a company in their mind.</p>
<p>On the other side, after generating knowledge, we face many questions:</p>
<ul>
<li>Who is interested in it? Why?</li>
<li>How and where should I broadcast it?</li>
<li>How will I know if people who will need it in the future will find it?</li>
</ul>
<p>We have relative success in applying it once, but we mostly fail to reuse it.</p>
<h3 id="knowledge-is-supposed-to-be-modular">Knowledge is supposed to be modular</h3>
<p>The information repositories we use, like Google Drive, have poor search capabilities. It retrieves what we mostly know to exist. Imagine we are a team starting a project to redesign a referral program. There is some “internal bibliographical review” to do.</p>
<ul>
<li>What are the nudges we did in the app, and how successful were they?</li>
<li>Why the current referral program is the way it is?</li>
<li>Have we tried to change it before?</li>
</ul>
<p>The intended outcome is to have incremental customer releases via the referral program, which is quite specific. Nonetheless, numerous generic assumptions will support different solutions from our space.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/ost.svg" name="opportunity solution tree">
<img style="width:750px;margin:10px" src="../../../images/rel-rev/ost.svg" />
</a>
<figcaption>Teams raise assumptions, test them, and generate knowledge. However, we hardly reuse it, especially by abstracting it a bit to apply it in a different context.</figcaption>
</figure>
</div>
<p>They might not be as specific as we want to discard an idea. We might have previous interviews, surveys, or experiments showing minor customer sensitivity to monetary incentives, but we think the referral context can change it. However, it will undoubtedly impact the discovery design and the amount of evidence needed to accept it.</p>
<p>Applying collective knowledge implies spreading and reusing proven assumptions efficiently as building blocks.</p>
<p>In a large organization, one will hardly know which experimental results from the last week performed by the entire company are relevant to one’s context. Or which hypothesis we proved one could reuse for a new endeavor.</p>
<p>An example of the usage of knowledge modularity is research. In an article, we can see the references building up to enable shortcuts and efficiency via reusability. Of course, academia does it at a high cost.</p>
<p>As a more objective building block, we have code. Code is a sub-problem of what we are describing. It is intended to be reused. However, at least in analytics, reusability is lower than what I would expect. There are many pointed villains. Computational notebooks are commonly criticized for their low reproducibility and reusability <sup id="fnref:fn5" role="doc-noteref"><a href="#fn:fn5" class="footnote" rel="footnote">5</a></sup>.</p>
<h2 id="envisioning-the-system">Envisioning the system</h2>
<p>This struggle is not new. Engelbart (1962) <sup id="fnref:fn6" role="doc-noteref"><a href="#fn:fn6" class="footnote" rel="footnote">6</a></sup> proposed a framework for cognition augmentation based on Human and Tool systems. The pinnacle of his framework is the Concurrent development, Integration, and Application of Knowledge (CODIAK). Engelbert calls it <em>groupware</em>, a software for group work.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/engelbart_1962.jpg" name="intro">
<img style="width:450px;margin:10px" src="../../../images/rel-rev/engelbart_1962.jpg" />
</a>
<figcaption> Engelbart (1962) </figcaption>
</figure>
</div>
<p>The CODIAK covers in spirit what we are talking about. Most of the attention goes into general challenges of Information Systems and collaborations, like integrating different media types, access control, and interfacing between other systems. We can say that current systems cover more or less the requirements for Engelbert’s CODIAK.</p>
<p>However, Engelbert predicts that radical changes in important parameters of this system can bring first a steep quantitative change, then a qualitative change that will bring surprising new ways of work <sup id="fnref:fn7" role="doc-noteref"><a href="#fn:fn7" class="footnote" rel="footnote">7</a></sup>. He believes it would come from tool systems since they can impose a radical change in speed, function, quality, etc.</p>
<p>Engelbert recognizes speech recognition as a possible transformative technology in this context <sup id="fnref:fn7:1" role="doc-noteref"><a href="#fn:fn7" class="footnote" rel="footnote">7</a></sup>. It would contribute to the flexibility of contributing and integrating knowledge. Nonetheless, I don’t think speech recognition delivered it in the collaborative knowledge-work context. It might be great to document knowledge. However, alone it does not boost its application, which could only increase the creation-application gap.</p>
<p>The chances are we will see Large Language Models (LLM, e.g., Generative Pre-training, GPT-3) generating this radical change. First, by transforming textual data into knowledge creation in a very flexible way (including what comes from speech recognition). Second, manipulating knowledge to increase its application in a way that is not qualitatively complex for humans but quantitative impossible: in thousands of people organizations. LLM will tighten the knowledge creation-application gap.</p>
<p>Still, I don’t see <a href="https://chat.openai.com/">ChatGPT</a> as a central savior system. The examples are mere illustrations. I believe in a series of specialized tools for knowledge workers who need to collaborate originated from applications of LLMs fine-tuned with expert knowledge. Further, the capability to point sources will be crucial to keep humans in the loop. It is a Knowledge Platform. We want to reduce the friction of applications to close the creation-application gap.</p>
<p>The fact academia needs to explicitly build on top of knowledge made an early signal of this kind of system for academic research in the form of <a href="https://elicit.org/">Elicit</a>, an AI Research Assistant, or even the <a href="https://www.explainpaper.com/">Explain Paper</a>.</p>
<h3 id="a-personal-nano-experience">A personal nano-experience</h3>
<p>I’ve been using org-roam for a while. It applies the Zettelkasten method to Emacs. In summary, it is a note-taking system in which you link notes. The start is fascinating; you make many connections and remember every note. However, as you add more notes, it gets harder to connect or to remember you have a useful note when living a specific situation <sup id="fnref:fn8" role="doc-noteref"><a href="#fn:fn8" class="footnote" rel="footnote">8</a></sup>. The knowledge creation-application gap will start to increase. After 2.5y, I have 3.4k notes. It becomes non-trivial to connect and apply.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/my-org-roam-bg.png" name="intro">
<img style="width:450px;margin:10px" src="../../../images/rel-rev/my-org-roam-bg.png" />
</a>
<figcaption> As the number of my notes grew, I was less certain I would be able to retrieve the right one at the most convenient moment. </figcaption>
</figure>
</div>
<p>What is the point of taking personal notes if you can’t remember them?</p>
<p>First, on the “personal” aspect of the notes: It is similar to looking at the picture you took of the Eiffel Tower or getting a picture of it from Google Images. When you look at your own photo, you retrieve more than the image of the tower. You get back part of what happened before and after and how you felt. Your own madeleine moment <sup id="fnref:fn9" role="doc-noteref"><a href="#fn:fn9" class="footnote" rel="footnote">9</a></sup>. The same applies to knowledge. When I look at my notes, it refreshes an entire region in my brain and retrieves more than the specific things on that note, mainly because I will start exploring the neighbor nodes that exist because I’ve created them, which makes the process flow very well.</p>
<p>Now, the puzzle: how to make me look at a specific note at a convenient time?</p>
<p>A good search makes a lot of difference. I have indexed all my notes using open-source LLMs to close the creation-application gap. I can relax when creating a new note since I’m confident about finding it when it is convenient. When I write, I create a “thinking arena” in which I keep querying my notes, grab some, link to the thinking box, explore the notes graph, query more, and keep this process until I feel I could refresh my references on the subject accordingly.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/semantic-search.png" name="intro">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/semantic-search.png" />
</a>
<figcaption>Searching on my notes for the paragraph that follows this image ("Nonetheless, the collaborative... simultaneously"). I was positively surprised by the connection with Communitas.</figcaption>
</figure>
</div>
<p>Nonetheless, the collaborative knowledge-work setting is slightly different. You would look at the pictures of others. However, I see it as an intermediate point between looking at personal pictures and doing a Google Image search. It is more like traveling in a group where you are not always together, but you touch basis. You can explore different things, but you had breakfast together and shared your plans. You were in the same city, weather, and culture simultaneously.</p>
<p>In a company, there are common languages, rituals, and goals. One will see many familiar things to trigger their internal knowledge graph.</p>
<p>At this point, we get back to the original puzzle: how to make it likely for someone to apply the knowledge available in their organization at a low cost?</p>
<h3 id="the-revolutionary-flexibility">The revolutionary flexibility</h3>
<p>The revolutionary aspect of it will come from the completeness of sources and broad coverage of the application. Generating data for this system won’t be an option. Everything a team codes, writes, or says, becomes unstructured data. This high flexibility of the input makes it revolutionary.</p>
<p>Code base (Github), document repository (Google Drive), work management (Jira), and meetings (Zoom). Unstructured data converts diverse knowledge packages into common ground.</p>
<p>Application is not easy or trivial, but the large surface for application will probably be quickly covered, motivated by the high return. The interface and the output have interesting characteristics: conversational, accessible, and structure-free.</p>
<p>The common goal for these applications is to entitle the system to make the application happen. E.g., given someone is writing code, it <em>can’t</em> let them not use what we know or have.</p>
<p>What if LLM are just fancy parrots?</p>
<p>It doesn’t matter. The fact we are increasing data ingestion and quality of information retrieval by orders of magnitude are likely enough. Cost reduction is likely more important at this point than improving its skills. In the end, it is augmenting collective intelligence and not replacing it. It doesn’t need to have any or all the traits of what we’d call intelligence <sup id="fnref:fn15" role="doc-noteref"><a href="#fn:fn15" class="footnote" rel="footnote">10</a></sup> <sup id="fnref:fn16" role="doc-noteref"><a href="#fn:fn16" class="footnote" rel="footnote">11</a></sup>. We are not envisioning a central intelligent entity telling an organization what to do. It is closer to what computers and networks of information systems did.</p>
<h2 id="sub-systems-and-applications">Sub-systems and applications</h2>
<h3 id="the-dry-becomes-dro">The DRY becomes DRO</h3>
<p>There is a principle in Software Engineering called DRY: Don’t Repeat Yourself <sup id="fnref:fn10" role="doc-noteref"><a href="#fn:fn10" class="footnote" rel="footnote">12</a></sup>, which states we should avoid repetition via good abstractions. The frustration is better expressed by the opposite, WET: write everything twice, write every time, we enjoy typing, or waste everyone’s time.</p>
<p>Suppose you extend it to everything else in an organization. In that case, it means solving the same puzzle twice, running the same experiments every time, enjoying creating remarkably similar tools, reinventing the wheel, in summary, wasting energy.</p>
<p>Nonetheless, we should assume we don’t do it because we like it but because we have a narrow context. It is way trickier for an individual or group to realize it because they might be doing something inspiring and novel for <em>them</em>, but unaware they are repeating someone else in the organization, which is organizationally unexciting.</p>
<p>The system will have access to the knowledge base and have the objective of not letting someone waste energy. We can imagine many sub-systems.</p>
<p>For coding, it can iteratively read what someone is writing and pop up to tell them when it is confident enough that the organization has an excellent abstraction to deal with that. Even if it requires the person writing the complete code, realizing it and replacing it is beneficial in the long term.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/reuse-1.jpg" name="Reuse code 1">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/reuse-1.jpg" />
</a>
</figure>
<figure>
<a href="../../../images/rel-rev/reuse-2.jpg" name="Reuse code 2">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/reuse-2.jpg" />
</a>
</figure>
<figure>
<a href="../../../images/rel-rev/reuse-3.jpg" name="Reuse code 3">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/reuse-3.jpg" />
</a>
<figcaption> I've tried to do the same with an open source library which is not as large as TensorFlow and PyTorch, but the result was not good. However, I believe there is way more internal data about this library than external.</figcaption>
</figure>
</div>
<p>When writing a Request For Comments (RFC) or Product Requirement Document (PRD), someone should get instant tips on what the team knows about that subject and decide how to incorporate it. Especially in an RFC. You automatically request comments from the whole organization’s knowledge body as you write down an issue, the assumptions involved, the solution, the alternative solutions, etc. We write to think, but we will start to write to think collectively. That’s the augmentation. Evidently, it is supportive. It brings to light likely relevant pieces of knowledge, but it will require the team to go deeper into the references. It is a collective intelligence collection.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/project-list.jpg" name="intro">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/project-list.jpg" />
</a>
<figcaption> ChatGPT needed a further step, but it is clear it has enough representational power to relate a task to a tool capability in a reasonable level of abstraction. </figcaption>
</figure>
</div>
<p>The “don’t repeat yourself” becomes “don’t repeat ourselves”.</p>
<h3 id="rationale-retrieval">Rationale retrieval</h3>
<p>Most of the time, we think a decision doesn’t make sense. It makes but given a context. However, retrieving the context hugely depends on people or a tortuous investigation of the many artifacts. Is it explained in a Pull Request, Slack thread, slide presentation, or RFC? It does not matter. Design decisions can be documented somewhere, but it is mighty to retrieve them so easily.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/org-roam-rationale.jpg" name="intro">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/org-roam-rationale.jpg" />
</a>
<figcaption> Retrieving the rationale for design decisions.</figcaption>
</figure>
</div>
<h3 id="creative-collaboration">Creative collaboration</h3>
<ul>
<li>Your problem looks like a lot with what happened on the Pricing team two years ago. Did you take a look at it?</li>
<li>I think we can use this internal tool to solve it. If I’m not mistaken, it fits your use case.</li>
</ul>
<p>This kind of observation wouldn’t be as valuable anymore. The information-retrieval part of collective work should be automatic and though not perfect, better than what anyone in the organization can do and what most reasonable sizeable groups could do. Managers lose their role of knowledge and networking hubs and replace by doing more understanding, thinking, and rethinking the collected information.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/fklearn.jpg" name="intro">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/fklearn.jpg" />
</a>
<figcaption>A similar query, including all internal tools, could exclude this kind of discussion when a group is together.</figcaption>
</figure>
</div>
<p>The example provided is quite simplistic. But we can imagine a few background prompts running along with a project proposal for knowledge collection:</p>
<ul>
<li>From all the internal tools, which are the most likely useful to execute this project?</li>
<li>From all the experiments in our base, which are the most related to the proposal?</li>
<li>Whom in the company has worked on similar initiatives?</li>
<li>Whom in the company is currently working on similar projects?</li>
<li>What were the common mistakes we made in the past on related topics?</li>
</ul>
<p>A room full of people is for interpreting knowledge, mixing it, challenging current knowledge, designing new experiments, and clarifying hidden assumptions. Processing a large batch of information, abstraction, and correlation becomes a machine task. In this context, abstraction transforms text into vectors with interesting dimensions to correlate information.</p>
<h3 id="proactive-internal-recommender-system">Proactive Internal Recommender System</h3>
<p>It is appealing that someone could query for internal knowledge when doing a new project. Having relevant information brought to us at the opportune moment is even more enjoyable.</p>
<p>The system knows what I’m working on, what I have worked on in the past, and my team’s current and possible challenges. It is watching every meeting, reading every document, and checking every project to proactively tell me what is relevant.</p>
<p>When you are creating a description of a project or task in a work management tool, the system can proactively go through it and answer: given the knowledge corpus, tools, and current projects of this company, what is relevant to show to this person?</p>
<h3 id="organic-communication-paths-will-enable-organizations-to-do-better">Organic communication paths will enable organizations to do better</h3>
<p>Conway’s law states that system design reflects the communication structure of an organization <sup id="fnref:fn11" role="doc-noteref"><a href="#fn:fn11" class="footnote" rel="footnote">13</a></sup>, which applies to software systems. Conway’s law concerns communication cost and not org structure <sup id="fnref:fn2:1" role="doc-noteref"><a href="#fn:fn2" class="footnote" rel="footnote">2</a></sup>. Communication costs will reduce drastically with the support of groupware systems based on LLM. They will non-stop correlate information generated in all the parts of a company. It will take part of the burden of being so cognitively limited when designing software systems.</p>
<h4 id="organizational-recommendation">Organizational recommendation</h4>
<p>Let’s take as a principle for an organizational chart to establish the most relevant communication channels to achieve an organization’s goal. We can flip from “put those teams together because they should communicate” to “these teams are communicating lately, put them together to optimize it”. Suppose we have a system that distributes relevant information, data from where the information is generated, and who is consuming it. In that case, it can support us in a data-driven organizational chart.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/org-chart-rec.svg" name="intro">
<img style="width:650px;margin:10px" src="../../../images/rel-rev/org-chart-rec.svg" />
</a>
<figcaption> Supposing we could trace back information origin, the information flow (in colors) could suggest how the organization should change.</figcaption>
</figure>
</div>
<h4 id="a-dynamic-hierarchy">A dynamic hierarchy</h4>
<p>We can go further. It could make organizational charts irrelevant. Hierarchy exists to limit information flow, which enables systems to become more complex <sup id="fnref:fn4:1" role="doc-noteref"><a href="#fn:fn4" class="footnote" rel="footnote">4</a></sup>. Not all the information needs to flow to all parts to collectively achieve complex goals. What if the parts get proactive support and tools to shape the communication flow? More complex collective behavior can emerge from parts getting relevant information based on a dynamic system that observes what they did, what they are doing, and their sub-goals, and the same for every other part of the organization. Communication paths shape what organizations can do.</p>
<h3 id="knowledge-governance-and-attribution">Knowledge governance and attribution</h3>
<p>Every domain and team will have reasons to document their knowledge in a convenient way to be processed and attributed. You want to expose everything your team learns from experiments, fails, analyses, etc. Just as a team wants to expose capabilities and data internally via APIs.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/knowledge-attribution-1.jpg" name="intro">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/knowledge-attribution-1.jpg" />
</a>
</figure>
<figure>
<a href="../../../images/rel-rev/knowledge-attribution-2.jpg" name="intro">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/knowledge-attribution-2.jpg" />
</a>
</figure>
<figure>
<a href="../../../images/rel-rev/knowledge-attribution-3.jpg" name="intro">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/knowledge-attribution-3.jpg" />
</a>
</figure>
<figure>
<a href="../../../images/rel-rev/knowledge-attribution-4.jpg" name="intro">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/knowledge-attribution-4.jpg" />
</a>
</figure>
</div>
<p>The incentive will be attribution. It enables rewarding helpful knowledge for the organization. If a team fails in a learningful way, it will generate a lot of knowledge that will be reused by many people in the organization, which will trace back to the failure and show how significant that contribution was.</p>
<div align="center">
<figure>
<a href="../../../images/rel-rev/learnful.jpg" name="intro">
<img style="width:550px;margin:10px" src="../../../images/rel-rev/learnful.jpg" />
</a>
<figcaption>It gets it!</figcaption>
</figure>
</div>
<h3 id="self-organization-for-technological-advancement">Self-organization for technological advancement</h3>
<p>Self-organization happens when a system can add, change, or evolve its structure. It happens via simple rules and elements: a raw material to select, a process that adds variety, and a selection step that interacts with what the system cares about and can signal success <sup id="fnref:fn4:2" role="doc-noteref"><a href="#fn:fn4" class="footnote" rel="footnote">4</a></sup>.</p>
<p>When we see a company as a system, its tech advance happens on top of a knowledge base (internal and external). The envisioned systems make it highly accessible. If the human creativity to insert variety into it and the capacity to experiment don’t become a bottleneck, we can expect a higher rate of tech advancement.</p>
<p>Accessibility means having available the stepping stones collected from a large group. We can hypothesize that tuning a system to offer surprising stepping stones to one’s endeavor can lead to great innovative results <sup id="fnref:fn12" role="doc-noteref"><a href="#fn:fn12" class="footnote" rel="footnote">14</a></sup> <sup id="fnref:fn13" role="doc-noteref"><a href="#fn:fn13" class="footnote" rel="footnote">15</a></sup>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Human systems will be important in this process. Using language to represent knowledge and making knowledge flow via Language Models will require people to pay more attention to language and communication. Surely, less structured than what Engelbert predicted <sup id="fnref:fn7:2" role="doc-noteref"><a href="#fn:fn7" class="footnote" rel="footnote">7</a></sup>, but it will become a matter of data quality. Regarding LLMs, current evidence points to a lot of assistance and expert knowledge <sup id="fnref:fn14" role="doc-noteref"><a href="#fn:fn14" class="footnote" rel="footnote">16</a></sup>, but it doesn’t seem to be a barrier. A system that inherits language processing from LLM gets polished by expert software engineers on sub-tasks research in software engineering considers important and fine-tuned on an internal code base look like the kind of product we will consume.</p>
<h2 id="references">References</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Brooks Jr, F. P. (1995). The mythical man-month: essays on software engineering. Pearson Education. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn2" role="doc-endnote">
<p>The Only Unbreakable Law (2022), <a href="https://www.youtube.com/watch?v=5IUj1EZwpJY">Youtube video</a>. <a href="#fnref:fn2" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:fn2:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:fn3" role="doc-endnote">
<p>Rensink, R. A. (2000). The dynamic representation of scenes. Visual cognition, 7(1-3), 17–42. <a href="#fnref:fn3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn4" role="doc-endnote">
<p>Meadows, D. H. (2008). Thinking in systems: a primer. Chelsea green publishing. <a href="#fnref:fn4" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:fn4:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:fn4:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a></p>
</li>
<li id="fn:fn5" role="doc-endnote">
<p>Chattopadhyay, S., Prasad, I., Henley, A. Z., Sarma, A., & Barik, T. (2020). What’s wrong with computational notebooks? pain points, needs, and design opportunities. In , Proceedings of the 2020 CHI conference on human factors in computing systems (pp. 1–12). <a href="#fnref:fn5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn6" role="doc-endnote">
<p>Engelbart, D. C. (1962). Augmenting human intellect: a conceptual framework. Menlo Park, CA, 21. <a href="#fnref:fn6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn7" role="doc-endnote">
<p>Engelbart, D. C. (1992). Toward high-performance organizations: a strategic role for groupware. In Proceedings of the GroupWare (pp. 3–5). <a href="#fnref:fn7" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:fn7:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:fn7:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a></p>
</li>
<li id="fn:fn8" role="doc-endnote">
<p>Shipper, D. (2022) The Fall of Roam. <a href="https://every.to/superorganizers/the-fall-of-roam">Link</a>. <a href="#fnref:fn8" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn9" role="doc-endnote">
<p>Proust, M. (1913). A la recherche du temps perdu: du côté de chez Swann. <a href="#fnref:fn9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn15" role="doc-endnote">
<p>Wong, Matteo. (2023) The Difference Between Speaking and Thinking. <a href="https://www.theatlantic.com/technology/archive/2023/01/chatgpt-ai-language-human-computer-grammar-logic/672902/">Link</a> - there is a pay wall, but the first two paragraphs are enough. <a href="#fnref:fn15" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn16" role="doc-endnote">
<p>LeCun, Y. Browning, J. (2022) AI And The Limits Of Language. <a href="https://www.noemamag.com/ai-and-the-limits-of-language/">Link</a>. <a href="#fnref:fn16" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn10" role="doc-endnote">
<p>Thomas, D., & Hunt, A. (2019). The pragmatic programmer. <a href="#fnref:fn10" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn11" role="doc-endnote">
<p>Conway, M. E. (1968). How do committees invent. Datamation, 14(4), 28–31. <a href="#fnref:fn11" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn12" role="doc-endnote">
<p>Stanley, K. O., & Lehman, J. (2015). Why greatness cannot be planned: the myth of the objective. <a href="#fnref:fn12" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn13" role="doc-endnote">
<p>Burkus, D. (2013). The myths of creativity: the truth about how innovative companies and people generate great ideas. John Wiley & Sons. <a href="#fnref:fn13" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn14" role="doc-endnote">
<p>Fu, Y. P., & Khot, T. (2022). How does gpt obtain its ability? tracing emergent abilities of language models to their sources. Yao Fu’s Notion. <a href="#fnref:fn14" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comThe collaboration-information problemOn Project management for Data Science2022-10-28T00:00:00+00:002022-10-28T00:00:00+00:00http://lgmoneda.github.io/2022/10/28/project-management-data-science<p>A couple of years ago, I moved in my job from the most experienced team on building solutions with Machine Learning to a team eager to have its first wave of models.</p>
<p>We, the “Data Science people”, did a brief document about the model and explained it to the team in a meeting. Soon, we started grabbing the data and fitting some models. After we provided the first results - still as model metrics, someone cheered in a team retrospective: “We are close to having ML models!”. I had to adjust the expectations, and we agreed to clarify the project scope for the team.</p>
<p>The Product Manager set a meeting with everybody, so we could write a PRD (Product Requirements Document) to define the project scope. Then I had this great revelation. I was amazed by how different ideas people had about what we would build. That was one of the best things I could have been exposed to learn about the importance of shared understanding.</p>
<div align="center">
<figure>
<a href="../../../images/primer-proj-m/intro-hands.svg" name="Intro hands">
<img style="width:350px;margin:10px" src="../../../images/primer-proj-m/intro-hands.svg" />
</a>
<!-- <figcaption>Putting together the many common terms for these two stages</figcaption> -->
</figure>
</div>
<p>After a confusing hour, I approached an Agilist to get some help. He recommended to me Jeff Patton’s User Story Mapping book <sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup>. The book is excellent. As I applied it to Data Science projects, I could identify the key parts of it for the field and develop others with the team.</p>
<p>I primarly thought about writing it as a blog post, but as it required more and more structure, I’ve decided to have it in a new section called <a href="https://lgmoneda.github.io/data-science-management/project-management/">Data Science Management</a> since that enables continuous improvement by adding content relevant to Data Science Management in general.</p>
<h2 id="references">References</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Patton, J., & Economy, P. (2014). User story mapping: discover the whole story, build the right product. : O’Reilly Media, Inc. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comA couple of years ago, I moved in my job from the most experienced team on building solutions with Machine Learning to a team eager to have its first wave of models.A Random Time Robust Forest2022-07-06T00:00:00+00:002022-07-06T00:00:00+00:00http://lgmoneda.github.io/2022/07/06/random-time-robust-forest<h2 id="index">Index</h2>
<ol>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#randomizing-the-time-segments">Randomizing the time segments</a></li>
<li><a href="#experiments">Experiments</a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ol>
<h2 id="introduction">Introduction</h2>
<p>The <a href="http://lgmoneda.github.io/2021/12/03/introducing-time-robust-tree.html">Time Robust Tree</a> (TRT) <sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup> is a tree-based learning algorithm that uses timestamp information to learn splits that are optimal for time segments provided by the user, e.g., year. It means a split on a particular feature is only good if it works through all the years present in the data. The hypothesis is that splits that survive the time-test in the training set are more likely to survive it for future unseen periods. It was inspired by algorithms that explore invariance through environments to learn causal relationships. The test of time represents the invariance, while the time segments are the environments.</p>
<p>Something that bothered me while ensembling TRTs into a Time Robust Forest was the lack of randomization of its core design change: the time segmentation. In this post, I explore two alternatives to make it happen and test it on two datasets.</p>
<h2 id="randomizing-the-time-segments">Randomizing the time segments</h2>
<p>The code is available in the <a href="https://github.com/lgmoneda/time-robust-forest">Time Robust Forest package</a>, where code snippets on how to use it can be found in the readme. The experiments’ code was adapted from the previous design, and it is not that tidy, but they are available: <a href="https://github.com/lgmoneda/lgmoneda.github.io/blob/master/images/randomtrt/ge_news_random_trf.ipynb">GE News</a>, <a href="https://github.com/lgmoneda/time-robust-tree-paper/blob/random-trt/experiments/_olist.ipynb">Olist</a>.</p>
<h3 id="enabling-multiple-time-segments-informed-by-the-user">Enabling multiple time segments informed by the user</h3>
<p>Previously, one could only inform a particular dataset segmentation considering the timestamp information. Either days, months, trimester, years, etc. The hypothesis is that mixing the invariance on different environment granularities can be more powerful. The model now accepts a set of columns representing different segmentations, and every new ensemble estimator will pick randomly from them.</p>
<p>The first practical advantage is ensembling the TRF with a Random Forest (RF) inside the same model since a segment column filled with 1s reproduces an RF. Beyond simply ensembling good models, one can have different availability of features through time. For example, a particular group of features did not exist three years ago. The current TRF does not deal well with these cases, but these features would be used in the RF estimators from the ensemble.</p>
<h3 id="random-segments">Random segments</h3>
<p>Alternatively, the user can set a maximum number of splits, and the model will order the data according to time, create all the possible segmentations and select randomly from this set for every estimator in the ensemble.</p>
<h2 id="experiments">Experiments</h2>
<p>The design and optimal hyper-parameters from a <a href="blog post with experiments">previous setting</a> were leveraged. The most important part is the data split: we first split using the time information. In-time set is for training and test, while out-of-time is for the holdout set. The same parameters were utilized, except for the number of estimators, which increased from 120 to 400 in all models. We will test the following models:</p>
<ul>
<li>Random Forest (benchmark): an RF, which is the TRF using a time column full of 1s;</li>
<li>Time Robust Forest: a TRF using a single time column selected by the user;</li>
<li>Random TRF (eng): a TRF using multiple time columns selected by the user;</li>
<li>Random TRF (few): a TRF using a low value for maximum random segments;</li>
<li>Random TRF (many): a TRF using a large value for maximum random segments;</li>
<li>TRF + RF: a TRF using as time columns a time segment and a dummy column with no segment (RF);
<!-- - RF same min examples: a simplified benchmark since more periods force a higher number of minimum examples to split, we train an RF with minimum samples to split equally to the challenger number of segments multiplied by the challenger minimum examples to split parameter. -->
<!-- - RF Similar training error: a TRF we force to have a similar training error than the challengers to make sure regular simpler models wouldn't behave like the challenger; --></li>
</ul>
<h3 id="ge-news">GE News</h3>
<p>The dataset <sup id="fnref:fn5" role="doc-noteref"><a href="#fn:fn5" class="footnote" rel="footnote">2</a></sup> was expanded with more recent examples. In the table below, one of the Random TRF designs shows that the <code class="language-plaintext highlighter-rouge">eng(inerred)</code> could perform better than the previous challenger, which is the TRF. However, it was not explored how to select from the many options available for a Random TRF - would the Env K-folds select the best option? See the section “Hyper-parameter optimization” from this <a href="http://127.0.0.1:4000/2021/12/03/introducing-time-robust-tree.html#the-algorithm">post</a>.</p>
<table class="styled-table" style="margin-left:auto;margin-right:auto;width:681px">
<thead>
<tr>
<th style="text-align:center">Model</th>
<th style="text-align:center">Train AUC</th>
<th style="text-align:center">Test AUC</th>
<th style="text-align:center">Holdout AUC</th>
</tr>
</thead>
<tbody>
<tr>
<td>Random TRF (eng)</td>
<td style="text-align:center;font-size:18px">.867</td>
<td style="text-align:center;font-size:18px">.863</td>
<td style="text-align:center;font-size:18px"><b>.825</b></td>
</tr>
<tr>
<td>Challenger</td>
<td style="text-align:center;font-size:18px">.870</td>
<td style="text-align:center;font-size:18px">.866</td>
<td style="text-align:center;font-size:18px">.822</td>
</tr>
<tr>
<td>Random TRF (few)</td>
<td style="text-align:center;font-size:18px">.899</td>
<td style="text-align:center;font-size:18px">.889</td>
<td style="text-align:center;font-size:18px">.822</td>
</tr>
<tr>
<td>TRF + RF</td>
<td style="text-align:center;font-size:18px">.905</td>
<td style="text-align:center;font-size:18px">.893</td>
<td style="text-align:center;font-size:18px">.821</td>
</tr>
<tr>
<td>Benchmark</td>
<td style="text-align:center;font-size:18px"><b>.919</b></td>
<td style="text-align:center;font-size:18px"><b>.898</b></td>
<td style="text-align:center;font-size:18px">.813</td>
</tr>
<tr>
<td>Random TRF (many)</td>
<td style="text-align:center;font-size:18px">.857</td>
<td style="text-align:center;font-size:18px">.859</td>
<td style="text-align:center;font-size:18px">.807</td>
</tr>
</tbody>
</table>
<p>When we observe the performance over time, we see that the <code class="language-plaintext highlighter-rouge">eng</code> option during the in-time period is a shift of what we have for the <code class="language-plaintext highlighter-rouge">challenger</code> (Time Robust Forest). Nonetheless, when the holdout starts, they start to behave slightly differently. We see them inverting which model is performing better, just as <code class="language-plaintext highlighter-rouge">challenger</code> did with <code class="language-plaintext highlighter-rouge">benchmark</code> previously.</p>
<div align="center">
<figure>
<a href="../../../images/randomtrt/all_models.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/randomtrt/all_models.png" />
</a>
<figcaption>The engineered option shows a worse performance in the in time test, while it performs better in the out of time set, our holdout.</figcaption>
</figure>
</div>
<h4 id="is-it-a-matter-of-simplicity">Is it a matter of simplicity?</h4>
<p>Most of my skepticism about the TRF comes from the fact its design could be simply pushing the model to be simpler. The simplicity would go against overfitting, and it would generalize better. Even considering that in all the previous experiments, the model development pipeline follows standard practices for both benchmark and challenger, I like to keep challenging it.</p>
<p>To keep getting evidence it is not the case, I took the opportunity to train two simpler versions of the benchmark. First, I multiply the minimum number of examples by the number of distinct periods in the challenger model since it is the volume of total data the TRF demands to keep growing deeper. In the second case, I’ve forced the benchmark to have a similar training performance compared to the challenger. I did it by changing the minimum examples to split and the maximum depth.</p>
<p>The results in the image below show it might explain part of it. The simplified versions worsened the benchmark in the holdout. However, the models converge to the same performance after some years. Nonetheless, the simplified benchmark behavior is still very different from what we observed in the challenger. I wonder if the challenger would also converge if we had more data. The period is not sufficiently long to provide enough environments for the TRF to learn and let a very long holdout.</p>
<div align="center">
<figure>
<a href="../../../images/randomtrt/simpler.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/randomtrt/simpler.png" />
</a>
<figcaption>We train a benchmark with similar parameters and another in which we change the hyper-parameters to match the challenger's train performance. It shows the challengers perform well in the holdout due to something beyond being simpler.</figcaption>
</figure>
</div>
<h3 id="olist">Olist</h3>
<p>The Olist dataset <sup id="fnref:fn7" role="doc-noteref"><a href="#fn:fn7" class="footnote" rel="footnote">3</a></sup> is not as voluminous as the GE News. We run ten bootstrap rounds to estimate model performance. I modified the previous optimal parameters by increasing the number of estimators from 140 to 240.</p>
<p>In the plot below, we can see the results for the test and holdout. As expected, when we ensemble the TRF with the RF, the test performance will go up. The same thing happens to the <code class="language-plaintext highlighter-rouge">few</code> case since it fragments the data guided by the time column, including the particular case of a single fragment. It is evidence the user should also control the minimum number of segments.</p>
<div align="center">
<figure>
<a href="../../../images/randomtrt/olist_test.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/randomtrt/olist_test.png" />
</a>
<figcaption>There is a wild difference in test performance, which is in time data sampled randomly. However, the out of time performance is very similar.</figcaption>
</figure>
</div>
<p>Since the performance in the holdout is very similar, we can zoom in. There is a lot of overlap in the intervals. However, it is interesting to notice how <code class="language-plaintext highlighter-rouge">benchmark</code>, <code class="language-plaintext highlighter-rouge">challenger</code>, <code class="language-plaintext highlighter-rouge">ensemble</code>, and <code class="language-plaintext highlighter-rouge">engineered</code> options are worse on average than <code class="language-plaintext highlighter-rouge">few</code> and <code class="language-plaintext highlighter-rouge">many</code>, which does not require expert knowledge on splitting the time information into meaningful segments.</p>
<div align="center">
<figure>
<a href="../../../images/randomtrt/olist_holdout.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/randomtrt/olist_holdout.png" />
</a>
<figcaption>Zooming in on the holdout performance, we see the two options using the Randomg TRF offers a likely improvement. The result is not definite, but it is encouraging to keep exploring this direction.</figcaption>
</figure>
</div>
<h2 id="conclusion">Conclusion</h2>
<p>The results are interesting in this minimal experimental setting. The GE News and the Olist dataset pointed to different approaches to creating the set of time segments to pick randomly from: engineered or randomly split. The GE dataset is large, but it needs to be challenged further. While the Olist results pretty much overlap. Segmenting the environment randomly goes against the idea of expert knowledge guiding the environment definition and assuming we’ll have data about it. One alternative to random selection is to make it part of the learning process or use data to guide it. A domain classifier is an excellent option to segment data periods that differ significantly. If we want to learn this split, we need to do it based on a criterion that is not a training error. The higher the segmentation, the higher the total minimum examples of splitting needed, the shallower the tree, and the simpler the model, the higher the training error. Unsupervised learning might make sense here too. Considering data from a certain period, look at how similar the following data segments are and define if we should consider them the same or a different environment based on a threshold or statistical test.</p>
<h2 id="references">References</h2>
<!-- [^fn3]: Chicago, C.: Chicagocrime-bigquerydataset(2021), version 1. Retrieved March 13, 2021 from https://www.kaggle.com/chicago/chicago-crime -->
<!-- [^fn4]: Daoud, J.: Animal shelter dataset (2021), version 1. Retrieved March 13, 2021, from https://www.kaggle.com/jackdaoud/animal-shelter-analytics -->
<!-- [^fn5]: Mouill, M.: Kickstarter projects dataset (2018), version 7. Retrieved March 13, 2021, from https://www.kaggle.com/kemical/kickstarter-projects?select=ks-projects-201612.csv -->
<!-- [^fn6]: Shastry, A.: Sanfrancisco building permits dataset (2018), version 7. Retrieved March 13, 2021, from https://www.kaggle.com/aparnashastry/building-permit-applications-data -->
<!-- [^fn8]: Mitchell, T.: 20 Newsgroup dataset (1996). They are retrieved from sklearn. -->
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Moneda, L., & Mauá, D. (2022). <a href="/resources/papers/Time_Robust_Tree.pdf">Time Robust Trees: Using Temporal Invariance to Improve Generalization</a>. BRACIS 2022. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn5" role="doc-endnote">
<p>Moneda, L.: Globo esporte news dataset (2020), version 19. Retrieved July 11, 2022, from https://www.kaggle.com/lgmoneda/ge-soccer-clubs-news <a href="#fnref:fn5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn7" role="doc-endnote">
<p>Sionek, A.: Brazilian e-commerce public dataset by olist (2019), version 7. Retrieved March 13, 2021, from https://www.kaggle.com/olistbr/brazilian-ecommerce <a href="#fnref:fn7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comIndexTime Robust Tree experiments on real datasets2022-06-13T00:00:00+00:002022-06-13T00:00:00+00:00http://lgmoneda.github.io/2022/06/13/experiments-time-robust-tree<h2 id="index">Index</h2>
<ol>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#experiments">Experiments</a>
<ol>
<li><a href="#setup">Setup</a></li>
<li><a href="#performance">Performance</a></li>
<li><a href="#the-domain-classifier">The domain classifier</a></li>
<li><a href="#temporal-views">Temporal views</a></li>
<li><a href="#the-hyper-parameters-effect">The hyper-parameters effect</a></li>
<li><a href="#feature-importance">Feature importance</a>
<!-- 4. [When and how to use it](#when-and-how-to-use-it) --></li>
</ol>
</li>
<li><a href="#conclusion">Conclusion</a></li>
</ol>
<h2 id="introduction">Introduction</h2>
<p>This post contains a practical application of the Time Robust tree (TRT). To understand the model motivation, please read <a href="http://lgmoneda.github.io/2021/12/03/introducing-time-robust-tree.html">Introducing the Time Robust Tree - invariance in Machine Learning #3</a> or “Time Robust Trees: Using Temporal Invariance to Improve Generalization”<sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup>. Most of our interest in this work is to reflect on dataset shift and generalization. However, in an industry setting, the Time Robust Forest (TRF), the ensemble version of the TRT, can offer an exciting challenger for settings one cannot retrain a model constantly.</p>
<p>We will explore some real-world datasets and compare the TRF to a Random Forest (RF). All the data and code for the experiments are <a href="https://github.com/lgmoneda/time-robust-tree-paper">available</a>, just as the base package <a href="https://github.com/lgmoneda/time-robust-forest">time-robust-forest</a>.</p>
<h2 id="the-experiments">The experiments</h2>
<h3 id="setup">Setup</h3>
<p>To validate the approach, seven public datasets in which a timestamp information and a reasonable time range are available were selected <sup id="fnref:fn2" role="doc-noteref"><a href="#fn:fn2" class="footnote" rel="footnote">2</a></sup> <sup id="fnref:fn3" role="doc-noteref"><a href="#fn:fn3" class="footnote" rel="footnote">3</a></sup> <sup id="fnref:fn4" role="doc-noteref"><a href="#fn:fn4" class="footnote" rel="footnote">4</a></sup> <sup id="fnref:fn5" role="doc-noteref"><a href="#fn:fn5" class="footnote" rel="footnote">5</a></sup> <sup id="fnref:fn6" role="doc-noteref"><a href="#fn:fn6" class="footnote" rel="footnote">6</a></sup> <sup id="fnref:fn7" role="doc-noteref"><a href="#fn:fn7" class="footnote" rel="footnote">7</a></sup> <sup id="fnref:fn8" role="doc-noteref"><a href="#fn:fn8" class="footnote" rel="footnote">8</a></sup>.</p>
<p>We split every dataset into two time periods: training and holdout. Then training period data is split randomly between training and test. We use the Time Robust Forest python package for both benchmark and challenger.</p>
<p>The benchmark has all training examples with the same \(T_{period}\), which is a particular case the TRF becomes a regular Random Forest. The challenger uses yearly or year-monthly segments.</p>
<p>The hyper-parameter optimization uses an approach we identify as Environment K-Fold, which we explain in a previous <a href="http://lgmoneda.github.io/2021/12/03/introducing-time-robust-tree.html">post</a>.</p>
<h3 id="performance">Performance</h3>
<p>The first evidence we look for is a simple difference in performance. We hypothesize the TRF will not suffer as much as the RF to keep its performance in future unseen data, which we simulate with the Holdout set.</p>
<p>In the table below, it is possible to see the TRF did better in the holdout in three opportunities, while it is similar in two and loses in the other two.</p>
<p>This aggregated result is interesting, but looking further can help us understand how the TRF operates.</p>
<table class="styled-table" style="margin-left:auto;margin-right:auto;width:681px">
<thead>
<tr>
<th style="text-align:center">Dataset</th>
<th style="text-align:center">Data split</th>
<th style="text-align:center">Volume</th>
<th style="text-align:center">Time range</th>
<th style="text-align:center">RF</th>
<th style="text-align:center">TRF</th>
<th style="text-align:center">TRF-RF</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3">GE News</td>
<td style="text-align:center;font-size:18px">Train</td>
<td style="text-align:center;font-size:18px">21k</td>
<td style="text-align:center;font-size:18px">2015-2018</td>
<td style="text-align:center;font-size:18px"><b>.927</b></td>
<td style="text-align:center;font-size:18px">.865</td>
<td style="text-align:center;font-size:18px">-.062</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Test</td>
<td style="text-align:center;font-size:18px"> 5k</td>
<td style="text-align:center;font-size:18px"> 2015-2018</td>
<td style="text-align:center;font-size:18px"><b>.879</b></td>
<td style="text-align:center;font-size:18px">.839</td>
<td style="text-align:center;font-size:18px"> -.040</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Holdout</td>
<td style="text-align:center;font-size:18px">58k</td>
<td style="text-align:center;font-size:18px">2019-2021</td>
<td style="text-align:center;font-size:18px">.805</td>
<td style="text-align:center;font-size:18px"><b>.821</b></td>
<td style="text-align:center;font-size:18px;background-color:#286748;color:white"><b>.017</b></td>
</tr>
<tr>
<td rowspan="3">Kickstarter</td>
<td style="text-align:center;font-size:18px">Train</td>
<td style="text-align:center;font-size:18px">98k</td>
<td style="text-align:center;font-size:18px">2010-2013</td>
<td style="text-align:center;font-size:18px"><b>.736</b></td>
<td style="text-align:center;font-size:18px">.717</td>
<td style="text-align:center;font-size:18px">-.019</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Test</td>
<td style="text-align:center;font-size:18px"> 24k</td>
<td style="text-align:center;font-size:18px"> 2010-2013</td>
<td style="text-align:center;font-size:18px"><b>.705</b></td>
<td style="text-align:center;font-size:18px">.701</td>
<td style="text-align:center;font-size:18px"> -.004</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Holdout</td>
<td style="text-align:center;font-size:18px">254k</td>
<td style="text-align:center;font-size:18px">2014-2017</td>
<td style="text-align:center;font-size:18px">.647</td>
<td style="text-align:center;font-size:18px"><b>.661</b></td>
<td style="text-align:center;font-size:18px;background-color:#286748;color:white"><b>.014</b></td>
</tr>
<tr>
<td rowspan="3">20 News</td>
<td style="text-align:center;font-size:18px">Train</td>
<td style="text-align:center;font-size:18px">8k</td>
<td style="text-align:center;font-size:18px">-</td>
<td style="text-align:center;font-size:18px"><b>.939</b></td>
<td style="text-align:center;font-size:18px">.869</td>
<td style="text-align:center;font-size:18px">-.070</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Test</td>
<td style="text-align:center;font-size:18px"> 2k</td>
<td style="text-align:center;font-size:18px">-</td>
<td style="text-align:center;font-size:18px"><b>.867</b></td>
<td style="text-align:center;font-size:18px">.828</td>
<td style="text-align:center;font-size:18px"> -.039</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Holdout</td>
<td style="text-align:center;font-size:18px">8k</td>
<td style="text-align:center;font-size:18px">-</td>
<td style="text-align:center;font-size:18px">.768</td>
<td style="text-align:center;font-size:18px"><b>.774</b></td>
<td style="text-align:center;font-size:18px;background-color:#286748;color:white"><b>.006</b></td>
</tr>
<tr>
<td rowspan="3">Animal Shelter</td>
<td style="text-align:center;font-size:18px">Train</td>
<td style="text-align:center;font-size:18px">75k</td>
<td style="text-align:center;font-size:18px">2014-2017</td>
<td style="text-align:center;font-size:18px"><b>.814</b></td>
<td style="text-align:center;font-size:18px">.803</td>
<td style="text-align:center;font-size:18px">-.011</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Test</td>
<td style="text-align:center;font-size:18px"> 19k</td>
<td style="text-align:center;font-size:18px">2014-2017</td>
<td style="text-align:center;font-size:18px"><b>.792</b></td>
<td style="text-align:center;font-size:18px">.790</td>
<td style="text-align:center;font-size:18px"> -.002</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Holdout</td>
<td style="text-align:center;font-size:18px">61k</td>
<td style="text-align:center;font-size:18px">2018-2021</td>
<td style="text-align:center;font-size:18px"><b>.791</b></td>
<td style="text-align:center;font-size:18px">.791</td>
<td style="text-align:center;font-size:18px;background-color:goldenrod;color:white"><b>.000</b></td>
</tr>
<tr>
<td rowspan="3">Olist</td>
<td style="text-align:center;font-size:18px">Train</td>
<td style="text-align:center;font-size:18px">41k</td>
<td style="text-align:center;font-size:18px">2017</td>
<td style="text-align:center;font-size:18px"><b>.799</b></td>
<td style="text-align:center;font-size:18px">.695</td>
<td style="text-align:center;font-size:18px">-.104</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Test</td>
<td style="text-align:center;font-size:18px"> 10k</td>
<td style="text-align:center;font-size:18px">2017</td>
<td style="text-align:center;font-size:18px"><b>.664</b></td>
<td style="text-align:center;font-size:18px">.641</td>
<td style="text-align:center;font-size:18px"> -.023</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Holdout</td>
<td style="text-align:center;font-size:18px">62k</td>
<td style="text-align:center;font-size:18px">2018</td>
<td style="text-align:center;font-size:18px"><b>.635</b></td>
<td style="text-align:center;font-size:18px">.635</td>
<td style="text-align:center;font-size:18px;background-color:goldenrod;color:white"><b>.000</b></td>
</tr>
<tr>
<td rowspan="3">Chicago Crime</td>
<td style="text-align:center;font-size:18px">Train</td>
<td style="text-align:center;font-size:18px">100k</td>
<td style="text-align:center;font-size:18px">2001-2010</td>
<td style="text-align:center;font-size:18px"><b>.936</b></td>
<td style="text-align:center;font-size:18px">.909</td>
<td style="text-align:center;font-size:18px">-.027</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Test</td>
<td style="text-align:center;font-size:18px"> 61k</td>
<td style="text-align:center;font-size:18px">2001-2010</td>
<td style="text-align:center;font-size:18px"><b>.904</b></td>
<td style="text-align:center;font-size:18px">.899</td>
<td style="text-align:center;font-size:18px"> -.005</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Holdout</td>
<td style="text-align:center;font-size:18px">90k</td>
<td style="text-align:center;font-size:18px">2011-2017</td>
<td style="text-align:center;font-size:18px"><b>.905</b></td>
<td style="text-align:center;font-size:18px">.902</td>
<td style="text-align:center;font-size:18px;background-color:#bf483f;color:white"><b>-.003</b></td>
</tr>
<tr>
<td rowspan="3">Building Permits</td>
<td style="text-align:center;font-size:18px">Train</td>
<td style="text-align:center;font-size:18px">90k</td>
<td style="text-align:center;font-size:18px">2013-2015</td>
<td style="text-align:center;font-size:18px"><b>.990</b></td>
<td style="text-align:center;font-size:18px">.984</td>
<td style="text-align:center;font-size:18px">-.006</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Test</td>
<td style="text-align:center;font-size:18px">22k</td>
<td style="text-align:center;font-size:18px">2013-2015</td>
<td style="text-align:center;font-size:18px"><b>.974</b></td>
<td style="text-align:center;font-size:18px">.972</td>
<td style="text-align:center;font-size:18px"> -.002</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">Holdout</td>
<td style="text-align:center;font-size:18px">193k</td>
<td style="text-align:center;font-size:18px">2011-2017</td>
<td style="text-align:center;font-size:18px"><b>.977</b></td>
<td style="text-align:center;font-size:18px">.973</td>
<td style="text-align:center;font-size:18px;background-color:#bf483f;color:white"><b>-.004</b></td>
</tr>
</tbody>
</table>
<!-- \begin{table} -->
<!-- \caption[Data volume, period, and results for every real-world dataset using RF and TRF]{Performance results. When comparing the AUC in the holdout from the TRF to the RF, the benchmark gets better performance on three cases. However, the difference between challenger and benchmark in the holdout always drops compared to the same difference in the test.} -->
<!-- \vspace{\baselineskip} -->
<!-- \label{table:experiments_summary} -->
<!-- \centering -->
<!-- \begin{tabular}{lcccccc} -->
<!-- \toprule -->
<!-- \hfil \ Dataset & \ Data split & \ Volume & \ Time range & RF & TRF & $\Delta$ TRF-RF\\ -->
<!-- \midrule -->
<!-- Kickstarter & Train & 98k & 2010-2013 &\textbf{.736} &.717 & -.019 \\ -->
<!-- Kickstarter & Test & 24k & 2010-2013 &\textbf{.705} &.701 & -.004 \\ -->
<!-- Kickstarter & Holdout & 254k & 2014-2017 &.647 & \textbf{.661} & .014 \\ -->
<!-- \midrule -->
<!-- GE News & Train & 21k & 2015-2018 &\textbf{.927} &.865 & -.062 \\ -->
<!-- GE News & Test & 5k & 2015-2018 &\textbf{.879} &.839 & -.040 \\ -->
<!-- GE News & Holdout & 58k & 2019-2021 &.805&\textbf{.821} & .017 \\ -->
<!-- \midrule -->
<!-- 20 News & Train & 8k & - & \textbf{.939} & .869 & -.070 \\ -->
<!-- 20 News & Test & 2k & - &\textbf{.867} &.828 & -.039 \\ -->
<!-- 20 News & Holdout & 8k & - &.768 &\textbf{.774} & .006 \\ -->
<!-- \midrule -->
<!-- Animal Shelter & Train & 75k & 2014-2017 &\textbf{.814} & .803 & -.011 \\ -->
<!-- Animal Shelter & Test & 19k & 2014-2017 &\textbf{.792} &.790 & -.002 \\ -->
<!-- Animal Shelter & Holdout & 61k & 2018-2021 &\textbf{.791} &.791 & .000 \\ -->
<!-- \midrule -->
<!-- Olist & Train & 41k & 2017 &\textbf{.799} &.695 & -.104 \\ -->
<!-- Olist & Test & 10k & 2017 &\textbf{.664} &.641 & -.023 \\ -->
<!-- Olist & Holdout & 62k & 2018 &\textbf{.635} &.635 & .000 \\ -->
<!-- \midrule -->
<!-- Chicago Crime & Train & 100k & 2001-2010 &\textbf{.936} & .909 & -.027 \\ -->
<!-- Chicago Crime & Test & 61k & 2001-2010 &\textbf{.904} & .899 & -.005 \\ -->
<!-- Chicago Crime & Holdout & 90k & 2011-2017 &\textbf{.905} & .902 & -.003 \\ -->
<!-- \midrule -->
<!-- Building Permits & Train & 90k & 2013-2015 &\textbf{.990} &.984 & -.006 \\ -->
<!-- Building Permits & Test & 22k & 2013-2015 &\textbf{.974} &.972 & -.002 \\ -->
<!-- Building Permits & Holdout & 193k & 2016-2017 &\textbf{.977} &.973 & -.004 \\ -->
<!-- \bottomrule -->
<!-- \end{tabular} -->
<!-- \vspace{\baselineskip} -->
<!-- \end{table} -->
<h3 id="the-domain-classifier">The domain classifier</h3>
<p>In the previous table, it is possible to verify that the cases where TRF is an exciting challenger are the ones in which the benchmark has problems performing in the holdout as well as in the test. We train a domain classifier using the holdout as the target to clarify the evidence under scenarios the future data changes the most. The higher the AUC, the more significant the difference between test and holdout in that dataset. As seen in the figure below, the results show that the TRF performed better in the datasets with a more remarkable shift between training and holdout data.</p>
<div align="center">
<figure>
<a href="../../../images/trtexp/domain_classifier_regression.jpg" name="Domain classifier">
<img style="width:550px;margin:10px" src="../../../images/trtexp/domain_classifier_regression.jpg" />
</a>
<figcaption>Domain classifier performance by the delta improvement in the TRF. The greater the difference between the source and target data, translated by a high AUC for the domain classifier, the greater the benefit of learning invariant relationships to generalize to future unseen data.</figcaption>
</figure>
</div>
<h3 id="temporal-views">Temporal views</h3>
<p>The hypothesis is that TRF would be able to learn more stable relationships, which would exclude spurious relationships. These relationships would not degrade as quickly as the others. We have a hint from the aggregated performance metric from the previous table, but we want to see it over time.</p>
<p>The most interesting case is the GE News. Notice how the curve shape is different during the holdout. In the other cases, the parallel shift seems to be evidence of a different model capacity. Still, if the setup is close enough to normal model development, the results put the TRF as a reasonable challenger for the RF.</p>
<div align="center">
<figure>
<a href="../../../images/trtexp/genews_performance_overtime.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/trtexp/genews_performance_overtime.png" />
</a>
<figcaption>(a) GE News</figcaption>
</figure>
</div>
<div align="center">
<figure>
<a href="../../../images/trtexp/kickstarter_test_holdout_auc_years_square.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/trtexp/kickstarter_test_holdout_auc_years_square.png" />
</a>
<figcaption>(b) Kickstarter</figcaption>
</figure>
</div>
<div align="center">
<figure>
<a href="../../../images/trtexp/20news_pycaret_aggregate_opt_square.jpg" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/trtexp/20news_pycaret_aggregate_opt_square.jpg" />
</a>
<figcaption>(c) 20 News</figcaption>
</figure>
</div>
<div align="center">
<figure>
<a href="../../../images/trtexp/animal_shelter_test_holdout_auc_years_square.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/trtexp/animal_shelter_test_holdout_auc_years_square.png" />
</a>
<figcaption>(d) Animal Shelter</figcaption>
</figure>
</div>
<div align="center">
<figure>
<a href="../../../images/trtexp/olist_test_holdout_auc_years_square.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/trtexp/olist_test_holdout_auc_years_square.png" />
</a>
<figcaption>(e) Olist</figcaption>
</figure>
</div>
<div align="center">
<figure>
<a href="../../../images/trtexp/chicago_crime_test_holdout_auc_years_square.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/trtexp/chicago_crime_test_holdout_auc_years_square.png" />
</a>
<figcaption>(f) Chicago Crime</figcaption>
</figure>
</div>
<div align="center">
<figure>
<a href="../../../images/trtexp/building_permits_test_holdout_auc_years_square.png" name="Domain classifier">
<img style="width:650px;margin:10px" src="../../../images/trtexp/building_permits_test_holdout_auc_years_square.png" />
</a>
<figcaption>(g) Building Permits</figcaption>
</figure>
</div>
<h3 id="the-hyper-parameters-effect">The hyper-parameters effect</h3>
<p>Though we optimize both of the models with a similar procedure, the chances are that hyper-parameters can produce misleading results. The parameter we care about the most on the TRF is the minimum number of examples from each period. In the RF, the similar parameter is the minimum number of examples to split. To verify how this parameter influenced the results, we fixed all the others and let this one change.</p>
<p>As we can see in the images below, it was a matter of parametrization in the Kickstarter case, while in GE News, it was not. The dashed lines represent the TRF, while the green curves are about the holdout sets. We can see the RF has the power to solve the Kickstarter with a higher holdout performance, while in GE News, the dashed green curve goes to a level the continuous one does not reach.</p>
<div align="center">
<figure>
<a href="../../../images/trtexp/kick_params.png" name="Domain classifier">
<img style="width:450px;margin:10px" src="../../../images/trtexp/kick_params.png" />
</a>
<figcaption>(a) Kickstarter</figcaption>
</figure>
</div>
<div align="center">
<figure>
<a href="../../../images/trtexp/ge_news_params.png" name="Domain classifier">
<img style="width:450px;margin:10px" src="../../../images/trtexp/ge_news_params.png" />
</a>
<figcaption>(b) GE News</figcaption>
</figure>
</div>
<h3 id="feature-importance">Feature importance</h3>
<p>If the TRF was to select stable relationships, we expect that the features would assume different importance. However, in terms of the order, there was little difference. The only pattern we could identify was that the TRF concentrated importance on the top features in the cases it outperformed the RF.</p>
<p>In the table below, we can see the feature importance of the GE News dataset. Notice how importance the TRF concentrates on the top features and how the importance quickly decreases as we go to the less important ones.</p>
<table class="styled-table" style="margin-left:auto;margin-right:auto;width:681px">
<thead>
<tr>
<th colspan="2" style="text-align:center">RF</th>
<th colspan="2" style="text-align:center">TRF</th>
</tr>
<tr>
<th style="text-align:center">Feature</th>
<th style="text-align:center">Importance</th>
<th style="text-align:center">Feature</th>
<th style="text-align:center">Importance</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center;font-size:18px">atleta</td>
<td style="text-align:center;font-size:18px">0.043446</td>
<td style="text-align:center;font-size:18px">atleta</td>
<td style="text-align:center;font-size:18px">0.090731</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">cada</td>
<td style="text-align:center;font-size:18px">0.034933</td>
<td style="text-align:center;font-size:18px">sabe</td>
<td style="text-align:center;font-size:18px">0.036755</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">partidas</td>
<td style="text-align:center;font-size:18px">0.019544</td>
<td style="text-align:center;font-size:18px">cada</td>
<td style="text-align:center;font-size:18px">0.034831</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">primeiro</td>
<td style="text-align:center;font-size:18px">0.014367</td>
<td style="text-align:center;font-size:18px">partidas</td>
<td style="text-align:center;font-size:18px">0.021660</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">sabe</td>
<td style="text-align:center;font-size:18px">0.012230</td>
<td style="text-align:center;font-size:18px">lateral</td>
<td style="text-align:center;font-size:18px">0.016985</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">dia</td>
<td style="text-align:center;font-size:18px">0.010441</td>
<td style="text-align:center;font-size:18px">chegar</td>
<td style="text-align:center;font-size:18px">0.015238</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">cinco</td>
<td style="text-align:center;font-size:18px">0.009303</td>
<td style="text-align:center;font-size:18px">dia</td>
<td style="text-align:center;font-size:18px">0.012248</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">lateral</td>
<td style="text-align:center;font-size:18px">0.006325</td>
<td style="text-align:center;font-size:18px">cinco</td>
<td style="text-align:center;font-size:18px">0.011322</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">video</td>
<td style="text-align:center;font-size:18px">0.006182</td>
<td style="text-align:center;font-size:18px">camisa</td>
<td style="text-align:center;font-size:18px">0.006628</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">rodrigo</td>
<td style="text-align:center;font-size:18px">0.005493</td>
<td style="text-align:center;font-size:18px">precisa</td>
<td style="text-align:center;font-size:18px">0.006192</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">defesa</td>
<td style="text-align:center;font-size:18px">0.004620</td>
<td style="text-align:center;font-size:18px">globoesport</td>
<td style="text-align:center;font-size:18px">0.004892</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">globoesporte</td>
<td style="text-align:center;font-size:18px">0.004507</td>
<td style="text-align:center;font-size:18px">defesa</td>
<td style="text-align:center;font-size:18px">0.004150</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">precisa</td>
<td style="text-align:center;font-size:18px">0.004204</td>
<td style="text-align:center;font-size:18px">todos</td>
<td style="text-align:center;font-size:18px">0.003037</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">momento</td>
<td style="text-align:center;font-size:18px">0.003925</td>
<td style="text-align:center;font-size:18px">possível</td>
<td style="text-align:center;font-size:18px">0.002211</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">treinador</td>
<td style="text-align:center;font-size:18px">0.003699</td>
<td style="text-align:center;font-size:18px">lesão</td>
<td style="text-align:center;font-size:18px">0.001797</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">camisa</td>
<td style="text-align:center;font-size:18px">0.003505</td>
<td style="text-align:center;font-size:18px">video</td>
<td style="text-align:center;font-size:18px">0.001595</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">hoje</td>
<td style="text-align:center;font-size:18px">0.003442</td>
<td style="text-align:center;font-size:18px">vitória</td>
<td style="text-align:center;font-size:18px">0.001177</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">difícil</td>
<td style="text-align:center;font-size:18px">0.003328</td>
<td style="text-align:center;font-size:18px">dentro</td>
<td style="text-align:center;font-size:18px">0.000943</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">possível</td>
<td style="text-align:center;font-size:18px">0.003327</td>
<td style="text-align:center;font-size:18px">neste</td>
<td style="text-align:center;font-size:18px">0.000874</td>
</tr>
<tr>
<td style="text-align:center;font-size:18px">todos</td>
<td style="text-align:center;font-size:18px">0.003132</td>
<td style="text-align:center;font-size:18px">nova</td>
<td style="text-align:center;font-size:18px">0.000581</td>
</tr>
</tbody>
</table>
<h2 id="conclusion">Conclusion</h2>
<p>In most cases, the TRF looks like regularization since it has worse training performance and a better holdout result. In these cases, the shape of the curve in the holdout is similar for the RF and the TRF. However, it might be the case that in most datasets, the low number of features does not enable it to find stability from a particular source. The most interesting case is also the one with the highest number of features, which is GE News. The interestingness of this case is that it provides evidence that phenomena we want to predict could be composed of many concepts, which change at different rates as time passes. At the same time, models cannot capture this nuance and tend to focus on simple and spurious relationships.</p>
<p>Nonetheless, the domain classifier points to a simple way to find a reason to try the TRF for a particular problem, and its performance does not lose by far in the cases it loses to the RF.</p>
<h2 id="references">References</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Moneda, L., & Mauá, D. (2022). <a href="/resources/papers/Time_Robust_Tree.pdf">Time Robust Trees: Using Temporal Invariance to Improve Generalization</a>. BRACIS 2022. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn2" role="doc-endnote">
<p>Moneda, L.: Globo esporte news dataset (2020), version 18. Retrieved March 31, 2021, from https://www.kaggle.com/lgmoneda/ge-soccer-clubs-news <a href="#fnref:fn2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn3" role="doc-endnote">
<p>Chicago, C.: Chicagocrime-bigquerydataset(2021), version 1. Retrieved March 13, 2021 from https://www.kaggle.com/chicago/chicago-crime <a href="#fnref:fn3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn4" role="doc-endnote">
<p>Daoud, J.: Animal shelter dataset (2021), version 1. Retrieved March 13, 2021, from https://www.kaggle.com/jackdaoud/animal-shelter-analytics <a href="#fnref:fn4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn5" role="doc-endnote">
<p>Mouill, M.: Kickstarter projects dataset (2018), version 7. Retrieved March 13, 2021, from https://www.kaggle.com/kemical/kickstarter-projects?select=ks-projects-201612.csv <a href="#fnref:fn5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn6" role="doc-endnote">
<p>Shastry, A.: Sanfrancisco building permits dataset (2018), version 7. Retrieved March 13, 2021, from https://www.kaggle.com/aparnashastry/building-permit-applications-data <a href="#fnref:fn6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn7" role="doc-endnote">
<p>Sionek, A.: Brazilian e-commerce public dataset by olist (2019), version 7. Retrieved March 13, 2021, from https://www.kaggle.com/olistbr/brazilian-ecommerce <a href="#fnref:fn7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn8" role="doc-endnote">
<p>Mitchell, T.: 20 Newsgroup dataset (1996). They are retrieved from sklearn. <a href="#fnref:fn8" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comIndexCalibração contínua de desempenho2022-03-26T00:00:00+00:002022-03-26T00:00:00+00:00http://lgmoneda.github.io/2022/03/26/calibracao-continua-performance<h2 id="índice">Índice</h2>
<ol>
<li><a href="#introdução">Introdução</a></li>
<li><a href="#como-funciona">Como funciona</a></li>
<li><a href="#a-postura-para-aproveitar-ao-maximo">A postura para aproveitar ao máximo</a></li>
<li><a href="#conclusão">Conclusão</a></li>
</ol>
<h2 id="introdução">Introdução</h2>
<p>Uma avaliação de desempenho é um processo para avaliar e recompensar indivíduos. A estrutura comum em empresas de tecnologia inclui dois ciclos por ano, autoavaliação, revisão por pares, avaliação de leads e calibração.
Na calibração, a liderança se reúne para comparar a avaliação de todos os membros de sua equipe. O objetivo é ser consistente em toda a empresa.</p>
<p>Por que eu não gosto?</p>
<p>Na verdade, não é tão simples. Eu até gosto de calibrações. Eles são esclarecedoras. É uma boa sessão de estudos de caso de gestão de pessoas.</p>
<p>Você ouve histórias específicas legais sobre como grandes trabalhos foram feitos. Você aprende com as dificuldades de outras equipes. Você fica sabendo como as pessoas superaram maus comportamentos ou desenvolveram uma habilidade crítica.</p>
<p>No entanto, seu principal objetivo é refletir sobre o passado e defini-lo em uma nota.</p>
<p>O ciclo acabou! Todos os insights que você pode obter da experiência de muitos líderes só podem ser aplicados no próximo ciclo, e isso é frustrante. Além disso, há tensão no ar porque termina com uma nota irreversível. Essa tensão torna as pessoas defensivas e menos abertas. Para piorar, há pouco tempo para discutir um grande grupo, e o trabalho de definir as notas é a prioridade.</p>
<p>Alguns gerentes trabalhando comigo e eu começamos a fazer <em>calibrações contínuas</em> para evitar que a clareza ficasse em segundo plano ao discuti-la sob pressão de notas e a sensação sub-ótima de que é tarde demais.</p>
<h2 id="a-prática">A prática</h2>
<h3 id="configuração">Configuração</h3>
<p>As calibrações contínuas seguem tudo das calibrações regulares, exceto que acontecem algumas vezes durante o ciclo: mensal ou bimestral. Se a equipe for muito grande, divida-a para restringir as sessões a uma hora ou uma hora e meia de discussão de alta qualidade.</p>
<p>É essencial seguir o mesmo modelo e escolher uma nota que represente o desempenho até aquele momento. Os modelos de calibração geralmente contêm entregas principais, pontos fortes e pontos de melhoria, além de informações descritivas, como nível, cargo e última promoção.</p>
<p>O exercício requer algum esforço contínuo de todos os membros da equipe. Solicitamos que as pessoas façam uma lista incremental de suas entregas, facilitando a vida de todos no final do ciclo. Também vinculamos o que todo gerente espera de todos usando o <a href="http://lgmoneda.github.io/2022/03/12/expectativas-manager-report.html">documento de expectativas manager-report</a>.</p>
<p>Além do modelo padrão, incluímos o feedback fornecido até o momento, que mostra o quanto o manager está atuando.</p>
<h3 id="durante-a-sessão">Durante a sessão</h3>
<p>Alguém precisa liderar a sessão para garantir que a equipe cubra os casos que mais se beneficiariam com ações rápidas.</p>
<p>A coisa mais útil a fazer é desafiar uns aos outros. Mas perguntando com cuidado (mais sobre isso na seção a seguir).</p>
<p>Vamos usar esta <a href="https://qulture.rocks/en/blog/googles-performance-management-practices-part-1/">classificação de desempenho do Google 2019</a> e associá-las a números:</p>
<ol>
<li>Precisa de melhorias</li>
<li>Atende consistentemente às expectativas</li>
<li>Excede as expectativas</li>
<li>Supera fortemente as expectativas</li>
<li>Excelente</li>
</ol>
<p>Agora vamos usar uma notação de “nota menos” como um desempenho avaliado como “nota”, mas próximo de “nota - 1”, e “nota mais” para um desempenho mais próximo de “nota + 1”. “Nota neutro” é um desempenho teoricamente sem dúvidas.</p>
<p>Questões a fazer durante a sessão:</p>
<ul>
<li>Baixo desempenho
<ul>
<li>Por que essa pessoa está com baixo desempenho? Como recuperar pessoas com baixo desempenho?</li>
<li>O que poderia fazer os 2 menos se tornarem 1?</li>
<li>Se o ciclo terminasse hoje, a pessoa ficaria surpresa com a nota baixa?</li>
</ul>
</li>
<li>Equidade de nota e uniformidade de avaliação
<ul>
<li>Por que alguém era 3 mais e um par era 3 neutro?</li>
<li>Como este [nota] mais se compara ao [nota+1] menos?</li>
</ul>
</li>
<li>Avaliação contrafactual
<ul>
<li>Se tivéssemos um par com a mesma oportunidade, como o resultado mudaria? Faz sentido ter uma nota baixa/alta aqui se a maioria dos pares fariam o mesmo?</li>
</ul>
</li>
<li>Alta performance
<ul>
<li>O que pode fazer o [2, 3, 4] mais um [3, 4, 5]?</li>
<li>Ninguém é 5, como ter esse desempenho no contexto dessa equipe?</li>
</ul>
</li>
<li>Promoções
<ul>
<li>Para pessoas com expectativas de promoção, sugeriríamos hoje? O que tornaria sua candidatura fraca? O que podemos fazer para fortalecer o caso?</li>
<li>Das coisas que enfraquecem as candidaturas de promoção, que oportunidades podemos oferecer para superá-las?</li>
</ul>
</li>
<li>Ajudar outros gerentes a criar boas histórias
<ul>
<li>Esta pessoa está em uma posição de liderança, mas não está claro como ela está exercendo;</li>
<li>Acho que está faltando expor o impacto do trabalho dessa pessoa;</li>
</ul>
</li>
</ul>
<h2 id="a-postura-para-aproveitar-ao-máximo">A postura para aproveitar ao máximo</h2>
<p>Tudo pode acontecer como descrito acima, e a sessão acabar sendo inócua ou um desastre.</p>
<p>A calibração contínua como ferramenta depende de como cada manager se comporta durante ela. Estes são os princípios para torná-la útil.</p>
<h3 id="o-desempenho-do-report-pode-gerar-feedback-para-o-manager">O desempenho do report pode gerar feedback para o manager</h3>
<p>Quando alguém está com baixo desempenho, as chances são de que o manager e o report compartilhem a responsabilidade. Na calibração regular, o foco está principalmente no report. Na calibração contínua, esperamos mais reflexão sobre o papel do manager e apoio dos pares para melhorar.</p>
<h3 id="tudo-ainda-está-aberto-para-você-e-para-seu-time">Tudo ainda está aberto para você, e para seu time</h3>
<p>Essa é a beleza. A proporção de passado e futuro muda à medida que o ciclo continua. Ainda assim, haverá espaço para evitar surpresas, desafiar e recuperar. Tanto para o report quanto para a manager. Descrevemos o passado, mas focamos no futuro. Defina a nota, mas discuta o que pode mudá-la até o final do ciclo. Todos os managers podem ajudá-la a definir boas estratégias para seu time, incluindo atividades conjuntas com as equipes de outras managers.</p>
<h3 id="um-exercício-de-bondade">Um exercício de bondade</h3>
<p>As calibrações podem estar cheias de auto-engano e falta de cuidado com a verdade. No entanto, calibrações contínuas devem ser sobre gentileza, cuidado e atenção: um movimento honesto para descobrir o que está acontecendo em cada caso particular<sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup>.</p>
<p>Você deseja investigar as causas-raiz realizando as perguntas listadas anteriormente.</p>
<p>Esteja preparado e aberto para se sentir culpado. Deixe que os outros o ajudem a perceber quando você não está fazendo um bom trabalho como manager.</p>
<p>Enquanto na calibração final, as pessoas ficam na defensiva sobre seu ponto de vista e resistem a ser convencidas, a versão contínua deve ser cheia de ceticismo e pensamento crítico. Estamos preocupados em saber se o que acreditamos ser um alto desempenho é genuíno. Desapegue-se de outros interesses em convencer as pessoas para sair com uma nota específica.</p>
<p>Ao comparar o desempenho das pessoas, aproveite a falta de pressão e concentre-se na clareza e consistência da avaliação da equipe. Desafie as notas porque você se preocupa com a qualidade de nossa avaliação. O exercício gerará um ótimo feedback aos reports e melhorará os managers.</p>
<p>Questione os managers como um grupo para revelar problemas sistemáticos: por que ninguém tem alto ou baixo desempenho?</p>
<h3 id="apoie-outros-managers-para-superar-a-empatia-ruinosa">Apoie outros managers para superar a empatia ruinosa</h3>
<p>Fornecer uma avaliação ruim é difícil, e a tentação de evitá-la se encaixa na definição de Scott (2019) de empatia ruinosa<sup id="fnref:fn2" role="doc-noteref"><a href="#fn:fn2" class="footnote" rel="footnote">2</a></sup>. O silêncio de um manager para evitar ferir o sentimento da pessoa, mas que acaba por prejudicar a carreira desta por mantê-la ignorante sobre os pontos de melhoria.</p>
<p>Ter o apoio dos colegas nesse momento é fantástico. Não porque eles vão te dizer que está tudo bem. Eles o ajudarão a identificar tudo o que você pode fazer pelo seu report para deixá-lo confiante de que fizemos o melhor e que a pessoa com baixo desempenho teve a oportunidade de entender bem o caso e recuperar-se.</p>
<h3 id="socialize-o-aprendizado">Socialize o aprendizado</h3>
<p>Sempre que vivo casos desafiadores de gestão de pessoas, considero que melhorei como manager. A calibração contínua permite socializar os casos mais interessantes com todos as managers locais <em>enquanto está acontecendo</em>. É fácil ser sábio após o evento. É uma lição viva. Gera empatia ao colocar todos no momento em que os desafios são de fato desafiadores.</p>
<h3 id="treinamento-para-a-calibração-final">Treinamento para a calibração final</h3>
<p>Quando chega a calibração propriamente dita, os managers sentem-se confiantes nas notas e sabem justificá-las.</p>
<p>Os pares ajudam a moldar sua defesa, o que é importante para baixo desempenho, promoção e alto desempenho.</p>
<h2 id="conclusão">Conclusão</h2>
<p>A calibração contínua tem o poder de destensionar as calibrações. Com um baixo investimento de uma hora a cada um ou dois meses uma situação com alto potencial de aprendizado é oferecida aos managers. Após cada sessão, reflita para identificar o feedback para si mesmo e estruture algo para seus reports.</p>
<h2 id="referências">Referências</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Murdoch, I. (2013). The sovereignty of good. : Routledge. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn2" role="doc-endnote">
<p>Scott, K. (2019). Radical candor: fully revised \& updated edition: be a kick-ass boss without losing your humanity. : St. Martin’s Press. <a href="#fnref:fn2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comÍndiceContinuous Performance Calibration2022-03-26T00:00:00+00:002022-03-26T00:00:00+00:00http://lgmoneda.github.io/2022/03/26/continuous-performance-calibration<h2 id="index">Index</h2>
<ol>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#how-it-works">How it works</a></li>
<li><a href="#the-best-mood-to-make-the-most-of-it">The best mood to make the most of it </a></li>
<li><a href="#conclusion">Conclusion</a></li>
</ol>
<h2 id="introduction">Introduction</h2>
<p>A performance review is a process to evaluate and reward individuals. The common framework in tech companies includes two cycles per year, self-assessment, peer review, lead assessment, and calibration.
In the calibration, the leadership meets to compare the evaluation of all members of their team. The objective is to be consistent throughout the company.</p>
<p>Why I don’t like it?</p>
<p>It is mixed. I do like calibrations. They are enlighting. It is a nice session of people management case studies.</p>
<p>You hear cool specific stories about how greatness looks like. You learn from other teams struggles. You get to know how people overcame bad behaviors, or developed a critical skill.</p>
<p>However, its main objective is to reflect on the past and define it in a grade.</p>
<p>The cycle is gone! All the insights you could gather from the experience of many leaders can only be applied in the next cycle, and it is frustrating. Furthermore, there is tension in the air because it ends with an irreversible grade. This tension makes people defensive and less open. To make it worse, there is a short time to discuss a large group, and the job of getting the grades defined is the priority.</p>
<p>A couple of managers working with me and I started doing <em>continuous calibrations</em> to avoid the truth getting veiled when discussing it under grades pressure and the sub-optimal feeling that it is too late.</p>
<h2 id="the-practice">The Practice</h2>
<h3 id="setup">Setup</h3>
<p>Continuous calibrations follow everything from regular calibrations, except they happen a few times during the cycle: monthly or bi-monthly. If the team is too large, split it to restrict the sessions to take one hour or one hour and a half of high-quality discussion.</p>
<p>It is essential to follow the same template and pick a grade that represents the performance until that point. Calibration templates usually contain main deliveries, strengths and points of improvement, and descriptive information, like the level, tenure, and last promotion.</p>
<p>It requires some continuous effort from all the team members. We request people to make an incremental list of their deliveries, making everyone’s life easier at the end of the cycle. We also link what every manager expects from everyone using the <a href="http://lgmoneda.github.io/2022/03/12/manager-report-expectations-doc.html">manager-report expectations doc</a>.</p>
<p>Beyond the standard template, we include feedback provided at that cycle, which shows how much their manager is acting.</p>
<h3 id="during-the-session">During the session</h3>
<p>Someone needs to lead the session to ensure the team covers the cases that would benefit the most from quick actions.</p>
<p>The most helpful thing to do is challenge each other. But inquiring with care (more about it in the following section).</p>
<p>Let’s use this <a href="https://qulture.rocks/en/blog/googles-performance-management-practices-part-1/">Google 2019’s performance classification</a> and associate them with numbers:</p>
<ol>
<li>Needs improvement</li>
<li>Consistently meets expectations</li>
<li>Exceeds expectations</li>
<li>Strongly exceeds expectations</li>
<li>Superb</li>
</ol>
<p>Now let’s use a notation of grade minus as performance in grade, but close to “grade - 1”, and grade plus for performance closer to “grade + 1”. “Grade neutral” is a performance in the middle of that grade.</p>
<p>Questions to do during it:</p>
<ul>
<li>Underperformance
<ul>
<li>Why this person is underperforming? How to recover people underperforming?</li>
<li>What could make the 2 minuses become a 1?</li>
<li>If the cycle ends today, would the person be surprised by the low grade?</li>
</ul>
</li>
<li>Grade fairness and evaluation uniformity
<ul>
<li>Why was someone a 3 plus and a peer was a 3 neutral?</li>
<li>How this [grade] plus compares to the [grade+1] minus?</li>
</ul>
</li>
<li>Counterfactual evaluation
<ul>
<li>If we had a peer with the same opportunity, how the outcome would change? Does it make sense to have a low/high grade here if it does not change significantly?</li>
</ul>
</li>
<li>High performance
<ul>
<li>What can make the [2, 3, 4] plus a [3, 4, 5]?</li>
<li>No one is a 5, how to have such performance on the context of this team?</li>
</ul>
</li>
<li>Promotions
<ul>
<li>For people with promotion expectations, would we suggest it today? What would make their candidature weak? What can we do to make the case stronger?</li>
<li>From the things that make promotion candidatures weak, which opportunities can we offer them to overcome it?</li>
</ul>
</li>
<li>Helping other managers to craft good stories
<ul>
<li>This person is in a leadership position, but it is not clear how they are exercising it;</li>
<li>I think we are lacking exposing this person’s work impact;</li>
</ul>
</li>
</ul>
<h2 id="the-best-mood-to-make-the-most-of-it">The best mood to make the most of it</h2>
<p>Everything can happen as described above, and the session ends up as innocuous or a disaster.</p>
<p>Continuous calibration as a tool depends on how every manager behaves during it. These are the principles to make it helpful.</p>
<h3 id="the-reports-performance-can-generate-feedback-for-the-manager">The report’s performance can generate feedback for the manager</h3>
<p>When someone is underperforming, the chances are the manager and the report share responsibility. In the calibration, the focus is mostly on the report. In the continuous calibration, we expect more reflection about the manager’s role and support from the peers to improve.</p>
<h3 id="everything-is-still-open-for-you-and-the-reports">Everything is still open for you, and the reports</h3>
<p>That’s the beauty. The proportion of past and openness changes as the cycle goes on. Still, there will be some room to avoid surprises, stretch and recover. Both for report and manager. To describe the past, but focus on the future. Define the grade, but discuss what can change it until the end of the cycle. All the managers can help you define good strategies for your reports, including joint activities with other managers’ teams.</p>
<h3 id="a-kindness-exercise">A kindness exercise</h3>
<p>Calibrations can be full of self-deception and hot air - a lack of care for the truth. However, continuous calibrations should be about kindness: an honest movement to find out what is happening in every particular case<sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup>.</p>
<p>You want to dig into root causes by performing the questions listed previously.</p>
<p>Be prepared and open to feel guilty. Let others help you realize when you are not doing a good job as a manager.</p>
<p>While in the final calibration, people are defensive about their point and resist being convinced, the continuous should be full of ceticism and critical thinking. We are concerned to find out if what we believe is a high performance is genuine. Detach from other interests on convincing people because of the outcome of getting out with a specific grade.</p>
<p>When comparing people’s performance, take advantage of the non-pressure and focus on the clarity and consistency of the team’s evaluation. Challenge cases because you care about the quality of our assessment. It will generate great feedback to the reports and improve management.</p>
<p>Question the managers as a group to reveal systematic problems: why no one is high or underperforming?</p>
<h3 id="support-other-managers-to-overcome-ruinous-empathy">Support other managers to overcome ruinous empathy</h3>
<p>Providing bad evaluation is harsh, and the temptation to avoid it fits on Scott (2019) definition of ruinous empathy<sup id="fnref:fn2" role="doc-noteref"><a href="#fn:fn2" class="footnote" rel="footnote">2</a></sup>. A manager’s silence to avoid hurting the person’s feeling, which end up undermining their career.</p>
<p>Having the support of peers during this moment is fantastic. Not because they will tell you it is ok. They will help you identify everything you can do for your report to make you confident we did the best and support with their acknowledgment.</p>
<h3 id="socialize-learning">Socialize learning</h3>
<p>Whenever I live challenging people management cases, I consider I’ve improved as a manager. The continuous calibration enables socializing everyone’s most interesting cases with all the local managers <em>while it is happening</em>. It is easy to be wise after the event. It is a living lesson. It generates empathy by placing everybody at the moment challenges are indeed challenging.</p>
<h3 id="training-for-the-final-calibration">Training for the final calibration</h3>
<p>When the actual calibration comes, the managers feel confident about the grades and know how to justify them.</p>
<p>The peers help to shape their defense, which is important for underperforming, promotion, and high-performing.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Continuous calibration has the power to de-tension calibrations. It makes it meaningful with a low investment of one hour every one or two months. After every session, reflect to identify feedback for yourself, and structure something to your reports.</p>
<h2 id="references">References</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Murdoch, I. (2013). The sovereignty of good. : Routledge. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:fn2" role="doc-endnote">
<p>Scott, K. (2019). Radical candor: fully revised \& updated edition: be a kick-ass boss without losing your humanity. : St. Martin’s Press. <a href="#fnref:fn2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comIndexAng Li’s example on the Unit Selection problem with counterfactual logic2022-03-22T00:00:00+00:002022-03-22T00:00:00+00:00http://lgmoneda.github.io/2022/03/22/li-unit-selection-problem-example<h2 id="index">Index</h2>
<ol>
<li><a href="#introduction">Introduction</a></li>
<li><a href="#unit-selection-problem">Unit Selection problem</a></li>
<li><a href="#counterfactual-formulation-of-unit-selection-problem">Counterfactual formulation of Unit Selection Problem</a></li>
<li><a href="#example">The example</a></li>
</ol>
<h2 id="introduction">Introduction</h2>
<p>I’ve read some parts of <a href="https://ftp.cs.ucla.edu/pub/stat_ser/r507.pdf">Unit Selection Based on Counterfactual Logic</a><sup id="fnref:fn1" role="doc-noteref"><a href="#fn:fn1" class="footnote" rel="footnote">1</a></sup>. Unit selection is the task of selecting individuals who respond to a specific treatment. I had to go through the calculation of one of the examples to convince myself I understood the work. It is the calculation of the proposed objective function using counterfactual logic. So I’m sharing it to support other people doing the same.</p>
<p>I recommend reading Ang Li’s thesis since I’ll briefly expose a few concepts from it.</p>
<h2 id="unit-selection-problem">Unit Selection problem</h2>
<p>The individual benefit varies accordingly to their compliance type: always-takers, never-takers, compliers, and defiers.</p>
<p>The classical way to solve this problem is to run an A/B test, then use the characteristics \(C\) we have about the customers to predict heterogeneous treatment effects and select only the compliers.</p>
<h3 id="ab-test-heuristics-for-decision-making">A/B test heuristics for decision making</h3>
<p>There are two classical objective functions to decide on applying action \(a\) to incentivize behavior \(r\).</p>
<p>The first checks if doing action \(a\) makes people respond \(r\) more:</p>
\[\begin{align}
\begin{split}
Obj_{1} = argmax_{c} 100 \times P(r \mid c, do(a)) - 100 \times P(r \mid c, do(a'))
\end{split}\label{eq:obj_1}
\end{align}\]
<p>The second contrasts the benefit of response \(r\) under the action \(a\), minus the outcome behaving as \(r\) without the action \(a\).</p>
\[\begin{align}
\begin{split}
Obj_{2} = argmax_{c} 100 \times P(r \mid c, do(a)) - 140 \times P(r \mid c, do(a'))
\end{split}\label{eq:obj_2}
\end{align}\]
<h2 id="counterfactual-formulation-of-unit-selection-problem">Counterfactual formulation of Unit Selection Problem</h2>
<p>Li (2021) shows that the previous heuristics fail and propose an objective function based on counterfactual logic.</p>
<p>See <em>Theorem 4</em>. Consider the benefits of offering the treatment to compliers (\(\beta\)), always-taker (\(\gamma\)), never-taker (\(\theta\)), and defier (\(\delta\)), a \(c\) representing the characteristics of the individuals, or simply subgroups. The only condition is that the \(C\) do not contain any descendant of the encouragement \(X\).</p>
\[\begin{align}
\begin{split}
argmax_{c} \beta P(\text{complier} \mid c) + \gamma P(\text{always-taker}\mid c) + \theta P(\text{never-taker}\mid c) + \delta P(\text{defier}\mid c)
\end{split}\label{eq:counter_f}
\end{align}\]
<p>Using \(a\) to denote taking action and \(a'\) its absence, we have, in terms of probabilities of sufficiency and necessity:</p>
\[argmax_{c} \beta P(r_a, r'_{a'} \mid c) + \gamma P(r_{a}, r_{a'} \mid c) + \theta P(r'_a, r'_{a'ƒ}\mid c) + \delta P(r'_{a}, r_{a'}\mid c)\]
<p>This case requires observational and experimental data (more details below).</p>
<p>If we observe either Gain equality or Causal monotonicity, it becomes simpler:</p>
\[(\beta - \theta) P(y_x \mid c) + (\gamma - \beta)P(y_{x'} \mid c) + \theta\]
<p>With do notation:</p>
\[(\beta - \theta) P(r \mid do(a)) + (\gamma - \beta)P(r \mid do(a')) + \theta\]
<p>In this last case, we can estimate it with only experimental data.</p>
<h3 id="gain-equality">Gain Equality</h3>
<p>Define a benefit vector of applying the treatment in the compliance groups following: compliers (\(\beta\)), always-taker (\(\gamma\)), never-taker (\(\theta\)), and defier (\(\delta\)). We say it satisfies the gain equality if:</p>
\[\beta + \delta = \gamma + \theta\]
<h3 id="causal-monotonicity">Causal monotonicity</h3>
<p>Causal monotonicity of a variable \(Y\) in respect to a variable \(X\) means:</p>
\[y'_{x} \text{ AND } y_{x'} = \text{false}\]
<p>If treatment \(x\) does not make the expected outcome \(y\), the lack of it won’t make it happen.</p>
<h3 id="the-general-case">The general case</h3>
<p>The proposed function Objective Function \ref{eq:counter_f} is bounded as follows:</p>
\[\begin{align}
\begin{split}
{}& max\{p_1, p_2, p_3, p_4\} \leq f \leq min\{p_5, p_6, p_7, p_8\} \text{ if } \sigma < 0, \\
{}& max\{p_5, p_6, p_7, p_8\} \leq f \leq min\{p_1, p_2, p_3, p_4\} \text{ if } \sigma > 0, \\
\end{split}
\end{align}\]
<p>Where</p>
\[\begin{align}
\begin{split}
{}& \sigma = \beta - \gamma -\theta + \delta \\
{}& p_1 = (\beta - \theta)P(y_x\mid c) + \delta P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c), \\
{}& p_2 = \gamma P(y_x\mid c) + \delta P(y_{x'}\mid c) + (\beta - \gamma) P(y'_{x'}\mid c), \\
{}& p_3 = (\gamma - \delta)P(y_x\mid c) + \delta P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c) + (\beta - \gamma - \theta + \delta)[P(y, x \mid c) + P(y', x' \mid c)], \\
{}& p_4 = (\beta - \theta)P(y_x\mid c) - (\beta - \gamma - \theta)P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c) + (\beta - \gamma - \theta + \delta)[P(y, x' \mid c) + P(y', x \mid c)], \\
{}& p_5 = (\gamma - \delta)P(y_x\mid c) + \delta P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c), \\
{}& p_6 = (\beta - \theta)P(y_x\mid c) - (\beta - \gamma - \theta)P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c), \\
{}& p_7 = (\gamma - \delta)P(y_x\mid c) - (\beta - \gamma - \theta)P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c) + (\beta - \gamma - \theta + \delta)P(y \mid c), \\
{}& p_8 = (\beta - \theta)P(y_x\mid c) + \delta P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c) - (\beta - \gamma - \theta + \delta)P(y \mid c), \\
\end{split}\label{eq:ps}
\end{align}\]
<h4 id="no-observational-data-case">No observational data case</h4>
<p>We can exclude the terms containing observational probabilities and use:</p>
\[\begin{align}
\begin{split}
{}& max\{p_1, p_2\} \leq f \leq min\{p_3, p_4\} \text{ if } \sigma < 0, \\
{}& max\{p_3, p_4\} \leq f \leq min\{p_1, p_2\} \text{ if } \sigma > 0, \\
\end{split}\label{eq:simple_ps}
\end{align}\]
<p>The \(p_1, p_2\) are the same. But \(p_3, p_4\) now are the \(p_5, p_6\) from \ref{eq:ps}.</p>
<h2 id="the-example">The example</h2>
<p>It is the example in section 5.3.1. In summary, we want to evaluate an action to avoid customer churn. We offer a discount to renew a subscription. The outcome of the action is considered as \$100 for compliers (\$140 profit less \$40 cost), -\$60 for always takers (-\$40 discount and an extra -\$20 since they may require additional discounts in the future), \$0 for never taker, and -\$140 for a defier since the company loses the customer. The discount is applied to two groups of customers identified by \(c\).</p>
<p>I’ll just expose the case for group 1 since it is similar to group 2.</p>
<p>Here’s the A/B testing result:</p>
<table class="styled-table" style="margin-left:auto;margin-right:auto;width:581px">
<thead>
<tr>
<th style="text-align:center">A/B test result</th>
<th style="text-align:center"></th>
<th style="text-align:center">do(a)</th>
<th style="text-align:center">do(a')</th>
</tr>
</thead>
<tbody>
<tr>
<td>Group 1</td>
<td style="text-align:center;font-size:18px">r</td>
<td style="text-align:center;font-size:18px">262</td>
<td style="text-align:center;font-size:18px">175</td>
</tr>
<tr>
<td>Group 1</td>
<td style="text-align:center;font-size:18px">r'</td>
<td style="text-align:center;font-size:18px">88</td>
<td style="text-align:center;font-size:18px">175</td>
</tr>
<tr>
<td>Group 2</td>
<td style="text-align:center;font-size:18px">r</td>
<td style="text-align:center;font-size:18px">87</td>
<td style="text-align:center;font-size:18px">52</td>
</tr>
<tr>
<td>Group 2</td>
<td style="text-align:center;font-size:18px">r'</td>
<td style="text-align:center;font-size:18px">263</td>
<td style="text-align:center;font-size:18px">298</td>
</tr>
</tbody>
</table>
<p>Let’s start with the A/B test heuristics.</p>
<p>To use Objective function \ref{eq:obj_1}, we need the following probabilities estimated from the A/B test:</p>
\[P(r \mid c, do(a)) = \frac{262}{262+88} = 0.7485 \\
P(r \mid c, do(a')) = \frac{175}{175+175} = 0.5 \\\]
<p>Replacing in the equation:</p>
\[Obj_{1} = 100 \times 0.7485 - 100 \times 0.5 = 24.85 =^{~} 25\]
<p>For the Objective function \ref{eq:obj_2}, we need the same probabilities, but we weight it differently:</p>
\[Obj_2 = 100 \times 0.7485 - 140 \times 0.5 = 4.857 =^{~} 4.86\]
<p>Now we estimate the proposed objective function. We first analyze the benefit vector:</p>
\[\sigma = \beta - \gamma -\theta + \delta = 100 - (-60) - 0 + (-140) = 20\]
<p>Gain equality is not respected. Since \(\sigma > 0\), we would use the set of Equations \ref{eq:ps}. However, since in this example we only have experimental data, we need to use \ref{eq:simple_ps}.</p>
<p>We start by exposing the components:</p>
\[p(y_x \mid c) = \frac{262}{262+88} = 0.785 \\
p(y_{x'} \mid c) = \frac{175}{175+175} = 0.5 \\
p(y'_{x'} \mid c) = \frac{175}{175+175} = 0.5 \\
p(y'_x \mid c) = \frac{88}{262+88} = 0.2414 \\\]
\[\begin{align}
\begin{split}
{}& p_1 = (\beta - \theta)P(y_x\mid c) + \delta P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c) \\
{}& p_1 = (100 - 0)\times 0.7485 + (-140) \times 0.5 + 0 \times 0.5 \\
{}& p_1 = 4.85 \\
\\
{}& p_2 = \gamma P(y_x\mid c) + \delta P(y_{x'}\mid c) + (\beta - \gamma) P(y'_{x'}\mid c) \\
{}& p_2 = -60 \times 0.740815 + (-140) \times 0.2514 + (100 - (-60)) \times 0.5 \\
{}& p_2 = -0.106 \\
\\
{}& p_3 = (\gamma - \delta)P(y_x\mid c) + \delta P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c) \\
{}& p_3 = (-60 - (-140)) \times 0.7485 + (-140) \times 0.5 + 0 \times 0.5 \\
{}& p_3 = -10.12\\
\\
{}& p_4 = (\beta - \theta)P(y_x\mid c) - (\beta - \gamma - \theta)P(y_{x'}\mid c) + \theta P(y'_{x'}\mid c), \\
{}& p_4 = (100 - 0 \times 0.7485 - (100 - (-60) - 0) \times 0.5 + 0 \times 0.5 \\
{}& p_4 = -5.15 \\
\end{split}
\end{align}\]
<p>We end up with:</p>
\[\begin{align}
\begin{split}
max\{-10.12, -5.15\} \leq {}& f \leq min\{4.85, -0.106\} \\
-5.15 \leq {}& f \leq -0.106 \\
\end{split}
\end{align}\]
<p>We use the midpoint of -5.15 and -0.106, which is -2.63. And that’s what we have in table 5.2 (Li, 2021).</p>
<p>Li provides the proportion of compliance types in group 1, which is never known.</p>
<table class="styled-table" style="margin-left:auto;margin-right:auto;width:581px">
<thead>
<tr>
<th style="text-align:center"></th>
<th style="text-align:center">Complier</th>
<th style="text-align:center">Always-taker</th>
<th style="text-align:center">Never-taker</th>
<th style="text-align:center">Defier</th>
</tr>
</thead>
<tbody>
<tr>
<td>Group 1</td>
<td style="text-align:center;font-size:18px">30%</td>
<td style="text-align:center;font-size:18px">45%</td>
<td style="text-align:center;font-size:18px">20%</td>
<td style="text-align:center;font-size:18px">5%</td>
</tr>
</tbody>
</table>
<p>When we estimate the expected value of applying the discount on them:</p>
\[\begin{align}
\begin{split}
{}& E[\text{profit}] = \beta P(\text{complier} \mid c=1) + \gamma P(\text{always-taker}\mid c=1) + \theta P(\text{never-taker}\mid c=1) + \delta P(\text{defier}\mid c=1)\\
{}& E[\text{profit}] = 100 \times 0.3 -60 \times 0.45 + 0 \times 0.2 -140 \times 0.05 \\
{}& E[\text{profit}] = -4 \\
\end{split}
\end{align}\]
<p>We realize how \(f_3\) is a better estimate of it. Li provides a plot on how the actual value and the three functions estimates change as we change \(\delta\).</p>
<div align="center">
<figure>
<a href="../../../images/li-2021/plot-delta.png" name="Blank Page">
<img style="width:400px;margin:10px" src="../../../images/li-2021/plot-delta.png" />
</a>
<figcaption>Source: Li 2021, Figure 5.2</figcaption>
</figure>
</div>
<h2 id="references">References</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:fn1" role="doc-endnote">
<p>Li, A. (2021). Unit selection based on counterfactual logic. : University of California, Los Angeles. <a href="#fnref:fn1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Luis Monedarobert@sorryapp.comIndex