I committed to post daily:

Subscribe for daily updates on software development, productivity, and more.

Data Pipeline

2023-10-06

When you want to prepare application data for analysis, you need to create a process called a data pipeline. This process collects, prepares, transforms, and transfers the data from an application to a data storage place like a data lake. It's important to carefully think about the requirements for this data pipeline because they involve specific rules.

There are generally three patterns for data pipelines:

  1. Extract Transform Load (ETL): In this pattern, data is first collected and filtered if necessary. Then, it's aggregated and processed before being stored in a data warehouse. ETL works well when data consistency is crucial, like historical data. However, it has downsides like slower speed, complexity, and limited scalability. Popular tools for this pattern include Apache Spark, AWS Glue, and Azure Data Factory.

  2. Extract Load Transform (ELT): Similar to ETL, data is first collected, but instead of processing it right away, the raw data is stored first. Then, it's transformed within the data warehouse. ELT is suitable for situations that need more flexibility or when the data isn't fully structured. However, it requires a data warehouse with robust transformation capabilities, which adds to management efforts. Most popular solutions support this pattern except AWS Glue.

  3. Extract Transform Load Transform (ETLT): This approach is a hybrid, aiming to balance ETL's consistency and ELT's flexibility. Data is partially pre-processed, then stored, and finally processed again to its desired format. While it offers some consistency and speed benefits, it demands more planning and effort during the design stage. It's useful for scenarios requiring complex data transformations.

TOML

2023-10-05

YAML, JSON, INI, and TOML are popular choices for configuration files, and they are replacing XML for good. While the first three are familiar and straightforward, TOML (Tom's Obvious Minimal Language) is an interesting choice. Let's take a closer look.

The official description of TOML says:

"TOML aims to be a simple configuration file format that's easy to read because it has clear rules. TOML is designed to easily turn into a data structure, like a table or a dictionary, in many programming languages. TOML should be easy to change into data structures in many different programming languages."

TOML lets you create objects, which are called tables here, using a simple syntax that avoids the need for nesting objects. For example:

[parent-object]
field1 = "the value"

[parent-object.child-object]
field2 = "another value"

The same goes for arrays of objects:

[[user]]
id = "1"
name = "user1"

[[user]]
id = "2"
name = "user2"

Another interesting feature is its support for date and time:

ldt1 = 1979-05-27T07:32:00 
ldt2 = 1979-05-27T00:32:00.999999

In terms of file size, TOML generally falls between JSON and YAML.

Personal Productivity System

2023-10-04

Organizing your life isn't a one-size-fits-all solution. Everyone has their own journey to find what works for them. During my journey, I've tried different methods. Some didn't work, and some did. Here's what didn't work for me:

Methods That Didn't Work:

  1. Kanban/Scrum for Personal Use:

    • These methods are great for projects but require a lot of effort when you have various projects, chores, and daily tasks.
    • When the board gets filled with too many tasks, it becomes overwhelming. Tools like Trello didn't help me manage my tasks effectively.
  2. GTD (Getting Things Done):

    • GTD works well for structured projects, but it started to break down for me over time.
    • The "Sometimes/Maybe Later" category became a mess with too many projects and tasks without clear start dates.
  3. Bullet Journaling:

    • While I liked the idea of a paper-based system, it didn't work for me because many of my projects are digital.

What I'm Looking For:

Through these experiences, I've figured out what I need in an organizational system:

  • Quick Capture: I want a system that lets me record ideas quickly without delay.

  • Inbox: During idea capture, I don't want to worry about when or where to do something. I need a simple inbox to collect my ideas.

  • Projects and Tags: Projects are great for grouping tasks by goals, while tags help categorize tasks by context for better planning.

  • Scalability: The system should be easy for everyday tasks but also flexible for bigger projects.

  • Centralization: Using multiple tools for personal organization is a hassle. I want one simple system to stick with.

  • Resilience: Sometimes, I step away from my system, and when I return, I want to pick up where I left off, not start over.

What Works for Me:

Currently, I'm using a simple to-do app like Todoist or TickTick. It supports projects and tags. I don't use a "sometimes/maybe" folder; instead, I schedule all tasks I'm actively working on and keep the rest under filter "Unscheduled." I have a special task type called "milestones" to remind me of my overall direction, and I can adjust priorities as needed. For prioritization, I use the Eisenhower matrix. I find the "Quick Capture" feature in my to-do app very helpful. I also experiment with retrospectives and use project comments to store project-specific notes.

Reflecting On 30 Days Of Daily Posting

2023-10-03

30 days ago, I started an experiment where I committed to posting daily notes on topics I'm interested in. Here's a brief look back on this period:

  • It encouraged me to research areas that interest me.
  • It motivated me to stay updated on industry trends.
  • It pushed me to dive deeper into each topic and double-check the facts I thought I knew.
  • Sometimes, it's disappointing to discard a note halfway through when I realize I don't like it.
  • Other times, it's discouraging to admit my lack of expertise in a specific topic, but I'm trying to post anyway, as it's the area I'm currently investigating.

Overall, I've decided to continue this experiment for at least three more months, and I'll share another update then. Thank you guys for your support.

GPT Function Calling

2023-10-02

I've been exploring the GPT API, and one of its cool features is called Function Calling. Basically, this allows the AI model to use code that you give it in a special format called JSON, and it's wrapped up in something called a "schema."

You can do some interesting things with this feature. For example, you can make a program that can understand and run code that you provide. It's similar to how a Code Interpreter works.

You can also use this to make your chatbot work with other programs. This means you can create a chat-based interface for your app. And if you combine it with a code interpreter, the API can even create code that works with your existing software.

Customer Segmentation

2023-10-01

Customer segmentation is the process of categorizing a company's customers into groups based on common characteristics. This enables companies to effectively and appropriately tailor their marketing efforts to each group.

There are four primary types of customer segmentation:

  1. Demographic segmentation: This method involves dividing customers into groups based on shared characteristics such as age, gender, income, occupation, education level, marital status, and location.

  2. Psychographic segmentation: In this approach, customers are grouped based on their lifestyle, interests, values, and attitudes.

  3. Behavioral segmentation: This method classifies customers into different groups based on their purchase history, usage patterns, brand loyalty, and responses to marketing campaigns.

  4. Geographic segmentation: Here, customers are divided into groups based on their location, which can include country, region, city, or neighborhood.

Customer segmentation offers various benefits, including optimizing your marketing strategy and defining specific marketing channels that target each segment. It also helps identify ways to improve products tailored to specific segments and even test various pricing options.

Segmentation can be carried out in various ways, such as through surveys, cold calls, collecting membership data, insights from customer support interactions, purchase history analysis, online analytics, and machine learning.

Here are some application examples:

  • Implementing different pricing strategies for students.
  • Offering family discounts.
  • Conducting age and gender-specific marketing campaigns (Netflix serves as a great example).
  • Developing distinct products for various cultural groups.
  • Sending customized messages based on how customers discovered a service.

Standard Text Interface

2023-09-30

The Portable Operating System Interface (POSIX) is a family of standards that defines a set of APIs (Application Programming Interfaces) and conventions for building and interacting with operating systems.

POSIX is designed to enhance the portability of applications. Essentially, this standard defines what a Unix-like operating system is. Among various components, such as Error Codes and Inter-Process Communication standards, it includes a list of utilities that are familiar to us, such as cd, ls, mkdir, and many more. These utilities have shaped how people interact with operating systems using text for decades.

It appears that we are witnessing a resurgence of text-based interfaces in the form of LLMs. Technologies like ChatGPT plugins, Microsoft Copilot 365, and the recently updated Bard indicate that LLMs might serve as text-based interfaces for a range of services and applications. I'm wondering if we will eventually establish a set of standards to define the interaction between LLMs and extensions, similar to how POSIX standardized Unix-like systems in its time.

Several factors could contribute to the emergence of such standards. Some of them:

1. User Demands: In a competitive market with multiple chat-based services that support third-party plugins, having a set of standards would enable compatibility across platforms or easy switching between them.

2. Technology Maturity: As these interfaces become more mature, and their applications span various domains, standardization may naturally evolve. The absence of disruptive changes and widespread usage can lead to the establishment of these standards.

Apple And AI

2023-09-29

In its latest event, Apple didn't utilize AI even once, in contrast to its closest competitor, Google. While refraining from making any announcements, the new generations of iPhone and Apple Watch boast more powerful Neural Engines. I believe Apple will take a different approach from what we currently observe in the market. Instead of further enhancing their existing cloud-based Siri experience, they will shift towards on-device processing. This strategy aligns with their strong stance on security and privacy, as we've already seen them testing on-device Siri processing with the Apple Watch. I'm curious about what they could offer with offline AI and have a few thoughts:

  1. Context-aware on-device search: Imagine being able to search across all types of files, including images, documents, and videos, and retrieve information in any format simply by asking Siri.

  2. Context-aware writing assistance: With training based on your email history, typing suggestions could become context-aware, offering email responses that align with your ongoing conversations.

  3. Deeper integration with other applications: It would be fascinating to enable any app to leverage an API that creates a "skill" for the local assistant, much like how you can extend ChatGPT with extensions. This could potentially open up new niches for apps centered around AI interaction.

Homeostasis

2023-09-28

Homeostasis is a self-regulating process that enables biological systems to maintain stability while adapting to changing environmental conditions. It describes how an organism can keep its internal environment relatively constant, allowing it to adapt and survive in a frequently challenging environment.

Homeostasis consists of several key components:

  1. Receptor: As the name suggests, receptors detect changes in the external or internal surroundings. They initiate a cascade of reactions to uphold homeostasis.

  2. Control Center: Also referred to as the integration center, the control center receives information from the receptors and processes it.

  3. Effector: Effectors respond according to the instructions received from the control center, either reducing or enhancing the stimulus as needed.

Homeostasis Diagram

The concept of homeostasis finds widespread application in software engineering across various domains and industries. Here are some notable examples:

  • Configuration as Code: Technologies like Kubernetes, Terraform, and CloudFormation adopt an approach where users declare the desired system state, and the system autonomously determines how to achieve and maintain it.

  • Elasticity: Systems can dynamically scale up or down in response to workload fluctuations, ensuring they can efficiently perform their tasks.

  • Self-Healing: Container orchestrators such as Kubernetes attempt to restart a malfunctioning service if it stops responding to health checks or exhibits unusual behavior.

The concept of homeostasis closely aligns with the idea of a desired state and a declarative approach to programming. A straightforward and widely used example is markup languages, where developers specify the desired page state, and the browser is responsible for rendering it as closely as possible to that desired state.

Quality Gates

2023-09-27

A quality gate is a critical checkpoint in the software development lifecycle that assesses whether software meets specific criteria. Its primary goal is to identify and fix as many issues as possible before releasing the software. Quality gates may include, but are not limited to, the following checks:

  • Build: Checking if the software builds and compiles without any errors.
  • Linting: Ensuring that the codebase adheres to accepted best practices.
  • Tests: Including both functional tests and coverage reports.

Typical locations for implementing quality gates are:

  • Local Environment: Usually implemented with pre-commit hooks, this allows for early issue detection during code commit. Among other checks, it's an excellent place to enforce code style using tools like Prettier and validate naming conventions for the branch.

  • PR Validation: These checks duplicate those in the local environment, in case a developer skips pre-commit hooks using the --no-verify option. They also add PR-specific validations. For example, Azure can check if the associated work item was created or if a description was provided for the PR.

  • Main Branch Actions: This is the best place to run extensive integration and automation tests in addition to the previous checks. It ensures that the software continues to meet quality standards after merging into the main branch.

This setup works exceptionally well with temporary teams, such as contractors or outsourcers, to ensure that the codebase complies with defined standards.

Model First VS Query First

2023-09-26

SQL is a good example of an abstraction that works in most cases (I assume the 80/20 rule is applicable here). But, like most abstractions, it cracks under pressure, and instead of writing readable, well-structured queries, developers find themselves writing dynamic SQL, tweaking indices, and investigating execution plans.

I think query-first data modeling, as used in Apache Cassandra, is more transparent compared to model-first, used in SQL:

  • It doesn't try to hide the physical nature of the query and insists on picking a good index beforehand, thus not faltering under high loads, allowing it to handle huge workloads.
  • It doesn't presume complete data integrity, and thus techniques like partitioning don't seem alien. CQL, for instance, insists on picking a good partition key when modeling your data, presuming partitioning from the beginning.

Effort Estimation

2023-09-25

I didn't realize until now that there are so many ways to do effort estimation in software engineering. Some of them:

  • Story Points: This is a classic one. The team decides on a minimum point and then compares the complexity of other tasks to this minimum. Usually, they use the Fibonacci Sequence and Planning Poker for this.
  • T-Shirt Sizes: This method uses a set of predefined sizes like XS, S, M, L, XL, etc., to estimate the complexity.
  • Ideal Days: This is straightforward. The team estimates how many ideal workdays they need to finish a task.
  • Function Point Analysis (FPA): This feels more academic. It considers things like the number of external outputs and internal interfaces. You can learn more about it here.
  • The Matrix Method: This is a visual method. It uses time on the X-axis and complexity on the Y-axis.

Explicit Error Handling

2023-09-24

When I first started using Go, it took me some time to become familiar with its error handling approach. In comparison to the traditional control flow approach, where exceptions are handled in a separate block (try/catch), modern languages like Go and Rust use a different approach called explicit error handling. In this approach, the error is one of the return values, and the developer is expected to check and handle it right away.

For example, in Go:

value, err := someFunction()
if err != nil {
    // handle the error
}

In Rust, you would use Result<T, E>, which is an enum with two variants: Ok(value) and Err(value):

match some_function() {
    Ok(value) => {
        // use the value
    },
    Err(e) => {
        // handle the error
    },
}

Both languages also have a similar concept called "panic" mode, which represents unrecoverable errors that should interrupt the execution.

Explicit error handling arguably helps developers write more readable error handling code because it is closely related to the actual place where the error will occur, as opposed to the control flow approach, where the catch block might be in a separate function or buried deep within many other function calls that might produce the error, or even worse, hidden by a generic exception.

Red Ocean Blue Ocean

2023-09-23

"Red Ocean, Blue Ocean" is the concept from the business strategy book "Blue Ocean Strategy" by W. Chan Kim and RenƩe Mauborgne that aims to help companies grow under different market conditions and adjust their actions accordingly to market "temperature":

  • Red Ocean: an already established market with competitors. How to take advantage?
    • Better cost;
    • Better quality;
    • Focus on a specific niche;
    • Better branding;
    • Relationships;
  • Blue Ocean: an unknown market space, where demand is created rather than fought over. How to win:
    • Innovate
    • Create new demand
    • Attract new customers The Blue Ocean almost always looks more appealing due to the lack of competitors; however, you should be a visionary to see them. While the Red Ocean may seem rough, knowing the rules and the market might help you secure your share.

The Paradox Of Choice

2023-09-22

In his book "The Paradox of Choice," Barry Schwartz mentioned that having too many choices could lead to less satisfaction and greater regret. The paradox is related to the following characteristics:

  • Choice Overload: Studies have shown that once a certain threshold is reached, there is a decrease in interest. This is particularly evident in retail, where stores know how many different brands are enough to keep customers interested but not too many to overwhelm them.

  • Escalation of Expectations: The more choices you have, the more you tend to believe that there must be "the best one" among them.

  • Regret and Opportunity Costs: This is closely tied to the previous point; people tend to experience more regret when they have to choose between various options.

How to cope with the Paradox Of Choice? Schwartz divides decision-makers into "maximizers" - those who constantly seek the best possible option, and "satisficers" - those who are content with a good enough option. By proactively limiting the number of choices to those that have been proven to be "good enough," you can reduce decision anxiety for a significant portion of consumer-based choices.

WAL And SQLite

2023-09-21

The Pocket Base, an open-source backend, made an interesting choice for their persistent storage: SQLite. In the industry, SQLite is usually used for simple client-side storage because it's lightweight, portable, and doesn't need much setup. However, it lacks some features, like OUTER JOINs, and doesn't support multiple writers or user management. This choice raises some questions.

Here's their response:

PocketBase uses embedded SQLite (in WAL mode) and has no plans to support other databases. For most queries, SQLite (in WAL mode) performs better than traditional databases like MySQL, MariaDB, or PostgreSQL, especially for read operations. If you need replication and disaster recovery, you can consider using Litestream.

Basically, they're writing into a WAL (Write-Ahead Logging) instead of using a rollback journal to address some critical issues with SQLite:

  1. Increased Concurrency: Using WAL allows multiple readers and a writer to work at the same time. Readers don't block writers, and writers don't block readers, so reading and writing can happen at the same time.

  2. Improved Performance: Writing a commit to the end of the WAL file is faster than writing to the middle of the main database file. This makes transactions faster with WAL.

  3. Crash Recovery: It's more robust if there's a crash. Changes are first written to the WAL file and then transferred to the database file, reducing the risk of database corruption.

  4. Disk I/O Reduction: With WAL, disk operations are more sequential, which reduces the time spent on disk seek operations.

Hofstader Law

2023-09-20

Hofstadter's Law addresses the common problem of accurately estimating the time it will take to complete a task.

Ā "It always takes longer than you expect, even when you take into account Hofstadter's Law."

Bill Gates' interpretation of this law especially resonates with me:

"Most people overestimate what they can do in one year and underestimate what they can do in ten years."

  • You overestimate what you can achieve in a gym in one month, and underestimate progress for six.
  • You overestimate your ability to learn something in a month, but underestimate what you can learn in one year.
  • You overestimate your saving ability for a particular paycheck, but underestimate what you can save in one year.

While planning something, I try to take this law into account and be more humble in my short-term goals, and more ambitious with long-term ones.

Not Invented Here Syndrome

2023-09-19

Not Invented Here (NIH) syndrome is a tendency to build in-house software instead of utilizing existing options. In its simplest form, it's a constant need to reinvent the wheel. Here are some notable examples:

  1. Netscape Navigator: Netscape decided to rewrite its entire codebase for Netscape Navigator 5.0, believing that starting from scratch would enable them to leapfrog the competition. Unfortunately, the project took much longer than expected, and by the time Netscape 6.0 (5.0 was skipped altogether) was released in 2000, Internet Explorer had taken over the browser market. Netscape's market share never recovered.

  2. Digg v4: Social news aggregator Digg decided to rewrite its entire codebase for version 4, moving away from MySQL and Memcache to Cassandra. The move was not well-received by users, and numerous bugs and performance issues led to a mass exodus to competitors like Reddit. The company's value plummeted, and they were eventually sold for a fraction of their peak value.

  3. Rewriting Quake by id Software: John Carmack, a co-founder of id Software, decided to rewrite the Quake game engine from scratch in C++, moving away from C. The rewrite ended up taking much longer than anticipated and led to numerous bugs and stability issues, damaging the game's reputation.

  4. Friendster: One of the first social networking sites, Friendster, faced scalability issues as more users joined. Instead of improving and optimizing their existing platform, they decided to rewrite the entire codebase. The result was a buggy, slow platform that frustrated users and led to a rapid decline in the user base.

  5. HealthCare.gov: When the U.S. government launched HealthCare.gov in 2013, it was a disaster due to numerous technical issues. Despite the government's massive resources, the site suffered from poor performance and frequent crashes. A key reason for the site's issues was that the government insisted on custom-building much of the site's functionality rather than using proven existing solutions.

Cloud IDEs

2023-09-18

I was working in StackBlitz and thinking about the potential future of Cloud Integrated Development Environments (IDEs). While they provide a convenient way to quickly set up configured development environments, I believe they won't entirely replace on-device development environments for everyday engineering tasks. They may, however, find their niche in replacing Virtual Desktop Environment (VDE) software, especially in situations where contractual restrictions prevent storing codebases locally, as is often the case for consulting companies.

Here are some reasons why I think local development environments will continue to dominate:

  1. Freedom of Tooling Choice: In your local environment, you have the freedom to select and customize tools and plugins that align perfectly with your workflow. You can even use proprietary tools if necessary, which can be challenging to integrate into online IDEs.

  2. Databases: While it's possible to create a development database in the cloud for certain use cases, having a local database can be indispensable. Whether you need it for testing migrations or simply for experimenting with data, a local environment offers greater flexibility. GitHub Codespaces does allow the use of Docker images, but this can add to your bill, leading to the next point.

  3. Pricing: Cloud IDEs often come with a price tag. GitHub Codespaces, for instance, bills on a per-core, per-hour basis, while CodeSandbox charges $15 per month per editor. AWS Cloud9's pricing is tied to the underlying EC2 instance usage. Paying a monthly fee for a tool that offers a subset of the capabilities available in your local environment may not be cost-effective for many developers.

Tokenization

2023-09-17

Tokenization is the process of breaking down text into components, known as tokens. Each token might represent an individual word or phrase. This process is required to make the data more manageable and suitable for various NLP tasks (text mining, ML, or text analysis). Let's take a look at the BERT-like model tokenization process:

  • Normalization (any cleanup of the text that is deemed necessary, such as removing spaces or accents, Unicode normalization, etc.)
  • Pre-tokenization (splitting the input into words)
  • Running the input through the model (using the pre-tokenized words to produce a sequence of tokens)
  • Post-processing (adding the special tokens of the tokenizer, generating the attention mask and token type IDs)

Example:

"Hello how are U today?" - Input
    |
    v
"hello how are u today?" - Case normalization
    |
    v
["hello", "how", "are", "u", "td", "##ay", "?"] - Subword tokenization
    |
    v
["CLS", "hello", "how", "are", "u", "td","##ay","?","SEP"] - Assigning special tokens

Here today has been split into td, ##ay. This technique is known as Subword Tokenization, often used in models like BERT to handle out-of-vocabulary words.

Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. For instanceĀ "annoyingly"Ā might be considered a rare word and could be decomposed intoĀ "annoying"Ā andĀ "ly". BothĀ "annoying"Ā andĀ "ly"Ā as stand-alone subwords would appear more frequently while at the same time the meaning ofĀ "annoyingly"Ā is kept by the composite meaning ofĀ "annoying"Ā andĀ "ly". This is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords. Source

Special tokens are used for classification in BERT like models. The CLS token is used to represent the entire context of the input for tasks like classification, while the SEP token is used to separate different sentences or contexts within the same text.

If It Is Not Deployed It Does Not Exist

2023-09-16

Expanding on my previous post, I firmly believe in the concept - if it's not deployed, it doesn't exist. This statement was, I think, first put forward by @jasonfried.

Throughout my career, I've grappled with a simple yet profound realization - ideas that haven't been brought to life, even in the form of a Minimum Viable Product (MVP), don't actually exist. Whether it's a modification required for a project, a new hobby I want to explore, or a fresh product idea, there needs to be a tangible version of it. This allows us to evaluate whether it's worth investing more time into. If it's not, then it should be discarded.

RxJs And Angular

2023-09-15

I've been using Angular for a couple of years now, and one repeated pattern that occurs when I'm diving into the codebase is how quickly and easily it's to make application logic unreadable with the reactive extensions library, to be more specific, NgRx. While it offers a large set of tools for managing asynchronous data, arguably no other stream processing framework can provide. It creates a callback hell that makes most of the logic way too complicated to comprehend than it could be when using the imperative approach. I'm still thinking that it's too much overhead for most client-side applications, and this is the exact reason I'm not using it in my projects.

Validated Learning

2023-09-14

According to the Product Led Growth the main value of a startup is a validated learning, a concept which defines a company intelligence, gained after shipping products, that can be used to further adjust the company strategy for further advantage.

I tend to expand this idea further to the individual scale, where I see a specific person as a company itself. In some way it also uses resources to achieve their objectives, operating in hopefully fair market. And during this process the person earns valuable insights, whether from a wins and setbacks.

It's surprisingly interesting to look at personal experiences through these lenses:

  • What unique problems am I facing?
  • How can I leverage this experience to my advantage?
  • What research can I conduct to gain more expertise and knowledge in a subject I'm interested in?
  • What is the Minimum Viable Product of a change I want to implement in my life?

Model Fine Tuning

2023-09-13

There are plenty of pre-trained models available, that could cover most or your needs. However, there might be some situations, when base model need to be further calibrated in order to serve well, some examples:

  • Adapting a model to a specific task. Sometimes, the performance of general-purpose models is not enough for you particular problem.
  • Fight with overfitting. When a model is trained on a limited dataset, it might perform poorly on unfamiliar data. Further fine-tuning can help overcome this problem.
  • Knowledge transfer. - You can use another model, which is already trained to perform well on a specific task, to transfer its knowledge to your model.

NLTK

2023-09-12

When working with NLP, one interesting tool that might help you with a wide range of tasks is NLTK. Some of its capabilities:

  • Tokenization
  • Stemming
  • Part-of-speech Tagging
  • Named Entity Recognition
  • Sentence parsing
  • Concordance
  • Frequency Distribution
  • Text Classification
  • Sentiment Analysis

One extremely useful feature that has saved me good portion of time, is its large collection of text corpora for training your model.

My CMS Setup

2023-09-11

I'm a big fan of Obsidian.md. I have been managing my notes there for a good couple of years now, and when I started my blog, I knew this would be my editing tool of choice for the following reasons:

  • Integration with my knowledge base,
  • Editing setup that I used to,
  • Great Markdown support.

My blog is hosted on Vercel and backed by persistent Redis, but I haven't found a straightforward way to integrate my favorite tools together, it's time to reinvent the wheel. My publications, along with media, are living in the 'posts' folder, following the naming convention: yyyy-mm-dd-title.md. On the other hand, my blog retrieves the list of articles from an index record inside the database; it's just a list of posts along with metadata. One interesting field is checksum, which I use to determine if the post has been edited. To keep my posts in sync with the database, I created a script that does the following:

  • If a post is not in the index, create it.
  • If a post is in the index, compare the checksum, and edit it if needed.
  • If a post is in the index but not in my folder, remove it.

The key of the actual post is its name, so I don't need the index for directly accessing a particular publication. All images are embedded into a post as base64. This approach does the job for my small website, but storing images on a CDN would be more preferable.

My vault is stored on GitHub, and the script is deployed as an Action. I do have git integration plugin installed for the Vault, so publishing the post is basically a commit.

While it's not an ideal solution, it does the thing for me, allowing to use tools I love and making the process a bit more enjoyable.

Why Naive Bayes Method Is Naive

2023-09-10

Bayes' theorem is a fundamental block of probability theory and allows us, in simple terms, to express and update our beliefs given new information.

The formula:

P(Aāˆ£B)=P(Bāˆ£A)ā‹…P(A)P(B)P(A|B) = \frac{{P(B|A) \cdot P(A)}}{{P(B)}}
  • P(Aāˆ£B)P(A|B) is the [[probability]] of event A given event B is true.
  • P(Bāˆ£A)P(B|A) is the probability of event B given event A is true.
  • P(A)P(A) and P(B)P(B) are the probabilities of events A and B respectively.

One interesting implication of this method is a Sentiment Analysis. That specific use-case of the Bayes' theorem is called Naive Bayes method. Let's find why.

For this problem, I need a dataset. For example, IMDB 50K as I mentioned in my previous post, or create your own using text samples with sentiment labels, such as "positive," "negative," or "neutral".

The algorithm:

  1. Calculate initial probabilities, based on sentiments distribution (P(positive)P(positive), P(negative)P(negative), P(neutralP(neutral)).
  2. Then, for each word, calculate sentiment probability based on its occurrence in text snippets (P(loveāˆ£positive)P(love|positive), P(terribleāˆ£negativeP(terrible|negative), etc.).
  3. Based on that information we can now define a posterior classifier, that will update a sentiment probability:
    P(positiveāˆ£text)=P(loveāˆ£positive)ā‹…P(terribleāˆ£postive)ā‹…ā€¦ā‹…P(positive)P(text)P(positive|text) = \frac{{P(love|positive) \cdot P(terrible|postive) \cdot \ldots \cdot P(positive)}}{{P(text)}}
    Then the same for neutral and negative.
  4. For calculating P(text)P(text) using the law of total probability:
P(text)=P(textāˆ£positive)ā‹…P(positive)+P(textāˆ£negative)ā‹…P(negative)+P(textāˆ£neutral)ā‹…P(neutral) P(text)=P(textāˆ£positive)\cdot P(positive)+P(textāˆ£negative) \cdot P(negative)+P(textāˆ£neutral) \cdot P(neutral)

For calculating P(textāˆ£positive/negative/neutral)P(text|positive/negative/neutral) using a simplification, called bag of words, where we basically assuming that all words in the sentence are independent, ant their only feature is frequency:

P(textāˆ£positive)=P(loveāˆ£positive)ā‹…P(thisāˆ£positive)ā‹…P(weatherāˆ£positive)P(textāˆ£positive)=P(loveāˆ£positive) \cdot P(thisāˆ£positive) \cdot P(weatherāˆ£positive)

And the same for the rest of the sentiment labels.

The bag of words simplification is exactly why this method is called "naive." It might seem like a shallow assumption, but it turns out to be extremely efficient in practice.

First Steps In Machine Learning

2023-09-09

I recently began diving into Machine Learning. My plan is to start with hands-on approach, and dig into theory on-demand. After some researching and chatting with GPT, I selected the Sentiment Analysis problem and the Naive Bayes algorithm as the first one to try, for the following reasons:

  • Low complexity level. The Bayes' Theorem is not necessarily intuitive to me, but it's relatively easy to understand, and still powerful enough to produce good results. The idea is to make something simple, get familiar with the tools and test the waters. I was choosing between Bayes' and k-NN as my first target.
  • Available datasets. For my project, I'm using the IMDB 50K Dataset which contains a collection of positive and negative comments, along with their sentiment scores. There are plenty of others datasets available for training the model for your specific needs.
  • Well-known problem. Sentiment Analysis has been widely used for a while now to gain measurable insights from various forms of text.

On Typescript

2023-09-08

A hot discussion about dropping TypeScript support happening in recent days, that seems like a major shift in the field. But is it?

Some thoughts:

  • Skipping generics "gymnastics", especially when designing a library might seem as a boost to developers productivity, and might feel like so, but whether it is improvement in a long run is still a big question. I'm looking forward to seeing examples.
  • Rather than rolling back to JavaScript to skip compilation step, why not move forward towards native TS support (example: Bun)
  • Most client code will be using TypeScript anyway, so I don't expect major changes here.

Overall, it's a good opportunity to take a step back, and critically assess if the tools we are using are still doing a good job.

Evergreen Notes

2023-09-07

In his blog, Andy Matuschak coined a concept of Evergreen notes - note-taking approach, aimed to maintain the information structured, and promote insight. It is based on a "Zettelkasten" method, originally invented by Niklas Luhmann, and described in a Taking Smart Notes by Sƶnke Ahrens.

In the Zettelkasten system, notes are store as atomic peace's of information, connected with links. The original version was a simple box with cards, physically sorted:

That system allowed linking and structuring information in various ways, long before databases were invented.

Evergreen notes take this idea further, leveraging note-taking software such as Obsidian, adding the following adjustments to the predecessor:

  • Network over hierarchy. Instead of structuring notes hieratically, splitting a complex topic into smaller subcategories, Evergreen Notes are promoting associative taxonomy, when various concepts might be connected with other concepts forming graph-like structure.
  • Inbox. Similarly to GTD, new information should be placed into an inbox, for further processing. The implementation of the inbox might be up to the user, for example, I'm using Todoist as my general-purpose inbox for both tasks and notes.
  • Spaced repetition. A process of revisiting existing information should be set up. For Obsidian users there are multiple plugins that could help with that: core Random Note, and community, flash-card based Spaced Repetition are worth mentioning.
  • Map Of Content (MOC). MOCs allow connecting various notes in a single place using links. This way, you can define some topic that you're interested in.

This structure allows maintaining a quiet impressive amount of knowledge, for example here is the knowledge graph of CEO of Obsidian Steph Ango(each dot represents a note):

Expected Value

2023-09-06

Expected Value is used in statistics, gambling, and other domains to measure the expected profitability of a given decision. It is calculated by multiplying each possible outcome by the probability of that outcome occurring, then summing these results:

EV = āˆ‘ [P(x) * X]

  • P(x) is the probability of each outcome,
  • X is the value of each outcome,
  • āˆ‘ indicates the sum across all possible outcomes.

Suppose a person wants to buy a car. The person is trying to choose between a Toyota Camry and a Honda Civic. Some common metrics to consider (the list is shortened for simplicity): initial price, operating costs, resale value, reliability. Comparison below is just an example, it can be extended with other metrics that might impact a decision. For instance, you can weigh your decision with customer satisfaction, or personal preferences.

FactorHonda CivicToyota Camry
Initial Price$15,000$18,000
Operating Costs (5 years)$14,700$16,400
Resale Value (after 5 years)-$9,000-$10,800
ReliabilityHigh (Hondas are known for their reliability)High (Toyotas are known for their reliability)
Total Cost over 5 years$20,700$23,600

Starting Personal Blog In 2023

2023-09-05

I've been considering starting a personal blog for some time now. However, with the recent advancements in AI, I've started to question whether it's worthwhile to undertake such a project in 2023. Despite my doubts, I've decided that the answer is yes, for the following reasons:

  • Still Valuable: Unique content will always have its place. Even in an age of AI-generated content, there's an irreplaceable value in creative, original thoughts and ideas.
  • Personal Corner: As social networks become more regulated and controlled, having your own space where you can freely express and share your thoughts becomes increasingly important and valuable.
  • Continuous Learning: A blog isn't just a platform for sharing, it's also a tool for learning. The research and thought that go into blog posts foster a continuous learning process.
  • Personal Brand: Despite the proliferation of social media influencers and AI-generated content, having a personal blog remains an excellent way to build and control your personal brand. It gives you the freedom to shape your online identity.
  • Discipline: The commitment required in maintaining a blog, such as consistent writing, is a good way to develop personal discipline. It encourages regular self-reflection and can help foster a habit of critical thinking.

Feedback Loops

2023-09-04

In the book "Thinking in Systems" by Donella H. Meadows, defines a system as an interconnected set of elements that is coherently organized in a way that achieves something. A system is more than the sum of its parts.

A system consists of three main parts:

  1. Elements: These are the components or parts which can't be broken down into further parts within the context of the system.
  2. Interconnections: These are the relationships that hold the elements together. They include physical flows, as well as flows of information or influence.
  3. Function or Purpose: This is the behavior the system exhibits or the goals it is trying to accomplish. This is often not written or spoken anywhere but is a consequence of the interplay between the system's elements and their interconnections.

Examples of systems in daily life:

  • Wardrobe: The elements are pieces of clothing. Interactions involve the flow of clothes - choosing, wearing, washing, and returning items. The function is to provide appropriate attire. The feedback loop involves maintaining order and finding items easily. Understanding this system optimizes wardrobe management.
  • Nutrition: The elements are meals. Interactions involve the process of buying, preparing, and consuming food. The function is to nourish and provide energy. The feedback loop can be the impact on health and wellness. Understanding this system can lead to healthier food choices and improved well-being.
Subscribe for daily updates on software development, productivity, and more.