I committed to post daily:
2023-10-06
When you want to prepare application data for analysis, you need to create a process called a data pipeline. This process collects, prepares, transforms, and transfers the data from an application to a data storage place like a data lake. It's important to carefully think about the requirements for this data pipeline because they involve specific rules.
There are generally three patterns for data pipelines:
Extract Transform Load (ETL): In this pattern, data is first collected and filtered if necessary. Then, it's aggregated and processed before being stored in a data warehouse. ETL works well when data consistency is crucial, like historical data. However, it has downsides like slower speed, complexity, and limited scalability. Popular tools for this pattern include Apache Spark, AWS Glue, and Azure Data Factory.
Extract Load Transform (ELT): Similar to ETL, data is first collected, but instead of processing it right away, the raw data is stored first. Then, it's transformed within the data warehouse. ELT is suitable for situations that need more flexibility or when the data isn't fully structured. However, it requires a data warehouse with robust transformation capabilities, which adds to management efforts. Most popular solutions support this pattern except AWS Glue.
Extract Transform Load Transform (ETLT): This approach is a hybrid, aiming to balance ETL's consistency and ELT's flexibility. Data is partially pre-processed, then stored, and finally processed again to its desired format. While it offers some consistency and speed benefits, it demands more planning and effort during the design stage. It's useful for scenarios requiring complex data transformations.
2023-10-05
YAML, JSON, INI, and TOML are popular choices for configuration files, and they are replacing XML for good. While the first three are familiar and straightforward, TOML (Tom's Obvious Minimal Language) is an interesting choice. Let's take a closer look.
The official description of TOML says:
"TOML aims to be a simple configuration file format that's easy to read because it has clear rules. TOML is designed to easily turn into a data structure, like a table or a dictionary, in many programming languages. TOML should be easy to change into data structures in many different programming languages."
TOML lets you create objects, which are called tables here, using a simple syntax that avoids the need for nesting objects. For example:
[parent-object]
field1 = "the value"
[parent-object.child-object]
field2 = "another value"
The same goes for arrays of objects:
[[user]]
id = "1"
name = "user1"
[[user]]
id = "2"
name = "user2"
Another interesting feature is its support for date and time:
ldt1 = 1979-05-27T07:32:00
ldt2 = 1979-05-27T00:32:00.999999
In terms of file size, TOML generally falls between JSON and YAML.
2023-10-04
Organizing your life isn't a one-size-fits-all solution. Everyone has their own journey to find what works for them. During my journey, I've tried different methods. Some didn't work, and some did. Here's what didn't work for me:
Methods That Didn't Work:
Kanban/Scrum for Personal Use:
GTD (Getting Things Done):
Bullet Journaling:
What I'm Looking For:
Through these experiences, I've figured out what I need in an organizational system:
Quick Capture: I want a system that lets me record ideas quickly without delay.
Inbox: During idea capture, I don't want to worry about when or where to do something. I need a simple inbox to collect my ideas.
Projects and Tags: Projects are great for grouping tasks by goals, while tags help categorize tasks by context for better planning.
Scalability: The system should be easy for everyday tasks but also flexible for bigger projects.
Centralization: Using multiple tools for personal organization is a hassle. I want one simple system to stick with.
Resilience: Sometimes, I step away from my system, and when I return, I want to pick up where I left off, not start over.
What Works for Me:
Currently, I'm using a simple to-do app like Todoist or TickTick. It supports projects and tags. I don't use a "sometimes/maybe" folder; instead, I schedule all tasks I'm actively working on and keep the rest under filter "Unscheduled." I have a special task type called "milestones" to remind me of my overall direction, and I can adjust priorities as needed. For prioritization, I use the Eisenhower matrix. I find the "Quick Capture" feature in my to-do app very helpful. I also experiment with retrospectives and use project comments to store project-specific notes.
2023-10-03
30 days ago, I started an experiment where I committed to posting daily notes on topics I'm interested in. Here's a brief look back on this period:
Overall, I've decided to continue this experiment for at least three more months, and I'll share another update then. Thank you guys for your support.
2023-10-02
I've been exploring the GPT API, and one of its cool features is called Function Calling. Basically, this allows the AI model to use code that you give it in a special format called JSON, and it's wrapped up in something called a "schema."
You can do some interesting things with this feature. For example, you can make a program that can understand and run code that you provide. It's similar to how a Code Interpreter works.
You can also use this to make your chatbot work with other programs. This means you can create a chat-based interface for your app. And if you combine it with a code interpreter, the API can even create code that works with your existing software.
2023-10-01
Customer segmentation is the process of categorizing a company's customers into groups based on common characteristics. This enables companies to effectively and appropriately tailor their marketing efforts to each group.
There are four primary types of customer segmentation:
Demographic segmentation: This method involves dividing customers into groups based on shared characteristics such as age, gender, income, occupation, education level, marital status, and location.
Psychographic segmentation: In this approach, customers are grouped based on their lifestyle, interests, values, and attitudes.
Behavioral segmentation: This method classifies customers into different groups based on their purchase history, usage patterns, brand loyalty, and responses to marketing campaigns.
Geographic segmentation: Here, customers are divided into groups based on their location, which can include country, region, city, or neighborhood.
Customer segmentation offers various benefits, including optimizing your marketing strategy and defining specific marketing channels that target each segment. It also helps identify ways to improve products tailored to specific segments and even test various pricing options.
Segmentation can be carried out in various ways, such as through surveys, cold calls, collecting membership data, insights from customer support interactions, purchase history analysis, online analytics, and machine learning.
Here are some application examples:
2023-09-30
The Portable Operating System Interface (POSIX) is a family of standards that defines a set of APIs (Application Programming Interfaces) and conventions for building and interacting with operating systems.
POSIX is designed to enhance the portability of applications. Essentially, this standard defines what a Unix-like operating system is. Among various components, such as Error Codes and Inter-Process Communication standards, it includes a list of utilities that are familiar to us, such as cd
, ls
, mkdir
, and many more. These utilities have shaped how people interact with operating systems using text for decades.
It appears that we are witnessing a resurgence of text-based interfaces in the form of LLMs. Technologies like ChatGPT plugins, Microsoft Copilot 365, and the recently updated Bard indicate that LLMs might serve as text-based interfaces for a range of services and applications. I'm wondering if we will eventually establish a set of standards to define the interaction between LLMs and extensions, similar to how POSIX standardized Unix-like systems in its time.
Several factors could contribute to the emergence of such standards. Some of them:
1. User Demands: In a competitive market with multiple chat-based services that support third-party plugins, having a set of standards would enable compatibility across platforms or easy switching between them.
2. Technology Maturity: As these interfaces become more mature, and their applications span various domains, standardization may naturally evolve. The absence of disruptive changes and widespread usage can lead to the establishment of these standards.
2023-09-29
In its latest event, Apple didn't utilize AI even once, in contrast to its closest competitor, Google. While refraining from making any announcements, the new generations of iPhone and Apple Watch boast more powerful Neural Engines. I believe Apple will take a different approach from what we currently observe in the market. Instead of further enhancing their existing cloud-based Siri experience, they will shift towards on-device processing. This strategy aligns with their strong stance on security and privacy, as we've already seen them testing on-device Siri processing with the Apple Watch. I'm curious about what they could offer with offline AI and have a few thoughts:
Context-aware on-device search: Imagine being able to search across all types of files, including images, documents, and videos, and retrieve information in any format simply by asking Siri.
Context-aware writing assistance: With training based on your email history, typing suggestions could become context-aware, offering email responses that align with your ongoing conversations.
Deeper integration with other applications: It would be fascinating to enable any app to leverage an API that creates a "skill" for the local assistant, much like how you can extend ChatGPT with extensions. This could potentially open up new niches for apps centered around AI interaction.
2023-09-28
Homeostasis is a self-regulating process that enables biological systems to maintain stability while adapting to changing environmental conditions. It describes how an organism can keep its internal environment relatively constant, allowing it to adapt and survive in a frequently challenging environment.
Homeostasis consists of several key components:
Receptor: As the name suggests, receptors detect changes in the external or internal surroundings. They initiate a cascade of reactions to uphold homeostasis.
Control Center: Also referred to as the integration center, the control center receives information from the receptors and processes it.
Effector: Effectors respond according to the instructions received from the control center, either reducing or enhancing the stimulus as needed.
The concept of homeostasis finds widespread application in software engineering across various domains and industries. Here are some notable examples:
Configuration as Code: Technologies like Kubernetes, Terraform, and CloudFormation adopt an approach where users declare the desired system state, and the system autonomously determines how to achieve and maintain it.
Elasticity: Systems can dynamically scale up or down in response to workload fluctuations, ensuring they can efficiently perform their tasks.
Self-Healing: Container orchestrators such as Kubernetes attempt to restart a malfunctioning service if it stops responding to health checks or exhibits unusual behavior.
The concept of homeostasis closely aligns with the idea of a desired state and a declarative approach to programming. A straightforward and widely used example is markup languages, where developers specify the desired page state, and the browser is responsible for rendering it as closely as possible to that desired state.
2023-09-27
A quality gate is a critical checkpoint in the software development lifecycle that assesses whether software meets specific criteria. Its primary goal is to identify and fix as many issues as possible before releasing the software. Quality gates may include, but are not limited to, the following checks:
Typical locations for implementing quality gates are:
Local Environment: Usually implemented with pre-commit hooks, this allows for early issue detection during code commit. Among other checks, it's an excellent place to enforce code style using tools like Prettier and validate naming conventions for the branch.
PR Validation: These checks duplicate those in the local environment, in case a developer skips pre-commit hooks using the --no-verify
option. They also add PR-specific validations. For example, Azure can check if the associated work item was created or if a description was provided for the PR.
Main Branch Actions: This is the best place to run extensive integration and automation tests in addition to the previous checks. It ensures that the software continues to meet quality standards after merging into the main branch.
This setup works exceptionally well with temporary teams, such as contractors or outsourcers, to ensure that the codebase complies with defined standards.
2023-09-26
SQL is a good example of an abstraction that works in most cases (I assume the 80/20 rule is applicable here). But, like most abstractions, it cracks under pressure, and instead of writing readable, well-structured queries, developers find themselves writing dynamic SQL, tweaking indices, and investigating execution plans.
I think query-first data modeling, as used in Apache Cassandra, is more transparent compared to model-first, used in SQL:
2023-09-25
I didn't realize until now that there are so many ways to do effort estimation in software engineering. Some of them:
2023-09-24
When I first started using Go, it took me some time to become familiar with its error handling approach. In comparison to the traditional control flow approach, where exceptions are handled in a separate block (try/catch), modern languages like Go and Rust use a different approach called explicit error handling. In this approach, the error is one of the return values, and the developer is expected to check and handle it right away.
For example, in Go:
value, err := someFunction()
if err != nil {
// handle the error
}
In Rust, you would use Result<T, E>
, which is an enum with two variants: Ok(value)
and Err(value)
:
match some_function() {
Ok(value) => {
// use the value
},
Err(e) => {
// handle the error
},
}
Both languages also have a similar concept called "panic" mode, which represents unrecoverable errors that should interrupt the execution.
Explicit error handling arguably helps developers write more readable error handling code because it is closely related to the actual place where the error will occur, as opposed to the control flow approach, where the catch
block might be in a separate function or buried deep within many other function calls that might produce the error, or even worse, hidden by a generic exception.
2023-09-23
"Red Ocean, Blue Ocean" is the concept from the business strategy book "Blue Ocean Strategy" by W. Chan Kim and RenƩe Mauborgne that aims to help companies grow under different market conditions and adjust their actions accordingly to market "temperature":
2023-09-22
In his book "The Paradox of Choice," Barry Schwartz mentioned that having too many choices could lead to less satisfaction and greater regret. The paradox is related to the following characteristics:
Choice Overload: Studies have shown that once a certain threshold is reached, there is a decrease in interest. This is particularly evident in retail, where stores know how many different brands are enough to keep customers interested but not too many to overwhelm them.
Escalation of Expectations: The more choices you have, the more you tend to believe that there must be "the best one" among them.
Regret and Opportunity Costs: This is closely tied to the previous point; people tend to experience more regret when they have to choose between various options.
How to cope with the Paradox Of Choice? Schwartz divides decision-makers into "maximizers" - those who constantly seek the best possible option, and "satisficers" - those who are content with a good enough option. By proactively limiting the number of choices to those that have been proven to be "good enough," you can reduce decision anxiety for a significant portion of consumer-based choices.
2023-09-21
The Pocket Base, an open-source backend, made an interesting choice for their persistent storage: SQLite. In the industry, SQLite is usually used for simple client-side storage because it's lightweight, portable, and doesn't need much setup. However, it lacks some features, like OUTER JOINs, and doesn't support multiple writers or user management. This choice raises some questions.
Here's their response:
PocketBase uses embedded SQLite (in WAL mode) and has no plans to support other databases. For most queries, SQLite (in WAL mode) performs better than traditional databases like MySQL, MariaDB, or PostgreSQL, especially for read operations. If you need replication and disaster recovery, you can consider using Litestream.
Basically, they're writing into a WAL (Write-Ahead Logging) instead of using a rollback journal to address some critical issues with SQLite:
Increased Concurrency: Using WAL allows multiple readers and a writer to work at the same time. Readers don't block writers, and writers don't block readers, so reading and writing can happen at the same time.
Improved Performance: Writing a commit to the end of the WAL file is faster than writing to the middle of the main database file. This makes transactions faster with WAL.
Crash Recovery: It's more robust if there's a crash. Changes are first written to the WAL file and then transferred to the database file, reducing the risk of database corruption.
Disk I/O Reduction: With WAL, disk operations are more sequential, which reduces the time spent on disk seek operations.
2023-09-20
Hofstadter's Law addresses the common problem of accurately estimating the time it will take to complete a task.
Ā "It always takes longer than you expect, even when you take into account Hofstadter's Law."
Bill Gates' interpretation of this law especially resonates with me:
"Most people overestimate what they can do in one year and underestimate what they can do in ten years."
While planning something, I try to take this law into account and be more humble in my short-term goals, and more ambitious with long-term ones.
2023-09-19
Not Invented Here (NIH) syndrome is a tendency to build in-house software instead of utilizing existing options. In its simplest form, it's a constant need to reinvent the wheel. Here are some notable examples:
Netscape Navigator: Netscape decided to rewrite its entire codebase for Netscape Navigator 5.0, believing that starting from scratch would enable them to leapfrog the competition. Unfortunately, the project took much longer than expected, and by the time Netscape 6.0 (5.0 was skipped altogether) was released in 2000, Internet Explorer had taken over the browser market. Netscape's market share never recovered.
Digg v4: Social news aggregator Digg decided to rewrite its entire codebase for version 4, moving away from MySQL and Memcache to Cassandra. The move was not well-received by users, and numerous bugs and performance issues led to a mass exodus to competitors like Reddit. The company's value plummeted, and they were eventually sold for a fraction of their peak value.
Rewriting Quake by id Software: John Carmack, a co-founder of id Software, decided to rewrite the Quake game engine from scratch in C++, moving away from C. The rewrite ended up taking much longer than anticipated and led to numerous bugs and stability issues, damaging the game's reputation.
Friendster: One of the first social networking sites, Friendster, faced scalability issues as more users joined. Instead of improving and optimizing their existing platform, they decided to rewrite the entire codebase. The result was a buggy, slow platform that frustrated users and led to a rapid decline in the user base.
HealthCare.gov: When the U.S. government launched HealthCare.gov in 2013, it was a disaster due to numerous technical issues. Despite the government's massive resources, the site suffered from poor performance and frequent crashes. A key reason for the site's issues was that the government insisted on custom-building much of the site's functionality rather than using proven existing solutions.
2023-09-18
I was working in StackBlitz and thinking about the potential future of Cloud Integrated Development Environments (IDEs). While they provide a convenient way to quickly set up configured development environments, I believe they won't entirely replace on-device development environments for everyday engineering tasks. They may, however, find their niche in replacing Virtual Desktop Environment (VDE) software, especially in situations where contractual restrictions prevent storing codebases locally, as is often the case for consulting companies.
Here are some reasons why I think local development environments will continue to dominate:
Freedom of Tooling Choice: In your local environment, you have the freedom to select and customize tools and plugins that align perfectly with your workflow. You can even use proprietary tools if necessary, which can be challenging to integrate into online IDEs.
Databases: While it's possible to create a development database in the cloud for certain use cases, having a local database can be indispensable. Whether you need it for testing migrations or simply for experimenting with data, a local environment offers greater flexibility. GitHub Codespaces does allow the use of Docker images, but this can add to your bill, leading to the next point.
Pricing: Cloud IDEs often come with a price tag. GitHub Codespaces, for instance, bills on a per-core, per-hour basis, while CodeSandbox charges $15 per month per editor. AWS Cloud9's pricing is tied to the underlying EC2 instance usage. Paying a monthly fee for a tool that offers a subset of the capabilities available in your local environment may not be cost-effective for many developers.
2023-09-17
Tokenization is the process of breaking down text into components, known as tokens. Each token might represent an individual word or phrase. This process is required to make the data more manageable and suitable for various NLP tasks (text mining, ML, or text analysis). Let's take a look at the BERT-like model tokenization process:
Example:
"Hello how are U today?" - Input
|
v
"hello how are u today?" - Case normalization
|
v
["hello", "how", "are", "u", "td", "##ay", "?"] - Subword tokenization
|
v
["CLS", "hello", "how", "are", "u", "td","##ay","?","SEP"] - Assigning special tokens
Here today
has been split into td
, ##ay
. This technique is known as Subword Tokenization, often used in models like BERT to handle out-of-vocabulary words.
Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. For instanceĀ
"annoyingly"
Ā might be considered a rare word and could be decomposed intoĀ"annoying"
Ā andĀ"ly"
. BothĀ"annoying"
Ā andĀ"ly"
Ā as stand-alone subwords would appear more frequently while at the same time the meaning ofĀ"annoyingly"
Ā is kept by the composite meaning ofĀ"annoying"
Ā andĀ"ly"
. This is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords. Source
Special tokens are used for classification in BERT like models. The CLS
token is used to represent the entire context of the input for tasks like classification, while the SEP
token is used to separate different sentences or contexts within the same text.
2023-09-16
Expanding on my previous post, I firmly believe in the concept - if it's not deployed, it doesn't exist. This statement was, I think, first put forward by @jasonfried.
Throughout my career, I've grappled with a simple yet profound realization - ideas that haven't been brought to life, even in the form of a Minimum Viable Product (MVP), don't actually exist. Whether it's a modification required for a project, a new hobby I want to explore, or a fresh product idea, there needs to be a tangible version of it. This allows us to evaluate whether it's worth investing more time into. If it's not, then it should be discarded.
2023-09-15
I've been using Angular for a couple of years now, and one repeated pattern that occurs when I'm diving into the codebase is how quickly and easily it's to make application logic unreadable with the reactive extensions library, to be more specific, NgRx. While it offers a large set of tools for managing asynchronous data, arguably no other stream processing framework can provide. It creates a callback hell that makes most of the logic way too complicated to comprehend than it could be when using the imperative approach. I'm still thinking that it's too much overhead for most client-side applications, and this is the exact reason I'm not using it in my projects.
2023-09-14
According to the Product Led Growth the main value of a startup is a validated learning, a concept which defines a company intelligence, gained after shipping products, that can be used to further adjust the company strategy for further advantage.
I tend to expand this idea further to the individual scale, where I see a specific person as a company itself. In some way it also uses resources to achieve their objectives, operating in hopefully fair market. And during this process the person earns valuable insights, whether from a wins and setbacks.
It's surprisingly interesting to look at personal experiences through these lenses:
2023-09-13
There are plenty of pre-trained models available, that could cover most or your needs. However, there might be some situations, when base model need to be further calibrated in order to serve well, some examples:
2023-09-12
When working with NLP, one interesting tool that might help you with a wide range of tasks is NLTK. Some of its capabilities:
One extremely useful feature that has saved me good portion of time, is its large collection of text corpora for training your model.
2023-09-11
I'm a big fan of Obsidian.md. I have been managing my notes there for a good couple of years now, and when I started my blog, I knew this would be my editing tool of choice for the following reasons:
My blog is hosted on Vercel and backed by persistent Redis, but I haven't found a straightforward way to integrate my favorite tools together, it's time to reinvent the wheel.
My publications, along with media, are living in the 'posts' folder, following the naming convention: yyyy-mm-dd-title.md
. On the other hand, my blog retrieves the list of articles from an index record inside the database; it's just a list of posts along with metadata. One interesting field is checksum
, which I use to determine if the post has been edited. To keep my posts in sync with the database, I created a script that does the following:
The key of the actual post is its name, so I don't need the index for directly accessing a particular publication. All images are embedded into a post as base64. This approach does the job for my small website, but storing images on a CDN would be more preferable.
My vault is stored on GitHub, and the script is deployed as an Action. I do have git integration plugin installed for the Vault, so publishing the post is basically a commit.
While it's not an ideal solution, it does the thing for me, allowing to use tools I love and making the process a bit more enjoyable.
2023-09-10
Bayes' theorem is a fundamental block of probability theory and allows us, in simple terms, to express and update our beliefs given new information.
The formula:
One interesting implication of this method is a Sentiment Analysis. That specific use-case of the Bayes' theorem is called Naive Bayes method. Let's find why.
For this problem, I need a dataset. For example, IMDB 50K as I mentioned in my previous post, or create your own using text samples with sentiment labels, such as "positive," "negative," or "neutral".
The algorithm:
For calculating using a simplification, called bag of words, where we basically assuming that all words in the sentence are independent, ant their only feature is frequency:
And the same for the rest of the sentiment labels.
The bag of words simplification is exactly why this method is called "naive." It might seem like a shallow assumption, but it turns out to be extremely efficient in practice.
2023-09-09
I recently began diving into Machine Learning. My plan is to start with hands-on approach, and dig into theory on-demand. After some researching and chatting with GPT, I selected the Sentiment Analysis problem and the Naive Bayes algorithm as the first one to try, for the following reasons:
2023-09-08
A hot discussion about dropping TypeScript support happening in recent days, that seems like a major shift in the field. But is it?
Some thoughts:
Overall, it's a good opportunity to take a step back, and critically assess if the tools we are using are still doing a good job.
2023-09-07
In his blog, Andy Matuschak coined a concept of Evergreen notes - note-taking approach, aimed to maintain the information structured, and promote insight. It is based on a "Zettelkasten" method, originally invented by Niklas Luhmann, and described in a Taking Smart Notes by Sƶnke Ahrens.
In the Zettelkasten system, notes are store as atomic peace's of information, connected with links. The original version was a simple box with cards, physically sorted:
That system allowed linking and structuring information in various ways, long before databases were invented.
Evergreen notes take this idea further, leveraging note-taking software such as Obsidian, adding the following adjustments to the predecessor:
This structure allows maintaining a quiet impressive amount of knowledge, for example here is the knowledge graph of CEO of Obsidian Steph Ango(each dot represents a note):
2023-09-06
Expected Value is used in statistics, gambling, and other domains to measure the expected profitability of a given decision. It is calculated by multiplying each possible outcome by the probability of that outcome occurring, then summing these results:
EV = ā [P(x) * X]
Suppose a person wants to buy a car. The person is trying to choose between a Toyota Camry and a Honda Civic. Some common metrics to consider (the list is shortened for simplicity): initial price, operating costs, resale value, reliability. Comparison below is just an example, it can be extended with other metrics that might impact a decision. For instance, you can weigh your decision with customer satisfaction, or personal preferences.
Factor | Honda Civic | Toyota Camry |
---|---|---|
Initial Price | $15,000 | $18,000 |
Operating Costs (5 years) | $14,700 | $16,400 |
Resale Value (after 5 years) | -$9,000 | -$10,800 |
Reliability | High (Hondas are known for their reliability) | High (Toyotas are known for their reliability) |
Total Cost over 5 years | $20,700 | $23,600 |
2023-09-05
I've been considering starting a personal blog for some time now. However, with the recent advancements in AI, I've started to question whether it's worthwhile to undertake such a project in 2023. Despite my doubts, I've decided that the answer is yes, for the following reasons:
2023-09-04
In the book "Thinking in Systems" by Donella H. Meadows, defines a system as an interconnected set of elements that is coherently organized in a way that achieves something. A system is more than the sum of its parts.
A system consists of three main parts:
Examples of systems in daily life: