## Brainstorm - [x] Summary of COBRA - Summarize workflow of data, model, evaluation - Break up into 4 parts - [x] [[#Overview]] - Data ingestion, cleaning, and preparation - Model training, evaluation, and storage - Results ## Drafts ### Overview [Part 1 out of 4] Studying for the AWS Certified Machine Learning exam reminds me of when I created a(n) ML & MLOps feature for COBRA--my first big boy job back in 2016--a project for the Department of Justice through Leidos. Although the terminology and technology has evolved since then, the vast majority of the principles have not changed. To start, the platform I augmented was designed to help underfunded government agencies with the growing burden of civil litigation. More specifically, the platform gave lawyers the ability to sift through and label data as being relevant or not relevant to a variety of different court case matters. Normally this would take many people many months to go through all the exchanged documents, but recent rulings have allowed the use of statistical techniques to tag documents more quickly. Going back to the AWS exam material, a typical ML workflow has the following steps, and so did the document classification feature I created. 1. Data Ingestion, Cleaning, and Storage 2. Model Training and Evaluation 3. Optimization I'll get to next parts in the next set of posts. ### Data Ingestion, Cleaning, and Storage Welcome to part 2 of an MLOps system I created back before managed services popped up! Next up: Data Ingestion, Cleaning, and Storage **Background Info** Next up is the data I fed into your AI/ML models, whether it is for training, fine-tuning, or use in production. Regardless of the domain or stage in model inference, you'll want your data to be consistent, otherwise you'll lose confidence in your model's results. Every data science and AI/ML project needs data, and it needs to be good. This is especially true in the case of working with text documents in NLP. The problem with working with text documents and using machine learning is that computers only work with numbers. In order to have computers understand human language, it's necessary to create a consistent numeric representation of words: a term we now call word embeddings. I won't go into detail the various document-to-vector techniques, but at the time in 2016, word embedding techniques of the early 2010's (TF-IDF, Word2Vec, GloVe, FastText) were falling out of fashion, and more contextually-aware techniques were on the horizon (ELMo, BERT, GPT). We had several major requirements at the time: - flexible to accommodate future vector types - staying free/cheap for under-funded government agencies to use **Storage** When I first joined that e-discovery project, Solr was determined to be the main workhorse for document-related data. For each electronic document ingested, I stored its raw format as well as the result of every stage of data cleaning and feature engineering as different properties in the same Solr document. We had other storage services, but only Solr is relevant to this post. Combined with the mentioned requirements, I settled for storing TF-IDF and Word2Vec vectors in Solr. Another nice feature I leveraged is the dynamic schema and flexible properties. **Ingestion** It was already determined that users would upload files to a network drive. To ingest these files more easily, we adopted Apache NiFi into our tech stack to orchestrate and visually monitor the flow and cleaning of text data. **Cleaning** Within Apache NiFi, units of work (called processors) ran on each document as it pass through each processing step. These units of work are various aspects of NLP cleaning, such as stemming, lemmatization, tokenization, and more. ### Model Training and Evaluation For Model Training, I worked with a set of requirements that are similar to most applied ML applications. - Different models/algorithms perform differently on different topics - Each ML model can change its internal weights through tuning by providing additional training data - Internal weights are probabilistic. This means that training order matters - Documents can be re-evaluated multiple times, regardless if they have ground truth or not - More documents get ground truths as experts review them for different document tags - Experts can disagree on ground truth - Can rollback updates to models by loading an older set of internal weights ### Optimization