Saturday, July 18, 2015

Step by step process for extracting, formatting, and inserting text from any source to flashcards

I'm using flashcards for studying Spanish vocab. Each day I add 10 new flashcards by taking phrases from a Spanish phrase book. It takes me 5-10 minutes to do this every day, because I have to clean up the data, and format it correctly. This is rather discouraging, because I'd rather spend those 5-10 minutes actually studying the flashcards. 

So today I decided to take all of the phrases out the book, systematically clean it and format it, then bulk insert it. I now have 2 months of daily flashcards prepared, which means I can focus on the important part - studying the flashcards instead of making them.

Here's the step by step process I used. 

Step 1. Get the text from the source (ebook, website, pdf, etc..). 

My source was an ebook. I used a converter to turn it into a .txt file.

Step 2. Figure out the pattern in the text. 

This will be specific to your book. I use for testing regex, and Notepad++ find/replace with the regular expressions checked.

Here's the pattern in the book I'm using




<English>.<Spanish> {0 or more of these additional words}


<ignore everything until the next number>

<separator> is either <newline> or punctuation (?, ., !)

Sentences can have word1/word2 in them. These are often opposites, like inside/outside. These should be considered separate sentences. 

For example, I went outside/inside. This should be parsed into "I went outside" "I went inside"

Sentences can have (modifying word) in it. Remove modifiers.

Sentences can end with . . .. Remove these. 

Step 3. Cleanup and normalization using regex

  • Replace . . . with two<newline>s

find=\. \. \.


  • When there is punctuation immediately followed by a character, insert two <newline> between them



  • Remove words in parens ex: (word)



  • Remove parens split across multiple lines 

find = \(.+\r\n\r\n\)

replace = \r\n\r\n 

  •  Remove apostrophes because these mess with English pronunciation in

  • Remove the section headers. These are in ALL CAPS

find=\r\n([A-Z ]+)\r\n


  • Remove the phrase number. For example 344. English=Spanish, remove 344.



Get rid of commas, because that's the delimiter I use when importing into

Now parse out the definitions.  Right now the pattern is like this

<English><newline><newline><Spanish><newline><newline><Pronounciation>. I want <English>,<Spanish>

find=([\S /]+)\r\n\r\n([\S /]+)\r\n\r\n([\S /]+)


There are some irregularities, like multiple pronounications in a row. Do this:

1. While there are anomolies

   find=([\S /]+)\r\n\r\n([\S /]+)\r\n\r\n([\S /]+)

   replace= English=\1 Spanish=\2 Pronounce=\3

   Manually look through English= to see if it's wrong 

Now remove all newlines

Step 4 - Copy and paste into excel

Step 5 - break data into groups of 10. Here's the VBA macro in Excel:

Sub InsertDateRows()

Dim Date1 As Date

Date1 = DateValue("July 28, 2015")

Dim i As Integer

Dim TheEnd As Integer

TheEnd = MyData.UsedRange.Rows.Count * 2

For i = 1 To TheEnd Step 13



    Range("A" & i).Value = "Spanish " & Date1

    Date1 = DateAdd("d", 1, Date1)



End Sub

For some reason the For loop in VBA has a static condition. So when i say "Go from 1 to Count" it gets the count at the beginning of the loop, and never updates it! That's why i'm multiplying the count by 2, to make sure we actually loop over all the words 

Step 6- Insert each group of 10 into cram using import data 

No comments:

Post a Comment