Translate

Sunday, April 15, 2012

Scrape Blu-Ray/DVD subtitles and create a .srt on Mac OS X

For a couple of years I have been trying to find a way to take a subtitle track from a DVD or Blu-Ray (which are usually burned images as opposed to text data) and convert it into a text subtitle track that then can be inserted into a m4v or other video format. There were (in my view) only a couple of options.

1) D-Subtitler. - Problem, it's PowerPC only, not Intel
2) Avidemux. - Problem, it doesn't use OCR text recognition therefore almost each character has to be input manually -- and it does a bad job with Blu-Ray subtitles (you have to scale them down via BDSup2Sub and Avidemux has a hard time recognizing glyphs of the same character).
3) Run VMWare/Parallels in order to run a windows program to OCR the subtitles into a .srt.

Yesterday I stumbled upon Subtitle Edit made by Nikolaj Olsson. It is a very good (and open source) .NET app to OCR subtitles (it actually uses either Tesseract or Microsoft Office Document Imaging) for the OCR and spellchecking (using Microsoft Word or Hunspell).

So I wondered, how would this work with the Mono Framework on the Mac (designed to run .NET apps by Just-In-Time compiling the IL for the Mac). I looked through the Subtitle Edit source code and found that he had some flags for Mac/Linux and how to execute Tesseract.

I tried Subtitle Edit with Forrest Gump Blu-Ray using the .sub file extracted from MakeMKV, and it did an AMAZING job (though there are some unrecognizable characters). Out of 1500 subtitle frames, I only had to correct a handful.

Here are the steps to get Subtitle Edit working on Mac OS X 10.7 (Lion, steps may vary for Snow Leopard, etc):


  1. Install Xcode from the Mac App Store.
  2. Run from a terminal: sudo xcode-select /Applications sudo xcode-select -switch /Applications/Xcode.app (As I had MacPorts complain about the xcodebuild failing).
  3. Install the Xcode Command Line tools (previously known as the UNIX and System tools) by going to the Xcode menu -> Preferences -> Downloads -> Command Line Tools
  4. Install Mono Framework for the mac (2.10.9 was the stable version during my post, and I installed the Mono SDK, but I believe the runtime version should be fine).
  5. Install MacPorts (for installing additional open source packages -- if you have another package source you should be fine).
  6. Run from a terminal: sudo port self update sudo port self update (To update MacPorts to the latest version).
  7. Run from a terminal: sudo port install tesseract (The character recognition software).
  8. Donwload Subtitle Edit (version 3.2.7 was released at the time of this post).
  9. Subtitle Edit expects the tesseract data folder to be in a certain location with the proper config files for Subtitle Edit so we have to create the proper config paths.
  10. Run from a terminal: sudo mkdir /usr/local/share/tesseract
  11. Run from a terminal: sudo ln -s /opt/local/share/tessdata /usr/local/share/tesseract/tessdata
  12. Run from a terminal: sudo cp PATH_TO_YOUR_SUBTITLE_EDIT_FOLDER/Tesseract/tessdata/eng.traineddata /opt/local/share/tessdata
  13. Run from a terminal: sudo cp PATH_TO_YOUR_SUBTITLE_EDIT_FOLDER/Tesseract/tessdata/configs/hocr /opt/local/share/tessdata/configs
That's it! All you have to do now is run SubtitleEdit.exe by opening up a terminal and typing:

mono PATH_TO_YOUR_SUBTITLE_EDIT_FOLDER/SubtitleEdit.exe

The first launch may take some time as the Mono framework is JITing the IL. Once you see the main window for Subtitle Edit pop up, you should be able to open up your sup/sub file and go through the conversion process (the screens should look very similar to the ones on the Subtitle Edit website).


Here's also some useful tools I used in the process:
- LG Blu-Ray Drive from Microcenter (Blu-Ray+ROM/DVD+ROM was $65.00 and Blu-Ry ROM/DVD+RW was $79.00)
- MakeMKV - Extracts Blu-Ray video/audio to a MKV (Matroska Media Container) file.
- MKVTools - To extract the subtitle track from the MKV file to a .sup file.
Subtitle Edit - To convert the subtitle 'images' to .srt text file.
- Handbrake - To convert the MKV to m4v so it is playable on the AppleTV/iPad/iPod/iPhone/Xbox 360. (At this point Handbrake can import the .srt text subtitle track, or you can OCR and correct the subtitle text while converting and use Subler to later insert the .srt into the m4v file).

13 comments:

  1. Nice :)

    Do you think it possible to get "mplayer" working as video player?

    ReplyDelete
    Replies
    1. Maybe, maybe not..... To get that to work I may have to rewrite the Windows Forms with interface builder for mac.... Though it would be nice to do that as well as package SE as an actual mac .app file on your site, and to get huntspell or OSX native spelling working, all would take some work. I'll see how much time I have on my hands to pitch in.

      Delete
  2. very interesting... just one question: did you get the source code version or the installer version? since in the second case I don't see the way you can obtain the .exe under OSX...

    ReplyDelete
    Replies
    1. The binary zip file. The source code is not needed. Mono is able to run .NET-only execuables on MacOS or Linux.

      Delete
  3. Doesn't seem to work. When I run mono PATH_TO_YOUR_SUBTITLE_EDIT_FOLDER/SubtitleEdit.exe I get an exception about some font family not being available :(

    ReplyDelete
    Replies
    1. What OS are you running? I don't think I've tried on 10.8 yet, but I should check up on that. I will update this post if there are any changes needed. Do you have the exact exception?

      Delete
    2. I just verified. I'm running the latest stable Mono, MacOS X 10.8.2, and the latest version of MacPorts with tesseract. I'm having no issue currently.

      Delete
    3. You need to install X11 in order to fix font problem. Open /Applicaitons/Utilities/X11 and follow instructions.

      Delete
  4. Hi Ryan. I've been using your instructions to run SE in Mono and it's working well. --It would be nice to have spellchecking with hunspell, though.

    --I wonder if it's possible to get hunspell to work with the program in OS X. I have tried sym links in the SE directory to different locations where libhunspell is, but no luck so far. Do you have any suggestions? I'd appreciate any hints if you have time. Thanks.

    ReplyDelete
    Replies
    1. Same hunspell problem here using version 3.3.4.
      Ryan, could you please check if you're having /usr/share/hunspell or /usr/local/share/hunspell directory and give us the file list for proper linking. Thanks!

      Delete
    2. Thanks for the instructions on how to get SE working on OS X!

      ...but I too am having problems with SE not be able to use hunspell.

      Delete
    3. I'm not currently using hunspell. It looks like it doesn't launch it directly but tries to import at runtime libhunspell.

      Delete
  5. I've been able to make a .app of Subtitle Edit with Wine Bottler. Great app! Give a much larger file unfortunately: 500 mb...
    In case you want to try it, here is a link :
    https://www.dropbox.com/s/2us68lqjmqff7zo/SubtitleEdit_3.3.11_rev2285_mac_osx.zip
    I'm testing it now to convert dvd sub to srt files, but beside that, I did not tested it a lot... So it may not run perfectly.


    If you want to try out Wine Bottler :
    http://winebottler.kronenberg.org/

    ReplyDelete

Codementor

Ryan Kuhn

★★★★★

Expertise