BinaryWorks.it Official Forum

BinaryWorks.it Official Forum

All Forums

eXtreme Movie Manager 8, 9, 10 Forum

Scripts

IMDb API & Web Scraping

Note: Only the poster of this message, and the Moderator can edit the message.

Screensize:

UserName:

Password:

Format Mode:

Format:

Forum:

Scripts

Subject:

Message:

* HTML is OFF
* Forum Code is ON

Smilies

Since the script probably has to be reworked again and again or does not lead to the desired results, I suggest two other approaches to scrape IMDb data:

[b][gold]1. API (fast, but restricted or pricy)[/gold][/b]

E.g. you can register for free on [beige]imdb-api.com[/beige] [teal](1)[/teal] and use their REST API for 100 calls per day for free. A further restriction is that there is no method to get alternate titles and perhaps some other information is not available.

But as most information as well as complex searches can be queried, it would be easy to build a small app for it, esp. as there already exists a [beige]C# IMDBApiLib[/beige] [teal](2)[/teal]. Furthermore, the whole community could work together and pay for 1 account which could then be used by all. Thereby, the costs would be reduced dramatically for the single user.

Links:
- [teal](1)[/teal]: https://imdb-api.com 
- [teal](2)[/teal]: https://www.nuget.org/packages/IMDbApiLib

[b][gold]2. Web Scraping (slow, but almost all public data can be captured)[/gold][/b]

Using Visual Studio 2019 with C# .NET (e.g. Standard 2.0 or Framework 4.7.2), you can also scrape all data of those IMDb web pages that are only displayed by clicking on "more". Herefore, you need [beige]Selenium.WebDriver[/beige] [teal](1)[/teal] and the [beige]Selenium.WebDriver.ChromeDriver[/beige] [teal](2)[/teal] as NuGet packages. Furthermore the [beige]HtmlAgilityPack[/beige] [teal](3)[/teal] is useful to parse the HTML document and its nodes.

Links:
- [teal](1)[/teal]: https://www.nuget.org/packages/Selenium.WebDriver
- [teal](2)[/teal]: https://www.nuget.org/packages/Selenium.WebDriver.ChromeDriver
- [teal](3)[/teal]: https://www.nuget.org/packages/HtmlAgilityPack

To execute the clicks on the "more" buttons, you need your own (extension) method:

[limegreen][code]using OpenQA.Selenium;
using System;
using System.Threading;

namespace MyNameSpace {
  public static partial class Extensions {
    #region --- safe click ----------------------------------------------------------------
    public static void SafeClick(this IWebElement element, int intervalInMilliseconds = 25, int timeoutInMilliseconds = 200) {
      bool success = false;
      int  counter = 0;
      while (!success && counter < timeoutInMilliseconds) {
        try {
          Thread.Sleep(TimeSpan.FromMilliseconds(intervalInMilliseconds));
          element.Click();
          success = true;
          return;
        } catch (Exception ex) {
          counter += intervalInMilliseconds;
        }
      }
    }
    #endregion
  }
}}[/code][/limegreen]

Working code example for scraping an IMDb page and parse some content:

[limegreen][code]using HtmlAgilityPack;
using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using System;
using System.Collections.Generic;

namespace MyNameSpace {
  public static class IMDbScraper {
    public static HtmlDocument ScrapeIMDbPage(string imdbID) {
      // --- create Selenium service and driver ----------------------------------------------------------
      ChromeDriverService driverService = ChromeDriverService.CreateDefaultService();
      driverService.HideCommandPromptWindow = true;

ChromeOptions chromeOptions = new ChromeOptions();
      chromeOptions.AddArguments(
        "--blink-settings=imagesEnabled=false"
      );
      chromeOptions.AddUserProfilePreference("profile.managed_default_content_settings.images", 2);
      chromeOptions.AddUserProfilePreference("profile.default_content_setting_values.images", 2);
      IWebDriver driver = new ChromeDriver(driverService, chromeOptions);

// --- call url in Selenium browser ----------------------------------------------------------------
      string url = String.Format("https://www.imdb.com/title/{0}/companycredits/", imdbID); // e.g. sub page companycredits
      driver.Navigate().GoToUrl(url);

// --- find and click on any "more" buttons --------------------------------------------------------
      IReadOnlyCollection<IWebElement> elements = driver.FindElements(By.ClassName("ipc-see-more__text"));

if (elements != null) {
        IJavaScriptExecutor javaScript = (IJavaScriptExecutor)driver;
        bool doIt = true;
        foreach (WebElement element in elements) {
          if (doIt) {
            try {
              if (element.Location.Y > 100) {
                string script = String.Format("window.scrollTo({0}, {1})", 0, element.Location.Y - 200);
                javaScript.ExecuteScript(script); // execute JavaScript to scroll to the button
              }

element.SafeClick(); // element.Click() is buggy and crashes, therefore we use our extension method
              doIt = false;
            } catch { }
          } else {
            doIt = true;
          }
        }
      }

// --- get body as HtmlDocument for further HtmlAgilityPack parsing --------------------------------
      HtmlDocument result = new HtmlDocument();
      result.LoadHtml(driver.FindElement(By.XPath(@"//body")).GetAttribute("innerHTML"));

return result;
    }

public static List<Company> ParseProductionCompanies(HtmlDocument document) {
      List<Company> result = new List<Company>(); // own class: see below

// --- get node of desired section -----------------------------------------------------------------
      string path = @"//div[@data-testid=""sub-section-production""]";
      HtmlNode node = document.DocumentNode.SelectSingleNode(path);

// --- parse content -------------------------------------------------------------------------------
      if (node != null) {
        try {
          foreach (HtmlNode entry in node.ChildNodes[0].ChildNodes) {
            if (entry.Name != "li") {
              continue;
            }

Company company = new Company() {
              Sphere = Sphere.Production // own Enum: see below
            };

try {
              company.ID = entry.ChildNodes[0]
                                .Attributes["href"]
                                .Value
                                .GetSubstringBetweenStrings("/company/", "?ref"); // own extension method for string: create it yourself ;)

company.Name = entry.ChildNodes[0].InnerText;
            } catch { }

try {
              HtmlNode details = entry.ChildNodes[1]
                                      .ChildNodes[0]
                                      .ChildNodes[0];

try {
                company.Remark = details.ChildNodes[0].InnerText;
              } catch { }
            } catch { }

result.Add(company);
          }
        } catch { }
      }

return result;
    }
  }

public class Company {
    public string Country { get; set; }
    public string ID      { get; set; }
    public string Name    { get; set; }
    public string Remark  { get; set; }
    public Sphere Sphere  { get; set; }
    public int    Year    { get; set; }
  }

public enum Sphere {
    [Description("Distributor")]
    Distributor,
    
    [Description("Other")]
    Other,

[Description("Production")]
    Production,
    
    [Description("Special Effects")]
    SpecialEffects
  };
}[/code][/limegreen]

Then you can use the parsed content for further processing, e.g. save it to your own database or export it to a file in your desired format (CSV, JSON, Text, XML).

Check here to include your profile signature.

BinaryWorks.it Official Forum

Generated in 0.04 sec.