BinaryWorks.it Official Forum
Home
|
Profile
|
Register
|
Active Topics
|
Members
|
Search
|
FAQ
All Forums
eXtreme Movie Manager 8, 9, 10 Forum
Scripts
IMDb API & Web Scraping
Note:
Only the poster of this message, and the Moderator can edit the message.
Screensize:
640 x 480
800 x 600
1024 x 768
1280 x 1024
UserName:
Password:
Format Mode:
Basic
Help
Prompt
Format:
Font
Andale Mono
Arial
Arial Black
Book Antiqua
Century Gothic
Comic Sans MS
Courier New
Georgia
Impact
Lucida Console
Script MT Bold
Stencil
Tahoma
Times New Roman
Trebuchet MS
Verdana
Size
1
2
3
4
5
6
Color
Black
Red
Yellow
Pink
Green
Orange
Purple
Blue
Beige
Brown
Teal
Navy
Maroon
LimeGreen
Forum:
Scripts
Subject:
Message:
* HTML is OFF
*
Forum Code
is ON
Smilies
Since the script probably has to be reworked again and again or does not lead to the desired results, I suggest two other approaches to scrape IMDb data: [b][gold]1. API (fast, but restricted or pricy)[/gold][/b] E.g. you can register for free on [beige]imdb-api.com[/beige] [teal](1)[/teal] and use their REST API for 100 calls per day for free. A further restriction is that there is no method to get alternate titles and perhaps some other information is not available. But as most information as well as complex searches can be queried, it would be easy to build a small app for it, esp. as there already exists a [beige]C# IMDBApiLib[/beige] [teal](2)[/teal]. Furthermore, the whole community could work together and pay for 1 account which could then be used by all. Thereby, the costs would be reduced dramatically for the single user. Links: - [teal](1)[/teal]: https://imdb-api.com - [teal](2)[/teal]: https://www.nuget.org/packages/IMDbApiLib [b][gold]2. Web Scraping (slow, but almost all public data can be captured)[/gold][/b] Using Visual Studio 2019 with C# .NET (e.g. Standard 2.0 or Framework 4.7.2), you can also scrape all data of those IMDb web pages that are only displayed by clicking on "more". Herefore, you need [beige]Selenium.WebDriver[/beige] [teal](1)[/teal] and the [beige]Selenium.WebDriver.ChromeDriver[/beige] [teal](2)[/teal] as NuGet packages. Furthermore the [beige]HtmlAgilityPack[/beige] [teal](3)[/teal] is useful to parse the HTML document and its nodes. Links: - [teal](1)[/teal]: https://www.nuget.org/packages/Selenium.WebDriver - [teal](2)[/teal]: https://www.nuget.org/packages/Selenium.WebDriver.ChromeDriver - [teal](3)[/teal]: https://www.nuget.org/packages/HtmlAgilityPack To execute the clicks on the "more" buttons, you need your own (extension) method: [limegreen][code]using OpenQA.Selenium; using System; using System.Threading; namespace MyNameSpace { public static partial class Extensions { #region --- safe click ---------------------------------------------------------------- public static void SafeClick(this IWebElement element, int intervalInMilliseconds = 25, int timeoutInMilliseconds = 200) { bool success = false; int counter = 0; while (!success && counter < timeoutInMilliseconds) { try { Thread.Sleep(TimeSpan.FromMilliseconds(intervalInMilliseconds)); element.Click(); success = true; return; } catch (Exception ex) { counter += intervalInMilliseconds; } } } #endregion } }}[/code][/limegreen] Working code example for scraping an IMDb page and parse some content: [limegreen][code]using HtmlAgilityPack; using OpenQA.Selenium; using OpenQA.Selenium.Chrome; using System; using System.Collections.Generic; namespace MyNameSpace { public static class IMDbScraper { public static HtmlDocument ScrapeIMDbPage(string imdbID) { // --- create Selenium service and driver ---------------------------------------------------------- ChromeDriverService driverService = ChromeDriverService.CreateDefaultService(); driverService.HideCommandPromptWindow = true; ChromeOptions chromeOptions = new ChromeOptions(); chromeOptions.AddArguments( "--blink-settings=imagesEnabled=false" ); chromeOptions.AddUserProfilePreference("profile.managed_default_content_settings.images", 2); chromeOptions.AddUserProfilePreference("profile.default_content_setting_values.images", 2); IWebDriver driver = new ChromeDriver(driverService, chromeOptions); // --- call url in Selenium browser ---------------------------------------------------------------- string url = String.Format("https://www.imdb.com/title/{0}/companycredits/", imdbID); // e.g. sub page companycredits driver.Navigate().GoToUrl(url); // --- find and click on any "more" buttons -------------------------------------------------------- IReadOnlyCollection<IWebElement> elements = driver.FindElements(By.ClassName("ipc-see-more__text")); if (elements != null) { IJavaScriptExecutor javaScript = (IJavaScriptExecutor)driver; bool doIt = true; foreach (WebElement element in elements) { if (doIt) { try { if (element.Location.Y > 100) { string script = String.Format("window.scrollTo({0}, {1})", 0, element.Location.Y - 200); javaScript.ExecuteScript(script); // execute JavaScript to scroll to the button } element.SafeClick(); // element.Click() is buggy and crashes, therefore we use our extension method doIt = false; } catch { } } else { doIt = true; } } } // --- get body as HtmlDocument for further HtmlAgilityPack parsing -------------------------------- HtmlDocument result = new HtmlDocument(); result.LoadHtml(driver.FindElement(By.XPath(@"//body")).GetAttribute("innerHTML")); return result; } public static List<Company> ParseProductionCompanies(HtmlDocument document) { List<Company> result = new List<Company>(); // own class: see below // --- get node of desired section ----------------------------------------------------------------- string path = @"//div[@data-testid=""sub-section-production""]"; HtmlNode node = document.DocumentNode.SelectSingleNode(path); // --- parse content ------------------------------------------------------------------------------- if (node != null) { try { foreach (HtmlNode entry in node.ChildNodes[0].ChildNodes) { if (entry.Name != "li") { continue; } Company company = new Company() { Sphere = Sphere.Production // own Enum: see below }; try { company.ID = entry.ChildNodes[0] .Attributes["href"] .Value .GetSubstringBetweenStrings("/company/", "?ref"); // own extension method for string: create it yourself ;) company.Name = entry.ChildNodes[0].InnerText; } catch { } try { HtmlNode details = entry.ChildNodes[1] .ChildNodes[0] .ChildNodes[0]; try { company.Remark = details.ChildNodes[0].InnerText; } catch { } } catch { } result.Add(company); } } catch { } } return result; } } public class Company { public string Country { get; set; } public string ID { get; set; } public string Name { get; set; } public string Remark { get; set; } public Sphere Sphere { get; set; } public int Year { get; set; } } public enum Sphere { [Description("Distributor")] Distributor, [Description("Other")] Other, [Description("Production")] Production, [Description("Special Effects")] SpecialEffects }; }[/code][/limegreen] Then you can use the parsed content for further processing, e.g. save it to your own database or export it to a file in your desired format (CSV, JSON, Text, XML).
Check here to include your profile signature.
BinaryWorks.it Official Forum
© Binaryworks.it
Generated in 0.04 sec.