Any data scientists | data engineers here?

Discussion in 'Mac Programming' started by patent10021, Sep 11, 2017.

  1. patent10021 macrumors 68030


    Apr 23, 2004
    I have a few questions related to data science at social media companies or any other companies.

    Structured data is usually in the form of CRM reports and other forms. How does a data scientist tap into unstructured user or location data? Create an in-house API that allows the data scientist to access any data at any time for example server time stamps of chat sessions? Does the data scientist usually create such API? If so, how would the API be built for a social app? A data scientist usually isn't fluent in backend languages right? So would it be the responsibility of other engineers to create an API etc for the data scientist to access server data? APIs accessing server data for social apps would usually be written in what languages? Node and Express? I know many data scientists are fluent in javascript in addition to Java, Python, R etc.

    Example case. Instagram data scientists would always have access to ANY server data in order for them to be able to do their job and describe and predict. Right or wrong? Would Instagram server data be sent to a data warehouse using Hadoop or something? How would the data scientist or data engineer have access to all the server/user/app data above and beyond the structured data they receive from in-house departments?
  2. MasConejos macrumors regular


    Jun 25, 2007
    Houston, TX
    You are combining data preprocessing, data science, and building scalable APIs. These are three different tasks that are often performed by three different people.

    Data preprocessing (usually referred to as ETL) involves cleaning up the data and putting it into a regular format. Most data is actually structured (server logs, database records) or semi-structured. Typically, only things like freeform text are truly unstructured. Someone at this level might aggregate multiple sources of data, clean up or remove bad/missing data, etc. Sometimes this is performed by a ETL specialist, other times this is performed by a data scientist.

    The data science portion, while it frequently involves the activities before and after, is centralized around building a predictive model of some sort. Sometimes this model is continuously improved with additional data. Other times the model is created once and used as is, depending on the context.

    Once the model is created, then APIs and such come into play. The role of the person here is to apply the model. An API might typically be done by someone with more backend experience. The portion that interfaces with the model would more likely be a joint venture between a backend specialist and a data scientist.

    The thing to remember about data science is that using a model to predict is typically very fast, but training a model is very expensive. When training becomes too expensive, specialized internal processes are used to train and update the model. This would typically be done by a data scientist, and this is almost always an internal, non-customer facing process. The type of scalability required for machine learning is very different from the type of scalability required to host and process an API.
  3. patent10021, Sep 12, 2017
    Last edited: Sep 12, 2017

    patent10021 thread starter macrumors 68030


    Apr 23, 2004
    Great informative reply. Thank you very much @MasConejos. I am familiar with much of the process except for the development of the API and communication between the database and Hadoop (ETL/ELT) and hardware.

    I use Java, Python, R, and Javascript etc for mobile, back-end, and ML projects so fortunately the programming curve is manageable. I'm studying "full-stack" data science for experience and will eventually choose the engineering or analytics side. I'm looking at it from the viewpoint of working at a startup using Hadoop (not traditional RDBMS) or maybe the startup is only using Python / R libraries with Tableau and cloud CRM systems and not wanting to pay for SAS/Hadoop. I am familiar with the theory behind developing the pipeline and applying the model, but I'd like to get your feedback on the development of the API and ETL/ELT process in detail.

    Traditional RDBMS


    1. In the diagrams there is the Source Systems label. Examples of source systems would be Instagram servers or an indie dev's Baas correct? In data science the (application data model + API) together are considered to be a server. So the (application data model + API) is one source system and the Instagram / BaaS is a separate source system? But in a Hadoop ecosystem the (application data model + API) together are contained in Hadoop along with the data lake correct? And in a traditional RDBMS the (application data model + API) are in the staging area?

    But obviously in either case APIs are needed to extract the data from the source system (Instagram server / BaaS) to the staging area or to Hadoop.

    2. Hadoop has the KNOX API for The REST API gateway. So you creates an API to interface with KNOX? I read on Stack that the KNOX component is not currently available within the Hadoop cluster. I guess in those exceptions you would use the Hortonworks API for HDFS?

    3. Does the design of the API change depending on whether or not the business is using Hadoop or traditional RDBMS?

    4. What languages are commonly used in designing APIs for ETL/ELT? Java? Javascript? Depends on model?

    API Architecture
    API code accesses the application via the application interface and expose it as a RESTful API. In between the application interface and the RESTful API the application data model complies with the RESTful architecture style.

    Of course we are dealing with JSON objects here with a type, associated data, relationships to other resources, and a set of methods that operate on it. But as I understand it, in data science only a few standard methods are defined for the resource corresponding to the standard HTTP GET, POST, PUT and DELETE methods right? An API resource model defines the available data types and their behavior. For example a JSON data model using scalar, array, object data types.

    Resources and associated data are delineated in terms of the JSON data model. But resources need to be serialized. What is the convention for serialization in data science? JSON or XML or HTML? I am assuming JSON since the data model is often defined in terms of the JSON model.

    p.s. Have you used Hydrograph?

    Thank you for responding MasConejos.
  4. MasConejos macrumors regular


    Jun 25, 2007
    Houston, TX
    I will answer what I can from this. Note that I'm finishing a MS in CS/machine learning this semester, so I don't have much real world data science experience.

    You seem to be using API to refer to a data source (e.g. Instagram) as well as a data sink (an API you provide for others to use). First note that with for data source, you will probably not be using an API. Most of the time you will be operating on your own data. If you are operating on someone else's data, they will either provide it to you as a data dump or provide direct access to it. Using Instagram as an example, their public API limits you to 5000 hits per hour, and automated (e.g. scraping/bulk copying/etc) activities are explicitly disallowed. If you had a partnership with Instagram to use their data, they would probably give you a different channel to access data.

    Hadoop is HDFS and map/reduce. If map/reduce isn't what you want, there are several services that ride on top of Hadoop like Pig (vaguely SQL-like language for massaging data) and Hive (SQL-like system that uses HDFS instead of RDBMS). Both of these will let you preprocess data to extract what you want in a scalable manner.

    Additionally there is Spark, which also uses HDFS but provides a scalable programmatic system to do ETL and ML.

    There are other platforms too, but these are the ones I am familiar with. In any event, all of these are just platforms that run on top of what ever server (or cluster) configuration you want, and selecting and configuring a cluster is a different topic.

    In general, HDFS and these systems are relatively slow as far as trying to fetch and serve a specific item of data goes. I'm sure there is an exception somewhere, but generally one would not want to talk to any of these systems directly. Either one want to fetch the raw data, in which case you don't need any of this, or one wants the processed output, which would be better stored in a traditional RDBMS after these processes run. More often, these processes will manipulate the data or merely generate a model, and the output is then handled using a traditional (non-big-data) server with an API.

    Using Instagram as an example again, lets say you were doing some sort of computer vision thing and Instagram gives you access to their data. You want to train a model to recognize dogs. You would probably query instagram, get everything related to dogs (correctly or incorrectly), bring it over on to your own servers (stored in hdfs or rdbms), do ETL and processing, and create a model that recognizes dogs. Afterwards, on your own server, you then make an API that accepts a picture and returns true/false if it contains/doesn't contain a dog.

    Conversely, lets say you already have a copy (or direct fast access to) all of Instagrams data in hdfs. There is not a use case for deciding on the fly (or by user input) that you want to recognize dogs, and learn how to recognize dogs, then evaluate a picture as part of an API. HDFS is slow, learning is slow, and more than a handful of users would kill your system.

    I'm sure there are exceptions somewhere, but in general, these systems are slow and not really designed to be interacted with. The line gets a little blurry with online learning, but even then, the process and pipeline is set up in advance, one is just dumping more data in the virtual hopper while it runs.

    I know this doesn't address all of your questions, but I hope it helps. Looking at what you've written on API Architecture, it looks correct. Serialization is a decision made when developing the API; it is independent of the data science considerations. I've not used (nor heard of) Hydrograph. When I try to look it up, all I immediately see is information about graphing flow rates.
  5. patent10021, Sep 13, 2017
    Last edited: Sep 13, 2017

    patent10021 thread starter macrumors 68030


    Apr 23, 2004

    Well I was really talking about two situations. The enterprise like Instagram and the smaller startups/indie devs. I was using Instagram as an example. I'm not using their API.

    For a startup they might use Twitter, Instagram, Facebook APIs and scrape web pages. This would all be unstructured external data. Or their marketing team provides structured CRM reports etc.

    I was thinking that a startup without Hadoop might be using Spark + a scalable file system like S3, Cassandra or HDFS. Spark has Standalone, YARN or Mesos modes. Or a startup with some investment might be using Tachyon/Hadoop + spark processing or MapReduce for processing.

    But this is another topic :)


    Specifically I was wanting to know how an enterprise like Instagram or a startup might pull app data (i.e. Instagram app user / location/ chat data) from their servers to -> Hadoop or RDBMS.

    They key question is what you were talking about here.
    Are you saying there is no API between the server holding the Instagram app data [i.e. usernames, location, chat logs, media] and their RDBMS staging area or Hadoop HDFS?

    1. How would an enterprise like Instagram access their app data on their servers? Would Hadoop be directly connected to the same server holding the Instagram app data? If yes how is it connected?
    2. How would a small startup using only Python, R, and maybe Spark access their app data on their local servers?

    What I have noticed is that even the small startups already have most of the data they need for CRM purposes. APIs like HockeyApp and LeanPlum etc have direct access to the iOS/Android app data then the marketing team looks at various reports. This is all internal structured data of course which then could be given to the data analyst in Excel format etc.

    In this case an API isn't needed of course. But what if the analyst wants to use internal unstructured data sitting on the servers that the marketing team isn't even aware of? How does the scientist/analyst access that data?

    KEY: I guess Instagram / startup company's app data is copied from the server that is serving the actual social iOS/Android app into the Hadoop data lake.

    1. How is it copied over? Not API? Purely physically connected?
    2. If they aren't using Hadoop how is it copied over to S3, Cassandra or HDFS?
    3. The data being copied is al server data containing media, location data, time-stamps, chat logs etc. How is that data transformed into useable data that can be used by Spark/MapReduce/Python/R?
  6. MasConejos macrumors regular


    Jun 25, 2007
    Houston, TX
    I suppose technically it would be an "API", but I mentally categorize it as just an internal connection.

    String with a traditional server configuration, lets assume we have a processing server that does web scraping or whatever and a db server. From the processing server, to talk to the db and create/update/delete data, I just open up a network connection using the server name, login information, and maybe a port. This falls into the standard usage of ADO.Net for C#/VB, ActiveRecord for Rails, etc. Basically any database library.

    Talking to HDFS works the same way. You would use a library and a uri ("hdfs://...") etc to read and write data into hdfs. For talking to S3, amazon provides a command line executable that can be invoked to move data back and forth between the local and S3 servers. To use it in a program you would call the program as a shell command.

    As a specific example, in a spark program to call the function, and as part of that you pass the path uri. The uri can be a file system path or an hdfs path.

    val sparkSession = SparkSession.builder
    .appName("CSE 8803 Project")
    .config("spark.debug.maxToStringFields", 35) // suppress warning about number of columns in INPUTEVENTS_MV.csv


    val dfAdmissions =
    .option("header", true)
    .option("inferSchema", true)
    .csv(Paths.get(path, "ADMISSIONS.csv").toString)

    To make the data useable you can use of of many tools to preprocess it it. If it's in a normal db or file system, you can use anything from compiled programs to scripting languages, to R, SQL, or a specific ETL tool to manipulate the data

    If it's in HDFS you can use Pig, Hive, or other tools. You can even use Spark to preprocess the data (there's nothing that would prevent this, and depending on how complicated your transformations are, it might even be the best tool)

    Most of the HDFS tools give you the option to use SQL like syntax to filter and modify the data. This would let you easily sort and filter the data, as well as perform transformations to it. Like I said, most data is at least semi-structured. For example for Facebook/instagram/etc data, the content might be raw text, but timestamps, user names, etc will be structured. For raw text, you could still use some of these tools to downcase/upcase everything, strip punctuation, and a lot of the other text mining tasks you would want to do do filter out bad data and get the rest into a more structured format.

    Finally, keep in mind that not all of data science is big data. Frequently you could be looking at just a few gigs of data (or far less) and use any technique you would normally use (python, R, java) etc. A normal DB is fully capable of handling these cases.
  7. alembic macrumors regular

    Oct 13, 2005

    This diagram maps the basic data flow in a Hortonworks implementation. Hortonworks is an open source data platform provider.

    Unstructured/structured data (Data Sources) is ingested then processed within the cluster of appliances (green area) using any of the available components, e.g. Hadoop, Storm, Spark, etc. This processing can be done in batch mode or real-time. Notice the bi-directional flow between the appliance data store and traditional Data Repository. The appliance cluster can combine info from traditional stores with processed external sources to create the required data structures for analytical models. Sometimes this is then exported to the traditional data repository for faster retrieval by end users.

    Typically the data scientist determines the data requirements (type, structure, volume); the processing to fulfill these requirements is then usually implemented by the data/ETL engineer.

    The data scientist uses tools in the Applications area to perform analytics. e.g. R Studio, Tableau, Spark console, etc. These tools may access data from either data store. A custom application will use an API provided by the development language or associated frameworks to access either data source. A company may offer an API, e.g. web service, to the processed data for external consumption.

Share This Page

6 September 11, 2017