Loop Git Repository
Synopsis
This operator loops through the contents of a Git repository. When entering your Git repository URL ensure that you use the clone repository URL which will end in
.git. Here is an example which is used in the tutorial process: https://github.com/dbarnett/python-helloworld.git
Description
This operator will create a copy of the specified version of the Git repository. You can choose a branch or a tag to checkout.
Contents will be copied into a temporary folder before looping through each of the files. This is particularly useful if you would like to execute a Python or R project from an external source. The operator filtering capabilities on the repository allowing you specify an entrypoint file which can be passed into an Execute Python or Execute R operator.
Input
con
The RapidMiner connection to be used to connect to the Git repository. This is not mandatory. Public repositories do not require any authentication and thus require no connection.
input
The Loop Git Repository operator can have multiple inputs. When one input is connected, another input port becomes available which is ready to accept another input (if any). The order of inputs remains the same. The Object supplied at the first input port of the subprocess is available at the first input port of the nested chain (inside the subprocess).
Output
con
The operator passes through the connection from the input port for reuse later in the RapidMiner process.
output
The Loop Git Repository operator can have multiple outputs. When one output is connected, another output port becomes available which is ready to deliver another output (if any). The order of outputs remains the same. The Object delivered at the first output port of the subprocess is delivered at the first output of the outer process.
Parameters
Repository url
Select the location of the Git repository from where to start scanning for files. This parameter is used when there is no connection input so it is only suitable for repositories where no authentication is required.
Set repository url
Used when a connection object input is connected. Ensure that you have used the correct URL based on the connection type. URLs for the following connection type:
- Username and Password: use http(s) url
- SSH Key: Use ssh url
Branch name
When selecting a branch to loop over you can just this Git branch selection parameter to select the branch from the available list.
Filter type
Specifies how to filter file names. You can either use standard, command shell like glob filtering or a regular expression.
Filter by glob
Specifies a glob expression which is used as filter for the file and directory names.
Here is a short overview:
-
- : any number of characters
- **: same as '*', but crosses directory boundaries. Useful to match complete paths.
- ? : matches exactly one char
- : contains collections that are separated by ','. The glob filter will try to match the string to any of the strings in the collection.
- []: contains a range of chars or a single char (e.g.[a-z]).
- String(*): *
- String(?): ?
- String(**): **
Filter by regex
"Specifies a regular expression which is used as filter for the file and directory names, e.g. 'a.*b' for all files starting with 'a' and ending with 'b'. Ignored if empty.",
Recursive
Set whether to recursively search every directory. If set to true, the operator will include files inside sub-directories (and sub-sub-directories ...) of the selected directory.
Enable macros
If this parameter is enabled, you can name and extract three macros (for file name, file type and file folder)and use them in your subprocess.
Macro for file name
If filled, a macro with this name will be set to the name of the current entry. To get access on the full path including the containing directory, combine this with the folder macro. Can be left blank.
Macro for file type
Will be set to the file's extension. Can be left blank.
Macro for file folder
If filled, a macro with this name will be set to the containing folder of the current file. To get access on the full path you can combine this with the name macro. Can be left blank.
Reuse results
Set whether to reuse the results of each iteration as the input of the next iteration. If set to true, the output of each iteration is used as input for the next iteration. Enabling this parameter will force the operator to NOT run in a parallel fashion. If set to false, the input of each iteration will be the original input.",
Enable parallel execution
This parameter enables the parallel execution of the inner processes. Please disable the parallel execution if you either run into memory problems or if you need an inner loop. The end result will be propagated to the outside process and can be used in the usual way.
Cache
This parameter determines whether the repository will be stored on disk following the checkout. By default, the data will be deleted.